[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-DirtyHarryLYL--Transformer-in-Vision":3,"tool-DirtyHarryLYL--Transformer-in-Vision":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150720,2,"2026-04-11T11:33:10",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":78,"owner_email":79,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":84,"forks":85,"last_commit_at":86,"license":83,"difficulty_score":87,"env_os":88,"env_gpu":89,"env_ram":89,"env_deps":90,"category_tags":93,"github_topics":94,"view_count":32,"oss_zip_url":83,"oss_zip_packed_at":83,"status":17,"created_at":103,"updated_at":104,"faqs":105,"releases":106},6689,"DirtyHarryLYL\u002FTransformer-in-Vision","Transformer-in-Vision","Recent Transformer-based CV and related works.","Transformer-in-Vision 是一个专注于计算机视觉领域的开源资源库，旨在系统梳理和追踪基于 Transformer 架构的最新研究成果与应用案例。随着 Transformer 从自然语言处理领域成功跨界至视觉任务，成为现代 AI 模型的核心组件，该仓库通过汇集论文、代码链接及项目主页，帮助从业者快速掌握技术前沿动态。\n\n它主要解决了视觉领域研究者面对海量新兴文献时难以高效筛选和跟进的痛点。仓库内容覆盖广泛，不仅包含经典的图像生成模型（如 DALL·E 2、Stable Diffusion、Imagen），还涉及视频理解、多模态学习（如 CLIP、LAVIS）以及机器人交互等前沿方向，并特别提供了关于自动驾驶传感器融合等细分领域的综述文章。此外，它还整理了从基础理论\"Attention is All You Need\"到各类框架实现的实用教程链接。\n\nTransformer-in-Vision 非常适合人工智能研究人员、算法工程师以及对深度学习感兴趣的学生使用。无论是希望寻找灵感的研究者，还是需要复现最新模型的开发者，都能在此找到宝贵的参考资料。其独特的价值在于将分散的高","Transformer-in-Vision 是一个专注于计算机视觉领域的开源资源库，旨在系统梳理和追踪基于 Transformer 架构的最新研究成果与应用案例。随着 Transformer 从自然语言处理领域成功跨界至视觉任务，成为现代 AI 模型的核心组件，该仓库通过汇集论文、代码链接及项目主页，帮助从业者快速掌握技术前沿动态。\n\n它主要解决了视觉领域研究者面对海量新兴文献时难以高效筛选和跟进的痛点。仓库内容覆盖广泛，不仅包含经典的图像生成模型（如 DALL·E 2、Stable Diffusion、Imagen），还涉及视频理解、多模态学习（如 CLIP、LAVIS）以及机器人交互等前沿方向，并特别提供了关于自动驾驶传感器融合等细分领域的综述文章。此外，它还整理了从基础理论\"Attention is All You Need\"到各类框架实现的实用教程链接。\n\nTransformer-in-Vision 非常适合人工智能研究人员、算法工程师以及对深度学习感兴趣的学生使用。无论是希望寻找灵感的研究者，还是需要复现最新模型的开发者，都能在此找到宝贵的参考资料。其独特的价值在于将分散的高质量资源进行了结构化整理，并持续更新包括大语言模型与视觉结合（LLM-in-Vision）在内的新趋势，是探索视觉 Transformer 技术版图的高效入口。","# Transformer-in-Vision\nRecent Transformer-based CV and related works. Welcome to comment\u002Fcontribute! \n\nThe transformer is now a basic component, adopted in nearly all AI models. Keep updated --> updated irregularly.\n\nNew Hope: [LLM-in-Vision](https:\u002F\u002Fgithub.com\u002FDirtyHarryLYL\u002FLLM-in-Vision)\n\n## Resource\n\n- **ChatGPT** for **Robotics**: Design Principles and Model Abilities, [[Paper]](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fuploads\u002Fprod\u002F2023\u002F02\u002FChatGPT___Robotics.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPromptCraft-Robotics)\n\n- DIFFUSIONDB [[Page]](https:\u002F\u002Fpoloclub.github.io\u002Fdiffusiondb), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14896.pdf)\n\n- LAION-5B [[Page]](https:\u002F\u002Flaion.ai\u002Flaion-5b-a-new-era-of-open-large-scale-multi-modal-datasets\u002F), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08402.pdf)\n\n- LAVIS [[Page]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09019.pdf)\n\n- Imagen Video [[Page]](https:\u002F\u002Fimagen.research.google\u002Fvideo\u002F), [[Paper]](https:\u002F\u002Fimagen.research.google\u002Fvideo\u002Fpaper.pdf)\n\n- Phenaki [[Page]](https:\u002F\u002Fphenaki.video\u002F), [[Paper]](https:\u002F\u002Fopenreview.net\u002Fpdf?id=vOEXS39nOF)\n\n- DREAMFUSION [[Page]](https:\u002F\u002Fdreamfusion3d.github.io\u002F), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14988.pdf)\n\n- MAKE-A-VIDEO [[Page]](https:\u002F\u002Fmake-a-video.github.io\u002F), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14792.pdf)\n\n- Stable Difffusion [[Page]](https:\u002F\u002Fommer-lab.com\u002Fresearch\u002Flatent-diffusion-models\u002F), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10752.pdf)\n\n- NUWA-Infinity [[Page]](https:\u002F\u002Fnuwa-infinity.microsoft.com\u002F#\u002F), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09814.pdf)\n\n- Parti [[Page]](https:\u002F\u002Fparti.research.google\u002F), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fparti)\n\n- Imagen [[Page]](https:\u002F\u002Fimagen.research.google\u002F), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11487.pdf)\n\n- Gato: A Generalist Agent, [[Paper]](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FA%20Generalist%20Agent\u002FGeneralist%20Agent.pdf)\n\n- PaLM: Scaling Language Modeling with Pathways, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02311.pdf)\n\n- DALL·E 2 [[Page]](https:\u002F\u002Fopenai.com\u002Fdall-e-2\u002F), [[Paper]](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fdall-e-2.pdf)\n\n- SCENIC: A JAX Library for Computer Vision Research and Beyond, [[Code]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fscenic)\n\n- V-L joint learning study (with good tables): [[METER]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02387.pdf), [[Kaleido-BERT]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.16110.pdf)\n\n- Attention is all you need, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1706.03762.pdf)\n\n- CLIP [[Page]](https:\u002F\u002Fopenai.com\u002Fblog\u002Fclip\u002F), [[Paper]](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002FLearning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP), [[arXiv]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.00020.pdf)\n\n- DALL·E [[Page]](https:\u002F\u002Fopenai.com\u002Fblog\u002Fdall-e\u002F), [[Code]](https:\u002F\u002Fgithub.com\u002Fopenai\u002FDALL-E), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.12092.pdf)\n\n- [huggingface\u002Ftransformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n\n- [Kyubyong\u002Ftransformer](https:\u002F\u002Fgithub.com\u002FKyubyong\u002Ftransformer), TF\n\n- [jadore801120\u002Fattention-is-all-you-need-pytorch](https:\u002F\u002Fgithub.com\u002Fjadore801120\u002Fattention-is-all-you-need-pytorch), Torch\n\n- [krasserm\u002Ffairseq-image-captioning](https:\u002F\u002Fgithub.com\u002Fkrasserm\u002Ffairseq-image-captioning)\n\n- [PyTorch Transformers Tutorials](https:\u002F\u002Fgithub.com\u002Fabhimishra91\u002Ftransformers-tutorials)\n\n- [ictnlp\u002Fawesome-transformer](https:\u002F\u002Fgithub.com\u002Fictnlp\u002Fawesome-transformer)\n\n- [basicv8vc\u002Fawesome-transformer](https:\u002F\u002Fgithub.com\u002Fbasicv8vc\u002Fawesome-transformer)\n\n- [dk-liang\u002FAwesome-Visual-Transformer](https:\u002F\u002Fgithub.com\u002Fdk-liang\u002FAwesome-Visual-Transformer)\n\n- [yuewang-cuhk\u002Fawesome-vision-language-pretraining-papers](https:\u002F\u002Fgithub.com\u002Fyuewang-cuhk\u002Fawesome-vision-language-pretraining-papers)\n\n## Survey\n\n- (arXiv 2023.2) TRANSFORMER-BASED **SENSOR FUSION** FOR **AUTONOMOUS DRIVING**: A SURVEY, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11481.pdf), [[Page]](https:\u002F\u002Fgithub.com\u002FApoorvRoboticist\u002FTransformers-Sensor-Fusion)\n\n- (arXiv 2023.2) Deep Learning for **Video-Text Retrieval**: a Review, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12552.pdf)\n\n- (arXiv 2023.2) Large-scale **Multi-Modal Pre-trained Models**: A Comprehensive Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10035.pdf)\n\n- (arXiv 2023.2) Transformer-based **Generative Adversarial Networks** in Computer Vision: A Comprehensive Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08641.pdf)\n\n- (arXiv 2023.2) **Knowledge Distillation** in Vision Transformers: A Critical Review, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2302\u002F2302.02108.pdf)\n\n- (arXiv 2023.2) A Survey on **Efficient Training** of Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01107.pdf)\n\n- (arXiv 2023.1) ChatGPT is not all you need. A State of the Art Review of **large Generative AI models**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04655.pdf)\n\n- (arXiv 2022.12) Transformers in **Action Recognition**: A Review on Temporal Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01921.pdf)\n\n- (arXiv 2022.11) Vision Transformers in **Medical Imaging**: A Review, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.10043.pdf)\n\n- (arXiv 2022.11) A survey on **knowledge**-enhanced **multimodal** learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12328.pdf)\n\n- (arXiv 2022.10) Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09263.pdf)\n\n- (arXiv 2022.10) A Survey on Graph Neural Networks and **Graph** Transformers in Computer Vision: A Task-Oriented Perspective, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13232.pdf)\n\n- (arXiv 2022.09) VISION TRANSFORMERS FOR **ACTION RECOGNITION**: A SURVEY, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05700.pdf)\n\n- (arXiv 2022.09) Transformers in **Remote Sensing**: A Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01206.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FVIROBO-15\u002FTransformer-in-Remote-Sensing)\n\n- (arXiv 2022.08) **3D Vision** with Transformers: A Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04309.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flahoud\u002F3d-vision-transformers)\n\n- (arXiv 2022.08) A Survey on **Masked Autoencoder** for Self-supervised Learning in Vision and Beyond, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00173.pdf)\n\n- (arXiv 2022.07) **Vision** Transformers: State of the Art and Research Challenges, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03041.pdf)\n\n- (arXiv 2022.07) **SELF-SUPERVISED** LEARNING FOR **VIDEOS**: A SURVEY, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00419.pdf)\n\n- (arXiv 2022.06) **Multimodal** Learning with Transformers: A Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06488.pdf)\n\n- (arXiv 2022.05) Vision Transformer: **Vit** and its **Derivatives**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11239.pdf)\n\n- (arXiv 2022.05) Transformers in 3D **Point Clouds**: A Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07417.pdf)\n\n- (arXiv 2022.04) **Visual Attention** Methods in Deep Learning: An In-Depth Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07756.pdf)\n\n- (arXiv 2022.04) **Vision-and-Language** Pretrained Models: A Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07356.pdf)\n\n- (arXiv 2022.03) A Roadmap for **Big Model**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14101.pdf)\n\n- (arXiv 2022.03) Transformers Meet **Visual** Learning Understanding: A Comprehensive Review, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12944.pdf）\n\n- (arXiv 2022.03) Recent Advances in **Vision** Transformer: A Survey and Outlook of Recent Work, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01536.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002Fkhawar512\u002FViT-Survey)\n\n- (arXiv 2022.02) A Survey of **Vision-Language** Pre-Trained Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10936.pdf)\n\n- (arXiv 2022.02) VLP: A Survey on **Vision-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09061.pdf)\n\n- (arXiv 2022.02) Transformer for **Graphs**: An Overview from Architecture Perspective, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.08455.pdf)\n\n- (arXiv 2022.01) **Video** Transformers: A Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05991.pdf)\n\n- (arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP **MLP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.04060.pdf)\n\n- (arXiv 2021.11) A Survey of **Visual** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.06091.pdf)\n\n- (arXiv 2021.09) Survey: Transformer based **Video-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.09920.pdf)\n\n- (arXiv 2021.06) A Survey of **Transformers**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.04554.pdf)\n\n- (arXiv 2021.06) **Attention** mechanisms and deep learning for machine vision: A survey of the state of the art, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.07550.pdf)\n\n- (arXiv 2021.06) **Pre-Trained Models**: Past, Present and Future, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.07139.pdf)\n\n- (arXiv 2021.05) Can Attention Enable **MLPs** To Catch Up With CNNs? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2105.15078.pdf)\n\n- (arXiv 2021.03) A Practical Survey on **Faster** and **Lighter** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.14636.pdf)\n\n- (arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with **Language and Vision**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.04037.pdf)\n\n- (arXiv 2021.01) A Survey on **Visual** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2012.12556.pdf)\n\n- (arXiv 2020.9) **Efficient** Transformers: A Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2009.06732.pdf)\n\n- (arXiv 2020.1) **Transformers in Vision**: A Survey, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.01169.pdf)\n\n## Recent Papers\n\n### 2023.8\n\n- (arXiv 2023.8) VL-PET: Vision-and-Language Parameter-**Efficient Tuning** via Granularity Control, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.09804), [[Project]](https:\u002F\u002Fhenryhzy.github.io\u002FVL-PET\u002F)\n\n### 2023.5\n\n- (arXiv 2023.5) Understanding Gaussian **Attention** Bias of Vision Transformers Using Effective Receptive Fields, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04722.pdf)\n\n### 2023.3\n\n- (arXiv 2023.3) Query-Dependent **Video** Representation for **Moment Retrieval** and **Highlight Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.13874.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwjun0830\u002FQD-DETR)\n\n### 2023.2\n\n- (arXiv 2023.2) **Open-domain Visual Entity Recognition**: Towards Recognizing Millions of Wikipedia Entities, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11154.pdf)\n\n- (arXiv 2023.2) KS-DETR: Knowledge Sharing in Attention Learning for **Detection** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11208.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fedocanonymous\u002FKS-DETR)\n\n- (arXiv 2023.2) HUMAN MOTIONFORMER: **TRANSFERRING** HUMAN **MOTIONS** WITH VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11306.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKumapowerLIU\u002FHuman-MotionFormer)\n\n- (arXiv 2023.2) Aligning **Text-to-Image** Models using **Human Feedback**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12192.pdf)\n\n- (arXiv 2023.2) Controlled and Conditional **Text to Image** Generation with Diffusion Prior, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11710.pdf)\n\n- (arXiv 2023.2) Can Pre-trained Vision and Language Models Answer **Visual Information-Seeking Questions**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11713.pdf), [[Code]](https:\u002F\u002Fopen-vison-language.github.io\u002Finfoseek)\n\n- (arXiv 2023.2) OBJECT-CENTRIC **VIDEO PREDICTION** VIA DECOUPLING OF OBJECT DYNAMICS AND INTERACTIONS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11850.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Focvp-vp)\n\n- (arXiv 2023.2) Distribution Normalization: An “Effortless” **Test-Time Augmentation** for Contrastively Learned **Visual-language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11084.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffengyuli2002\u002Fdistribution-normalization)\n\n- (arXiv 2023.2) Teaching **CLIP** to **Count** to Ten, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12066.pdf), [[Project]](https:\u002F\u002Fteaching-clip-to-count.github.io\u002F)\n\n- (arXiv 2023.2) Designing an Encoder for Fast Personalization of **Text-to-Image** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12228.pdf), [[Project]](https:\u002F\u002Ftuning-encoder.github.io\u002F)\n\n- (arXiv 2023.2) Side Adapter Network for **Open-Vocabulary Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12242.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMendelXu\u002FSAN)\n\n- (arXiv 2023.2) Learning Visual Representations via **Language-Guided Sampling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12248.pdf)\n\n- (arXiv 2023.2) VoxFormer: Sparse Voxel Transformer for Camera-based **3D Semantic Scene Completion**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12251.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer)\n\n- (arXiv 2023.2) Language-Driven Representation Learning for **Robotics**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12766.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fvoltron-robotics)\n\n- (arXiv 2023.2) A Convolutional Vision Transformer for **Semantic Segmentation** of Side-Scan **Sonar** Data, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12416.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhayatrajani\u002Fs3seg-vit)\n\n- (arXiv 2023.2) **Lightweight** Real-time Semantic **Segmentation** Network with Efficient Transformer and CNN, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10484.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FIVIPLab\u002FLETNet)\n\n- (arXiv 2023.2) VIEWCO: DISCOVERING **TEXT-SUPERVISED** **SEGMENTATION** MASKS VIA MULTI-VIEW SEMANTIC CONSISTENCY, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10307.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fpzhren\u002FViewCo)\n\n- (arXiv 2023.2) CertViT: Certified **Robustness** of Pre-Trained Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10287.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsagarverma\u002Ftransformer-lipschitz)\n\n- (arXiv 2023.2) Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for **Grounding Viewpoint Descriptions**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10282.pdf)\n\n- (arXiv 2023.2) MaskedKD: Efficient **Distillation** of Vision Transformers with **Masked** Images, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10494.pdf)\n\n- (arXiv 2023.2) A General Visual Representation Guided Framework with Global Affinity for **Weakly Supervised Salient Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10697.pdf)\n\n- (arXiv 2023.2) ViTA: A Vision Transformer **Inference Accelerator** for **Edge** Applications, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09108.pdf)\n\n- (arXiv 2023.2) **Video Action Recognition** Collaborative Learning with Dynamics via PSO-ConvNet Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09187.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fleonlha\u002FVideo-Action-Recognition-via-PSO-ConvNet-Transformer-Collaborative-Learning-with-Dynamics)\n\n- (arXiv 2023.2) A Pilot **Evaluation** of ChatGPT and DALL-E 2 on **Decision Making** and **Spatial Reasoning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2302\u002F2302.09068.pdf)\n\n- (arXiv 2023.2) StyLIP: Multi-Scale Style-Conditioned Prompt Learning for **CLIP**-based **Domain Generalization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09251.pdf)\n\n- (arXiv 2023.2) Meta Style Adversarial Training for Cross-Domain **Few-Shot** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09309.pdf)\n\n- (arXiv 2023.2) HYNETER: HYBRID NETWORK TRANSFORMER FOR OBJECT **DETECTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09365.pdf)\n\n- (arXiv 2023.2) STOA-VLP: Spatial-Temporal Modeling of Object and Action for **Video-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09736.pdf)\n\n- (arXiv 2023.2) Constraint and Union for Partially-Supervised **Temporal Sentence Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09850.pdf)\n\n- (arXiv 2023.2) STB-VMM: Swin Transformer Based **Video Motion Magnification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10001.pdf)\n\n- (arXiv 2023.2) **Fashion Image Retrieval** with Multi-Granular Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08902.pdf)\n\n- (arXiv 2023.2) LayoutDiffuse: Adapting Foundational Diffusion Models for **Layout-to-Image Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08908.pdf)\n\n- (arXiv 2023.2) CK-Transformer: Commonsense Knowledge Enhanced Transformers for **Referring Expression Comprehension**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09027.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FFightingFighting\u002FCK-Transformer)\n\n- (arXiv 2023.2) MaskSketch: Unpaired Structure-guided Masked **Image Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05496.pdf)\n\n- (arXiv 2023.2) Single **Motion** **Diffusion**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05905.pdf), [[Code]](https:\u002F\u002Fsinmdm.github.io\u002FSinMDM-page)\n\n- (arXiv 2023.2) Tri-Perspective View for Vision-Based **3D Semantic Occupancy Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07817.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwzzheng\u002FTPVFormer)\n\n- (arXiv 2023.2) ANSEL Photobot: A **Robot** **Event Photographer** with Semantic Intelligence, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07931.pdf)\n\n- (arXiv 2023.2) ForceFormer: Exploring Social Force and Transformer for **Pedestrian Trajectory Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07583.pdf)\n\n- (arXiv 2023.2) **Video** Probabilistic **Diffusion** Models in Projected Latent Space, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07685.pdf)\n\n- (arXiv 2023.2) Dataset Interfaces: **Diagnosing Model Failures** Using Controllable Counterfactual Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07865.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMadryLab\u002Fdataset-interfaces)\n\n- (arXiv 2023.2) Learning to Substitute Ingredients in **Recipes**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07960.pdf)\n\n- (arXiv 2023.2) **Energy** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07253.pdf)\n\n- (arXiv 2023.2) Efficiency 360: **Efficient** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08374.pdf)\n\n- (arXiv 2023.2) A-la-carte **Prompt Tuning** (APT): Combining Distinct Data Via Composable ` Prompting, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07994.pdf)\n\n- (arXiv 2023.2) Effective Data **Augmentation** With **Diffusion** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07944.pdf), [[Project]](https:\u002F\u002Fbtrabuc.co\u002Fda-fusion)\n\n- (arXiv 2023.2) PRedItOR: Text Guided **Image Editing** with Diffusion Prior, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07979.pdf)\n\n- (arXiv 2023.2) TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual Vision Transformer for Fast Arbitrary **One-Shot Image Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08047.pdf)\n\n- (arXiv 2023.2) Hierarchical Cross-modal Transformer for **RGB-D Salient Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08052.pdf)\n\n- (arXiv 2023.2) MINOTAUR: Multi-task **Video Grounding** From Multimodal Queries, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08063.pdf)\n\n- (arXiv 2023.2) Towards **Efficient** Visual **Adaption** via Structural Re-parameterization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08106.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fluogen1996\u002FRepAdapter)\n\n- (arXiv 2023.2) Efficient **3D Object Reconstruction** using Visual Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08474.pdf)\n\n- (arXiv 2023.2) Retrieval-augmented Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08268.pdf)\n\n- (arXiv 2023.2) Robust Human **Motion Forecasting** using Transformer-based Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08274.pdf)\n\n- (arXiv 2023.2) VQ3D: Learning a **3D**-Aware **Generative** Model on ImageNet, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06833.pdf), [[Project]](https:\u002F\u002Fkylesargent.github.io\u002Fvq3d)\n\n- (arXiv 2023.2) UKnow: A Unified Knowledge Protocol for **Common-Sense Reasoning** and **Vision-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06891.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGongggg\u002FUKnow)\n\n- (arXiv 2023.2) A **THEORETICAL** UNDERSTANDING OF **SHALLOW** VISION TRANSFORMERS: LEARNING, GENERALIZATION, AND SAMPLE COMPLEXITY, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06015.pdf)\n\n- (arXiv 2023.2) A Simple Zero-shot Prompt Weighting Technique to Improve **Prompt** Ensembling in **Text-Image** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06235.pdf)\n\n- (arXiv 2023.2) Generalized Few-Shot **Continual Learning** with Contrastive Mixture of Adapters, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05936.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyawencui\u002FCMoA)\n\n- (arXiv 2023.2) Actional Atomic-Concept Learning for Demystifying **Vision-Language Navigation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06072.pdf)\n\n- (arXiv 2023.2) Towards Local Visual Modeling for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06098.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxmu-xiaoma666\u002FLSTNet)\n\n- (arXiv 2023.2) CLIP-RR: IMPROVED CLIP NETWORK FOR RELATION-FOCUSED **CROSS-MODAL INFORMATION RETRIEVAL**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06350.pdf)\n\n- (arXiv 2023.2) **Anticipating** Next Active Objects for **Egocentric Videos**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06358.pdf), [[Code]]()\n\n- (arXiv 2023.2) UniAdapter: Unified Parameter-Efficient Transfer Learning for **Cross-modal Modeling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06605.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FRERV\u002FUniAdapter)\n\n- (arXiv 2023.2) TEAM **DETR**: GUIDE QUERIES AS A PROFESSIONAL TEAM IN DETECTION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07116.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhorrible-dong\u002FTeamDETR)\n\n- (arXiv 2023.2) ConceptFusion: Open-set **Multimodal** **3D Mapping**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07241.pdf), [[Project]](https:\u002F\u002Fconcept-fusion.github.io\u002F)\n\n- (arXiv 2023.2) Team Triple-Check at Factify 2: Parameter-Efficient Large Foundation Models with Feature Representations for **Multi-Modal Fact Verification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07740.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwwweiwei\u002FPre-CoFactv2-AAAI-2023)\n\n- (arXiv 2023.2) PolyFormer: Referring Image **Segmentation** as Sequential Polygon Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07387.pdf)\n\n- (arXiv 2023.2) Pose-Oriented Transformer with Uncertainty-Guided Refinement for **2D-to-3D Human Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07408.pdf)\n\n- (arXiv 2023.2) TFormer: A Transmission-Friendly ViT Model for **IoT** Devices, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07734.pdf), [[Code]]()\n\n- (arXiv 2023.2) Tri-Perspective View for Vision-Based **3D Semantic Occupancy Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07817.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwzzheng\u002FTPVFormer)\n\n- (arXiv 2023.2) Adding Conditional Control to **Text-to-Image Diffusion** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05543.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flllyasviel\u002FControlNet)\n\n- (arXiv 2023.2) Invariant **Slot Attention**: **Object Discovery** with Slot-Centric Reference Frames, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04973.pdf)\n\n- (arXiv 2023.2) IS MULTI-MODAL **VISION** SUPERVISION **BENEFICIAL** TO **LANGUAGE**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05016.pdf)\n\n- (arXiv 2023.2) Data-Driven **Stochastic Motion Evaluation** and **Optimization** with Image by Spatially-Aligned Temporal Encoding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05041.pdf)\n\n- (arXiv 2023.2) **Scaling** Vision Transformers to **22 Billion Parameters**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05442.pdf)\n\n- (arXiv 2023.2) Adapting **Pre-trained** Vision Transformers from **2D to 3D** through Weight Inflation Improves Medical Image Segmentation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04303.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyuhui-zh15\u002FTransSeg)\n\n- (arXiv 2023.2) Mitigating **Bias** in Visual Transformers via Targeted Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04358.pdf)\n\n- (arXiv 2023.2) IH-ViT: Vision Transformer-based **Integrated Circuit Appearance Defect Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2302\u002F2302.04521.pdf)\n\n- (arXiv 2023.2) Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04858.pdf)\n\n- (arXiv 2023.2) Learning by Asking for **Embodied** Visual **Navigation** and **Task Completion**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04865.pdf)\n\n- (arXiv 2023.2) **Reversible** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04869.pdf), [[Code1]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fslowfast), [[Code2]](https:\u002F\u002Fgithub.com\u002Fkarttikeya\u002FminREV)\n\n- (arXiv 2023.2) Neural Congealing: **Aligning Images** to a Joint **Semantic Atlas**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03956.pdf), [[Project]](https:\u002F\u002Fneural-congealing.github.io\u002F)\n\n- (arXiv 2023.2) **Adversarial Prompting** for Black Box Foundation Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04237.pdf)\n\n- (arXiv 2023.2) Understanding Why ViT **Trains** Badly on **Small Datasets**: An Intuitive Perspective, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03751.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBoyuanJackChen\u002FMiniProject2_VisTrans)\n\n- (arXiv 2023.2) CROSS-LAYER RETROSPECTIVE RETRIEVING VIA LAYER **ATTENTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03985.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjoyfang1106\u002FMRLA)\n\n- (arXiv 2023.2) Convolutional Neural Networks Trained to **Identify Words** Provide a Good Account of Visual Form Priming Effects, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03992.pdf)\n\n- (arXiv 2023.2) Zero-shot **Generation** of Coherent **Storybook** from Plain Text Story using Diffusion Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03900.pdf)\n\n- (arXiv 2023.2) OSRT: Omnidirectional **Image Super-Resolution** with Distortion-aware Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03453.pdf)\n\n- (arXiv 2023.2) Pic2Word: Mapping Pictures to Words for Zero-shot **Composed** **Image Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03084.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fcomposed_image_retrieval)\n\n- (arXiv 2023.2) SimCon Loss with Multiple Views for Text Supervised **Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03432.pdf)\n\n- (arXiv 2023.2) PhysFormer++: **Facial** Video-based **Physiological Measurement** with SlowFast Temporal Difference Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03548.pdf)\n\n- (arXiv 2023.2) Scaling **Self-Supervised** End-to-End **Driving** with Multi-View Attention Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03198.pdf)\n\n- (arXiv 2023.2) HumanMAC: Masked Motion Completion for **Human Motion Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03665.pdf), [[Project]](https:\u002F\u002Flhchen.top\u002FHuman-MAC\u002F)\n\n- (arXiv 2023.2) LAMPP: **Language Models** as Probabilistic Priors for **Perception** and **Action**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02801.pdf)\n\n- (arXiv 2023.2) Zero-Shot **Robot Manipulation** from Passive Human Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02011.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fhuman-0shot-robot)\n\n- (arXiv 2023.2) MixFormer: End-to-End **Tracking** with Iterative Mixed Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02814.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FMixFormer)\n\n- (arXiv 2023.2) LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale **Image-Text Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02908.pdf)\n\n- (arXiv 2023.2) V1T: large-scale **mouse V1 response prediction** using a Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03023.pdf)\n\n- (arXiv 2023.2) AIM: ADAPTING **IMAGE MODELS** FOR EFFICIENT **VIDEO ACTION RECOGNITION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03024.pdf), [[Project]](https:\u002F\u002Fadapt-image-models.github.io\u002F)\n\n- (arXiv 2023.2) KDEformer: **Accelerating** Transformers via Kernel Density Estimation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02451.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmajid-daliri\u002Fkdeformer)\n\n- (arXiv 2023.2) Semantic-Guided **Image Augmentation** with Pre-trained Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02070.pdf)\n\n- (arXiv 2023.2) X-ReID: Cross-Instance Transformer for Identity-Level **Person Re-Identification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02075.pdf)\n\n- (arXiv 2023.2) MOMA: **Distill** from Self-Supervised Teachers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02089.pdf)\n\n- (arXiv 2023.2) Learning to Agree on Vision Attention for **Visual Commonsense Reasoning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02117.pdf)\n\n- (arXiv 2023.2) Efficient End-to-End **Video Question Answering** with Pyramidal Multimodal Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02136.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTrunpm\u002FPMT-AAAI23)\n\n- (arXiv 2023.2) LipFormer: Learning to **Lipread** Unseen Speakers based on Visual-Landmark Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02141.pdf)\n\n- (arXiv 2023.2) Oscillation-free **Quantization** for Low-bit Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02210.pdf)\n\n- (arXiv 2023.2) Design Booster: A Text-Guided Diffusion Model for **Image Translation** with Spatial Layout Preservation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02284.pdf)\n\n- (arXiv 2023.2) Contrast with Reconstruct: **Contrastive** **3D** Representation Learning Guided by Generative Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02318.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fqizekun\u002FReCon)\n\n- (arXiv 2023.2) Leaving Reality to Imagination: **Robust** **Classification** via **Generated** Datasets, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02503.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHritikbansal\u002Fgenerative-robustness)\n\n- (arXiv 2023.2) CHiLS: Zero-Shot Image **Classification** with **Hierarchical** Label Sets, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02551.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Facmi-lab\u002FCHILS)\n\n- (arXiv 2023.2) Zero-shot **Image-to-Image** Translation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03027.pdf), [[Project]](https:\u002F\u002Fpix2pixzero.github.io\u002F)\n\n- (arXiv 2023.2) Learning a **Fourier Transform** for Linear Relative **Positional Encodings** in Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01925.pdf)\n\n- (arXiv 2023.2) EXPLICIT BOX DETECTION UNIFIES END-TO-END **MULTI-PERSON POSE ESTIMATION**, [[Paper]](http:\u002F\u002Fmy.sjtu.edu.cn\u002FTask), [[Code]](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FED-Pose)\n\n- (arXiv 2023.2) CFFT-GAN: Cross-domain Feature Fusion Transformer for Exemplar-based **Image Translation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01608.pdf)\n\n- (arXiv 2023.2) DEVICE: DEpth and VIsual ConcEpts Aware Transformer for **TextCaps**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01540.pdf)\n\n- (arXiv 2023.2) CVTNet: A Cross-View Transformer Network for **Place Recognition** Using **LiDAR** Data, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01665.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBIT-MJY\u002FCVTNet)\n\n- (arXiv 2023.2) DilateFormer: **Multi-Scale Dilated** Transformer for Visual Recognition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01791.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJIAOJIAYUASD\u002Fdilateformer)\n\n- (arXiv 2023.2) HDFormer: High-order Directed Transformer for **3D Human Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01825.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhyer\u002FHDFormer)\n\n- (arXiv 2023.2) IC^3: Image Captioning by Committee Consensus, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01328.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FDavidMChan\u002Fcaption-by-committee)\n\n- (arXiv 2023.2) Boosting Low-Data Instance **Segmentation** by Unsupervised Pre-training with Saliency Prompt, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01171.pdf)\n\n- (arXiv 2023.2) QR-CLIP: Introducing Explicit Open-World Knowledge for **Location and Time Reasoning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00952.pdf)\n\n- (arXiv 2023.2) Vision Transformer-based Feature Extraction for **Generalized Zero-Shot Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00875.pdf)\n\n- (arXiv 2023.2) **Multimodal** Chain-of-Thought **Reasoning** in Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00923.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Famazon-science\u002Fmm-cot)\n\n- (arXiv 2023.2) CLIPood: Generalizing **CLIP** to **Out-of-Distributions**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00864.pdf)\n\n- (arXiv 2023.2) Language Quantized AutoEncoders: Towards Unsupervised **Text-Image** Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00902.pdf)\n\n- (arXiv 2023.2) The geometry of **hidden representations** of large transformer models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00294.pdf)\n\n- (arXiv 2023.2) **Debiasing** **Vision-Language** Models via Biased Prompts, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00070.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fchingyaoc\u002Fdebias_vl)\n\n- (arXiv 2023.2) COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR **OPEN-VOCABULARY VIDEO RELATION DETECTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00268.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FDawn-LX\u002FOpenVoc-VidVRD)\n\n- (arXiv 2023.2) mPLUG-2: A Modularized **Multi-modal** Foundation Model Across Text, Image and Video, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00402.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Falibaba\u002FAliceMind\u002Ftree\u002Fmain\u002FmPLUG)\n\n- (arXiv 2023.2) Transforming **CLIP** to an **Open-vocabulary Video Model** via Interpolated Weight Optimization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00624.pdf)\n\n- (arXiv 2023.2) ADAPT: Action-aware Driving **Caption** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00673.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjxbbb\u002FADAPT)\n\n### 2023.1\n\n\u003C!-- - (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]()\n\n- (arXiv 2023.1) , [[Paper]](), [[Code]]() -->\n\n- (arXiv 2023.1) AdaPoinTr: Diverse **Point Cloud Completion** with Adaptive Geometry-Aware Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04545.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyuxumin\u002FPoinTr)\n\n- (arXiv 2023.1) **EXIF** as Language: Learning Cross-Modal Associations Between **Images and Camera Metadata**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04647.pdf), [[Project]](https:\u002F\u002Fhellomuffin.github.io\u002Fexif-as-language)\n\n- (arXiv 2023.1) Head-Free Lightweight **Semantic Segmentation** with Linear Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04648.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdongbo811\u002FAFFormer)\n\n- (arXiv 2023.1) Geometry-biased Transformers for **Novel View Synthesis**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04650.pdf), [[Project]](https:\u002F\u002Fmayankgrwl97.github.io\u002Fgbt)\n\n- (arXiv 2023.1) **Continual** **Few-Shot** Learning Using HyperTransformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04584.pdf)\n\n- (arXiv 2023.1) SEMPPL: PREDICTING **PSEUDO-LABELS** FOR BETTER **CONTRASTIVE** REPRESENTATIONS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05158.pdf)\n\n- (arXiv 2023.1) Learning to **Summarize Videos** by Contrasting Clips, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05213.pdf)\n\n- (arXiv 2023.1) Guiding **Text-to-Image** **Diffusion** Model Towards Grounded Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05221.pdf), [[Project]](https:\u002F\u002Flipurple.github.io\u002FGrounded_Diffusion\u002F)\n\n- (arXiv 2023.1) Domain Expansion of **Image Generators**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05225.pdf), [[Code]](https:\u002F\u002Fyotamnitzan.github.io\u002Fdomain-expansion\u002F)\n\n- (arXiv 2023.1) Scene-centric vs. Object-centric Image-Text **Cross-modal Retrieval**: A Reproducibility Study, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05174.pdf)\n\n- (arXiv 2023.1) Tracr: Compiled Transformers as a Laboratory for **Interpretability**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05062.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Ftracr)\n\n- (arXiv 2023.1) **CLIP** the Gap: A Single **Domain Generalization** Approach for Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05499.pdf)\n\n- (arXiv 2023.1) **Text to Point Cloud Localization** with Relation-Enhanced Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05372.pdf)\n\n- (arXiv 2023.1) GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured **Pruning** for Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05345.pdf)\n\n- (arXiv 2023.1) Toward Building General Foundation Models for Language, Vision, and **Vision-Language** Understanding Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05065.pdf)\n\n- (arXiv 2023.1) ViTs for SITS: Vision Transformers for **Satellite Image Time Series**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04944.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmichaeltrs\u002FDeepSatModels)\n\n- (arXiv 2023.1) CLIP2Scene: Towards Label-efficient **3D Scene Understanding** by **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04926.pdf)\n\n- (arXiv 2023.1) A Large-Scale Outdoor Multi-modal **Dataset** and Benchmark for **Novel View Synthesis** and Implicit **Scene Reconstruction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06782.pdf), [[Project]](https:\u002F\u002Fommo.luchongshan.com\u002F)\n\n- (arXiv 2023.1) USER: Unified Semantic Enhancement with Momentum Contrast for **Image-Text Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06844.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzhangy0822\u002FUSER)\n\n- (arXiv 2023.1) SAT: Size-Aware Transformer for 3D **Point Cloud Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06869.pdf)\n\n- (arXiv 2023.1) **Masked** **Visual** Reconstruction in **Language** Semantic Space, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06958.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FRILS)\n\n- (arXiv 2023.1) Vision Learners Meet Web **Image-Text** Pairs, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07088.pdf), [[Code]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ftennant\u002FMUG_caption)\n\n- (arXiv 2023.1) GLIGEN: Open-Set Grounded **Text-to-Image** Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07093.pdf), [[Project]](https:\u002F\u002Fgligen.github.io\u002F)\n\n- (arXiv 2023.1) **Learning** Customized Visual Models with **Retrieval**-Augmented **Knowledge**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07094.pdf), [[Project]](https:\u002F\u002Freact-vl.github.io\u002F)\n\n- (arXiv 2023.1) UATVR: Uncertainty-Adaptive **Text-Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06309.pdf)\n\n- (arXiv 2023.1) Learning Aligned Cross-modal Representations for **Referring Image Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06429.pdf)\n\n- (arXiv 2023.1) T2M-GPT: **Generating** Human **Motion** from Textual Descriptions with Discrete Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06052.pdf), [[Project]](https:\u002F\u002Fmael-zys.github.io\u002FT2M-GPT\u002F)\n\n- (arXiv 2023.1) DSVT: Dynamic **Sparse Voxel** Transformer with Rotated Sets, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06051.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHaiyang-W\u002FDSVT)\n\n- (arXiv 2023.1) CMAE-V: Contrastive Masked Autoencoders for **Video Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06018.pdf)\n\n- (arXiv 2023.1) Generating Templated Caption for **Video Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05997.pdf)\n\n- (arXiv 2023.1) Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised **Depth Estimation** in Dynamic Scenes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05871.pdf)\n\n- (arXiv 2023.1) SwinDepth: Unsupervised **Depth Estimation** using Monocular Sequences via Swin Transformer and Densely Cascaded Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06715.pdf)\n\n- (arXiv 2023.1) **CLIP**TER: Looking at the Bigger Picture in **Scene Text Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07464.pdf)\n\n- (arXiv 2023.1) Temporal Perceiving **Video-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07463.pdf)\n\n- (arXiv 2023.1) Joint Representation Learning for **Text** and 3D **Point Cloud**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07584.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FText4Point)\n\n- (arXiv 2023.1) Effective End-to-End **Vision Language** Pretraining with Semantic Visual Loss, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07236.pdf)\n\n- (arXiv 2023.1) PTA-Det: Point Transformer Associating Point cloud and Image for **3D Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07301.pdf)\n\n- (arXiv 2023.1) **Face Recognition** in the age of CLIP & Billion image datasets, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07315.pdf)\n\n- (arXiv 2023.1) HSTFormer: Hierarchical Spatial-Temporal Transformers for **3D Human Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07322.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fqianxiaoye825\u002FHSTFormer)\n\n- (arXiv 2023.1) Towards Models that Can **See** and **Read**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07389.pdf)\n\n- (arXiv 2023.1) **Embodied** Agents for Efficient Exploration and Smart Scene Description, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07150.pdf)\n\n- (arXiv 2023.1) **Self-Supervised Learning** from Images with a Joint-Embedding Predictive Architecture, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08243.pdf)\n\n- (arXiv 2023.1) Revisiting the Spatial and Temporal Modeling for **Few-shot Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07944.pdf)\n\n- (arXiv 2023.1) Multimodal Video Adapter for Parameter Efficient **Video Text Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07868.pdf)\n\n- (arXiv 2023.1) **Self Supervision** Does Not Help Natural Language Supervision at Scale, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07836.pdf)\n\n- (arXiv 2023.1) MULTI-TARGET MULTI-CAMERA **VEHICLE TRACKING** USING TRANSFORMER-BASED CAMERA LINK MODEL AND SPATIAL-TEMPORAL INFORMATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07805.pdf)\n\n- (arXiv 2023.1) ATMAN: **Understanding** Transformer Predictions Through Memory Efficient **Attention** Manipulation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08110.pdf)\n\n- (arXiv 2023.1) DDS: Decoupled Dynamic **Scene-Graph Generation** Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07666.pdf), [[Code]]()\n\n- (arXiv 2023.1) Visual Writing Prompts: Character-Grounded **Story Generation** with Curated Image Sequences, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08571.pdf)\n\n- (arXiv 2023.1) **Image Memorability Prediction** with Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08647.pdf)\n\n- (arXiv 2023.1) HOLISTICALLY **EXPLAINABLE** VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08669.pdf)\n\n- (arXiv 2023.1) FlatFormer: Flattened Window Attention for **Efficient** **Point Cloud** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08739.pdf)\n\n- (arXiv 2023.1) LEGO-Net: Learning Regular **Rearrangements** of **Objects** in Rooms, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09629.pdf), [[Project]](https:\u002F\u002Fivl.cs.brown.edu\u002Fprojects\u002Flego-net)\n\n- (arXiv 2023.1) Zorro: the masked **multimodal** transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09595.pdf)\n\n- (arXiv 2023.1) Towards Robust **Video Instance Segmentation** with Temporal-Aware Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09416.pdf)\n\n- (arXiv 2023.1) Learning **Open-vocabulary Semantic Segmentation** Models From Natural Language Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09121.pdf), [[Project]](https:\u002F\u002Fjazzcharles.github.io\u002FOVSegmentor\u002F)\n\n- (arXiv 2023.1) Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object **Interaction Anticipation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09209.pdf), [[Code]](https:\u002F\u002Feth-ait.github.io\u002Ftransfusion-proj\u002F)\n\n- (arXiv 2023.1) Combined Use of Federated Learning and Image Encryption for **Privacy**-Preserving **Image Classification** with Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09255.pdf)\n\n- (arXiv 2023.1) Slice Transformer and Self-supervised Learning for **6DoF Localization** in 3D Point Cloud Maps, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08957.pdf)\n\n- (arXiv 2023.1) IMPROVING ACCURACY OF **ZERO-SHOT ACTION RECOGNITION** WITH HANDCRAFTED FEATURES, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08874.pdf)\n\n- (arXiv 2023.1) Learning to View: Decision Transformers for **Active Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09544.pdf)\n\n- (arXiv 2023.1) Visual Semantic Relatedness Dataset for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08784.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fahmedssabir\u002FTextual-Visual-Semantic-Dataset)\n\n- (arXiv 2023.1) VERSATILE NEURAL PROCESSES FOR LEARNING **IMPLICIT NEURAL REPRESENTATIONS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08883.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZongyuGuo\u002FVersatile-NP)\n\n- (arXiv 2023.1) RangeViT: Towards Vision Transformers for **3D Semantic Segmentation** in Autonomous Driving, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10222.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002Frangevit)\n\n- (arXiv 2023.1) Exploiting Optical Flow Guidance for Transformer-Based **Video Inpainting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10048.pdf)\n\n- (arXiv 2023.1) Image **Super-Resolution** using Efficient Striped Window Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09869.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FFried-Rice-Lab\u002FFriedRiceLab)\n\n- (arXiv 2023.1) **Out of Distribution** Performance of State of Art Vision Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10750.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsalman-lui\u002Fvision_course_project)\n\n- (arXiv 2023.1) Compact Transformer **Tracker** with Correlative Masked Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10938.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHUSTDML\u002FCTTrack)\n\n- (arXiv 2023.1) **Vision-Language** Models Performing Zero-Shot Tasks Exhibit **Gender**-based **Disparities**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11100.pdf)\n\n- (arXiv 2023.1) Cut and Learn for **Unsupervised** Object **Detection** and Instance **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11320.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FCutLER)\n\n- (arXiv 2023.1) Explaining Visual **Biases** as Words by Generating Captions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11104.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Falinlab\u002Fb2t)\n\n- (arXiv 2023.1) Revisiting **Temporal Modeling** for **CLIP**-based Image-to-Video Knowledge Transferring, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11116.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffarewellthree\u002FSTAN)\n\n- (arXiv 2023.1) **Multi-video Moment Ranking** with Multimodal Clue, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13606.pdf)\n\n- (arXiv 2023.1) SDF-FORMER: **MONOCULAR SCENE RECONSTRUCTION** WITH 3D SDF TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13510.pdf), [[Project]](https:\u002F\u002Fweihaosky.github.io\u002Fsdfformer)\n\n- (arXiv 2023.1) Grounding Language Models to Images for **Multimodal Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13823.pdf)\n\n- (arXiv 2023.1) Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for **Visual Commonsense Reasoning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13335.pdf)\n\n- (arXiv 2023.1) A Modular Multi-stage Lightweight Graph Transformer Network for **Human Pose and Shape Estimation** from 2D Human Pose, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13403.pdf)\n\n- (arXiv 2023.1) Priors are Powerful: Improving a Transformer for **Multi-camera 3D Detection** with 2D Priors, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13592.pdf)\n\n- (arXiv 2023.1) UPop: Unified and Progressive Pruning for **Compressing** **Vision-Language** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13741.pdf)\n\n- (arXiv 2023.1) **Fairness**-aware Vision Transformer via Debiased Self-Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13803.pdf)\n\n- (arXiv 2023.1) Anchor-Based Adversarially Robust **Zero-Shot Learning** Driven by Language, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13096.pdf)\n\n- (arXiv 2023.1) Distilling Internet-Scale **Vision-Language** Models into **Embodied** Agents, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12507.pdf)\n\n- (arXiv 2023.1) 6-DoF Robotic **Grasping** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12476.pdf)\n\n- (arXiv 2023.1) Do Embodied Agents Dream of Pixelated Sheep?: **Embodied Decision Making** using Language Guided World Modelling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12050.pdf), [[Project]](https:\u002F\u002Fdeckardagent.github.io\u002F)\n\n- (arXiv 2023.1) GALIP: Generative Adversarial CLIPs for **Text-to-Image** Synthesis, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12959.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ftobran\u002FGALIP)\n\n- (arXiv 2023.1) STAIR: Learning **Sparse** **Text and Image** Representation in Grounded Tokens, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13081.pdf)\n\n- (arXiv 2023.1) **Aerial** Image Object **Detection** With Vision Transformer Detector (ViTDet), [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2301\u002F2301.12058.pdf)\n\n- (arXiv 2023.1) Towards Vision Transformer Unrolling Fixed-Point Algorithm: a Case Study on **Image Restoration**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12332.pdf)\n\n- (arXiv 2023.1) Debiased Fine-Tuning for **Vision-language** Models by **Prompt** Regularization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12429.pdf), [[Code]]()\n\n- (arXiv 2023.1) BLIP-2: Bootstrapping **Language-Image** Pre-training with **Frozen** Image Encoders and Large Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12597.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip2)\n\n- (arXiv 2023.1) Tagging before Alignment: Integrating Multi-Modal Tags for **Video-Text Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12644.pdf)\n\n- (arXiv 2023.1) SEAFORMER: SQUEEZE-ENHANCED AXIAL TRANSFORMER FOR MOBILE SEMANTIC **SEGMENTATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13156.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FSeaFormer)\n\n- (arXiv 2023.1) Learning 6-DoF Fine-grained **Grasp Detection** Based on Part Affordance Grounding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11564.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Flang-shape)\n\n- (arXiv 2023.1) Multimodal Event Transformer for **Image-guided Story Ending Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11357.pdf)\n\n- (arXiv 2023.1) Style-Aware Contrastive Learning for Multi-Style Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11367.pdf)\n\n- (arXiv 2023.1) 3DShape2VecSet: A **3D Shape Representation** for Neural Fields and Generative Diffusion Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11445.pdf)\n\n- (arXiv 2023.1) Semi-Parametric **Video-Grounded Text Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11507.pdf)\n\n- (arXiv 2023.1) **Robust** Transformer with Locality Inductive Bias and Feature Normalization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11553.pdf)\n\n- (arXiv 2023.1) LEVERAGING THE THIRD DIMENSION IN **CONTRASTIVE LEARNING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11790.pdf)\n\n- (arXiv 2023.1) Understanding **Self-Supervised** Pretraining with **Part**-Aware Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11915.pdf)\n\n- (arXiv 2023.1) Hypergraph Transformer for **Skeleton-based Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09590.pdf)\n\n- (arXiv 2023.1) CPT-V: A Contrastive Approach to Post-Training **Quantization** of Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09643.pdf)\n\n- (arXiv 2023.1) InstructPix2Pix: Learning to Follow **Image Editing** Instructions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09800.pdf), [[Code]](http:\u002F\u002Ftimothybrooks.com\u002Finstruct-pix2pix)\n\n- (arXiv 2023.1) OvarNet: Towards Open-vocabulary Object **Attribute Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09506.pdf), [[Project]](https:\u002F\u002Fkyanchen.github.io\u002FOvarNet)\n\n- (arXiv 2023.1) DDS: Decoupled Dynamic **Scene-Graph Generation** Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07666.pdf)\n\n- (arXiv 2023.1) **Token** Transformer: Can class token help window-based transformer build better **long-range interactions**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06083.pdf)\n\n- (arXiv 2023.1) Toward Building General **Foundation Models** for Language, Vision, and Vision-Language Understanding Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05065.pdf)\n\n- (arXiv 2023.1) Multimodal Inverse Cloze Task for Knowledge-based **Visual Question Answering**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04366.pdf), [[Code]]()\n\n- (arXiv 2023.1) FGAHOI: Fine-Grained Anchors for **Human-Object Interaction** Detection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04019.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxiaomabufei\u002FFGAHOI)\n\n- (arXiv 2023.1) Parallel Reasoning Network for **Human-Object Interaction** Detection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.03510.pdf)\n\n- (arXiv 2023.1) In Defense of Structural Symbolic Representation for **Video Event-Relation Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.03410.pdf)\n\n- (arXiv 2023.1) **Scene Synthesis** from Human **Motion**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.01424.pdf), [[Project]](https:\u002F\u002Flijiaman.github.io\u002Fprojects\u002Fsummon\u002F)\n\n### 2022.12\n\n\u003C!-- - (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]() -->\n\n- (arXiv 2022.12) EVA: Exploring the Limits of **Masked Visual Representation** Learning at Scale, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07636.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEVA)\n\n- (arXiv 2022.12) OneFormer: One Transformer to Rule Universal Image **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06220.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FOneFormer)\n\n- (arXiv 2022.12) MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards **Multi-modal Open-domain Conversation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05719.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002Fvictorsungo\u002FMMDialog)\n\n- (arXiv 2022.12) Why is Winoground Hard? Investigating Failures in **Visuolinguistic Compositionality**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00768.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fajd12342\u002Fwhy-winoground-hard)\n\n- (arXiv 2022.12) Multimodal **Information Bottleneck**: Learning Minimal Sufficient Unimodal and **Multimodal** Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.17444.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTmacMai\u002FMultimodal-Information-Bottleneck)\n\n- (arXiv 2022.12) CLIP-FLOW: CONTRASTIVE LEARNING BY SEMISUPERVISED ITERATIVE PSEUDO LABELING FOR **OPTICAL FLOW ESTIMATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14383.pdf)\n\n- (arXiv 2022.12) INSTRUCTION-FOLLOWING **AGENTS** WITH JOINTLY PRE-TRAINED **VISION-LANGUAGE** MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13431.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flhao499\u002Finstructrl)\n\n- (arXiv 2022.12) MetaFormer **Baselines** for Vision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13452.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fmetaformer)\n\n- (arXiv 2022.12) ViTCoD: Vision Transformer **Acceleration** via Dedicated Algorithm and Accelerator Co-Design, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09573.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGATECH-EIC\u002FViTCoD)\n\n- (arXiv 2022.12) FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED **ROBOT** DATA, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10047.pdf), [[Project]](https:\u002F\u002Fplay-to-policy.github.io\u002F)\n\n- (arXiv 2022.12) Optimizing **Prompts** for **Text-to-Image** Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09611.pdf), [[Code]](https:\u002F\u002Faka.ms\u002Fpromptist)\n\n- (arXiv 2022.12) Attentive **Mask** **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08653.pdf)\n\n- (arXiv 2022.12) Rethinking **Cooking State Recognition** with Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08586.pdf)\n\n- (arXiv 2022.12) Enhancing **Multi-modal** and **Multi-hop Question Answering** via Structured Knowledge and Unified Retrieval-Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08632.pdf), [[Code]]()\n\n- (arXiv 2022.12) MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in **Vision and Language** Models & Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08158.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHeidelberg-NLP\u002FMM-SHAP)\n\n- (arXiv 2022.12) RepQ-ViT: Scale Reparameterization for Post-Training **Quantization** of Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08254.pdf)\n\n- (arXiv 2022.12) WAVENHANCER: UNIFYING WAVELET AND TRANSFORMER FOR **IMAGE ENHANCEMENT**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08327.pdf)\n\n- (arXiv 2022.12) AUTOENCODERS AS CROSS-MODAL TEACHERS: CAN PRETRAINED 2D IMAGE TRANSFORMERS HELP **3D REPRESENTATION** LEARNING?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08320.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FRunpeiDong\u002FACT)\n\n- (arXiv 2022.12) SceneGATE: Scene-Graph based co-Attention networks for TExt **visual question answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08283.pdf)\n\n- (arXiv 2022.12) Emergent **Analogical Reasoning** in Large Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09196.pdf)\n\n- (arXiv 2022.12) Unleashing the Power of **Visual Prompting** At the Pixel Level, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10556.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002FEVP)\n\n- (arXiv 2022.12) Does **CLIP** Bind Concepts? Probing **Compositionality** in Large Image Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10537.pdf)\n\n- (arXiv 2022.12) LayoutDETR: Detection Transformer Is a Good Multimodal **Layout Designer**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09877.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLayoutDETR)\n\n- (arXiv 2022.12) Towards Unsupervised **Visual Reasoning**: Do Off-The-Shelf Features Know How to Reason?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10292.pdf)\n\n- (arXiv 2022.12) Benchmarking **Spatial Relationships** in **Text-to-Image** Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10015.pdf), [[Project]](https:\u002F\u002Fvisort2i.github.io\u002F)\n\n- (arXiv 2022.12) MetaCLUE: Towards Comprehensive **Visual Metaphors** Research, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09898.pdf), [[Project]](https:\u002F\u002Fmetaclue.github.io\u002F)\n\n- (arXiv 2022.12) Tackling Ambiguity with Images: Improved **Multimodal** Machine **Translation** and Contrastive Evaluation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10140.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMatthieuFP\u002FCoMMuTE.git)\n\n- (arXiv 2022.12) Cross-modal Attention Congruence Regularization for **Vision-Language** **Relation** Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10549.pdf)\n\n- (arXiv 2022.12) Does unsupervised **grammar induction** need pixels?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10564.pdf)\n\n- (arXiv 2022.12) Hi-LASSIE: High-Fidelity **Articulated** Shape and Skeleton **Discovery** from Sparse **Image** Ensemble, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11042.pdf)\n\n- (arXiv 2022.12) MAViC: Multimodal Active Learning for **Video Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11109.pdf)\n\n- (arXiv 2022.12) What Makes for Good **Tokenizers** in Vision Transformer? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11115.pdf)\n\n- (arXiv 2022.12) Not Just Pretty Pictures: **Text-to-Image** Generators Enable Interpretable Interventions for **Robust** Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11237.pdf), [[Code]]()\n\n- (arXiv 2022.12) Generalized Decoding for **Pixel**, **Image**, and **Language**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11270.pdf), [[Project]](https:\u002F\u002Fx-decoder-vl.github.io\u002F)\n\n- (arXiv 2022.12) METEOR Guided Divergence for **Video Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10690.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fd-rothen\u002Fbmhrl)\n\n- (arXiv 2022.12) SLGTFORMER: AN ATTENTION-BASED APPROACH TO **SIGN LANGUAGE RECOGNITION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10746.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fneilsong\u002Fslt)\n\n- (arXiv 2022.12) FROM IMAGES TO TEXTUAL **PROMPTS**: ZERO-SHOT **VQA** WITH FROZEN LARGE LANGUAGE MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10846.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fimg2prompt-vqa)\n\n- (arXiv 2022.12) 3D Highlighter: Localizing Regions on **3D** Shapes via **Text** Descriptions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11263.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fthreedle\u002F3DHighlighter)\n\n- (arXiv 2022.12) Contrastive **Language-Vision** AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification **Bias**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11261.pdf)\n\n- (arXiv 2022.12) Ultra-High-Definition **Low-Light Image Enhancement**: A Benchmark and Transformer-Based Method, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11548.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTaoWangzj\u002FLLFormer)\n\n- (arXiv 2022.12) Tune-A-Video: One-Shot Tuning of Image Diffusion Models for **Text-to-Video** Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11565.pdf), [[Project]](https:\u002F\u002Ftuneavideo.github.io\u002F)\n\n- (arXiv 2022.12) Beyond SOT: It’s Time to **Track** **Multiple** Generic **Objects** at Once, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11920.pdf)\n\n- (arXiv 2022.12) KNOWLEDGE-DRIVEN SCENE PRIORS FOR SEMANTIC AUDIO-VISUAL **EMBODIED NAVIGATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11345.pdf)\n\n- (arXiv 2022.12) SegViT: **Semantic Segmentation** with Plain Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05844.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzbwxp\u002FSegVit)\n\n- (arXiv 2022.12) Open-Vocabulary **Temporal Action Detection** with Off-the-Shelf Image-Text Features, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10596.pdf)\n\n- (arXiv 2022.12) Point·E: A System for **Generating 3D Point Clouds** from Complex **Prompts**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08751.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fpoint-e)\n\n- (arXiv 2022.12) Inductive Attention for **Video Action Anticipation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08830.pdf)\n\n- (arXiv 2022.12) **Image-and-Language** Understanding from Pixels Only, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08045.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbig_vision)\n\n- (arXiv 2022.12) FlexiViT: One Model for All **Patch Sizes**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08013.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbig_vision)\n\n- (arXiv 2022.12) **Unsupervised** Object **Localization**: Observing the Background to Discover Objects, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07834.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002FFOUND)\n\n- (arXiv 2022.12) Vision Transformers are Parameter-Efficient **Audio-Visual** Learners, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07983.pdf), [[Project]](https:\u002F\u002Fgenjib.github.io\u002Fproject_page\u002FLAVISH\u002F)\n\n- (arXiv 2022.12) Full Contextual Attention for Multi-resolution Transformers in **Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07890.pdf)\n\n- (arXiv 2022.12) DETR4D: Direct Multi-View **3D Object Detection** with Sparse Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07849.pdf)\n\n- (arXiv 2022.12) Enhanced Training of Query-Based Object **Detection** via Selective Query Recollection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07593.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FFangyi-Chen\u002FSQR)\n\n- (arXiv 2022.12) TEXT-GUIDED MASK-FREE LOCAL **IMAGE RETOUCHING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07603.pdf)\n\n- (arXiv 2022.12) Summary-Oriented Vision Modeling for **Multimodal Abstractive Summarization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07672.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FXL2248\u002FSOV-MAS)\n\n- (arXiv 2022.12) One-Shot Domain Adaptive and Generalizable **Semantic Segmentation** with Class-Aware Cross-Domain Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07292.pdf)\n\n- (arXiv 2022.12) ConQueR: Query Contrast Voxel-DETR for **3D Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07289.pdf)\n\n- (arXiv 2022.12) Examining the **Difference** Among **Transformers** and **CNNs** with Explanation Methods, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06872.pdf)\n\n- (arXiv 2022.12) Find Someone Who: Visual Commonsense Understanding in Human-Centric **Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06971.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHxyou\u002FHumanCog)\n\n- (arXiv 2022.12) Dual-branch Cross-Patch Attention Learning for **Group Affect Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07055.pdf)\n\n- (arXiv 2022.12) Cross-Modal Similarity-Based Curriculum Learning for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07075.pdf)\n\n- (arXiv 2022.12) NLIP: Noise-robust **Language-Image** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07086.pdf)\n\n- (arXiv 2022.12) Lidar**CLIP** or: How I Learned to Talk to **Point Clouds**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06858.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fatonderski\u002Flidarclip)\n\n- (arXiv 2022.12) **CLIP**SEP: LEARNING TEXT-QUERIED **SOUND SEPARATION** WITH NOISY UNLABELED VIDEOS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07065.pdf)\n\n- (arXiv 2022.12) Reproducible **scaling laws** for contrastive language-image learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07143.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLAION-AI\u002Fscaling-laws-openclip)\n\n- (arXiv 2022.12) WHAT DO VISION TRANSFORMERS LEARN? A VISUAL **EXPLORATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06727.pdf)\n\n- (arXiv 2022.12) Self-Play and Self-Describe: **Policy Adaptation** with **Vision-Language** Foundation Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07398.pdf), [[Project]](https:\u002F\u002Fgeyuying.github.io\u002FSPLAYD)\n\n- (arXiv 2022.12) GPVIT: A **HIGH RESOLUTION** NON-HIERARCHICAL VISION TRANSFORMER WITH GROUP PROPAGATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06795.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FChenhongyiYang\u002FGPViT)\n\n- (arXiv 2022.12) Learning 3D Representations from 2D Pre-trained Models via **Image-to-Point** Masked Autoencoders, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06785.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZrrSkywalker\u002FI2P-MAE)\n\n- (arXiv 2022.12) Parallel Queries for **Human-Object Interaction Detection**, [[Paper]](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3551626.3564944)\n\n- (arXiv 2022.12) Structure-Guided **Image Completion** with Image-level and Object-level Semantic Discriminators, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06310.pdf)\n\n- (arXiv 2022.12) Localized Latent Updates for **Fine-Tuning** **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06556.pdf)\n\n- (arXiv 2022.12) CamoFormer: Masked Separable Attention for **Camouflaged Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06570.pdf)\n\n- (arXiv 2022.12) FastMIM: Expediting **Masked** Image Modeling Pre-training for Vision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06593.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fggjy\u002FFastMIM.pytorch)\n\n- (arXiv 2022.12) OAMixer: Object-aware **Mixing** Layer for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06595.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Falinlab\u002FOAMixer)\n\n- (arXiv 2022.12) Doubly Right **Object Recognition**: A Why **Prompt** for Visual **Rationales**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06202.pdf)\n\n- (arXiv 2022.12) RT-1: **ROBOTICS** TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06817.pdf), [[Project]](https:\u002F\u002Frobotics-transformer.github.io\u002F)\n\n- (arXiv 2022.12) **Egocentric Video** Task Translation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06301.pdf)\n\n- (arXiv 2022.12) ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved **Visio-Linguistic** Models in **3D** Scenes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06250.pdf), [[Project]](https:\u002F\u002Fscanents3d.github.io\u002F)\n\n- (arXiv 2022.12) **Curriculum Learning** Meets Weakly Supervised **Modality Correlation** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07619.pdf)\n\n- (arXiv 2022.12) IMoS: Intent-Driven Full-Body **Motion Synthesis** for **Human-Object Interactions**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07555.pdf)\n\n- (arXiv 2022.12) MultiAct: Long-Term **3D Human Motion Generation** from Multiple Action Labels, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.05897.pdf)\n\n- (arXiv 2022.12) A New Path: Scaling **Vision-and-Language Navigation** with Synthetic Instructions and Imitation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03112.pdf)\n\n- (arXiv 2022.12) Beyond Object Recognition: A New Benchmark towards **Object Concept Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.02710.pdf), [[Project]](https:\u002F\u002Fmvig-rhos.com\u002Focl)\n\n- (arXiv 2022.12) ViTPose+: Vision Transformer Foundation Model for Generic Body **Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04246.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FViTPose)\n\n- (arXiv 2022.12) Structured **Vision-Language** Pretraining for **Computational** Cooking, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04267.pdf)\n\n- (arXiv 2022.12) MIME: **Human**-Aware **3D Scene Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04360.pdf), [[Project]](https:\u002F\u002Fmime.is.tue.mpg.de\u002F)\n\n- (arXiv 2022.12) OFASY S: A **Multi-Modal Multi-Task** Learning System for Building **Generalist Models**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04408.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOFA-Sys\u002FOFASys)\n\n- (arXiv 2022.12) Task **Bias** in **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04412.pdf)\n\n- (arXiv 2022.12) Multi-Concept Customization of **Text-to-Image** **Diffusion**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04488.pdf), [[Code]](https:\u002F\u002Fwww.cs.cmu.edu\u002F~custom-diffusion\u002F)\n\n- (arXiv 2022.12) Few-View Object **Reconstruction** with Unknown Categories and Camera Poses, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04492.pdf), [[Project]](https:\u002F\u002Fut-austin-rpl.github.io\u002FFORGE\u002F)\n\n- (arXiv 2022.12) Masked Video Distillation: Rethinking **Masked** Feature Modeling for **Self-supervised** **Video Representation** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04500.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fruiwang2021\u002Fmvd)\n\n- (arXiv 2022.12) Learning **Video** Representations from **Large Language Models**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04501.pdf), [[Project]](https:\u002F\u002Ffacebookresearch.github.io\u002FLaViLa)\n\n- (arXiv 2022.12) Frozen **CLIP** Model is Efficient **Point Cloud** Backbone, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04098.pdf)\n\n- (arXiv 2022.12) DialogCC: Large-scale **Multi-Modal Dialogue** Dataset, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04119.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002Fpassing2961\u002FDialogCC)\n\n- (arXiv 2022.12) Group Generalized Mean **Pooling** for Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04114.pdf)\n\n- (arXiv 2022.12) LEARNING DOMAIN INVARIANT **PROMPT** FOR **VISION-LANGUAGE** MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04196.pdf)\n\n- (arXiv 2022.12) LLM-Planner: Few-Shot Grounded **Planning** for **Embodied** Agents with **Large Language Models**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04088.pdf)\n\n- (arXiv 2022.12) Hyperbolic **Contrastive** Learning for Visual **Representations** beyond Objects, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.00653.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshlokk\u002FHCL\u002Ftree\u002Fmain\u002FHCL)\n\n### 2022.11\n\n\u003C!-- - (arXiv 2022.11) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.11) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.11) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.11) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.11) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.11) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.11) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.11) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.11) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.11) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.11) , [[Paper]](), [[Code]]() -->\n\n- (arXiv 2022.11) Texts as Images in Prompt Tuning for **Multi-Label Image Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12739.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fguozix\u002FTaI-DPT)\n\n- (arXiv 2022.11) Tell Me What Happened: Unifying **Text-guided Video Completion** via Multimodal Masked Video Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12824.pdf)\n\n- (arXiv 2022.11) InDiReCT: Language-Guided Zero-Shot Deep **Metric Learning** for Images, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12760.pdf)\n\n- (arXiv 2022.11) VoP: Text-Video Co-operative Prompt Tuning for **Cross-Modal Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12764.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fbighuang624\u002FVoP)\n\n- (arXiv 2022.11) **Completing point cloud** from few points by Wasserstein GAN and Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.12746.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FWxfQjh\u002FStability-point-recovery.git)\n\n- (arXiv 2022.11) Integrally Pre-Trained Transformer **Pyramid** Networks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12735.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsunsmarterjie\u002FiTPN)\n\n- (arXiv 2022.11) Data Augmentation Vision Transformer for **Fine-grained Image Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12879.pdf)\n\n- (arXiv 2022.11) **DETR**s with Collaborative Hybrid Assignments **Training**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12860.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FCo-DETR)\n\n- (arXiv 2022.11) Open-vocabulary **Attribute Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12914.pdf), [[Project]](https:\u002F\u002Fovad-benchmark.github.io\u002F)\n\n- (arXiv 2022.11) Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised **Monocular Depth Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13202.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fnoahzn\u002FLite-Mono)\n\n- (arXiv 2022.11) Inversion-Based **Creativity Transfer** with Diffusion Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13203.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FzyxElsa\u002Fcreativity-transfer)\n\n- (arXiv 2022.11) CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free **Continual Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13218.pdf)\n\n- (arXiv 2022.11) SVFormer: Semi-supervised Video Transformer for **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13222.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FChenHsing\u002FSVFormer)\n\n- (arXiv 2022.11) Generalizable **Implicit Neural Representations** via Instance Pattern Composers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13223.pdf)\n\n- (arXiv 2022.11) Improving **Visual-textual Sentiment Analysis** by Fusing Expert Features, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12981.pdf)\n\n- (arXiv 2022.11) **Self-Supervised** Learning based on Heat Equation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13228.pdf)\n\n- (arXiv 2022.11) Peekaboo: **Text to Image** Diffusion Models are Zero-Shot Segmentors, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13224.pdf)\n\n- (arXiv 2022.11) Paint by Example: Exemplar-based **Image Editing** with Diffusion Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13227.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FFantasy-Studio\u002FPaint-by-Example)\n\n- (arXiv 2022.11) Human or Machine? **Turing Tests** for Vision and Language, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13087.pdf), [[Code]](https:\u002F\u002Ftinyurl.com\u002F8x8nha7p)\n\n- (arXiv 2022.11) Teach-DETR: Better **Training** **DETR** with Teachers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11953.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLeonHLJ\u002FTeach-DETR)\n\n- (arXiv 2022.11) Conv2Former: A Simple Transformer-Style **ConvNet** for Visual Recognition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11943.pdf)\n\n- (arXiv 2022.11) X^2-VLM: All-In-One Pre-trained Model For **Vision-Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12402.pdf), [[Code]](github.com\u002Fzengyan-97\u002FX2-VLM)\n\n- (arXiv 2022.11) Aligning Source Visual and Target Language Domains for Unpaired **Video Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12148.pdf)\n\n- (arXiv 2022.11) On the Transferability of Visual Features in **Generalized Zero-Shot Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12494.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fuvavision\u002FTV-GZSL)\n\n- (arXiv 2022.11) Generalizable Industrial Visual **Anomaly Detection** with Self-Induction Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12311.pdf)\n\n- (arXiv 2022.11) Transformer Based Multi-Grained Features for Unsupervised **Person Re-Identification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12280.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FRikoLi\u002FWACV23-workshop-TMGF)\n\n- (arXiv 2022.11) Efficient Frequency Domain-based Transformers for High-Quality Image **Deblurring**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12250.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fkkkls\u002FFFTformer)\n\n- (arXiv 2022.11) Event Transformer+. A multi-purpose solution for efficient **event data processing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12222.pdf)\n\n- (arXiv 2022.11) MagicPony: Learning Articulated **3D Animals** in the Wild, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12497.pdf), [[Project]](https:\u002F\u002F3dmagicpony.github.io\u002F)\n\n- (arXiv 2022.11) Gated Class-Attention with Cascaded Feature Drift Compensation for Exemplar-free **Continual Learning** of Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12292.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOcraM17\u002FGCAB-CFDC)\n\n- (arXiv 2022.11) Expectation-Maximization Contrastive Learning for Compact **Video-and-Language** Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11427.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjpthu17\u002FEMCL)\n\n- (arXiv 2022.11) N-Gram in Swin Transformers for Efficient Lightweight **Image Super-Resolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11436.pdf)\n\n- (arXiv 2022.11) **Robotic** Skill Acquisition via Instruction Augmentation with Vision-Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11736.pdf), [[Code]](https:\u002F\u002Finstructionaugmentation.github.io\u002F)\n\n- (arXiv 2022.11) Peeling the Onion: Hierarchical Reduction of Data Redundancy for **Efficient** Vision Transformer **Training**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10801.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZLKong\u002FTri-Level-ViT)\n\n- (arXiv 2022.11) Unifying **Vision-Language** Representation Space with Single-tower Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11153.pdf)\n\n- (arXiv 2022.11) DeepSolo: Let Transformer Decoder with Explicit Points Solo for **Text Spotting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10772.pdf)\n\n- (arXiv 2022.11) Castling-ViT: **Compressing Self-Attention** via Switching Towards Linear-Angular Attention During Vision Transformer Inference, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10526.pdf)\n\n- (arXiv 2022.11) CL-CrossVQA: A Continual Learning Benchmark for **Cross-Domain Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10567.pdf)\n\n- (arXiv 2022.11) Normal Transformer: Extracting Surface Geometry from **LiDAR** Points Enhanced by Visual Semantics, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10580.pdf)\n\n- (arXiv 2022.11) A Unified Model for **Video** Understanding and Knowledge Embedding with Heterogeneous **Knowledge Graph** Dataset, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10624.pdf)\n\n- (arXiv 2022.11) Efficient **Video Representation** Learning via Masked Video Modeling with Motion-centric Token Selection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10636.pdf)\n\n- (arXiv 2022.11) DiffStyler: Controllable Dual Diffusion for Text-Driven **Image Stylization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10682.pdf)\n\n- (arXiv 2022.11) TORE: Token Reduction for Efficient **Human Mesh Recovery** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10705.pdf)\n\n- (arXiv 2022.11) **Synthesizing** Coherent **Story** with Auto-Regressive Latent Diffusion Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10950.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FFlash-321\u002FARLDM)\n\n- (arXiv 2022.11) Are **Out-of-Distribution Detection** Methods Reliable?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10892.pdf)\n\n- (arXiv 2022.11) GLT-T: Global-Local Transformer Voting for **3D Single Object Tracking** in Point Clouds, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10927.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhaooozi\u002FGLT-T)\n\n- (arXiv 2022.11) CROSS-MODAL CONTRASTIVE LEARNING FOR ROBUST REASONING IN **VQA**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11190.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fqizhust\u002Fcmcl_vqa_pl)\n\n- (arXiv 2022.11) LISA: Localized **Image Stylization** with Audio via Implicit Neural Representation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11381.pdf)\n\n- (arXiv 2022.11) MagicVideo: Efficient **Video Generation** With Latent Diffusion Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11018.pdf), [[Code]](https:\u002F\u002Fmagicvideo.github.io\u002F#)\n\n- (arXiv 2022.11) DreamArtist: Towards Controllable One-Shot **Text-to-Image** Generation via Contrastive Prompt-Tuning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11337.pdf)\n\n- (arXiv 2022.11) Hybrid Transformer Based Feature Fusion for Self-Supervised **Monocular Depth Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11066.pdf)\n\n- (arXiv 2022.11) Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable **Image Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11158.pdf)\n\n- (arXiv 2022.11) Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language **Navigation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11116.pdf)\n\n- (arXiv 2022.11) You Need Multiple Exiting: Dynamic Early Exiting for **Accelerating** Unified Vision Language Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11152.pdf)\n\n- (arXiv 2022.11) Beyond Attentive Tokens: Incorporating Token Importance and Diversity for **Efficient** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11315.pdf)\n\n- (arXiv 2022.11) FlowLens: Seeing Beyond the **FoV** via Flow-guided **Clip**-Recurrent Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11293.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMasterHow\u002FFlowLens)\n\n- (arXiv 2022.11) PS-Transformer: Learning Sparse **Photometric Stereo** Network using Self-Attention Mechanism, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11386.pdf)\n\n- (arXiv 2022.11) On the Robustness, Generalization, and Forgetting of Shape-Texture Debiased **Continual Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11174.pdf)\n\n- (arXiv 2022.11) Vision Transformer with Super **Token Sampling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11167.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhhb072\u002FSViT)\n\n- (arXiv 2022.11) Detect Only What You Specify : Object **Detection** with Linguistic Target, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11572.pdf)\n\n- (arXiv 2022.11) Visual Programming: Compositional **visual reasoning** without training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11559.pdf), [[Project]](https:\u002F\u002Fprior.allenai.org\u002Fprojects\u002Fvisprog)\n\n- (arXiv 2022.11) ClipCrop: Conditioned **Cropping** Driven by **Vision-Language** Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11492.pdf)\n\n- (arXiv 2022.11) SMAUG: Sparse **Masked** Autoencoder for **Efficient** **Video-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11446.pdf)\n\n- (arXiv 2022.11) **Blur Interpolation** Transformer for Real-World Motion from Blur, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11423.pdf)\n\n- (arXiv 2022.11) Mean Shift Mask Transformer for Unseen Object Instance **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11679.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYoungSean\u002FUnseenObjectsWithMeanShift)\n\n- (arXiv 2022.11) PointCLIP V2: Adapting **CLIP** for Powerful **3D** Open-world Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11682.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyangyangyang127\u002FPointCLIP_V2)\n\n- (arXiv 2022.11) Exploring Discrete **Diffusion** Models for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11694.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fbuxiangzhiren\u002FDDCap)\n\n- (arXiv 2022.11) PERCEIVER-VL: **Efficient** **Vision-and-Language** Modeling with Iterative Latent Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11701.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzinengtang\u002FPerceiver_VL)\n\n- (arXiv 2022.11) Multitask **Vision-Language** **Prompt** Tuning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11720.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FsIncerass\u002FMVLPT)\n\n- (arXiv 2022.11) Teaching **Structured** **Vision & Language** Concepts to Vision & Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11733.pdf)\n\n- (arXiv 2022.11) WEIGHTED **ENSEMBLE** **SELF-SUPERVISED** LEARNING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09981.pdf)\n\n- (arXiv 2022.11) BEVFormer v2: Adapting Modern Image Backbones to **Bird’s-Eye-View Recognition** via Perspective Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10439.pdf)\n\n- (arXiv 2022.11) Task Residual for Tuning **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10277.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgeekyutao\u002FTaskRes)\n\n- (arXiv 2022.11) α DARTS Once More: Enhancing Differentiable **Architecture Search** by **Masked** Image Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10105.pdf)\n\n- (arXiv 2022.11) Delving into Transformer for Incremental Semantic **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10253.pdf)\n\n- (arXiv 2022.11) DETRDistill: A Universal **Knowledge Distillation** Framework for **DETR**-families, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10156.pdf)\n\n- (arXiv 2022.11) PromptCap: Prompt-Guided Task-Aware Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09699.pdf)\n\n- (arXiv 2022.11) UNIFORMERV2: SPATIOTEMPORAL LEARNING BY ARMING IMAGE VITS WITH **VIDEO** UNIFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09552.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FUniFormerV2)\n\n- (arXiv 2022.11) **Masked** Reconstruction **Contrastive** Learning with Information Bottleneck Principle, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09013.pdf)\n\n- (arXiv 2022.11) Listen, denoise, action! Audio-driven **motion synthesis** with diffusion models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09707.pdf), [[Project]](https:\u002F\u002Fwww.speech.kth.se\u002Fresearch\u002Flisten-denoise-action\u002F)\n\n- (arXiv 2022.11) ConStruct-VL: Data-Free Continual **Structured VL Concepts** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09790.pdf)\n\n- (arXiv 2022.11) How to **Fine-Tune** Vision Models with **SGD**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09359.pdf)\n\n- (arXiv 2022.11) Progressive Tree-Structured Prototype Network for End-to-End Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09460.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FNovaMind-Z\u002FPTSN)\n\n- (arXiv 2022.11) CapEnrich: Enriching **Caption** Semantics for Web Images via Cross-modal Pre-trained Knowledge, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09371.pdf), [[Code]]()\n\n- (arXiv 2022.11) Visual Commonsense-aware Representation Network for **Video Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09469.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzchoi\u002FVCRN)\n\n- (arXiv 2022.11) Language Conditioned Spatial Relation Reasoning for **3D Object Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09646.pdf), [[Code]](https:\u002F\u002Fcshizhe.github.io\u002Fprojects\u002Fvil3dref.html)\n\n- (arXiv 2022.11) HARDVS: Revisiting Human **Activity Recognition** with **Dynamic Vision Sensors**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09648.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FEvent-AHU\u002FHARDVS)\n\n- (arXiv 2022.11) Towards All-in-one **Pre-training** via Maximizing **Multi-modal** Mutual Information, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09807.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FM3I-Pretraining)\n\n- (arXiv 2022.11) Uni-Perceiver v2: A **Generalist** Model for Large-Scale **Vision** and **Vision-Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09808.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffundamentalvision\u002FUni-Perceiver)\n\n- (arXiv 2022.11) D^3ETR: Decoder **Distillation** for **Detection** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09768.pdf)\n\n- (arXiv 2022.11) **CAE** v2: Context Autoencoder with **CLIP** Target, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09799.pdf)\n\n- (arXiv 2022.11) Cross-Modal Adapter for **Text-Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09623.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FCross-Modal-Adapter)\n\n- (arXiv 2022.11) TOKEN **TURING MACHINES**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09119.pdf)\n\n- (arXiv 2022.11) WILL LARGE-SCALE **GENERATIVE** MODELS CORRUPT **FUTURE DATASETS**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08095.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmoskomule\u002Fdataset-contamination)\n\n- (arXiv 2022.11) Demystify **Self-Attention** in Vision Transformers from a Semantic Perspective: Analysis and Application, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08543.pdf)\n\n- (arXiv 2022.11) SATVSR: Scenario Adaptive Transformer for Cross Scenarios **Video Super-Resolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.08703.pdf)\n\n- (arXiv 2022.11) TransCC: Transformer-based **Multiple Illuminant Color Constancy** Using Multitask Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08772.pdf)\n\n- (arXiv 2022.11) Stare at What You See: **Masked Image Modeling** without Reconstruction, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08887.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOpenPerceptionX\u002Fmaskalign)\n\n- (arXiv 2022.11) HeatViT: Hardware-Efficient Adaptive **Token Pruning** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08110.pdf)\n\n- (arXiv 2022.11) Cross-domain Federated Adaptive **Prompt Tuning** for **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07864.pdf)\n\n- (arXiv 2022.11) YORO - Lightweight End to End **Visual Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07912.pdf)\n\n- (arXiv 2022.11) **Knowledge Distillation** for Detection Transformer with Consistent Distillation Points Sampling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08071.pdf)\n\n- (arXiv 2022.11) BiViT: Extremely **Compressed** **Binary** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07091.pdf)\n\n- (arXiv 2022.11) ContextCLIP: Contextual Alignment of **Image-Text** pairs on **CLIP** visual representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07122.pdf)\n\n- (arXiv 2022.11) Zero-shot Image **Captioning** by Anchor-augmented Vision-Language Space Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07275.pdf)\n\n- (arXiv 2022.11) Seeing Beyond the **Brain**: Conditional Diffusion Model with Sparse Masked Modeling for **Vision Decoding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06956.pdf), [[Project]](https:\u002F\u002Fmind-vis.github.io\u002F)\n\n- (arXiv 2022.11) Enhancing **Few-Shot Image Classification** with Cosine Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06828.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvinuni-vishc\u002FFew-Shot-Cosine-Transformer)\n\n- (arXiv 2022.11) SCOTCH and SODA: A Transformer **Video Shadow Detection** Framework, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06885.pdf)\n\n- (arXiv 2022.11) AU-Aware Vision Transformers for Biased **Facial Expression Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06609.pdf)\n\n- (arXiv 2022.11) Fast Text-Conditional Discrete **Denoising** on Vector-Quantized Latent Spaces, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07292.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdome272\u002FPaella)\n\n- (arXiv 2022.11) Large-Scale Bidirectional Training for Zero-Shot Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06774.pdf)\n\n- (arXiv 2022.11) Grafting Pre-trained Models for Multimodal **Headline Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07210.pdf)\n\n- (arXiv 2022.11) CabViT: Cross **Attention** among Blocks for Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07198.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhkzhang91\u002FCabViT)\n\n- (arXiv 2022.11) **Composed Image Retrieval** with Text Feedback via Multi-grained Uncertainty Regularization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07394.pdf)\n\n- (arXiv 2022.11) SSGVS: Semantic **Scene Graph-to-Video** Synthesis, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06119.pdf)\n\n- (arXiv 2022.11) One-Time **Model Adaptation** to Heterogeneous Clients: An Intra-Client and Inter-Image Attention Design, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06276.pdf)\n\n- (arXiv 2022.11) An Improved End-to-End **Multi-Target Tracking** Method Based on Transformer Self-Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.06001.pdf)\n\n- (arXiv 2022.11) Zero-shot Visual Commonsense **Immorality Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05521.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fku-vai\u002FZero-shot-Visual-Commonsense-Immorality-Prediction)\n\n- (arXiv 2022.11) Hyperbolic Cosine Transformer for **LiDAR 3D Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.05580.pdf)\n\n- (arXiv 2022.11) **Training** a Vision Transformer from scratch in less than 24 hours with 1 GPU, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05187.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBorealisAI\u002Fefficient-vit-training)\n\n- (arXiv 2022.11) ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer **Acceleration** with a Linear Taylor Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05109.pdf)\n\n- (arXiv 2022.11) SimOn: A Simple Framework for **Online Temporal Action Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04905.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTuanTNG\u002FSimOn)\n\n- (arXiv 2022.11) ERNIE-UNIX^2: A UNIFIED **CROSS-LINGUAL CROSS-MODAL** FRAMEWORK FOR UNDERSTANDING AND GENERATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04861.pdf)\n\n- (arXiv 2022.11) SG-Shuffle: Multi-aspect Shuffle Transformer for **Scene Graph Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04773.pdf)\n\n- (arXiv 2022.11) Understanding Cross-modal Interactions in V&L Models that Generate **Scene Descriptions**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04971.pdf)\n\n- (arXiv 2022.11) VieCap4H - VLSP 2021: ObjectAoA - Enhancing performance of Object Relation Transformer with Attention on Attention for **Vietnamese** image **captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05405.pdf)\n\n- (arXiv 2022.11) Watching the News: Towards **VideoQA** Models that can Read, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05588.pdf), [[Project]](http:\u002F\u002Fcvit.iiit.ac.in\u002Fresearch\u002Fprojects\u002Fcvit-projects\u002Fvideoqa)\n\n- (arXiv 2022.11) Efficient Joint **Detection** and **Multiple Object Tracking** with Spatially Aware Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05654.pdf)\n\n- (arXiv 2022.11) **Demystify** Transformers & **Convolutions** in Modern Image Deep Networks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05781.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FSTM-Evaluation)\n\n- (arXiv 2022.11) InternImage: Exploring Large-Scale Vision Foundation Models with **Deformable Convolutions**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05778.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternImage)\n\n- (arXiv 2022.11) DEPTHFORMER: MULTIMODAL POSITIONAL ENCODINGS AND CROSS-INPUT ATTENTION FOR TRANSFORMER-BASED **SEGMENTATION** NETWORKS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04188.pdf)\n\n- (arXiv 2022.11) Sequential Transformer for End-to-End **Person Search**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04323.pdf)\n\n- (arXiv 2022.11) Prompting Large Pre-trained Vision-Language Models For **Compositional Concept Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05077.pdf)\n\n- (arXiv 2022.11) CASA: Category-agnostic **Skeletal Animal Reconstruction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.03568.pdf)\n\n- (arXiv 2022.11) ViT-CX: Causal **Explanation** of Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.03064.pdf)\n\n- (arXiv 2022.11) Disentangling Content and Motion for **Text-Based Neural Video Manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02980.pdf)\n\n- (arXiv 2022.11) **Efficient** Multi-order Gated Aggregation Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.03295.pdf)\n\n- (arXiv 2022.11) CLOP: **Video-and-Language** Pre-Training with Knowledge Regularizations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.03314.pdf)\n\n- (arXiv 2022.11) MSMG-Net: Multi-scale Multi-grained Supervised Metworks for Multi-task Image Manipulation **Detection** and **Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.03140.pdf)\n\n- (arXiv 2022.11) Understanding and Mitigating Overfitting in **Prompt** Tuning for **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02219.pdf), [[Code]](https:\u002F\u002Ftinyurl.com\u002Fmpe64f89)\n\n- (arXiv 2022.11) Zero-shot **Video Moment Retrieval** With Off-the-Shelf Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02178.pdf)\n\n- (arXiv 2022.11) Scaling **Multimodal** Pre-Training via Cross-Modality Gradient Harmonization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02077.pdf)\n\n- (arXiv 2022.11) A Transformer Architecture for Online **Gesture Recognition** of Mathematical Expressions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02643.pdf)\n\n- (arXiv 2022.11) Evaluating and Improving Factuality in **Multimodal Abstractive Summarization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02580.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmeetdavidwan\u002Ffaithful-multimodal-summ)\n\n- (arXiv 2022.11) RCDPT: **RADAR-CAMERA FUSION** DENSE PREDICTION TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02432.pdf)\n\n- (arXiv 2022.11) **Video Event Extraction** via Tracking Visual States of Arguments, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01781.pdf)\n\n- (arXiv 2022.11) The **Lottery Ticket** Hypothesis for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01484.pdf)\n\n- (arXiv 2022.11) TEXTCRAFT: ZERO-SHOT GENERATION OF HIGHFIDELITY AND DIVERSE **SHAPES FROM TEXT**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01427.pdf)\n\n- (arXiv 2022.11) PolyBuilding: Polygon Transformer for End-to-End **Building Extraction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01589.pdf)\n\n- (arXiv 2022.11) RETHINKING **HIERARCHIES** IN PRE-TRAINED PLAIN VISION TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01785.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FHPViT)\n\n- (arXiv 2022.11) SAP-**DETR**: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02006.pdf)\n\n- (arXiv 2022.11) Could Giant Pretrained Image Models Extract **Universal Representations**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02043.pdf)\n\n- (arXiv 2022.11) MAEDAY: MAE for few and zero shot **AnomalY-Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14307.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FEliSchwartz\u002FMAEDAY)\n\n- (arXiv 2022.11) Degenerate Swin to Win: Plain **Window-based** Transformer without Sophisticated Operations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14255.pdf)\n\n- (arXiv 2022.11) Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for **3D Visual Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14241.pdf), [[Code]](https:\u002F\u002Feslambakr.github.io\u002FLAR.github.io\u002F)\n\n- (arXiv 2022.11) SpaText: Spatio-Textual Representation for **Controllable Image Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14305.pdf), [[Project]](https:\u002F\u002Fomriavrahami.com\u002Fspatext)\n\n- (arXiv 2022.11) Learning **3D** Scene Priors with **2D** Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14157.pdf), [[Project]](https:\u002F\u002Fyinyunie.github.io\u002Fsceneprior-page\u002F)\n\n- (arXiv 2022.11) PoET: Pose Estimation Transformer for Single-View, Multi-Object **6D Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14125.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Faau-cns\u002Fpoet)\n\n- (arXiv 2022.11) Spatial-Spectral Transformer for **Hyperspectral Image Denoising**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14090.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMyuLi\u002FSST)\n\n- (arXiv 2022.11) Training **Vision-Language** Models with Less Bimodal Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00262.pdf)\n\n- (arXiv 2022.11) Text-Only Training for Image **Captioning** using Noise-Injected **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00575.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FDavidHuji\u002FCapDec)\n\n- (arXiv 2022.11) Attention-based **Neural Cellular Automata**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01233.pdf)\n\n- (arXiv 2022.11) eDiff-I: **Text-to-Image** Diffusion Models with an Ensemble of Expert Denoisers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01324.pdf), [[Code]](https:\u002F\u002Fdeepimagination.cc\u002FeDiff-I\u002F)\n\n- (arXiv 2022.11) Chinese CLIP: Contrastive **Vision-Language** Pretraining in **Chinese**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01335.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOFA-Sys\u002FChinese-CLIP)\n\n- (arXiv 2022.11) P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for **Open-Vocabulary Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00849.pdf)\n\n- (arXiv 2022.11) tSF: Transformer-based Semantic Filter for **Few-Shot Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00868.pdf)\n\n- (arXiv 2022.11) WITT: A WIRELESS IMAGE TRANSMISSION TRANSFORMER FOR **SEMANTIC COMMUNICATIONS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00937.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKeYang8\u002FWITT)\n\n- (arXiv 2022.11) Pair DETR: Contrastive Learning **Speeds Up** **DETR** Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16476.pdf)\n\n- (arXiv 2022.11) Interaction Visual Transformer for **Egocentric Action Anticipation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14154.pdf)\n\n- (arXiv 2022.11) UDE: A Unified Driving Engine for Human **Motion Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.16016.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzixiangzhou916\u002FUDE\u002F)\n\n- (arXiv 2022.11) Action-**GPT**: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot **Action Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.15603.pdf), [[Project]](https:\u002F\u002Factiongpt.github.io\u002F)\n\n- (arXiv 2022.11) Human or Machine? **Turing Tests** for **Vision** and **Language**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13087.pdf), [[Code]](https:\u002F\u002Ftinyurl.com\u002F8x8nha7p)\n\n- (arXiv 2022.11) Knowledge **Prompting** for Few-shot **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12030.pdf)\n\n- (arXiv 2022.11) UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16031.pdf), [[Project]](https:\u002F\u002Fupainting.github.io\u002F)\n\n- (arXiv 2022.11) LVP-M^3: Language-aware Visual Prompt for **Multilingual Multimodal Machine Translation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15461.pdf)\n\n- (arXiv 2022.11) PROCONTEXT: PROGRESSIVE CONTEXT TRANSFORMER FOR **TRACKING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15511.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjp-lan\u002FProContEXT)\n\n- (arXiv 2022.11) Video based Object **6D Pose Estimation** using Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13540.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FApoorvaBeedu\u002FVideoPose)\n\n- (arXiv 2022.11) S2WAT: **Image Style Transfer** via Hierarchical Vision Transformer using Strips Window Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12381.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAlienZhang1996\u002FS2WAT)\n\n- (arXiv 2022.11) Holistic Interaction Transformer Network for **Action Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12686.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjoslefaure\u002FHIT)\n\n- (arXiv 2022.11) Learning and Retrieval from Prior Data for Skill-based **Imitation Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11435.pdf), [[Code]](https:\u002F\u002Fut-austin-rpl.github.io\u002Fsailor)\n\n- (arXiv 2022.11) SimpleClick: **Interactive** Image **Segmentation** with Simple Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11006.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Funcbiag\u002FSimpleClick)\n\n- (arXiv 2022.11) TANGO: **Text-driven** Photorealistic and Robust **3D Stylization** via Lighting Decomposition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11277.pdf), [[Code]](https:\u002F\u002Fcyw-3d.github.io\u002Ftango\u002F)\n\n- (arXiv 2022.11) CPL: Counterfactual **Prompt** Learning for **Vision and Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10362.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Feric-ai-lab\u002FCPL)\n\n- (arXiv 2022.11) Plug-and-Play VQA: Zero-shot **VQA** by Conjoining Large Pretrained Models with Zero Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08773.pdf)\n\n- (arXiv 2022.11) Selective Query-guided Debiasing for **Video** Corpus Moment **Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08714.pdf)\n\n- (arXiv 2022.11) Scaling & Shifting Your Features: A New Baseline for **Efficient Model Tuning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08823.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdongzelian\u002FSSF)\n\n- (arXiv 2022.11) DENOISING **MASKED AUTOENCODERS** ARE CERTIFIABLE ROBUST VISION LEARNERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06983.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fquanlin-wu\u002Fdmae)\n\n- (arXiv 2022.11) **Token-Label Alignment** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06455.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FEuphoria16\u002FTL-Align)\n\n- (arXiv 2022.11) **CLIP**-Fields: Weakly Supervised Semantic Fields for **Robotic** Memory, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05663.pdf), [[Code]](https:\u002F\u002Fmahis.life\u002Fclip-fields)\n\n- (arXiv 2022.11) Multi-Scale Wavelet Transformer for **Face Forgery Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03899.pdf)\n\n- (arXiv 2022.11) **CLIP**-PAE: PROJECTION-AUGMENTATION EMBEDDING TO EXTRACT RELEVANT FEATURES FOR A DISENTANGLED, INTERPRETABLE, AND CONTROLLABLE **TEXT-GUIDED IMAGE MANIPULATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03919.pdf)\n\n- (arXiv 2022.11) VISUAL PROMPT TUNING FOR **TEST-TIME DOMAIN ADAPTATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04831.pdf)\n\n- (arXiv 2022.11) Fast**CLIP**styler: Optimisation-free Text-based **Image Style Transfer** Using Style Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03461.pdf)\n\n- (arXiv 2022.11) PROGRESSIVE DENOISING MODEL FOR FINEGRAINED **TEXT-TO-IMAGE** GENERATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02291.pdf)\n\n- (arXiv 2022.11) **DALL-E**-Bot: Introducing Web-Scale Diffusion Models to **Robotics**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02438.pdf), [[Project]](https:\u002F\u002Fwww.robot-learning.uk\u002Fdall-e-bot)\n\n- (arXiv 2022.11) Decomposed Soft Prompt Guided Fusion Enhancing for **Compositional Zero-Shot Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10681.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FForest-art\u002FDFSP.git)\n\n- (arXiv 2022.11) ACCURATE **IMAGE RESTORATION** WITH ATTENTION RETRACTABLE TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01427.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgladzhang\u002FART)\n\n- (arXiv 2022.11) **Dilated** Neighborhood **Attention** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15001.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FNeighborhood-Attention-Transformer)\n\n- (arXiv 2022.11) Unified Loss of Pair Similarity Optimization for **Vision-Language Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13869.pdf)\n\n- (arXiv 2022.11) TVLT: Textless **Vision-Language** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14156.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzinengtang\u002FTVLT)\n\n### 2022.10\n\n- (arXiv 2022.10) DiMBERT: Learning **Vision-Language** Grounded Representations with Disentangled Multimodal-Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16431.pdf)\n\n- (arXiv 2022.10) TFORMER: **3D TOOTH SEGMENTATION** IN MESH SCANS WITH GEOMETRY GUIDED TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16627.pdf)\n\n- (arXiv 2022.10) **ON-THE-FLY** OBJECT **DETECTION** USING STYLEGAN WITH **CLIP** GUIDANCE, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16742.pdf)\n\n- (arXiv 2022.10) Image-free Domain Generalization via **CLIP** for **3D Hand Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16788.pdf)\n\n- (arXiv 2022.10) Temporal-Viewpoint Transportation Plan for **Skeletal Few-shot Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16820.pdf)\n\n- (arXiv 2022.10) A SIMPLE, EFFICIENT AND SCALABLE CONTRASTIVE **MASKED AUTOENCODER** FOR LEARNING VISUAL REPRESENTATIONS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16870.pdf)\n\n- (arXiv 2022.10) Time-rEversed diffusioN tEnsor Transformer: A new TENET of **Few-Shot Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16897.pdf)\n\n- (arXiv 2022.10) **Foreign Object Debris Detection** for Airport Pavement Images based on Self-supervised Localization and Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16901.pdf)\n\n- (arXiv 2022.10) ViT-LSLA: Vision Transformer with **Light Self-Limited-Attention**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.17115.pdf)\n\n- (arXiv 2022.10) Generative Negative Text Replay for Continual **Vision-Language** Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.17322.pdf)\n\n- (arXiv 2022.10) PatchRot: A **Self-Supervised** Technique for Training Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15722.pdf)\n\n- (arXiv 2022.10) MULTIMODAL TRANSFORMER DISTILLATION FOR **AUDIO-VISUAL** SYNCHRONIZATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15563.pdf)\n\n- (arXiv 2022.10) **Multimodal** Transformer for Parallel Concatenated Variational Autoencoders, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16174.pdf)\n\n- (arXiv 2022.10) Differentially **Private** CutMix for Split Learning with Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15986.pdf)\n\n- (arXiv 2022.10) OHMG: ZERO-SHOT OPEN-VOCABULARY HUMAN **MOTION GENERATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15929.pdf)\n\n- (arXiv 2022.10) VLT: Vision-Language Transformer and Query Generation for **Referring Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15871.pdf)\n\n- (arXiv 2022.10) PSFORMER: POINT TRANSFORMER FOR **3D SALIENT OBJECT DETECTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15933.pdf)\n\n- (arXiv 2022.10) **GRAFTING** VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15943.pdf)\n\n- (arXiv 2022.10) Generalization Differences between End-to-End and Neuro-Symbolic **Vision-Language** **Reasoning** Systems, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15037)\n\n- (arXiv 2022.10) FaD-VLP: **Fashion** **Vision-and-Language** Pre-training towards Unified Retrieval and Captioning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15028.pdf)\n\n- (arXiv 2022.10) Masked **Vision-Language** Transformer in **Fashion**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15110.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGewelsJI\u002FMVLT)\n\n- (arXiv 2022.10) Learning Variational Motion Prior for **Video-based Motion Capture**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15134.pdf)\n\n- (arXiv 2022.10) **Open-vocabulary Semantic Segmentation** with Frozen Vision-Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15138.pdf), [[Code]](https:\u002F\u002Fyyh-rain-song.github.io\u002FFusioner_webpage\u002F)\n\n- (arXiv 2022.10) **TEXT2MODEL**: MODEL INDUCTION FOR ZERO-SHOT GENERALIZATION USING TASK DESCRIPTIONS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15182.pdf)\n\n- (arXiv 2022.10) Learning Joint Representation of **Human Motion** and **Language**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15187.pdf)\n\n- (arXiv 2022.10) ERNIE-ViLG 2.0: Improving **Text-to-Image** Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15257.pdf)\n\n- (arXiv 2022.10) MSF3DDETR: Multi-Sensor Fusion **3D** Detection Transformer for **Autonomous Driving**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15316.pdf)\n\n- (arXiv 2022.10) Li3DeTr: A **LiDAR** based **3D Detection** Transformer, [[Paper]](Li3DeTr: A LiDAR based 3D Detection Transformer)\n\n- (arXiv 2022.10) Masked Transformer for **image Anomaly Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15540.pdf)\n\n- (arXiv 2022.10) Discovering Design Concepts for **CAD Sketches**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14451.pdf)\n\n- (arXiv 2022.10) Compressing And Debiasing Vision-Language Pre-Trained Models for **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14558.pdf)\n\n- (arXiv 2022.10) End-to-End Multimodal Representation Learning for **Video Dialog**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14512.pdf)\n\n- (arXiv 2022.10) TPFNet: A Novel **Text In-painting** Transformer for Text Removal, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14461.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FCandleLabAI\u002FTPFNet)\n\n- (arXiv 2022.10) IMU2CLIP: MULTIMODAL CONTRASTIVE LEARNING FOR **IMU MOTION SENSORS** FROM **EGOCENTRIC** VIDEOS AND TEXT NARRATIONS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14395.pdf)\n\n- (arXiv 2022.10) Can Transformer Attention Spread Give Insights Into **Uncertainty** of **Detected** and **Tracked** Objects? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14391.pdf)\n\n- (arXiv 2022.10) SemFormer: Semantic Guided Activation Transformer for **Weakly Supervised Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14618.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJLChen-C\u002FSemFormer)\n\n- (arXiv 2022.10) End-to-end **Tracking** with a Multi-query Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14601.pdf)\n\n- (arXiv 2022.10) Explicitly Increasing **Input Information Density** for Vision Transformers on **Small Datasets**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14319.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxiangyu8\u002FDenseVT)\n\n- (arXiv 2022.10) TAMFORMER: MULTI-MODAL TRANSFORMER WITH LEARNED ATTENTION MASK FOR **EARLY INTENT PREDICTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14714.pdf)\n\n- (arXiv 2022.10) **VISUAL ANSWER LOCALIZATION** WITH CROSS-MODAL MUTUAL KNOWLEDGE TRANSFER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14823.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FWENGSYX\u002FMutualSL)\n\n- (arXiv 2022.10) Visual **Semantic Parsing**: From Images to Abstract Meaning Representation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14862.pdf)\n\n- (arXiv 2022.10) End-to-end Transformer for **Compressed Video Quality Enhancement**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13827.pdf)\n\n- (arXiv 2022.10) PlanT: Explainable **Planning** Transformers via Object-Level Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14222.pdf), [[Project]](https:\u002F\u002Fwww.katrinrenz.de\u002Fplant)\n\n- (arXiv 2022.10) Strong-TransCenter: Improved **Multi-Object Tracking** based on Transformers with Dense Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13570.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Famitgalor18\u002FSTC_Tracker)\n\n- (arXiv 2022.10) GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online **Action Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13605.pdf)\n\n- (arXiv 2022.10) VLC-BERT: **Visual Question Answering** with Contextualized Commonsense Knowledge, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13626.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Faditya10\u002FVLC-BERT)\n\n- (arXiv 2022.10) Learning by Hallucinating: **Vision-Language** Pre-training with Weak Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13591.pdf)\n\n- (arXiv 2022.10) Learning Explicit **Object-Centric Representations** with Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14139.pdf)\n\n- (arXiv 2022.10) Abductive **Action** Inference, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13984.pdf)\n\n- (arXiv 2022.10) Minutiae-Guided **Fingerprint** Embeddings via Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13994.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ftba)\n\n- (arXiv 2022.10) 3DALL-E: Integrating **Text-to-Image** AI in **3D** Design Workflows, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11603.pdf)\n\n- (arXiv 2022.10) COMPOSING **ENSEMBLES** OF **PRE-TRAINED MODELS** VIA ITERATIVE CONSENSUS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11522.pdf), [[Code]](https:\u002F\u002Fenergy-based-model.github.io\u002Fcomposing-pretrained-models)\n\n- (arXiv 2022.10) Do **Vision-and-Language** Transformers Learn Grounded **Predicate-Noun Dependencies**?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12079.pdf)\n\n- (arXiv 2022.10) Boosting vision transformers for **image retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11909.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdealicious-inc\u002FDToP)\n\n- (arXiv 2022.10) LiteVL: Efficient **Video-Language** Learning with Enhanced Spatial-Temporal Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11929.pdf)\n\n- (arXiv 2022.10) Fine-grained Semantic Alignment Network for Weakly Supervised **Temporal Language Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11933.pdf)\n\n- (arXiv 2022.10) **Face** **Pyramid** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11974.pdf), [[Code]](https:\u002F\u002Fkhawar-islam.github.io\u002Ffpvt\u002F)\n\n- (arXiv 2022.10) Context-Enhanced **Stereo** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11719.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fguoweiyu\u002FContext-Enhanced-Stereo-Transformer)\n\n- (arXiv 2022.10) CRT-6D: Fast **6D Object Pose Estimation** with Cascaded Refinement Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11718.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPedroCastro\u002FCRT-6D)\n\n- (arXiv 2022.10) Rethinking Learning Approaches for Long-Term **Action Anticipation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11566.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FNmegha2601\u002Fanticipatr)\n\n- (arXiv 2022.10) Extending **Phrase Grounding** with Pronouns in Visual Dialogues, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12658.pdf), [[Code]]()\n\n- (arXiv 2022.10) Accumulated Trivial **Attention** Matters in Vision Transformers on **Small Datasets**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12333.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxiangyu8\u002FSATA)\n\n- (arXiv 2022.10) Transformers For **Recognition** In **Overhead Imagery**: A Reality Check, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12599.pdf)\n\n- (arXiv 2022.10) Anticipative Feature Fusion Transformer for Multi-Modal **Action Anticipation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12649.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzeyun-zhong\u002FAFFT)\n\n- (arXiv 2022.10) UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision Transformer for **Face Forgery Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12752.pdf)\n\n- (arXiv 2022.10) LCPFormer: Towards Effective 3D **Point Cloud** Analysis via Local Context Propagation in Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12755.pdf)\n\n- (arXiv 2022.10) Towards Real-Time **Text2Video** via **CLIP**-Guided, Pixel-Level Optimization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12826.pdf), [[Code]](https:\u002F\u002Fpschaldenbrand.github.io\u002Ftext2video\u002F)\n\n- (arXiv 2022.10) Language-free Training for Zero-shot **Video Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12977.pdf), [[Code]]()\n\n- (arXiv 2022.10) Foreground Guidance and Multi-Layer Feature Fusion for **Unsupervised Object Discovery** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13053.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FVDIGPKU\u002FFORMULA)\n\n- (arXiv 2022.10) Towards Unifying **Reference Expression** Generation and Comprehension, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13076.pdf)\n\n- (arXiv 2022.10) Robust **Self-Supervised Learning** with Lie Groups, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13356.pdf)\n\n- (arXiv 2022.10) VIOLA: Imitation Learning for Vision-Based **Manipulation** with Object Proposal Priors, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11339.pdf), [[Code]](https:\u002F\u002Fut-austin-rpl.github.io\u002FVIOLA)\n\n- (arXiv 2022.10) VTC: Improving **Video-Text Retrieval** with User Comments, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10820.pdf), [[Project]](https:\u002F\u002Funitaryai.github.io\u002Fvtc-paper)\n\n- (arXiv 2022.10) SOLVING **REASONING** TASKS WITH A SLOT TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11394.pdf), [[Code]]()\n\n- (arXiv 2022.10) Prompting through Prototype: A Prototype-based **Prompt** Learning on Pretrained **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10841.pdf)\n\n- (arXiv 2022.10) Grounded **Video Situation Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10828.pdf), [[Project]](https:\u002F\u002Fzeeshank95.github.io\u002Fgrvidsitu)\n\n- (arXiv 2022.10) Single **Image Super-Resolution** Using Lightweight Networks Based on Swin Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11019.pdf)\n\n- (arXiv 2022.10) Visual Spatial Description: Controlled **Spatial**-Oriented **Image-to-Text** Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11109.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzhaoyucs\u002FVSD)\n\n- (arXiv 2022.10) Movie**CLIP**: Visual **Scene Recognition** in Movies, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11065.pdf)\n\n- (arXiv 2022.10) PointTAD: Multi-Label **Temporal Action Detection** with Learnable Query Points, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11035.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FPointTAD)\n\n- (arXiv 2022.10) TOWARDS SUSTAINABLE **SELF-SUPERVISED** LEARNING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11016.pdf)\n\n- (arXiv 2022.10) **Visual-Semantic** Contrastive Alignment for Few-Shot Image Classification, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11000.pdf)\n\n- (arXiv 2022.10) i-MAE: ARE LATENT REPRESENTATIONS IN **MASKED AUTOENCODERS** LINEARLY SEPARABLE? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11470.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvision-learning-acceleration-lab\u002Fi-mae)\n\n- (arXiv 2022.10) 2nd Place Solution to ECCV 2022 Challenge: Transformer-based **Action recognition** in **hand-object** interacting scenarios, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11387.pdf)\n\n- (arXiv 2022.10) 1st Place Solution to ECCV 2022 Challenge on HBHA: Transformer-based Global **3D Hand Pose Estimation** in Two Hands Manipulating Objects Scenarios, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11384.pdf)\n\n- (arXiv 2022.10) **DALLE-2** is Seeing Double: **Flaws** in Word-to-Concept Mapping in Text2Image Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10606.pdf)\n\n- (arXiv 2022.10) **CLIP**-Driven Fine-grained Text-Image **Person Re-identification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10276.pdf)\n\n- (arXiv 2022.10) Dense but Efficient **VideoQA** for Intricate Compositional Reasoning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10300.pdf)\n\n- (arXiv 2022.10) Multi-view **Gait Recognition** based on SiameseVisionTransformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2210\u002F2210.10421.pdf)\n\n- (arXiv 2022.10) TOIST: Task Oriented **Instance Segmentation** Transformer with Noun-Pronoun Distillation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10775.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAIR-DISCOVER\u002FTOIST)\n\n- (arXiv 2022.10) CroCo: **Self-Supervised** Pre-training for **3D** Vision Tasks by Cross-View **Completion**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10716.pdf), [[Project]](https:\u002F\u002Feurope.naverlabs.com\u002Fresearch\u002Fcomputer-vision\u002Fcroco\u002F)\n\n- (arXiv 2022.10) A Unified View of **Masked** Image Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10615.pdf), [[Code]](https:\u002F\u002Faka.ms\u002Funimim)\n\n- (arXiv 2022.10) Cross-Modal Fusion Distillation for Fine-Grained **Sketch-Based Image Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10486.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fabhrac\u002Fxmodal-vit)\n\n- (arXiv 2022.10) BOAT: Bilateral Local **Attention** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.13027.pdf)\n\n- (arXiv 2022.12) **TOKEN MERGING**: YOUR VIT BUT **FASTER**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09461.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FToMe)\n\n- (arXiv 2022.10) Using Language to Extend to **Unseen Domains**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09520.pdf), [[Code]]()\n\n- (arXiv 2022.10) SWINV2-IMAGEN: HIERARCHICAL VISION TRANSFORMER DIFFUSION MODELS FOR **TEXT-TO-IMAGE** GENERATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09549.pdf)\n\n- (arXiv 2022.10) HUMANISE: Language-conditioned Human **Motion Generation** in **3D Scenes**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09729.pdf), [[Project]](https:\u002F\u002Fsilverster98.github.io\u002FHUMANISE\u002F)\n\n- (arXiv 2022.10) Transfer-learning for **video classification**: Video Swin Transformer on multiple domains, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09969.pdf)\n\n- (arXiv 2022.10) PERCEPTUAL **GROUPING** IN **VISION-LANGUAGE** MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09996.pdf)\n\n- (arXiv 2022.10) How Mask Matters: Towards **Theoretical Understandings** of **Masked Autoencoders**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08344.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzhangq327\u002FU-MAE)\n\n- (arXiv 2022.10) **LINEAR** **VIDEO** TRANSFORMER WITH FEATURE FIXATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08164.pdf), [[Code]]()\n\n- (arXiv 2022.10) Transformer-based **dimensionality reduction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08288.pdf)\n\n- (arXiv 2022.10) Bridging the Domain Gap for **Multi-Agent Perception**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08451.pdf)\n\n- (arXiv 2022.10) TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-**Drone** **Detection** in **Aerial** Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08423.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ftusharsangam\u002FTransVisDrone)\n\n- (arXiv 2022.10) SCRATCHING VISUAL TRANSFORMER’S BACK WITH UNIFORM **ATTENTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08457.pdf)\n\n- (arXiv 2022.10) Increasing Visual Awareness in **Multimodal Neural Machine Translation** from an Information Theoretic Perspective, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08478.pdf)\n\n- (arXiv 2022.10) TLDW: Extreme Multimodal **Summarisation** of News **Videos**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08481.pdf), [[Code]]()\n\n- (arXiv 2022.10) Character-Centric **Story Visualization** via Visual Planning and Token Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08465.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPlusLabNLP\u002FVP-CSV)\n\n- (arXiv 2022.10) COFAR: Commonsense and Factual Reasoning in **Image Search**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08554.pdf), [[Code]](https:\u002F\u002Fvl2g.github.io\u002Fprojects\u002Fcofar)\n\n- (arXiv 2022.10) Learning Self-Regularized **Adversarial** Views for Self-Supervised Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08458.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTrent-tangtao\u002FAutoView)\n\n- (arXiv 2022.10) Temporal and Contextual Transformer for **Multi-Camera Editing** of TV Shows, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08737.pdf)\n\n- (arXiv 2022.10) **Forecasting** Human **Trajectory** from Scene History, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08732.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMaKaRuiNah\u002FSHENet)\n\n- (arXiv 2022.10) SGRAM: Improving **Scene Graph Parsing** via Abstract Meaning Representation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08675.pdf)\n\n- (arXiv 2022.10) Contrastive **Language-Image** Pre-Training with Knowledge Graphs, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08901.pdf)\n\n- (arXiv 2022.10) A Saccaded Visual Transformer for General **Object Spotting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09220.pdf)\n\n- (arXiv 2022.10) Vision Transformers provably learn **spatial structure**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09221.pdf)\n\n- (arXiv 2022.10) oViT: An Accurate Second-Order **Pruning** Framework for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09223.pdf)\n\n- (arXiv 2022.10) Fine-grained **Category Discovery** under Coarse-grained supervision with Hierarchical Weighted Self-contrastive Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07733.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLackel\u002FHierarchical_Weighted_SCL)\n\n- (arXiv 2022.10) Non-Contrastive Learning Meets **Language-Image** Pre-Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09304.pdf)\n\n- (arXiv 2022.10) Frame Mining: a Free Lunch for Learning Robotic **Manipulation** from 3D Point Clouds, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07442.pdf), [[Project]](https:\u002F\u002Fcolin97.github.io\u002FFrameMining\u002F)\n\n- (arXiv 2022.10) Pretrained Transformers Do not Always Improve **Robustness**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07663.pdf)\n\n- (arXiv 2022.10) Plausible May Not Be Faithful: Probing Object Hallucination in **Vision-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07688.pdf)\n\n- (arXiv 2022.10) CONTRASTIVE **AUDIO-VISUAL** **MASKED** AUTOENCODER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07839.pdf)\n\n- (arXiv 2022.10) SWFormer: Sparse Window Transformer for **3D Object Detection** in Point Clouds, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07372.pdf)\n\n- (arXiv 2022.10) Trailers12k: Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label **Movie Trailer Genre Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07983.pdf)\n\n- (arXiv 2022.10) AVLEN: Audio-Visual-Language Embodied **Navigation** in 3D Environments, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07940.pdf)\n\n- (arXiv 2022.10) MOVE: Unsupervised Movable Object **Segmentation** and **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07920.pdf)\n\n- (arXiv 2022.10) IS SYNTHETIC DATA FROM **GENERATIVE** MODELS READY FOR IMAGE **RECOGNITION**?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07574.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FCVMI-Lab\u002FSyntheticData)\n\n- (arXiv 2022.10) Towards Transformer-based Homogenization of **Satellite Imagery** for Landsat-8 and Sentinel-2, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07654.pdf)\n\n- (arXiv 2022.10) MCTNET: A MULTI-SCALE CNN-TRANSFORMER NETWORK FOR **CHANGE DETECTION** IN OPTICAL **REMOTE SENSING** IMAGES, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07601.pdf)\n\n- (arXiv 2022.10) Vision Transformer **Visualization**: What Neurons Tell and How Neurons Behave? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07646.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FbyM1902\u002FViT_visualization)\n\n- (arXiv 2022.10) TokenMixup: Efficient Attention-guided Token-level Data **Augmentation** for Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07562.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmlvlab\u002FTokenMixup)\n\n- (arXiv 2022.10) SQA3D: SITUATED **QUESTION ANSWERING** IN **3D** SCENES, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07474.pdf)\n\n- (arXiv 2022.10) When **Adversarial** Training Meets Vision Transformers: Recipes from Training to Architecture, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07540.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmo666666\u002FWhen-Adversarial-Training-Meets-Vision-Transformers)\n\n- (arXiv 2022.10) STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07503.pdf)\n\n- (arXiv 2022.10) PedFormer: **Pedestrian Behavior Prediction** via Cross-Modal Attention Modulation and Gated Multitask Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07886.pdf)\n\n- (arXiv 2022.10) One Model to Edit Them All: Free-Form Text-Driven **Image Manipulation** with Semantic Modulations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07883.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKumapowerLIU\u002FFFCLIP)\n\n- (arXiv 2022.10) IMAGINARYNET: LEARNING OBJECT **DETECTORS** WITHOUT REAL IMAGES AND ANNOTATIONS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06886.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fkodenii\u002FImaginaryNet)\n\n- (arXiv 2022.10) Feature-Proxy Transformer for **Few-Shot Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06908.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJarvis73\u002FFPTrans)\n\n- (arXiv 2022.10) Scene **Text Image Super-Resolution** via Content Perceptual Loss and Criss-Cross Transformer Blocks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06924.pdf)\n\n- (arXiv 2022.10) UNIFIED **VISION AND LANGUAGE** **PROMPT** LEARNING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07225.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyuhangzang\u002FUPT)\n\n- (arXiv 2022.10) Exploring Long-Sequence **Masked Autoencoders**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07224.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flong_seq_mae)\n\n- (arXiv 2022.10) MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for **Vision-Language** Few-Shot **Prompting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07179.pdf)\n\n- (arXiv 2022.10) Interactive **Language**: Talking to **Robots** in Real Time, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06407.pdf), [[Project]](https:\u002F\u002Finteractive-language.github.io\u002F)\n\n- (arXiv 2022.10) RTFormer: Efficient Design for Real-Time **Semantic Segmentation** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07124.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleSeg)\n\n- (arXiv 2022.10) How to **Train** Vision Transformer on **Small-scale Datasets**?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07240.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhananshafi\u002Fvits-for-small-scale-datasets)\n\n- (arXiv 2022.10) Hate-CLIPper: Multimodal Hateful **Meme Classification** based on Cross-modal Interaction of **CLIP** Features, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05916.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgokulkarthik\u002Fhateclipper)\n\n- (arXiv 2022.10) Large Models are Parsimonious Learners: **Activation Sparsity** in Trained Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06313.pdf)\n\n- (arXiv 2022.10) CURVED **REPRESENTATION SPACE** OF VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05742.pdf)\n\n- (arXiv 2022.10) **Foundation** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06423.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm)\n\n- (arXiv 2022.10) Underspecification in **Scene Description-to-Depiction** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05815.pdf)\n\n- (arXiv 2022.10) Continuous conditional **video synthesis** by neural processes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05810.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FNPVS\u002FNPVS)\n\n- (arXiv 2022.10) SAIT: SPARSE VISION TRANSFORMERS THROUGH ADAPTIVE TOKEN **PRUNING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05832.pdf)\n\n- (arXiv 2022.10) ZITS++: **Image Inpainting** by Improving the Incremental Transformer on Structural Priors, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05950.pdf)\n\n- (arXiv 2022.10) SLOTFORMER: UNSUPERVISED VISUAL **DYNAMICS SIMULATION** WITH OBJECT-CENTRIC MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05861.pdf), [[Project]](https:\u002F\u002Fslotformer.github.io\u002F)\n\n- (arXiv 2022.10) Learning by Asking Questions for Knowledge-based Novel **Object Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05879.pdf)\n\n- (arXiv 2022.10) Bridging the **Gap** Between Vision Transformers and **Convolutional Neural Networks** on Small Datasets, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05958.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FArieSeirack\u002FDHVT)\n\n- (arXiv 2022.10) GGViT:Multistream Vision Transformer Network in Face2Face **Facial Reenactment Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05990.pdf)\n\n- (arXiv 2022.10) Distilling Knowledge from Language Models for **Video-based Action Anticipation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05991.pdf)\n\n- (arXiv 2022.10) Long-Form **Video-Language** Pre-Training with Multimodal Temporal Contrastive Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06031.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FXPretrain)\n\n- (arXiv 2022.10) M3VIDEO: **MASKED** MOTION MODELING FOR SELFSUPERVISED **VIDEO REPRESENTATION** LEARNING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06096.pdf)\n\n- (arXiv 2022.10) Uplift and Upsample: Efficient **3D Human Pose Estimation** with Uplifting Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06110.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoldbricklemon\u002Fuplift-upsample-3dhpe)\n\n- (arXiv 2022.10) FontTransformer: Few-shot High-resolution Chinese **Glyph Image Synthesis** via Stacked Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06301.pdf)\n\n- (arXiv 2022.10) AISFormer: Amodal **Instance Segmentation** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06323.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FUARK-AICV\u002FAISFormer)\n\n- (arXiv 2022.10) ViewBirdiformer: Learning to **recover** ground-plane **crowd trajectories** and **ego-motion** from a single ego-centric view, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06332.pdf)\n\n- (arXiv 2022.10) One does not fit all! On the Complementarity of Vision Encoders for **Vision and Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06379.pdf)\n\n- (arXiv 2022.10) **PROMPT GENERATION** NETWORKS FOR EFFICIENT **ADAPTATION** OF FROZEN VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06466.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjochemloedeman\u002FPGN)\n\n- (arXiv 2022.10) Generating Executable Action **Plans** with Environmentally-Aware Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04964.pdf)\n\n- (arXiv 2022.10) AVE-**CLIP**: AudioCLIP-based Multi-window Temporal Transformer for **Audio Visual Event Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05060.pdf)\n\n- (arXiv 2022.10) Improving Dense **Contrastive Learning** with Dense Negative Pairs, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05063.pdf)\n\n- (arXiv 2022.10) Fine-Grained **Image Style Transfer** with Visual Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05176.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FSTTR)\n\n- (arXiv 2022.10) IT TAKES TWO: **MASKED** APPEARANCE-MOTION MODELING FOR **SELF-SUPERVISED VIDEO** TRANSFORMER PRE-TRAINING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05234.pdf)\n\n- (arXiv 2022.10) Contrastive **Video-Language** Learning with Fine-grained Frame Sampling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05039.pdf)\n\n- (arXiv 2022.10) Style-Guided Inference of Transformer for High-resolution **Image Synthesis**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05533.pdf)\n\n- (arXiv 2022.10) MAP: Modality-Agnostic Uncertainty-Aware **Vision-Language** Pre-training Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05335.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FIIGROUP\u002FMAP)\n\n- (arXiv 2022.10) LEARNING TO LOCATE VISUAL **ANSWER** IN **VIDEO** CORPUS USING **QUESTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05423.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FWENGSYX\u002FCCGS)\n\n- (arXiv 2022.10) UNDERSTANDING **EMBODIED** REFERENCE WITH TOUCH-LINE TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05668.pdf)\n\n- (arXiv 2022.10) **Point** Transformer V2: Grouped Vector Attention and Partition-based Pooling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05666.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGofinge\u002FPointTransformerV2)\n\n- (arXiv 2022.10) See, Plan, Predict: Language-guided Cognitive **Planning** with Video Prediction, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03825.pdf)\n\n- (arXiv 2022.10) USING BOTH DEMONSTRATIONS AND LANGUAGE INSTRUCTIONS TO EFFICIENTLY LEARN **ROBOTIC** TASKS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04476.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdel-taco-learning)\n\n- (arXiv 2022.10) Generating image **captions** with external encyclopedic knowledge, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04806.pdf)\n\n- (arXiv 2022.10) LOCL: Learning **Object-Attribute Composition** using Localization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03780.pdf)\n\n- (arXiv 2022.10) SVL-Adapter: Self-Supervised Adapter for **Vision-Language** Pretrained Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03794.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fomipan\u002Fsvl_adapter)\n\n- (arXiv 2022.10) ConTra: (Con)text (Tra)nsformer for **Cross-Modal Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04341.pdf)\n\n- (arXiv 2022.10) Learning Fine-Grained Visual Understanding for **Video Question Answering** via Decoupling Spatial-Temporal Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03941.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshinying\u002Fdest)\n\n- (arXiv 2022.10) (Fusionformer):Exploiting the Joint Motion Synergy with Fusion Network Based On Transformer for **3D Human Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04006.pdf)\n\n- (arXiv 2022.10) Fast-ParC: Position Aware Global **Kernel** for ConvNets and ViTs, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04020.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyangtao2019yt\u002FFast_ParC.git)\n\n- (arXiv 2022.10) OGC: Unsupervised **3D Object Segmentation** from Rigid Dynamics of Point Clouds, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04458.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FvLAR-group\u002FOGC)\n\n- (arXiv 2022.10) Multi-Modal Fusion Transformer for **Visual Question Answering** in **Remote Sensing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04510.pdf), [[Code]](https:\u002F\u002Fgit.tu-berlin.de\u002Frsim\u002Fmulti-modal-fusion-transformer-for-vqa-in-rs)\n\n- (arXiv 2022.10) Semantics-Consistent **Cross-domain Summarization** via Optimal Transport Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04722.pdf)\n\n- (arXiv 2022.10) VOLTA: **VISION-LANGUAGE** TRANSFORMER WITH WEAKLY-SUPERVISED LOCAL-FEATURE ALIGNMENT, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04135.pdf)\n\n- (arXiv 2022.10) OPEN-VOCABULARY SEMANTIC **SEGMENTATION** WITH MASK-ADAPTED **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04150.pdf), [[Project]](https:\u002F\u002Fjeff-liangf.github.io\u002Fprojects\u002Fovseg)\n\n- (arXiv 2022.10) MAMO: Masked Multimodal Modeling for Fine-Grained **Vision-Language** Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04183.pdf)\n\n- (arXiv 2022.10) SELF-SUPERVISED **VIDEO** REPRESENTATION LEARNING WITH MOTION-AWARE **MASKED AUTOENCODERS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04154.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhappy-hsy\u002FMotionMAE)\n\n- (arXiv 2022.10) LEARNING TO DECOMPOSE **VISUAL** FEATURES WITH LATENT **TEXTUAL** PROMPTS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04287.pdf)\n\n- (arXiv 2022.10) DCVQE: A Hierarchical Transformer for **Video Quality Assessment**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04377.pdf)\n\n- (arXiv 2022.10) **Fine-grained Object** Categorization for Service **Robots**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04613.pdf)\n\n- (arXiv 2022.10) **CLIP**-DIFFUSION-LM: APPLY DIFFUSION MODEL ON IMAGE **CAPTIONING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04559.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxu-shitong\u002Fdiffusion-image-captioning)\n\n- (arXiv 2022.10) A Memory Transformer Network for **Incremental Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04485.pdf)\n\n- (arXiv 2022.10) Bridging **CLIP** and StyleGAN through Latent Alignment for **Image Editing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04506.pdf)\n\n- (arXiv 2022.10) LMQFormer: A Laplace-Prior-Guided Mask Query Transformer for **Lightweight Snow Removal**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04787.pdf)\n\n- (arXiv 2022.10) FS-DETR: **FEW-SHOT** **DETECTION** TRANSFORMER WITH PROMPTING AND WITHOUT RE-TRAINING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04845.pdf)\n\n- (arXiv 2022.10) Transformer-based Localization from **Embodied** Dialog with Large-scale Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04864.pdf)\n\n- (arXiv 2022.10) Turbo Training with Token **Dropout**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04889.pdf)\n\n- (arXiv 2022.10) Polyhistor: Parameter-**Efficient** **Multi-Task Adaptation** for Dense Vision Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03265.pdf)\n\n- (arXiv 2022.10) C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual **Text-Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03625.pdf)\n\n- (arXiv 2022.10) **Pose** Guided **Human Image Synthesis** with Partially Decoupled GAN, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03627.pdf)\n\n- (arXiv 2022.10) A Simple Plugin for Transforming **Images** to **Arbitrary Scales**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03417.pdf), [[Project]](https:\u002F\u002Flipurple.github.io\u002FARIS_Webpage\u002F)\n\n- (arXiv 2022.10) Time-Space Transformers for **Video Panoptic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03546.pdf)\n\n- (arXiv 2022.10) MOAT: ALTERNATING **MOBILE** CONVOLUTION AND ATTENTION BRINGS STRONG VISION MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01820.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fdeeplab2)\n\n- (arXiv 2022.10) **IMAGEN** VIDEO: HIGH DEFINITION **VIDEO GENERATION** WITH **DIFFUSION** MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02303.pdf), [[Project]](https:\u002F\u002Fimagen.research.google\u002Fvideo\u002F)\n\n- (arXiv 2022.10) clip2latent: **Text** driven sampling of a pre-trained **StyleGAN** using denoising diffusion and **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02347.pdf)\n\n- (arXiv 2022.10) FQDet: **Fast**-converging Query-based **Detector**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02318.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FCedricPicron\u002FFQDet)\n\n- (arXiv 2022.10) VARIATIONAL **PROMPT** TUNING IMPROVES GENERALIZATION OF **VISION-LANGUAGE** MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02390.pdf)\n\n- (arXiv 2022.10) Grounding **Language** with **Visual** **Affordances** over Unstructured Data, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01911.pdf), [[Project]](http:\u002F\u002Fhulc2.cs.uni-freiburg.de\u002F)\n\n- (arXiv 2022.10) Granularity-aware Adaptation for **Image Retrieval** over Multiple Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02254.pdf)\n\n- (arXiv 2022.10) WHEN AND WHY **VISION-LANGUAGE** MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01936.pdf)\n\n- (arXiv 2022.10) Multi-view **Human** Body **Mesh** Translator, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01886.pdf)\n\n- (arXiv 2022.10) EXPLORING THE ROLE OF MEAN TEACHERS IN **SELFSUPERVISED** **MASKED** AUTO-ENCODERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02077.pdf)\n\n- (arXiv 2022.10) **Point Cloud Recognition** with Position-to-Structure Attention Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02030.pdf)\n\n- (arXiv 2022.10) TEMPORALLY CONSISTENT VIDEO TRANSFORMER FOR LONG-TERM **VIDEO PREDICTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02396.pdf), [[Code]](https:\u002F\u002Fwilson1yan.github.io\u002Fteco)\n\n- (arXiv 2022.10) PHENAKI: VARIABLE LENGTH **VIDEO GENERATION** FROM OPEN DOMAIN TEXTUAL DESCRIPTIONS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02399.pdf)\n\n- (arXiv 2022.10) MuRAG: Multimodal Retrieval-Augmented Generator for **Open Question Answering** over Images and Text, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02928.pdf)\n\n- (arXiv 2022.10) Real-World **Robot Learning** with **Masked** Visual Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03109.pdf), [[Project]](https:\u002F\u002Ftetexiao.com\u002Fprojects\u002Freal-mvp)\n\n- (arXiv 2022.10) BaseTransformers: Attention over base data-points for **One Shot** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02476.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmayug\u002FBaseTransformers)\n\n- (arXiv 2022.10) Focal and Global Spatial-Temporal Transformer for **Skeleton**-based **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02693.pdf)\n\n- (arXiv 2022.10) Vision Transformer Based Model for **Describing** a Set of **Images** as a Story, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02762.pdf)\n\n- (arXiv 2022.10) **Video Referring Expression Comprehension** via Transformer with Content-aware Query, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02953.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmengcaopku\u002FContFormer)\n\n- (arXiv 2022.10) **EFFECTIVE** **SELF-SUPERVISED** PRE-TRAINING ON LOW-COMPUTE NETWORKS WITHOUT DISTILLATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02808.pdf)\n\n- (arXiv 2022.10) **CLIP** MODEL IS AN EFFICIENT **CONTINUAL LEARNER**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03114.pdf)\n\n- (arXiv 2022.10) Content-Based Search for Deep **Generative** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03116.pdf)\n\n- (arXiv 2022.10) MAPLE: **MULTI-MODAL** **PROMPT** LEARNING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03117.pdf), [[Code]](https:\u002F\u002Ftinyurl.com\u002F2dzs8f3w)\n\n- (arXiv 2022.10) SYSTEMATIC GENERALIZATION AND EMERGENT STRUCTURES IN TRANSFORMERS TRAINED ON **STRUCTURED TASKS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00400.pdf)\n\n- (arXiv 2022.10) WIDE **ATTENTION** IS THE WAY FORWARD FOR TRANSFORMERS? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00640.pdf)\n\n- (arXiv 2022.10) DARTFORMER: **FINDING** THE BEST TYPE OF **ATTENTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00641.pdf)\n\n- (arXiv 2022.10) MOBILEVITV3: **MOBILE**-FRIENDLY VISION TRANSFORMER WITH SIMPLE AND EFFECTIVE FUSION OF LOCAL, GLOBAL AND INPUT FEATURES, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15159.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FmicronDLA\u002FMobileViTv3)\n\n- (arXiv 2022.10) Differentiable Parsing and Visual **Grounding** of Verbal Instructions for **Object Placement**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00215.pdf), [[Project]](https:\u002F\u002F1989ryan.github.io\u002Fprojects\u002Fparagon.html)\n\n- (arXiv 2022.10) EAPruning: Evolutionary **Pruning** for Vision Transformers and CNNs, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00181.pdf)\n\n- (arXiv 2022.10) Motion-inductive Self-supervised **Object Discovery** in Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00221.pdf)\n\n- (arXiv 2022.10) Fully Transformer Network for Change Detection of **Remote Sensing** Images, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00757.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAI-Zhpp\u002FFTN)\n\n- (arXiv 2022.10) TOWARDS A UNIFIED VIEW ON VISUAL PARAMETER-**EFFICIENT** **TRANSFER LEARNING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00788.pdf)\n\n- (arXiv 2022.10) Visual **Prompt** Tuning for Generative **Transfer Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00990.pdf)\n\n- (arXiv 2022.10) A Strong Transfer Baseline for **RGB-D Fusion** in Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00843.pdf)\n\n- (arXiv 2022.10) LPT: **LONG-TAILED** **PROMPT** TUNING FOR IMAGE CLASSIFICATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01033.pdf)\n\n- (arXiv 2022.10) Expediting Large-Scale Vision Transformer for **Dense Prediction** without Fine-tuning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01035.pdf)\n\n- (arXiv 2022.10) CLIP2POINT: TRANSFER **CLIP** TO **POINT CLOUD CLASSIFICATION** WITH IMAGE-DEPTH PRE-TRAINING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01055.pdf)\n\n- (arXiv 2022.10) Dual-former: Hybrid Self-attention Transformer for Efficient **Image Restoration**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01069.pdf)\n\n- (arXiv 2022.10) LANGUAGE-AWARE **SOFT PROMPTING** FOR **VISION & LANGUAGE** FOUNDATION MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01115.pdf)\n\n- (arXiv 2022.10) ASIF: COUPLED DATA TURNS UNIMODAL MODELS TO **MULTIMODAL** WITHOUT TRAINING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01738.pdf)\n\n- (arXiv 2022.10) ImmFusion: Robust mmWave-RGB Fusion for **3D Human Body Reconstruction** in All Weather Conditions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01346.pdf)\n\n- (arXiv 2022.10) PROMPT LEARNING WITH **OPTIMAL TRANSPORT** FOR **VISION-LANGUAGE** MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01253.pdf)\n\n- (arXiv 2022.10) Bridged Transformer for Vision and Point Cloud **3D Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01391.pdf)\n\n- (arXiv 2022.10) Dense Prediction Transformer for Scale Estimation in Monocular Visual **Odometry**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01723.pdf)\n\n- (arXiv 2022.10) HUMAN **MOTION** **DIFFUSION** MODEL, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14916.pdf), [[Project]](https:\u002F\u002Fguytevet.github.io\u002Fmdm-page\u002F)\n\n- (arXiv 2022.10) TokenFlow: Rethinking Fine-grained Cross-modal Alignment in **Vision-Language Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13822.pdf)\n\n- (arXiv 2022.10) Uni**CLIP**: Unified Framework for Contrastive **Language–Image** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13430.pdf)\n\n- (arXiv 2022.10) CrossDTR: Cross-view and Depth-guided Transformers for **3D Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13507.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsty61010\u002FCrossDTR)\n\n- (arXiv 2022.10) Multi-dataset Training of Transformers for Robust **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12362.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJunweiLiang\u002FMultiTrain)\n\n- (arXiv 2022.10) Multi-Scale **Human-Object Interaction** Detector, [[Paper]](https:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?arnumber=9927451)\n\n- (arXiv 2022.10) LGDN: Language-Guided Denoising Network for **Video-Language** Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11388.pdf)\n\n- (arXiv 2022.10) RaP: Redundancy-aware Video-language Pre-training for **Text-Video** Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06881.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fcaskcsg\u002FVLP\u002Ftree\u002Fmain\u002FRaP)\n\n- (arXiv 2022.10) Intermediate Prototype Mining Transformer for Few-Shot **Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06780.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLIUYUANWEI98\u002FIPMT)\n\n- (arXiv 2022.10) Decoding Visual Neural Representations by Multimodal Learning of **Brain-Visual-Linguistic** Features, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06756.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FChangdeDu\u002FBraVL)\n\n- (arXiv 2022.10) Q-ViT: Accurate and Fully **Quantized** Low-bit Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06707.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYanjingLi0202\u002FQ-ViT)\n\n- (arXiv 2022.10) Prepended Domain Transformer: Heterogeneous **Face Recognition** without Bells and Whistles, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06529.pdf)\n\n- (arXiv 2022.10) Visual Knowledge Graph for Human **Action Reasoning in Videos**, [[Paper]](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3503161.3548257)\n\n- (arXiv 2022.10) Human Joint Kinematics Diffusion-Refinement for Stochastic **Motion Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05976.pdf)\n\n- (arXiv 2022.10) VIMA: GENERAL **ROBOT MANIPULATION** WITH MULTIMODAL **PROMPTS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03094.pdf), [[Project]](https:\u002F\u002Fvimalabs.github.io\u002F)\n\n- (arXiv 2022.10) What Should the System Do Next?: **Operative Action Captioning** for Estimating System Actions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02735.pdf)\n\n- (arXiv 2022.10) DMMGAN: Diverse Multi **Motion Prediction of 3D Human Joints** using Attention-Based Generative Adversarial Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09124.pdf)\n\n- (arXiv 2022.10) PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to **6 DoF Tracking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07589.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fnv-nguyen\u002Fpizza)\n\n### 2022.09\n\n- (arXiv 2022.09) SELF-DISTILLATION FOR FURTHER **PRE-TRAINING** OF TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02871.pdf)\n\n- (arXiv 2022.09) **Visuo-Tactile** Transformers for Manipulation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00121.pdf), [[Project]](https:\u002F\u002Fwww.mmintlab.com\u002Fvtt)\n\n- (arXiv 2022.09) UNDERSTANDING PURE **CLIP** GUIDANCE FOR **VOXEL** GRID **NERF** MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15172.pdf), [[Project]](https:\u002F\u002Fhanhung.github.io\u002FPureCLIPNeRF\u002F)\n\n- (arXiv 2022.09) Dual Progressive Transformations for Weakly Supervised Semantic **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15211.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhuodongjian0603\u002Fcrt)\n\n- (arXiv 2022.09) Transformers for Object **Detection** in Large **Point Clouds**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15258.pdf)\n\n- (arXiv 2022.09) **DIFFUSION**-BASED **IMAGE TRANSLATION** USING DISENTANGLED STYLE AND CONTENT REPRESENTATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15264.pdf)\n\n- (arXiv 2022.09) ERNIE-VIL 2.0: MULTI-VIEW CONTRASTIVE LEARNING FOR **IMAGE-TEXT** PRE-TRAINING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15270.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FERNIE\u002F)\n\n- (arXiv 2022.09) LEARNING TRANSFERABLE **SPATIOTEMPORAL** REPRESENTATIONS FROM NATURAL **SCRIPT** KNOWLEDGE, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15280.pdf)\n\n- (arXiv 2022.09) SMALLCAP: Lightweight Image **Captioning** Prompted with Retrieval Augmentation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15323.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FRitaRamo\u002Fsmallcap)\n\n- (arXiv 2022.09) SPIKFORMER: WHEN **SPIKING NEURAL NETWORK** MEETS TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15425.pdf)\n\n- (arXiv 2022.09) F-VLM: OPEN-VOCABULARY OBJECT **DETECTION** UPON FROZEN VISION AND LANGUAGE MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15639.pdf)\n\n- (arXiv 2022.09) CONTRASTIVE CORPUS ATTRIBUTION FOR EXPLAINING REPRESENTATIONS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00107.pdf)\n\n- (arXiv 2022.09) Alignment-guided Temporal Attention for **Video Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00132.pdf)\n\n- (arXiv 2022.09) EDA: Explicit Text-Decoupling and Dense Alignment for **3D Visual and Language** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14941.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyanmin-wu\u002FEDA)\n\n- (arXiv 2022.09) SPOTLIGHT: **MOBILE UI UNDERSTANDING** USING VISION-LANGUAGE MODELS WITH A FOCUS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14927.pdf)\n\n- (arXiv 2022.09) DREAMFUSION: **TEXT-TO-3D** USING 2D **DIFFUSION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14988.pdf), [[Project]](https:\u002F\u002Fdreamfusion3d.github.io\u002F)\n\n- (arXiv 2022.09) REST: RETRIEVE & SELF-TRAIN FOR GENERATIVE **ACTION RECOGNITION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15000.pdf)\n\n- (arXiv 2022.09) **Effective** Vision Transformer **Training**: A Data-Centric Perspective, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15006.pdf)\n\n- (arXiv 2022.09) Human-in-the-loop Robotic **Grasping** using BERT Scene Representation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14026.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fhitl-grasping-bert)\n\n- (arXiv 2022.09) Revisiting **Few-Shot** Learning from a **Causal** Perspective, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13816.pdf)\n\n- (arXiv 2022.09) **Attacking** Compressed Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13785.pdf)\n\n- (arXiv 2022.09) Adaptive Sparse ViT: Towards Learnable Adaptive **Token Pruning** by Fully Exploiting Self-Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13802.pdf)\n\n- (arXiv 2022.09) DeViT: Deformed Vision Transformers in **Video Inpainting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13925.pdf)\n\n- (arXiv 2022.09) Obj2Seq: Formatting **Objects** as Sequences with Class Prompt for Visual Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13948.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FCASIA-IVA-Lab\u002FObj2Seq)\n\n- (arXiv 2022.09) Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual **Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13959.pdf)\n\n- (arXiv 2022.09) Motion Transformer for Unsupervised **Image Animation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14024.pdf)\n\n- (arXiv 2022.09) Weighted Contrastive **Hashing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14099.pdf), [[Code]](http:\u002F\u002Fgithub.com\u002FRosieYuu\u002FWCH)\n\n- (arXiv 2022.09) CALIP: **Zero-Shot** Enhancement of **CLIP** with Parameter-free Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14169.pdf)\n\n- (arXiv 2022.09) Dialog Acts for Task-Driven Embodied Agents, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12953.pdf)\n\n- (arXiv 2022.09) NEURAL MARIONETTE: A Transformer-based Multi-action Human **Motion Synthesis** System, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13204.pdf), [[Code]](https:\u002F\u002Fwjohnnyw.github.io\u002Fblog\u002Ftag2motion\u002F)\n\n- (arXiv 2022.09) Embracing Consistency: A One-Stage Approach for **Spatio-Temporal Video Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13306.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FSTCAT)\n\n- (arXiv 2022.09) Text-Adaptive Multiple Visual Prototype Matching for **Video-Text** Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13307.pdf)\n\n- (arXiv 2022.09) Towards Parameter-Efficient Integration of Pre-Trained **Language** Models In Temporal **Video** Grounding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13359.pdf)\n\n- (arXiv 2022.09) **Anomaly Detection** in **Aerial** Videos with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13363.pdf), [[Code]](https:\u002F\u002Fyoutu.be\u002FancczYryOBY)\n\n- (arXiv 2022.09) AdaFocusV3: On Unified Spatial-temporal Dynamic **Video Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13465.pdf)\n\n- (arXiv 2022.09) **Motion** Transformer with Global Intention **Localization** and Local Movement Refinement, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13508.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsshaoshuai\u002FMTR)\n\n- (arXiv 2022.09) FREESEG: FREE MASK FROM INTERPRETABLE CONTRASTIVE LANGUAGE-IMAGE PRETRAINING FOR **SEMANTIC SEGMENTATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13558.pdf)\n\n- (arXiv 2022.09) Learning State-Aware Visual Representations from Audible **Interactions**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13583.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHimangiM\u002FRepLAI)\n\n- (arXiv 2022.09) Towards Explainable **3D** Grounded **Visual Question Answering**: A New Benchmark and Strong Baseline, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12028.pdf)\n\n- (arXiv 2022.09) Leveraging Self-Supervised Training for **Unintentional Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11870.pdf)\n\n- (arXiv 2022.09) NeRF-Loc: Transformer-Based **Object Localization** Within **Neural Radiance Fields**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12068.pdf)\n\n- (arXiv 2022.09) All are Worth Words: a ViT Backbone for Score-based **Diffusion** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12152.pdf)\n\n- (arXiv 2022.09) Paraphrasing Is All You Need for Novel Object **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12343.pdf)\n\n- (arXiv 2022.09) Collaboration of Pre-trained Models Makes Better **Few-shot** Learner, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12255.pdf)\n\n- (arXiv 2022.09) Multi-modal **Video Chapter Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2209\u002F2209.12694.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fczt117\u002FMVCG)\n\n- (arXiv 2022.09) **Best Prompts** for **Text-to-Image** Models and How to Find Them, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11711)\n\n- (arXiv 2022.09) Swin2SR: SwinV2 Transformer for **Compressed Image Super-Resolution** and **Restoration**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11345.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmv-lab\u002Fswin2sr)\n\n- (arXiv 2022.09) 3DPCT: 3D **Point Cloud** Transformer with Dual Self-attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11255.pdf)\n\n- (arXiv 2022.09) **LIGHTWEIGHT** TRANSFORMERS FOR HUMAN **ACTIVITY RECOGNITION** ON MOBILE DEVICES, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11750.pdf)\n\n- (arXiv 2022.09) PACT: Perception-Action Causal Transformer for **Autoregressive Robotics Pre-Training**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11133.pdf)\n\n- (arXiv 2022.09) UniColor: A Unified Framework for Multi-Modal **Colorization** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11223.pdf), [[Code]](https:\u002F\u002Fluckyhzt.github.io\u002Funicolor)\n\n- (arXiv 2022.09) **Traffic Accident Risk Forecasting** using Contextual Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11180.pdf)\n\n- (arXiv 2022.09) CONE: An Efficient COarse-to-fiNE Alignment Framework for **Long Video Temporal Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10918.pdf)\n\n- (arXiv 2022.09) **Recipe Generation** from Unsegmented Cooking Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10134.pdf)\n\n- (arXiv 2022.09) PicT: A Slim Weakly Supervised Vision Transformer for **Pavement Distress Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10074.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FDearCaat\u002FPicT)\n\n- (arXiv 2022.09) Show, Interpret and Tell: Entity-aware Contextualised Image **Captioning** in Wikipedia, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10474.pdf)\n\n- (arXiv 2022.09) RNGDet++: **Road Network Graph Detection** by Transformer with Instance Segmentation and Multi-scale Features Enhancement, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10150.pdf), [[Code]](https:\u002F\u002Ftonyxuqaq.github.io\u002Fprojects\u002FRNGDetPlusPlus\u002F)\n\n- (arXiv 2022.09) Toward 3D Spatial Reasoning for Human-like Text-based **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10326.pdf)\n\n- (arXiv 2022.09) I2DFormer: Learning **Image** to **Document** Attention for **Zero-Shot Image Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10304.pdf)\n\n- (arXiv 2022.09) **Integer** Fine-tuning of Transformer-based Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09815.pdf)\n\n- (arXiv 2022.09) Open-vocabulary Queryable Scene Representations for **Real World Planning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09874.pdf), [[Code]](https:\u002F\u002Fnlmap-saycan.github.io\u002F)\n\n- (arXiv 2022.09) Det**CLIP**: Dictionary-Enriched Visual-Concept Paralleled Pre-training for **Open-world Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09407.pdf)\n\n- (arXiv 2022.09) Hierarchical Temporal Transformer for **3D Hand Pose Estimation** and **Action Recognition** from **Egocentric** RGB Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09484.pdf)\n\n- (arXiv 2022.09) **Graph** Reasoning Transformer for **Image Parsing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09545.pdf)\n\n- (arXiv 2022.09) **Quantum** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08167.pdf)\n\n- (arXiv 2022.09) Active **Visual Search** in the Wild, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08803.pdf)\n\n- (arXiv 2022.09) PPT: token-Pruned Pose Transformer for monocular and multi-view human **pose estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08194.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHowieMa\u002FPPT)\n\n- (arXiv 2022.09) Learning Distinct and Representative **Modes** for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08231.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fbladewaltz1\u002FModeCap)\n\n- (arXiv 2022.09) TODE-Trans: **Transparent** Object **Depth Estimation** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08455.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyuchendoudou\u002FTODE)\n\n- (arXiv 2022.09) Tree-based **Text-Vision** BERT for Video Search in Baidu Video Advertising, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08759.pdf)\n\n- (arXiv 2022.09) Integrative Feature and Cost Aggregation with Transformers for **Dense Correspondence**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08742.pdf)\n\n- (arXiv 2022.09) Axially Expanded Windows for **Local-Global Interaction** in Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08726.pdf)\n\n- (arXiv 2022.09) UNCERTAINTY AWARE MULTITASK PYRAMID VISION TRANSFORMER FOR **UAV**-BASED **OBJECT RE-IDENTIFICATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08686.pdf)\n\n- (arXiv 2022.09) TASKED: Transformer-based Adversarial learning for **human activity recognition** using **wearable sensors** via Self-KnowledgE Distillation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09092.pdf)\n\n- (arXiv 2022.09) EcoFormer: Energy-Saving Attention with **Linear Complexity**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09004.pdf), [[Code]]](https:\u002F\u002Fgithub.com\u002Fziplab\u002FEcoFormer)\n\n- (arXiv 2022.09) **Panoramic** Vision Transformer for **Saliency Detection** in 360◦ Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08956.pdf)\n\n- (arXiv 2022.09) THE BIASED ARTIST: EXPLOITING CULTURAL **BIASES** VIA HOMOGLYPHS IN **TEXT-GUIDED IMAGE GENERATION MODELS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08891.pdf)\n\n- (arXiv 2022.09) **Scene Graph Modification** as Incremental Structure Expanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09093.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTHU-BPM\u002FSGM)\n\n- (arXiv 2022.09) Discriminative Sampling of Proposals in Self-Supervised Transformers for **Weakly Supervised Object Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09209.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshakeebmurtaza\u002Fdips)\n\n- (arXiv 2022.09) Real-time **Online Video Detection** with Temporal Smoothing Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09236.pdf)\n\n- (arXiv 2022.09) ViT-DD: Multi-Task Vision Transformer for Semi-Supervised **Driver Distraction Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09178.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPurdueDigitalTwin\u002FViT-DD)\n\n- (arXiv 2022.09) Code as Policies: Language Model Programs for **Embodied Control**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07753.pdf), [[Project]](https:\u002F\u002Fcode-as-policies.github.io\u002F)\n\n- (arXiv 2022.09) SQ-Swin: a Pretrained Siamese Quadratic Swin Transformer for **Lettuce Browning Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07683.pdf)\n\n- (arXiv 2022.09) Self-Attentive Pooling for **Efficient** Deep Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07659.pdf)\n\n- (arXiv 2022.09) Domain-Unified Prompt Representations for Source-Free **Domain Generalization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14926.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmuse1998\u002FSource-Free-Domain-Generalization)\n\n- (arXiv 2022.09) BRIDGING THE GAP TO REAL-WORLD **OBJECTCENTRIC LEARNING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14860.pdf)\n\n- (arXiv 2022.09) Prompt-guided **Scene Generation** for **3D** Zero-Shot Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14690.pdf)\n\n- (arXiv 2022.09) RE-IMAGEN: RETRIEVAL-AUGMENTED **TEXT-TO-IMAGE** GENERATOR, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14491.pdf)\n\n- (arXiv 2022.09) Distribution Aware **Metrics** for Conditional Natural **Language Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07518.pdf)\n\n- (arXiv 2022.09) **CLIP**ping Privacy: Identity Inference **Attacks** on Multi-Modal Machine Learning Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07341.pdf)\n\n- (arXiv 2022.09) Finetuning Pretrained **Vision-Language** Models with Correlation Information Bottleneck for Robust **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06954.pdf)\n\n- (arXiv 2022.09) PriorLane: A Prior Knowledge Enhanced **Lane Detection** Approach Based on Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06994.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvincentqqb\u002FPriorLane)\n\n- (arXiv 2022.09) Can We Solve **3D** Vision Tasks Starting from A **2D** Vision **Transformer**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07026.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FSimple3D-Former.git)\n\n- (arXiv 2022.09) EXPLORING VISUAL INTERPRETABILITY FOR CONTRASTIVE **LANGUAGE-IMAGE** PRE-TRAINING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07046.pdf)\n\n- (arXiv 2022.09) OmniVL: One Foundation Model for **Image-Language** and **Video-Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07526.pdf)\n\n- (arXiv 2022.09) Test-Time Training with **Masked Autoencoders**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07522.pdf), [[Code]](https:\u002F\u002Fyossigandelsman.github.io\u002Fttt_mae\u002Findex.html)\n\n- (arXiv 2022.09) VISUAL **RECOGNITION** WITH DEEP NEAREST **CENTROIDS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07383.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FChengHan111\u002FDNC)\n\n- (arXiv 2022.09) One-Shot Transfer of **Affordance** Regions? AffCorrs! [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07147.pdf), [[Code]](https:\u002F\u002Fsites.google.com\u002Fview\u002Faffcorrs)\n\n- (arXiv 2022.09) Test-Time **Prompt Tuning** for Zero-Shot Generalization in **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07511.pdf), [[Code]](https:\u002F\u002Fazshue.github.io\u002FTPT\u002F)\n\n- (arXiv 2022.09) A Light Recipe to Train **Robust** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07399.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdedeswim\u002Fvits-robustness-torch)\n\n- (arXiv 2022.09) On the Surprising Effectiveness of Transformers in Low-Labeled **Video Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07474.pdf)\n\n- (arXiv 2022.09) Number of **Attention Heads** vs. Number of Transformer-**Encoders** in Computer Vision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07221.pdf)\n\n- (arXiv 2022.09) Global Semantic Descriptors for **Zero-Shot Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12061.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvalterlej\u002Fobjsentzsar)\n\n- (arXiv 2022.09) Revisiting Neural **Scaling Laws** in Language and Vision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06640.pdf)\n\n- (arXiv 2022.09) Small Transformers Compute Universal **Metric Embeddings**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06788.pdf)\n\n- (arXiv 2022.09) **CLIP**-ViP: Adapting Pre-trained Image-Text Model to **Video-Language** Representation Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06430.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FXPretrain\u002Ftree\u002Fmain\u002FCLIP-ViP)\n\n- (arXiv 2022.09) CRAFT: Camera-Radar **3D Object Detection** with Spatio-Contextual Fusion Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06535.pdf)\n\n- (arXiv 2022.09) Transformers and CNNs both Beat Humans on **SBIR**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06629.pdf)\n\n- (arXiv 2022.09) PaLI: A Jointly-Scaled **Multilingual** **Language-Image** Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06794.pdf)\n\n- (arXiv 2022.09) MUST-VQA: MUltilingual Scene-text **VQA**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06730.pdf), [[Code]](https:\u002F\u002Fwww.ethnologue.com\u002Fenterprise-faq\u002Fhow-many-languages-world-are-unwritten-0)\n\n- (arXiv 2022.09) Leveraging Large Language Models for **Robot 3D Scene Understanding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05629.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMIT-SPARK\u002Fllm_scene_understanding)\n\n- (arXiv 2022.09) A lightweight Transformer-based model for **fish landmark detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05777.pdf)\n\n- (arXiv 2022.09) PSAQ-ViT V2: Towards Accurate and General Data-Free **Quantization** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05687.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzkkli\u002FPSAQ-ViT)\n\n- (arXiv 2022.09) ComplETR: Reducing the cost of annotations for object **detection** in dense scenes with vision transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05654.pdf)\n\n- (arXiv 2022.09) Semantic2Graph: Graph-based Multi-modal Feature for **Action Segmentation** in **Videos**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2209\u002F2209.05653.pdf)\n\n- (arXiv 2022.09) CenterFormer: Center-based Transformer for **3D Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05588.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTuSimple\u002Fcenterformer)\n\n- (arXiv 2022.09) PreSTU: Pre-Training for **Scene-Text** Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05534.pdf)\n\n- (arXiv 2022.09) OmDet: Language-Aware Object **Detection** with Large-scale **Vision-Language** Multi-dataset Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05946.pdf)\n\n- (arXiv 2022.09) DMTNet: Dynamic Multi-scale Network for Dual-pixel Images **Defocus Deblurring** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06040.pdf)\n\n- (arXiv 2022.09) SeRP: Self-Supervised Representation Learning Using Perturbed **Point Clouds**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06067.pdf)\n\n- (arXiv 2022.09) VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06103.pdf)\n\n- (arXiv 2022.09) Story**DALL-E**: Adapting Pretrained Text-to-Image Transformers for **Story Continuation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06192.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fadymaharana\u002Fstorydalle)\n\n- (arXiv 2022.09) ON THE **COMPUTATIONAL COMPLEXITY** OF SELF-ATTENTION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04881.pdf)\n\n- (arXiv 2022.09) Instruction-driven history-aware policies for **robotic manipulations**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04899.pdf), [[Code]](https:\u002F\u002Fguhur.github.io\u002Fhiveformer\u002F)\n\n- (arXiv 2022.09) Towards Multi-Lingual **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05401.pdf)\n\n- (arXiv 2022.09) PERCEIVER-ACTOR: A Multi-Task Transformer for **Robotic Manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05451.pdf), [[Project]](https:\u002F\u002Fperact.github.io\u002F)\n\n- (arXiv 2022.09) GLOBAL PROTOTYPE ENCODING FOR INCREMENTAL **VIDEO HIGHLIGHTS DETECTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05166.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FForeverPs\u002FGPE)\n\n- (arXiv 2022.09) Self-Supervised Multimodal Fusion Transformer for **Passive Activity Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03765.pdf)\n\n- (arXiv 2022.09) FETA: Towards Specializing **Foundation Models** for **Expert Task** Applications, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03648.pdf)\n\n- (arXiv 2022.09) Prior Knowledge-Guided **Attention** in Self-Supervised Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03745.pdf)\n\n- (arXiv 2022.09) Exploring Target Representations for **Masked Autoencoders**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03917.pdf)\n\n- (arXiv 2022.09) ISS: IMAGE AS STEPPING STONE FOR **TEXT-GUIDED 3D SHAPE GENERATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04145.pdf)\n\n- (arXiv 2022.09) Towards Confidence-guided **Shape Completion** for **Robotic** Applications, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04300.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fandrearosasco\u002Fhyperpcr)\n\n- (arXiv 2022.09) Pre-training **image-language** transformers for **open-vocabulary** tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04372.pdf)\n\n- (arXiv 2022.09) Improved Masked **Image Generation** with Token-Critic, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04439.pdf)\n\n- (arXiv 2022.09) Do As I Can, Not As I Say: **Grounding Language** in **Robotic** Affordances, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01691.pdf), [[Code]](https:\u002F\u002Fsay-can.github.io\u002F)\n\n- (arXiv 2022.09) Uformer-ICS: A Specialized U-Shaped Transformer for **Image Compressive Sensing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01763.pdf)\n\n- (arXiv 2022.09) An Empirical Study of End-to-End **Video-Language** Transformers with **Masked** Visual Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01540.pdf)\n\n- (arXiv 2022.09) Spatial-Temporal Transformer for **Video Snapshot Compressive Imaging**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01578.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fucaswangls\u002FSTFormer)\n\n- (arXiv 2022.09) MAFormer: A Transformer Network with Multi-scale **Attention** Fusion for Visual Recognition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01620.pdf)\n\n- (arXiv 2022.09) SEFormer: Structure Embedding Transformer for **3D Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01745.pdf)\n\n- (arXiv 2022.09) ADTR: **Anomaly Detection** Transformer with Feature Reconstruction, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01816.pdf)\n\n- (arXiv 2022.09) Learning Canonical Embeddings for Unsupervised Shape **Correspondence** with Locally Linear Transformations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02152.pdf)\n\n- (arXiv 2022.09) Transformer-CNN Cohort: **Semi-supervised Semantic Segmentation** by the Best of Both Students, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02178.pdf)\n\n- (arXiv 2022.09) PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards **Video Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02242.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHon-Wong\u002FPTSEFormer)\n\n- (arXiv 2022.09) VITKD: PRACTICAL GUIDELINES FOR VIT FEATURE **KNOWLEDGE DISTILLATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02432.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyzd-v\u002Fcls_KD)\n\n- (arXiv 2022.09) DPIT: Dual-Pipeline Integrated Transformer for Human **Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02431.pdf)\n\n- (arXiv 2022.09) SkeletonMAE: Spatial-Temporal **Masked Autoencoders** for Self-supervised **Skeleton Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02399.pdf)\n\n- (arXiv 2022.09) What does a platypus look like? Generating customized prompts for **zero-shot image classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03320.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsarahpratt\u002FCuPL)\n\n- (arXiv 2022.09) AI Illustrator: Translating Raw Descriptions into **Images** by **Prompt**-based Cross-Modal **Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03160.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FAI_Illustrator)\n\n- (arXiv 2022.09) MimCo: **Masked** Image Modeling Pre-training with Contrastive Teacher, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03063.pdf)\n\n- (arXiv 2022.09) **Multi-modal** Contrastive Representation Learning for **Entity Alignment**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00891.pdf)\n\n- (arXiv 2022.09) Zero-Shot Multi-Modal **Artist-Controlled Retrieval** and **Exploration** of **3D** Object Sets, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00682.pdf)\n\n- (arXiv 2022.09) Geometry Aligned Variational Transformer for **Image-conditioned Layout Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00852.pdf)\n\n- (arXiv 2022.09) Real-time **3D** Single Object **Tracking** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00860.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshanjiayao\u002FPTT)\n\n- (arXiv 2022.09) Video-Guided Curriculum Learning for **Spoken Video Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00277.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmarmot-xy\u002FSpoken-Video-Grounding)\n\n- (arXiv 2022.09) FLAME: Free-form Language-based **Motion Synthesis** & **Editing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00349.pdf)\n\n- (arXiv 2022.09) TOKENCUT: **SEGMENTING** OBJECTS IN IMAGES AND VIDEOS WITH SELF-SUPERVISED TRANSFORMER AND NORMALIZED CUT, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00383.pdf), [[Code]](https:\u002F\u002Fwww.m-psi.fr\u002FPapers\u002FTokenCut2022\u002F)\n\n- (arXiv 2022.09) Unified Fully and Timestamp Supervised **Temporal Action Segmentation** via Sequence to Sequence Translation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00638.pdf)\n\n- (arXiv 2022.09) MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised **Point Cloud Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00407.pdf), [[Project]](http:\u002F\u002Fxiaodongchen.cn\u002FMAPLE\u002F)\n\n- (arXiv 2022.09) **Visual Prompting** via Image Inpainting, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00647.pdf), [[Project]](https:\u002F\u002Fyossigandelsman.github.io\u002Fvisual_prompt)\n\n- (arXiv 2022.09) RLIP: Relational **Language-Image** Pre-training for **Human-Object Interaction** Detection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01814.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJacobYuan7\u002FRLIP)\n\n### 2022.08\n\n- (arXiv 2022.08) On Grounded Planning for **Embodied** Tasks with Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00465.pdf), [[Project]](https:\u002F\u002Finklab.usc.edu\u002FG-PlanET\u002F)\n\n- (arXiv 2022.08) **Group Activity Recognition** in Basketball Tracking Data - Neural Embeddings in Team Sports (NETS), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00451.pdf)\n\n- (arXiv 2022.08) SWIN-TRANSFORMER-YOLOV5 FOR REAL-TIME WINE GRAPE BUNCH **DETECTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14508.pdf)\n\n- (arXiv 2022.08) SIM-Trans: Structure Information Modeling Transformer for **Fine-grained Visual Categorization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14607.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPKU-ICST-MIPL\u002FSIM-Trans_ACMMM2022)\n\n- (arXiv 2022.08) INJECTING **IMAGE DETAILS** INTO **CLIP**’S FEATURE SPACE, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14649.pdf)\n\n- (arXiv 2022.08) Hierarchical Local-Global Transformer for **Temporal Sentence Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14882.pdf)\n\n- (arXiv 2022.08) EViT: **Privacy**-Preserving **Image Retrieval** via Encrypted Vision Transformer in Cloud Computing, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14657.pdf)\n\n- (arXiv 2022.08) TRUST: An Accurate and End-to-End **Table structure Recognizer** Using Splitting-based Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14687.pdf)\n\n- (arXiv 2022.08) ELMformer: Efficient Raw **Image Restoration** with a Locally Multiplicative Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14704.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fleonmakise\u002FELMformer)\n\n- (arXiv 2022.08) SoMoFormer: **Multi-Person Pose Forecasting** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14023.pdf)\n\n- (arXiv 2022.08) A Circular Window-based Cascade Transformer for **Online Action Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14209.pdf)\n\n- (arXiv 2022.08) ASpanFormer: Detector-Free **Image Matching** with Adaptive Span Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14201.pdf)\n\n- (arXiv 2022.08) Robust Sound-Guided **Image Manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14114.pdf)\n\n- (arXiv 2022.08) TrojViT: **Trojan Insertion** in Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13049.pdf)\n\n- (arXiv 2022.08) User-Controllable Latent Transformer for StyleGAN **Image Layout Editing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12408.pdf)\n\n- (arXiv 2022.08) Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for **Few-Shot Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12398.pdf)\n\n- (arXiv 2022.08) JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational **Embodied Agents**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13266.pdf)\n\n- (arXiv 2022.08) TFusion: Transformer based N-to-One **Multimodal** Fusion Block, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12776.pdf)\n\n- (arXiv 2022.08) VMFormer: End-to-End **Video Matting** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12801.pdf), [[Code]](https:\u002F\u002Fchrisjuniorli.github.io\u002Fproject\u002FVMFormer\u002F)\n\n- (arXiv 2022.08) LOGICRANK: Logic Induced Reranking for Generative **Text-to-Image** Systems, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13518.pdf)\n\n- (arXiv 2022.08) CLUSTR: EXPLORING **EFFICIENT SELF-ATTENTION** VIA CLUSTERING FOR VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13138.pdf)\n\n- (arXiv 2022.08) **Federated Zero-Shot Learning** with Mid-Level Semantic Knowledge Transfer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13465.pdf)\n\n- (arXiv 2022.08) **Prompt Tuning** with Soft Context Sharing for **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13474.pdf)\n\n- (arXiv 2022.08) Efficient **Vision-Language** Pretraining with Visual Concepts and Hierarchical Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13628.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmshukor\u002FViCHA)\n\n- (arXiv 2022.08) CounTR: Transformer-based Generalised Visual **Counting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13721.pdf), [[Code]](https:\u002F\u002Fverg-avesta.github.io\u002FCounTR_Webpage\u002F)\n\n- (arXiv 2022.08) **Open-Set** Semi-Supervised Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13722.pdf)\n\n- (arXiv 2022.08) g**Swin**: Gated **MLP** Vision Model with Hierarchical Structure of Shifted Window, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11718.pdf)\n\n- (arXiv 2022.08) Adaptive Perception Transformer for **Temporal Action Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11908.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSouperO\u002FAdaPerFormer)\n\n- (arXiv 2022.08) Symbolic Replay: **Scene Graph** as Prompt for Continual Learning on **VQA** Task, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12037.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FCLVQA)\n\n- (arXiv 2022.08) **Masked** Autoencoders Enable Efficient **Knowledge Distillers**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12256.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002FDMAE)\n\n- (arXiv 2022.08) LaTe**RF**: Label and **Text** Driven Object Radiance Fields, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01583.pdf)\n\n- (arXiv 2022.08) Video Mobile-Former: **Video Recognition** with **Efficient** Global Spatial-temporal Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12257.pdf)\n\n- (arXiv 2022.08) Pix4Point: Image Pretrained Transformers for 3D **Point Cloud Understanding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12259.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fguochengqian\u002FPix4Point)\n\n- (arXiv 2022.08) Mask**CLIP**: **Masked** Self-Distillation Advances Contrastive Language-Image Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12262.pdf)\n\n- (arXiv 2022.08) Visual Subtitle Feature Enhanced **Video Outline Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11307.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAopolin-Lv\u002FVSENet)\n\n- (arXiv 2022.08) CATS: COMPLEMENTARY **CNN** AND TRANSFORMER ENCODERS FOR **SEGMENTATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11572.pdf)\n\n- (arXiv 2022.08) Modeling Paragraph-Level **Vision-Language** Semantic Alignment for Multi-Modal Summarization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11303.pdf)\n\n- (arXiv 2022.08) Fashion**VQA**: A Domain-Specific Visual Question Answering System, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11253.pdf)\n\n- (arXiv 2022.08) K-ORDER GRAPH-ORIENTED TRANSFORMER WITH GRAATTENTION FOR **3D POSE AND SHAPE ESTIMATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11328.pdf)\n\n- (arXiv 2022.08) Towards **Efficient** Use of Multi-Scale Features in Transformer-Based Object **Detectors**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11356.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZhangGongjie\u002FIMFA)\n\n- (arXiv 2022.08) Improving **video retrieval** using **multilingual** knowledge transfer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11553.pdf)\n\n- (arXiv 2022.08) **EFFICIENT** SPARSELY ACTIVATED TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14580.pdf)\n\n- (arXiv 2022.08) M2HF: MULTI-LEVEL MULTI-MODAL HYBRID FUSION FOR **TEXT-VIDEO RETRIEVAL**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07664.pdf)\n\n- (arXiv 2022.08) **Accelerating** Vision Transformer Training via a Patch Sampling Schedule, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09520.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002FBradMcDanel\u002Fpss)\n\n- (arXiv 2022.08) A Dual Modality Approach For (Zero-Shot) **Multi-Label Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09562.pdf)\n\n- (arXiv 2022.08) Offline **Handwritten Mathematical Recognition** using Adversarial Learning and Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09662.pdf)\n\n- (arXiv 2022.08) Semantic-enhanced Image **Clustering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09849.pdf)\n\n- (arXiv 2022.08) DPTNet: A Dual-Path Transformer Architecture for **Scene Text Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09878.pdf)\n\n- (arXiv 2022.08) ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for **Interpretable Image Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.10431.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzju-vipa\u002FProtoPFormer)\n\n- (arXiv 2022.08) Image as a Foreign Language: BEIT Pretraining for All Vision and **Vision-Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.10442.pdf), [[Project]](https:\u002F\u002Faka.ms\u002Fbeit-3)\n\n- (arXiv 2022.08) PoseBERT: A Generic Transformer Module for **Temporal 3D Human Modeling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.10211.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fposebert)\n\n- (arXiv 2022.08) **EFFICIENT** ATTENTION-FREE **VIDEO** SHIFT TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11108.pdf)\n\n- (arXiv 2022.08) Flat Multi-modal Interaction Transformer for **Named Entity Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11039.pdf)\n\n- (arXiv 2022.08) **Dance Style Transfer** with Cross-modal Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09406.pdf)\n\n- (arXiv 2022.08) Improved **Image Classification** with Token Fusion , [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2208\u002F2208.09183.pdf)\n\n- (arXiv 2022.08) VAuLT: Augmenting the **Vision-and-Language** Transformer with the Propagation of Deep Language Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09021.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgchochla\u002FVAuLT)\n\n- (arXiv 2022.08) **TEXT TO IMAGE GENERATION**: LEAVING NO LANGUAGE BEHIND, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09333.pdf)\n\n- (arXiv 2022.08) Aspect-based **Sentiment Classification** with Sequential Cross-modal Semantic Graph, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09417.pdf)\n\n- (arXiv 2022.08) Diverse **Video Captioning** by Adaptive Spatio-temporal Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09266.pdf)\n\n- (arXiv 2022.08) VL**MAE**: **Vision-Language** Masked Autoencoder, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09374.pdf)\n\n- (arXiv 2022.08) SoMoFormer: Social-Aware Motion Transformer for **Multi-Person Motion Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09224.pdf)\n\n- (arXiv 2022.08) ILLUME: Rationalizing **Vision-Language** Models by Interacting with their Jabber, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08241.pdf)\n\n- (arXiv 2022.08) ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human **Activity Recognition** in Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07929.pdf)\n\n- (arXiv 2022.08) UniLayout: Taming Unified Sequence-to-Sequence Transformers for **Graphic Layout Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08037.pdf)\n\n- (arXiv 2022.08) InterTrack: Interaction Transformer for **3D Multi-Object Tracking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08041.pdf)\n\n- (arXiv 2022.08) Understanding **Attention** for **Vision-and-Language** Task, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08104.pdf)\n\n- (arXiv 2022.08) Towards **Open-vocabulary Scene Graph Generation** with Prompt-based Finetuning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08165.pdf)\n\n- (arXiv 2022.08) Class-Aware Visual Prompt Tuning for **Vision-Language** Pre-Trained Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08340.pdf)\n\n- (arXiv 2022.08) Unifying Visual **Perception** by Dispersible **Points** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08630.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FUniHead)\n\n- (arXiv 2022.08) **Text-to-Image Generation** via Implicit Visual Guidance and Hypernetwork, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08493.pdf)\n\n- (arXiv 2022.08) ConMatch: **Semi-Supervised Learning** with Confidence-Guided Consistency Regularization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08631.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJiwonCocoder\u002FConMatch)\n\n- (arXiv 2022.08) The 8-Point Algorithm as an Inductive Bias for **Relative Pose Prediction** by ViTs, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08988.pdf)\n\n- (arXiv 2022.08) Open-Vocabulary **Panoptic Segmentation** with Mask**CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08984.pdf)\n\n- (arXiv 2022.08) Prompt Vision Transformer for **Domain Generalization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08914.pdf)\n\n- (arXiv 2022.08) GSRFormer: **Grounded Situation Recognition** Transformer with Alternate Semantic Attention Refinement, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08965.pdf)\n\n- (arXiv 2022.08) CONVIFORMERS: **CONVOLUTIONALLY** GUIDED VISION TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08900.pdf)\n\n- (arXiv 2022.08) Learning Spatial-Frequency Transformer for Visual Object **Tracking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08829.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTchuanm\u002FSFTransT.git)\n\n- (arXiv 2022.08) **Efficient** **Multimodal** Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07589.pdf)\n\n- (arXiv 2022.08) Your ViT is Secretly a Hybrid **Discriminative-Generative** **Diffusion** Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07791.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsndnyang\u002FDiffusion_ViT)\n\n- (arXiv 2022.08) LLM.int8(): 8-bit **Matrix Multiplication** for Transformers at **Scale**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07339.pdf)\n\n- (arXiv 2022.08) ExpansionNet v2: Block Static Expansion in fast end to end training for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06551.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjchenghu\u002FExpansionNet_v2)\n\n- (arXiv 2022.08) Multi-modal Transformer **Path Prediction** for Autonomous Vehicle, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07256.pdf)\n\n- (arXiv 2022.08) Flow-Guided Transformer for **Video Inpainting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06768.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhitachinsk\u002FFGT)\n\n- (arXiv 2022.08) TL;DW? **Summarizing Instructional Videos** with Task Relevance & Cross-Modal Saliency, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06773.pdf), [[Project]](https:\u002F\u002Fmedhini.github.io\u002Fivsum\u002F)\n\n- (arXiv 2022.08) HoW-3D: Holistic **3D Wireframe Perception** from a Single Image, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06999.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FWenchao-M\u002FHoW-3D)\n\n- (arXiv 2022.08) **BEIT V2**: **Masked Image Modeling** with Vector-Quantized Visual Tokenizers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06366.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm)\n\n- (arXiv 2022.08) MILAN: **Masked Image Pretraining** on Language Assisted Representation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06049.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzejiangh\u002FMILAN)\n\n- (arXiv 2022.08) Hybrid Transformer Network for **Deepfake Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05820.pdf)\n\n- (arXiv 2022.08) **Semi-supervised** Vision Transformers at Scale, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05688.pdf)\n\n- (arXiv 2022.08) PPMN: Pixel-Phrase Matching Network for One-Stage **Panoptic Narrative Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05647.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdzh19990407\u002FPPMN)\n\n- (arXiv 2022.08) Exploring Anchor-based Detection for **Ego4D** Natural **Language Query**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05375.pdf)\n\n- (arXiv 2022.08) Language Supervised Training for **Skeleton-based Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05318.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMartinXM\u002FLST)\n\n- (arXiv 2022.08) Exploring Point-BEV Fusion for 3D **Point Cloud Object Tracking** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05216.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJasonkks\u002FPTTR)\n\n- (arXiv 2022.08) Ghost-free **High Dynamic Range Imaging** with Context-aware Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05114.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmegvii-research\u002FHDR-Transformer)\n\n- (arXiv 2022.08) **CLIP**-based Neural Neighbor **Style Transfer** for **3D Assets**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04370.pdf)\n\n- (arXiv 2022.08) **Sports Video Analysis** on Large-Scale Data, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04897.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjackwu502\u002FNSVA)\n\n- (arXiv 2022.08) How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving **Art Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04693.pdf)\n\n- (arXiv 2022.08) In the Eye of Transformer: Global-Local Correlation for **Egocentric Gaze Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04464.pdf), [[Code]](https:\u002F\u002Fbolinlai.github.io\u002FGLC-EgoGazeEst)\n\n- (arXiv 2022.08) **DALLE**-URBAN: Capturing the **urban** design expertise of large **text to image** transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04139.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsachith500\u002FDALLEURBAN)\n\n- (arXiv 2022.08) PlaneFormers: From Sparse View Planes to **3D Reconstruction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04307.pdf), [[Code]](https:\u002F\u002Fsamiragarwala.github.io\u002FPlaneFormers)\n\n- (arXiv 2022.08) Boosting **Video-Text Retrieval** with Explicit High-Level Semantics, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04215.pdf)\n\n- (arXiv 2022.08) Distinctive Image **Captioning** via **CLIP** Guided Group Optimization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04254.pdf)\n\n- (arXiv 2022.08) Understanding **Masked Image Modeling** via Learning Occlusion Invariant Feature, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04164.pdf)\n\n- (arXiv 2022.08) GRIT-VLP: Grouped Mini-batch Sampling for Efficient **Vision and Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04060.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjaeseokbyun\u002FGRIT-VLP)\n\n- (arXiv 2022.08) Advancing Plain Vision Transformer Towards **Remote Sensing** Foundation Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03987.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FRemote-Sensing-RVSA)\n\n- (arXiv 2022.08) Domain Randomization-Enhanced Depth Simulation and Restoration for Perceiving and **Grasping** Specular and Transparent Objects, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03792.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPKU-EPIC\u002FDREDS)\n\n- (arXiv 2022.08) Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for **3D Human Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03704.pdf)\n\n- (arXiv 2022.08) Frozen **CLIP** Models are **Efficient Video** Learners, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03550.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002Fefficient-video-recognition)\n\n- (arXiv 2022.08) MonoViT: Self-Supervised **Monocular Depth Estimation** with a Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03543.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzxcqlf\u002FMonoViT)\n\n- (arXiv 2022.08) HaloAE: An HaloNet based Local Transformer Auto-Encoder for **Anomaly Detection** and **Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03486.pdf), [[Code]](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FHaloAE-E27B\u002FREADME.md)\n\n- (arXiv 2022.08) IVT: An End-to-End Instance-guided Video Transformer for **3D Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03431.pdf)\n\n- (arXiv 2022.08) A Sketch Is Worth a Thousand Words: **Image Retrieval** with **Text** and **Sketch**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03354.pdf), [[Code]](https:\u002F\u002Fjanesjanes.github.io\u002Ftsbir\u002F)\n\n- (arXiv 2022.08) PointConvFormer: Revenge of the **Point-based Convolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02879.pdf)\n\n- (arXiv 2022.08) ChiQA: A Large Scale Image-based Real-World **Question Answering Dataset** for Multi-Modal Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03030.pdf)\n\n- (arXiv 2022.08) LaTTe: **Language** **Trajectory** TransformEr, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02918.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Farthurfenderbucker\u002FNL_trajectory_reshaper)\n\n- (arXiv 2022.08) Learning Spatiotemporal Frequency-Transformer for **Compressed Video Super-Resolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03012.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FFTVSR)\n\n- (arXiv 2022.08) TransMatting: Enhancing **Transparent Objects Matting** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03007.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002FAceCHQ\u002FTransMatting)\n\n- (arXiv 2022.08) Word-Level Fine-Grained **Story Visualization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02341.pdf)\n\n- (arXiv 2022.08) Fine-Grained Semantically Aligned **Vision-Language** Pre-Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02515.pdf)\n\n- (arXiv 2022.08) Expanding **Language-Image** Pretrained Models for General **Video Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02816.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVideoX\u002Ftree\u002Fmaster\u002FX-CLIP)\n\n- (arXiv 2022.08) P2P: Tuning Pre-trained Image Models for **Point Cloud Analysis** with Point-to-Pixel Prompting, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02812.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwangzy22\u002FP2P)\n\n- (arXiv 2022.08) **Drop**Key, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02646.pdf)\n\n- (arXiv 2022.08) MVSFormer: **Multi-View Stereo** with Pre-trained Vision Transformers and Temperature-based Depth, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02541.pdf)\n\n- (arXiv 2022.08) Per-Clip Video Object **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01924.pdf)\n\n- (arXiv 2022.08) XCon: Learning with Experts for **Fine-grained Category Discovery**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01898.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYiXXin\u002FXCon)\n\n- (arXiv 2022.08) Combined CNN Transformer Encoder for Enhanced Fine-grained Human **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01897.pdf)\n\n- (arXiv 2022.08) RE-ATTENTION TRANSFORMER FOR WEAKLY SUPERVISED **OBJECT LOCALIZATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01838.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsu-hui-zz\u002FReAttentionTransformer)\n\n- (arXiv 2022.08) TAG: Boosting Text-**VQA** via Text-aware Visual Question-answer Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01813.pdf)\n\n- (arXiv 2022.08) Two-Stream Transformer Architecture for **Long Form Video Understanding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01753.pdf)\n\n- (arXiv 2022.08) A Fast **Text-Driven** Approach for **Generating Artistic Content**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01748.pdf)\n\n- (arXiv 2022.08) DAHITRA: **DAMAGE ASSESSMENT** USING A NOVEL HIERARCHICAL TRANSFORMER ARCHITECTURE, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02205.pdf)\n\n- (arXiv 2022.08) MinVIS: A Minimal **Video Instance Segmentation** Framework without Video-based Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02245.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FMinVIS)\n\n- (arXiv 2022.08) **Masked** **Vision and Language** Modeling for Multi-modal Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02131.pdf)\n\n- (arXiv 2022.08) SSformer: A **Lightweight** Transformer for Semantic Segmentation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02034.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshiwt03\u002FSSformer)\n\n- (arXiv 2022.08) **Pose** Uncertainty Aware **Movement Synchrony Estimation** via Spatial-Temporal Graph Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01161.pdf)\n\n- (arXiv 2022.08) Making the Best of Both Worlds: A Domain-Oriented Transformer for **Unsupervised Domain Adaptation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01195.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBIT-DA\u002FDomain-Oriented-Transformer)\n\n- (arXiv 2022.08) Unified Normalization for **Accelerating** and **Stabilizing** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01313.pdf)\n\n- (arXiv 2022.08) An Image is Worth One Word: Personalizing **Text-to-Image Generation** using Textual Inversion, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01618.pdf), [[Project]](https:\u002F\u002Ftextual-inversion.github.io\u002F)\n\n- (arXiv 2022.08) Prompt-to-**Prompt** **Image Editing** with Cross Attention Control, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01626.pdf)\n\n- (arXiv 2022.08) Momentum Transformer: Closing the Performance Gap Between Self-attention and Its **Linearization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00579.pdf)\n\n- (arXiv 2022.08) Testing Relational Understanding in **Text-Guided Image Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00005.pdf)\n\n- (arXiv 2022.08) UAVM: A Unified Model for **Audio-Visual** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00061.pdf)\n\n- (arXiv 2022.08) Meta-**DETR**: Image-Level **Few-Shot** Detection with Inter-Class Correlation Exploitation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00219.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZhangGongjie\u002FMeta-DETR)\n\n- (arXiv 2022.08) Point Primitive Transformer for Long-Term **4D Point Cloud Video Understanding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00281.pdf)\n\n- (arXiv 2022.08) One for All: One-stage **Referring Expression Comprehension** with Dynamic Reasoning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00361.pdf)\n\n- (arXiv 2022.08) Toward Understanding WordArt: Corner-Guided Transformer for **Scene Text Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00438.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxdxie\u002FWordArt)\n\n- (arXiv 2022.08) SdAE: Self-distillated **Masked Autoencoder**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00449.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAbrahamYabo\u002FSdAE)\n\n- (arXiv 2022.08) Augmenting **Vision Language** Pretraining by Learning Codebook with Visual Semantics, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00475.pdf)\n\n- (arXiv 2022.08) STrajNet: **Occupancy Flow Prediction** via Multi-modal Swin Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00394.pdf)\n\n- (arXiv 2022.08) D^3Former: Debiased Dual Distilled Transformer for Incremental Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00777.pdf), [[Code]](https:\u002F\u002Ftinyurl.com\u002Fd3former)\n\n- (arXiv 2022.08) Local Perception-Aware Transformer for **Aerial Tracking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00662.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvision4robotics\u002FLPAT)\n\n- (arXiv 2022.08) SIAMIXFORMER: A SIAMESE TRANSFORMER NETWORK FOR BUILDING DETECTION AND CHANGE DETECTION FROM BI-TEMPORAL **REMOTE SENSING** IMAGES, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00657.pdf)\n\n- (arXiv 2022.08) Transformers as Meta-Learners for **Implicit Neural Representations**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02801.pdf), [[Code]](https:\u002F\u002Fyinboc.github.io\u002Ftrans-inr\u002F)\n\n- (arXiv 2022.08) **Video Question Answering** with Iterative Video-Text Co-Tokenization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00934.pdf), [[Code]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fvideoqa-cotokenization)\n\n- (arXiv 2022.08) Understanding Adversarial **Robustness** of Vision Transformers via Cauchy Problem, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00906.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTrustAI\u002FODE4RobustViT)\n\n### 2022.07 \n\n- (arXiv 2022.07) Pro-tuning: Unified **Prompt Tuning** for Vision Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14381.pdf)\n\n- (arXiv 2022.07) ALADIN: Distilling Fine-grained Alignment Scores for Efficient **Image-Text** Matching and Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14757.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmesnico\u002FALADIN)\n\n- (arXiv 2022.07) Curriculum Learning for Data-Efficient **Vision-Language** Alignme, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14525.pdf)\n\n- (arXiv 2022.07) DnSwin: Toward Real-World **Denoising** via Continuous Wavelet Sliding-Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13861.pdf)\n\n- (arXiv 2022.07) Cross-Attention of Disentangled Modalities for **3D Human Mesh Recovery** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13820.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fpostech-ami\u002FFastMETRO)\n\n- (arXiv 2022.07) AvatarPoser: Articulated **Full-Body Pose Tracking** from Sparse Motion Sensing, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13784.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002Feth-siplab\u002FAvatarPoser)\n\n- (arXiv 2022.07) Semantic-Aligned Matching for Enhanced **DETR** Convergence and Multi-Scale Feature Fusion, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14172.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZhangGongjie\u002FSAM-DETR)\n\n- (arXiv 2022.07) Safety-Enhanced **Autonomous Driving** Using Interpretable Sensor Fusion Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14024.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FInterFuser)\n\n- (arXiv 2022.07) Video Mask Transfiner for High-Quality **Video Instance Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14012.pdf), [[Project]](http:\u002F\u002Fvis.xyz\u002Fpub\u002Fvmt)\n\n- (arXiv 2022.07) A **Variational AutoEncoder** for Transformers with Nonparametric Variational Information Bottleneck, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13529.pdf)\n\n- (arXiv 2022.07) Online **Continual Learning** with Contrastive Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13516.pdf)\n\n- (arXiv 2022.07) Retrieval-Augmented Transformer for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13162.pdf)\n\n- (arXiv 2022.07) Spatiotemporal Self-attention Modeling with Temporal Patch Shift for **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13259.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMartinXM\u002FTPS)\n\n- (arXiv 2022.07) Is Attention All **NeRF** Needs?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13298.pdf), [[Code]](https:\u002F\u002Fvita-group.github.io\u002FGNT\u002F)\n\n- (arXiv 2022.07) **Convolutional Embedding** Makes Hierarchical Vision Transformer Stronger, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13317.pdf)\n\n- (arXiv 2022.07) SiRi: A Simple Selective Retraining Mechanism for Transformer-based **Visual Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13325.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fqumengxue\u002Fsiri-vg.git)\n\n- (arXiv 2022.07) Deep **Clustering** with Features from **Self-Supervised** Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13364.pdf)\n\n- (arXiv 2022.07) Contrastive **Masked Autoencoders** are Stronger Vision Learners, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13532.pdf)\n\n- (arXiv 2022.07) VICTOR: VISUAL **INCOMPATIBILITY DETECTION** WITH TRANSFORMERS AND FASHION-SPECIFIC CONTRASTIVE PRE-TRAINING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13458.pdf)\n\n- (arXiv 2022.07) Compositional **Human-Scene Interaction Synthesis** with Semantic Control, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12824.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzkf1997\u002FCOINS)\n\n- (arXiv 2022.07) Static and Dynamic Concepts for **Self-supervised** **Video** Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12795.pdf)\n\n- (arXiv 2022.07) Unsupervised Domain Adaptation for Video Transformers in **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12842.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvturrisi\u002FUDAVT)\n\n- (arXiv 2022.07) LaKo: Knowledge-driven **Visual Question Answering** via Late Knowledge-to-Text Injection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12888.pdf)\n\n- (arXiv 2022.07) TransFiner: A Full-Scale Refinement Approach for **Multiple Object Tracking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12967.pdf)\n\n- (arXiv 2022.07) S-Prompts Learning with Pre-trained Transformers: An Occam’s Razor for **Domain Incremental Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12819.pdf)\n\n- (arXiv 2022.07) WinoGAViL: Gamified Association **Benchmark** to Challenge **Vision-and-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12576.pdf), [[Project]](https:\u002F\u002Fwinogavil.github.io\u002F)\n\n- (arXiv 2022.07) Cross-Modal Causal Relational Reasoning for Event-Level **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12647.pdf)\n\n- (arXiv 2022.07) Graph Neural Network and Spatiotemporal Transformer Attention for **3D** Video Object **Detection** from Point Clouds, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12659.pdf)\n\n- (arXiv 2022.07) Learning Visual Representation from Modality-Shared Contrastive **Language-Image** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12661.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHxyou\u002FMSCLIP)\n\n- (arXiv 2022.07) V^2L: Leveraging Vision and **Vision-language** Models into Large-scale **Product Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12994.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FWangWenhao0716\u002FV2L)\n\n- (arXiv 2022.07) NewsStories: Illustrating **articles** with **visual** summaries, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13061.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002FNewsStoriesData\u002Fnewsstories.github.io)\n\n- (arXiv 2022.07) **DETR**s with Hybrid **Matching**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13080.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHDETR)\n\n- (arXiv 2022.07) GROUP **DETR**: **FAST** TRAINING CONVERGENCE WITH DECOUPLED ONE-TO-MANY LABEL ASSIGNMENT, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13085.pdf)\n\n- (arXiv 2022.07) Improved **Super Resolution** of MR Images Using CNNs and Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11748.pdf)\n\n- (arXiv 2022.07) Video Swin Transformers for **Egocentric Video** Understanding @ Ego4D Challenges 2022, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2207\u002F2207.11329.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBCV-Uniandes\u002FPNR_OSCC)\n\n- (arXiv 2022.07) An Impartial Take to the CNN vs Transformer **Robustness** Contest, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11347.pdf)\n\n- (arXiv 2022.07) **Generative** Artisan: A Semantic-Aware and Controllable **CLIP**styler, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11598.pdf)\n\n- (arXiv 2022.07) MAR: Masked Autoencoders for Efficient **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11660.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Falibaba-mmai-research\u002FMasked-Action-Recognition)\n\n- (arXiv 2022.07) **Object State Change Classification** in **Egocentric** Videos using the Divided Space-Time Attention Mechanism, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11814.pdf), [[Cpde]](https:\u002F\u002Fgithub.com\u002Fmd-mohaiminul\u002FObjectStateChange)\n\n- (arXiv 2022.07) Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for **Panoramic Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11860.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjamycheung\u002FTrans4PASS)\n\n- (arXiv 2022.07) Reference-based Image **Super-Resolution** with Deformable Attention Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11938.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fcaojiezhang\u002FDATSR)\n\n- (arXiv 2022.07) JIGSAW-VIT: LEARNING **JIGSAW PUZZLES** IN VISION TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11971.pdf), [[Code]](https:\u002F\u002Fyingyichen-cyy.github.io\u002FJigsaw-ViT)\n\n- (arXiv 2022.07) TransCL: Transformer Makes Strong and Flexible **Compressive Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11972.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMC-E\u002FTransCL\u002F)\n\n- (arXiv 2022.07) 3D Siamese Transformer Network for Single Object **Tracking** on **Point Clouds**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11995.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffpthink\u002FSTNet)\n\n- (arXiv 2022.07) Intention-Conditioned Long-Term Human **Egocentric Action Forecasting** @ EGO4D Challenge 2022, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12080.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FEvm7\u002Fego4dlta-icvae)\n\n- (arXiv 2022.07) IGFormer: Interaction Graph Transformer for **Skeleton**-based **Human Interaction Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12100.pdf)\n\n- (arXiv 2022.07) Is **GPT-3** all you need for **Visual Question Answering** in Cultural Heritage? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12101.pdf)\n\n- (arXiv 2022.07) Applying Spatiotemporal Attention to **Identify Distracted** and **Drowsy Driving** with Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12148.pdf)\n\n- (arXiv 2022.07) **Action Quality Assessment** using Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12318.pdf)\n\n- (arXiv 2022.07) Self-Distilled Vision Transformer for **Domain Generalization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12392.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmaryam089\u002FSDViT)\n\n- (arXiv 2022.07) Exploring **CLIP** for **Assessing** the Look and Feel of **Images**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12396.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FIceClear\u002FCLIP-IQA)\n\n- (arXiv 2022.07) Transformer with Implicit Edges for Particle-based **Physics Simulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10860.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fftbabi\u002FTIE_ECCV2022.git)\n\n- (arXiv 2022.07) Auto-regressive **Image Synthesis** with Integrated Quantization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10776.pdf)\n\n- (arXiv 2022.07) Efficient Modeling of Future Context for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10897.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffeizc\u002FFuture-Caption)\n\n- (arXiv 2022.07) Zero-Shot Video **Captioning** with Evolving Pseudo-Tokens, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11100.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYoadTew\u002Fzero-shot-video-to-text)\n\n- (arXiv 2022.07) Panoptic **Scene Graph** Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11247.pdf), [[Project]](https:\u002F\u002Fpsgdataset.org\u002F), [[Code]](https:\u002F\u002Fgithub.com\u002FJingkang50\u002FOpenPSG)\n\n- (arXiv 2022.07) **Facial Expression Recognition** using Vanilla ViT backbones with MAE Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11081.pdf)\n\n- (arXiv 2022.07) Target-Driven Structured Transformer Planner for **Vision-Language Navigation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11201.pdf)\n\n- (arXiv 2022.07) **Scaling Laws** vs Model Architectures: How does Inductive Bias Influence Scaling? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10551.pdf)\n\n- (arXiv 2022.07) Hybrid CNN-Transformer Model For **Facial Affect Recognition** In the ABAW4 Challenge, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10201.pdf)\n\n- (arXiv 2022.07) Mesh**MAE**: Masked Autoencoders for 3D **Mesh** Data Analysis, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10228.pdf)\n\n- (arXiv 2022.07) SeedFormer: Patch Seeds based **Point Cloud Completion** with Upsample Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10315.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhrzhou2\u002Fseedformer)\n\n- (arXiv 2022.07) LocVTP: **Video-Text** Pre-training for Temporal Localization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10362.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmengcaopku\u002FLocVTP)\n\n- (arXiv 2022.07) Temporal Saliency Query Network for **Efficient Video Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10379.pdf), [[Code]](https:\u002F\u002Flawrencexia2008.github.io\u002Fprojects\u002Ftsqnet)\n\n- (arXiv 2022.07) Pose for Everything: Towards Category-Agnostic **Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10387.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fluminxu\u002FPose-for-Everything)\n\n- (arXiv 2022.07) Weakly Supervised **Object Localization** via Transformer with Implicit Spatial Calibration, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10447.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002F164140757\u002FSCM)\n\n- (arXiv 2022.07) An Efficient **Spatio-Temporal** Pyramid Transformer for **Action Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10448.pdf)\n\n- (arXiv 2022.07) Towards **Efficient Adversarial Training** on Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10498.pdf)\n\n- (arXiv 2022.07) TinyViT: Fast Pretraining Distillation for **Small** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10666.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FCream\u002Ftree\u002Fmain\u002FTinyViT)\n\n- (arXiv 2022.07) Hierarchically Self-Supervised Transformer for Human **Skeleton Representation** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09644.pdf), [[Code]]( https:\u002F\u002Fgithub.com\u002Fyuxiaochen1103\u002FHi-TRS)\n\n- (arXiv 2022.07) Explicit Image **Caption Editing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09625.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fbaaaad\u002FECE)\n\n- (arXiv 2022.07) AiATrack: Attention in Attention for Transformer Visual **Tracking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09603.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLittle-Podi\u002FAiATrack)\n\n- (arXiv 2022.07) Tip-Adapter: Training-free Adaption of **CLIP** for **Few-shot Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09519.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgaopengcuhk\u002FTip-Adapter)\n\n- (arXiv 2022.07) Single Frame **Atmospheric Turbulence Mitigation**: A Benchmark Study and A New Physics-Inspired Transformer Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10040.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FTurbNet)\n\n- (arXiv 2022.07) HTNet: Anchor-free **Temporal Action Localization** with Hierarchical Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09662.pdf)\n\n- (arXiv 2022.07) GRIT: Faster and Better Image **captioning** Transformer Using Dual Visual Features, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09666.pdf)\n\n- (arXiv 2022.07) OTPose: Occlusion-Aware Transformer for **Pose Estimation** in Sparsely-Labeled Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09725.pdf)\n\n- (arXiv 2022.07) FaceFormer: Scale-aware Blind **Face Restoration** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09790.pdf)\n\n- (arXiv 2022.07) Multimodal Transformer for **Automatic 3D Annotation** and Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09805.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FCliu2\u002FMTrans)\n\n- (arXiv 2022.07) Temporal and cross-modal attention for **audio-visual** **zero-shot** learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09966.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FExplainableML\u002FTCAF-GZSL)\n\n- (arXiv 2022.07) Locality Guidance for Improving Vision Transformers on **Tiny Datasets**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10026.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flkhl\u002Ftiny-transformers)\n\n- (arXiv 2022.07) Is an Object-Centric Video Representation Beneficial for Transfer? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10075.pdf)\n\n- (arXiv 2022.07) DUQIM-Net: Probabilistic Object Hierarchy Representation for Multi-View **Manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09105.pdf)\n\n- (arXiv 2022.07) RELATIONAL FUTURE **CAPTIONING** MODEL FOR EXPLAINING LIKELY COLLISIONS IN DAILY TASKS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09083.pdf)\n\n- (arXiv 2022.07) Conditional **DETR** V2: **Efficient** Detection Transformer with Box Queries, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08914.pdf)\n\n- (arXiv 2022.07) Exploiting Unlabeled Data with **Vision and Language** Models for Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08954.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxiaofeng94\u002FVL-PLM)\n\n- (arXiv 2022.07) TTVFI: Learning Trajectory-Aware Transformer for **Video Frame Interpolation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09048.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FTTVFI.git)\n\n- (arXiv 2022.07) Time Is MattEr: **Temporal Self-supervision** for Video Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09067.pdf)\n\n- (arXiv 2022.07) IDET: Iterative Difference-Enhanced Transformers for **High-Quality Change Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09240.pdf)\n\n- (arXiv 2022.07) Don’t Stop Learning: Towards **Continual Learning** for the **CLIP** Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09248.pdf)\n\n- (arXiv 2022.07) **Action Quality Assessment** with Temporal Parsing Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09270.pdf)\n\n- (arXiv 2022.07) Visual **Representation** Learning with Transformer: A Sequence-to-Sequence Perspective, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09339.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FSETR)\n\n- (arXiv 2022.07) Structural Prior Guided Generative Adversarial Transformers for **Low-Light Image Enhancement**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07828.pdf)\n\n- (arXiv 2022.07) TS2-Net: Token Shift and Selection Transformer for T**ext-Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07852.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyuqi657\u002Fts2_net)\n\n- (arXiv 2022.07) Clover: Towards A Unified **Video-Language** Alignment and Fusion Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07885.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLeeYN-43\u002FClover)\n\n- (arXiv 2022.07) SatMAE: Pre-training Transformers for Temporal and Multi-Spectral **Satellite Imagery**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08051.pdf)\n\n- (arXiv 2022.07) FashionViL: Fashion-Focused **Vision-and-Language** Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08150.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBrandonHanx\u002Fmmf)\n\n- (arXiv 2022.07) Zero-Shot **Temporal Action Detection** via Vision-Language Prompting, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08184.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsauradip\u002FSTALE)\n\n- (arXiv 2022.07) Rethinking Alignment in **Video Super-Resolution** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08494.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FXPixelGroup\u002FRethinkVSRAlignment)\n\n- (arXiv 2022.07) Open-world **Semantic Segmentation** via Contrasting and Clustering Vision-Language Embedding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08455.pdf)\n\n- (arXiv 2022.07) TokenMix: Rethinking Image Mixing for Data **Augmentation** in Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08409.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FTokenMix)\n\n- (arXiv 2022.07) Towards the Human Global Context: Does the **Vision-Language** Model Really Judge Like a Human Being? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08333.pdf)\n\n- (arXiv 2022.07) Defect Transformer: An Efficient Hybrid Transformer Architecture for **Surface Defect Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08319.pdf)\n\n- (arXiv 2022.07) Semantic **Novelty Detection** via Relational Reasoning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08699.pdf)\n\n- (arXiv 2022.07) Unifying **Event Detection** and **Captioning** as Sequence Generation via Pre-Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08625.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FQiQAng\u002FUEDVC)\n\n- (arXiv 2022.07) Multi-manifold **Attention** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08569.pdf)\n\n- (arXiv 2022.07) UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in **Bird’s-Eye-View**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08536.pdf)\n\n- (arXiv 2022.07) **Position Prediction** as an Effective Pretraining Strategy, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07611.pdf)\n\n- (arXiv 2022.07) **Lightweight** Vision Transformer with Cross Feature Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07268.pdf)\n\n- (arXiv 2022.07) Parameterization of **Cross-Token Relations** with Relative Positional Encoding for Vision **MLP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07284.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZhicaiwww\u002FPosMLP)\n\n- (arXiv 2022.07) X-CLIP: End-to-End Multi-grained Contrastive Learning for **Video-Text Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07285.pdf)\n\n- (arXiv 2022.07) Learning Parallax Transformer Network for **Stereo Image JPEG Artifacts Removal**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07335.pdf)\n\n- (arXiv 2022.07) A Dual-Masked Auto-Encoder for **Robust Motion Capture** with Spatial-Temporal Skeletal Token Completion, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07381.pdf)\n\n- (arXiv 2022.07) Is a **Caption** Worth a Thousand **Images**? A Controlled Study for **Representation** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07635.pdf)\n\n- (arXiv 2022.07) Multimodal **Open-Vocabulary Video Classification** via Pre-Trained Vision and Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07646.pdf)\n\n- (arXiv 2022.07) Cross-Attention Transformer for **Video Interpolation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04132.pdf)\n\n- (arXiv 2022.07) Towards Multimodal **Vision-Language** Models Generating Non-Generic Text, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04174.pdf)\n\n- (arXiv 2022.07) QKVA grid: **Attention** in Image Perspective and Stacked DETR, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04313.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshengwenyuan\u002Fsdetr)\n\n- (arXiv 2022.07) Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person **3D Pose Estimation Tracking** and **Forecasting** on a Video Snippet, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04320.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJimmyZou\u002FSnipper)\n\n- (arXiv 2022.07) Horizontal and Vertical **Attention** in Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04399.pdf)\n\n- (arXiv 2022.07) CoMER: Modeling Coverage for Transformer-based **Handwritten Mathematical Expression Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04410.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGreen-Wood\u002FCoMER)\n\n- (arXiv 2022.07) DPText-DETR: Towards Better **Scene Text Detection** with Dynamic Points in Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04491.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fymy-k\u002FDPText-DETR)\n\n- (arXiv 2022.07) DEPTHFORMER: MULTISCALE VISION TRANSFORMER FOR **MONOCULAR DEPTH ESTIMATION** WITH GLOBAL LOCAL INFORMATION FUSION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04535.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fashutosh1807\u002FDepthformer.git)\n\n- (arXiv 2022.07) LaT: Latent Translation with Cycle-Consistency for **Video-Text Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04858.pdf)\n\n- (arXiv 2022.07) **Dual** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04976.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYehLi\u002FImageNetModel)\n\n- (arXiv 2022.07) Wave-ViT: Unifying **Wavelet** and Transformers for Visual **Representation** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04978.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYehLi\u002FImageNetModel)\n\n- (arXiv 2022.07) Scaling Novel Object **Detection** with Weakly Supervised Detection Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05205.pdf)\n\n- (arXiv 2022.07) Hunting Group Clues with Transformers for Social **Group Activity Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05254.pdf)\n\n- (arXiv 2022.07) **Outpainting** by Queries, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05312.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKaiseem\u002FQueryOTR)\n\n- (arXiv 2022.07) IDEA: Increasing Text Diversity via Online Multi-Label Recognition for **Vision-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05333.pdf)\n\n- (arXiv 2022.07) Video Graph Transformer for **Video Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05342.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FVGT)\n\n- (arXiv 2022.07) Next-ViT: Next Generation Vision Transformer for **Efficient Deployment** in Realistic **Industrial** Scenarios, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05501.pdf)\n\n- (arXiv 2022.07) UniNet: Unified **Architecture Search** with Convolution, Transformer, and MLP, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05420.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FUniNet)\n\n- (arXiv 2022.07) Image and Model Transformation with **Secret Key** for Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05366.pdf)\n\n- (arXiv 2022.07) eX-ViT: A Novel eXplainable Vision Transformer for **Weakly Supervised Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05358.pdf)\n\n- (arXiv 2022.07) Compound Prototype Matching for **Few-shot Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05515.pdf)\n\n- (arXiv 2022.07) Long-term Leap Attention, Short-term Periodic Shift for **Video Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05526.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FVideoNetworks\u002FLAPS-transformer)\n\n- (arXiv 2022.07) LightViT: Towards **Light**-Weight **Convolution-Free** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05557.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhunto\u002FLightViT)\n\n- (arXiv 2022.07) Learning from **Label Relationships** in Human Affect, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05577.pdf)\n\n- (arXiv 2022.07) MSP-Former: Multi-Scale Projection Transformer for Single Image **Desnowing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05621.pdf)\n\n- (arXiv 2022.07) Tell Me the Evidence? Dual **Visual-Linguistic** Interaction for **Answer Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05703.pdf)\n\n- (arXiv 2022.07) Vision Transformer for NeRF-Based **View Synthesis** from a Single Input Image, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05736.pdf), [[Code]](https:\u002F\u002Fcseweb.ucsd.edu\u002F~viscomp\u002Fprojects\u002FVisionNeRF\u002F)\n\n- (arXiv 2022.07) COSIM: Commonsense Reasoning for **Counterfactual Scene Imagination**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03961.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhyounghk\u002FCoSIm)\n\n- (arXiv 2022.07) Beyond Transfer Learning: Co-finetuning for **Action Localisation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03807.pdf)\n\n- (arXiv 2022.07) RePFormer: Refinement Pyramid Transformer for Robust **Facial Landmark Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03917.pdf)\n\n- (arXiv 2022.07) k-means **Mask** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04044.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fdeeplab2)\n\n- (arXiv 2022.07) **Training** Transformers Together, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03481.pdf), [[Code]](https:\u002F\u002Ftraining-transformers-together.github.io\u002F)\n\n- (arXiv 2022.07) Improving **Few-Shot Image Classification** Using Machine- and User-Generated Natural Language Descriptions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03133.pdf)\n\n- (arXiv 2022.07) MaiT: Leverage **Attention Masks** for More **Efficient** Image Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03006.pdf)\n\n- (arXiv 2022.07) Dual-Stream Transformer for Generic **Event Boundary Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03038.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGX77\u002FDual-Stream-Transformer-for-Generic-Event-Boundary-Captioning)\n\n- (arXiv 2022.07) **Softmax-free** Linear Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03341.pdf), [[Code[[(https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FSOFT)\n\n- (arXiv 2022.07) Bridging the Gap between Object and Image-level Representations for **Open-Vocabulary Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03482.pdf), [[Code]](https:\u002F\u002Fbit.ly\u002F3byZoQp)\n\n- (arXiv 2022.07) Transformers are Adaptable **Task Planners**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02442.pdf), [[Code]](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002Ftemporal_task_planner-Paper148\u002F)\n\n- (arXiv 2022.07) **ARRAY CAMERA IMAGE FUSION** USING PHYSICS-AWARE TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02250.pdf)\n\n- (arXiv 2022.07) OSFormer: One-Stage Camouflaged Instance **Segmentation** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02255.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPJLallen\u002FOSFormer)\n\n- (arXiv 2022.07) Weakly Supervised Grounding for **VQA** in Vision-Language Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02334.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Faurooj\u002FWSG-VQA-VLTransformers)\n\n- (arXiv 2022.07) PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense **Video Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02583.pdf)\n\n- (arXiv 2022.07) STVGFormer: Spatio-Temporal **Video Grounding** with Static-Dynamic Cross-Modal Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02756.pdf)\n\n- (arXiv 2022.07) Towards Counterfactual **Image Manipulation** via **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02812.pdf)\n\n- (arXiv 2022.07) MatFormer: A **Generative** Model for Procedural **Materials**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01044.pdf)\n\n- (arXiv 2022.07) Multimodal Frame-Scoring Transformer for **Video Summarization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01814.pdf)\n\n- (arXiv 2022.07) **3D Part Assembly** Generation with Instance Encoded Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01779.pdf)\n\n- (arXiv 2022.07) Scene-Aware Prompt for Multi-modal **Dialogue Understanding and Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01823.pdf)\n\n- (arXiv 2022.07) **Efficient** Representation Learning via Adaptive Context Pooling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01844.pdf)\n\n- (arXiv 2022.07) **Gaze Target Estimation** inspired by Interactive Attention, [[Paper]](https:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?arnumber=9828503), [[Code]](https:\u002F\u002Fgithub.com\u002Fnkuhzx\u002FVSG-IA)\n\n- (arXiv 2022.07) Generalizable Patch-Based **Neural Rendering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10662.pdf), [[Project]](https:\u002F\u002Fmohammedsuhail.net\u002Fgen_patch_neural_rendering\u002F)\n\n- (arXiv 2022.07) Interaction Transformer for Human **Reaction Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01685.pdf)\n\n- (arXiv 2022.07) TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of **3D Human Motions and Texts**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01696.pdf), [[Project]](https:\u002F\u002Fericguo5513.github.io\u002FTM2T\u002F)\n\n- (arXiv 2022.07) FishFormer: Annulus Slicing-based Transformer for **Fisheye Rectification** with Efficacy Domain Exploration, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01925.pdf)\n\n- (arXiv 2022.07) Open-Vocabulary Multi-Label Classification via Multi-modal **Knowledge Transfer**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01887.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fseanhe97\u002FMKT)\n\n- (arXiv 2022.07) Toward Explainable and Fine-Grained **3D Grounding** through Referring Textual Phrases, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01821.pdf), [[Code]](https:\u002F\u002Fyanx27.github.io\u002Fphraserefer\u002F)\n\n- (arXiv 2022.07) Improving **Semantic Segmentation** in Transformers using Hierarchical Inter-Level Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02126.pdf)\n\n- (arXiv 2022.07) MULTI-MODAL **ROBUSTNESS** ANALYSIS AGAINST **LANGUAGE AND VISUAL** PERTURBATIONS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02159.pdf), [[Project]](https:\u002F\u002Fmaddy12.github.io\u002FMultiModalVideoRobustness\u002F)\n\n- (arXiv 2022.07) CoBEVT: Cooperative **Bird’s Eye View Semantic Segmentation** with Sparse Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02202.pdf)\n\n- (arXiv 2022.07) **Segmenting Moving Objects** via an Object-Centric Layered Representation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02206.pdf)\n\n- (arXiv 2022.07) Counterfactually Measuring and Eliminating **Social Bias** in **Vision-Language** Pre-training Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01056.pdf)\n\n- (arXiv 2022.07) Contrastive Cross-Modal Knowledge Sharing Pre-training for **Vision-Language** Representation Learning and Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00733.pdf)\n\n- (arXiv 2022.07) Learning Cross-Image Object Semantic Relation in Transformer for **Few-Shot Fine-Grained** Image **Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00784.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJiakangYuan\u002FHelixFormer)\n\n- (arXiv 2022.07) Memory-Based Label-Text Tuning for **Few-Shot** Class-**Incremental** **Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01036.pdf)\n\n- (arXiv 2022.07) Exploiting Context Information for Generic Event Boundary **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01050.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzjr2000\u002FContext-GEBC)\n\n- (arXiv 2022.07) You Only Need One **Detector**: Unified Object Detector for **Different Modalities** based on Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01071.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fliketheflower\u002FYONOD.git)\n\n- (arXiv 2022.07) Divert More Attention to **Vision-Language Tracking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01076.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJudasDie\u002FSOTS)\n\n- (arXiv 2022.07) Can **Language** Understand **Depth**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01077.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAdonis-galaxy\u002FDepthCLIP)\n\n- (arXiv 2022.07) TANet: Transformer-based Asymmetric Network for **RGB-D Salient Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01172.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flc012463\u002FTANet)\n\n- (arXiv 2022.07) DUET: Cross-modal Semantic Grounding for **Contrastive Zero-shot Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01328.pdf)\n\n- (arXiv 2022.07) Transferring **Textual Knowledge** for Visual Recognition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01297.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwhwu95\u002FText4Vis)\n\n- (arXiv 2022.07) R^2-VOS: Robust Referring **Video** Object **Segmentation** via Relational Cycle Consistency, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01203.pdf)\n\n- (arXiv 2022.07) CRFormer: A Cross-Region Transformer for **Shadow Removal**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01600.pdf)\n\n- (arXiv 2022.07) Dynamic **Spatial Sparsification** for **Efficient** Vision Transformers and Convolutional Neural Networks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01580.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fraoyongming\u002FDynamicViT)\n\n- (arXiv 2022.07) Back to MLP: A Simple Baseline for Human **Motion Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01567.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdulucas\u002FsiMLPe)\n\n- (arXiv 2022.07) I-ViT: Integer-only **Quantization** for **Efficient** Vision Transformer Inference, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01405.pdf)\n\n- (arXiv 2022.07) Rethinking **Query-Key** Pairwise Interactions in Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00188.pdf)\n\n- (arXiv 2022.07) LARGE-SCALE **ROBUSTNESS** ANALYSIS OF **VIDEO ACTION RECOGNITION** MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01398.pdf), [[Code]](https:\u002F\u002Frose-ar.github.io\u002F)\n\n- (arXiv 2022.07) VL-CheckList: **Evaluating** Pre-trained **Vision-Language** Models with Objects, Attributes and Relations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00221.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fom-ai-lab\u002FVL-CheckList)\n\n- (arXiv 2022.07) **Masked Autoencoders** for Self-Supervised Learning on Automotive **Point Clouds**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00531.pdf)\n\n- (arXiv 2022.07) MotionMixer: **MLP**-based **3D** Human Body **Pose Forecasting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00499.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMotionMLP\u002FMotionMixer)\n\n- (arXiv 2022.07) DALG: Deep Attentive Local and Global Modeling for **Image Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00287.pdf)\n\n- (arXiv 2022.07) PolarFormer: Multi-camera **3D Object Detection** with Polar Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15398.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FPolarFormer)\n\n- (arXiv 2022.07) CTrGAN: Cycle Transformers GAN for **Gait Transfer**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15248.pdf)\n\n- (arXiv 2022.07) LM-Nav: Robotic **Navigation** with Large Pre-Trained Models of **Language**, **Vision**, and **Action**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04429.pdf)\n\n- (arXiv 2022.07) Bootstrapped **Masked Autoencoders** for Vision BERT Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07116.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLightDXY\u002FBootMAE)\n\n- (arXiv 2022.07) ReAct: **Temporal Action Detection** with Relational Queries, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07097.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsssste\u002FReact)\n\n- (arXiv 2022.07) **Benchmarking** **Omni-Vision** Representation through the Lens of Visual **Realms**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07106.pdf), [[Project]](https:\u002F\u002Fzhangyuanhan-ai.github.io\u002FOmniBenchmark)\n\n- (arXiv 2022.07) **Convolutional Bypasses** Are Better Vision Transformer **Adapters**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07039.pdf)\n\n- (arXiv 2022.07) **LANGUAGE MODELLING** WITH PIXELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06991.pdf)\n\n- (arXiv 2022.07) Transformer-based Context Condensation for Boosting Feature Pyramids in Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06603.pdf)\n\n- (arXiv 2022.07) **DEEPFAKE VIDEO DETECTION** WITH SPATIOTEMPORAL DROPOUT TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06612.pdf)\n\n- (arXiv 2022.07) iColoriT: Towards Propagating Local Hint to the Right Region in **Interactive Colorization** by Leveraging Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06831.pdf)\n\n- (arXiv 2022.07) **Imaging** through the **Atmosphere** using **Turbulence** Mitigation Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06465.pdf)\n\n- (arXiv 2022.07) Symmetry-Aware Transformer-based **Mirror Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06332.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ftyhuang0428\u002FSATNet)\n\n- (arXiv 2022.07) Pyramid Transformer for **Traffic Sign Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06067.pdf)\n\n- (arXiv 2022.07) Global-local Motion Transformer for Unsupervised **Skeleton**-based **Action** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06101.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBoeun-Kim\u002FGL-Transformer)\n\n- (arXiv 2022.07) DynaST: Dynamic Sparse Transformer for Exemplar-Guided **Image Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06124.pdf)\n\n- (arXiv 2022.07) Trans4Map: Revisiting Holistic Top-down Mapping from **Egocentric Images** to Allocentric Semantics with Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06205.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjamycheung\u002FTrans4Map)\n\n- (arXiv 2022.07) Entry-Flipped Transformer for Inference and Prediction of **Participant Behavior**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06235.pdf)\n\n- (arXiv 2022.07) Wayformer: **Motion Forecasting** via Simple & Efficient Attention Networks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05844.pdf)\n\n- (arXiv 2022.07) Diverse **Dance Synthesis** via Keyframes with Transformer Controllers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05906.pdf)\n\n- (arXiv 2022.07) Learning to Estimate **External Forces** of Human **Motion** in Video, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05845.pdf)\n\n- (arXiv 2022.07) Vision Transformer for **Contrastive Clustering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12925.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJackKoLing\u002FVTCC)\n\n- (arXiv 2022.07) Pose2Room: Understanding **3D Scenes** from **Human Activities**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03030.pdf)\n\n- (arXiv 2022.07) Towards Hard-Positive Query Mining for DETR-based **Human-Object Interaction Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05293.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMuchHair\u002FHQM)\n\n- (arXiv 2022.07) Cross-Architecture **Knowledge Distillation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05273.pdf)\n\n- (arXiv 2022.07) Distance Matters in **Human-Object Interaction Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01869.pdf)\n\n### 2022.06\n\n- (arXiv 2022.06) TENET: Transformer Encoding Network for Effective Temporal Flow on **Motion Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00170.pdf)\n\n- (arXiv 2022.06) GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for **Few-Shot Gait Impairment Severity Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00106.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmarkendo\u002FGaitForeMer)\n\n- (arXiv 2022.06) GSCLIP : A Framework for Explaining **Distribution Shifts** in Natural **Language**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15007.pdf)\n\n- (arXiv 2022.06) Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15002.pdf)\n\n- (arXiv 2022.06) Causality for Inherently **Explainable** Transformers: CAT-XPLAIN, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.14841.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmvrl\u002FCAT-XPLAIN)\n\n- (arXiv 2022.06) A Unified End-to-End Retriever-Reader Framework for Knowledge-based **VQA**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.14989.pdf)\n\n- (arXiv 2022.06) **Continual Learning** with Transformers for **Image Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.14085.pdf)\n\n- (arXiv 2022.06) ST-Adapter: Parameter-**Efficient** **Image-to-Video** Transfer Learning for **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13559.pdf)\n\n- (arXiv 2022.06) **Robustifying** Vision Transformer without Retraining from Scratch by **Test-Time** Class-Conditional Feature Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13951.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fkojima-takeshi188\u002FCFA)\n\n- (arXiv 2022.06) Leveraging **Language** for Accelerated Learning of **Tool Manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13074.pdf)\n\n- (arXiv 2022.06) RoME: Role-aware Mixture-of-Expert Transformer for **Text-to-Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12845.pdf)\n\n- (arXiv 2022.06) VLCAP: **VISION-LANGUAGE** WITH CONTRASTIVE LEARNING FOR COHERENT VIDEO PARAGRAPH **CAPTIONING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12972.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FUARK-AICV\u002FVLCAP)\n\n- (arXiv 2022.06) Video2**StyleGAN**: Encoding **Video** in Latent Space for **Manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13078.pdf)\n\n- (arXiv 2022.06) Text-Driven **Stylization** of **Video** Objects, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12396.pdf), [[Project]](https:\u002F\u002Fsloeschcke.github.io\u002FText-Driven-Stylization-of-Video-Objects\u002F)\n\n- (arXiv 2022.06) **Open Vocabulary** Object **Detection** with Proposal Mining and Prediction Equalization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11134.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPealing\u002FMEDet)\n\n- (arXiv 2022.06) CMT-DeepLab: Clustering Mask Transformers for Panoptic **Segmentation**, [[Paper]](about:blank)\n\n- (arXiv 2022.06) Towards Adversarial **Attack** on **Vision-Language** Pre-training Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09391.pdf)\n\n- (arXiv 2022.06) CLiMB: A **Continual Learning** Benchmark for **Vision-and-Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09059.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGLAMOR-USC\u002FCLiMB)\n\n- (arXiv 2022.06) **VISUALIZING** AND UNDERSTANDING **SELF-SUPERVISED** VISION LEARNING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09753.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffawazsammani\u002Fxai-ssl)\n\n- (arXiv 2022.06) VReBERT: A Simple and Flexible Transformer for **Visual Relationship Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09111.pdf)\n\n- (arXiv 2022.06) Bear the Query in Mind: **Visual Grounding** with Query-conditioned Convolution, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09114.pdf)\n\n- (arXiv 2022.06) **DALL-E** for **Detection**: Language-driven Context Image Synthesis for Object Detection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09592.pdf)\n\n- (arXiv 2022.06) REVECA – Rich Encoder-decoder framework for **Video Event CAptioner**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09178.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTooTouch\u002FREVECA)\n\n- (arXiv 2022.06) SAViR-T: Spatially Attentive** Visual Reasoning** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09265.pdf)\n\n- (arXiv 2022.06) EATFormer: Improving Vision Transformer Inspired by **Evolutionary** Algorithm, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09325.pdf), [[Code]](https:\u002F\u002Fhttps\u002F\u002Fgithub.com\u002Fzhangzjn\u002FEATFormer)\n\n- (arXiv 2022.06) Dual**CoOp**: Fast Adaptation to **Multi-Label** Recognition with Limited Annotations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09541.pdf)\n\n- (arXiv 2022.06) Capturing and Inferring Dense Full-Body **Human-Scene Contact**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09553.pdf), [[Project]](https:\u002F\u002Frich.is.tue.mpg.de\u002F)\n\n- (arXiv 2022.06) M&M Mix: A Multimodal Multiview Transformer **Ensemble**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09852.pdf)\n\n- (arXiv 2022.06) DisCoVQA: Temporal Distortion-Content Transformers for **Video Quality Assessment**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09853.pdf)\n\n- (arXiv 2022.06) Voxel-**MAE**: Masked Autoencoders for Pre-training Large-scale **Point Clouds**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09900.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fchaytonmin\u002FVoxel-MAE)\n\n- (arXiv 2022.06) **Global Context** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09959.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FGCViT)\n\n- (arXiv 2022.06) **Counting** Varying Density **Crowds** Through Density Guided Adaptive Selection CNN and Transformer Estimation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10075.pdf)\n\n- (arXiv 2022.06) One-stage **Action Detection** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10080.pdf)\n\n- (arXiv 2022.06) Sem**MAE**: Semantic-Guided Masking for Learning Masked Autoencoders, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10207.pdf)\n\n- (arXiv 2022.06) TRANSFORMER-BASED MULTI-MODAL PROPOSAL AND RE-RANK FOR WIKIPEDIA **IMAGE-CAPTION** MATCHING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10436.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmesnico\u002FWiki-Image-Caption-Matching)\n\n- (arXiv 2022.06) **Vicinity** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10552.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FVicinity-Vision-Transformer)\n\n- (arXiv 2022.06) EdgeNeXt: **Efficiently** Amalgamated **CNN-Transformer** Architecture for Mobile Vision Applications, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10589.pdf), [[Code]](https:\u002F\u002Ft.ly\u002F_Vu9)\n\n- (arXiv 2022.06) Temporally Consistent Semantic **Video Editing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10590.pdf)\n\n- (arXiv 2022.06) VLMbench: A Compositional Benchmark for **Vision-and-Language Manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08522.pdf)\n\n- (arXiv 2022.06) MINEDOJO: Building Open-Ended **Embodied Agents** with Internet-Scale Knowledge, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08853.pdf), [[Project]](https:\u002F\u002Fminedojo.org\u002F)\n\n- (arXiv 2022.06) IRISformer: Dense Vision Transformers for Single-Image **Inverse Rendering** in Indoor Scenes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08423.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FViLab-UCSD\u002FIRISformer)\n\n- (arXiv 2022.06) Backdoor **Attacks** on Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08477.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FUCDvision\u002Fbackdoor_transformer.git)\n\n- (arXiv 2022.06) Rectify ViT **Shortcut** Learning by Visual **Saliency**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08567.pdf)\n\n- (arXiv 2022.06) Learning Using Privileged Information for **Zero-Shot Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08632.pdf)\n\n- (arXiv 2022.06) Bridge-Tower: Building Bridges Between Encoders in **Vision-Language** Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08657.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBridgeTower)\n\n- (arXiv 2022.06) CtrlFormer: Learning Transferable State Representation for **Visual Control** via Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08883.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fctrlformer-icml\u002F)\n\n- (arXiv 2022.06) SimA: Simple **Softmax-free** **Attention** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08898.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FUCDvision\u002Fsima)\n\n- (arXiv 2022.06) UNIFIED-IO: A **UNIFIED MODEL** FOR **VISION, LANGUAGE**, AND **MULTI-MODAL** TASKS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08916.pdf), [[Project]](https:\u002F\u002Funified-io.allenai.org\u002F)\n\n- (arXiv 2022.06) VLMixer: Unpaired **Vision-Language** Pre-training via Cross-Modal CutMix, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08919.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fttengwang\u002FVLMixer)\n\n- (arXiv 2022.06) ReLER@ZJU-Alibaba Submission to the **Ego4D** Natural Language **Queries** Challenge 2022, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00383.pdf)\n\n- (arXiv 2022.06) Video + **CLIP** Baseline for **Ego4D** Long-term **Action Anticipation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00579.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsrijandas07\u002Fclip_baseline_LTA_Ego4d)\n\n- (arXiv 2022.06) What makes **domain generalization** hard?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07802.pdf)\n\n- (arXiv 2022.06) SAVi++: Towards End-to-End **Object-Centric Learning** from Real-World **Videos**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07764.pdf), [[Code]](https:\u002F\u002Fslot-attention-video.github.io\u002Fsavi++\u002F)\n\n- (arXiv 2022.06) Disentangling **visual** and **written** **concepts** in **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07835.pdf), [[Project]](https:\u002F\u002Fjoaanna.github.io\u002Fdisentangling_spelling_in_clip\u002F)\n\n- (arXiv 2022.06) Multi-scale Cooperative Multimodal Transformers for Multimodal **Sentiment Analysis** in Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07981.pdf)\n\n- (arXiv 2022.06) **Patch**-level **Representation** Learning for Self-supervised Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07990.pdf)\n\n- (arXiv 2022.06) **Zero-Shot Video Question Answering** via Frozen Bidirectional Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08155.pdf), [[Code]](https:\u002F\u002Fantoyang.github.io\u002Ffrozenbilm.html)\n\n- (arXiv 2022.06) Omni**MAE**: Single Model Masked Pretraining on Images and Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08356.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fomnivore)\n\n- (arXiv 2022.06) **Adapting** Self-Supervised Vision Transformers by Probing Attention-Conditioned **Masking** Consistency, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08222.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvirajprabhu\u002FPACMAC)\n\n- (arXiv 2022.06) LAVENDER: Unifying **Video-Language** Understanding as Masked Language Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07160.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLAVENDER)\n\n- (arXiv 2022.06) Multimodal Event Graphs: Towards **Event Centric Understanding** of **Multimodal** World, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07207.pdf)\n\n- (arXiv 2022.06) Rethinking Generalization in **Few-Shot Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07267.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmrkshllr\u002FFewTURE)\n\n- (arXiv 2022.06) VCT: A **Video Compression** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07307.pdf)\n\n- (arXiv 2022.06) **Forecasting** of **depth** and **ego-motion** with transformers and self-supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07435.pdf)\n\n- (arXiv 2022.06) Coarse-to-Fine **Vision-Language** Pre-training with Fusion in the Backbone, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07643.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FFIBER)\n\n- (arXiv 2022.06) SP-ViT: Learning **2D Spatial Priors** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07662.pdf)\n\n- (arXiv 2022.06) A Simple **Data Mixing** Prior for Improving **Self-Supervised** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07692.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOliverRensu\u002FSDMP)\n\n- (arXiv 2022.06) Prefix Language Models are **Unified Modal Learners**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07699.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshizhediao\u002FDaVinci)\n\n- (arXiv 2022.06) **Masked Frequency Modeling** for Self-Supervised Visual Pre-Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07706.pdf), [Code]](https:\u002F\u002Fwww.mmlab-ntu.com\u002Fproject\u002Fmfm\u002Findex.html)\n\n- (arXiv 2022.06) Generalizable **Neural Radiance Fields** for Novel View Synthesis with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05375.pdf)\n\n- (arXiv 2022.06) A Unified **Continuous Learning** Framework for Multi-modal Knowledge Discovery and Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05555.pdf)\n\n- (arXiv 2022.06) Learning to Estimate **Shapley Values** with Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05282.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsuinleelab\u002Fvit-shapley)\n\n- (arXiv 2022.06) Graph-based Spatial Transformer with Memory Replay for Multi-future **Pedestrian Trajectory Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05712.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJacobieee\u002FST-MR)\n\n- (arXiv 2022.06) **GLIPv2**: Unifying Localization and **VL Understanding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05836.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FGLIP)\n\n- (arXiv 2022.06) INDIGO: Intrinsic Multimodality for **Domain Generalization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05912.pdf)\n\n- (arXiv 2022.06) TRANSDUCTIVE **CLIP** WITH **CLASS-CONDITIONAL** CONTRASTIVE LEARNING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06177.pdf)\n\n- (arXiv 2022.06) SILVER-BULLET-3D AT MANISKILL 2021: LEARNING-FROM-DEMONSTRATIONS AND HEURISTIC RULE-BASED METHODS FOR **OBJECT MANIPULATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06289.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fcaiqi\u002FSilver-Bullet-3D\u002F)\n\n- (arXiv 2022.06) MLP-3D: A **MLP**-like **3D** Architecture with Grouped Time Mixing, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06292.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZhaofanQiu\u002FMLP-3D)\n\n- (arXiv 2022.06) Visual Transformer for Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06323.pdf)\n\n- (arXiv 2022.06) Bringing **Image **Scene Structure to Video** via Frame-Clip Consistency of Object Tokens, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06346.pdf), [[Code]](https:\u002F\u002Feladb3.github.io\u002FSViT\u002F)\n\n- (arXiv 2022.06) TransVG++: End-to-End **Visual Grounding** with Language Conditioned Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06619.pdf)\n\n- (arXiv 2022.06) ReCo: Retrieve and Co-**segment** for **Zero-shot** Transfer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07045.pdf), [[Project]](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Freco)\n\n- (arXiv 2022.06) MAREO: MEMORY- AND ATTENTION- BASED **VISUAL REASONING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04928.pdf)\n\n- (arXiv 2022.06) Recurrent Transformer Variational Autoencoders for **Multi-Action Motion Synthesis**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06741.pdf)\n\n- (arXiv 2022.06) **Object Scene Representation** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06922.pdf)\n\n- (arXiv 2022.06) Comprehending and Ordering Semantics for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06930.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYehLi\u002Fxmodaler\u002Ftree\u002Fmaster\u002Fconfigs\u002Fimage_caption\u002Fcosnet)\n\n- (arXiv 2022.06) Exploring **Adversarial Attacks** and **Defenses** in Vision Transformers trained with **DINO**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06761.pdf)\n\n- (arXiv 2022.06) **Peripheral** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06801.pdf), [[Code]](http:\u002F\u002Fcvlab.postech.ac.kr\u002Fresearch\u002FPerViT\u002F)\n\n- (arXiv 2022.06) Efficient Decoder-free Object **Detection** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06829.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPealing\u002FDFFT.)\n\n- (arXiv 2022.06) Prototypical **Contrastive Language Image Pretraining**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10996.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmegvii-research\u002Fprotoclip)\n\n- (arXiv 2022.06) SpA-Former:Transformer image** shadow detection and removal** via spatial attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10910.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzhangbaijin\u002FSpA-Former-shadow-removal)\n\n- (arXiv 2022.06) A Unified and Biologically-Plausible **Relational Graph Representation** of Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11073.pdf)\n\n- (arXiv 2022.06) Can Foundation Models Talk **Causality**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10591.pdf)\n\n- (arXiv 2022.06) Learning **Viewpoint-Agnostic** Visual **Representations** by Recovering Tokens in 3D Space, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11895.pdf), [[Code]](https:\u002F\u002Fwww3.cs.stonybrook.edu\u002F~jishang\u002F3dtrl\u002F3dtrl.html)\n\n- (arXiv 2022.06) MaskViT: **Masked** Visual Pre-Training for **Video Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11894.pdf)\n\n- (arXiv 2022.06) PromptPose: Language **Prompt** Helps **Animal Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11752.pdf)\n\n- (arXiv 2022.06) **Video PreTraining** (VPT): Learning to Act by Watching **Unlabeled** **Online** Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11795.pdf)\n\n- (arXiv 2022.06) MERLOT Reserve: Neural Script Knowledge through **Vision and Language and Sound**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02639.pdf), [[Project]](https:\u002F\u002Frowanzellers.com\u002Fmerlotreserve)\n\n- (arXiv 2022.06) Building Spatio-temporal Transformers for **Egocentric 3D Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04785.pdf)\n\n- (arXiv 2022.06) **Position** Labels for **Self-Supervised** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04981.pdf)\n\n- (arXiv 2022.06) Exploring Feature Self-relation for **Self-supervised** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05184.pdf)\n\n- (arXiv 2022.06) Patch-based Object-centric Transformers for Efficient **Video Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04003.pdf), [[Code]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fpovt-public)\n\n- (arXiv 2022.06) Sparse Fusion **Mixture-of-Experts** are Domain Generalizable Learners, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04046.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLuodian\u002FSF-MoE-DG)\n\n- (arXiv 2022.06) VN-Transformer: **Rotation-Equivariant** Attention for Vector Neurons, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04176.pdf)\n\n- (arXiv 2022.06) **CLIP**-Actor: **Text**-Driven Recommendation and Stylization for **Animating Human Meshes**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04382.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYouwang-Kim\u002FCLIP-Actor)\n\n- (arXiv 2022.06) **OOD** Augmentation May Be at Odds with **Open-Set Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04242.pdf)\n\n- (arXiv 2022.06) Draft-and-Revise: Effective **Image Generation** with Contextual RQ-Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04452.pdf)\n\n- (arXiv 2022.06) cycle text2face: cycle **text-to-face gan** via transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04503.pdf)\n\n- (arXiv 2022.06) Efficient and Robust **2D-to-BEV** Representation Learning via Geometry-guided Kernel Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04584.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FGKT)\n\n- (arXiv 2022.06) Transformer based Urdu Handwritten **Text** Optical Character Reader, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04575.pdf)\n\n- (arXiv 2022.06) **Spatial Entropy Regularization** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04636.pdf)\n\n- (arXiv 2022.06) On Data Scaling in **Masked Image Modeling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04664.pdf)\n\n- (arXiv 2022.06) Extreme **Masking** for Learning Instance and Distributed Visual **Representations**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04667.pdf)\n\n- (arXiv 2022.06) GateHUB: Gated History Unit with Background Suppression for **Online Action Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04668.pdf)\n\n- (arXiv 2022.06) **Anomaly detection** in surveillance videos using transformer based attention model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01524.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fkapildeshpande\u002FAnomaly-Detection-in-Surveillance-Videos)\n\n- (arXiv 2022.06) Contra**CLIP**: Interpretable **GAN** generation driven by pairs of contrasting sentences, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02104.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fchi0tzp\u002FContraCLIP)\n\n- (arXiv 2022.06) EAANet: **Efficient** Attention Augmented Convolutional Networks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01821.pdf)\n\n- (arXiv 2022.06) Visual Clues: Bridging **Vision and Language** Foundations for Image Paragraph **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01843.pdf)\n\n- (arXiv 2022.06) Recurrent **Video Restoration** Transformer with Guided Deformable Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02146.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJingyunLiang\u002FRVRT)\n\n- (arXiv 2022.06) Rethinking the **Openness** of **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01986.pdf)\n\n- (arXiv 2022.06) OrdinalCLIP: Learning Rank Prompts for **Language-Guided Ordinal Regression**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02338.pdf)\n\n- (arXiv 2022.06) Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel **Video-Language Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02082.pdf)\n\n- (arXiv 2022.06) CONTRASTIVE GRAPH MULTIMODAL MODEL FOR **TEXT CLASSIFICATION** IN VIDEOS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02343.pdf)\n\n- (arXiv 2022.06) Separable **Self-attention** for **Mobile** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02680.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fapple\u002Fml-cvnets)\n\n- (arXiv 2022.06) Mask **DINO**: Towards A Unified Transformer-based Framework for Object **Detection** and **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02777.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FIDEACVR\u002FMaskDINO)\n\n- (arXiv 2022.06) Multimodal Contrastive Learning with LIMoE: the **Language-Image** **Mixture of Experts**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02770.pdf)\n\n- (arXiv 2022.06) cViL: Cross-Lingual Training of **Vision-Language** Models using Knowledge Distillation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03354.pdf)\n\n- (arXiv 2022.06) **Masked** **Unsupervised** Self-training for Zero-shot Image Classification, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02967.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FMUST)\n\n- (arXiv 2022.06) DETR++: Taming Your Multi-Scale **Detection** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02977.pdf)\n\n- (arXiv 2022.06) Structured Context Transformer for Generic **Event Boundary Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02985.pdf)\n\n- (arXiv 2022.06) Revealing Single Frame Bias for **Video-and-Language** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03428.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjayleicn\u002Fsingularity)\n\n- (arXiv 2022.06) Cerberus Transformer: Joint **Semantic**, **Affordance** and **Attribute** Parsing, [[Paper]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FChen_Cerberus_Transformer_Joint_Semantic_Affordance_and_Attribute_Parsing_CVPR_2022_paper.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOPEN-AIR-SUN\u002FCerberus)\n\n- (arXiv 2022.06) Can **CNNs** Be More **Robust** Than Transformers? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03452.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002FRobustCNN)\n\n- (arXiv 2022.06) Detection Hub: Unifying Object **Detection** Datasets via Query Adaptation on Language Embedding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03484.pdf)\n\n- (CVPR 2022) Keypoint Transformer: Solving Joint Identification in Challenging **Hands and Object Interactions** for Accurate **3D Pose** Estimation, [[Paper]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FHampali_Keypoint_Transformer_Solving_Joint_Identification_in_Challenging_Hands_and_Object_CVPR_2022_paper.pdf)\n\n- (arXiv 2022.06) A-OKVQA: A Benchmark for **Visual Question Answering** using World Knowledge, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01718.pdf), [[Project]](http:\u002F\u002Fa-okvqa.allenai.org\u002F)\n\n- (arXiv 2022.06) Revisiting the “Video” in **Video-Language** Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01720.pdf), [[Project]](https:\u002F\u002Fstanfordvl.github.io\u002Fatp-revisit-video-lang\u002F)\n\n- (arXiv 2022.06) Efficient **Self-supervised** Vision Pretraining with Local **Masked** Reconstruction, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00790.pdf)\n\n- (arXiv 2022.06) Modeling Image Composition for Complex **Scene Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00923.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJohnDreamer\u002FTwFA)\n\n- (arXiv 2022.06) Unified Recurrence Modeling for **Video Action Anticipation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01009.pdf)\n\n- (arXiv 2022.06) **Prefix Conditioning** Unifies Language and Label Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01125.pdf)\n\n- (arXiv 2022.06) Optimizing Relevance Maps of Vision Transformers Improves **Robustness**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01161.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhila-chefer\u002FRobustViT)\n\n- (arXiv 2022.06) VL-BEIT: Generative **Vision-Language** Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01127.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm)\n\n- (arXiv 2022.06) **Efficient**Former: Vision Transformers at MobileNet Speed, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01191.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002FEfficientFormer)\n\n- (arXiv 2022.06) REVIVE: Regional Visual Representation Matters in Knowledge-Based **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01201.pdf)\n\n- (arXiv 2022.06) Siamese Image Modeling for **Self-Supervised** Vision Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01204.pdf)\n\n- (CVPR 2022) Distillation Using Oracle Queries for Transformer-based **Human-Object nteraction Detection**, [[Paper]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FQu_Distillation_Using_Oracle_Queries_for_Transformer-Based_Human-Object_Interaction_Detection_CVPR_2022_paper.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSherlockHolmes221\u002FDOQ)\n\n- (CVPR 2022) Exploring Structure-aware Transformer over Interaction Proposals for **Human-Object Interaction Detection**, [[Paper]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FZhang_Exploring_Structure-Aware_Transformer_Over_Interaction_Proposals_for_Human-Object_Interaction_Detection_CVPR_2022_paper.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzyong812\u002FSTIP)\n\n- (CVPR 2022) Human **Trajectory Prediction** with Momentary Observation, [[Paper]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FSun_Human_Trajectory_Prediction_With_Momentary_Observation_CVPR_2022_paper.pdf)\n\n- (arXiv 2022.06) Where are my Neighbors? Exploiting Patches Relations in **Self-Supervised** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00481.pdf)\n\n- (arXiv 2022.06) Unifying Voxel-based Representation with Transformer for **3D Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00630.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FUVTR)\n\n- (arXiv 2022.06) Extreme **Floorplan Reconstruction** by Structure-Hallucinating Transformer Cascades, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00645.pdf)\n\n- (arXiv 2022.06) Cross-View Language Modeling: Towards Unified Cross-Lingual **Cross-Modal Pre-training**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00621.pdf)\n\n- (arXiv 2022.06) VALHALLA: **Visual Hallucination** for Machine Translation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00100.pdf), [[Code]](http:\u002F\u002Fwww.svcl.ucsd.edu\u002Fprojects\u002Fvalhalla)\n\n- (arXiv 2022.06) Learning Sequential Contexts using Transformer for **3D Hand Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00171.pdf)\n\n- (arXiv 2022.06) CLIP4IDC: **CLIP** for Image Difference **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00629.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsushizixin\u002FCLIP4IDC)\n\n- (arXiv 2022.06) Cross-domain **Detection** Transformer based on Spatial-aware and Semantic-aware Token Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00222.pdf)\n\n- (arXiv 2022.06) Vision **GNN**: An Image is Worth Graph of Nodes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00272.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhuawei-noah\u002FCV-Backbones)\n\n- (arXiv 2022.06) Weakly-supervised Action Transition Learning for Stochastic Human **Motion Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15608.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwei-mao-2019\u002FWAT)\n\n- (arXiv 2022.06) TubeFormer-**DeepLab**: Video Mask Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15361.pdf)\n\n- (arXiv 2022.06) **Video**-based **Human-Object Interaction** Detection from Tubelet Tokens, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01908.pdf)\n\n### 2022.05\n\n- (arXiv 2022.05) HeatER: An Efficient and Unified Network for **Human Reconstruction** via Heatmap-based TransformER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15448.pdf)\n\n- (arXiv 2022.05) Robotic **grasp detection** based on Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2205\u002F2205.15112.pdf)\n\n- (arXiv 2022.05) Multimodal **Masked Autoencoders** Learn Transferable Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14204.pdf)\n\n- (arXiv 2022.05) Multimodal **Fake News Detection** via **CLIP**-Guided Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14304.pdf)\n\n- (arXiv 2022.05) WT-MVSNet: Window-based Transformers for **Multi-view Stereo**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14319.pdf)\n\n- (arXiv 2022.05) Object-wise **Masked Autoencoders** for **Fast** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14338.pdf)\n\n- (arXiv 2022.05) A Closer Look at **Self-supervised** **Lightweight** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14443.pdf)\n\n- (arXiv 2022.05) Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14458.pdf)\n\n- (arXiv 2022.05) CY**CLIP**: Cyclic Contrastive **Language-Image** Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14459.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoel-shashank\u002FCyCLIP)\n\n- (arXiv 2022.05) MDMLP: Image Classification from Scratch on Small Datasets with **MLP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14477.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAmoza-Theodore\u002FMDMLP)\n\n- (arXiv 2022.05) SupMAE: **Supervised** **Masked Autoencoders** Are Efficient Vision Learners, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14540.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fcmu-enyac\u002Fsupmae)\n\n- (arXiv 2022.05) 3D-C2FT: Coarse-to-fine Transformer for **Multi-view 3D Reconstruction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14575.pdf)\n\n- (arXiv 2022.05) Prompt-aligned Gradient for **Prompt** Tuning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14865.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBeierZhu\u002FPrompt-align)\n\n- (arXiv 2022.05) **Illumination** Adaptive Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14871.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fcuiziteng\u002FIllumination-Adaptive-Transformer)\n\n- (arXiv 2022.05) HiViT: **Hierarchical** Vision Transformer Meets **Masked Image Modeling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14949.pdf)\n\n- (arXiv 2022.05) GMML is All you Need, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14986.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSara-Ahmed\u002FGMML)\n\n- (arXiv 2022.05) COMPLETEDT: **POINT CLOUD COMPLETION** WITH DENSE AUGMENT INFERENCE TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14999.pdf)\n\n- (arXiv 2022.05) Self-Supervised Pre-training of Vision Transformers for **Dense Prediction** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15173.pdf)\n\n- (arXiv 2022.05) VLUE: A Multi-Task **Benchmark** for Evaluating **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15237.pdf), [[Benchmark]](https:\u002F\u002Fvlue-benchmark.github.io\u002F), [[Code]](https:\u002F\u002Fgithub.com\u002FMichaelZhouwang\u002FVLUE)\n\n- (arXiv 2022.05) Architecture-Agnostic **Masked Image Modeling** – From ViT back to CNN, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13943.pdf)\n\n- (arXiv 2022.05) **Contrastive** Learning Rivals **Masked Image Modeling** in Fine-tuning via Feature Distillation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14141.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSwinTransformer\u002FFeature-Distillation)\n\n- (arXiv 2022.05) GIT: A Generative **Image-to-text** Transformer for Vision and Language, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14100.pdf)\n\n- (arXiv 2022.05) 3DILG: Irregular Latent Grids for **3D Generative Modeling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13914.pdf)\n\n- (arXiv 2022.05) Simple **Unsupervised** **Object-Centric Learning** for Complex and Naturalistic Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14065.pdf), [[Code]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fslot-transformer-for-videos)\n\n- (arXiv 2022.05) Future Transformer for Long-term **Action Anticipation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14022.pdf), [[Project]](http:\u002F\u002Fcvlab.postech.ac.kr\u002Fresearch\u002FFUTR)\n\n- (arXiv 2022.05) X-ViT: High Performance **Linear** Vision Transformer without Softmax, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13805.pdf)\n\n- (arXiv 2022.05) **Knowledge Distillation** via the Target-aware Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10793.pdf)\n\n- (arXiv 2022.05) Dynamic **Query** Selection for Fast Visual Perceiver, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10873.pdf)\n\n- (arXiv 2022.05) MonoFormer: Towards Generalization of self-supervised monocular **depth** estimation with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11083.pdf)\n\n- (arXiv 2022.05) PEVL: Position-enhanced Pre-training and Prompt Tuning for **Vision-language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11169.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FPEVL)\n\n- (arXiv 2022.05) Supporting **Vision-Language** Model Inference with Causality-pruning Knowledge **Prompt**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11100.pdf)\n\n- (arXiv 2022.05) Super Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11397.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flmbxmu\u002FSuperViT)\n\n- (arXiv 2022.05) mPLUG: Effective and Efficient **Vision-Language** Learning by Cross-modal Skip-connections, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12005.pdf)\n\n- (arXiv 2022.05) **VQA**-GNN: **Reasoning** with Multimodal Semantic Graph for Visual Question Answering, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11501.pdf)\n\n- (arXiv 2022.05) UMSNet: An Universal Multi-sensor Network for Human **Activity Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11756.pdf)\n\n- (arXiv 2022.05) **Privacy**-Preserving Image **Classification** Using Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12041.pdf)\n\n- (arXiv 2022.05) HiVLP: Hierarchical **Vision-Language** Pre-Training for Fast Image-Text Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12105.pdf)\n\n- (arXiv 2022.05) ASSET: Autoregressive **Semantic Scene Editing** with Transformers at High Resolutions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12231.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FDifanLiu\u002FASSET)\n\n- (arXiv 2022.05) HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent **Trajectory Prediction** via Scene Encoding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09753.pdf)\n\n- (arXiv 2022.05) **Mask**-guided Vision Transformer (MG-ViT) for **Few-Shot** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09995.pdf)\n\n- (arXiv 2022.05) Degradation-Aware Unfolding Half-Shuffle Transformer for **Spectral Compressive Imaging**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10102.pdf)\n\n- (arXiv 2022.05) Uniform Masking: Enabling **MAE** Pre-training for **Pyramid**-based Vision Transformers with Locality, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10063.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fimplus\u002FUM-MAE)\n\n- (arXiv 2022.05) Visual **Concepts** Tokenization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10093.pdf)\n\n- (arXiv 2022.05) MSTRIQ: No Reference **Image Quality Assessment** Based on Swin Transformer with Multi-Stage Fusion, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10101.pdf)\n\n- (arXiv 2022.05) CogVideo: Large-scale Pretraining for **Text-to-Video** Generation via Transformers., [[Paper]](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVideo\u002Fblob\u002Fmain\u002Fpaper\u002FCogVideo-arxiv.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVideo)\n\n- (arXiv 2022.05) Evidence for **Hypodescent** in Visual Semantic AI, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10764.pdf)\n\n- (arXiv 2022.05) Boosting Camouflaged Object **Detection** with Dual-Task Interactive Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10579.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fliuzywen\u002FCOD)\n\n- (arXiv 2022.05) muNet: Evolving Pretrained Deep Neural Networks into Scalable **Auto-tuning Multitask Systems**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10937.pdf)\n\n- (arXiv 2022.05) Large Language Models are **Zero-Shot Reasoners**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11916.pdf)\n\n- (arXiv 2022.05) AdaptFormer: **Adapting** Vision Transformers for **Scalable** Visual Recognition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13535.pdf), [[Code]](http:\u002F\u002Fwww.shoufachen.com\u002Fadaptformer-page)\n\n- (arXiv 2022.05) **Green** Hierarchical Vision Transformer for **Masked** Image Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13515.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLayneH\u002FGreenMIM)\n\n- (arXiv 2022.05) Efficient U-Transformer with Boundary-Aware Loss for **Action Segmentatio**n, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13425.pdf)\n\n- (arXiv 2022.05) Cross-Architecture **Self-supervised Video** Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13313.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fguoshengcv\u002FCACL)\n\n- (arXiv 2022.05) **Prompt**-based Learning for Unpaired Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13125.pdf)\n\n- (arXiv 2022.05) MixMIM: **Mixed** and **Masked** Image Modeling for **Efficient** Visual Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13137.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FMixMIM)\n\n- (arXiv 2022.05) **Fast** Vision Transformers with HiLo **Attention**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13213.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzip-group\u002FLITv2)\n\n- (arXiv 2022.05) Fine-grained Image **Captioning** with **CLIP** Reward, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13115.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fj-min\u002FCLIP-Caption-Reward)\n\n- (arXiv 2022.05) Mutual Information Divergence: A Unified Metric for Multimodal **Generative Models**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13445.pdf)\n\n- (arXiv 2022.05) MoCoViT: **Mobile** **Convolutional** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12635.pdf)\n\n- (arXiv 2022.05) AO2-DETR: Arbitrary-Oriented Object **Detection** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12785.pdf)\n\n- (arXiv 2022.05) **Inception** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12956.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FiFormer)\n\n- (arXiv 2022.05) VTP: Volumetric Transformer for Multi-view Multi-person **3D Pose** Estimation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12602.pdf)\n\n- (arXiv 2022.05) UViM: A **Unified Modeling** Approach for Vision with Learned Guiding Codes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10337.pdf)\n\n- (arXiv 2022.05) Language Models with Image Descriptors are Strong Few-Shot **Video-Language** Learners, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10747.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMikeWangWZHL\u002FVidIL)\n\n- (arXiv 2022.05) **Training** Vision-Language Transformers from **Captions** Alone, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09256.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fguilk\u002FVLC)\n\n- (arXiv 2022.05) **Voxel**-informed **Language** Grounding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09710.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Frcorona\u002Fvoxel_informed_language_grounding)\n\n- (arXiv 2022.05) Cross-Enhancement Transformer for **Action Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09445.pdf)\n\n- (arXiv 2022.05) TRT-ViT: **TensorRT**-oriented Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09579.pdf)\n\n- (arXiv 2022.05) Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09613.pdf)\n\n- (arXiv 2022.05) A graph-transformer for whole slide image **classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09671.pdf)\n\n- (arXiv 2022.05) VNT-Net: **Rotational Invariant** Vector Neuron Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09690.pdf)\n\n- (arXiv 2022.05) **Masked** Image Modeling with Denoising **Contrast**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09616.pdf)\n\n- (arXiv 2022.05) Cross-subject **Action Unit Detection** with Meta Learning and Transformer-based Relation Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08787.pdf)\n\n- (arXiv 2022.05) **Masked Autoencoders** As Spatiotemporal Learners, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09113.pdf)\n\n- (arXiv 2022.05) BodyMap: Learning Full-**Body** Dense **Correspondence** Map, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09111.pdf), [[Code]](https:\u002F\u002Fnsarafianos.github.io\u002Fbodymap)\n\n- (arXiv 2022.05) Unraveling **Attention** via Convex Duality: Analysis and Interpretations of Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08078.pdf)\n\n- (arXiv 2022.05) Avatar**CLIP**: Zero-Shot Text-Driven Generation and Animation of 3D **Avatars**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08535.pdf)\n\n- (arXiv 2022.05) Vision Transformer Adapter for **Dense Predictions**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08534.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fczczup\u002FViT-Adapter)\n\n- (arXiv 2022.05) Demo: Real-Time **Semantic Communications** with a Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03886.pdf)\n\n- (arXiv 2022.05) MulT: An End-to-End **Multitask** Learning Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08303.pdf), [[Code]](https:\u002F\u002Fivrl.github.io\u002FMulT\u002F)\n\n- (arXiv 2022.05) A **CLIP**-Hitchhiker’s Guide to Long **Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08508.pdf)\n\n- (arXiv 2022.05) Video **Frame Interpolation** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07230.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FVFIformer)\n\n- (arXiv 2022.05) Dense residual Transformer for Image **Denoising**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.06944.pdf)\n\n- (arXiv 2022.05) Learning Lip-Based **Audio-Visual** Speaker Embeddings with AV-HuBERT, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07180.pdf)\n\n- (arXiv 2022.05) **Robot Cooking** with Stir-fry: Bimanual Non-prehensile Manipulation of Semi-fluid Objects, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05960.pdf)\n\n- (arXiv 2022.05) Entity-aware and Motion-aware Transformers for Language-driven **Action Localization** in Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05854.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshuoyang129\u002FEAMAT)\n\n- (arXiv 2022.05) Learning to **Retrieve Videos** by Asking Questions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05739.pdf)\n\n- (arXiv 2022.05) One Model, **Multiple Modalities**: A Sparsely Activated Approach for Text, Sound, Image, Video and Code, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.06126.pdf)\n\n- (arXiv 2022.05) Simple Open-Vocabulary Object **Detection** with Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.06230.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fscenic\u002Ftree\u002Fmain\u002Fscenic\u002Fprojects\u002Fowl_vit)\n\n- (arXiv 2022.05) AggPose: Deep Aggregation Vision Transformer for Infant **Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05277.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSZAR-LAB\u002FAggPose)\n\n- (arXiv 2022.05) An Empirical Study of Self-supervised Learning Approaches for Object **Detection** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05543.pdf), [[Code-DETR]](https:\u002F\u002Fgithub.com\u002Fgokulkarthik\u002Fdetr), [[Code-Deform-DETR]](https:\u002F\u002Fgithub.com\u002Fgokulkarthik\u002FDeformable-DETR)\n\n- (arXiv 2022.05) Reduce Information Loss in Transformers for Pluralistic Image **Inpainting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05076.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fliuqk3\u002FPUT)\n\n- (arXiv 2022.05) Transformer-based Cross-Modal **Recipe** Embeddings with Large Batch Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04948.pdf)\n\n- (arXiv 2022.05) Spatio-Temporal Transformer for Dynamic **Facial Expression Recognition** in the Wild, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04749.pdf)\n\n- (arXiv 2022.05) Generalizable **Task Planning** through Representation Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07993.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fgentp)\n\n- (arXiv 2022.05) EdgeViTs: Competing **Light-weight** CNNs on **Mobile Devices** with Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03436.pdf)\n\n- (arXiv 2022.05) Activating More Pixels in Image **Super-Resolution** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04437.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fchxy95\u002FHAT)\n\n- (arXiv 2022.05) Row-wise **Accelerator** for Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03998.pdf)\n\n- (arXiv 2022.05) SparseTT: Visual **Tracking** with Sparse Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03776.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffzh0917\u002FSparseTT)\n\n- (arXiv 2022.05) RoViST: Learning Robust Metrics for **Visual Storytelling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03774.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fusydnlp\u002Frovist)\n\n- (arXiv 2022.05) Beyond Bounding Box: Multimodal Knowledge Learning for Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04072.pdf)\n\n- (arXiv 2022.05) Multilevel Hierarchical Network with Multiscale Sampling for **Video Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04061.pdf)\n\n- (arXiv 2022.05) Incremental-DETR: Incremental Few-Shot Object **Detection** via Self-Supervised Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04042.pdf)\n\n- (arXiv 2022.05) Conv**MAE**: Masked **Convolution** Meets Masked Autoencoders, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03892.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAlpha-VL\u002FConvMAE)\n\n- (arXiv 2022.05) Cross-lingual Adaptation for **Recipe Retrieval** with Mixup, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03891.pdf)\n\n- (arXiv 2022.05) Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A **Vision-Language** Framework, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03860.pdf)\n\n- (arXiv 2022.05) Transformer **Tracking** with Cyclic Shifting Window Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03806.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSkyeSong38\u002FCSWinTT)\n\n- (arXiv 2022.05) Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04363.pdf)\n\n- (arXiv 2022.05) Prompt Distribution Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03340.pdf)\n\n- (arXiv 2022.05) CLIP-CLOP: **CLIP**-Guided **Collage** and **Photomontage**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03146.pdf)\n\n- (arXiv 2022.05) Dual-Level Decoupled Transformer for **Video Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03039.pdf)\n\n- (arXiv 2022.05) Declaration-based Prompt Tuning for **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02456.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FCCIIPLab\u002FDPT)\n\n- (arXiv 2022.05) P^3IV: Probabilistic **Procedure Planning** from **Instructional Videos** with Weak Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02300.pdf)\n\n- (arXiv 2022.05) Language Models Can See: Plugging **Visual** Controls in **Text Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02655.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyxuansu\u002FMAGIC)\n\n- (arXiv 2022.05) YOLOPose: Transformer-based Multi-Object **6D Pose Estimation** using Keypoint Regression, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02536.pdf)\n\n- (arXiv 2022.05) Cross-view Transformers for real-time Map-view **Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02833.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fbradyz\u002Fcross_view_transformers)\n\n- (arXiv 2022.05) i-Code: An Integrative and Composable **Multimodal** Learning Framework, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01818.pdf)\n\n- (arXiv 2022.05) **Visual Commonsense** in Pretrained Unimodal and Multimodal Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01850.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002FChenyuHeidiZhang\u002FVL-commonsense)\n\n- (arXiv 2022.05) Dual Cross-Attention Learning for Fine-Grained Visual **Categorization** and Object **Re-Identification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02151.pdf)\n\n- (arXiv 2022.05) RecipeSnap - a lightweight **image to recipe** model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02141.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjianfa\u002FRecipeSnap-a-lightweight-image-to-recipe-model.git)\n\n- (arXiv 2022.05) CoCa: Contrastive Captioners are **Image-Text** Foundation Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01917.pdf)\n\n- (arXiv 2022.05) Data Determines Distributional **Robustness** in Contrastive Language Image Pre-training (**CLIP**), [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01397.pdf)\n\n- (arXiv 2022.05) Cross-modal Representation Learning for **Zero-shot Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2205\u002F2205.01657.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FResT)\n\n- (arXiv 2022.05) Cross-Domain Object **Detection** with Mean-Teacher Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01643.pdf)\n\n- (arXiv 2022.05) Better plain ViT baselines for **ImageNet-1k**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01580.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbig_vision)\n\n- (arXiv 2022.05) Reinforced Swin-Convs Transformer for **Underwater Image Enhancement**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00434.pdf)\n\n- (arXiv 2022.05) UTC: A Unified Transformer with Inter-Task Contrastive Learning for **Visual Dialog**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00423.pdf)\n\n- (arXiv 2022.05) Answer-Me: Multi-Task Open-Vocabulary **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00949.pdf)\n\n- (arXiv 2022.05) Center**CLIP**: Token Clustering for Efficient **Text-Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00823.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmzhaoshuai\u002FCenterCLIP)\n\n- (arXiv 2022.05) Arbitrary Shape **Text Detection** via Boundary Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05320.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGXYM\u002FTextBPN-Puls-Plus)\n\n- (arXiv 2022.05) HULC: **3D Human Motion Capture** with Pose Manifold Sampling and Dense Contact Guidance, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05677.pdf), [[Project]](https:\u002F\u002Fvcai.mpi-inf.mpg.de\u002Fprojects\u002FHULC)\n\n### 2022.04\n\n- (arXiv 2022.04) Learn to Understand Negation in **Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00132.pdf)\n\n- (arXiv 2022.04) LayoutBERT: Masked Language **Layout** Model for Object Insertion, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00347.pdf)\n\n- (arXiv 2022.04) Improving **Visual Grounding** with Visual-Linguistic Verification and Iterative Reasoning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00272.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyangli18\u002FVLTVG)\n\n- (arXiv 2022.04) Coarse-to-Fine **Video Denoising** with Dual-Stage Spatial-Channel Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00214.pdf)\n\n- (arXiv 2022.04) SideRT: A Real-time Pure Transformer Architecture for Single Image **Depth Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13892.pdf)\n\n- (arXiv 2022.04) Where in the World is this Image? Transformer-based **Geo-localization** in the Wild, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13861.pdf)\n\n- (arXiv 2022.04) **Depth Estimation** with Simplified Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13791.pdf)\n\n- (arXiv 2022.04) A very preliminary **analysis** of **DALL-E 2**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2204\u002F2204.13807.pdf)\n\n- (arXiv 2022.04) CogView2: Faster and Better **Text-to-Image Generation** via Hierarchical Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14217.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogView2)\n\n- (arXiv 2022.04) **CLIP**-Art: Contrastive Pre-training for **Fine-Grained Art Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14244.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKeremTurgutlu\u002Fclip_art)\n\n- (arXiv 2022.04) TEMOS: **Generating** diverse human **motions** from textual descriptions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14109.pdf), [[Project]](https:\u002F\u002Fimagine.enpc.fr\u002F~petrovim\u002Ftemos)\n\n- (arXiv 2022.04) PyramidCLIP: Hierarchical Feature Alignment for **Vision-language** Model Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14095.pdf)\n\n- (arXiv 2022.04) Symmetric Transformer-based Network for **Unsupervised Image Registration**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13575.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMingR-Ma\u002FSymTrans)\n\n- (arXiv 2022.04) Tragedy Plus Time: Capturing **Unintended Human Activities** from Weakly-labeled Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13548.pdf), [[Code]](https:\u002F\u002Fasu-apg.github.io\u002FTragedyPlusTime)\n\n- (arXiv 2022.04) CapOnImage: Context-driven Dense-**Captioning** on Image, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12974.pdf)\n\n- (arXiv 2022.04) Self-Supervised Learning of Object Parts for Semantic **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13101.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMkuuWaUjinga\u002Fleopart)\n\n- (arXiv 2022.04) DearKD: Data-**Efficient** Early **Knowledge Distillation** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12997.pdf)\n\n- (arXiv 2022.04) CATrans: Context and Affinity Transformer for Few-Shot **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12817.pdf)\n\n- (arXiv 2022.04) Self-Driving Car **Steering Angle Prediction**: Let Transformer Be a Car Again, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12748.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fchingisooinar\u002FAI)\n\n- (arXiv 2022.04) ClothFormer: Taming Video **Virtual Try-on** in All Module, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12151.pdf)\n\n- (arXiv 2022.04) Deeper Insights into ViTs **Robustness** towards Common Corruptions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12143.pdf)\n\n- (arXiv 2022.04) VITPOSE: SIMPLE VISION TRANSFORMER BASELINES FOR HUMAN **POSE ESTIMATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12484.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FViTPose)\n\n- (arXiv 2022.04) Understanding The **Robustness** in Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12451.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFAN)\n\n- (arXiv 2022.04) MILES: Visual BERT Pre-training with Injected Language Semantics for **Video-text Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12408.pdf)\n\n- (arXiv 2022.04) Contrastive Language-Action Pre-training for **Temporal Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12293.pdf)\n\n- (arXiv 2022.04) Boosting **Adversarial Transferability** of **MLP**-Mixer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12204.pdf)\n\n- (arXiv 2022.04) Adaptive **Split-Fusion** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12196.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fszx503045266\u002FASF-former)\n\n- (arXiv 2022.04) Can Foundation Models Perform Zero-Shot Task Specification For **Robot Manipulation**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11134.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fzestproject)\n\n- (arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR **VISUAL RELATIONAL REASONING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11167.pdf)\n\n- (arXiv 2022.04) VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic **Retail Checkout**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11024.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fistiakshihab\u002Fautomated-retail-checkout-aicity22)\n\n- (arXiv 2022.04) **CLIP**-DISSECT: AUTOMATIC **DESCRIPTION** OF **NEURON** REPRESENTATIONS IN DEEP VISION NETWORKS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10965.pdf)\n\n- (arXiv 2022.04) TEMOS: **Generating** diverse human **motions** from textual descriptions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14109.pdf), [[Project]](https:\u002F\u002Fimagine.enpc.fr\u002F~petrovim\u002Ftemos)\n\n- (arXiv 2022.04) Unsupervised Hierarchical **Semantic Segmentation** with Multiview Cosegmentation and Clustering Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11432.pdf)\n\n- (arXiv 2022.04) SwinFuse: A Residual Swin Transformer Fusion Network for **Infrared and Visible Images**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11436.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZhishe-Wang\u002FSwinFuse)\n\n- (arXiv 2022.04) OCFormer: One-Class Transformer Network for **Image Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11449.pdf)\n\n- (arXiv 2022.04) DRT: A Lightweight Single Image **Deraining** Recursive Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11385.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYC-Liang\u002FDRT)\n\n- (arXiv 2022.04) Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for **Knowledge-based Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10448.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyujungheo\u002Fkbvqa-public)\n\n- (arXiv 2022.04) ParkPredict+: **Multimodal Intent** and **Motion Prediction** for **Vehicles** in Parking Lots with CNN and Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10777.pdf)\n\n- (arXiv 2022.04) iCAR: Bridging Image Classification and **Image-text** Alignment for Visual Recognition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10760.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fweiyx16\u002FiCAR)\n\n- (arXiv 2022.04) DIVERSE INSTANCE DISCOVERY: VISION-TRANSFORMER FOR INSTANCE-AWARE **MULTI-LABEL IMAGE RECOGNITION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10731.pdf)\n\n- (arXiv 2022.04) Spatiality-guided Transformer for 3D Dense **Captioning** on **Point Clouds**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10688.pdf), [[Code]](https:\u002F\u002Fspacap3d.github.io\u002F)\n\n- (arXiv 2022.04) DFAM-DETR: Deformable feature based attention mechanism DETR on slender object **detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10667.pdf)\n\n- (arXiv 2022.04) NFormer: Robust **Person Re-identification** with Neighbor Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09331.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhaochenheheda\u002FNFormer)\n\n- (arXiv 2022.04) **Video Moment Retrieval** from Text Queries via Single Frame Annotation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09409.pdf)\n\n- (arXiv 2022.04) GIMO: Gaze-Informed **Human Motion Prediction** in Context, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09443.pdf)\n\n- (arXiv 2022.04) VQGAN-CLIP: Open Domain **Image Generation and Editing** with Natural Language Guidance, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08583.pdf)\n\n- (arXiv 2022.04) Sim-2-Sim Transfer for **Vision-and-Language Navigation** in Continuous Environments, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09667.pdf)\n\n- (arXiv 2022.04) Not All Tokens Are Equal: **Human-centric** Visual **Analysis** via Token Clustering Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08680.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzengwang430521\u002FTCFormer.git)\n\n- (arXiv 2022.04) **Multimodal** Token Fusion for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08721.pdf)\n\n- (arXiv 2022.04) Self-Calibrated Efficient Transformer for Lightweight **Super-Resolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08913.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAlexZou14\u002FSCET)\n\n- (arXiv 2022.04) Searching Intrinsic **Dimensions** of Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07722.pdf)\n\n- (arXiv 2022.04) Towards **Lightweight** Transformer via Group-wise Transformation for **Vision-and-Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07780.pdf)\n\n- (arXiv 2022.04) Multimodal Few-Shot Object **Detection** with Meta-Learning Based Cross-Modal Prompting, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07841.pdf)\n\n- (arXiv 2022.04) Multi-Frame Self-Supervised **Depth** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07616.pdf), [[Code]](https:\u002F\u002Fsites.google.com\u002Ftri.global\u002Fdepthformer)\n\n- (arXiv 2022.04) MST++: Multi-stage Spectral-wise Transformer for Efficient **Spectral Reconstruction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07908.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fcaiyuanhao1998\u002FMST-plus-plus)\n\n- (arXiv 2022.04) Vision-Language Pre-Training for Multimodal Aspect-Based **Sentiment Analysis**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07955.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FNUSTM\u002FVLP-MABSA)\n\n- (arXiv 2022.04) An Extendable, Efficient and Effective Transformer-based Object **Detector**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07962.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fnaver-ai\u002Fvidt)\n\n- (arXiv 2022.04) VDTR: **Video Deblurring** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08023.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fljzycmd\u002FVDTR)\n\n- (arXiv 2022.04) BSRT: Improving Burst **Super-Resolution** with Swin Transformer and Flow-Guided Deformable Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08332.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAlgolzw\u002FBSRT)\n\n- (arXiv 2022.04) Temporally Efficient Vision Transformer for **Video Instance Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08412.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FTeViT)\n\n- (arXiv 2022.04) VSA: Learning Varied-Size Window **Attention** in Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08446.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FViTAE-VSA)\n\n- (arXiv 2022.04) XDBERT: Distilling Visual Information to **BERT** from Cross-Modal Systems to Improve Language Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07316.pdf)\n\n- (arXiv 2022.04) IMPROVING CROSS-MODAL UNDERSTANDING IN **VISUAL DIALOG** VIA CONTRASTIVE LEARNING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07302.pdf)\n\n- (arXiv 2022.04) MVSTER: Epipolar Transformer for Efficient **Multi-View Stereo**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07346.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJeffWang987\u002FMVSTER)\n\n- (arXiv 2022.04) UNCONDITIONAL **IMAGE-TEXT PAIR GENERATION** WITH MULTIMODAL CROSS QUANTIZER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07537.pdf)\n\n- (arXiv 2022.04) Pushing the Limits of Simple Pipelines for **Few-Shot Learning**: External Data and Fine-Tuning Make a Difference, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07305.pdf)\n\n- (arXiv 2022.04) COTS: Collaborative Two-Stream **Vision-Language** Pre-Training Model for Cross-Modal Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07441.pdf)\n\n- (arXiv 2022.04) Image **Captioning** In the Transformer Age, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07374.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSjokerLily\u002Fawesome-image-captioning)\n\n- (arXiv 2022.04) **ResT** V2: Simpler, Faster and Stronger, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07366.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwofmanaf\u002FResT)\n\n- (arXiv 2022.04) Lightweight Bimodal Network for Single-Image **Super-Resolution** via Symmetric CNN and Recursive Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13286.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FIVIPLab\u002FLBNet)\n\n- (arXiv 2022.04) Temporal Progressive Attention for **Early Action Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13340.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Falexandrosstergiou\u002Fprogressive-action-prediction)\n\n- (arXiv 2022.04) Keep the Caption Information: Preventing Shortcut Learning in Contrastive **Image-Caption** Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13382.pdf)\n\n- (arXiv 2022.04) Flamingo: a **Visual Language** Model for **Few-Shot** Learning, [[Paper]](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FDeepMind.com\u002FBlog\u002Ftackling-multiple-tasks-with-a-single-visual-language-model\u002Fflamingo.pdf)\n\n- (arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR **VISUAL RELATIONAL REASONING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11167.pdf)\n\n- (arXiv 2022.04) **Unsupervised** Human **Action** Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10312.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FIIT-PAVIS\u002FUHAR_Skeletal_Laplacian)\n\n- (arXiv 2022.04) Learning **Future Object Prediction** with a Spatiotemporal Detection Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10321.pdf)\n\n- (arXiv 2022.04) R^2-Trans: **Fine-Grained Visual Categorization** with Redundancy Reduction, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10095.pdf), [[Code]](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FR-2-Trans)\n\n- (arXiv 2022.04) A New Dataset and Transformer for **Stereoscopic Video Super-Resolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10039.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FH-deep\u002FTrans-SVSR\u002F)\n\n- (arXiv 2022.04) Transformer-Guided Convolutional Neural Network for Cross-View **Geolocalization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09967.pdf)\n\n- (arXiv 2022.04) Multi-Scale Features and Parallel Transformers Based **Image Quality Assessment**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09779.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKomalPal9610\u002FIQA)\n\n- (arXiv 2022.04) BTranspose: Bottleneck Transformers for **Human Pose Estimation** with Self-Supervised Pre-Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10209.pdf)\n\n- (arXiv 2022.04) **Human-Object Interaction Detection** via Disentangled Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09290.pdf)\n\n- (arXiv 2022.04) ELEVATER: A **Benchmark** and **Toolkit** for Evaluating **Language-Augmented Visual Models**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08790.pdf)\n\n- (arXiv 2022.04) Interactiveness Field in **Human-Object Interactions**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07718.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FForuck\u002FInteractiveness-Field)\n\n- (arXiv 2022.04) DeiT III: Revenge of the **ViT**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07118.pdf)\n\n- (arXiv 2022.04) Residual Swin Transformer Channel Attention Network for **Image Demosaicing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07098.pdf)\n\n- (arXiv 2022.04) Neighborhood **Attention** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07143.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FNeighborhood-Attention-Transformer)\n\n- (arXiv 2022.04) MiniViT: **Compressing** Vision Transformers with Weight Multiplexing, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07154.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FCream)\n\n- (arXiv 2022.04) ViTOL: Vision Transformer for **Weakly Supervised Object Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.06772.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSaurav-31\u002FViTOL)\n\n- (arXiv 2022.04) What Matters in Language Conditioned Robotic **Imitation Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.06252.pdf), [[Code]](http:\u002F\u002Fhulc.cs.uni-freiburg.de\u002F)\n\n- (arXiv 2022.04) Consistency driven Sequential Transformers Attention Model for **Partially Observable Scenes**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00656)\n\n- (arXiv 2022.04) ReCLIP: A Strong Zero-Shot Baseline for **Referring Expression Comprehension**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05991.pdf)\n\n- (arXiv 2022.04) Are **Multimodal** Transformers **Robust** to Missing Modality? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05454.pdf)\n\n- (arXiv 2022.04) TopFormer: Token Pyramid Transformer for Mobile **Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05525.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FTopFormer)\n\n- (arXiv 2022.04) X-DETR: A Versatile Architecture for Instance-wise **Vision-Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05626.pdf)\n\n- (arXiv 2022.04) **Event** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05172.pdf)\n\n- (arXiv 2022.04) Evaluating Vision Transformer Methods for **Deep Reinforcement Learning** from Pixels, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04905.pdf)\n\n- (arXiv 2022.04) ManiTrans: Entity-Level **Text-Guided Image Manipulation** via Token-wise Semantic Alignment and Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04428.pdf), [[Code]](https:\u002F\u002Fjawang19.github.io\u002Fmanitrans)\n\n- (arXiv 2022.04) Multimodal Transformer for Nursing **Activity Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04564.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMomilijaz96\u002FMMT_for_NCRC)\n\n- (arXiv 2022.04) **Robust** Cross-Modal Representation Learning with Progressive Self-Distillation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04588.pdf)\n\n- (arXiv 2022.04) Stripformer: Strip Transformer for Fast Image **Deblurring**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04627.pdf)\n\n- (arXiv 2022.04) No Token Left Behind: **Explainability**-Aided Image Classification and Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04908.pdf)\n\n- (arXiv 2022.04) Fashionformer: A Simple, Effective and Unified Baseline for **Human Fashion Segmentation and Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04654.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxushilin1\u002FFashionFormer)\n\n- (arXiv 2022.04) Panoptic-PartFormer: Learning a Unified Model for **Panoptic Part Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04655.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FlxtGH\u002FPanoptic-PartFormer)\n\n- (arXiv 2022.04) DILEMMA: Self-Supervised **Shape and Texture** Learning with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04788.pdf)\n\n- (arXiv 2022.04) Learning Trajectory-Aware Transformer for **Video Super-Resolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04216.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FTTVSR.git)\n\n- (arXiv 2022.04) Learning to Induce **Causal** Structure, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04875.pdf)\n\n- (arXiv 2022.04) Consistency Learning via Decoding Path Augmentation for Transformers in **Human Object Interaction Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04836.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmlvlab\u002FCPChoi)\n\n- (arXiv 2022.04) Category-Aware Transformer Network for Better **Human-Object Interaction Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04911.pdf)\n\n- (arXiv 2022.04) Does **Robustness** on ImageNet Transfer to Downstream Tasks?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03934.pdf)\n\n- (arXiv 2022.04) POSTER: A Pyramid Cross-Fusion Transformer Network for **Facial Expression Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04083.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzczcwh\u002FPOSTER)\n\n- (arXiv 2022.04) Vision Transformers for Single Image **Dehazing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03883.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FIDKiro\u002FDehazeFormer)\n\n- (arXiv 2022.04) Underwater **Image Enhancement** Using Pre-trained Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04199.pdf)\n\n- (arXiv 2022.04) Event Transformer. A sparse-aware solution for efficient **event data processing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03355.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAlbertoSabater\u002FEventTransformer)\n\n- (arXiv 2022.04) PSTR: End-to-End One-Step **Person Search** With Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03340.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJialeCao001\u002FPSTR)\n\n- (arXiv 2022.04) Adapting **CLIP** For **Phrase Localization** Without Further Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03647.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fpals-ttic\u002Fadapting-CLIP)\n\n- (arXiv 2022.04) FineDiving: A Fine-grained **Dataset** for Procedure-aware **Action Quality Assessment**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03646.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002Fxujinglin\u002FFineDiving)\n\n- (arXiv 2022.04) DaViT: Dual **Attention** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03645.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdingmyu\u002Fdavit)\n\n- (arXiv 2022.04) Unsupervised Prompt Learning for **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03649.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ftonyhuang2022\u002FUPL)\n\n- (arXiv 2022.04) **Long Video Generation** with Time-Agnostic VQGAN and Time-Sensitive Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03638.pdf), [[Project]](https:\u002F\u002Fsongweige.github.io\u002Fprojects\u002Ftats\u002Findex.html)\n\n- (arXiv 2022.04) Unified Contrastive Learning in **Image-Text-Label** Space, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03610.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FUniCL)\n\n- (arXiv 2022.04) HunYuan_tvr for **Text-Video** Retrivial, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03382.pdf)\n\n- (arXiv 2022.04) LEARNING TO COMPOSE SOFT PROMPTS FOR **COMPOSITIONAL ZERO-SHOT LEARNING**, [[Paper]]()\n\n- (arXiv 2022.04) End-to-End Zero-Shot **HOI** Detection via **Vision and Language** Knowledge Distillation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03541.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmrwu-mac\u002FEoID)\n\n- (arXiv 2022.04) **Temporal Alignment** Networks for Long-term Video, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02968.pdf), [[Code]](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Ftan\u002F)\n\n- (arXiv 2022.04) Unleashing Vanilla Vision Transformer with Masked Image Modeling for **Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02964.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FMIMDet)\n\n- (arXiv 2022.04) MixFormer: **Mixing Features** across Windows and Dimensions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02557.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleClas)\n\n- (arXiv 2022.04) CM3: A **CAUSAL** MASKED **MULTIMODAL** MODEL OF THE INTERNET, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07520.pdf)\n\n- (arXiv 2022.04) DO AS I CAN, NOT AS I SAY: GROUNDING LANGUAGE IN **ROBOTIC** AFFORDANCES, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01691.pdf), [[Project]](https:\u002F\u002Fsay-can.github.io\u002F)\n\n- (arXiv 2022.04) TransGeo: Transformer Is All You Need for **Cross-view Image Geo-localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00097.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJeff-Zilence\u002FTransGeo2022)\n\n- (arXiv 2022.04) Socratic Models: Composing **Zero-Shot Multimodal Reasoning** with Language, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00598.pdf), [[Project]](https:\u002F\u002Fsocraticmodels.github.io\u002F)\n\n- (arXiv 2022.04) Vision Transformer with Cross-attention by Temporal Shift for Efficient **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00452.pdf)\n\n- (arXiv 2022.04) Learning **Audio-Video** Modalities from Image Captions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00679.pdf)\n\n- (arXiv 2022.04) Improving Vision Transformers by Revisiting **High-frequency Components**, [[Paper]]()\n\n- (arXiv 2022.04) POS-BERT: **Point Cloud** One-Stage BERT Pre-Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00989.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffukexue\u002FPOS-BERT)\n\n- (arXiv 2022.04) BinsFormer: Revisiting Adaptive Bins for **Monocular Depth Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00987.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzhyever\u002FMonocular-Depth-Estimation-Toolbox)\n\n- (arXiv 2022.04) BatchFormerV2: Exploring Sample Relationships for **Dense Representation** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01254.pdf)\n\n- (arXiv 2022.04) TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for **Repetitive Action Counting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01018.pdf)\n\n- (arXiv 2022.04) **Long** Movie Clip Classification with State-Space **Video** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01692.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmd-mohaiminul\u002FViS4mer)\n\n- (arXiv 2022.04) TALLFormer: **Temporal Action Localization** with Long-memory Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01680.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fklauscc\u002FTALLFormer)\n\n- (arXiv 2022.04) Multi**MAE**: Multi-modal Multi-task Masked Autoencoders, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01678.pdf), [[Project]](https:\u002F\u002Fmultimae.epfl.ch\u002F)\n\n- (arXiv 2022.04) “This is my unicorn, Fluffy”: Personalizing frozen **vision-language** representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01694.pdf)\n\n- (arXiv 2022.04) SE(3)-Equivariant Attention Networks for **Shape Reconstruction** in Function Space, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02394.pdf)\n\n- (arXiv 2022.04) Multi-View Transformer for **3D Visual Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02174.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsega-hsj\u002FMVT-3DVG)\n\n- (arXiv 2022.04) VISION TRANSFORMER EQUIPPED WITH NEURAL RESIZER ON **FACIAL EXPRESSION RECOGNITION** TASK, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02181.pdf)\n\n- (arXiv 2022.04) Dual-AI: Dual-path Actor Interaction Learning for **Group Activity Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02148.pdf), [[Project]](https:\u002F\u002Fmingfei.info\u002FDual-AI\u002F)\n\n- (arXiv 2022.04) Detector-Free Weakly Supervised **Group Activity Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02139.pdf)\n\n- (arXiv 2022.04) Joint **Hand Motion and Interaction Hotspots Prediction** from Egocentric Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01696.pdf), [[Project]](https:\u002F\u002Fstevenlsw.github.io\u002Fhoi-forecast)\n\n- (arXiv 2022.04) What to look at and where: Semantic and Spatial Refined Transformer for detecting **human-object interactions**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00746.pdf)\n\n- (arXiv 2022.04) MaxViT: **Multi-Axis** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01697.pdf)\n\n### 2022.03\n\n- (arXiv 2022.03) A **ConvNet** for the 2020s, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.03545.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FConvNeXt)\n\n- (arXiv 2022.03) DeepNet: Scaling Transformers to **1,000 Layers**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00555.pdf)\n\n- (arXiv 2022.03) Spatial-Temporal Parallel Transformer for **Arm-Hand Dynamic Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16202.pdf)\n\n- (arXiv 2022.03) ViSTA: **Vision** and **Scene Text** Aggregation for Cross-Modal **Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16778.pdf)\n\n- (arXiv 2022.03) ReSTR: Convolution-free Referring Image Segmentation Using Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16768.pdf), [[Project]](http:\u002F\u002Fcvlab.postech.ac.kr\u002Fresearch\u002Frestr\u002F)\n\n- (arXiv 2022.03) CREATE: A **Benchmark** for Chinese Short **Video Retrieval** and **Title Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16763.pdf)\n\n- (arXiv 2022.03) **Deformable** **Video** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16795.pdf)\n\n- (arXiv 2022.03) End-to-End **Trajectory** Distribution Prediction Based on Occupancy Grid Maps, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16910.pdf)\n\n- (arXiv 2022.03) CRAFT: Cross-Attentional Flow Transformer for Robust **Optical Flow**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16896.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Faskerlee\u002Fcraft)\n\n- (arXiv 2022.03) VL-InterpreT: An Interactive **Visualization** Tool for Interpreting **Vision-Language** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17247.pdf), [[App]](http:\u002F\u002Fvlinterpretenv4env-env.eba-vmhhefup.us-east-2.elasticbeanstalk.com\u002F)\n\n- (arXiv 2022.03) TransEditor: Transformer-Based Dual-Space **GAN** for Highly Controllable **Facial Editing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17266.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBillyXYB\u002FTransEditor)\n\n- (arXiv 2022.03) BEVFormer: Learning **Bird’s-Eye-View** Representation from Multi-Camera Images via Spatiotemporal Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17270.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzhiqi-li\u002FBEVFormer)\n\n- (arXiv 2022.03) **Visual Prompting**: Modifying Pixel Space to Adapt Pre-trained Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17274.pdf), [[Code]](https:\u002F\u002Fhjbahng.github.io\u002Fvisual_prompting\u002F)\n\n- (arXiv 2022.03) Bringing Old **Films** Back to Life, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17276.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fraywzy\u002FBringing-Old-Films-Back-to-Life)\n\n- (arXiv 2022.03) Learning to Prompt for **Open-Vocabulary Object Detection** with Vision-Language Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14940.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdyabel\u002Fdetpro)\n\n- (arXiv 2022.03) SeqTR: A Simple yet Universal Network for **Visual Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16265.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsean-zhuh\u002FSeqTR)\n\n- (arXiv 2022.03) InstaFormer: Instance-Aware **Image-to-Image Translation** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16248.pdf)\n\n- (arXiv 2022.03) Omni-DETR: **Omni-Supervised** Object **Detection** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16089.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Famazon-research\u002Fomni-detr)\n\n- (arXiv 2022.03) Learning **Program Representations** for Food Images and Cooking Recipes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16071.pdf), [[Project]](http:\u002F\u002Fcookingprograms.csail.mit.edu\u002F)\n\n- (arXiv 2022.03) ITTR: **Unpaired Image-to-Image Translation** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16015.pdf)\n\n- (arXiv 2022.03) VPTR: **Efficient** Transformers for **Video Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15836.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FXiYe20\u002FVPTR)\n\n- (arXiv 2022.03) Parameter-**efficient** Fine-tuning for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16329.pdf)\n\n- (arXiv 2022.03) TubeDETR: Spatio-Temporal Video **Grounding** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16434.pdf), [[Code]](https:\u002F\u002Fantoyang.github.io\u002Ftubedetr.html)\n\n- (arXiv 2022.03) Exploring Plain Vision Transformer Backbones for Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16527.pdf)\n\n- (arXiv 2022.03) PROMPTDET: EXPAND YOUR **DETECTOR** VOCABULARY WITH UNCURATED IMAGES, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16513.pdf), [[Code]](https:\u002F\u002Ffcjian.github.io\u002Fpromptdet)\n\n- (arXiv 2022.03) **Few-Shot** Object **Detection** with Fully Cross-Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15021.pdf)\n\n- (arXiv 2022.03) Unified Transformer Tracker for Object **Tracking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15175.pdf)\n\n- (arXiv 2022.03) X-Pool: Cross-Modal **Language-Video** Attention for Text-Video Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15086.pdf), [[Code]](https:\u002F\u002Flayer6ai-labs.github.io\u002Fxpool\u002F)\n\n- (arXiv 2022.03) Fine-tuning Image Transformers using Learnable **Memory**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15243.pdf)\n\n- (arXiv 2022.03) MAT: Mask-Aware Transformer for Large Hole Image **Inpainting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15270.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffenglinglwb\u002FMAT)\n\n- (arXiv 2022.03) mc-BEiT: Multi-choice Discretization for **Image BERT** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15371.pdf)\n\n- (arXiv 2022.03) End-to-End Transformer Based Model for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15350.pdf)\n\n- (arXiv 2022.03) Hybrid Routing Transformer for **Zero-Shot Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15310.pdf)\n\n- (arXiv 2022.03) TREATMENT LEARNING TRANSFORMER FOR **NOISY IMAGE CLASSIFICATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15529.pdf)\n\n- (arXiv 2022.03) Do **Vision-Language** Pretrained Models Learn **Primitive Concepts**?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17271.pdf)\n\n- (arXiv 2022.03) Transformer Inertial Poser: Attention-based Real-time **Human Motion Reconstruction** from Sparse **IMUs**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15720.pdf)\n\n- (arXiv 2022.03) SepViT: **Separable** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15380.pdf)\n\n- (arXiv 2022.03) MatteFormer: Transformer-Based **Image Matting** via Prior-Tokens, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15662.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwebtoon\u002Fmatteformer)\n\n- (arXiv 2022.03) Feature Selective Transformer for **Semantic Image Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14124.pdf)\n\n- (arXiv 2022.03) Bridge-Prompt: Towards **Ordinal Action Understanding** in Instructional Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14104.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fttlmh\u002FBridge-Prompt)\n\n- (arXiv 2022.03) RSTT: Real-time Spatial Temporal Transformer for Space-Time **Video Super-Resolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14186.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fllmpass\u002FRSTT)\n\n- (arXiv 2022.03) Single-Stream Multi-Level Alignment for **Vision-Language** Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14395.pdf)\n\n- (arXiv 2022.03) Beyond Masking: Demystifying **Token-Based Pre-Training** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14313.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsunsmarterjie\u002Fbeyond_masking)\n\n- (arXiv 2022.03) Collaborative Transformers for **Grounded Situation Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16518.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjhcho99\u002FCoFormer)\n\n- (arXiv 2022.03) Object Memory Transformer for Object Goal **Navigation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14708.pdf)\n\n- (arXiv 2022.03) Brain-inspired **Multilayer Perceptron** with **Spiking Neurons**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14679.pdf), [[Code]](https:\u002F\u002Fgitee.com\u002Fmindspore\u002Fmodels\u002Ftree\u002Fmaster\u002Fresearch\u002Fcv\u002Fsnn_mlp)\n\n- (arXiv 2022.03) HandOccNet: Occlusion-Robust **3D Hand Mesh Estimation** Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14564.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fnamepllet\u002FHandOccNet)\n\n- (arXiv 2022.03) REGTR: End-to-end **Point Cloud Correspondences** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14517.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyewzijian\u002FRegTR)\n\n- (arXiv 2022.03) Automated Progressive Learning for **Efficient Training** of Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14509.pdf)\n\n- (arXiv 2022.03) Stratified Transformer for 3D **Point Cloud Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14508.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FStratified-Transformer)\n\n- (arXiv 2022.03) NOC-REK: Novel Object **Captioning** with Retrieved Vocabulary from External Knowledge, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14499.pdf)\n\n- (arXiv 2022.03) **FACIAL EXPRESSION RECOGNITION** WITH SWIN TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13472.pdf)\n\n- (arXiv 2022.03) Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch **Robustness**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13639.pdf)\n\n- (arXiv 2022.03) Efficient Visual **Tracking** via Hierarchical Cross-Attention Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13537.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fchenxin-dlut\u002FHCAT)\n\n- (arXiv 2022.03) High-Performance Transformer **Tracking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13533.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fchenxin-dlut\u002FTransT-M)\n\n- (arXiv 2022.03) RayTran: **3D pose estimation** and **shape reconstruction** of multiple objects from videos with ray-traced transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13296.pdf)\n\n- (arXiv 2022.03) Multi-modal Multi-label **Facial Action Unit Detection** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13301.pdf)\n\n- (arXiv 2022.03) MonoDETR: Depth-aware Transformer for Monocular **3D** Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13310.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZrrSkywalker\u002FMonoDETR.git)\n\n- (arXiv 2022.03) **Text to Mesh** Without 3D Supervision Using Limit Subdivision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13333.pdf), [[Project]](https:\u002F\u002Fwww.nasir.lol\u002Fclipmesh)\n\n- (arXiv 2022.03) GEN-VLKT: Simplify Association and Enhance Interaction Understanding for **HOI Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13954.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYueLiao\u002Fgen-vlkt)\n\n- (arXiv 2022.03) CrossFormer: Cross Spatio-Temporal Transformer for **3D Human Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13387.pdf)\n\n- (arXiv 2022.03) FitCLIP: Refining Large-Scale Pretrained **Image-Text** Models for Zero-Shot Video Understanding Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13371.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fbryant1410\u002F)\n\n- (arXiv 2022.03) Vision Transformer **Compression** with Structured Pruning and Low Rank Approximation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13444.pdf)\n\n- (arXiv 2022.03) Multi-Modal Learning for **AU Detection** Based on Multi-Head Fused Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11441.pdf)\n\n- (arXiv 2022.03) MSTR: Multi-Scale Transformer for End-to-End **Human-Object Interaction Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14709.pdf)\n\n- (arXiv 2022.03) Learning Patch-to-Cluster **Attention** in Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11987.pdf)\n\n- (arXiv 2022.03) Visual **Prompt Tuning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12119.pdf)\n\n- (arXiv 2022.03) Training-free Transformer **Architecture Search**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12217.pdf)\n\n- (arXiv 2022.03) VideoMAE: Masked Autoencoders are Data-Efficient Learners for **Self-Supervised Video Pre-Training**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12602.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FVideoMAE)\n\n- (arXiv 2022.03) METAMORPH: LEARNING **UNIVERSAL CONTROLLERS** WITH TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11931.pdf), [[Project]](https:\u002F\u002Fmetamorph-iclr.github.io\u002Fsite\u002F)\n\n- (arXiv 2022.03) A Prompt Array Keeps the Bias Away: **Debiasing** **Vision-Language** Models with Adversarial Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11933.pdf)\n\n- (arXiv 2022.03) Reshaping **Robot Trajectories** Using Natural **Language** Commands: A Study of Multi-Modal Data Alignment Using Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13411.pdf), [[Project]](https:\u002F\u002Farthurfenderbucker.github.io\u002FNL_trajectory_reshaper\u002F)\n\n- (arXiv 2022.03) Associating Objects with Scalable Transformers for **Video Object Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11442.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FAOT0\n\n- (arXiv 2022.03) HOP: History-and-Order Aware Pre-training for **Vision-and-Language Navigation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11591.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYanyuanQiao\u002FHOP-VLN)\n\n- (arXiv 2022.03) Learning to **generate line drawings** that convey geometry and semantics, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12691.pdf), [[Project]](https:\u002F\u002Fcarolineec.github.io\u002Finformative_drawings\u002F)\n\n- (arXiv 2022.03) UMT: Unified Multi-modal Transformers for Joint **Video Moment Retrieval** and **Highlight Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12745.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FUMT)\n\n- (arXiv 2022.03) AIMusicGuru: Music Assisted **Human Pose Correction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12829.pdf)\n\n- (arXiv 2022.03) What to Hide from Your Students: Attention-Guided **Masked Image Modeling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12719.pdf)\n\n- (arXiv 2022.03) Towards Efficient and Elastic **Visual Question Answering** with Doubly Slimmable Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12814.pdf)\n\n- (arXiv 2022.03) ViT-FOD: A Vision Transformer based **Fine-grained Object Discriminator**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12816.pdf)\n\n- (arXiv 2022.03) **Keypoints Tracking** via Transformer Networks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12848.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLexaNagiBator228\u002FKeypoints-Tracking-via-Transformer-Networks\u002F)\n\n- (arXiv 2022.03) Beyond Fixation: **Dynamic Window** Visual Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12856.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fpzhren\u002FDW-ViT)\n\n- (arXiv 2022.03) Make-A-Scene: Scene-Based **Text-to-Image** Generation with Human Priors, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13131.pdf)\n\n- (arXiv 2022.03) Self-supervised Video-centralised Transformer for **Video Face Clustering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13166.pdf)\n\n- (arXiv 2022.03) Towards Exemplar-Free **Continual Learning** in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13167.pdf)\n\n- (arXiv 2022.03) Global **Tracking** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13250.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxingyizhou\u002FGTR)\n\n- (arXiv 2022.03) **Video Instance Segmentation** via Multi-scale Spatio-temporal Split Attention Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13253.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOmkarThawakar\u002FMSSTS-VIS)\n\n- (arXiv 2022.03) QS-Craft: Learning to Quantize, Scrabble and Craft for **Conditional Human Motion Animation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11632.pdf)\n\n- (arXiv 2022.03) Look for the Change: Learning **Object States** and **State-Modifying Actions** from Untrimmed Web Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11637.pdf), [[Project]](https:\u002F\u002Fdata.ciirc.cvut.cz\u002Fpublic\u002Fprojects\u002F2022LookForTheChange\u002F)\n\n- (arXiv 2022.03) GradViT: **Gradient Inversion** of Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11894.pdf), [[Code]](https:\u002F\u002Fgradvit.github.io\u002F)\n\n- (arXiv 2022.03) **Mask Usage Recognition** using Vision Transformer with Transfer Learning and Data Augmentation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11542.pdf)\n\n- (arXiv 2022.03) Under the Hood of Transformer Networks for **Trajectory Forecasting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11878.pdf)\n\n- (arXiv 2022.03) **Open-Vocabulary DETR** with Conditional Matching, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11876.pdf)\n\n- (arXiv 2022.03) Meta-attention for ViT-backed **Continual Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11684.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzju-vipa\u002FMEAT-TIL)\n\n- (arXiv 2022.03) CNNs and Transformers Perceive **Hybrid Images** Similar to Humans, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11678.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Faliborji\u002Fhybrid_images.git)\n\n- (arXiv 2022.03) Bailando: **3D Dance Generation** by Actor-Critic GPT with Choreographic Memory, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13055.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flisiyao21\u002FBailando\u002F)\n\n- (arXiv 2022.03) Affective Feedback Synthesis Towards Multimodal **Text and Image** Data, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2203\u002F2203.12692.pdf)\n\n- (arXiv 2022.03) ViewFormer: **NeRF-free Neural Rendering** from Few Images Using Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10157.pdf)\n\n- (arXiv 2022.03) **CLIP** on Wheels: Zero-Shot Object **Navigation** as Object Localization and Exploration, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10421.pdf)\n\n- (arXiv 2022.03) Voxel Set Transformer: A Set-to-Set Approach to **3D Object Detection** from Point Clouds, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10314.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fskyhehe123\u002FVoxSeT)\n\n- (arXiv 2022.03) HIPA: Hierarchical Patch Transformer for Single Image **Super Resolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10247.pdf)\n\n- (arXiv 2022.03) DirecFormer: A Directed Attention in Transformer Approach to **Robust Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10233.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fuark-cviu\u002FDirecFormer)\n\n- (arXiv 2022.03) MixFormer: End-to-End **Tracking** with Iterative Mixed Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11082.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FMixFormer)\n\n- (arXiv 2022.03) PersFormer: **3D Lane Detection** via Perspective Transformer and the OpenLane Benchmark, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11089.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOpenPerceptionX\u002FOpenLane)\n\n- (arXiv 2022.03) Relationformer: A Unified Framework for **Image-to-Graph** Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10202.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsuprosanna\u002Frelationformer)\n\n- (arXiv 2022.03) **CLIP** meets GamePhysics: Towards **bug identification** in gameplay videos using zero-shot transfer learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11096.pdf), [[Code]](https:\u002F\u002Fasgaardlab.github.io\u002FCLIPxGamePhysics\u002F)\n\n- (arXiv 2022.03) **Hyperbolic** Vision Transformers: Combining Improvements in Metric Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10833.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhtdt\u002Fhyp_metric)\n\n- (arXiv 2022.03) MonoDTR: Monocular **3D Object Detection** with Depth-Aware Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10981.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fkuanchihhuang\u002FMonoDTR)\n\n- (arXiv 2022.03) Transformer-based **HTR** for Historical Documents, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11008.pdf)\n\n- (arXiv 2022.03) simCrossTrans: A Simple **Cross-Modality** Transfer Learning for Object **Detection** with ConvNets or Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10456.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fliketheflower\u002FsimCrossTrans)\n\n- (arXiv 2022.03) End-to-End **Human-Gaze-Target Detection** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10433.pdf)\n\n- (arXiv 2022.03) End-to-End **Video Text Spotting** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10539.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fweijiawu\u002FTransDETR)\n\n- (arXiv 2022.03) Open-Vocabulary One-Stage **Detection** with Hierarchical **Visual-Language** Knowledge Distillation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10593.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FmengqiDyangge\u002FHierKD)\n\n- (arXiv 2022.03) V2X-ViT: **Vehicle**-to-Everything Cooperative Perception with Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10638.pdf)\n\n- (arXiv 2022.03) LocATe: End-to-end **Localization of Actions** in 3D with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10719.pdf)\n\n- (arXiv 2022.03) AnoViT: **Unsupervised Anomaly Detection and Localization** with Vision Transformer-based Encoder-Decoder, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10808.pdf)\n\n- (arXiv 2022.03) ViM: **Out-Of-Distribution** with Virtual-logit Matching, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10807.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhaoqiwang\u002Fvim)\n\n- (arXiv 2022.03) ScalableViT: Rethinking the Context-oriented **Generalization** of Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10790.pdf)\n\n- (arXiv 2022.03) Iwin: **Human-Object Interaction Detection** via Transformer with Irregular Windows, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10537.pdf)\n\n- (arXiv 2022.03) Vision Transformer with Convolutions **Architecture Search**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2203\u002F2203.10435.pdf)\n\n- (arXiv 2022.03) Cascade Transformers for End-to-End **Person Search**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09642.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKitware\u002FCOAT)\n\n- (arXiv 2022.03) CodedVTR: Codebook-based Sparse **Voxel** Transformer with Geometric Guidance, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09887.pdf)\n\n- (arXiv 2022.03) MatchFormer: Interleaving Attention in Transformers for **Feature Matching**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09645.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjamycheung\u002FMatchFormer)\n\n- (arXiv 2022.03) Local-Global Context Aware Transformer for **Language-Guided Video Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09773.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fleonnnop\u002FLocater)\n\n- (arXiv 2022.03) **Three things** everyone should know about Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09795.pdf)\n\n- (arXiv 2022.03) Are Vision Transformers **Robust** to Spurious Correlations? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09125.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdeeplearning-wisc\u002Fvit-spurious-robustness)\n\n- (arXiv 2022.03) MUTUAL GENERATIVE TRANSFORMER LEARNING FOR **CROSS-VIEW GEO-LOCALIZATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09135.pdf)\n\n- (arXiv 2022.03) DU-VLG: Unifying **Vision-and-Language** Generation via Dual Sequence-to-Sequence Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09052.pdf)\n\n- (arXiv 2022.03) Semantic-aligned Fusion Transformer for **One-shot** Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09093.pdf)\n\n- (arXiv 2022.03) UNIMO-2: End-to-End Unified **Vision-Language** Grounded Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09067.pdf), [[Code]](https:\u002F\u002Funimo-ptm.github.io\u002F)\n\n- (arXiv 2022.03) Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for **Few-shot Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09064.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FStomachCold\u002FHCTransformers)\n\n- (arXiv 2022.03) One-Shot Adaptation of **GAN** in Just One **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09301.pdf)\n\n- (arXiv 2022.03) PanoFormer: Panorama Transformer for Indoor 360° **Depth Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09283.pdf)\n\n- (arXiv 2022.03) PreTR: Spatio-Temporal Non-Autoregressive **Trajectory Prediction** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09293.pdf)\n\n- (arXiv 2022.03) Look Outside the Room: **Synthesizing** A Consistent Long-Term **3D Scene Video** from A Single Image, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09457.pdf), [[Code]](https:\u002F\u002Fxrenaa.github.io\u002Flook-outside-room\u002F)\n\n- (arXiv 2022.03) Transframer: Arbitrary **Frame Prediction** with Generative Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09494.pdf)\n\n- (arXiv 2022.03) Towards Data-**Efficient** Detection Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09507.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fencounter1997\u002FDE-DETRs)\n\n- (arXiv 2022.03) Bi-directional Object-Context Prioritization Learning for **Saliency** Ranking, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09416.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGrassBro\u002FOCOR)\n\n- (arXiv 2022.03) PATCH-FOOL: ARE VISION TRANSFORMERS ALWAYS ROBUST AGAINST **ADVERSARIAL** PERTURBATIONS? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08392.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FRICE-EIC\u002FPatch-Fool)\n\n- (arXiv 2022.03) WegFormer: Transformers for Weakly Supervised **Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08421.pdf)\n\n- (arXiv 2022.03) **Open Set Recognition** using Vision Transformer with an Additional Detection Head, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08441.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffeiyang-cai\u002Fosr_vit.git)\n\n- (arXiv 2022.03) UNIFIED VISUAL TRANSFORMER **COMPRESSION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08243.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FUVC)\n\n- (arXiv 2022.03) Towards Practical **Certifiable Patch Defense** with Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08519.pdf)\n\n- (arXiv 2022.03) EDTER: **Edge Detection** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08566.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMengyangPu\u002FEDTER)\n\n- (arXiv 2022.03) ActFormer: A GAN Transformer Framework towards General Action-Conditioned **3D Human Motion Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07706.pdf)\n\n- (arXiv 2022.03) Rich CNN-Transformer Feature Aggregation Networks for **Super-Resolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07682.pdf)\n\n- (arXiv 2022.03) Revitalize Region Feature for Democratizing **Video-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07720.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FCuthbertCai\u002FDemoVLP)\n\n- (arXiv 2022.03) Inverted Pyramid Multi-task Transformer for **Dense Scene Understanding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07997.pdf)\n\n- (arXiv 2022.03) Smoothing Matters: Momentum Transformer for Domain Adaptive **Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07988.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Falpc91\u002FTransDA)\n\n- (arXiv 2022.03) Style Transformer for **Image Inversion** and **Editing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07932.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsapphire497\u002Fstyle-transformer)\n\n- (arXiv 2022.03) MotionCLIP: Exposing Human **Motion Generation** to **CLIP** Space, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08063.pdf), [[Project]](https:\u002F\u002Fguytevet.github.io\u002Fmotionclip-page\u002F)\n\n- (arXiv 2022.03) The Principle of **Diversity**: Training Stronger Vision Transformers Calls for Reducing All Levels of **Redundancy**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06345.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FDiverse-ViT)\n\n- (arXiv 2022.03) Enabling **Multimodal Generation** on CLIP via Vision-Language Knowledge Distillation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06386.pdf)\n\n- (arXiv 2022.03) Sparse Local Patch Transformer for Robust **Face Alignment** and **Landmarks Inherent Relation** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06541.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJiahao-UTS\u002FSLPT-master)\n\n- (arXiv 2022.03) Joint CNN and Transformer Network via weakly supervised Learning for efficient **crowd counting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06388.pdf)\n\n- (arXiv 2022.03) DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for **Salient Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06429.pdf)\n\n- (arXiv 2022.03) DATR: Domain-adaptive transformer for **multi-domain landmark detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06433.pdf)\n\n- (arXiv 2022.03) EventFormer: AU Event Transformer for **Facial Action** Unit Event Detection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06355.pdf)\n\n- (arXiv 2022.03) Accelerating **DETR** **Convergence** via Semantic-Aligned Matching, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06883.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZhangGongjie\u002FSAM-DETR)\n\n- (arXiv 2022.03) All in One: Exploring Unified **Video-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07303.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshowlab\u002Fall-in-one)\n\n- (arXiv 2022.03) CLIP Models are **Few-shot** Learners: Empirical Studies on VQA and Visual Entailment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07190.pdf)\n\n- (arXiv 2022.03) EIT: **Efficiently** Lead Inductive Biases to ViT, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07116.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMrHaiPi\u002FEIT)\n\n- (arXiv 2022.03) Self-Promoted Supervision for **Few-Shot** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07057.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FDongSky\u002Ffew-shot-vit)\n\n- (arXiv 2022.03) MDMMT-2: Multidomain Multimodal Transformer for **Video Retrieval**, One More Step Towards Generalization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07086.pdf)\n\n- (arXiv 2022.03) Disentangled Representation Learning for **Text-Video** Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07111.pdf)\n\n- (arXiv 2022.03) TransCAM: Transformer Attention-based CAM Refinement for **Weakly Supervised Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07239.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fliruiwen\u002FTransCAM)\n\n- (arXiv 2022.03) Synopses of Movie Narratives: a **Video-Language Dataset** for Story Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05711.pdf), [[Dataset]](https:\u002F\u002Fgithub.com\u002Finsundaycathy\u002FSYMON)\n\n- (arXiv 2022.03) Visualizing and Understanding **Patch Interactions** in Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05922.pdf)\n\n- (arXiv 2022.03) **ANTI-OVERSMOOTHING** IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS: FROM THEORY TO PRACTICE, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05962.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FViT-Anti-Oversmoothing)\n\n- (arXiv 2022.03) Democratizing Contrastive **Language-Image** Pre-training: A CLIP **Benchmark** of Data, Model, and Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05796.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSense-GVT\u002FDeCLIP)\n\n- (arXiv 2022.03) ActiveMLP: An **MLP**-like Architecture with Active Token Mixer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06108.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FActiveMLP)\n\n- (arXiv 2022.03) **Zero-Shot Action Recognition** with Transformer-based Video Semantic Embedding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05156.pdf)\n\n- (arXiv 2022.03) TrueType Transformer: **Character and Font Style Recognition** in Outline Format, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05338.pdf)\n\n- (arXiv 2022.03) LOOPITR: Combining Dual and Cross Encoder Architectures for **Image-Text** Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05465.pdf)\n\n- (arXiv 2022.03) MVP: **Multimodality**-guided Visual Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05175.pdf)\n\n- (arXiv 2022.03) DEER: Detection-agnostic End-to-End Recognizer for **Scene Text Spotting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05122.pdf)\n\n- (arXiv 2022.03) **Multi-Modal** Mixup for **Robust** Fine-tuning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03897.pdf)\n\n- (arXiv 2022.03) AssistQ: Affordance-centric Question-driven Task Completion for **Egocentric Assistant**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04203.pdf), [[Project]](https:\u002F\u002Fshowlab.github.io\u002Fassistq\u002F)\n\n- (arXiv 2022.03) **Coarse-to-Fine** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03821.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FChenMnZ\u002FCF-ViT)\n\n- (arXiv 2022.03) Monocular Robot **Navigation** with Self-Supervised Pretrained Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03682.pdf)\n\n- (arXiv 2022.03) WAVEMIX: RESOURCE-**EFFICIENT** TOKEN MIXING FOR IMAGES, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03689.pdf)\n\n- (arXiv 2022.03) VOVIT: LOW LATENCY GRAPH-BASED **AUDIO-VISUAL** VOICE SEPARATION TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04099.pdf), [[Code]](https:\u002F\u002Fipcv.github.io\u002FVoViT\u002F)\n\n- (arXiv 2022.03) Graph Attention Transformer Network for **Multi-Label** Image **Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04049.pdf)\n\n- (arXiv 2022.03) EDGEFORMER: IMPROVING **LIGHT-WEIGHT CONVNETS** BY LEARNING FROM VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03952.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhkzhang91\u002FEdgeFormer)\n\n- (arXiv 2022.03) Skating-Mixer: Multimodal **MLP** for **Scoring Figure Skating**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03990.pdf)\n\n- (arXiv 2022.03) Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group **Attention**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03937.pdf)\n\n- (arXiv 2022.03) CP-ViT: Cascade Vision Transformer **Pruning** via Progressive Sparsity Prediction, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04570.pdf)\n\n- (arXiv 2022.03) Model-Agnostic Multitask Fine-tuning for Few-shot **Vision-Language** **Transfer Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04904.pdf)\n\n- (arXiv 2022.03) ChiTransformer: Towards Reliable **Stereo** from Cues, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04554.pdf)\n\n- (arXiv 2022.03) A Unified Transformer Framework for Group-based Segmentation: **Co-Segmentation**,** Co-Saliency Detection** and **Video Salient Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04708.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsuyukun666\u002FUFO)\n\n- (arXiv 2022.03) Coarse-to-Fine Sparse Transformer for **Hyperspectral Image Reconstruction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04845.pdf)\n\n- (arXiv 2022.03) CMX: Cross-Modal Fusion for **RGB-X Semantic Segmentation** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04838.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhuaaaliu\u002FRGBX_Semantic_Segmentation)\n\n- (arXiv 2022.03) Multiscale Transformer for **Hyperspectral Image Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04771.pdf)\n\n- (arXiv 2022.03) Mind the Gap: Understanding the Modality Gap in **Multi-modal Contrastive Representation** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02053.pdf), [[Code]](https:\u002F\u002Fmodalitygap.readthedocs.io\u002F)\n\n- (arXiv 2022.03) Autoregressive **Image Generation** using Residual Quantization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01941.pdf)\n\n- (arXiv 2022.03) CONTEXTFORMER: A TRANSFORMER WITH SPATIO-CHANNEL ATTENTION FOR CONTEXT MODELING IN LEARNED **IMAGE COMPRESSION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02452.pdf)\n\n- (arXiv 2022.03) Patch Similarity Aware Data-Free **Quantization** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02250.pdf)\n\n- (arXiv 2022.03) ViT-P: Rethinking Data-**efficient** Vision Transformers from Locality, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02358.pdf)\n\n- (arXiv 2022.03) DIT: SELF-SUPERVISED PRE-TRAINING FOR **DOCUMENT IMAGE** TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02378.pdf)\n\n- (arXiv 2022.03) Towards **Efficient** and **Scalable** Sharpness-Aware Minimization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02714.pdf)\n\n- (arXiv 2022.03) HyperTransformer: A Textural and Spectral Feature Fusion Transformer for **Pansharpening**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02503.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwgcban\u002FHyperTransformer)\n\n- (arXiv 2022.03) UVCGAN: UNET VISION TRANSFORMER CYCLE-CONSISTENT GAN FOR **UNPAIRED IMAGE-TO-IMAGE TRANSLATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02557.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLS4GAN\u002Fuvcgan)\n\n- (arXiv 2022.03) Show Me What and Tell Me How: **Video Synthesis** via Multimodal Conditioning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02573.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002FMMVID)\n\n- (arXiv 2022.03) PANFORMER: A TRANSFORMER BASED MODEL FOR **PAN-SHARPENING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02916.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzhysora\u002FPanFormer)\n\n- (arXiv 2022.03) Multi-class Token Transformer for **Weakly Supervised Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02891.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxulianuwa\u002FMCTformer)\n\n- (arXiv 2022.03) Cross Language Image Matching for **Weakly Supervised Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02668.pdf)\n\n- (arXiv 2022.03) Learning Affinity from Attention: End-to-End **Weakly-Supervised Semantic Segmentation** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02664.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Frulixiang\u002Fafa)\n\n- (arXiv 2022.03) DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03605.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FIDEACVR\u002FDINO)\n\n- (arXiv 2022.03) MetaFormer : A Unified Meta Framework for **Fine-Grained Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02751.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdqshuai\u002FMetaFormer)\n\n- (arXiv 2022.03) **Audio-visual** Generalised Zero-shot Learning with Cross-modal Attention and Language, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03598.pdf)\n\n- (arXiv 2022.03) Knowledge Amalgamation for Object **Detection** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03187.pdf)\n\n- (arXiv 2022.03) Learnable Irrelevant Modality Dropout for **Multimodal Action Recognition** on Modality-Specific Annotated Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03014.pdf)\n\n- (arXiv 2022.03) Modeling Coreference Relations in **Visual Dialog**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02986.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMingxiao-Li\u002FModeling-Coreference-Relations-in-Visual-Dialog)\n\n- (arXiv 2022.03) VITRANSPAD: VIDEO TRANSFORMER USING CONVOLUTION AND SELF-ATTENTION FOR **FACE PRESENTATION ATTACK DETECTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01562.pdf)\n\n- (arXiv 2022.03) Multi-Tailed Vision Transformer for **Efficient Inference**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01587.pdf)\n\n- (arXiv 2022.03) Bending Reality: Distortion-aware Transformers for Adapting to Panoramic **Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01452.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjamycheung\u002FTrans4PASS)\n\n- (arXiv 2022.03) Ensembles of Vision Transformers as a New Paradigm for Automated Classification in **Ecology**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01726.pdf)\n\n- (arXiv 2022.03) LGT-Net: Indoor Panoramic Room **Layout Estimation** with Geometry-Aware Transformer Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01824.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzhigangjiang\u002FLGT-Net)\n\n- (arXiv 2022.03) LatentFormer: Multi-Agent Transformer-Based **Interaction Modeling** and **Trajectory Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01880.pdf)\n\n- (arXiv 2022.03) DCT-Former: **Efficient** Self-Attention with Discrete Cosine Transform, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01178.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fcscribano\u002FDCT-Former-Public)\n\n- (arXiv 2022.03) Unsupervised **Vision-and-Language** Pre-training via Retrieval-based Multi-Granular Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00242.pdf)\n\n- (arXiv 2022.03) Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in **Point Cloud**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2203\u002F2203.00138.pdf)\n\n- (arXiv 2022.03) **CLIP**-GEN: Language-Free Training of a **Text-to-Image** Generator with CLIP, [[Paper]]()\n\n- (arXiv 2022.03) MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for **3D** Human **Pose** Estimation in Video, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00859.pdf)\n\n- (arXiv 2022.03) X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for **3D Dense Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00843.pdf)\n\n- (arXiv 2022.03) 3DCTN: 3D Convolution-Transformer Network for **Point Cloud** Classification, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00828.pdf)\n\n- (arXiv 2022.03) DeciWatch: A Simple Baseline for 10× **Efficient** 2D and 3D **Pose** Estimation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08713.pdf)\n\n- (arXiv 2022.03) D_2ETR: **Decoder-Only DETR** with Computationally Efficient Cross-Scale Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00860.pdf)\n\n- (arXiv 2022.03) Incremental Transformer Structure Enhanced Image **Inpainting** with Masking Positional Encoding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00867.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FDQiaole\u002FZITS_inpainting)\n\n- (arXiv 2022.03) Self-supervised Transformer for **Deepfake Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01265.pdf)\n\n- (arXiv 2022.03) Aggregated **Pyramid** Vision Transformer: Splittransform-merge Strategy for Image Recognition without Convolutions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2203\u002F2203.00960.pdf)\n\n- (arXiv 2022.03) TransDARC: Transformer-based **Driver Activity Recognition** with Latent Space Feature Calibration, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00927.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKPeng9510\u002FTransDARC)\n\n- (arXiv 2022.03) DN-DETR: **Accelerate** DETR **Training** by Introducing Query DeNoising, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01305.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FFengLi-ust\u002FDN-DETR)\n\n- (arXiv 2022.03) **Protecting Celebrities** with Identity Consistency Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01318.pdf)\n\n- (arXiv 2022.03) Masked Visual Pre-training for **Motor Control**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06173.pdf), [[Project]](https:\u002F\u002Ftetexiao.com\u002Fprojects\u002Fmvp)\n\n- (arXiv 2022.03) NLX-GPT: A Model for Natural Language Explanations in Vision and **Vision-Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05081.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffawazsammani\u002Fnlxgpt)\n\n- (arXiv 2022.03) Conditional Prompt Learning for Vision-Language Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05557.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKaiyangZhou\u002FCoOp)\n\n- (arXiv 2022.03) **Lane Detection** with Versatile AtrousFormer and Local Semantic Guidance, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04067.pdf)\n\n- (arXiv 2022.03) DALL-EVAL: Probing the Reasoning Skills and Social Biases of **Text-to-Image** Generative Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04053.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fj-min\u002FDallEval)\n\n- (arXiv 2022.03) **Forecasting** Characteristic **3D Poses** of Human Actions\n, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2011.15079.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fchrdiller\u002Fcharacteristic3dposes)\n\n### 2022.02\n\n- (arXiv 2022.02) **Bayesian Structure Learning** with Generative Flow Networks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.13903.pdf)\n\n- (arXiv 2022.02) Towards **Unsupervised Domain Adaptation** via Domain-Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.13777.pdf)\n\n- (arXiv 2022.02) An End-to-End Transformer Model for **Crowd Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.13065.pdf)\n\n- (arXiv 2022.02) Instantaneous **Physiological Estimation** using Video Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.12368.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Frevanurambareesh\u002Finstantaneous_transformer)\n\n- (arXiv 2022.02) Style**CLIP**Draw: Coupling Content and Style in **Text-to-Drawing** Translation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.12362.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fpschaldenbrand\u002FStyleCLIPDraw)\n\n- (arXiv 2022.02) ATTENTION ENABLES ZERO **APPROXIMATION** ERROR, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.12166.pdf)\n\n- (arXiv 2022.02) When Transformer Meets **Robotic Grasping**: Exploits Context for Efficient Grasp Detection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11911.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FWangShaoSUN\u002Fgrasp-transformer)\n\n- (arXiv 2022.02) **AUTO-SCALING** VISION TRANSFORMERS WITHOUT TRAINING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11921.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FAsViT)\n\n- (arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for **Vision-and-Language Navigation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11742.pdf), [[Project]](https:\u002F\u002Fcshizhe.github.io\u002Fprojects\u002Fvln_duet.html)\n\n- (arXiv 2022.02) LEARNING TO **MERGE TOKENS** IN VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.12015.pdf)\n\n- (arXiv 2022.02) ProFormer: Learning **Data-efficient** Representations of **Body Movement** with Prototype-based Feature Augmentation and Visual Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11423.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKPeng9510\u002FProFormer)\n\n- (arXiv 2022.02) SELF-SUPERVISED TRANSFORMERS FOR **UNSUPERVISED OBJECT DISCOVERY** USING NORMALIZED CUT, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11539.pdf), [[Project]](https:\u002F\u002Fwww.m-psi.fr\u002FPapers\u002FTokenCut2022\u002F)\n\n- (arXiv 2022.02) Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal **Texture Synthesis**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11703.pdf)\n\n- (arXiv 2022.02) CaMEL: Mean Teacher Learning for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10492.pdf)\n\n- (arXiv 2022.02) **Hierarchical** Perceiver, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10890.pdf)\n\n- (arXiv 2022.02) **Movies2Scenes**: Learning Scene Representations Using Movie Similarities, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10650.pdf)\n\n- (arXiv 2022.02) GroupViT: **Semantic Segmentation** Emerges from Text Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11094.pdf), [[Code\n\n- (arXiv 2022.02) Snowflake Point Deconvolution for **Point Cloud** Completion and Generation with Skip-Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09367.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAllenXiangX\u002FSnowflakeNet)\n\n- (arXiv 2022.02) Audio Visual Scene-Aware **Dialog Generation** with Transformer-based Video Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09979.pdf)\n\n- (arXiv 2022.02) ViTAEv2: Vision Transformer Advanced by Exploring **Inductive Bias** for Image Recognition and Beyond, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10108.pdf)\n\n- (arXiv 2022.02) PMP-Net++: **Point Cloud Completion** by Transformer-Enhanced Multi-step Point Moving Paths, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09507.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdiviswen\u002FPMP-Net)\n\n- (arXiv 2022.02) DataMUX: **Data Multiplexing** for Neural Networks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09318.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FDataMUX)\n\n- (arXiv 2022.02) On Guiding Visual **Attention** with **Language** Specification, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.08926.pdf)\n\n- (arXiv 2022.02) SPATIO-TEMPORAL OUTDOOR **LIGHTING AGGREGATION** ON IMAGE SEQUENCES USING TRANSFORMER NETWORKS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09206.pdf)\n\n- (arXiv 2022.02) **MISINFORMATION DETECTION** IN SOCIAL MEDIA **VIDEO** POSTS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07706.pdf)\n\n- (arXiv 2022.02) Can Deep Learning be Applied to Model-Based **Multi-Object Tracking**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07909.pdf)\n\n- (arXiv 2022.02) NOT ALL PATCHES ARE WHAT YOU NEED: EXPEDITING VISION TRANSFORMERS VIA **TOKEN REORGANIZATIONS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07800.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyouweiliang\u002Fevit)\n\n- (arXiv 2022.02) ActionFormer: **Localizing** Moments of **Actions** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07925.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhappyharrycn\u002Factionformer_release)\n\n- (arXiv 2022.02) One Step at a Time: Long-Horizon **Vision-and-Language Navigation** with Milestones, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07028.pdf)\n\n- (arXiv 2022.02) XAI for Transformers: Better **Explanations** through Conservative Propagation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07304.pdf)\n\n- (arXiv 2022.02) MeshLeTemp: Leveraging the Learnable Vertex-Vertex Relationship to Generalize Human **Pose** and **Mesh Reconstruction** for In-the-Wild Scenes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07228.pdf)\n\n- (arXiv 2022.02) ViNTER: **Image Narrative Generation** with Emotion-Arc-Aware Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07305.pdf)\n\n- (arXiv 2022.02) Hyper-relationship Learning Network for **Scene Graph** Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07271.pdf)\n\n- (arXiv 2022.02) CommerceMM: Large-Scale Commerce **MultiModal Representation** Learning with Omni Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07247.pdf)\n\n- (arXiv 2022.02) Flowformer: **Linearizing** Transformers with Conservation Flows, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06258.pdf)\n\n- (arXiv 2022.02) DialFRED: Dialogue-Enabled Agents for **Embodied** Instruction Following, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.13330.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fanonrabit\u002FDialFRED)\n\n- (arXiv 2022.02) CATs++: Boosting **Cost Aggregation** with Convolutions and Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06817.pdf)\n\n- (arXiv 2022.02) Geometric Transformer for Fast and Robust **Point Cloud Registration**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06688.pdf), [[Code]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06688.pdf)\n\n- (arXiv 2022.02) I-Tuning: Tuning Language Models with Image for **Caption** Generation, [[Paper]](I-Tuning: Tuning Language Models with Image for Caption Generation)\n\n- (arXiv 2022.02) Multi-direction and Multi-scale Pyramid in Transformer for Video-based **Pedestrian Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06014.pdf), [[Code]](https:\u002F\u002Fgit.openi.org.cn\u002Fzangxh\u002FPiT.git)\n\n- (arXiv 2022.02) **Visual Acoustic** Matching, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06875.pdf)\n\n- (arXiv 2022.02) LighTN: **Light**-weight Transformer Network for Performance-overhead Tradeoff in **Point Cloud Downsampling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06263.pdf)\n\n- (arXiv 2022.02) BViT: Broad **Attention** based Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06268.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FDRL-CASIA\u002FDense_ViT)\n\n- (arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for **Few-Shot Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06498.pdf)\n\n- (arXiv 2022.02) Domain Adaptation via **Prompt** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06687.pdf)\n\n- (arXiv 2022.02) Mixing and Shifting: Exploiting Global and Local Dependencies in **Vision MLPs**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06510.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJegZheng\u002FMS-MLP)\n\n- (arXiv 2022.02) Wukong: 100 Million Large-scale Chinese **Cross-modal Pre-training** Dataset and A Foundation Framework, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06767.pdf), [[Project]](https:\u002F\u002Fwukong-dataset.github.io\u002Fwukong-dataset\u002F)\n\n- (arXiv 2022.02) HOW DO VISION TRANSFORMERS WORK? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06709.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxxxnell\u002Fhow-do-vits-work)\n\n- (arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05451.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjiahuei\u002Fsparse-image-captioning)\n\n- (arXiv 2022.02) **CLIP**asso: Semantically-Aware **Object Sketching**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05822.pdf), [[Code]](https:\u002F\u002Fclipasso.github.io\u002Fclipasso\u002F)\n\n- (arXiv 2022.02) Towards Weakly-Supervised **Text Spotting** using a Multi-Task Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05508.pdf)\n\n- (arXiv 2022.02) DEEP **SOCCER CAPTIONING** WITH TRANSFORMER: DATASET, SEMANTICS-RELATED LOSSES, AND MULTI-LEVEL EVALUATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05728.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fsoccercaptioning)\n\n- (arXiv 2022.02) ENTROFORMER: A TRANSFORMER-BASED ENTROPY MODEL FOR LEARNED **IMAGE COMPRESSION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05492.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmx54039q\u002Fentroformer)\n\n- (arXiv 2022.02) Image Difference Captioning with Pre-training and Contrastive Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04298.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyaolinli\u002FIDC)\n\n- (arXiv 2022.02) MaskGIT: Masked **Generative** **Image** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04200.pdf)\n\n- (arXiv 2022.02) Distillation with Contrast is All You Need for **Self-Supervised** **Point Cloud** Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04241.pdf)\n\n- (arXiv 2022.02) Motion-Aware Transformer For **Occluded Person Re-identification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04243.pdf)\n\n- (arXiv 2022.02) Conditional **Motion In-betweening**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04307.pdf), [[Code]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04307.pdf)\n\n- (arXiv 2022.02) Memory-based **gaze prediction** in deep imitation learning for **robot manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04877.pdf)\n\n- (arXiv 2022.02) **Spherical** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04942.pdf)\n\n- (arXiv 2022.02) OWL (Observe, Watch, Listen): **Localizing Actions** in **Egocentric Video** via Audiovisual Temporal Context, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04947.pdf)\n\n- (arXiv 2022.02) The Abduction of Sherlock Holmes: A **Dataset** for **Visual Abductive Reasoning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04800.pdf), [[Project]](http:\u002F\u002Fwww.visualabduction.com\u002F)\n\n- (arXiv 2022.02) DALL-EVAL: Probing the Reasoning Skills and Social Biases of **Text-to-Image** Generative Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04053.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fj-min\u002FDallEval)\n\n- (arXiv 2022.02) Pre-Trained Language Models for **Interactive Decision-Making**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.01771.pdf)\n\n- (arXiv 2022.02) TransFollower: Long-Sequence Car-Following **Trajectory Prediction** through Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.03183.pdf)\n\n- (arXiv 2022.02) The devil is in the labels: **Semantic segmentation** from sentences, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.02002.pdf)\n\n- (arXiv 2022.02) Webly Supervised Concept Expansion for **General Purpose Vision Models**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.02317.pdf), [[Project]](https:\u002F\u002Fprior.allenai.org\u002Fprojects\u002Fgpv2)\n\n- (arXiv 2022.02) VU-BERT: A UNIFIED FRAMEWORK FOR **VISUAL DIALOG**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10787.pdf)\n\n- (arXiv 2022.02) **UNIFYING** ARCHITECTURES, TASKS, AND MODALITIES THROUGH A SIMPLE SEQUENCE-TO-SEQUENCE LEARNING FRAMEWORK, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.03052.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOFA-Sys\u002FOFA)\n\n- (arXiv 2022.02) Transformers in Self-Supervised **Monocular Depth Estimation** with Unknown Camera Intrinsics, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.03131.pdf)\n\n- (arXiv 2022.02) TRANSDREAMER: **REINFORCEMENT LEARNING** WITH TRANSFORMER WORLD MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09481.pdf)\n\n- (arXiv 2022.02) **Vision-Language** Pre-Training with Triple Contrastive Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10401.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Futa-smile\u002FTCL)\n\n- (arXiv 2022.02) Corrupted Image Modeling for **Self-Supervised** Visual **Pre-Training**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.03382.pdf)\n\n- (arXiv 2022.02) BLIP: Bootstrapping Language-Image Pre-training for Unified **Vision-Language** Understanding and Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12086.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FBLIP)\n\n- (arXiv 2022.02) DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in **DNN Accelerators**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11218.pdf)\n\n- (arXiv 2022.02) Interactron: **Embodied** Adaptive **Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00660.pdf)\n\n- (arXiv 2022.02) Local Feature Matching with Transformers for low-end devices **LoFTR** method adaptation approach, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00770.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FKolkir\u002FCoarse_LoFTR_TRT)\n\n- (arXiv 2022.02) Pre-Trained Language Models for **Interactive Decision-Making**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.01771.pdf)\n\n- (arXiv 2022.02) Can Transformers be Strong **Treatment Effect Estimators**?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.01336.pdf)\n\n- (arXiv 2022.02) Improving **Sample Efficiency of Value** Based Models Using Attention and Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00710.pdf)\n\n- (arXiv 2022.02) Detecting **Human-Object Interactions** with Object-Guided Cross-Modal Calibrated Semantics, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00259.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJacobYuan7\u002FOCN-HOI-Benchmark)\n\n### 2022.01\n\n- (arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12133.pdf)\n\n- (arXiv 2022.01) DynaMixer: A Vision **MLP** Architecture with Dynamic Mixing, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12083.pdf)\n\n- (arXiv 2022.01) VRT: A **Video Restoration** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12288.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJingyunLiang\u002FVRT)\n\n- (arXiv 2022.01) DAB-DETR: DYNAMIC **ANCHOR** BOXES ARE BETTER QUERIES FOR **DETR**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12329.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSlongLiu\u002FDAB-DETR)\n\n- (arXiv 2022.01) Plug-In Inversion: Model-Agnostic **Inversion** for Vision with Data Augmentations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12961.pdf)\n\n- (arXiv 2022.01) MVP: Multi-Stage **Vision-Language** Pre-Training via Multi-Level Semantic Alignment, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12596.pdf)\n\n- (arXiv 2022.01) VC-GPT: Visual Conditioned GPT for End-to-End Generative **Vision-and-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12723.pdf)\n\n- (arXiv 2022.01) BOAT: Bilateral Local **Attention** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.13027.pdf)\n\n- (arXiv 2022.01) GRAPH SELF-ATTENTION FOR LEARNING **GRAPH** REPRESENTATION WITH TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12787.pdf)\n\n- (arXiv 2022.01) Aggregating **Global** Features into **Local** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12903.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fkrushi1992\u002FMOA-transformer)\n\n- (arXiv 2022.01) Transformer Module Networks for Systematic Generalization in **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11316.pdf)\n\n- (arXiv 2022.01) Generalised Image **Outpainting** with U-Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11403.pdf)\n\n- (arXiv 2022.01) RelTR: Relation Transformer for **Scene Graph Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11460.pdf)\n\n- (arXiv 2022.01) DocSegTr: An Instance-Level End-to-End **Document Image Segmentation** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11438.pdf)\n\n- (arXiv 2022.01) Pre-Trained **Language** Transformers are Universal **Image** Classifiers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10182.pdf)\n\n- (arXiv 2022.01) Explore and Match: End-to-End **Video Grounding** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10168.pdf)\n\n- (arXiv 2022.01) TGFuse: An **Infrared** and **Visible Image Fusion** Approach Based on Transformer and Generative Adversarial Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10147.pdf)\n\n- (arXiv 2022.01) ViT-HGR: Vision Transformer-based **Hand Gesture Recognition** from High Density Surface EMG Signals, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10060.pdf)\n\n- (arXiv 2022.01) ShapeFormer: Transformer-based **Shape Completion** via Sparse Representation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10326.pdf), [[Project]](https:\u002F\u002Fshapeformer.github.io\u002F)\n\n- (arXiv 2022.01) **CONVOLUTIONAL** XFORMERS FOR VISION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10271.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fpranavphoenix\u002FCXV)\n\n- (arXiv 2022.01) DocEnTr: An End-to-End **Document Image Enhancement** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10252.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdali92002\u002FDocEnTR)\n\n- (arXiv 2022.01) Zero-Shot **Sketch** Based **Image Retrieval** using Graph Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10185.pdf)\n\n- (arXiv 2022.01) SA-**VQA**: Structured Alignment of Visual and Semantic Representations for Visual Question Answering, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10654.pdf)\n\n- (arXiv 2022.01) DUAL-TASKS SIAMESE TRANSFORMER FRAMEWORK FOR **BUILDING DAMAGE ASSESSMENT**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10953.pdf)\n\n- (arXiv 2022.01) When **Shift Operation** Meets Vision Transformer: An Extremely Simple Alternative to **Attention** Mechanism, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10801.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSPACH)\n\n- (arXiv 2022.01) Self-supervised 3D Semantic Representation Learning for **Vision-and-Language Navigation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10788.pdf)\n\n- (arXiv 2022.01) **Training** Vision Transformers with Only 2040 Images, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10728.pdf)\n\n- (arXiv 2022.01) Learning To Recognize **Procedural Activities** with Distant Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10990.pdf)\n\n- (arXiv 2022.01) EVALUATING **LANGUAGE**-BIASED **IMAGE** CLASSIFICATION BASED ON SEMANTIC REPRESENTATIONS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11014.pdf)\n\n- (arXiv 2022.01) A Comprehensive Study of Vision Transformers on **Dense Prediction Tasks**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.08683.pdf)\n\n- (arXiv 2022.01) UniFormer: Unifying **Convolution** and **Self-attention** for Visual Recognition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09450.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FUniFormer)\n\n- (arXiv 2022.01) **Patches** Are All You Need? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09792.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flocuslab\u002Fconvmixer)\n\n- (arXiv 2022.01) Reading-strategy Inspired Visual Representation Learning for **Text-to-Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09168.pdf)\n\n- (arXiv 2022.01) LEARNING TO ACT WITH AFFORDANCE-AWARE **MULTIMODAL** NEURAL **SLAM**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09862.pdf)\n\n- (arXiv 2022.01) Visual Information Guided **Zero-Shot Paraphrase Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09107.pdf)\n\n- (arXiv 2022.01) TerViT: An **Efficient** **Ternary** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.08050.pdf)\n\n- (arXiv 2022.01) End-to-end Generative Pretraining for **Multimodal Video Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.08264.pdf)\n\n- (arXiv 2022.01) OMNIVORE: A Single Model for **Many** Visual **Modalities**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.08377.pdf), [[Project]](https:\u002F\u002Ffacebookresearch.github.io\u002Fomnivore\u002F)\n\n- (arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for **Efficient Long-Term Video Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.08383.pdf)\n\n- (arXiv 2022.01) The CLEAR Benchmark: **Continual LEArning** on Real-World Imagery, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06289.pdf), [[Project]](https:\u002F\u002Fclear-benchmark.github.io\u002F)\n\n- (arXiv 2022.01) ProposalCLIP: **Unsupervised** Open-Category Object **Proposal** Generation via Exploiting **CLIP** Cues, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06696.pdf)\n\n- (arXiv 2022.01) Cross-modal Contrastive Distillation for **Instructional Activity Anticipation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06734.pdf)\n\n- (arXiv 2022.01) Transformers in Action: **Weakly Supervised Action Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05675.pdf)\n\n- (arXiv 2022.01) VAQF: Fully Automatic **Software-hardware Co-design** Framework for **Low-bit** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06618.pdf)\n\n- (arXiv 2022.01) CLIP-TD: **CLIP** Targeted Distillation for **Vision-Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05729.pdf)\n\n- (arXiv 2022.01) **Domain Adaptation** via Bidirectional Cross-Attention Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05887.pdf)\n\n- (arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for **Online Inference**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06268.pdf)\n\n- (arXiv 2022.01) **Motion Inbetweening** via Deep ∆-Interpolator, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06701.pdf)\n\n- (arXiv 2022.01) RePre: Improving **Self-Supervised** Vision Transformer with Reconstructive Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06857.pdf)\n\n- (arXiv 2022.01) GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for **Nowcasting Extreme Events**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06717.pdf)\n\n- (arXiv 2022.01) TransFuse: A Unified Transformer-based **Image Fusion** Framework using Self-supervised Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07451.pdf)\n\n- (arXiv 2022.01) Q-ViT: Fully Differentiable **Quantization** for Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07703.pdf)\n\n- (arXiv 2022.01) Disentangled Latent Transformer for **Interpretable Monocular Height Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06357.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002FShadowXZT\u002FDLT-Height-Estimation.pytorch)\n\n- (arXiv 2022.01) Poseur: Direct Human **Pose Regression** with Transformers*, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07412.pdf)\n\n- (arXiv 2022.01) SWINUNET3D - A HIERARCHICAL ARCHITECTURE FOR DEEP **TRAFFIC PREDICTION** USING SHIFTED WINDOW TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06390.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fbojesomo\u002FTraffic4Cast2021-SwinUNet3D)\n\n- (arXiv 2022.01) SWIN-POSE: SWIN TRANSFORMER BASED HUMAN **POSE ESTIMATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07384.pdf)\n\n- (arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for **Robotic Manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07779.pdf), [[Project]](https:\u002F\u002Fjangirrishabh.github.io\u002Flookcloser\u002F)\n\n- (arXiv 2022.01) ViT2Hash: Unsupervised Information-Preserving **Hashing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05541.pdf)\n\n- (arXiv 2022.01) LANGUAGE-DRIVEN **SEMANTIC SEGMENTATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.03546.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fisl-org\u002Flang-seg)\n\n- (arXiv 2022.01) **Pedestrian Detection**: Domain Generalization, CNNs, Transformers and Beyond, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.03176.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhasanirtiza\u002FPedestron)\n\n- (arXiv 2022.01) ImageSubject: A Large-scale Dataset for **Subject Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.03101.pdf)\n\n- (arXiv 2022.01) **Detecting** Twenty-thousand Classes using Image-level Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02605.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDetic)\n\n- (arXiv 2022.01) Generalized **Category Discovery**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02609.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsgvaze\u002Fgeneralized-category-discovery)\n\n- (arXiv 2022.01) Video **Summarization** Based on **Video-text** Modelling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02494.pdf)\n\n- (arXiv 2022.01) Spatio-Temporal Tuples Transformer for **Skeleton-Based Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02849.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fheleiqiu\u002FSTTFormer)\n\n- (arXiv 2022.01) **QUADTREE ATTENTION** FOR VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02767.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTangshitao\u002FQuadtreeAttention)\n\n- (arXiv 2022.01) A Comprehensive Empirical Study of **Vision-Language** Pre-trained Model for Supervised Cross-Modal Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02772.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002Fzhixiongz\u002FCLIP4CMR)\n\n- (arXiv 2022.01) MERLOT Reserve: Neural Script Knowledge through **Vision and Language and Sound**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02639.pdf), [[Project]](https:\u002F\u002Frowanzellers.com\u002Fmerlotreserve)\n\n- (arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.03965.pdf)\n\n- (arXiv 2022.01) Pyramid Fusion Transformer for **Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04019.pdf)\n\n- (arXiv 2022.01) Multiview Transformers for **Video Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04288.pdf)\n\n- (arXiv 2022.01) HYPERTRANSFORMER: MODEL GENERATION FOR SUPERVISED AND SEMI-SUPERVISED **FEW-SHOT LEARNING**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04182.pdf)\n\n- (arXiv 2022.01) UNIFORMER: UNIFIED TRANSFORMER FOR **EFFICIENT SPATIOTEMPORAL** REPRESENTATION LEARNING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04676.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FUniFormer)\n\n- (arXiv 2022.01) BridgeFormer: Bridging **Video-text** Retrieval with Multiple Choice Questions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04850.pdf), [[Project]](https:\u002F\u002Fgeyuying.github.io\u002FMCQ.html)\n\n- (arXiv 2022.01) TransVOD: End-to-end **Video Object Detection** with Spatial-Temporal Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05047.pdf)\n\n- (arXiv 2022.01) **CLIP**-Event: Connecting Text and Images with **Event** Structures, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05078.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flimanling\u002Fclip-event)\n\n- (arXiv 2022.01) Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular **Vision-Language** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04026.pdf)\n\n- (arXiv 2022.01) Lawin Transformer: Improving **Semantic Segmentation** Transformer with Multi-Scale Representations via Large Window Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01615.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyan-hao-tian\u002Flawin)\n\n- (arXiv 2022.01) **Self-Training** **Vision Language** BERTs with a Unified Conditional Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02010.pdf)\n\n- (arXiv 2022.01) TransVPR: Transformer-based TransVPR: Transformer-based place recognition with multi-level attention aggregation with multi-level attention aggregation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02001.pdf)\n\n- (arXiv 2022.01) Compact Bidirectional Transformer for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01984.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYuanEZhou\u002FCBTrans)\n\n- (arXiv 2022.01) Flow-Guided Sparse Transformer for **Video Deblurring**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01893.pdf)\n\n- (arXiv 2022.01) **Stochastic Layers** in Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.15111.pdf)\n\n- (arXiv 2022.01) ERNIE-VILG: UNIFIED GENERATIVE PRE-TRAINING FOR **BIDIRECTIONAL VISION-LANGUAGE GENERATION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.15283.pdf)\n\n- (arXiv 2022.01) InverseMV: **Composing Piano Scores** with a Convolutional **Video-Music** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.15320.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flinchintung\u002FVMT)\n\n- (arXiv 2022.01) CSformer: Bridging Convolution and Transformer for **Compressive Sensing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.15299.pdf)\n\n- (arXiv 2022.01) Persformer: A Transformer Architecture for **Topological Machine Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.15210.pdf)\n\n- (arXiv 2022.01) Vision Transformer **Slimming**: Multi-Dimension Searching in Continuous Optimization Space, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00814.pdf)\n\n- (arXiv 2022.01) Language as Queries for **Referring Video Object Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00487.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwjn922\u002FReferFormer)\n\n- (arXiv 2022.01) PyramidTNT: Improved **Transformer-in-Transformer** Baselines with Pyramid Architecture, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00978.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhuawei-noah\u002FCV-Backbones\u002Ftree\u002Fmaster\u002Ftnt_pytorch)\n\n- (arXiv 2022.01) A TRANSFORMER-BASED SIAMESE NETWORK FOR **CHANGE DETECTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01293.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwgcban\u002FChangeFormer)\n\n- (arXiv 2022.01) Vision Transformer with **Deformable Attention**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00520.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FDAT)\n\n- (arXiv 2022.01) Splicing ViT Features for **Semantic Appearance Transfer**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00424.pdf), [[Project]](https:\u002F\u002Fsplice-vit.github.io\u002F)\n\n- (arXiv 2022.01) Detail-Preserving Transformer for **Light Field Image Super-Resolution**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00346.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBITszwang\u002FDPT)\n\n### 2021.12\n\n- (arXiv 2021.12) Multi-Dimensional **Model Compression** of Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00043.pdf)\n\n- (arXiv 2021.12) Siamese Network with Interactive Transformer for **Video Object Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13983.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLANMNG\u002FSITVOS)\n\n- (arXiv 2021.12) Pale Transformer: A General Vision Transformer **Backbone** with Pale-Shaped **Atention**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.14000.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBR-IDL\u002FPaddleViT)\n\n- (arXiv 2021.12) APRIL: Finding the Achilles’ Heel on **Privacy** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.14087.pdf)\n\n- (arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in **Video-to-Text Translation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.14088.pdf)\n\n- (arXiv 2021.12) Does **CLIP** Benefit **Visual Question Answering** in the Medical Domain as Much as it Does in the General Domain?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13906.pdf)\n\n- (arXiv 2021.12) SPViT: Enabling **Faster** Vision Transformers via Soft Token Pruning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13890.pdf)\n\n- (arXiv 2021.12) A FISTFUL OF WORDS: LEARNING TRANSFERABLE VISUAL MODELS FROM BAG-OF-WORDS SUPERVISION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13884.pdf)\n\n- (arXiv 2021.12) StyleGAN-V: A Continuous **Video** Generator with the Price, Image Quality and Perks of **StyleGAN2**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.14683.pdf), [[Code]](https:\u002F\u002Funiversome.github.io\u002Fstylegan-v)\n\n- (arXiv 2021.12) A Simple Baseline for **Zero-shot Semantic Segmentation** with Pre-trained **Vision-language** Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.14757.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMendelXu\u002Fzsseg.baseline)\n\n- (arXiv 2021.12) Miti-DETR: Object **Detection** based on Transformers with Mitigatory Self-Attention Convergence, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13310.pdf)\n\n- (arXiv 2021.12) SIMVIT: EXPLORING A SIMPLE VISION TRANSFORMER WITH **SLIDING WINDOWS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13085.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fucasligang\u002FSimViT)\n\n- (arXiv 2021.12) SGTR: End-to-end **Scene Graph Generation** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12970.pdf)\n\n- (arXiv 2021.12) **Video** Joint Modelling Based on Hierarchical Transformer for **Co-summarization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13478.pdf)\n\n- (arXiv 2021.12) Vision Transformer for **Small-Size Datasets**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13492.pdf)\n\n- (arXiv 2021.12) Learning **Generative** Vision Transformer with Energy-Based Latent Space for **Saliency Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13528.pdf)\n\n- (arXiv 2021.12) ViR: the Vision **Reservoir**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13545.pdf)\n\n- (arXiv 2021.12) SeMask: Semantically Masked Transformers for **Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12782.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FPicsart-AI-Research\u002FSeMask-Segmentation)\n\n- (arXiv 2021.12) Open-Vocabulary Image **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12143.pdf)\n\n- (arXiv 2021.12) ELSA: Enhanced Local **Self-Attention** for Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12786.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdamo-cv\u002FELSA)\n\n- (arXiv 2021.12) LaTr: Layout-Aware Transformer for **Scene-Text** **VQA**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12494.pdf)\n\n- (arXiv 2021.12) **Multimodal Personality Recognition** using Cross-Attention Transformer and Behaviour Encoding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12180.pdf)\n\n- (arXiv 2021.12) Fine-grained **Multi-Modal Self-Supervised Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12182.pdf)\n\n- (arXiv 2021.12) SLIP: Self-supervision meets **Language-Image** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12750.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSLIP)\n\n- (arXiv 2021.12) CLEVR3D: Compositional Language and Elementary Visual Reasoning for **Question Answering** in **3D Real-World Scenes**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11691.pdf)\n\n- (arXiv 2021.12) MIA-Former: **Efficient** and **Robust** Vision Transformers via Multi-grained Input Adaptation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11542.pdf)\n\n- (arXiv 2021.12) iSegFormer: Interactive Image **Segmentation** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11325.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fqinliuliuqin\u002FiSegFormer.git)\n\n- (arXiv 2021.12) Contrastive Object **Detection** Using Knowledge Graph Embeddings, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11366.pdf)\n\n- (arXiv 2021.12) RepMLPNet: Hierarchical Vision **MLP** with Re-parameterized **Locality**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11081.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FDingXiaoH\u002FRepMLP)\n\n- (arXiv 2021.12) **Lite** Vision Transformer with Enhanced **Self-Attention**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10809.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FChenglin-Yang\u002FLVT)\n\n- (arXiv 2021.12) MPViT : Multi-Path Vision Transformer for **Dense Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11010.pdf), [[Code]](https:\u002F\u002Fgit.io\u002FMPViT)\n\n- (arXiv 2021.12) SOIT: **Segmenting** Objects with Instance-Aware Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11037.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FyuxiaodongHRI\u002FSOIT)\n\n- (arXiv 2021.12) Learned Queries for Efficient Local **Attention**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11435.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmoabarar\u002Fqna)\n\n- (arXiv 2021.12) On **Efficient** Transformer and Image Pre-training for **Low-level** Vision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10175.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffenglinglwb\u002FEDT)\n\n- (arXiv 2021.12) LOCFORMER: Enabling Transformers to Perform **Temporal Moment Localization** on Long Untrimmed Videos With a Feature Sampling Approach, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10066.pdf)\n\n- (arXiv 2021.12) Tell me what you see: A zero-shot **action recognition** method based on natural language descriptions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09976.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvalterlej\u002Fzsarcap)\n\n- (arXiv 2021.12) Pre-Training Transformers for **Domain Adaptation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09965.pdf)\n\n- (arXiv 2021.12) ScanQA: 3D Question Answering for Spatial Scene Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10482.pdf)\n\n- (arXiv 2021.12) Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10740.pdf)\n\n- (arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution **Image Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10762.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FStyleSwin)\n\n- (arXiv 2021.12) Mask2Former for **Video Instance Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10764.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FMask2Former)\n\n- (arXiv 2021.12) GLIDE: Towards Photorealistic **Image Generation** and **Editing** with **Text**-Guided Diffusion Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10741.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fglide-text2im)\n\n- (arXiv 2021.12) **Efficient** Visual **Tracking** with Exemplar Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09686.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvisionml\u002Fpytracking)\n\n- (arXiv 2021.12) **Neuromorphic Camera Denoising** using Graph Neural Network-driven Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09685.pdf)\n\n- (arXiv 2021.12) Align and Prompt: **Video-and-Language** Pre-training with Entity Prompts, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09583.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FALPRO)\n\n- (arXiv 2021.12) DATA **EFFICIENT** **LANGUAGE-SUPERVISED ZEROSHOT RECOGNITION** WITH OPTIMAL TRANSPORT DISTILLATION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09445.pdf)\n\n- (arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame **Image Restoration** with Pre-Trained Siamese Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09426.pdf)\n\n- (arXiv 2021.12) Full Transformer Framework for Robust **Point Cloud Registration** with Deep Information Interaction, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09385.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FCGuangyan-BIT\u002FDIT)\n\n- (arXiv 2021.12) ZeroVL: A Strong Baseline for Aligning **Vision-Language** Representations with **Limited Resources**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09331.pdf)\n\n- (arXiv 2021.12) Towards End-to-End **Image Compression and Analysis** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09300.pdf)\n\n- (arXiv 2021.12) How to **augment** your ViTs? Consistency loss and StyleAug, a random style transfer augmentation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09260.pdf)\n\n- (arXiv 2021.12) Learning to Prompt for **Continual Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08654.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fl2p)\n\n- (arXiv 2021.12) Distilled Dual-Encoder Model for **Vision-Language** Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08723.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fkugwzk\u002FDistilled-DualEncoder)\n\n- (arXiv 2021.12) Dense Video **Captioning** Using Unsupervised Semantic Information, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08455.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvalterlej\u002Fdvcusi)\n\n- (arXiv 2021.12) Looking Outside the Box to **Ground Language** in **3D** Scenes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08879.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fnickgkan\u002Fbeauty_detr)\n\n- (arXiv 2021.12) Region**CLIP**: Region-based **Language-Image** Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09106.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FRegionCLIP)\n\n- (arXiv 2021.12) DProST: **6-DoF Object Pose Estimation** Using Space Carving and Dynamic Projective Spatial Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08775.pdf)\n\n- (arXiv 2021.12) Masked Feature Prediction for **Self-Supervised** Visual Pre-Training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09133.pdf)\n\n- (arXiv 2021.12) SGEITL: Scene Graph Enhanced Image-Text Learning for **Visual Commonsense Reasoning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08587.pdf)\n\n- (arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for **Zero-Shot Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08643.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshiming-chen\u002FTransZero_pp)\n\n- (arXiv 2021.12) Vision Transformer Based **Video Hashing Retrieval** for Tracing the Source of Fake Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08117.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flajlksdf\u002Fvtl)\n\n- (arXiv 2021.12) Co-training Transformer with Videos and Images Improves **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07175.pdf)\n\n- (arXiv 2021.12) QAHOI: Query-Based Anchors for **Human-Object Interaction** Detection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08647.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fcjw2021\u002FQAHOI)\n\n- (arXiv 2021.12) AdaViT: Adaptive Tokens for **Efficient** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07658.pdf)\n\n- (arXiv 2021.12) **CLIP**-Lite: Information **Efficient** Visual Representation Learning from Textual Annotations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07133.pdf)\n\n- (arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on **Unpaired Images and Text**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07074.pdf)\n\n- (arXiv 2021.12) Deep ViT Features as Dense Visual **Descriptors**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05814.pdf), [[Project]](https:\u002F\u002Fdino-vit-features.github.io\u002F)\n\n- (arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07374.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmikecheninoulu\u002FCGT)\n\n- (arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07338.pdf)\n\n- (arXiv 2021.12) COMPOSER: Compositional Learning of **Group Activity** in Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05892.pdf)\n\n- (arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for **Micro-Expression Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05851.pdf)\n\n- (arXiv 2021.12) Improving and Diagnosing Knowledge-Based **Visual Question Answering** via Entity Enhanced Knowledge Injection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06888.pdf)\n\n- (arXiv 2021.12) SVIP: **Sequence VerIfication** for Procedures in **Videos**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06447.pdf)\n\n- (arXiv 2021.12) Improving Vision Transformers for **Incremental Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06103.pdf)\n\n- (arXiv 2021.12) VL-ADAPTER: Parameter-Efficient Transfer Learning for **Vision-and-Language** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06825.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fylsung\u002FVL_adapter)\n\n- (arXiv 2021.12) Embracing Single Stride **3D Object Detector** with Sparse Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06375.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FTuSimple\u002FSST)\n\n- (arXiv 2021.12) PartGlot: Learning **Shape Part Segmentation** from Language Reference Games, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06390.pdf)\n\n- (arXiv 2021.12) **Pedestrian Trajectory Prediction** via Spatial Interaction Transformer Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06624.pdf)\n\n- (arXiv 2021.12) LEARNING SEMANTIC-ALIGNED FEATURE REPRESENTATION FOR **TEXT-BASED PERSON SEARCH**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06714.pdf)\n\n- (arXiv 2021.12) L-Verse: Bidirectional **Generation** Between **Image** and **Text**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11133.pdf)\n\n- (arXiv 2021.12) **SELF-ATTENTION** DOES NOT NEED O(n^2) MEMORY, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05682.pdf)\n\n- (arXiv 2021.12) Are Vision Transformers **Robust** to Patch Perturbations? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10659.pdf)\n\n- (arXiv 2021.12) Mesa: A **Memory-saving Training** Framework for Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11124.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzhuang-group\u002FMesa)\n\n- (arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05230.pdf)\n\n- (arXiv 2021.12) MAGMA – Multimodal **Augmentation** of **Generative** Models through Adapter-based Finetuning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05253.pdf)\n\n- (arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for **Weakly Supervised Object Localization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05291.pdf)\n\n- (arXiv 2021.12) FaceFormer: **Speech-Driven 3D Facial Animation** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05329.pdf)\n\n- (arXiv 2021.12) Rethinking the Two-Stage Framework for **Grounded Situation Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05375.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fkellyiss\u002FSituFormer)\n\n- (arXiv 2021.12) **CLIP**2Style**GAN**: Unsupervised Extraction of StyleGAN Edit Directions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05219.pdf)\n\n- (arXiv 2021.12) Couplformer: Rethinking Vision Transformer with Coupling **Attention** Map, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05425.pdf)\n\n- (arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for **Vision-Language** Understanding and Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05587.pdf)\n\n- (arXiv 2021.12) Visual Transformers with Primal Object Queries for **Multi-Label Image Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05485.pdf)\n\n- (arXiv 2021.12) Colossal-AI: A Unified Deep Learning System For **Large-Scale Parallel Training**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.14883.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI)\n\n- (arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for **Action Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03902.pdf)\n\n- (arXiv 2021.12) Grounded **Language-Image** Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03857.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FGLIP)\n\n- (arXiv 2021.12) U^2-Former: A Nested U-shaped Transformer for **Image Restoration**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02279.pdf)\n\n- (arXiv 2021.12) ADAPTIVE CHANNEL ENCODING TRANSFORMER FOR **POINT CLOUD** ANALYSIS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2112\u002F2112.02507.pdf)\n\n- (arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person **Re-identification** Based on Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02466.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FWangTaoAs\u002FPFD_Net)\n\n- (arXiv 2021.12) VT-CLIP: Enhancing **Vision-Language** Models with Visual-guided Texts, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02399.pdf)\n\n- (arXiv 2021.12) PointCLIP: **Point Cloud** Understanding by **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02413.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZrrSkywalker\u002FPointCLIP)\n\n- (arXiv 2021.12) Learning **Tracking** Representations via Dual-Branch Fully Transformer Networks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02571.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fphiphiphi31\u002FDualTFR)\n\n- (arXiv 2021.12) DYNAMIC TOKEN **NORMALIZATION** IMPROVES VISION TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02624.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fwqshao126\u002FDTN)\n\n- (arXiv 2021.12) PTTR: Relational 3D **Point Cloud Object Tracking** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02857.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FJasonkks\u002FPTTR)\n\n- (arXiv 2021.12) GETAM: Gradient-weighted Element-wise Transformer Attention Map for **Weakly-supervised Semantic segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02841.pdf)\n\n- (arXiv 2021.12) **Text2Mesh**: Text-Driven Neural Stylization for Meshes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03221.pdf), [[Project]](https:\u002F\u002Fthreedle.github.io\u002Ftext2mesh\u002F)\n\n- (arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal **Emotion Recognition** from Unaligned Multimodal Sequences, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01697.pdf)\n\n- (arXiv 2021.12) Make A Long Image Short: Adaptive **Token** Length for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01686.pdf)\n\n- (arXiv 2021.12) FuseDream: Training-Free **Text-to-Image Generation** with Improved **CLIP**+GAN Space Optimization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01573.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgnobitab\u002FFuseDream)\n\n- (arXiv 2021.12) TransZero: Attribute-guided Transformer for **Zero-Shot Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01683.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshiming-chen\u002FTransZero)\n\n- (arXiv 2021.12) Learning Generalizable **Vision-Tactile** Robotic **Grasping** Strategy for Deformable Objects via Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06374.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGTLIDAR\u002FDeformableObjectsGrasping.git)\n\n- (arXiv 2021.12) Hformer: Hybrid CNN-Transformer for **Fringe Order Prediction** in Phase Unwrapping of Fringe Projection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2112\u002F2112.06759.pdf)\n\n- (arXiv 2021.12) Pre-training and Fine-tuning Transformers for **fMRI Prediction** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05761.pdf)\n\n- (arXiv 2021.12) Transformer based **trajectory prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04350.pdf)\n\n- (arXiv 2021.12) Evaluating Transformers for Lightweight **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09641.pdf)\n\n- (arXiv 2021.12) Contextualized Spatio-Temporal **Contrastive Learning** with Self-Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05181.pdf)\n\n- (arXiv 2021.12) CMA-CLIP: Cross-Modality Attention **CLIP** for **Image-Text** Classification, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03562.pdf)\n\n- (arXiv 2021.12) **Bootstrapping** ViTs: Towards Liberating Vision Transformers from Pre-training, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03552.pdf)\n\n- (arXiv 2021.12) Decision-based Black-box **Attack** Against Vision Transformers via Patch-wise Adversarial Removal, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03492.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FshiyuchengTJU\u002FPAR)\n\n- (arXiv 2021.12) DoodleFormer: Creative **Sketch Drawing** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03258.pdf)\n\n- (arXiv 2021.12) Creating **Multimodal Interactive Agents** with Imitation and Self-Supervised Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03763.pdf)\n\n- (arXiv 2021.12) **AUDIO-VISUAL** SYNCHRONISATION IN THE WILD, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04432.pdf), [[Project]](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Favs)\n\n- (arXiv 2021.12) **Classification**-Then-**Grounding**: Reformulating **Video** Scene Graphs as Temporal Bipartite Graphs, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04222.pdf)\n\n- (arXiv 2021.12) Garment4D: **Garment Reconstruction** from Point Cloud Sequences, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04159.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhongfz16\u002FGarment4D)\n\n- (arXiv 2021.12) Locally Shifted **Attention****** With Early Global Integration, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05080.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fshellysheynin\u002FLocally-SAG-Transformer)\n\n- (arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable **Layout Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05112.pdf)\n\n- (arXiv 2021.12) PE-former: **Pose Estimation** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04981.pdf), [[Project]](https:\u002F\u002Fwww.ics.forth.gr\u002Fhccv\u002F)\n\n- (arXiv 2021.12) Hair**CLIP**: **Design** Your Hair by Text and Reference Image, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05142.pdf), [[Project]](https:\u002F\u002Fgithub.com\u002Fwty-ustc\u002FHairCLIP)\n\n- (arXiv 2021.12) **CLIP**-**NeRF**: Text-and-Image Driven Manipulation of Neural Radiance Fields, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05139.pdf), [[Code]](https:\u002F\u002Fcassiepython.github.io\u002Fclipnerf\u002F)\n\n- (arXiv 2021.12) A Bilingual, Open World Video Text **Dataset** and End-to-end **Video Text Spotter** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04888.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fweijiawu\u002FTransVTSpotter), [[Dataset]](https:\u002F\u002Fgithub.com\u002Fweijiawu\u002FBOVText-Benchmark)\n\n- (arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for **Efficient Video Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04674.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fdualformer)\n\n- (arXiv 2021.12) Recurrent Glimpse-based Decoder for **Detection** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04632.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzhechen\u002FDeformable-DETR-REGO)\n\n- (arXiv 2021.12) Fast **Point** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04702.pdf)\n\n- (arXiv 2021.12) Assistive Tele-op: Leveraging Transformers to **Collect Robotic Task Demonstrations**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05129.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fassistive-teleop)\n\n- (arXiv 2021.12) Cross-Modality Fusion Transformer for **Multispectral Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.00273.pdf)\n\n- (arXiv 2021.12) PatchFormer: An **Efficient** **Point** Transformer with Patch Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.00207.pdf)\n\n- (arXiv 2021.12) Transformer-Based Approach for Joint **Handwriting** and **Named Entity Recognition** in Historical documents, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04189.pdf)\n\n- (arXiv 2021.12) **MLP** Architectures for **Vision-and-Language** Modeling: An Empirical Study, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04453.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Feasonnie\u002Fmlp-vil)\n\n- (arXiv 2021.12) Everything at Once – Multi-modal Fusion Transformer for **Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04446.pdf)\n\n- (arXiv 2021.12) Prompting **Visual-Language** Models for Efficient Video Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04478.pdf), [[Project]](https:\u002F\u002Fju-chen.github.io\u002Fefficient-prompt\u002F)\n\n- (arXiv 2021.12) FLAVA: A Foundational **Language And Vision** Alignment Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04482.pdf)\n\n- (arXiv 2021.12) Embedding Arithmetic for **Text-driven Image Transformation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03162.pdf)\n\n- (arXiv 2021.12) LAVT: Language-Aware Vision Transformer for **Referring Image Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02244.pdf)\n\n- (arXiv 2021.12) Look at What I’m Doing: Self-Supervised **Spatial Grounding** of Narrations in Instructional Videos, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10596.pdf), [[Project]](https:\u002F\u002Fcs-people.bu.edu\u002Frxtan\u002Fprojects\u002Fgrounding_narrations\u002F)\n\n- (arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for **Generic Perception** for **Zero-shot and Few-shot** Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01522.pdf)\n\n- (arXiv 2021.12) Dense**CLIP**: Language-Guided **Dense** Prediction with Context-Aware Prompting, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01518.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fraoyongming\u002FDenseCLIP)\n\n- (arXiv 2021.12) Self-supervised **Video** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01514.pdf), [[Code]](https:\u002F\u002Fgit.io\u002FJ1juJ)\n\n- (arXiv 2021.12) OW-DETR: **Open-world Detection** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01513.pdf)\n\n- (arXiv 2021.12) Zero-Shot **Text-Guided Object Generation** with Dream Fields, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01455.pdf), [[Project]](https:\u002F\u002Fajayj.com\u002Fdreamfields)\n\n- (arXiv 2021.12) **Video-Text** Pre-training with Learned Regions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01194.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fruiyan1995\u002FRegion_Learner)\n\n- (arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for **RGB-D Salient Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01177.pdf)\n\n- (arXiv 2021.12) TCTN: A 3D-Temporal Convolutional Transformer Network for **Spatiotemporal** Predictive Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01085.pdf)\n\n- (arXiv 2021.12) DenseCLIP: Extract Free **Dense** Labels from **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01071.pdf)\n\n- (arXiv 2021.12) TransMEF: A Transformer-Based **Multi-Exposure Image Fusion** Framework using Self-Supervised Multi-Task Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01030.pdf)\n\n- (arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer **Tracking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00995.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FLitingLin\u002FSwinTrack)\n\n- (arXiv 2021.12) Object-Centric Unsupervised Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00969.pdf)\n\n- (arXiv 2021.12) Vision Pair Learning: An **Efficient** Training Framework for Image **Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00965.pdf)\n\n- (arXiv 2021.12) Visual-Semantic Transformer for **Scene Text Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00948.pdf)\n\n- (arXiv 2021.12) Differentiable **Spatial Planning** using Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01010.pdf), [[Project]](https:\u002F\u002Fdevendrachaplot.github.io\u002Fprojects\u002Fspatial-planning-transformers)\n\n- (arXiv 2021.12) Improved **Multiscale** Vision Transformers for **Classification** and **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01526.pdf)\n\n- (arXiv 2021.12) Masked-attention Mask Transformer for Universal Image **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01527.pdf), [[Code]](https:\u002F\u002Fbowenc0221.github.io\u002Fmask2former)\n\n- (arXiv 2021.12) BEVT: BERT Pretraining of **Video** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01529.pdf)\n\n- (arXiv 2021.12) **Human-Object Interaction Detection** via Weak Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00492.pdf)\n\n- (arXiv 2021.12) Learning Transformer Features for **Image Quality Assessment**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00485.pdf)\n\n- (arXiv 2021.12) **CLIP**styler: **Image Style Transfer** with a Single Text Condition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00374.pdf)\n\n- (arXiv 2021.12) **Multi-View Stereo** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00336.pdf)\n\n- (arXiv 2021.12) VoRTX: **Volumetric 3D Reconstruction** With Transformers for Voxelwise View Selection and Fusion, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00236.pdf), [[Code]](https:\u002F\u002Fnoahstier.github.io\u002Fvortx)\n\n- (arXiv 2021.12) Object-aware **Video-language** Pre-training for Retrieval, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00656.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FFingerRec\u002FOA-Transformer)\n\n### 2021.11\n\n- (arXiv 2021.11) Multi-modal Transformers Excel at **Class-agnostic** Object **Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11430.pdf), [[Code]](https:\u002F\u002Fgit.io\u002FJ1HPY)\n\n- (arXiv 2021.11) Predict, Prevent, and Evaluate: Disentangled **Text-Driven Image Manipulation** Empowered by Pre-Trained Vision-Language Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13333.pdf)\n\n- (arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for **Visual Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12994.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FNomMer1125\u002FNomMer)\n\n- (arXiv 2021.11) PolyViT: **Co-training** Vision Transformers on **Images**, **Videos** and **Audio**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12993.pdf)\n\n- (arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13677.pdf)\n\n- (arXiv 2021.11) ADAPTIVE **FOURIER** NEURAL OPERATORS: **EFFICIENT** TOKEN MIXERS FOR TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13587.pdf)\n\n- (arXiv 2021.11) DyTox: Transformers for **Continual Learning** with DYnamic TOken eXpansion, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11326.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Farthurdouillard\u002Fdytox)\n\n- (arXiv 2021.11) DABS: A Domain-Agnostic **Benchmark** for **Self-Supervised** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12062.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Falextamkin\u002Fdabs)\n\n- (arXiv 2021.11) Ice hockey **player identification** via transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11535.pdf)\n\n- (arXiv 2021.11) DBIA: Data-free Backdoor Injection **Attack** against Transformer Networks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11870.pdf), [[Code]](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FDBIA-825D)\n\n- (arXiv 2021.11) Sparse Fusion for **Multimodal** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11992.pdf)\n\n- (arXiv 2021.11) PhysFormer: **Facial Video-based Physiological Measurement** with Temporal Difference Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12082.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FZitongYu\u002FPhysFormer)\n\n- (arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person **Re-Identification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12084.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmichuanhaohao\u002FTransReID-SSL)\n\n- (arXiv 2021.11) DISCRETE REPRESENTATIONS STRENGTHEN VISION TRANSFORMER **ROBUSTNESS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10493.pdf)\n\n- (arXiv 2021.11) TRAVLR: Now You See It, Now You Don’t! Evaluating Cross-Modal Transfer of **Visio-Linguistic Reasoning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10756.pdf)\n\n- (arXiv 2021.11) Crossing the Format Boundary of Text and Boxes: Towards Unified **Vision-Language** Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12085.pdf)\n\n- (arXiv 2021.11) **Semi-Supervised** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11067.pdf)\n\n- (arXiv 2021.11) CpT: Convolutional Point Transformer for 3D **Point Cloud** Processing, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10866.pdf)\n\n- (arXiv 2021.11) ZERO-SHOT CERTIFIED **DEFENSE** AGAINST **ADVERSARIAL** PATCHES WITH VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10481.pdf)\n\n- (arXiv 2021.11) PointMixer: MLP-Mixer for **Point Cloud** Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11187.pdf)\n\n- (arXiv 2021.11) **MetaFormer** is Actually What You Need for Vision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11418.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fpoolformer)\n\n- (arXiv 2021.11) Florence: A New **Foundation Model** for Computer Vision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11432.pdf)\n\n- (arXiv 2021.11) Benchmarking **Detection Transfer Learning** with Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11429.pdf)\n\n- (arXiv 2021.11) Learning to **Compose Visual Relations**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09297.pdf), [[Project]](https:\u002F\u002Fcomposevisualrelations.github.io\u002F)\n\n- (arXiv 2021.11) REFERENCE-BASED **MAGNETIC RESONANCE IMAGE RECONSTRUCTION** USING TEXTURE TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09492.pdf)\n\n- (arXiv 2021.11) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for **Instructional Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09276.pdf)\n\n- (arXiv 2021.11) **Swin Transformer V2**: Scaling Up Capacity and Resolution, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09883.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSwin-Transformer)\n\n- (arXiv 2021.11) SimMIM: A Simple Framework for **Masked Image Modeling**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09886.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSimMIM)\n\n- (arXiv 2021.11) Restormer: Efficient Transformer for **High-Resolution Image Restoration**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09881.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fswz30\u002FRestormer)\n\n- (arXiv 2021.11) Simple but Effective: **CLIP** Embeddings for **Embodied AI**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09888.pdf)\n\n- (arXiv 2021.11) ClipCap: CLIP Prefix for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09734.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Frmokady\u002FCLIP_prefix_caption)\n\n- (arXiv 2021.11) TransMix: Attend to **Mix** for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09833.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FBeckschen\u002FTransMix)\n\n- (arXiv 2021.11) TRIG: Transformer-Based **Text Recognizer** with Initial Embedding Guidance, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08314.pdf)\n\n- (arXiv 2021.11) Multi-Grained **Vision Language** Pre-Training: Aligning Texts with Visual Concepts, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08276.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzengyan-97\u002FX-VLM)\n\n- (arXiv 2021.11) Explainable Semantic Space by **Grounding Language to Vision** with Cross-Modal Contrastive Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07180.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyizhen-zhang\u002FVG-Bert)\n\n- (arXiv 2021.11) Semantically Grounded Object Matching for Robust **Robotic Scene Rearrangement**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07975.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fapplied-ai-lab\u002Fobject_matching)\n\n- (arXiv 2021.11) **Tracking** People with **3D** Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07868.pdf), [[Code]](https:\u002F\u002Fbrjathu.github.io\u002FT3DP)\n\n- (arXiv 2021.11) LiT: Zero-Shot Transfer with Locked-**image** **Text** **Tuning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07991.pdf)\n\n- (arXiv 2021.11) FILIP: FINE-GRAINED INTERACTIVE **LANGUAGE-IMAGE** PRE-TRAINING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07783.pdf)\n\n- (arXiv 2021.11) Graph Relation Transformer: Incorporating **pairwise object features** into the Transformer architecture, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.06075.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fderikclive\u002Ftransformers)\n\n- (arXiv 2021.11) **Attention** Approximates Sparse Distributed Memory, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.05498.pdf)\n\n- (arXiv 2021.11) SLICED **RECURSIVE** TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.05297.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fszq0214\u002FSReT)\n\n- (arXiv 2021.11) HYBRID **BYOL-VIT**: EFFICIENT APPROACH TO DEAL WITH **SMALL DATASETS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.04845.pdf)\n\n- (arXiv 2021.11) Tip-Adapter: Training-free **CLIP**-Adapter for Better **Vision-Language** Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.03930.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgaopengcuhk\u002FTip-Adapter)\n\n- (arXiv 2021.11) Improving Visual Quality of **Image Synthesis** by A Token-based Generator with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.03481.pdf)\n\n- (arXiv 2021.11) Style**CLIP**Draw: Coupling Content and Style in **Text-to-Drawing Synthesis**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.03133.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fpschaldenbrand\u002FStyleCLIPDraw)\n\n- (arXiv 2021.11) Revisiting **spatio-temporal** layouts for **compositional action recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01936.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgorjanradevski\u002Frevisiting-spatial-temporal-layouts)\n\n- (arXiv 2021.11) PatchGame: Learning to Signal Mid-level Patches in **Referential Games**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01785.pdf), [[Code]](https:\u002F\u002Fkampta.github.io\u002Fpatch-game)\n\n- (arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM **CONVOLUTION**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01353.pdf)\n\n- (arXiv 2021.11) Livestock Monitoring with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.00801.pdf)\n\n- (arXiv 2021.11) With a Little Help from my Temporal Context: Multimodal **Egocentric Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01024.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fekazakos\u002FMTCN)\n\n- (arXiv 2021.11) IconQA: A New Benchmark for Abstract Diagram Understanding and **Visual Language Reasoning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13214.pdf), [[Project]](https:\u002F\u002Ficonqa.github.io\u002F)\n\n- (arXiv 2021.11) BoxeR: **Box-Attention** for 2D and 3D Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13087.pdf)\n\n- (arXiv 2021.11) VLDeformer: **Vision-Language** Decomposed Transformer for Fast **Cross-Modal Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11338.pdf)\n\n- (arXiv 2021.11) Multi-Person **3D Motion Prediction** with Multi-Range Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12073.pdf), [[Code]](https:\u002F\u002Fjiashunwang.github.io\u002FMRT\u002F)\n\n- (arXiv 2021.11) Scene Representation Transformer: Geometry-Free **Novel View Synthesis** Through Set-Latent Scene Representations, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13152.pdf), [[Project]](https:\u002F\u002Fsrt-paper.github.io\u002F)\n\n- (arXiv 2021.11) **Global Interaction Modelling** in Vision Transformer via Super Tokens, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13156.pdf)\n\n- (arXiv 2021.11) ML-Decoder: Scalable and Versatile **Classification Head**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12933.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FAlibaba-MIIL\u002FML_Decoder)\n\n- (arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for **Unsupervised Domain Adaptation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12941.pdf)\n\n- (arXiv 2021.11) SWINBERT: End-to-End Transformers with Sparse Attention for **Video Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13196.pdf)\n\n- (arXiv 2021.11) Amortized Prompt: Lightweight Fine-Tuning for **CLIP** in **Domain Generalization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12853.pdf)\n\n- (arXiv 2021.11) Universal Captioner: Long-Tail **Vision-and-Language** Model Training through Content-Style Separation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12727.pdf)\n\n- (arXiv 2021.11) **Sparse** is Enough in Scaling Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12763.pdf)\n\n- (arXiv 2021.11) An implementation of the “**Guess who**?” game using CLIP, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00599.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FArnauDIMAI\u002FCLIP-GuessWho)\n\n- (arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for **Structured Reconstruction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15143.pdf)\n\n- (arXiv 2021.11) A Unified **Pruning** Framework for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15127.pdf)\n\n- (arXiv 2021.11) Pyramid **Adversarial Training** Improves ViT Performance, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15121.pdf)\n\n- (arXiv 2021.11) AssistSR: Affordance-centric Question-driven **Video Segment Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15050.pdf), [[Code & Data]](https:\u002F\u002Fgithub.com\u002FStanLei52\u002FAQVSR)\n\n- (arXiv 2021.11) DAFormer: Improving Network Architectures and Training Strategies for **Domain-Adaptive Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14887.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flhoyer\u002FDAFormer)\n\n- (arXiv 2021.11) , [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14887.pdf)\n\n- (arXiv 2021.11) AdaViT: Adaptive Vision Transformers for **Efficient** Image Recognition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15668.pdf)\n\n- (arXiv 2021.11) ATS: Adaptive Token Sampling For **Efficient** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15667.pdf)\n\n- (arXiv 2021.11) **CLIP** Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15162.pdf)\n\n- (arXiv 2021.11) CRIS: **CLIP**-Driven Referring Image **Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15174.pdf)\n\n- (arXiv 2021.11) Shunted **Self-Attention** via Multi-Scale Token Aggregation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15193.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOliverRensu\u002FShunted-Transformer)\n\n- (arXiv 2021.11) MC-SSL0.0: Towards Multi-Concept **Self-Supervised** Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15340.pdf)\n\n- (arXiv 2021.11) TransWeather: Transformer-based **Restoration of Images** Degraded by Adverse Weather Conditions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14813.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjeya-maria-jose\u002FTransWeather)\n\n- (arXiv 2021.11) Searching the **Search Space** of Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14725.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FCream)\n\n- (arXiv 2021.11) TransMVSNet: Global Context-aware **Multi-view Stereo** Network with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14600.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMegviiRobot\u002FTransMVSNet)\n\n- (arXiv 2021.11) **Recurrent** Vision Transformer for Solving Visual **Reasoning** Problems, [[Paper]]()\n\n- (arXiv 2021.11) **Video Frame Interpolation** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13817.pdf)\n\n- (arXiv 2021.11) FQ-ViT: Fully **Quantized** Vision Transformer without Retraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13824.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flinyang-zhh\u002FFQ-ViT)\n\n- (arXiv 2021.11) LAFITE : Towards Language-Free Training for **Text-to-Image Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13792.pdf)\n\n- (arXiv 2021.11) SPARSE DETR: **EFFICIENT** END-TO-END OBJECT **DETECTION** WITH LEARNABLE SPARSITY, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14330.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fsparse-detr)\n\n- (arXiv 2021.11) End-to-End **Referring Video Object Segmentation** with Multimodal Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14821.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmttr2021\u002FMTTR)\n\n- (arXiv 2021.11) Point-BERT: Pre-training 3D **Point Cloud** Transformers with Masked Point Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14819.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flulutang0608\u002FPoint-BERT)\n\n- (arXiv 2021.11) Zero-Shot **Image-to-Text Generation** for Visual-Semantic Arithmetic, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14447.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYoadTew\u002Fzero-shot-image-to-text)\n\n- (arXiv 2021.11) Blended Diffusion for **Text-driven Editing** of **Natural Images**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14818.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fomriav\u002Fblended-diffusion)\n\n- (arXiv 2021.11) Mask Transfiner for High-Quality **Instance Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13673.pdf), [[Code]](http:\u002F\u002Fvis.xyz\u002Fpub\u002Ftransfiner)\n\n- (arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for **3D Human Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12707.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FVegetebird\u002FMHFormer)\n\n- (arXiv 2021.11) PeCo: Perceptual Codebook for **BERT Pre-training** of Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12710.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPeCo)\n\n- (arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast **High-Resolution Image Generation** from Vector-Quantized Codes, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12701.pdf), [[COde]](https:\u002F\u002Fgithub.com\u002Fsamb-t\u002Funleashing-transformers)\n\n- (arXiv 2021.11) Towards Tokenized **Human Dynamics** Representation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11433.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flikenneth\u002Facton)\n\n- (arXiv 2021.11) **Self-slimmed** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12624.pdf)\n\n- (arXiv 2021.11) VIOLET: End-to-End **Video-Language** Transformers with Masked Visual-token Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12681.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ftsujuifu\u002Fpytorch_violet)\n\n- (arXiv 2021.11) A Lightweight Graph Transformer Network for **Human Mesh Reconstruction** from 2D Human Pose, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12696.pdf)\n\n- (arXiv 2021.11) MorphMLP: A Self-Attention Free, **MLP**-Like Backbone for Image and Video, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12527.pdf)\n\n- (arXiv 2021.11) Octree Transformer: Autoregressive **3D Shape Generation** on Hierarchically Structured Sequences, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12480.pdf)\n\n- (arXiv 2021.11) Hierarchical Modular Network for **Video Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12476.pdf)\n\n- (arXiv 2021.11) NU¨WA: **Visual Synthesis Pre-training** for Neural visUal World creAtion, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12417.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FNUWA)\n\n- (arXiv 2021.11) An Image Patch is a Wave: Phase-Aware Vision **MLP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12294.pdf)\n\n- (arXiv 2021.11) PTQ4ViT: Post-Training **Quantization** Framework for Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12293.pdf)\n\n- (arXiv 2021.11) PU-Transformer: **Point Cloud Upsampling** Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12242.pdf)\n\n- (arXiv 2021.11) Scaling Up **Vision-Language Pre-training** for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12233.pdf)\n\n- (arXiv 2021.11) Cerberus Transformer: Joint **Semantic, Affordance and Attribute Parsing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12608.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FOPEN-AIR-SUN\u002FCerberus)\n\n- (arXiv 2021.11) Efficient **Video** Transformers with Spatial-Temporal Token Selection, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11591.pdf)\n\n- (arXiv 2021.11) RedCaps: Web-curated **image-text data** created by the people, for the people, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11431.pdf), [[Project]](https:\u002F\u002Fredcaps.xyz\u002F)\n\n- (arXiv 2021.11) EMScore: Evaluating **Video Captioning** via Coarse-Grained and Fine-Grained Embedding Matching, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08919.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FShiYaya\u002Femscore)\n\n- (arXiv 2021.11) Compositional Transformers for **Scene Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08960.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdorarad\u002Fgansformer)\n\n- (arXiv 2021.11) Vis-TOP: Visual Transformer **Overlay Processor**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10957.pdf)\n\n- (arXiv 2021.11) **Grounded Situation Recognition** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10135.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjhcho99\u002Fgsrtr)\n\n- (arXiv 2021.11) Rethinking **Query, Key, and Value** Embedding in Vision Transformer under **Tiny Model** Constraints, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10017.pdf)\n\n- (arXiv 2021.11) UFO: A UniFied TransfOrmer for **Vision-Language** Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10023.pdf)\n\n- (arXiv 2021.11) Advancing High-Resolution **Video-Language** Representation with Large-Scale Video Transcriptions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10337.pdf)\n\n- (arXiv 2021.11) Combined Scaling for **Zero-shot Transfer Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10050.pdf)\n\n- (arXiv 2021.11) Simple but Effective: **CLIP** Embeddings for **Embodied AI**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09888.pdf)\n\n- (arXiv 2021.11) Improved **Robustness** of Vision Transformer via PreLayerNorm in Patch Embedding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08413.pdf)\n\n- (arXiv 2021.11) IBOT: **IMAGE BERT PRE-TRAINING** WITH ONLINE TOKENIZER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07832.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Fibot)\n\n- (arXiv 2021.11) **Masked Autoencoders** Are Scalable Vision Learners, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.06377.pdf)\n\n- (arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient **Hyperspectral Image Reconstruction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07910.pdf)\n\n- (arXiv 2021.11) Are Transformers More **Robust** Than CNNs?, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.05464.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fytongbai\u002FViTs-vs-CNNs)\n\n- (arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for **Video-Text Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.05610.pdf)\n\n- (arXiv 2021.11) Multimodal Transformer with Variable-length Memory for **Vision-and-Language Navigation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.05759.pdf)\n\n- (arXiv 2021.11) Improving **Visual Quality** of **Image Synthesis** by A Token-based Generator with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.03481)\n\n- (arXiv 2021.11) VLMO: Unified **Vision-Language** Pre-Training with Mixture-of-Modality-Experts, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02358.pdf), [[Code]](https:\u002F\u002Faka.ms\u002Fvlmo)\n\n- (arXiv 2021.11) LAION-400M: Open **Dataset** of **CLIP**-Filtered 400 Million **Image-Text** Pairs, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02114.pdf), [[Project]](https:\u002F\u002Flaion.ai\u002Flaion-400-open-dataset\u002F)\n\n- (arXiv 2021.11) An Empirical Study of **Training** End-to-End **Vision-and-Language** Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02387.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fzdou0830\u002FMETER)\n\n- (arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM **CONVOLUTION**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01353.pdf)\n\n- (arXiv 2021.11) HRViT: **Multi-Scale High-Resolution** Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01236.pdf)\n\n### 2021.10\n\n- (arXiv 2021.10) **Visual Keyword Spotting** with Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15957.pdf), [[Project]](Visual Keyword Spotting with Attention)\n\n- (arXiv 2021.10) Learning **Co-segmentation** by Segment Swapping for Retrieval and Discovery, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15904.pdf), [[Data & Code]](http:\u002F\u002Fimagine.enpc.fr\u002F~shenx\u002FSegSwap\u002F)\n\n- (arXiv 2021.10) Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal **Text-Video Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15609.pdf), [[Code]](https:\u002F\u002Fhttps\u002F\u002Fgithub.com\u002FLionel-Hing\u002FVSR-Net)\n\n- (arXiv 2021.10) Dispensed Transformer Network for **Unsupervised Domain Adaptation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.14944.pdf)\n\n- (arXiv 2021.10) Scatterbrain: Unifying **Sparse** and **Low-rank Attention** Approximation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15343.pdf)\n\n- (arXiv 2021.10) **3D Object Tracking** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.14921.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002F3bobo\u002Flttr)\n\n- (arXiv 2021.10) Blending **Anti-Aliasing** into Vision Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15156.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Famazon-research\u002Fanti-aliasing-transformer)\n\n- (arXiv 2021.10) UltraPose: **Synthesizing** Dense Pose with 1 Billion Points by **Human-body** Decoupling **3D** Model, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15267.pdf), [[Data & Code]](https:\u002F\u002Fgithub.com\u002FMomoAILab\u002Fultrapose)\n\n- (arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for **Vision-and-Language Navigation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.14143.pdf)\n\n- (arXiv 2021.10) Bangla Image **Caption Generation** through CNN-Transformer based Encoder-Decoder Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.12442.pdf)\n\n- (arXiv 2021.10) History Aware Multimodal Transformer for **Vision-and-Language Navigation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13309.pdf), [[Project]](https:\u002F\u002Fcshizhe.github.io\u002Fprojects\u002Fvln_hamt.html)\n\n- (arXiv 2021.10) TriBERT: Full-body Human-centric Audio-visual Representation Learning for **Visual Sound Separation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13412.pdf)\n\n- (arXiv 2021.10) TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED **EMOTION RECOGNITION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13708.pdf)\n\n- (arXiv 2021.10) Contextual Similarity Aggregation with Self-attention for **Visual Re-ranking**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13430.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMCC-WH\u002FCSA)\n\n- (arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for **Skeleton-Based Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13385.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fqtwang0035\u002FIIP-Transformer)\n\n- (arXiv 2021.10) IMAGE-BASED **CLIP**-GUIDED ESSENCE TRANSFER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.12427.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fhila-chefer\u002FTargetCLIP)\n\n- (arXiv 2021.10) Sinkformers: Transformers with Doubly Stochastic **Attention**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11773.pdf)\n\n- (arXiv 2021.10) ILLITERATE **DALL·E** LEARNS TO COMPOSE, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11405.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fslate-autoencoder), [[Code]](https:\u002F\u002Fgithub.com\u002Fsinghgautam\u002Fslate)\n\n- (arXiv 2021.10) Learning Text-Image Joint Embedding for Efficient **Cross-Modal Retrieval** with Deep Feature Engineering, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11592.pdf)\n\n- (arXiv 2021.10) SOFT: Softmax-free Transformer with **Linear Complexity**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11945.pdf), [[Code]](https:\u002F\u002Ffudan-zvg.github.io\u002FSOFT)\n\n- (arXiv 2021.10) Deep Two-Stream Video Inference for Human Body **Pose** and **Shape Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11680.pdf)\n\n- (arXiv 2021.10) TRANSFORMER **ACCELERATION** WITH DYNAMIC SPARSE ATTENTION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11299.pdf)\n\n- (arXiv 2021.10) CLOOB: MODERN **HOPFIELD** NETWORKS WITH INFOLOOB OUTPERFORM **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11316.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fml-jku\u002Fcloob)\n\n- (arXiv 2021.10) Integrating Visuospatial, Linguistic and Commonsense Structure into **Story Visualization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10834.pdf)\n\n- (arXiv 2021.10) StructFormer: Learning Spatial Structure for **Language-Guided** Semantic **Rearrangement** of Novel Objects, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10189.pdf), [[Project]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fstructformer)\n\n- (arXiv 2021.10) Gophormer: Ego-**Graph** Transformer for **Node Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13094.pdf)\n\n- (arXiv 2021.10) STRANSGAN: AN EMPIRICAL STUDY ON TRANSFORMER IN **GANS**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13107.pdf), [[Code]](https:\u002F\u002Fnbei.github.io\u002Fstransgan.html)\n\n- (arXiv 2021.10) MVT: Multi-view Vision Transformer for **3D Object Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13083.pdf)\n\n- (arXiv 2021.10) DocTr: **Document Image** Transformer for Geometric Unwarping and Illumination Correction, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.12942.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ffh2019ustc\u002FDocTr)\n\n- (arXiv 2021.10) Bangla Image **Caption** Generation through CNN-Transformer based Encoder-Decoder Network, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.12442.pdf)\n\n- (arXiv 2021.10) WAV2CLIP: LEARNING ROBUST **AUDIO REPRESENTATIONS** FROM **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11499.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fdescriptinc\u002Flyrebird-wav2clip)\n\n- (arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for **Medical Image Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10403.pdf)\n\n- (arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11316.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fml-jku\u002Fcloob)\n\n- (arXiv 2021.10) AniFormer: Data-driven **3D Animation** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10533.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmikecheninoulu\u002FAniFormer)\n\n- (arXiv 2021.10) **Few-Shot Temporal Action Localization** with Query Adaptive Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10552.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsauradip\u002FfewshotQAT)\n\n- (arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for **Hyperspectral Image Classification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11084.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxmm\u002F3D-ANAS-V2)\n\n- (arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared **Person Re-identification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08994.pdf)\n\n- (arXiv 2021.10) 3D-RETR: End-to-End **Single and Multi-View 3D Reconstruction** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08861.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FFomalhautB\u002F3D-RETR)\n\n- (arXiv 2021.10) HRFormer: **High-Resolution** Transformer for **Dense Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09408.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHRNet\u002FHRFormer)\n\n- (arXiv 2021.10) Leveraging MoCap Data for **Human Mesh Recovery**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09243.pdf)\n\n- (arXiv 2021.10) A Good **Prompt** Is Worth Millions of Parameters? Low-resource Prompt-based Learning for **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08484.pdf)\n\n- (arXiv 2021.10) ASFormer: Transformer for **Action Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08568.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FChinaYi\u002FASFormer)\n\n- (arXiv 2021.10) Multimodal **Dialogue Response Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08515.pdf)\n\n- (arXiv 2021.10) Understanding **Procedural Knowledge** by Sequencing Multimodal Instructional Manuals, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08486.pdf)\n\n- (arXiv 2021.10) COMPOSITIONAL **ATTENTION**: DISENTANGLING SEARCH AND RETRIEVAL, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09419.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fsarthmit\u002FCompositional-Attention)\n\n- (arXiv 2021.10) Spatial-Temporal Transformer for 3D **Point Cloud Sequences**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09783.pdf)\n\n- (arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for **3D Human Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09554.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FHowieMa\u002FTransFusion-Pose)\n\n- (arXiv 2021.10) Unifying Multimodal Transformer for **Bi-directional Image and Text Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09753.pdf)\n\n- (arXiv 2021.10) Transformer with a Mixture of **Gaussian Keys**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08678.pdf)\n\n- (arXiv 2021.10) DIFFUSIONCLIP: **TEXT-GUIDED IMAGE MANIPULATION** USING DIFFUSION MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02711.pdf)\n\n- (arXiv 2021.10) Adversarial **Robustness** Comparison of Vision Transformer and MLP-Mixer to CNNs, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02797.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fphibenz\u002Frobustness_comparison_vit_mlp-mixer_cnn)\n\n- (arXiv 2021.10) RIPPLE ATTENTION FOR VISUAL PERCEPTION WITH **SUB-QUADRATIC COMPLEXITY**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02453.pdf)\n\n- (arXiv 2021.10) Certified Patch **Robustness** via Smoothed Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.07719.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMadryLab\u002Fsmoothed-vit)\n\n- (arXiv 2021.10) CLIP-Forge: Towards Zero-Shot **Text-to-Shape** Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02624.pdf)\n\n- (arXiv 2021.10) Understanding and Improving **Robustness** of Vision Transformers through Patch-based Negative Augmentation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.07858.pdf)\n\n- (arXiv 2021.10) SPARSE MOES MEET **EFFICIENT ENSEMBLES**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.03360.pdf)\n\n- (arXiv 2021.10) Shared **Visual Representations** of Drawing for Communication: How do different **biases** affect human interpretability and intent? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08203.pdf)\n\n- (arXiv 2021.10) SignBERT: Pre-Training of Hand-Model-Aware Representation for **Sign Language Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05382.pdf)\n\n- (arXiv 2021.10) Revitalizing CNN Attentions via Transformers in **Self-Supervised** Visual Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05340.pdf)\n\n- (arXiv 2021.10) Investigating **Transfer Learning Capabilities** of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2110\u002F2110.05270.pdf)\n\n- (arXiv 2021.10) SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE **LANGUAGE-IMAGE** PRE-TRAINING PARADIGM, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05208.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSense-GVT\u002F)\n\n- (arXiv 2021.10) CLIP4Caption ++: Multi-CLIP for **Video Caption**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05204.pdf)\n\n- (arXiv 2021.10) Transformer-based Dual Relation Graph for **Multi-label Image Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04722.pdf)\n\n- (arXiv 2021.10) VECTOR-QUANTIZED **IMAGE MODELING** WITH IMPROVED VQGAN, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04627.pdf)\n\n- (arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for **3D Human Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05092.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Flelexx\u002FMTF-Transformer)\n\n- (arXiv 2021.10) NVIT: VISION TRANSFORMER **COMPRESSION** AND **PARAMETER REDISTRIBUTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04869.pdf)\n\n- (arXiv 2021.10) 6D-ViT: Category-Level **6D Object Pose Estimation** via Transformer-based Instance Representation Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04792.pdf)\n\n- (arXiv 2021.10) CLIP-Adapter: Better **Vision-Language** Models with Feature Adapters, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04544.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fgaopengcuhk\u002FCLIP-Adapter)\n\n- (arXiv 2021.10) ATISS: Autoregressive Transformers for **Indoor Scene Synthesis**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.03675.pdf), [[Code]](https:\u002F\u002Fnv-tlabs.github.io\u002FATISS)\n， \n- (arXiv 2021.10) MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND **MOBILE**-FRIENDLY VISION TRANSFORMER, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02178.pdf)\n\n- (arXiv 2021.10) **TOKEN POOLING** IN VISION TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.03860.pdf)\n\n- (arXiv 2021.10) VIDT: AN EFFICIENT AND EFFECTIVE FULLY TRANSFORMER-BASED **OBJECT DETECTOR**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.03921.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fnaver-ai\u002Fvidt)\n\n- (arXiv 2021.10) CLIP4Caption: CLIP for **Video Caption**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06615.pdf)\n\n- (arXiv 2021.10) **OBJECT**-REGION **VIDEO** TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06915.pdf), [[Code]](https:\u002F\u002Froeiherz.github.io\u002FORViT\u002F)\n\n- (arXiv 2021.10) LEVERAGING **REDUNDANCY** IN ATTENTION WITH REUSE TRANSFORMERS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06821.pdf)\n\n- (arXiv 2021.10) **Dynamic Inference** with Neural Interpreters, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06399.pdf)\n\n- (arXiv 2021.10) A CLIP-Enhanced Method for **Video-Language** Understanding, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.07137.pdf)\n\n- (arXiv 2021.10) **Visual Relationship Detection** Using Part-and-Sum Transformers with Composite Queries, [[Paper]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2021\u002Fpapers\u002FDong_Visual_Relationship_Detection_Using_Part-and-Sum_Transformers_With_Composite_Queries_ICCV_2021_paper.pdf)\n\n- (arXiv 2021.10) Discovering Human **Interactions** with Large-Vocabulary Objects via Query and Multi-Scale Detection, [[Paper]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2021\u002Fpapers\u002FWang_Discovering_Human_Interactions_With_Large-Vocabulary_Objects_via_Query_and_Multi-Scale_ICCV_2021_paper.pdf)\n\n- (arXiv 2021.10) Learning Structural Representations for **Recipe Generation** and **Food Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.01209.pdf)\n\n- (arXiv 2021.10) A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR **FINE-GRAINED VISUAL RECOGNITION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2110\u002F2110.01240.pdf)\n\n### 2021.09\n- (arXiv 2021.09) Joint Multimedia **Event Extraction** from Video and Article, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12776.pdf)\n\n- (arXiv 2021.09) Long-Range Transformers for **Dynamic Spatiotemporal Forecasting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12218.pdf)\n\n- (arXiv 2021.09) **Visually Grounded Concept** Composition, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14115.pdf)\n\n- (arXiv 2021.09) CoSeg: Cognitively Inspired Unsupervised Generic **Event Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.15170.pdf)\n\n- (arXiv 2021.09) CCTrans: Simplifying and Improving **Crowd Counting** with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14483.pdf)\n\n- (arXiv 2021.09) UFO-ViT: High Performance **Linear** Vision Transformer **without Softmax**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14382.pdf)\n\n- (arXiv 2021.09) **Infrared Small-Dim Target Detection** with Transformer under Complex Backgrounds, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14379.pdf)\n\n- (arXiv 2021.09) **Localizing Objects** with Self-Supervised Transformers and no Labels, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14279.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002FLOST)\n\n- (arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14137.pdf)\n\n- (arXiv 2021.09) VideoCLIP: Contrastive Pre-training for **Zero-shot Video-Text Understanding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14084.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ffairseq\u002Fexamples\u002FMMPT)\n\n- (arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of **State Variables in Ising Models**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.13925.pdf)\n\n- (arXiv 2021.09) CLIP-It! Language-Guided **Video Summarization**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.00650.pdf), [[Project]](https:\u002F\u002Fmedhini.github.io\u002Fclip_it)\n\n- (arXiv 2021.09) MFEVIT: A ROBUST LIGHTWEIGHT TRANSFORMER-BASED NETWORK FOR MULTIMODAL 2D+3D **FACIAL EXPRESSION RECOGNITION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.13086.pdf)\n\n- (arXiv 2021.09) Sparse Spatial Transformers for **Few-Shot Learning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12932.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fchenhaoxing\u002FSSFormers)\n\n- (arXiv 2021.09) Vision Transformer Hashing for **Image Retrieval**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12564.pdf)\n\n- (arXiv 2021.09) PETA: **Photo Albums Event Recognition** using Transformers Attention, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12499.pdf)\n\n- (arXiv 2021.09) MLIM: **VISION-AND-LANGUAGE** MODEL PRE-TRAINING WITH MASKED LANGUAGE AND IMAGE MODELING, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12178.pdf)\n\n- (arXiv 2021.09) Dense Contrastive **Visual-Linguistic** Pretraining, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.11778.pdf)\n\n- (arXiv 2021.09) CPT: COLORFUL **PROMPT TUNING** FOR PRE-TRAINED VISION-LANGUAGE MODELS, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.11797.pdf)\n\n- (arXiv 2021.09) Localizing ∞-shaped fishes: **Sketch-guided object localization** in the wild, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.11874.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fpriba\u002Fsgol_wild)\n\n- (arXiv 2021.09) CLIPORT: What and Where Pathways for **Robotic Manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12098.pdf), [[Project]](https:\u002F\u002Fcliport.github.io\u002F), [[Code]](https:\u002F\u002Fgithub.com\u002Fcliport\u002Fcliport)\n\n- (arXiv 2021.09) GraFormer: Graph Convolution Transformer for **3D Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08364.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGraformer\u002FGraFormer)\n\n- (arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for **Visual Dialogue Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08478.pdf)\n\n- (arXiv 2021.09) Expression Snippet Transformer for Robust Video-based **Facial Expression Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08409.pdf), [[Code]](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FATSE-C58B)\n\n- (arXiv 2021.09) LOTR: **Face Landmark Localization** Using Localization Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.10057.pdf)\n\n- (arXiv 2021.09) Dyadformer: A **Multi-modal** Transformer for Long-Range Modeling of Dyadic Interactions, [[Paper]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2109\u002F2109.09487.pdf)\n\n- (arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for **Dense Image Prediction**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08963.pdf)\n\n- (arXiv 2021.09) KD-VLP: Improving End-to-End **Vision-and-Language Pretraining** with Object Knowledge Distillation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.10504.pdf)\n\n- (arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.10948.pdf)\n\n- (arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for **Person Re-Identification**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.11159.pdf)\n\n- (arXiv 2021.09) PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR **OBJECT DETECTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.10852.pdf)\n\n- (arXiv 2021.09) ActionCLIP: A New Paradigm for **Video Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08472.pdf)\n\n- (arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for **Scene Graph Generation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.05346.pdf)\n\n- (arXiv 2021.09) Neural Human Performer: Learning Generalizable Radiance Fields for **Human Performance Rendering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07448.pdf), [[Code]](https:\u002F\u002Fyoungjoongunc.github.io\u002Fnhp\u002F)\n\n- (arXiv 2021.09) **Anchor DETR**: Query Design for Transformer-Based Detector, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07107.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmegvii-model\u002FAnchorDETR)\n\n- (arXiv 2021.09) An End-to-End Transformer Model for **3D Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08141.pdf), [[Code]](https:\u002F\u002Ffacebookresearch.github.io\u002F3detr)\n\n- (arXiv 2021.09) Hybrid Local-Global Transformer for **Image Dehazing**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07100.pdf)\n\n- (arXiv 2021.09) Semi-Supervised Wide-Angle **Portraits Correction** by Multi-Scale Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08024.pdf)\n\n- (arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image **Captioning**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07799.pdf)\n\n- (arXiv 2021.09) Pose Transformers (POTR): **Human Motion Prediction** with Non-Autoregressive Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07531.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fidiap\u002Fpotr)\n\n- (arXiv 2021.09) PnP-DETR: Towards **Efficient** Visual Analysis with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07036.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Ftwangnh\u002Fpnp-detr)\n\n- (arXiv 2021.09) Learning to **Ground** Visual Objects for Visual Dialog, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.06013.pdf)\n\n- (arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for **Video Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.06085.pdf), [[Code]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmengcao\u002Fpublication\u002Fgtr)\n\n- (arXiv 2021.09) CDTrans: Cross-domain Transformer for **Unsupervised Domain Adaptation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.06165.pdf)\n\n- (arXiv 2021.09) IS ATTENTION BETTER THAN **MATRIX DECOMPOSITION**? [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04553.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FGsunshine\u002FEnjoy-Hamburger)\n\n- (arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for **Video Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04735.pdf)\n\n- (arXiv 2021.09) Line as a Visual Sentence: Context-aware **Line Descriptor** for Visual Localization, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04753.pdf)\n\n- (arXiv 2021.09) Negative Sample Matters: A Renaissance of Metric Learning for **Temporal Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04872.pdf)\n\n- (arXiv 2021.09) LAViTeR: Learning Aligned **Visual and Textual** Representations Assisted by Image and Caption Generation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04993.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fmshaikh2\u002FLaViTeR)\n\n- (arXiv 2021.09) Panoptic Narrative **Grounding**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04988.pdf)\n\n- (arXiv 2021.09) An Empirical Study of GPT-3 for Few-Shot Knowledge-Based **VQA**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.05014.pdf)\n\n- (arXiv 2021.09) PlaTe: **Visually-Grounded Planning** with Transformers in Procedural Tasks, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04869.pdf), [[Project]](https:\u002F\u002Fwww.pair.toronto.edu\u002Fplate-planner\u002F)\n\n- (arXiv 2021.09) **EfficientCLIP**: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04699.pdf)\n\n- (arXiv 2021.09) **Scaled ReLU** Matters for **Training** Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.03810.pdf)\n\n- (arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for **Video Inpainting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02974.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fruiliu-ai\u002FFuseFormer)\n\n- (arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for **Action Recognition**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02860.pdf)\n\n- (arXiv 2021.09) WHYACT: Identifying **Action Reasons** in Lifestyle **Vlogs**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02747.pdf)\n\n- (arXiv 2021.09) Zero-Shot **Open Set Detection** by Extending **CLIP**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02748.pdf)\n\n- (arXiv 2021.09) Towards Transferable **Adversarial Attacks** on Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04176.pdf)\n\n- (arXiv 2021.09) Learning to **Prompt** for **Vision-Language** Models, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.01134), [[Code]](https:\u002F\u002Fgithub.com\u002FKaiyangZhou\u002FCoOp)\n\n- (arXiv 2021.09) Improving **Video-Text Retrieval** by Multi-Stream Corpus Alignment and Dual Softmax Loss, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04290.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fstarmemda\u002FCAMoW\u002F)\n\n- (arXiv 2021.09) UCTransNet: Rethinking the **Skip Connections in U-Net** from a Channel-wise Perspective with Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04335.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FMcGregorWwww\u002FUCTransNet)\n\n- (arXiv 2021.09) ConvMLP: Hierarchical Convolutional **MLPs** for Vision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04454.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FConvolutional-MLPs)\n\n- (arXiv 2021.09) TxT: **Crossmodal** End-to-End Learning with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04422.pdf)\n\n- (arXiv 2021.09) Vision-and-Language or Vision-for-Language? On **Cross-Modal Influence** in Multimodal Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04448.pdf)\n\n- (arXiv 2021.09) **Sparse**-MLP: A Fully-**MLP** Architecture with Conditional Computation, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02008.pdf)\n\n- (arXiv 2021.09) SORNet: Spatial Object-Centric Representations for Sequential **Manipulation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.03891.pdf), [[Project]](https:\u002F\u002Fwentaoyuan.github.io\u002Fsornet)\n\n- (arXiv 2021.09) Audio-Visual Transformer Based **Crowd Counting**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.01926.pdf)\n\n- (arXiv 2021.09) Weakly Supervised Relative Spatial Reasoning for **Visual Question Answering**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.01934.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fpratyay-banerjee\u002Fweak_sup_vqa)\n\n- (arXiv 2021.09) FUSFORMER: A TRANSFORMER-BASED FUSION APPROACH FOR HYPERSPECTRAL IMAGE **SUPER-RESOLUTION**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02079.pdf)\n\n- (arXiv 2021.09) CTRL-C: **Camera calibration** TRansformer with Line-Classification, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02259.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fjwlee-vcl\u002FCTRL-C)\n\n- (arXiv 2021.09) Learning to Generate **Scene Graph** from Natural Language Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02227.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FYiwuZhong\u002FSGG_from_NLS)\n\n- (arXiv 2021.09) The Animation Transformer: Visual **Correspondence** via Segment Matching, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02614.pdf)\n\n- (arXiv 2021.09) Voxel Transformer for **3D Object Detection**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02497.pdf)\n\n- (ICCV 2021.09) **3D Human Texture Estimation** from a Single Image with Transformers, [[Paper]](http:\u002F\u002Fpersonal.ie.cuhk.edu.hk\u002F~ccloy\u002Ffiles\u002Ficcv_2021_texformer.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxuxy09\u002FTexformer)\n\n- (arXiv 2021.09) Encoder-decoder with Multi-level Attention for **3D Human Shape and Pose Estimation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02303.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fziniuwan\u002Fmaed)\n\n- (arXiv 2021.09) Joint Graph Learning and Matching for **Semantic Feature Correspondence**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.00240.pdf)\n\n- (arXiv 2021.09) Searching for **Efficient** Multi-Stage Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.00642.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fyilunliao\u002Fvit-search)\n\n### 2021.08\n- (arXiv 2021.08) SIGN: Spatial-information Incorporated Generative Network for **Generalized Zero-shot Semantic Segmentation**, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.12517.pdf)\n\n- (arXiv 2021.08) GroupFormer: **Group Activity Recognition** with Clustered Spatial-Temporal Transformer, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.12630.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002Fxueyee\u002FGroupFormer)\n\n- (arXiv 2021.08) **A Battle of Network Structures**: An Empirical Study of CNN, Transformer, and MLP, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.13002.pdf)\n\n- (arXiv 2021.08) Exploring and Improving **Mobile** Level Vision Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.13015.pdf)\n\n- (arXiv 2021.08) Cross-category **Video Highlight Detection** via Set-based Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11770.pdf), [[Code]](https:\u002F\u002Fgithub.com\u002FChrisAllenMing\u002FCross_Category_Video_Highlight)\n\n- (arXiv 2021.08) Shifted Chunk Transformer for **Spatio-Temporal** Representational Learning, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11575.pdf)\n\n- (arXiv 2021.08) SASRA: Semantically-aware Spatio-temporal Reasoning Agent for **Vision-and-Language Navigation** in Continuous Environments, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11945.pdf)\n\n- (arXiv 2021.08) LocTex: Learning **Data-Efficient** Visual **Representations** from Localized Textual Supervision, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11950.pdf), [[Project]](https:\u002F\u002Floctex.mit.edu\u002F)\n\n- (arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based **Detection** Heads, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.09691.pdf)\n\n- (arXiv 2021.08) SIMVLM: SIMPLE **VISUAL LANGUAGE** MODEL PRETRAINING WITH WEAK SUPERVISION, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.10904.pdf)\n\n- (arXiv 2021.08) TransFER: Learning Relation-aware **Facial Expression Representations** with Transformers, [[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11116)\n","# 视觉中的Transformer\n近期基于Transformer的计算机视觉及相关工作。欢迎评论\u002F贡献！\n\nTransformer如今已成为基础组件，几乎被所有AI模型所采用。保持更新 --> 更新不规律。\n\n新希望：[LLM-in-Vision](https:\u002F\u002Fgithub.com\u002FDirtyHarryLYL\u002FLLM-in-Vision)\n\n## 资源\n\n- **ChatGPT** 用于 **机器人技术**：设计原则与模型能力，[[论文]](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fuploads\u002Fprod\u002F2023\u002F02\u002FChatGPT___Robotics.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPromptCraft-Robotics)\n\n- DIFFUSIONDB [[页面]](https:\u002F\u002Fpoloclub.github.io\u002Fdiffusiondb)，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14896.pdf)\n\n- LAION-5B [[页面]](https:\u002F\u002Flaion.ai\u002Flaion-5b-a-new-era-of-open-large-scale-multi-modal-datasets\u002F)，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08402.pdf)\n\n- LAVIS [[页面]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS)，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09019.pdf)\n\n- Imagen Video [[页面]](https:\u002F\u002Fimagen.research.google\u002Fvideo\u002F)，[[论文]](https:\u002F\u002Fimagen.research.google\u002Fvideo\u002Fpaper.pdf)\n\n- Phenaki [[页面]](https:\u002F\u002Fphenaki.video\u002F)，[[论文]](https:\u002F\u002Fopenreview.net\u002Fpdf?id=vOEXS39nOF)\n\n- DREAMFUSION [[页面]](https:\u002F\u002Fdreamfusion3d.github.io\u002F)，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14988.pdf)\n\n- MAKE-A-VIDEO [[页面]](https:\u002F\u002Fmake-a-video.github.io\u002F)，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14792.pdf)\n\n- Stable Diffusion [[页面]](https:\u002F\u002Fommer-lab.com\u002Fresearch\u002Flatent-diffusion-models\u002F)，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10752.pdf)\n\n- NUWA-Infinity [[页面]](https:\u002F\u002Fnuwa-infinity.microsoft.com\u002F#\u002F)，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09814.pdf)\n\n- Parti [[页面]](https:\u002F\u002Fparti.research.google\u002F)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fparti)\n\n- Imagen [[页面]](https:\u002F\u002Fimagen.research.google\u002F)，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11487.pdf)\n\n- Gato：一种通用智能体，[[论文]](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FA%20Generalist%20Agent\u002FGeneralist%20Agent.pdf)\n\n- PaLM：通过Pathways扩展语言建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02311.pdf)\n\n- DALL·E 2 [[页面]](https:\u002F\u002Fopenai.com\u002Fdall-e-2\u002F)，[[论文]](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fdall-e-2.pdf)\n\n- SCENIC：一款用于计算机视觉研究及其他领域的JAX库，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fscenic)\n\n- V-L联合学习研究（附有优秀表格）：[[METER]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02387.pdf)，[[Kaleido-BERT]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.16110.pdf)\n\n- “Attention is all you need”，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1706.03762.pdf)\n\n- CLIP [[页面]](https:\u002F\u002Fopenai.com\u002Fblog\u002Fclip\u002F)，[[论文]](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002FLearning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP)，[[arXiv]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.00020.pdf)\n\n- DALL·E [[页面]](https:\u002F\u002Fopenai.com\u002Fblog\u002Fdall-e\u002F)，[[代码]](https:\u002F\u002Fgithub.com\u002Fopenai\u002FDALL-E)，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.12092.pdf)\n\n- [huggingface\u002Ftransformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n\n- [Kyubyong\u002Ftransformer](https:\u002F\u002Fgithub.com\u002FKyubyong\u002Ftransformer)，TF\n\n- [jadore801120\u002Fattention-is-all-you-need-pytorch](https:\u002F\u002Fgithub.com\u002Fjadore801120\u002Fattention-is-all-you-need-pytorch)，Torch\n\n- [krasserm\u002Ffairseq-image-captioning](https:\u002F\u002Fgithub.com\u002Fkrasserm\u002Ffairseq-image-captioning)\n\n- [PyTorch Transformers 教程](https:\u002F\u002Fgithub.com\u002Fabhimishra91\u002Ftransformers-tutorials)\n\n- [ictnlp\u002Fawesome-transformer](https:\u002F\u002Fgithub.com\u002Fictnlp\u002Fawesome-transformer)\n\n- [basicv8vc\u002Fawesome-transformer](https:\u002F\u002Fgithub.com\u002Fbasicv8vc\u002Fawesome-transformer)\n\n- [dk-liang\u002FAwesome-Visual-Transformer](https:\u002F\u002Fgithub.com\u002Fdk-liang\u002FAwesome-Visual-Transformer)\n\n- [yuewang-cuhk\u002Fawesome-vision-language-pretraining-papers](https:\u002F\u002Fgithub.com\u002Fyuewang-cuhk\u002Fawesome-vision-language-pretraining-papers)\n\n## 调查报告\n\n- (arXiv 2023.2) 基于 Transformer 的 **传感器融合** 在 **自动驾驶** 中的应用：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11481.pdf)，[[页面]](https:\u002F\u002Fgithub.com\u002FApoorvRoboticist\u002FTransformers-Sensor-Fusion)\n\n- (arXiv 2023.2) 深度学习在 **视频-文本检索** 中的应用：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12552.pdf)\n\n- (arXiv 2023.2) 大规模 **多模态预训练模型**：全面综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10035.pdf)\n\n- (arXiv 2023.2) 计算机视觉中基于 Transformer 的 **生成对抗网络**：全面综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08641.pdf)\n\n- (arXiv 2023.2) 视觉 Transformer 中的 **知识蒸馏**：批判性评论，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2302\u002F2302.02108.pdf)\n\n- (arXiv 2023.2) Transformer 高效训练方法综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01107.pdf)\n\n- (arXiv 2023.1) ChatGPT 并非万能。关于 **大型生成式 AI 模型** 的最新进展综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04655.pdf)\n\n- (arXiv 2022.12) Transformer 在 **动作识别** 中的应用：时间建模综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01921.pdf)\n\n- (arXiv 2022.11) 视觉 Transformer 在 **医学影像** 中的应用：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.10043.pdf)\n\n- (arXiv 2022.11) 关于 **知识** 增强的 **多模态** 学习综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12328.pdf)\n\n- (arXiv 2022.10) 视觉-语言预训练：基础、最新进展及未来趋势，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09263.pdf)\n\n- (arXiv 2022.10) 计算机视觉中图神经网络与 **图** Transformer 的综述：任务导向视角，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13232.pdf)\n\n- (arXiv 2022.09) 视觉 Transformer 在 **动作识别** 中的应用：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05700.pdf)\n\n- (arXiv 2022.09) Transformer 在 **遥感** 中的应用：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01206.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FVIROBO-15\u002FTransformer-in-Remote-Sensing)\n\n- (arXiv 2022.08) 使用 Transformer 进行 **3D 视觉**：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04309.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flahoud\u002F3d-vision-transformers)\n\n- (arXiv 2022.08) 自监督学习中用于视觉及其他领域的 **掩码自编码器** 综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00173.pdf)\n\n- (arXiv 2022.07) 视觉 Transformer：现状与研究挑战，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03041.pdf)\n\n- (arXiv 2022.07) **自监督** 学习在 **视频** 中的应用：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00419.pdf)\n\n- (arXiv 2022.06) Transformer 在 **多模态** 学习中的应用：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06488.pdf)\n\n- (arXiv 2022.05) 视觉 Transformer：**ViT** 及其 **衍生模型**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11239.pdf)\n\n- (arXiv 2022.05) Transformer 在 3D **点云** 数据中的应用：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07417.pdf)\n\n- (arXiv 2022.04) 深度学习中 **视觉注意力** 方法的深入综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07756.pdf)\n\n- (arXiv 2022.04) **视觉-语言** 预训练模型综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07356.pdf)\n\n- (arXiv 2022.03) **大模型** 发展路线图，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14101.pdf)\n\n- (arXiv 2022.03) Transformer 与 **视觉** 学习理解的结合：全面回顾，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12944.pdf）\n\n- (arXiv 2022.03) 视觉 Transformer 的最新进展：综述及未来展望，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01536.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002Fkhawar512\u002FViT-Survey)\n\n- (arXiv 2022.02) **视觉-语言** 预训练模型综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10936.pdf)\n\n- (arXiv 2022.02) VLP：**视觉-语言** 预训练综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09061.pdf)\n\n- (arXiv 2022.02) 图形 Transformer：从架构角度概述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.08455.pdf)\n\n- (arXiv 2022.01) **视频** Transformer：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05991.pdf)\n\n- (arXiv 2021.11) 我们是否已准备好迎接新的范式转变？关于视觉深度 **MLP** 的综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.04060.pdf)\n\n- (arXiv 2021.11) 视觉 Transformer 综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.06091.pdf)\n\n- (arXiv 2021.09) 基于 Transformer 的 **视频-语言** 预训练综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.09920.pdf)\n\n- (arXiv 2021.06) **Transformer** 综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.04554.pdf)\n\n- (arXiv 2021.06) 机器视觉中的 **注意力** 机制与深度学习：现状综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.07550.pdf)\n\n- (arXiv 2021.06) **预训练模型**：过去、现在与未来，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.07139.pdf)\n\n- (arXiv 2021.05) 注意力机制能否使 **MLP** 追赶上 CNN？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2105.15078.pdf)\n\n- (arXiv 2021.03) 关于更 **快速** 更 **轻量** 的 Transformer 的实用综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.14636.pdf)\n\n- (arXiv 2021.03) 跨模态任务中 **语言与视觉** 的 Transformer 架构视角与前景，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.04037.pdf)\n\n- (arXiv 2021.01) 视觉 Transformer 综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2012.12556.pdf)\n\n- (arXiv 2020.9) **高效** Transformer：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2009.06732.pdf)\n\n- (arXiv 2020.1) **Transformer 在视觉领域** 的应用：综述，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.01169.pdf)\n\n## 最新论文\n\n### 2023.8\n\n- (arXiv 2023.8) VL-PET：通过粒度控制实现视觉-语言参数 **高效微调**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.09804)，[[项目]](https:\u002F\u002Fhenryhzy.github.io\u002FVL-PET\u002F)\n\n### 2023.5\n\n- (arXiv 2023.5) 利用有效感受野理解视觉 Transformer 的高斯 **注意力** 偏置，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04722.pdf)\n\n### 2023.3\n\n- (arXiv 2023.3) 用于 **时刻检索** 和 **精彩片段检测** 的查询依赖型 **视频** 表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.13874.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwjun0830\u002FQD-DETR)\n\n### 2023.2\n\n- (arXiv 2023.2) **开放域视觉实体识别**：迈向识别数百万个维基百科实体，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11154.pdf)\n\n- (arXiv 2023.2) KS-DETR：用于 **检测** Transformer 的注意力学习中的知识共享，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11208.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fedocanonymous\u002FKS-DETR)\n\n- (arXiv 2023.2) HUMAN MOTIONFORMER：利用视觉 Transformer **迁移** 人体 **运动**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11306.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKumapowerLIU\u002FHuman-MotionFormer)\n\n- (arXiv 2023.2) 利用**人类反馈**对**文本到图像**模型进行对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12192.pdf)\n\n- (arXiv 2023.2) 基于扩散先验的可控与条件化**文本到图像**生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11710.pdf)\n\n- (arXiv 2023.2) 预训练的视觉与语言模型能否回答**视觉信息查询问题**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11713.pdf)，[[代码]](https:\u002F\u002Fopen-vison-language.github.io\u002Finfoseek)\n\n- (arXiv 2023.2) 通过解耦物体动力学与交互实现以物体为中心的**视频预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11850.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Focvp-vp)\n\n- (arXiv 2023.2) 分布归一化：一种“无需努力”的对比学习型**视觉-语言**模型**测试时增强**方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11084.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffengyuli2002\u002Fdistribution-normalization)\n\n- (arXiv 2023.2) 教**CLIP**数到十，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12066.pdf)，[[项目]](https:\u002F\u002Fteaching-clip-to-count.github.io\u002F)\n\n- (arXiv 2023.2) 为**文本到图像**模型的快速个性化设计编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12228.pdf)，[[项目]](https:\u002F\u002Ftuning-encoder.github.io\u002F)\n\n- (arXiv 2023.2) 用于**开放词汇语义分割**的侧边适配网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12242.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMendelXu\u002FSAN)\n\n- (arXiv 2023.2) 通过**语言引导采样**学习视觉表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12248.pdf)\n\n- (arXiv 2023.2) VoxFormer：基于稀疏体素的变换器，用于基于摄像头的**3D语义场景补全**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12251.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer)\n\n- (arXiv 2023.2) 面向**机器人技术**的语言驱动表征学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12766.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fvoltron-robotics)\n\n- (arXiv 2023.2) 用于侧扫**声呐**数据**语义分割**的卷积视觉变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12416.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhayatrajani\u002Fs3seg-vit)\n\n- (arXiv 2023.2) 具有高效变换器和CNN的**轻量级**实时语义**分割**网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10484.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FIVIPLab\u002FLETNet)\n\n- (arXiv 2023.2) VIEWCO：通过多视角语义一致性发现**文本监督**的**分割**掩码，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10307.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fpzhren\u002FViewCo)\n\n- (arXiv 2023.2) CertViT：预训练视觉变换器的认证**鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10287.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsagarverma\u002Ftransformer-lipschitz)\n\n- (arXiv 2023.2) Paparazzi：深入探讨语言与视觉模型在**视点描述定位**方面的能力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10282.pdf)\n\n- (arXiv 2023.2) MaskedKD：使用**掩码**图像高效蒸馏视觉变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10494.pdf)\n\n- (arXiv 2023.2) 一种由全局亲和力引导的通用视觉表征框架，用于**弱监督显著目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10697.pdf)\n\n- (arXiv 2023.2) ViTA：面向**边缘**应用的视觉变换器**推理加速器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09108.pdf)\n\n- (arXiv 2023.2) 基于PSO-ConvNet Transformer的动力学协作学习进行**视频动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09187.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fleonlha\u002FVideo-Action-Recognition-via-PSO-ConvNet-Transformer-Collaborative-Learning-with-Dynamics)\n\n- (arXiv 2023.2) ChatGPT和DALL-E 2在**决策制定**和**空间推理**方面的试点**评估**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2302\u002F2302.09068.pdf)\n\n- (arXiv 2023.2) StyLIP：基于**CLIP**的跨域**领域泛化**的多尺度风格条件提示学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09251.pdf)\n\n- (arXiv 2023.2) 用于跨域**少样本**学习的元风格对抗训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09309.pdf)\n\n- (arXiv 2023.2) HYNETER：用于物体**检测**的混合网络变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09365.pdf)\n\n- (arXiv 2023.2) STOA-VLP：用于**视频-语言**预训练的物体与动作时空建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09736.pdf)\n\n- (arXiv 2023.2) 部分监督下**时间句子定位**的约束与联合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09850.pdf)\n\n- (arXiv 2023.2) STB-VMM：基于Swin Transformer的**视频运动放大**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10001.pdf)\n\n- (arXiv 2023.2) 基于多粒度对齐的**时尚图像检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08902.pdf)\n\n- (arXiv 2023.2) LayoutDiffuse：将基础扩散模型应用于**布局到图像生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08908.pdf)\n\n- (arXiv 2023.2) CK-Transformer：用于**指代表达理解**的常识知识增强型变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.09027.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FFightingFighting\u002FCK-Transformer)\n\n- (arXiv 2023.2) MaskSketch：无配对结构引导的掩码**图像生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05496.pdf)\n\n- (arXiv 2023.2) 单一**运动****扩散**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05905.pdf)，[[代码]](https:\u002F\u002Fsinmdm.github.io\u002FSinMDM-page)\n\n- (arXiv 2023.2) 用于基于视觉的**3D语义占用预测**的三视角观，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07817.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwzzheng\u002FTPVFormer)\n\n- (arXiv 2023.2) ANSEL Photobot：一款具有语义智能的**机器人****活动摄影师**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07931.pdf)\n\n- (arXiv 2023.2) ForceFormer：探索社会力与变换器在**行人轨迹预测**中的应用，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07583.pdf)\n\n- (arXiv 2023.2) 投影潜在空间中的**视频**概率**扩散**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07685.pdf)\n\n- (arXiv 2023.2) 数据集接口：利用可控反事实生成**诊断模型故障**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07865.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMadryLab\u002Fdataset-interfaces)\n\n- (arXiv 2023.2) 学习如何在**食谱**中替换食材，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07960.pdf)\n\n- (arXiv 2023.2) **能量**变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07253.pdf)\n\n- (arXiv 2023.2) Efficiency 360：高效的视觉变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08374.pdf)\n\n- (arXiv 2023.2) 自助式**提示调优**（APT）：通过可组合的`提示`结合不同数据，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07994.pdf)\n\n- (arXiv 2023.2) 基于扩散模型的有效数据增强，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07944.pdf)，[[项目]](https:\u002F\u002Fbtrabuc.co\u002Fda-fusion)\n\n- (arXiv 2023.2) PRedItOR：基于扩散先验的文本引导图像编辑，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07979.pdf)\n\n- (arXiv 2023.2) TcGAN：语义感知且保持结构的GAN，结合独立视觉Transformer实现快速任意单样本图像生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08047.pdf)\n\n- (arXiv 2023.2) 用于RGB-D显著性目标检测的层次化跨模态Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08052.pdf)\n\n- (arXiv 2023.2) MINOTAUR：多任务多模态查询视频定位，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08063.pdf)\n\n- (arXiv 2023.2) 通过结构重参数化实现高效的视觉适应，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08106.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fluogen1996\u002FRepAdapter)\n\n- (arXiv 2023.2) 利用视觉Transformer进行高效的3D物体重建，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08474.pdf)\n\n- (arXiv 2023.2) 检索增强型图像字幕生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08268.pdf)\n\n- (arXiv 2023.2) 基于Transformer模型的鲁棒人体运动预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.08274.pdf)\n\n- (arXiv 2023.2) VQ3D：在ImageNet上学习一个3D感知的生成模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06833.pdf)，[[项目]](https:\u002F\u002Fkylesargent.github.io\u002Fvq3d)\n\n- (arXiv 2023.2) UKnow：一种用于常识推理和视觉-语言预训练的统一知识协议，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06891.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FGongggg\u002FUKnow)\n\n- (arXiv 2023.2) 对浅层视觉Transformer的理论理解：学习、泛化与样本复杂度，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06015.pdf)\n\n- (arXiv 2023.2) 一种简单的零样本提示加权技术，用于改进文本-图像模型中的提示集成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06235.pdf)\n\n- (arXiv 2023.2) 基于对比混合适配器的广义少样本持续学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05936.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyawencui\u002FCMoA)\n\n- (arXiv 2023.2) 用于揭秘视觉-语言导航的动作原子概念学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06072.pdf)\n\n- (arXiv 2023.2) 面向图像字幕生成的局部视觉建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06098.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxmu-xiaoma666\u002FLSTNet)\n\n- (arXiv 2023.2) CLIP-RR：面向关系导向的跨模态信息检索的改进CLIP网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06350.pdf)\n\n- (arXiv 2023.2) 为第一人称视频预测下一个活跃对象，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06358.pdf)，[[代码]]()\n\n- (arXiv 2023.2) UniAdapter：用于跨模态建模的统一参数高效迁移学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.06605.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FRERV\u002FUniAdapter)\n\n- (arXiv 2023.2) TEAM DETR：在检测Transformer中将引导查询视为专业团队，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07116.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhorrible-dong\u002FTeamDETR)\n\n- (arXiv 2023.2) ConceptFusion：开放集多模态3D地图构建，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07241.pdf)，[[项目]](https:\u002F\u002Fconcept-fusion.github.io\u002F)\n\n- (arXiv 2023.2) Factify 2中的团队三重检查：用于多模态事实核查的特征表示参数高效大型基础模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07740.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwwweiwei\u002FPre-CoFactv2-AAAI-2023)\n\n- (arXiv 2023.2) PolyFormer：将参考图像分割视为序列多边形生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07387.pdf)\n\n- (arXiv 2023.2) 基于姿态的Transformer结合不确定性引导的精炼，用于2D到3D人体姿态估计，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07408.pdf)\n\n- (arXiv 2023.2) TFormer：一种适用于物联网设备的传输友好型ViT模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07734.pdf)，[[代码]]()\n\n- (arXiv 2023.2) 基于视觉的3D语义占用预测的三视角视图，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07817.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwzzheng\u002FTPVFormer)\n\n- (arXiv 2023.2) 为文本到图像的扩散模型添加条件控制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05543.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flllyasviel\u002FControlNet)\n\n- (arXiv 2023.2) 不变槽注意力：以槽为中心的参考系进行目标发现，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04973.pdf)\n\n- (arXiv 2023.2) 多模态视觉监督对语言有益吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05016.pdf)\n\n- (arXiv 2023.2) 基于数据驱动的随机运动评估与优化，结合空间对齐的时间编码图像，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05041.pdf)\n\n- (arXiv 2023.2) 将视觉Transformer扩展至220亿参数，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05442.pdf)\n\n- (arXiv 2023.2) 通过权重膨胀将预训练的视觉Transformer从2D适配到3D，可提升医学图像分割性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04303.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyuhui-zh15\u002FTransSeg)\n\n- (arXiv 2023.2) 通过定向对齐缓解视觉Transformer中的偏差，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04358.pdf)\n\n- (arXiv 2023.2) IH-ViT：基于视觉Transformer的集成电路外观缺陷检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2302\u002F2302.04521.pdf)\n\n- (arXiv 2023.2) Re-ViLM：检索增强型视觉语言模型，用于零样本和少样本图像字幕生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04858.pdf)\n\n- (arXiv 2023.2) 通过提问学习具身视觉导航与任务完成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04865.pdf)\n\n- (arXiv 2023.2) 可逆视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04869.pdf)，[[代码1]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fslowfast)，[[代码2]](https:\u002F\u002Fgithub.com\u002Fkarttikeya\u002FminREV)\n\n- (arXiv 2023.2) 神经凝结：将图像对齐到联合语义图谱，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03956.pdf)，[[项目]](https:\u002F\u002Fneural-congealing.github.io\u002F)\n\n- (arXiv 2023.2) 针对黑盒基础模型的对抗性提示攻击，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04237.pdf)\n\n- (arXiv 2023.2) 理解为什么ViT在小数据集上训练效果不佳：一种直观的视角，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03751.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBoyuanJackChen\u002FMiniProject2_VisTrans)\n\n- (arXiv 2023.2) 通过层注意力进行跨层回顾性检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03985.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjoyfang1106\u002FMRLA)\n\n- (arXiv 2023.2) 经过训练以**识别单词**的卷积神经网络，能够很好地解释视觉形式启动效应，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03992.pdf)\n\n- (arXiv 2023.2) 使用扩散模型从纯文本故事零样本**生成**连贯的**故事书**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03900.pdf)\n\n- (arXiv 2023.2) OSRT：基于失真感知Transformer的全向**图像超分辨率**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03453.pdf)\n\n- (arXiv 2023.2) Pic2Word：将图片映射到单词，用于零样本的**组合式** **图像检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03084.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fcomposed_image_retrieval)\n\n- (arXiv 2023.2) 基于多视图的SimCon损失，用于文本监督下的**语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03432.pdf)\n\n- (arXiv 2023.2) PhysFormer++：基于慢速-快速时序差异Transformer的**面部**视频**生理测量**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03548.pdf)\n\n- (arXiv 2023.2) 通过多视图注意力学习扩展**自监督**端到端**驾驶**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03198.pdf)\n\n- (arXiv 2023.2) HumanMAC：用于**人体运动预测**的掩码运动补全，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03665.pdf)，[[项目]](https:\u002F\u002Flhchen.top\u002FHuman-MAC\u002F)\n\n- (arXiv 2023.2) LAMPP：将**语言模型**作为概率先验用于**感知**和**行动**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02801.pdf)\n\n- (arXiv 2023.2) 从被动的人类视频中实现零样本**机器人操作**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02011.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fhuman-0shot-robot)\n\n- (arXiv 2023.2) MixFormer：基于迭代混合注意力的端到端**跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02814.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FMixFormer)\n\n- (arXiv 2023.2) LexLIP：面向大规模**图像-文本检索**的词典瓶颈型语言-图像预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02908.pdf)\n\n- (arXiv 2023.2) V1T：使用视觉Transformer进行大规模**小鼠V1区反应预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03023.pdf)\n\n- (arXiv 2023.2) AIM：为高效**视频动作识别**而**适配图像模型**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03024.pdf)，[[项目]](https:\u002F\u002Fadapt-image-models.github.io\u002F)\n\n- (arXiv 2023.2) KDEformer：通过核密度估计加速Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02451.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmajid-daliri\u002Fkdeformer)\n\n- (arXiv 2023.2) 基于预训练模型的语义引导**图像增强**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02070.pdf)\n\n- (arXiv 2023.2) X-ReID：用于身份级**行人重识别**的跨实例Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02075.pdf)\n\n- (arXiv 2023.2) MOMA：从自监督教师那里进行**蒸馏**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02089.pdf)\n\n- (arXiv 2023.2) 学习在视觉注意力上达成一致，用于**视觉常识推理**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02117.pdf)\n\n- (arXiv 2023.2) 使用金字塔式多模态Transformer实现高效的端到端**视频问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02136.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTrunpm\u002FPMT-AAAI23)\n\n- (arXiv 2023.2) LipFormer：基于视觉地标Transformer学习对未见过说话者的**唇读**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02141.pdf)\n\n- (arXiv 2023.2) 无振荡的**量化**方法，适用于低比特视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02210.pdf)\n\n- (arXiv 2023.2) Design Booster：一种文本引导的扩散模型，用于在保持空间布局的前提下进行**图像翻译**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02284.pdf)\n\n- (arXiv 2023.2) 对比与重建：由生成式预训练引导的**对比式** **3D** 表征学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02318.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fqizekun\u002FReCon)\n\n- (arXiv 2023.2) 离开现实走向想象：通过**生成数据集**实现**鲁棒** **分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02503.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FHritikbansal\u002Fgenerative-robustness)\n\n- (arXiv 2023.2) CHiLS：使用**分层**标签集进行零样本图像**分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02551.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Facmi-lab\u002FCHILS)\n\n- (arXiv 2023.2) 零样本**图像到图像**翻译，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03027.pdf)，[[项目]](https:\u002F\u002Fpix2pixzero.github.io\u002F)\n\n- (arXiv 2023.2) 学习一种**傅里叶变换**，用于Transformer中的线性相对**位置编码**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01925.pdf)\n\n- (arXiv 2023.2) 显式框检测统一了端到端的**多人姿态估计**，[[论文]](http:\u002F\u002Fmy.sjtu.edu.cn\u002FTask)，[[代码]](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FED-Pose)\n\n- (arXiv 2023.2) CFFT-GAN：用于基于示例的**图像翻译**的跨域特征融合Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01608.pdf)\n\n- (arXiv 2023.2) DEVICE：深度与视觉概念感知Transformer，用于**TextCaps**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01540.pdf)\n\n- (arXiv 2023.2) CVTNet：一种基于**LiDAR**数据的跨视图Transformer网络，用于**场所识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01665.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBIT-MJY\u002FCVTNet)\n\n- (arXiv 2023.2) DilateFormer：用于视觉识别的**多尺度扩张**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01791.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJIAOJIAYUASD\u002Fdilateformer)\n\n- (arXiv 2023.2) HDFormer：用于**3D人体姿态估计**的高阶有向Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01825.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhyer\u002FHDFormer)\n\n- (arXiv 2023.2) IC^3：通过委员会共识进行图像字幕生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01328.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FDavidMChan\u002Fcaption-by-committee)\n\n- (arXiv 2023.2) 通过显著性提示的无监督预训练提升低数据量实例**分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01171.pdf)\n\n- (arXiv 2023.2) QR-CLIP：引入明确的开放世界知识，用于**地点和时间推理**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00952.pdf)\n\n- (arXiv 2023.2) 基于视觉Transformer的特征提取，用于**广义零样本学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00875.pdf)\n\n- (arXiv 2023.2) 语言模型中的**多模态**思维链**推理**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00923.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Famazon-science\u002Fmm-cot)\n\n- (arXiv 2023.2) CLIPood：将**CLIP**推广到**分布外场景**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00864.pdf)\n\n- (arXiv 2023.2) 语言量化自编码器：迈向无监督的**文本-图像**对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00902.pdf)\n\n- (arXiv 2023.2) 大型 Transformer 模型的**隐藏表示**几何结构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00294.pdf)\n\n- (arXiv 2023.2) 通过有偏提示对**视觉-语言**模型进行**去偏**处理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00070.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fchingyaoc\u002Fdebias_vl)\n\n- (arXiv 2023.2) 基于运动线索的组合式提示微调用于**开放词汇视频关系检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00268.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FDawn-LX\u002FOpenVoc-VidVRD)\n\n- (arXiv 2023.2) mPLUG-2：一个跨文本、图像和视频的模块化**多模态**基础模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00402.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Falibaba\u002FAliceMind\u002Ftree\u002Fmain\u002FmPLUG)\n\n- (arXiv 2023.2) 通过插值权重优化将**CLIP**转化为**开放词汇视频模型**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00624.pdf)\n\n- (arXiv 2023.2) ADAPT：动作感知的驾驶场景**字幕**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00673.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjxbbb\u002FADAPT)\n\n\n\n### 2023.1\n\n\u003C!-- - (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 20......- (arXiv 2023.2) 大型 Transformer 模型的**隐藏表示**几何结构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00294.pdf)\n\n- (arXiv 2023.2) 通过有偏提示对**视觉-语言**模型进行**去偏**处理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00070.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fchingyaoc\u002Fdebias_vl)\n\n- (arXiv 2023.2) 基于运动线索的组合式提示微调用于**开放词汇视频关系检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00268.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FDawn-LX\u002FOpenVoc-VidVRD)\n\n- (arXiv 2023.2) mPLUG-2：一个跨文本、图像和视频的模块化**多模态**基础模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00402.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Falibaba\u002FAliceMind\u002Ftree\u002Fmain\u002FmPLUG)\n\n- (arXiv 2023.2) 通过插值权重优化将**CLIP**转化为**开放词汇视频模型**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00624.pdf)\n\n- (arXiv 2023.2) ADAPT：一种动作感知的驾驶场景**字幕**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00673.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjxbbb\u002FADAPT)\n\n\n\n### 2023.1\n\n\u003C!-- - (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 2023.1) ，[[论文]]()，[[代码]]()\n\n- (arXiv 20......\n\n- (arXiv 2023.1) Dyna-DepthFormer：用于动态场景下自监督深度估计的多帧Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05871.pdf)\n\n- (arXiv 2023.1) SwinDepth：基于Swin Transformer和密集级联网络的单目序列无监督深度估计，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.06715.pdf)\n\n- (arXiv 2023.1) CLIPTER：在场景文本识别中着眼全局，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07464.pdf)\n\n- (arXiv 2023.1) 时间感知的视频-语言预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07463.pdf)\n\n- (arXiv 2023.1) 文本与三维点云的联合表示学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07584.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FText4Point)\n\n- (arXiv 2023.1) 基于语义视觉损失的有效端到端视觉-语言预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07236.pdf)\n\n- (arXiv 2023.1) PTA-Det：用于3D目标检测的点云与图像关联点Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07301.pdf)\n\n- (arXiv 2023.1) CLIP时代与十亿级图像数据集下的人脸识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07315.pdf)\n\n- (arXiv 2023.1) HSTFormer：用于3D人体姿态估计的层次化时空Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07322.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fqianxiaoye825\u002FHSTFormer)\n\n- (arXiv 2023.1) 致力于能够“看”和“读”的模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07389.pdf)\n\n- (arXiv 2023.1) 用于高效探索和智能场景描述的具身智能体，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07150.pdf)\n\n- (arXiv 2023.1) 基于联合嵌入预测架构的图像自监督学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08243.pdf)\n\n- (arXiv 2023.1) 重访少样本动作识别中的空间和时间建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07944.pdf)\n\n- (arXiv 2023.1) 用于参数高效的视频文本检索的多模态视频适配器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07868.pdf)\n\n- (arXiv 2023.1) 自监督学习在大规模自然语言监督下并无助益，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07836.pdf)\n\n- (arXiv 2023.1) 基于Transformer的相机链接模型与时空信息的多目标多摄像头车辆跟踪，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07805.pdf)\n\n- (arXiv 2023.1) ATMAN：通过内存高效的注意力机制解释Transformer预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08110.pdf)\n\n- (arXiv 2023.1) DDS：解耦的动态场景图生成网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07666.pdf)，[[代码]]()\n\n- (arXiv 2023.1) 视觉写作提示：基于精选图像序列的角色驱动故事生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08571.pdf)\n\n- (arXiv 2023.1) 基于视觉Transformer的图像记忆性预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08647.pdf)\n\n- (arXiv 2023.1) 全面可解释的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08669.pdf)\n\n- (arXiv 2023.1) FlatFormer：用于高效点云Transformer的扁平化窗口注意力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08739.pdf)\n\n- (arXiv 2023.1) LEGO-Net：学习房间内物体的规则重新排列，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09629.pdf)，[[项目]](https:\u002F\u002Fivl.cs.brown.edu\u002Fprojects\u002Flego-net)\n\n- (arXiv 2023.1) Zorro：一种掩码多模态Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09595.pdf)\n\n- (arXiv 2023.1) 向具有时间感知的Transformer的鲁棒视频实例分割迈进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09416.pdf)\n\n- (arXiv 2023.1) 从自然语言监督中学习开放词汇语义分割模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09121.pdf)，[[项目]](https:\u002F\u002Fjazzcharles.github.io\u002FOVSegmentor\u002F)\n\n- (arXiv 2023.1) 总结过去以预测未来：上下文的自然语言描述有助于多模态对象交互预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09209.pdf)，[[代码]](https:\u002F\u002Feth-ait.github.io\u002Ftransfusion-proj\u002F)\n\n- (arXiv 2023.1) 联邦学习与图像加密的结合应用于保护隐私的视觉Transformer图像分类，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09255.pdf)\n\n- (arXiv 2023.1) 切片Transformer与自监督学习用于三维点云地图中的6DoF定位，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08957.pdf)\n\n- (arXiv 2023.1) 使用手工特征提升零样本动作识别的准确性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08874.pdf)\n\n- (arXiv 2023.1) 学会观察：用于主动目标检测的决策Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09544.pdf)\n\n- (arXiv 2023.1) 用于图像字幕生成的视觉语义相关性数据集，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08784.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fahmedssabir\u002FTextual-Visual-Semantic-Dataset)\n\n- (arXiv 2023.1) 用于学习隐式神经表示的多功能神经过程，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.08883.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZongyuGuo\u002FVersatile-NP)\n\n- (arXiv 2023.1) RangeViT：面向自动驾驶中3D语义分割的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10222.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002Frangevit)\n\n- (arXiv 2023.1) 利用光流引导进行基于Transformer的视频修复，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10048.pdf)\n\n- (arXiv 2023.1) 使用高效条带窗口Transformer进行图像超分辨率，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09869.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FFried-Rice-Lab\u002FFriedRiceLab)\n\n- (arXiv 2023.1) 当前最先进视觉模型的分布外性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10750.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsalman-lui\u002Fvision_course_project)\n\n- (arXiv 2023.1) 带有相关性掩码建模的紧凑型Transformer跟踪器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10938.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FHUSTDML\u002FCTTrack)\n\n- (arXiv 2023.1) 执行零样本任务的视觉-语言模型表现出性别相关的差异，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11100.pdf)\n\n- (arXiv 2023.1) 用于无监督目标检测和实例分割的Cut and Learn，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11320.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FCutLER)\n\n- (arXiv 2023.1) 通过生成字幕将视觉偏见转化为文字加以解释，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11104.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Falinlab\u002Fb2t)\n\n- (arXiv 2023.1) 重访基于CLIP的图像到视频知识迁移中的时间建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11116.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffarewellthree\u002FSTAN)\n\n- (arXiv 2023.1) **多视频时刻排序**，基于多模态线索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13606.pdf)\n\n- (arXiv 2023.1) SDF-FORMER：利用3D SDF变换器进行**单目场景重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13510.pdf)，[[项目]](https:\u002F\u002Fweihaosky.github.io\u002Fsdfformer)\n\n- (arXiv 2023.1) 将语言模型与图像对齐用于**多模态生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13823.pdf)\n\n- (arXiv 2023.1) 具有多级置信度优化的伪3D感知变换器，用于**视觉常识推理**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13335.pdf)\n\n- (arXiv 2023.1) 一种模块化多阶段轻量级图变换器网络，用于从2D人体姿态中进行**人体姿态和形状估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13403.pdf)\n\n- (arXiv 2023.1) 先验知识很强大：通过2D先验改进用于**多摄像头3D检测**的变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13592.pdf)\n\n- (arXiv 2023.1) UPop：统一且渐进式的剪枝方法，用于**压缩****视觉-语言**变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13741.pdf)\n\n- (arXiv 2023.1) 基于去偏自注意力的**公平性**感知视觉变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13803.pdf)\n\n- (arXiv 2023.1) 基于锚点的语言驱动对抗鲁棒**零样本学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13096.pdf)\n\n- (arXiv 2023.1) 将互联网规模的**视觉-语言**模型蒸馏到**具身**智能体中，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12507.pdf)\n\n- (arXiv 2023.1) 利用变换器实现6自由度机器人**抓取**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12476.pdf)\n\n- (arXiv 2023.1) 具身智能体会做像素化的羊梦吗？：使用语言引导的世界建模进行**具身决策**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12050.pdf)，[[项目]](https:\u002F\u002Fdeckardagent.github.io\u002F)\n\n- (arXiv 2023.1) GALIP：用于**文本到图像**合成的生成对抗CLIPs，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12959.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ftobran\u002FGALIP)\n\n- (arXiv 2023.1) STAIR：在接地标记中学习**稀疏**的**文本和图像**表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13081.pdf)\n\n- (arXiv 2023.1) 使用视觉变换器检测器（ViTDet）进行**航拍**图像目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2301\u002F2301.12058.pdf)\n\n- (arXiv 2023.1) 向视觉变换器展开固定点算法迈进：以**图像修复**为例，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12332.pdf)\n\n- (arXiv 2023.1) 通过**提示**正则化对**视觉-语言**模型进行去偏微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12429.pdf)，[[代码]]()\n\n- (arXiv 2023.1) BLIP-2：利用**冻结**的图像编码器和大型语言模型进行**语言-图像**预训练的自举方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12597.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip2)\n\n- (arXiv 2023.1) 标注优先于对齐：为**视频-文本检索**整合多模态标签，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12644.pdf)\n\n- (arXiv 2023.1) SEAFORMER：面向移动设备语义**分割**的挤压增强轴向变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13156.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FSeaFormer)\n\n- (arXiv 2023.1) 基于部件效价接地学习6自由度细粒度**抓取检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11564.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Flang-shape)\n\n- (arXiv 2023.1) 多模态事件变换器，用于**图像引导的故事结局生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11357.pdf)\n\n- (arXiv 2023.1) 风格感知对比学习，用于多风格图像**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11367.pdf)\n\n- (arXiv 2023.1) 3DShape2VecSet：一种用于神经场和生成扩散模型的**3D形状表示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11445.pdf)\n\n- (arXiv 2023.1) 半参数化**视频接地文本生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11507.pdf)\n\n- (arXiv 2023.1) 具有局部归纳偏置和特征归一化的**鲁棒**变换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11553.pdf)\n\n- (arXiv 2023.1) 在**对比学习**中利用第三维度，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11790.pdf)\n\n- (arXiv 2023.1) 通过**部件**感知表示学习理解**自监督**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.11915.pdf)\n\n- (arXiv 2023.1) 超图变换器，用于**基于骨骼的动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09590.pdf)\n\n- (arXiv 2023.1) CPT-V：一种对比方法，用于视觉变换器的**量化**后处理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09643.pdf)\n\n- (arXiv 2023.1) InstructPix2Pix：学习遵循**图像编辑**指令，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09800.pdf)，[[代码]](http:\u002F\u002Ftimothybrooks.com\u002Finstruct-pix2pix)\n\n- (arXiv 2023.1) OvarNet：迈向开放词汇的目标**属性识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09506.pdf)，[[项目]](https:\u002F\u002Fkyanchen.github.io\u002FOvarNet)\n\n- (arXiv 2023.1) DDS：解耦的动态**场景图生成**网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07666.pdf)\n\n- (arXiv 2023.1) **Token**变换器：类别token能否帮助基于窗口的变换器建立更好的**长距离交互**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06083.pdf)\n\n- (arXiv 2023.1) 向构建适用于语言、视觉以及视觉-语言理解任务的通用**基础模型**迈进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.05065.pdf)\n\n- (arXiv 2023.1) 基于知识的**视觉问答**的多模态逆填空任务？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04366.pdf)，[[代码]]()\n\n- (arXiv 2023.1) FGAHOI：用于**人-物体交互**检测的细粒度锚点，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04019.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxiaomabufei\u002FFGAHOI)\n\n- (arXiv 2023.1) 并行推理网络，用于**人-物体交互**检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.03510.pdf)\n\n- (arXiv 2023.1) 为**视频事件关系预测**中的结构化符号表示辩护，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.03410.pdf)\n\n- (arXiv 2023.1) 从人类**运动**中进行**场景合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.01424.pdf)，[[项目]](https:\u002F\u002Flijiaman.github.io\u002Fprojects\u002Fsummon\u002F)\n\n\n\n### 2022年12月\n\n\u003C!-- - (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]()\n\n- (arXiv 2022.12) , [[Paper]](), [[Code]]() -->\n\n- (arXiv 2022.12) EVA: 探索大规模下**掩码视觉表征**学习的极限，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07636.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEVA)\n\n- (arXiv 2022.12) OneFormer: 一个Transformer统治通用图像**分割**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06220.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FOneFormer)\n\n- (arXiv 2022.12) MMDialog: 面向**多模态开放域对话**的大规模多轮对话数据集，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05719.pdf)，[[Project]](https:\u002F\u002Fgithub.com\u002Fvictorsungo\u002FMMDialog)\n\n- (arXiv 2022.12) 为什么Winoground很难？探究**视觉语言组合性**中的失败，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00768.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002Fajd12342\u002Fwhy-winoground-hard)\n\n- (arXiv 2022.12) 多模态**信息瓶颈**：学习最小充分的单模态和**多模态**表征，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.17444.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002FTmacMai\u002FMultimodal-Information-Bottleneck)\n\n- (arXiv 2022.12) CLIP-FLOW：基于半监督迭代伪标签的对比学习用于**光流估计**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14383.pdf)\n\n- (arXiv 2022.12) 使用联合预训练的**视觉-语言**模型的指令遵循**智能体**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13431.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002Flhao499\u002Finstructrl)\n\n- (arXiv 2022.12) 视觉领域的MetaFormer**基准**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13452.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fmetaformer)\n\n- (arXiv 2022.12) ViTCoD：通过专用算法与加速器协同设计实现视觉Transformer**加速**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09573.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002FGATECH-EIC\u002FViTCoD)\n\n- (arXiv 2022.12) 从玩耍到策略：利用未经整理的**机器人**数据生成条件行为，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10047.pdf)，[[Project]](https:\u002F\u002Fplay-to-policy.github.io\u002F)\n\n- (arXiv 2022.12) 优化用于**文本到图像**生成的**提示词**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09611.pdf)，[[Code]](https:\u002F\u002Faka.ms\u002Fpromptist)\n\n- (arXiv 2022.12) 注意力驱动的**掩码CLIP**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08653.pdf)\n\n- (arXiv 2022.12) 用视觉Transformer重新思考**烹饪状态识别**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08586.pdf)\n\n- (arXiv 2022.12) 通过结构化知识和统一的检索-生成机制提升**多模态**和**多跳问答**性能，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08632.pdf)，[[Code]]()\n\n- (arXiv 2022.12) MM-SHAP：一种不依赖于性能的指标，用于衡量**视觉与语言**模型及任务中的多模态贡献，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08158.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002FHeidelberg-NLP\u002FMM-SHAP)\n\n- (arXiv 2022.12) RepQ-ViT：视觉Transformer训练后**量化**的尺度重参数化，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08254.pdf)\n\n- (arXiv 2022.12) WAVENHANCER：将小波与Transformer统一用于**图像增强**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08327.pdf)\n\n- (arXiv 2022.12) 自编码器作为跨模态教师：预训练的2D图像Transformer能否帮助**3D表征**学习？，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08320.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002FRunpeiDong\u002FACT)\n\n- (arXiv 2022.12) SceneGATE：基于场景图的协同注意力网络用于文本**视觉问答**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08283.pdf)\n\n- (arXiv 2022.12) 大型语言模型中的涌现**类比推理**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09196.pdf)\n\n- (arXiv 2022.12) 在像素级别释放**视觉提示**的力量，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10556.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002FEVP)\n\n- (arXiv 2022.12) **CLIP**是否能绑定概念？探究大型图像模型中的**组合性**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10537.pdf)\n\n- (arXiv 2022.12) LayoutDETR：检测Transformer是优秀的**布局设计师**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09877.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLayoutDETR)\n\n- (arXiv 2022.12) 走向无监督**视觉推理**：现成的特征是否懂得如何推理？，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10292.pdf)\n\n- (arXiv 2022.12) **文本到图像**生成中**空间关系**的基准测试，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10015.pdf)，[[Project]](https:\u002F\u002Fvisort2i.github.io\u002F)\n\n- (arXiv 2022.12) MetaCLUE：迈向全面的**视觉隐喻**研究，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09898.pdf)，[[Project]](https:\u002F\u002Fmetaclue.github.io\u002F)\n\n- (arXiv 2022.12) 用图像解决歧义：改进的**多模态**机器**翻译**与对比评估，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10140.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002FMatthieuFP\u002FCoMMuTE.git)\n\n- (arXiv 2022.12) 跨模态注意力一致性正则化用于**视觉-语言****关系**对齐，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10549.pdf)\n\n- (arXiv 2022.12) 无监督**语法归纳**需要像素吗？，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10564.pdf)\n\n- (arXiv 2022.12) Hi-LASSIE：从稀疏**图像**集合中高保真地**发现**可变形形状和骨架，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11042.pdf)\n\n- (arXiv 2022.12) MAViC：用于**视频字幕**的多模态主动学习，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11109.pdf)\n\n- (arXiv 2022.12) 什么样的**分词器**适合视觉Transformer？[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11115.pdf)\n\n- (arXiv 2022.12) 不只是漂亮的图片：**文本到图像**生成器能够实现对**鲁棒**表征的可解释干预，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11237.pdf)，[[Code]]()\n\n- (arXiv 2022.12) 针对**像素**、**图像**和**语言**的通用解码，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11270.pdf)，[[Project]](https:\u002F\u002Fx-decoder-vl.github.io\u002F)\n\n- (arXiv 2022.12) METEOR引导的**视频字幕**多样性，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10690.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002Fd-rothen\u002Fbmhrl)\n\n- (arXiv 2022.12) SLGTFORMER：一种基于注意力的**手语识别**方法，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10746.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002Fneilsong\u002Fslt)\n\n- (arXiv 2022.12) 从图像到文本**提示**：使用冻结大型语言模型进行零样本**VQA**，[[Paper]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10846.pdf)，[[Code]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fimg2prompt-vqa)\n\n- (arXiv 2022.12) 3D Highlighter：通过**文本**描述定位**3D**形状上的区域，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11263.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fthreedle\u002F3DHighlighter)\n\n- (arXiv 2022.12) 基于网络爬取的多模态数据预训练的对比型**语言-视觉**AI模型表现出性对象化的**偏见**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11261.pdf)\n\n- (arXiv 2022.12) 超高清**低光图像增强**：基准与基于Transformer的方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11548.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTaoWangzj\u002FLLFormer)\n\n- (arXiv 2022.12) Tune-A-Video：用于**文本到视频**生成的图像扩散模型的一次性微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11565.pdf)，[[项目]](https:\u002F\u002Ftuneavideo.github.io\u002F)\n\n- (arXiv 2022.12) 不止于SOT：是时候同时**跟踪**多个通用**物体**了，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11920.pdf)\n\n- (arXiv 2022.12) 基于知识的场景先验用于语义视听**具身导航**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.11345.pdf)\n\n- (arXiv 2022.12) SegViT：使用普通视觉Transformer进行**语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05844.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzbwxp\u002FSegVit)\n\n- (arXiv 2022.12) 使用现成的图像-文本特征进行开放词汇表的**时序动作检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10596.pdf)\n\n- (arXiv 2022.12) Point·E：一个可根据复杂**提示**生成**3D点云**的系统，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08751.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fpoint-e)\n\n- (arXiv 2022.12) 用于**视频动作预测**的归纳式注意力机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08830.pdf)\n\n- (arXiv 2022.12) 仅从像素理解**图像与语言**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08045.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbig_vision)\n\n- (arXiv 2022.12) FlexiViT：一种适用于所有**补丁大小**的模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08013.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbig_vision)\n\n- (arXiv 2022.12) **无监督**目标**定位**：通过观察背景来发现目标，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07834.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002FFOUND)\n\n- (arXiv 2022.12) 视觉Transformer是参数高效的**视听**学习者，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07983.pdf)，[[项目]](https:\u002F\u002Fgenjib.github.io\u002Fproject_page\u002FLAVISH\u002F)\n\n- (arXiv 2022.12) 用于**语义分割**中多分辨率Transformer的全上下文注意力机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07890.pdf)\n\n- (arXiv 2022.12) DETR4D：利用稀疏注意力实现直接的多视角**3D目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07849.pdf)\n\n- (arXiv 2022.12) 通过选择性查询重召回提升基于查询的目标**检测**训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07593.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FFangyi-Chen\u002FSQR)\n\n- (arXiv 2022.12) 文本引导的无遮罩局部**图像修饰**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07603.pdf)\n\n- (arXiv 2022.12) 面向摘要的视觉建模用于**多模态抽象摘要**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07672.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FXL2248\u002FSOV-MAS)\n\n- (arXiv 2022.12) 基于类感知跨域Transformer实现一次性的领域自适应且可泛化的**语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07292.pdf)\n\n- (arXiv 2022.12) ConQueR：用于**3D目标检测**的查询对比体素DETR，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07289.pdf)\n\n- (arXiv 2022.12) 利用解释方法考察**Transformer**与**CNN**之间的**差异**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06872.pdf)\n\n- (arXiv 2022.12) 找某人：以人类为中心的**接地**中的视觉常识理解，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06971.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FHxyou\u002FHumanCog)\n\n- (arXiv 2022.12) 用于**群体情感识别**的双分支跨补丁注意力学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07055.pdf)\n\n- (arXiv 2022.12) 基于跨模态相似性的课程学习用于图像**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07075.pdf)\n\n- (arXiv 2022.12) NLIP：抗噪声的**语言-图像**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07086.pdf)\n\n- (arXiv 2022.12) Lidar**CLIP**或：我如何学会与**点云**对话，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06858.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fatonderski\u002Flidarclip)\n\n- (arXiv 2022.12) **CLIP**SEP：利用带噪声的未标注视频学习文本查询的**声音分离**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07065.pdf)\n\n- (arXiv 2022.12) 对比语言-图像学习的可重复**缩放规律**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07143.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLAION-AI\u002Fscaling-laws-openclip)\n\n- (arXiv 2022.12) 视觉Transformer学到了什么？一次视觉**探索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06727.pdf)\n\n- (arXiv 2022.12) 自我博弈与自我描述：利用**视觉-语言**基础模型进行**策略适应**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07398.pdf)，[[项目]](https:\u002F\u002Fgeyuying.github.io\u002FSPLAYD)\n\n- (arXiv 2022.12) GPVIT：一种具有群体传播能力的**高分辨率**非层级视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06795.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FChenhongyiYang\u002FGPViT)\n\n- (arXiv 2022.12) 通过**图像到点**掩码自编码器，从2D预训练模型学习3D表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06785.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZrrSkywalker\u002FI2P-MAE)\n\n- (arXiv 2022.12) 用于**人机交互检测**的并行查询，[[论文]](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3551626.3564944)\n\n- (arXiv 2022.12) 基于结构的**图像修复**，结合图像级和对象级语义判别器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06310.pdf)\n\n- (arXiv 2022.12) 用于**视觉-语言**模型**微调**的局部潜在更新，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06556.pdf)\n\n- (arXiv 2022.12) CamoFormer：用于**伪装目标检测**的掩码可分离注意力机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06570.pdf)\n\n- (arXiv 2022.12) FastMIM：加速视觉领域的**掩码**图像建模预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06593.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fggjy\u002FFastMIM.pytorch)\n\n- (arXiv 2022.12) OAMixer：面向视觉Transformer的对象感知**混合**层，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06595.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Falinlab\u002FOAMixer)\n\n- (arXiv 2022.12) 双重正确的**目标识别**：为视觉**理由**提供“为什么”提示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06202.pdf)\n\n- (arXiv 2022.12) RT-1：用于大规模现实世界控制的**机器人**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06817.pdf)，[[项目]](https:\u002F\u002Frobotics-transformer.github.io\u002F)\n\n- (arXiv 2022.12) **第一人称视频**任务翻译，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06301.pdf)\n\n- (arXiv 2022.12) ScanEnts3D：利用短语与3D物体的对应关系提升**三维场景**中的**视觉-语言**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06250.pdf)，[[项目]](https:\u002F\u002Fscanents3d.github.io\u002F)\n\n- (arXiv 2022.12) **课程学习**邂逅弱监督的**模态相关性**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07619.pdf)\n\n- (arXiv 2022.12) IMoS：面向**人-物体交互**的意图驱动全身**运动合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07555.pdf)\n\n- (arXiv 2022.12) MultiAct：基于多个动作标签的长期**3D人体运动生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.05897.pdf)\n\n- (arXiv 2022.12) 新路径：通过合成指令和模仿学习扩展**视觉-语言导航**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03112.pdf)\n\n- (arXiv 2022.12) 超越物体识别：迈向**物体概念学习**的新基准，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.02710.pdf)，[[项目]](https:\u002F\u002Fmvig-rhos.com\u002Focl)\n\n- (arXiv 2022.12) ViTPose+：用于通用身体**姿态估计**的视觉Transformer基础模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04246.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FViTPose)\n\n- (arXiv 2022.12) 面向**计算烹饪**的结构化**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04267.pdf)\n\n- (arXiv 2022.12) MIME：以人为本的**3D场景生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04360.pdf)，[[项目]](https:\u002F\u002Fmime.is.tue.mpg.de\u002F)\n\n- (arXiv 2022.12) OFASY S：用于构建**通用模型**的**多模态多任务**学习系统，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04408.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOFA-Sys\u002FOFASys)\n\n- (arXiv 2022.12) **视觉-语言**模型中的任务**偏差**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04412.pdf)\n\n- (arXiv 2022.12) 多概念定制的**文本到图像****扩散**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04488.pdf)，[[代码]](https:\u002F\u002Fwww.cs.cmu.edu\u002F~custom-diffusion\u002F)\n\n- (arXiv 2022.12) 在未知类别和相机位姿下进行少视角物体**重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04492.pdf)，[[项目]](https:\u002F\u002Fut-austin-rpl.github.io\u002FFORGE\u002F)\n\n- (arXiv 2022.12) 掩码视频蒸馏：重新思考用于**自监督****视频表示**学习的**掩码**特征建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04500.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fruiwang2021\u002Fmvd)\n\n- (arXiv 2022.12) 从**大型语言模型**中学习**视频**表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04501.pdf)，[[项目]](https:\u002F\u002Ffacebookresearch.github.io\u002FLaViLa)\n\n- (arXiv 2022.12) 冻结的**CLIP**模型是高效的**点云**骨干网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04098.pdf)\n\n- (arXiv 2022.12) DialogCC：大规模**多模态对话**数据集，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04119.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002Fpassing2961\u002FDialogCC)\n\n- (arXiv 2022.12) 用于视觉Transformer的分组广义均值**池化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04114.pdf)\n\n- (arXiv 2022.12) 学习**视觉-语言**模型的领域不变**提示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04196.pdf)\n\n- (arXiv 2022.12) LLM-Planner：使用**大型语言模型**为**具身智能体**提供少样本接地的**规划**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04088.pdf)\n\n- (arXiv 2022.12) 双曲空间中的**对比学习**用于超越物体的视觉**表示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.00653.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshlokk\u002FHCL\u002Ftree\u002Fmain\u002FHCL)\n\n\n\n### 2022年11月\n\n\u003C!-- - (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 20......- (arXiv 2022.12) **自我中心视频**任务翻译，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06301.pdf)\n\n- (arXiv 2022.12) ScanEnts3D：利用短语与3D物体的对应关系提升**三维场景**中的**视觉-语言**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06250.pdf)，[[项目]](https:\u002F\u002Fscanents3d.github.io\u002F)\n\n- (arXiv 2022.12) **课程学习**邂逅弱监督的**模态相关性**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07619.pdf)\n\n- (arXiv 2022.12) IMoS：面向**人-物体交互**的意图驱动全身**运动合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.07555.pdf)\n\n- (arXiv 2022.12) MultiAct：基于多动作标签的长期**3D人体运动生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.05897.pdf)\n\n- (arXiv 2022.12) 新路径：通过合成指令和模仿学习扩展**视觉-语言导航**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03112.pdf)\n\n- (arXiv 2022.12) 超越物体识别：迈向**物体概念学习**的新基准，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.02710.pdf)，[[项目]](https:\u002F\u002Fmvig-rhos.com\u002Focl)\n\n- (arXiv 2022.12) ViTPose+：用于通用身体**姿态估计**的视觉Transformer基础模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04246.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FViTPose)\n\n- (arXiv 2022.12) 面向**计算烹饪**的结构化**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04267.pdf)\n\n- (arXiv 2022.12) MIME：**以人为本**的**3D场景生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04360.pdf)，[[项目]](https:\u002F\u002Fmime.is.tue.mpg.de\u002F)\n\n- (arXiv 2022.12) OFASY S：用于构建**通用模型**的**多模态多任务**学习系统，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04408.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOFA-Sys\u002FOFASys)\n\n- (arXiv 2022.12) **视觉-语言**模型中的任务**偏差**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04412.pdf)\n\n- (arXiv 2022.12) 多概念定制的**文本到图像****扩散**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04488.pdf)，[[代码]](https:\u002F\u002Fwww.cs.cmu.edu\u002F~custom-diffusion\u002F)\n\n- (arXiv 2022.12) 在未知类别和相机位姿下进行少视角物体**重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04492.pdf)，[[项目]](https:\u002F\u002Fut-austin-rpl.github.io\u002FFORGE\u002F)\n\n- (arXiv 2022.12) 掩码视频蒸馏：重新思考用于**自监督****视频表示**学习的**掩码**特征建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04500.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fruiwang2021\u002Fmvd)\n\n- (arXiv 2022.12) 从**大型语言模型**中学习**视频**表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04501.pdf)，[[项目]](https:\u002F\u002Ffacebookresearch.github.io\u002FLaViLa)\n\n- (arXiv 2022.12) 冻结的**CLIP**模型是高效的**点云**骨干网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04098.pdf)\n\n- (arXiv 2022.12) DialogCC：大规模**多模态对话**数据集，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04119.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002Fpassing2961\u002FDialogCC)\n\n- (arXiv 2022.12) 用于视觉Transformer的组广义均值**池化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04114.pdf)\n\n- (arXiv 2022.12) 学习**视觉-语言**模型的领域不变**提示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04196.pdf)\n\n- (arXiv 2022.12) LLM-Planner：使用**大型语言模型**为**具身智能体**提供少样本接地的**规划**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04088.pdf)\n\n- (arXiv 2022.12) 双曲空间中的**对比学习**，用于超越物体的视觉**表示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.00653.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshlokk\u002FHCL\u002Ftree\u002Fmain\u002FHCL)\n\n\n\n### 2022年11月\n\n\u003C!-- - (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]()\n\n- (arXiv 2022.11) , [[论文]](), [[代码]]() -->\n\n- (arXiv 2022.11) 在**多标签图像识别**的提示调优中将文本作为图像处理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12739.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fguozix\u002FTaI-DPT)\n\n- (arXiv 2022.11) 告诉我发生了什么：通过多模态掩码视频生成统一**文本引导的视频补全**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12824.pdf)\n\n- (arXiv 2022.11) InDiReCT：面向图像的零样本深度**度量学习**，由语言指导，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12760.pdf)\n\n- (arXiv 2022.11) VoP：用于**跨模态检索**的文本-视频协同提示调优，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12764.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fbighuang624\u002FVoP)\n\n- (arXiv 2022.11) 利用水波斯特鲁克GAN和Transformer从少量点完成**点云重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.12746.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FWxfQjh\u002FStability-point-recovery.git)\n\n- (arXiv 2022.11) 整体预训练的Transformer**金字塔**网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12735.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsunsmarterjie\u002FiTPN)\n\n- (arXiv 2022.11) 用于**细粒度图像分类**的数据增强视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12879.pdf)\n\n- (arXiv 2022.11) 具有协作式混合分配的**DETR**训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12860.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FCo-DETR)\n\n- (arXiv 2022.11) 开放词汇表的**属性检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12914.pdf)，[[项目]](https:\u002F\u002Fovad-benchmark.github.io\u002F)\n\n- (arXiv 2022.11) Lite-Mono：一种轻量级CNN和Transformer架构，用于自监督的**单目深度估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13202.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fnoahzn\u002FLite-Mono)\n\n- (arXiv 2022.11) 基于扩散模型的反演式**创造力迁移**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13203.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FzyxElsa\u002Fcreativity-transfer)\n\n- (arXiv 2022.11) CODA-Prompt：基于持续分解注意力的提示方法，用于无回放的**持续学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13218.pdf)\n\n- (arXiv 2022.11) SVFormer：用于**动作识别**的半监督视频Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13222.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FChenHsing\u002FSVFormer)\n\n- (arXiv 2022.11) 通过实例模式组合器实现可泛化的**隐式神经表示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13223.pdf)\n\n- (arXiv 2022.11) 通过融合专家特征提升**视觉-文本情感分析**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12981.pdf)\n\n- (arXiv 2022.11) 基于热方程的**自监督**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13228.pdf)\n\n- (arXiv 2022.11) Peekaboo：**文本到图像**扩散模型可作为零样本分割器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13224.pdf)\n\n- (arXiv 2022.11) 以例作画：基于示例的扩散模型**图像编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13227.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FFantasy-Studio\u002FPaint-by-Example)\n\n- (arXiv 2022.11) 人还是机器？针对视觉与语言的**图灵测试**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13087.pdf)，[[代码]](https:\u002F\u002Ftinyurl.com\u002F8x8nha7p)\n\n- (arXiv 2022.11) Teach-DETR：通过教师模型提升**DETR**的**训练**效果，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11953.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLeonHLJ\u002FTeach-DETR)\n\n- (arXiv 2022.11) Conv2Former：一种用于视觉识别的简单Transformer风格**卷积神经网络**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11943.pdf)\n\n- (arXiv 2022.11) X^2-VLM：面向**视觉-语言**任务的一体化预训练模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12402.pdf)，[[代码]](github.com\u002Fzengyan-97\u002FX2-VLM)\n\n- (arXiv 2022.11) 对齐源视觉域与目标语言域，用于无配对的**视频字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12148.pdf)\n\n- (arXiv 2022.11) 关于**广义零样本学习**中视觉特征的可迁移性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12494.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fuvavision\u002FTV-GZSL)\n\n- (arXiv 2022.11) 基于自感应视觉Transformer的可泛化工业视觉**异常检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12311.pdf)\n\n- (arXiv 2022.11) 基于Transformer的多粒度特征用于无监督的**行人重识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12280.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FRikoLi\u002FWACV23-workshop-TMGF)\n\n- (arXiv 2022.11) 高效的频域Transformer用于高质量图像**去模糊**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12250.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fkkkls\u002FFFTformer)\n\n- (arXiv 2022.11) Event Transformer+：一种用于高效**事件数据处理**的多功能解决方案，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12222.pdf)\n\n- (arXiv 2022.11) MagicPony：在野外学习关节式**3D动物**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12497.pdf)，[[项目]](https:\u002F\u002F3dmagicpony.github.io\u002F)\n\n- (arXiv 2022.11) 具有级联特征漂移补偿的门控类注意力机制，用于视觉Transformer的无示例**持续学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12292.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOcraM17\u002FGCAB-CFDC)\n\n- (arXiv 2022.11) 基于期望最大化对比学习的紧凑**视频-语言**表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11427.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjpthu17\u002FEMCL)\n\n- (arXiv 2022.11) Swin Transformer中的N-gram用于高效的轻量级**图像超分辨率**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11436.pdf)\n\n- (arXiv 2022.11) 通过视觉-语言模型的指令增强实现**机器人**技能获取，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11736.pdf)，[[代码]](https:\u002F\u002Finstructionaugmentation.github.io\u002F)\n\n- (arXiv 2022.11) 剥洋葱：分层减少数据冗余以实现**高效**的视觉Transformer**训练**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10801.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZLKong\u002FTri-Level-ViT)\n\n- (arXiv 2022.11) 使用单塔Transformer统一**视觉-语言**表示空间，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11153.pdf)\n\n- (arXiv 2022.11) DeepSolo：让带有显式点的Transformer解码器单独完成**文本定位**任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10772.pdf)\n\n- (arXiv 2022.11) Castling-ViT：在视觉Transformer推理过程中切换至线性-角度注意力，从而**压缩自注意力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10526.pdf)\n\n- (arXiv 2022.11) CL-CrossVQA：一个用于**跨领域视觉问答**的持续学习基准，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10567.pdf)\n\n- (arXiv 2022.11) Normal Transformer：从结合视觉语义的**LiDAR**点云中提取表面几何信息，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10580.pdf)\n\n- (arXiv 2022.11) 一种用于理解**视频**并嵌入异构**知识图谱**数据的统一模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10624.pdf)\n\n- (arXiv 2022.11) 通过以运动为中心的标记选择进行掩码视频建模，实现高效的**视频表示**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10636.pdf)\n\n- (arXiv 2022.11) DiffStyler：可控的双扩散模型用于文本驱动的**图像风格化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10682.pdf)\n\n- (arXiv 2022.11) TORE：使用Transformer实现高效**人体网格恢复**的标记缩减，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10705.pdf)\n\n- (arXiv 2022.11) 利用自回归潜扩散模型**合成**连贯的**故事**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10950.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FFlash-321\u002FARLDM)\n\n- (arXiv 2022.11) **分布外检测**方法可靠吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10892.pdf)\n\n- (arXiv 2022.11) GLT-T：用于点云中**3D单目标跟踪**的全局-局部Transformer投票机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10927.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhaooozi\u002FGLT-T)\n\n- (arXiv 2022.11) 针对**VQA**的跨模态对比学习以实现稳健推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11190.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fqizhust\u002Fcmcl_vqa_pl)\n\n- (arXiv 2022.11) LISA：通过隐式神经表示，利用音频实现本地化的**图像风格化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11381.pdf)\n\n- (arXiv 2022.11) MagicVideo：利用潜扩散模型高效**视频生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11018.pdf)，[[代码]](https:\u002F\u002Fmagicvideo.github.io\u002F#)\n\n- (arXiv 2022.11) DreamArtist：通过对比提示调优实现可控的一次性**文本到图像**生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11337.pdf)\n\n- (arXiv 2022.11) 基于混合Transformer的特征融合用于自监督的**单目深度估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11066.pdf)\n\n- (arXiv 2022.11) 瓶中之语：由语言模型引导的概念瓶颈用于可解释的**图像分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11158.pdf)\n\n- (arXiv 2022.11) 用于改善视觉-语言**导航**中视觉表示的结构编码辅助任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11116.pdf)\n\n- (arXiv 2022.11) 多出口策略：动态早期退出以**加速**统一视觉语言模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11152.pdf)\n\n- (arXiv 2022.11) 超越注意力标记：结合标记重要性和多样性以实现**高效**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11315.pdf)\n\n- (arXiv 2022.11) FlowLens：通过流引导的Clip循环Transformer实现超越视场角的观测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11293.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMasterHow\u002FFlowLens)\n\n- (arXiv 2022.11) PS-Transformer：利用自注意力机制学习稀疏光度立体视觉网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11386.pdf)\n\n- (arXiv 2022.11) 关于形状-纹理去偏持续学习的鲁棒性、泛化能力和遗忘现象，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11174.pdf)\n\n- (arXiv 2022.11) 基于超采样Token的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11167.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhhb072\u002FSViT)\n\n- (arXiv 2022.11) 只检测你指定的内容：基于语言目标的对象检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11572.pdf)\n\n- (arXiv 2022.11) 视觉编程：无需训练的组合式视觉推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11559.pdf)，[[项目]](https:\u002F\u002Fprior.allenai.org\u002Fprojects\u002Fvisprog)\n\n- (arXiv 2022.11) ClipCrop：由视觉-语言模型驱动的条件裁剪，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11492.pdf)\n\n- (arXiv 2022.11) SMAUG：用于高效视频-语言预训练的稀疏掩码自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11446.pdf)\n\n- (arXiv 2022.11) 用于从模糊图像恢复真实世界运动的模糊插值Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11423.pdf)\n\n- (arXiv 2022.11) 均值漂移掩码Transformer用于未见物体实例分割，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11679.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYoungSean\u002FUnseenObjectsWithMeanShift)\n\n- (arXiv 2022.11) PointCLIP V2：适配CLIP以实现强大的3D开放世界学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11682.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyangyangyang127\u002FPointCLIP_V2)\n\n- (arXiv 2022.11) 探索用于图像字幕生成的离散扩散模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11694.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fbuxiangzhiren\u002FDDCap)\n\n- (arXiv 2022.11) PERCEIVER-VL：通过迭代潜在注意力实现高效的视觉-语言建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11701.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzinengtang\u002FPerceiver_VL)\n\n- (arXiv 2022.11) 多任务视觉-语言提示调优，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11720.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FsIncerass\u002FMVLPT)\n\n- (arXiv 2022.11) 向视觉-语言模型教授结构化视觉与语言概念，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11733.pdf)\n\n- (arXiv 2022.11) 加权自监督集成学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09981.pdf)\n\n- (arXiv 2022.11) BEVFormer v2：通过透视监督将现代图像骨干网络适配到鸟瞰视角识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10439.pdf)\n\n- (arXiv 2022.11) 用于调优视觉-语言模型的任务残差，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10277.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgeekyutao\u002FTaskRes)\n\n- (arXiv 2022.11) α DARTS再出发：通过掩码图像建模增强可微架构搜索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10105.pdf)\n\n- (arXiv 2022.11) 深入研究用于增量语义分割的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10253.pdf)\n\n- (arXiv 2022.11) DETRDistill：一个适用于DETR系列的通用知识蒸馏框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10156.pdf)\n\n- (arXiv 2022.11) PromptCap：提示引导的任务感知图像字幕生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09699.pdf)\n\n- (arXiv 2022.11) UNIFORMERV2：通过为图像Vision Transformer配备视频Uniformer实现时空学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09552.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FUniFormerV2)\n\n- (arXiv 2022.11) 基于信息瓶颈原理的掩码重建对比学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09013.pdf)\n\n- (arXiv 2022.11) 听，降噪，行动！基于扩散模型的音频驱动运动合成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09707.pdf)，[[项目]](https:\u002F\u002Fwww.speech.kth.se\u002Fresearch\u002Flisten-denoise-action\u002F)\n\n- (arXiv 2022.11) ConStruct-VL：无数据持续学习结构化视觉-语言概念，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09790.pdf)\n\n- (arXiv 2022.11) 如何用SGD微调视觉模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09359.pdf)\n\n- (arXiv 2022.11) 用于端到端图像字幕生成的渐进式树状原型网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09460.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FNovaMind-Z\u002FPTSN)\n\n- (arXiv 2022.11) CapEnrich：通过跨模态预训练知识丰富网络图像的字幕语义，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09371.pdf)，[[代码]]()\n\n- (arXiv 2022.11) 面向视频字幕生成的视觉常识感知表示网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09469.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzchoi\u002FVCRN)\n\n- (arXiv 2022.11) 用于3D物体定位的语言条件空间关系推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09646.pdf)，[[代码]](https:\u002F\u002Fcshizhe.github.io\u002Fprojects\u002Fvil3dref.html)\n\n- (arXiv 2022.11) HARDVS：用动态视觉传感器重新审视人类活动识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09648.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FEvent-AHU\u002FHARDVS)\n\n- (arXiv 2022.11) 通过最大化多模态互信息迈向一体化预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09807.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FM3I-Pretraining)\n\n- (arXiv 2022.11) Uni-Perceiver v2：面向大规模视觉和视觉-语言任务的通才模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09808.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffundamentalvision\u002FUni-Perceiver)\n\n- (arXiv 2022.11) D^3ETR：用于检测Transformer的解码器蒸馏，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09768.pdf)\n\n- (arXiv 2022.11) CAE v2：带有CLIP目标的上下文自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09799.pdf)\n\n- (arXiv 2022.11) 用于文本-视频检索的跨模态适配器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09623.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FCross-Modal-Adapter)\n\n- (arXiv 2022.11) Token图灵机，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09119.pdf)\n\n- (arXiv 2022.11) 大规模生成模型会污染未来的数据集吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08095.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmoskomule\u002Fdataset-contamination)\n\n- (arXiv 2022.11) 从语义角度揭秘视觉Transformer中的自注意力：分析与应用，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08543.pdf)\n\n- (arXiv 2022.11) SATVSR：用于跨场景视频超分辨率的场景适应性Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.08703.pdf)\n\n- (arXiv 2022.11) TransCC：基于Transformer的**多光源颜色恒常性**，采用多任务学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08772.pdf)\n\n- (arXiv 2022.11) 凝视你所见：无需重建的**掩码图像建模**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08887.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOpenPerceptionX\u002Fmaskalign)\n\n- (arXiv 2022.11) HeatViT：面向视觉Transformer的硬件高效自适应**标记剪枝**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08110.pdf)\n\n- (arXiv 2022.11) 面向**CLIP**的跨域联邦自适应**提示调优**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07864.pdf)\n\n- (arXiv 2022.11) YORO——轻量级端到端**视觉定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07912.pdf)\n\n- (arXiv 2022.11) 基于一致蒸馏点采样的检测Transformer**知识蒸馏**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.08071.pdf)\n\n- (arXiv 2022.11) BiViT：极低**压缩比**的**二值化**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07091.pdf)\n\n- (arXiv 2022.11) ContextCLIP：在**CLIP**视觉表征上对**图文**对进行上下文对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07122.pdf)\n\n- (arXiv 2022.11) 基于锚点增强的视觉-语言空间对齐的零样本图像**描述生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07275.pdf)\n\n- (arXiv 2022.11) 超越**大脑**的视觉：用于**视觉解码**的稀疏掩码建模条件扩散模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06956.pdf)，[[项目]](https:\u002F\u002Fmind-vis.github.io\u002F)\n\n- (arXiv 2022.11) 利用余弦Transformer提升**少样本图像分类**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06828.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvinuni-vishc\u002FFew-Shot-Cosine-Transformer)\n\n- (arXiv 2022.11) SCOTCH和SODA：一种基于Transformer的**视频阴影检测**框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06885.pdf)\n\n- (arXiv 2022.11) 面向有偏**面部表情识别**的AU感知视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06609.pdf)\n\n- (arXiv 2022.11) 在向量量化潜在空间上快速进行文本条件离散**去噪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07292.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdome272\u002FPaella)\n\n- (arXiv 2022.11) 零样本图像**描述生成**的大规模双向训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06774.pdf)\n\n- (arXiv 2022.11) 将预训练模型嫁接用于多模态**标题生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07210.pdf)\n\n- (arXiv 2022.11) CabViT：视觉Transformer中块间的交叉**注意力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07198.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhkzhang91\u002FCabViT)\n\n- (arXiv 2022.11) 基于多粒度不确定性正则化的文本反馈**组合图像检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.07394.pdf)\n\n- (arXiv 2022.11) SSGVS：语义**场景图到视频**合成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06119.pdf)\n\n- (arXiv 2022.11) 针对异构客户端的一次性**模型适配**：一种客户端内与跨图像注意力设计，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.06276.pdf)\n\n- (arXiv 2022.11) 一种基于Transformer自注意力的改进型端到端**多目标跟踪**方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.06001.pdf)\n\n- (arXiv 2022.11) 零样本视觉常识下的**不道德预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05521.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fku-vai\u002FZero-shot-Visual-Commonsense-Immorality-Prediction)\n\n- (arXiv 2022.11) 双曲余弦Transformer用于**LiDAR三维目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.05580.pdf)\n\n- (arXiv 2022.11) 使用1张GPU在不到24小时内从头开始训练视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05187.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBorealisAI\u002Fefficient-vit-training)\n\n- (arXiv 2022.11) ViTALiTy：通过线性泰勒注意力统一低秩与稀疏近似以加速视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05109.pdf)\n\n- (arXiv 2022.11) SimOn：一个用于**在线时序动作定位**的简单框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04905.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTuanTNG\u002FSimOn)\n\n- (arXiv 2022.11) ERNIE-UNIX^2：一个用于理解和生成的**跨语言跨模态**统一框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04861.pdf)\n\n- (arXiv 2022.11) SG-Shuffle：用于**场景图生成**的多视角洗牌Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04773.pdf)\n\n- (arXiv 2022.11) 理解生成**场景描述**的V&L模型中的跨模态交互，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04971.pdf)\n\n- (arXiv 2022.11) VieCap4H - VLSP 2021：ObjectAoA——通过注意力上的注意力提升对象关系Transformer在**越南语**图像**描述生成**中的性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05405.pdf)\n\n- (arXiv 2022.11) 观看新闻：迈向能够阅读的**VideoQA**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05588.pdf)，[[项目]](http:\u002F\u002Fcvit.iiit.ac.in\u002Fresearch\u002Fprojects\u002Fcvit-projects\u002Fvideoqa)\n\n- (arXiv 2022.11) 利用空间感知Transformer实现高效的联合**检测**和**多目标跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05654.pdf)\n\n- (arXiv 2022.11) **揭秘**现代图像深度网络中的Transformer与**卷积**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05781.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FSTM-Evaluation)\n\n- (arXiv 2022.11) InternImage：利用**可变形卷积**探索大规模视觉基础模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05778.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternImage)\n\n- (arXiv 2022.11) DEPTHFORMER：面向基于Transformer的**分割**网络的多模态位置编码与跨输入注意力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04188.pdf)\n\n- (arXiv 2022.11) 用于端到端**人员搜索**的序列Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.04323.pdf)\n\n- (arXiv 2022.11) 引导大型预训练视觉-语言模型进行**组合概念学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05077.pdf)\n\n- (arXiv 2022.11) CASA：类别无关的**骨骼动物重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.03568.pdf)\n\n- (arXiv 2022.11) ViT-CX：视觉Transformer的因果**解释**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.03064.pdf)\n\n- (arXiv 2022.11) 解耦内容与运动以实现**基于文本的神经网络视频操控**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02980.pdf)\n\n- (arXiv 2022.11) **高效**多阶门控聚合网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.03295.pdf)\n\n- (arXiv 2022.11) CLOP：结合知识正则化的**视频与语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.03314.pdf)\n\n- (arXiv 2022.11) MSMG-Net：多尺度、多粒度监督的网络，用于多任务图像篡改的**检测**和**定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2211\u002F2211.03140.pdf)\n\n- (arXiv 2022.11) 理解并缓解**视觉-语言**模型中**提示**微调的过拟合问题，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02219.pdf)，[[代码]](https:\u002F\u002Ftinyurl.com\u002Fmpe64f89)\n\n- (arXiv 2022.11) 使用现成模型进行零样本**视频瞬间检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02178.pdf)\n\n- (arXiv 2022.11) 通过跨模态梯度协调实现**多模态**预训练的扩展，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02077.pdf)\n\n- (arXiv 2022.11) 一种用于在线数学表达式**手势识别**的Transformer架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02643.pdf)\n\n- (arXiv 2022.11) 评估和改进**多模态摘要生成**中的事实性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02580.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmeetdavidwan\u002Ffaithful-multimodal-summ)\n\n- (arXiv 2022.11) RCDPT：**雷达-相机融合**密集预测Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02432.pdf)\n\n- (arXiv 2022.11) 通过追踪论元的视觉状态进行**视频事件提取**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01781.pdf)\n\n- (arXiv 2022.11) 视觉Transformer的“彩票假说”，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01484.pdf)\n\n- (arXiv 2022.11) TEXTCRAFT：零样本生成高保真且多样化的**文本驱动形状**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01427.pdf)\n\n- (arXiv 2022.11) PolyBuilding：用于端到端**建筑物提取**的多边形Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01589.pdf)\n\n- (arXiv 2022.11) 重新思考预训练纯视觉Transformer中的**层次结构**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01785.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FHPViT)\n\n- (arXiv 2022.11) SAP-**DETR**：弥合显著点与基于查询的Transformer检测器之间的差距，以实现快速模型收敛，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02006.pdf)\n\n- (arXiv 2022.11) 巨型预训练图像模型能否提取**通用表征**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02043.pdf)\n\n- (arXiv 2022.11) MAEDAY：用于少样本和零样本**异常检测**的MAE，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14307.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FEliSchwartz\u002FMAEDAY)\n\n- (arXiv 2022.11) 化繁为简：无复杂操作的纯**窗口式**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14255.pdf)\n\n- (arXiv 2022.11) 环顾四周并指代：用于**3D视觉定位**的2D合成语义知识蒸馏，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14241.pdf)，[[代码]](https:\u002F\u002Feslambakr.github.io\u002FLAR.github.io\u002F)\n\n- (arXiv 2022.11) SpaText：用于**可控图像生成**的空间-文本表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14305.pdf)，[[项目]](https:\u002F\u002Fomriavrahami.com\u002Fspatext)\n\n- (arXiv 2022.11) 在**2D**监督下学习**3D**场景先验，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14157.pdf)，[[项目]](https:\u002F\u002Fyinyunie.github.io\u002Fsceneprior-page\u002F)\n\n- (arXiv 2022.11) PoET：用于单视角、多目标**6D姿态估计**的姿态估计Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14125.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Faau-cns\u002Fpoet)\n\n- (arXiv 2022.11) 用于**高光谱图像去噪**的空间-光谱Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14090.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMyuLi\u002FSST)\n\n- (arXiv 2022.11) 使用较少的双模态监督训练**视觉-语言**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00262.pdf)\n\n- (arXiv 2022.11) 利用噪声注入的**CLIP**进行仅文本训练的图像**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00575.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FDavidHuji\u002FCapDec)\n\n- (arXiv 2022.11) 基于注意力的**神经元胞自动机**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01233.pdf)\n\n- (arXiv 2022.11) eDiff-I：采用专家去噪器集成的**文本到图像**扩散模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01324.pdf)，[[代码]](https:\u002F\u002Fdeepimagination.cc\u002FeDiff-I\u002F)\n\n- (arXiv 2022.11) 中文CLIP：在**中文**环境下进行对比学习的**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01335.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOFA-Sys\u002FChinese-CLIP)\n\n- (arXiv 2022.11) P^3OVD：细粒度的视觉-文本提示驱动的自训练方法，用于**开放词汇目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00849.pdf)\n\n- (arXiv 2022.11) tSF：基于Transformer的语义过滤器，用于**少样本学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00868.pdf)\n\n- (arXiv 2022.11) WITT：用于**语义通信**的无线图像传输Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.00937.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKeYang8\u002FWITT)\n\n- (arXiv 2022.11) Pair DETR：对比学习**加速**了**DETR**的训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16476.pdf)\n\n- (arXiv 2022.11) 用于**第一人称动作预测**的交互视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.14154.pdf)\n\n- (arXiv 2022.11) UDE：用于人类**运动生成**的统一驱动引擎，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.16016.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzixiangzhou916\u002FUDE\u002F)\n\n- (arXiv 2022.11) Action-**GPT**：利用大规模语言模型提升和泛化零样本**动作生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.15603.pdf)，[[项目]](https:\u002F\u002Factiongpt.github.io\u002F)\n\n- (arXiv 2022.11) 人还是机器？针对**视觉**和**语言**的**图灵测试**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13087.pdf)，[[代码]](https:\u002F\u002Ftinyurl.com\u002F8x8nha7p)\n\n- (arXiv 2022.11) 针对少样本**动作识别**的知识**提示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12030.pdf)\n\n- (arXiv 2022.11) UPainting：结合跨模态指导的统一文本到图像扩散生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16031.pdf)，[[项目]](https:\u002F\u002Fupainting.github.io\u002F)\n\n- (arXiv 2022.11) LVP-M^3：面向**多语言多模态机器翻译**的语言感知视觉提示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15461.pdf)\n\n- (arXiv 2022.11) PROCONTEXT：用于**跟踪**的渐进式上下文Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15511.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjp-lan\u002FProContEXT)\n\n- (arXiv 2022.11) 利用Transformer进行基于视频的对象**6D姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13540.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FApoorvaBeedu\u002FVideoPose)\n\n- (arXiv 2022.11) S2WAT：通过分层视觉Transformer和条带窗口注意力实现**图像风格迁移**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12381.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAlienZhang1996\u002FS2WAT)\n\n- (arXiv 2022.11) 用于**动作检测**的整体交互Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12686.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjoslefaure\u002FHIT)\n\n- (arXiv 2022.11) 基于先验数据的学习与检索用于技能型**模仿学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11435.pdf)，[[代码]](https:\u002F\u002Fut-austin-rpl.github.io\u002Fsail)\n\n- (arXiv 2022.11) SimpleClick：基于简单视觉Transformer的**交互式**图像**分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11006.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Funcbiag\u002FSimpleClick)\n\n- (arXiv 2022.11) TANGO：通过光照分解实现的**文本驱动**、逼真且鲁棒的**3D风格化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11277.pdf)，[[代码]](https:\u002F\u002Fcyw-3d.github.io\u002Ftango\u002F)\n\n- (arXiv 2022.11) CPL：用于**视觉与语言**模型的反事实**提示**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10362.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Feric-ai-lab\u002FCPL)\n\n- (arXiv 2022.11) 即插即用VQA：无需训练即可通过结合大型预训练模型实现零样本**VQA**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08773.pdf)\n\n- (arXiv 2022.11) 面向**视频**语料库时刻**检索**的选择性查询引导去偏方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08714.pdf)\n\n- (arXiv 2022.11) 缩放与平移你的特征：一种新的**高效模型微调**基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08823.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdongzelian\u002FSSF)\n\n- (arXiv 2022.11) 去噪**掩码自编码器**是可认证鲁棒的视觉学习模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06983.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fquanlin-wu\u002Fdmae)\n\n- (arXiv 2022.11) 视觉Transformer中的**标记-标签对齐**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06455.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FEuphoria16\u002FTL-Align)\n\n- (arXiv 2022.11) **CLIP**-Fields：用于**机器人**记忆的弱监督语义场，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05663.pdf)，[[代码]](https:\u002F\u002Fmahis.life\u002Fclip-fields)\n\n- (arXiv 2022.11) 多尺度小波Transformer用于**人脸伪造检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03899.pdf)\n\n- (arXiv 2022.11) **CLIP**-PAE：投影增强嵌入，用于提取相关特征，实现解耦、可解释且可控的**文本引导图像操纵**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03919.pdf)\n\n- (arXiv 2022.11) 针对**测试时域适应**的视觉提示微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04831.pdf)\n\n- (arXiv 2022.11) Fast**CLIP**styler：利用风格表征实现无优化的基于文本的**图像风格迁移**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03461.pdf)\n\n- (arXiv 2022.11) 用于精细化**文本到图像**生成的渐进式去噪模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02291.pdf)\n\n- (arXiv 2022.11) **DALL-E**-Bot：将网络规模扩散模型引入**机器人技术**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02438.pdf)，[[项目]](https:\u002F\u002Fwww.robot-learning.uk\u002Fdall-e-bot)\n\n- (arXiv 2022.11) 分解软提示引导融合增强的**组合型零样本学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10681.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FForest-art\u002FDFSP.git)\n\n- (arXiv 2022.11) 带有注意力可收缩Transformer的精确**图像修复**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01427.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgladzhang\u002FART)\n\n- (arXiv 2022.11) **扩张**邻域**注意力**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15001.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FNeighborhood-Attention-Transformer)\n\n- (arXiv 2022.11) 用于**视觉-语言检索**的统一成对相似性优化损失，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13869.pdf)\n\n- (arXiv 2022.11) TVLT：无文本的**视觉-语言**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14156.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzinengtang\u002FTVLT)\n\n\n\n### 2022年10月\n\n- (arXiv 2022.10) DiMBERT：使用解耦多模态注意力学习**视觉-语言**接地表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16431.pdf)\n\n- (arXiv 2022.10) TFORMER：利用几何引导Transformer在网格扫描中进行**3D牙齿分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16627.pdf)\n\n- (arXiv 2022.10) 使用StyleGAN并以**CLIP**为指导的**实时**目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16742.pdf)\n\n- (arXiv 2022.10) 基于**CLIP**的无图像领域泛化方法，用于**3D手部姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16788.pdf)\n\n- (arXiv 2022.10) 用于**骨骼少样本动作识别**的时间-视角转换方案，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16820.pdf)\n\n- (arXiv 2022.10) 一种简单、高效且可扩展的对比**掩码自编码器**，用于学习视觉表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16870.pdf)\n\n- (arXiv 2022.10) 时间逆向扩散张量Transformer：一种新的**少样本目标检测**范式，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16897.pdf)\n\n- (arXiv 2022.10) 基于自监督定位和视觉Transformer的机场路面图像中**异物碎片检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16901.pdf)\n\n- (arXiv 2022.10) ViT-LSLA：具有**轻量自限制注意力**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.17115.pdf)\n\n- (arXiv 2022.10) 用于持续**视觉-语言**预训练的生成式负文本重播，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.17322.pdf)\n\n- (arXiv 2022.10) PatchRot：一种用于训练视觉Transformer的**自监督**技术，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15722.pdf)\n\n- (arXiv 2022.10) 用于**音频-视觉**同步的多模态Transformer蒸馏，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15563.pdf)\n\n- (arXiv 2022.10) 用于并行串联变分自编码器的**多模态**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.16174.pdf)\n\n- (arXiv 2022.10) 基于视觉Transformer的差分**隐私**CutMix用于分割学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15986.pdf)\n\n- (arXiv 2022.10) OHMG：零样本开放词汇人体**运动生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15929.pdf)\n\n- (arXiv 2022.10) VLT：用于**指代分割**的视觉-语言Transformer及查询生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15871.pdf)\n\n- (arXiv 2022.10) PSFORMER：用于**3D显著物体检测**的点Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15933.pdf)\n\n- (arXiv 2022.10) **嫁接**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15943.pdf)\n\n- (arXiv 2022.10) 端到端与神经符号型**视觉-语言**推理系统之间的泛化差异，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15037.pdf)\n\n- (arXiv 2022.10) FaD-VLP：面向统一检索和字幕生成的**时尚**领域的**视觉与语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15028.pdf)\n\n- (arXiv 2022.10) 时尚领域的掩码**视觉-语言**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15110.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FGewelsJI\u002FMVLT)\n\n- (arXiv 2022.10) 基于视频的动作捕捉的变分运动先验学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15134.pdf)\n\n- (arXiv 2022.10) 使用冻结视觉-语言模型的开放词汇语义分割，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15138.pdf)，[[代码]](https:\u002F\u002Fyyh-rain-song.github.io\u002FFusioner_webpage\u002F)\n\n- (arXiv 2022.10) TEXT2MODEL：利用任务描述进行零样本泛化的模型归纳，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15182.pdf)\n\n- (arXiv 2022.10) 学习人类运动与语言的联合表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15187.pdf)\n\n- (arXiv 2022.10) ERNIE-ViLG 2.0：通过知识增强的去噪专家混合模型改进文本到图像扩散模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15257.pdf)\n\n- (arXiv 2022.10) MSF3DDETR：用于自动驾驶的多传感器融合3D检测Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15316.pdf)\n\n- (arXiv 2022.10) Li3DeTr：一种基于LiDAR的3D检测Transformer，[[论文]](Li3DeTr：一种基于LiDAR的3D检测Transformer)\n\n- (arXiv 2022.10) 用于图像异常定位的掩码Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15540.pdf)\n\n- (arXiv 2022.10) 发现CAD草图的设计概念，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14451.pdf)\n\n- (arXiv 2022.10) 针对视觉问答任务压缩和去偏置预训练的视觉-语言模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14558.pdf)\n\n- (arXiv 2022.10) 视频对话的端到端多模态表征学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14512.pdf)\n\n- (arXiv 2022.10) TPFNet：一种用于文本去除的新颖文本修复Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14461.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FCandleLabAI\u002FTPFNet)\n\n- (arXiv 2022.10) IMU2CLIP：基于第一人称视角视频和文字叙述的IMU运动传感器多模态对比学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14395.pdf)\n\n- (arXiv 2022.10) Transformer注意力分布能否揭示检测与跟踪目标的不确定性？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14391.pdf)\n\n- (arXiv 2022.10) SemFormer：面向弱监督语义分割的语义引导激活Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14618.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJLChen-C\u002FSemFormer)\n\n- (arXiv 2022.10) 基于多查询Transformer的端到端跟踪，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14601.pdf)\n\n- (arXiv 2022.10) 显式提高小数据集上视觉Transformer的输入信息密度，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14319.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxiangyu8\u002FDenseVT)\n\n- (arXiv 2022.10) TAMFORMER：具有可学习注意力掩码的多模态Transformer，用于早期意图预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14714.pdf)\n\n- (arXiv 2022.10) 基于跨模态相互知识迁移的视觉答案定位，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14823.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FWENGSYX\u002FMutualSL)\n\n- (arXiv 2022.10) 视觉语义解析：从图像到抽象意义表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14862.pdf)\n\n- (arXiv 2022.10) 用于压缩视频质量增强的端到端Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13827.pdf)\n\n- (arXiv 2022.10) PlanT：通过对象级表示实现可解释的规划Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14222.pdf)，[[项目]](https:\u002F\u002Fwww.katrinrenz.de\u002Fplant)\n\n- (arXiv 2022.10) Strong-TransCenter：基于密集表示的Transformer改进的多目标跟踪，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13570.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Famitgalor18\u002FSTC_Tracker)\n\n- (arXiv 2022.10) GliTr：具有时空一致性的凝视Transformer，用于在线动作预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13605.pdf)\n\n- (arXiv 2022.10) VLC-BERT：结合情境化常识知识的视觉问答，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13626.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Faditya10\u002FVLC-BERT)\n\n- (arXiv 2022.10) 幻觉式学习：弱监督下的视觉-语言预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13591.pdf)\n\n- (arXiv 2022.10) 利用视觉Transformer学习显式的以物体为中心的表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14139.pdf)\n\n- (arXiv 2022.10) 演绎式动作推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13984.pdf)\n\n- (arXiv 2022.10) 基于视觉Transformer的细节点引导指纹嵌入，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13994.pdf)，[[代码]](未提供)\n\n- (arXiv 2022.10) 3DALL-E：将文本到图像AI集成到3D设计工作流中，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11603.pdf)\n\n- (arXiv 2022.10) 通过迭代共识组合预训练模型的集成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11522.pdf)，[[代码]](https:\u002F\u002Fenergy-based-model.github.io\u002Fcomposing-pretrained-models)\n\n- (arXiv 2022.10) 视觉-语言Transformer是否学习了 grounded的谓词-名词依赖关系？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12079.pdf)\n\n- (arXiv 2022.10) 提升视觉Transformer在图像检索中的性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11909.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdealicious-inc\u002FDToP)\n\n- (arXiv 2022.10) LiteVL：通过增强的时空建模实现高效的视频-语言学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11929.pdf)\n\n- (arXiv 2022.10) 面向弱监督时间语言接地的细粒度语义对齐网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11933.pdf)\n\n- (arXiv 2022.10) 面部金字塔视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11974.pdf)，[[代码]](https:\u002F\u002Fkhawar-islam.github.io\u002Ffpvt\u002F)\n\n- (arXiv 2022.10) 上下文增强的立体声Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11719.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fguoweiyu\u002FContext-Enhanced-Stereo-Transformer)\n\n- (arXiv 2022.10) CRT-6D：利用级联精炼Transformer实现快速的6D物体位姿估计，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11718.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPedroCastro\u002FCRT-6D)\n\n- (arXiv 2022.10) 重新思考长期动作预测的学习方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11566.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FNmegha2601\u002Fanticipatr)\n\n- (arXiv 2022.10) 在视觉对话中使用代词扩展短语接地，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12658.pdf)，[[代码]]()\n\n- (arXiv 2022.10) 在小数据集上，累积的琐碎注意力在视觉Transformer中至关重要，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12333.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxiangyu8\u002FSATA)\n\n- (arXiv 2022.10) 用于高空影像中识别的Transformer：一次现实检验，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12599.pdf)\n\n- (arXiv 2022.10) 面向多模态**动作预测**的前瞻特征融合Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12649.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzeyun-zhong\u002FAFFT)\n\n- (arXiv 2022.10) UIA-ViT：基于视觉Transformer的无监督不一致性感知方法，用于**人脸伪造检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12752.pdf)\n\n- (arXiv 2022.10) LCPFormer：通过Transformer中的局部上下文传播实现高效的3D**点云**分析，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12755.pdf)\n\n- (arXiv 2022.10) 基于**CLIP**引导的像素级优化，实现实时**文本到视频**生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12826.pdf)，[[代码]](https:\u002F\u002Fpschaldenbrand.github.io\u002Ftext2video\u002F)\n\n- (arXiv 2022.10) 无需语言的零样本**视频定位**训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12977.pdf)，[[代码]]()\n\n- (arXiv 2022.10) 利用前景引导与多层特征融合的Transformer无监督**目标发现**方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13053.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FVDIGPKU\u002FFORMULA)\n\n- (arXiv 2022.10) 向统一的**参考表达**生成与理解迈进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13076.pdf)\n\n- (arXiv 2022.10) 基于李群的鲁棒**自监督学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13356.pdf)\n\n- (arXiv 2022.10) VIOLA：基于视觉的物体提案先验的模仿学习，用于**操作**任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11339.pdf)，[[代码]](https:\u002F\u002Fut-austin-rpl.github.io\u002FVIOLA)\n\n- (arXiv 2022.10) VTC：利用用户评论提升**视频-文本检索**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10820.pdf)，[[项目]](https:\u002F\u002Funitaryai.github.io\u002Fvtc-paper)\n\n- (arXiv 2022.10) 使用槽位Transformer解决**推理**任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11394.pdf)，[[代码]]()\n\n- (arXiv 2022.10) 基于原型的提示学习：在预训练**视觉-语言**模型上进行原型驱动的**提示**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10841.pdf)\n\n- (arXiv 2022.10) 现实场景下的**视频情境识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10828.pdf)，[[项目]](https:\u002F\u002Fzeeshank95.github.io\u002Fgrvidsitu)\n\n- (arXiv 2022.10) 基于Swin Transformer的轻量级网络实现单张**图像超分辨率**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11019.pdf)\n\n- (arXiv 2022.10) 视觉空间描述：可控的**空间**导向**图像到文本**生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11109.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzhaoyucs\u002FVSD)\n\n- (arXiv 2022.10) Movie**CLIP**：电影中的**场景识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11065.pdf)\n\n- (arXiv 2022.10) PointTAD：基于可学习查询点的多标签**时序动作检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11035.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FPointTAD)\n\n- (arXiv 2022.10) 迈向可持续的**自监督**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11016.pdf)\n\n- (arXiv 2022.10) 少样本图像分类中的**视觉-语义**对比对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11000.pdf)\n\n- (arXiv 2022.10) i-MAE：**掩码自编码器**中的潜在表示是否线性可分？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11470.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvision-learning-acceleration-lab\u002Fi-mae)\n\n- (arXiv 2022.10) ECCV 2022挑战赛亚军方案：基于Transformer的**手物交互**场景下**动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11387.pdf)\n\n- (arXiv 2022.10) ECCV 2022 HBHA挑战赛冠军方案：基于Transformer的双手操作物体场景下的全局**3D手部姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11384.pdf)\n\n- (arXiv 2022.10) **DALLE-2**“眼花缭乱”：Text2Image模型中词—概念映射的**缺陷**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10606.pdf)\n\n- (arXiv 2022.10) 基于**CLIP**的细粒度文本-图像**人员重识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10276.pdf)\n\n- (arXiv 2022.10) 面向复杂组合推理的密集但高效的**VideoQA**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10300.pdf)\n\n- (arXiv 2022.10) 基于SiameseVisionTransformer的多视角**步态识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2210\u002F2210.10421.pdf)\n\n- (arXiv 2022.10) TOIST：面向任务的**实例分割**Transformer，结合名词-代词蒸馏，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10775.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAIR-DISCOVER\u002FTOIST)\n\n- (arXiv 2022.10) CroCo：通过跨视图**补全**实现**3D**视觉任务的**自监督**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10716.pdf)，[[项目]](https:\u002F\u002Feurope.naverlabs.com\u002Fresearch\u002Fcomputer-vision\u002Fcroco\u002F)\n\n- (arXiv 2022.10) 对**掩码**图像建模的统一视角，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10615.pdf)，[[代码]](https:\u002F\u002Faka.ms\u002Funimim)\n\n- (arXiv 2022.10) 跨模态融合蒸馏用于细粒度的**基于草图的图像检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10486.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fabhrac\u002Fxmodal-vit)\n\n- (arXiv 2022.10) BOAT：双边局部**注意力**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.13027.pdf)\n\n- (arXiv 2022.12) **TOKEN MERGING**：你的ViT，更快！[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09461.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FToMe)\n\n- (arXiv 2022.10) 利用语言扩展至**未见领域**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09520.pdf)，[[代码]]()\n\n- (arXiv 2022.10) SWINV2-IMAGEN：用于**文本到图像**生成的层次化视觉Transformer扩散模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09549.pdf)\n\n- (arXiv 2022.10) HUMANISE：语言条件下的**3D场景**中人类**运动生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09729.pdf)，[[项目]](https:\u002F\u002Fsilverster98.github.io\u002FHUMANISE\u002F)\n\n- (arXiv 2022.10) **视频分类**的迁移学习：多领域上的Video Swin Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09969.pdf)\n\n- (arXiv 2022.10) **视觉-语言**模型中的**知觉分组**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09996.pdf)\n\n- (arXiv 2022.10) 掩码的重要性：迈向对**掩码自编码器**的**理论理解**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08344.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzhangq327\u002FU-MAE)\n\n- (arXiv 2022.10) 带有特征固定的**线性****视频**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08164.pdf)，[[代码]]()\n\n- (arXiv 2022.10) 基于Transformer的**降维**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08288.pdf)\n\n- (arXiv 2022.10) 消除**多智能体感知**中的领域差距，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08451.pdf)\n\n- (arXiv 2022.10) TransVisDrone：用于空中视频中基于视觉的无人机间目标检测的时空Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08423.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ftusharsangam\u002FTransVisDrone)\n\n- (arXiv 2022.10) 用均匀注意力刮擦视觉Transformer的“后背”，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08457.pdf)\n\n- (arXiv 2022.10) 从信息论视角提升多模态神经机器翻译中的视觉感知，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08478.pdf)\n\n- (arXiv 2022.10) TLDW：新闻视频的极端多模态摘要，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08481.pdf)，[[代码]]()\n\n- (arXiv 2022.10) 基于视觉规划与标记对齐的人物中心故事可视化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08465.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPlusLabNLP\u002FVP-CSV)\n\n- (arXiv 2022.10) COFAR：图像搜索中的常识与事实推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08554.pdf)，[[代码]](https:\u002F\u002Fvl2g.github.io\u002Fprojects\u002Fcofar)\n\n- (arXiv 2022.10) 学习自正则化的自监督视觉Transformer对抗视图，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08458.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTrent-tangtao\u002FAutoView)\n\n- (arXiv 2022.10) 用于电视剧多机位剪辑的时序与上下文Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08737.pdf)\n\n- (arXiv 2022.10) 基于场景历史预测人类轨迹，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08732.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMaKaRuiNah\u002FSHENet)\n\n- (arXiv 2022.10) SGRAM：通过抽象语义表示改进场景图解析，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08675.pdf)\n\n- (arXiv 2022.10) 基于知识图谱的对比式语言-图像预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08901.pdf)\n\n- (arXiv 2022.10) 一种适用于通用目标检测的扫视型视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09220.pdf)\n\n- (arXiv 2022.10) 视觉Transformer可证明地学习空间结构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09221.pdf)\n\n- (arXiv 2022.10) oViT：视觉Transformer的精确二阶剪枝框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09223.pdf)\n\n- (arXiv 2022.10) 在粗粒度监督下利用层次加权自对比学习进行细粒度类别发现，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07733.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLackel\u002FHierarchical_Weighted_SCL)\n\n- (arXiv 2022.10) 非对比学习遇见语言-图像预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09304.pdf)\n\n- (arXiv 2022.10) 帧挖掘：从3D点云中学习机器人操作的免费午餐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07442.pdf)，[[项目]](https:\u002F\u002Fcolin97.github.io\u002FFrameMining\u002F)\n\n- (arXiv 2022.10) 预训练Transformer并不总是能提高鲁棒性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07663.pdf)\n\n- (arXiv 2022.10) 合理未必忠实：探究视觉-语言预训练中的对象幻觉，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07688.pdf)\n\n- (arXiv 2022.10) 对比式视听掩码自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07839.pdf)\n\n- (arXiv 2022.10) SWFormer：用于点云中3D目标检测的稀疏窗口Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07372.pdf)\n\n- (arXiv 2022.10) Trailers12k：通过双模态图像和视频Transformer进行多标签电影预告片类型分类，以提升迁移学习效果，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07983.pdf)\n\n- (arXiv 2022.10) AVLEN：在3D环境中实现的视听语言具身导航，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07940.pdf)\n\n- (arXiv 2022.10) MOVE：无监督的可移动物体分割与检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07920.pdf)\n\n- (arXiv 2022.10) 生成模型合成的数据是否已准备好用于图像识别？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07574.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FCVMI-Lab\u002FSyntheticData)\n\n- (arXiv 2022.10) 朝着基于Transformer的Landsat-8与Sentinel-2卫星影像同质化方向发展，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07654.pdf)\n\n- (arXiv 2022.10) MCTNET：用于光学遥感图像变化检测的多尺度CNN-Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07601.pdf)\n\n- (arXiv 2022.10) 视觉Transformer可视化：神经元在说什么？它们的行为又是怎样的？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07646.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FbyM1902\u002FViT_visualization)\n\n- (arXiv 2022.10) TokenMixup：面向Transformer的高效注意力引导型标记级数据增强，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07562.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmlvlab\u002FTokenMixup)\n\n- (arXiv 2022.10) SQA3D：3D场景中的情境问答，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07474.pdf)\n\n- (arXiv 2022.10) 当对抗训练遇到视觉Transformer时：从训练到架构的实践指南，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07540.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmo666666\u002FWhen-Adversarial-Training-Meets-Vision-Transformers)\n\n- (arXiv 2022.10) STAR-Transformer：用于人体动作识别的时空交叉注意力Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07503.pdf)\n\n- (arXiv 2022.10) PedFormer：通过跨模态注意力调制与门控多任务学习进行行人行为预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07886.pdf)\n\n- (arXiv 2022.10) 一个模型搞定一切：基于语义调制的自由形式文本驱动图像编辑，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07883.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKumapowerLIU\u002FFFCLIP)\n\n- (arXiv 2022.10) IMAGINARYNET：无需真实图像和标注即可学习目标检测器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06886.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fkodenii\u002FImaginaryNet)\n\n- (arXiv 2022.10) 用于少样本分割的特征代理Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06908.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJarvis73\u002FFPTrans)\n\n- (arXiv 2022.10) 基于内容感知损失和十字交叉Transformer块的场景文本图像超分辨率，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06924.pdf)\n\n- (arXiv 2022.10) 统一的视觉与语言提示学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07225.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyuhangzang\u002FUPT)\n\n- (arXiv 2022.10) 探索长序列掩码自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07224.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flong_seq_mae)\n\n- (arXiv 2022.10) MAPL：用于视觉-语言少样本提示的单模态预训练模型的参数高效适配，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07179.pdf)\n\n- (arXiv 2022.10) 交互式**语言**：与**机器人**实时对话，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06407.pdf)，[[项目]](https:\u002F\u002Finteractive-language.github.io\u002F)\n\n- (arXiv 2022.10) RTFormer：基于Transformer的高效实时**语义分割**设计，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07124.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleSeg)\n\n- (arXiv 2022.10) 如何在**小规模数据集**上**训练**视觉Transformer？，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07240.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhananshafi\u002Fvits-for-small-scale-datasets)\n\n- (arXiv 2022.10) Hate-CLIPper：基于**CLIP**特征跨模态交互的多模态仇恨**表情包分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05916.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgokulkarthik\u002Fhateclipper)\n\n- (arXiv 2022.10) 大型模型是节俭的学习者：训练后的Transformer中的**激活稀疏性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06313.pdf)\n\n- (arXiv 2022.10) 视觉Transformer的**表示空间**弯曲特性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05742.pdf)\n\n- (arXiv 2022.10) **基础**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06423.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm)\n\n- (arXiv 2022.10) **场景描述到图像生成**任务中的欠定性问题，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05815.pdf)\n\n- (arXiv 2022.10) 基于神经过程的连续条件**视频合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05810.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FNPVS\u002FNPVS)\n\n- (arXiv 2022.10) SAIT：通过自适应标记**剪枝**实现稀疏视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05832.pdf)\n\n- (arXiv 2022.10) ZITS++：通过改进结构先验上的增量Transformer进行**图像修复**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05950.pdf)\n\n- (arXiv 2022.10) SLOTFORMER：基于以对象为中心的模型的无监督视觉**动态模拟**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05861.pdf)，[[项目]](https:\u002F\u002Fslotformer.github.io\u002F)\n\n- (arXiv 2022.10) 基于知识的新颖**物体识别**：通过提问学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05879.pdf)\n\n- (arXiv 2022.10) 缩小视觉Transformer与**卷积神经网络**在小数据集上的**差距**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05958.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FArieSeirack\u002FDHVT)\n\n- (arXiv 2022.10) GGViT：用于Face2Face**面部重演检测**的多流视觉Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05990.pdf)\n\n- (arXiv 2022.10) 从语言模型中蒸馏知识用于**基于视频的动作预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05991.pdf)\n\n- (arXiv 2022.10) 基于多模态时序对比学习的长视频**视频-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06031.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FXPretrain)\n\n- (arXiv 2022.10) M3VIDEO：用于自监督**视频表示学习**的**掩码**运动建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06096.pdf)\n\n- (arXiv 2022.10) Uplift和Upsample：利用提升型Transformer进行高效的**3D人体姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06110.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoldbricklemon\u002Fuplift-upsample-3dhpe)\n\n- (arXiv 2022.10) FontTransformer：通过堆叠Transformer实现少样本高分辨率中文**字形图像合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06301.pdf)\n\n- (arXiv 2022.10) AISFormer：基于Transformer的无模态**实例分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06323.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FUARK-AICV\u002FAISFormer)\n\n- (arXiv 2022.10) ViewBirdiformer：从单一自我中心视角学习恢复地面平面的**人群轨迹**和**自身运动**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06332.pdf)\n\n- (arXiv 2022.10) 并不存在“一刀切”！关于视觉编码器在**视觉与语言**任务中的互补性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06379.pdf)\n\n- (arXiv 2022.10) 用于高效**适应**冻结视觉Transformer的**提示生成**网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06466.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjochemloedeman\u002FPGN)\n\n- (arXiv 2022.10) 利用环境感知语言模型生成可执行的动作**计划**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04964.pdf)\n\n- (arXiv 2022.10) AVE-**CLIP**：基于AudioCLIP的多窗口时序Transformer，用于**视听事件定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05060.pdf)\n\n- (arXiv 2022.10) 通过密集负样本对改进密集**对比学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05063.pdf)\n\n- (arXiv 2022.10) 利用视觉Transformer进行细粒度**图像风格迁移**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05176.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FSTTR)\n\n- (arXiv 2022.10) 需要两者兼备：用于**自监督视频**Transformer预训练的**掩码**外观-运动建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05234.pdf)\n\n- (arXiv 2022.10) 基于细粒度帧采样的对比**视频-语言**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05039.pdf)\n\n- (arXiv 2022.10) 基于风格引导的Transformer推理用于高分辨率**图像合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05533.pdf)\n\n- (arXiv 2022.10) MAP：模态无关、不确定性感知的**视觉-语言**预训练模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05335.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FIIGROUP\u002FMAP)\n\n- (arXiv 2022.10) 学习如何使用**问题**在**视频**语料库中定位视觉**答案**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05423.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FWENGSYX\u002FCCGS)\n\n- (arXiv 2022.10) 使用触线Transformer理解**具身**指代，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05668.pdf)\n\n- (arXiv 2022.10) **点**Transformer V2：分组向量注意力与基于分区的池化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05666.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FGofinge\u002FPointTransformerV2)\n\n- (arXiv 2022.10) 看见、计划、预测：结合视频预测的语言引导认知**规划**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03825.pdf)\n\n- (arXiv 2022.10) 同时利用演示和语言指令高效学习**机器人**任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04476.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdel-taco-learning)\n\n- (arXiv 2022.10) 结合外部百科全书知识生成图像**标题**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04806.pdf)\n\n- (arXiv 2022.10) LOCL：利用定位学习**对象-属性组合**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03780.pdf)\n\n- (arXiv 2022.10) SVL适配器：用于**视觉-语言**预训练模型的自监督适配器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03794.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fomipan\u002Fsvl_adapter)\n\n- (arXiv 2022.10) ConTra：（Con）文本（Tra）nsformer用于**跨模态视频检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04341.pdf)\n\n- (arXiv 2022.10) 通过解耦时空建模学习针对**视频问答**的细粒度视觉理解，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03941.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshinying\u002Fdest)\n\n- (arXiv 2022.10) (Fusionformer): 利用基于Transformer的融合网络挖掘联合运动协同效应，用于**3D人体姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04006.pdf)\n\n- (arXiv 2022.10) Fast-ParC: 一种适用于ConvNet和ViT的全局位置感知**核**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04020.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyangtao2019yt\u002FFast_ParC.git)\n\n- (arXiv 2022.10) OGC: 基于点云刚体动力学的无监督**3D目标分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04458.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FvLAR-group\u002FOGC)\n\n- (arXiv 2022.10) 面向**遥感**领域的多模态融合Transformer用于**视觉问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04510.pdf)，[[代码]](https:\u002F\u002Fgit.tu-berlin.de\u002Frsim\u002Fmulti-modal-fusion-transformer-for-vqa-in-rs)\n\n- (arXiv 2022.10) 基于最优传输对齐实现语义一致的**跨域摘要**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04722.pdf)\n\n- (arXiv 2022.10) VOLTA：带有弱监督局部特征对齐的**视觉-语言**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04135.pdf)\n\n- (arXiv 2022.10) 开放词汇语义**分割**与掩码自适应**CLIP**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04150.pdf)，[[项目]](https:\u002F\u002Fjeff-liangf.github.io\u002Fprojects\u002Fovseg)\n\n- (arXiv 2022.10) MAMO：用于细粒度**视觉-语言**表征学习的掩码多模态建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04183.pdf)\n\n- (arXiv 2022.10) 基于运动感知掩码自编码器的自监督**视频**表征学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04154.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhappy-hsy\u002FMotionMAE)\n\n- (arXiv 2022.10) 利用潜在**文本**提示学习分解**视觉**特征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04287.pdf)\n\n- (arXiv 2022.10) DCVQE：一种用于**视频质量评估**的层次化Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04377.pdf)\n\n- (arXiv 2022.10) 面向服务**机器人**的**细粒度物体**分类，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04613.pdf)\n\n- (arXiv 2022.10) **CLIP**-扩散-LM：将扩散模型应用于图像**标题生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04559.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxu-shitong\u002Fdiffusion-image-captioning)\n\n- (arXiv 2022.10) 用于**增量学习**的记忆Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04485.pdf)\n\n- (arXiv 2022.10) 通过潜在对齐连接**CLIP**和StyleGAN，用于**图像编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04506.pdf)\n\n- (arXiv 2022.10) LMQFormer：一种基于拉普拉斯先验引导的掩码查询Transformer，用于**轻量级除雪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04787.pdf)\n\n- (arXiv 2022.10) FS-DETR：一种无需重新训练、基于提示的**少样本**检测Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04845.pdf)\n\n- (arXiv 2022.10) 基于Transformer的具身对话定位，结合大规模预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04864.pdf)\n\n- (arXiv 2022.10) 使用标记**丢弃**进行加速训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.04889.pdf)\n\n- (arXiv 2022.10) Polyhistor：面向密集型视觉任务的参数**高效**多任务适配，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03265.pdf)\n\n- (arXiv 2022.10) C2KD：用于多语言**文本-视频检索**的跨语言跨模态知识蒸馏，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03625.pdf)\n\n- (arXiv 2022.10) 基于**姿态**引导的部分解耦GAN用于**人物图像合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03627.pdf)\n\n- (arXiv 2022.10) 一个简单的插件，用于将**图像**转换为**任意尺度**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03417.pdf)，[[项目]](https:\u002F\u002Flipurple.github.io\u002FARIS_Webpage\u002F)\n\n- (arXiv 2022.10) 时空Transformer用于**视频全景分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03546.pdf)\n\n- (arXiv 2022.10) MOAT：交替使用**移动**卷积和注意力机制，可构建强大的视觉模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01820.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fdeeplab2)\n\n- (arXiv 2022.10) **IMAGEN** 视频：利用**扩散**模型生成高清晰度**视频**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02303.pdf)，[[项目]](https:\u002F\u002Fimagen.research.google\u002Fvideo\u002F)\n\n- (arXiv 2022.10) clip2latent：使用去噪扩散和**CLIP**驱动采样预训练的**StyleGAN**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02347.pdf)\n\n- (arXiv 2022.10) FQDet：一种快速收敛的基于查询的**检测器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02318.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FCedricPicron\u002FFQDet)\n\n- (arXiv 2022.10) 变分**提示**调优提升**视觉-语言**模型的泛化能力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02390.pdf)\n\n- (arXiv 2022.10) 在非结构化数据上通过**视觉**可用性锚定**语言**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01911.pdf)，[[项目]](http:\u002F\u002Fhulc2.cs.uni-freiburg.de\u002F)\n\n- (arXiv 2022.10) 面向多任务的**图像检索**中的粒度感知适配，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02254.pdf)\n\n- (arXiv 2022.10) 何时以及为何**视觉-语言**模型表现得像词袋？又该如何应对？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01936.pdf)\n\n- (arXiv 2022.10) 多视角**人体**体格网格转换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01886.pdf)\n\n- (arXiv 2022.10) 探索均值教师在**自监督**掩码自编码器中的作用，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02077.pdf)\n\n- (arXiv 2022.10) 基于位置到结构注意力Transformer的**点云识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02030.pdf)\n\n- (arXiv 2022.10) 用于长期**视频预测**的时序一致视频Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02396.pdf)，[[代码]](https:\u002F\u002Fwilson1yan.github.io\u002Fteco)\n\n- (arXiv 2022.10) PHENAKI：从开放领域文本描述中生成可变长度**视频**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02399.pdf)\n\n- (arXiv 2022.10) MuRAG：一种多模态检索增强生成器，用于基于图像和文本的**开放式问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02928.pdf)\n\n- (arXiv 2022.10) 基于掩码视觉预训练的真实世界**机器人学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03109.pdf)，[[项目]](https:\u002F\u002Ftetexiao.com\u002Fprojects\u002Freal-mvp)\n\n- (arXiv 2022.10) BaseTransformers：基于基础数据点的注意力机制，用于**一次-shot**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02476.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmayug\u002FBaseTransformers)\n\n- (arXiv 2022.10) 面向基于**骨骼**的**动作识别**的焦点与全局时空Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02693.pdf)\n\n- (arXiv 2022.10) 基于视觉Transformer的模型，用于将一组**图像**描述为一个故事，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02762.pdf)\n\n- (arXiv 2022.10) 基于内容感知查询的Transformer实现**视频指代表达理解**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02953.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmengcaopku\u002FContFormer)\n\n- (arXiv 2022.10) 在低算力网络上无需蒸馏的**有效**自监督预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02808.pdf)\n\n- (arXiv 2022.10) **CLIP**模型是一种高效的**持续学习**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03114.pdf)\n\n- (arXiv 2022.10) 针对深度**生成**模型的内容检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03116.pdf)\n\n- (arXiv 2022.10) MAPLE：**多模态****提示**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03117.pdf)，[[代码]](https:\u002F\u002Ftinyurl.com\u002F2dzs8f3w)\n\n- (arXiv 2022.10) 在**结构化任务**上训练的Transformer中的系统性泛化与涌现结构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00400.pdf)\n\n- (arXiv 2022.10) 宽视野**注意力**是Transformer的发展方向吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00640.pdf)\n\n- (arXiv 2022.10) DARTFORMER：寻找最佳类型的**注意力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00641.pdf)\n\n- (arXiv 2022.10) MOBILEVITV3：一种轻量级、友好的视觉Transformer，采用简单有效的局部、全局及输入特征融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15159.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FmicronDLA\u002FMobileViTv3)\n\n- (arXiv 2022.10) 针对**物体放置**的可微分解析与语言指令的视觉**定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00215.pdf)，[[项目]](https:\u002F\u002F1989ryan.github.io\u002Fprojects\u002Fparagon.html)\n\n- (arXiv 2022.10) EAPruning：用于视觉Transformer和CNN的进化式**剪枝**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00181.pdf)\n\n- (arXiv 2022.10) 视频中基于运动的自监督**目标发现**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00221.pdf)\n\n- (arXiv 2022.10) 用于**遥感**图像变化检测的全Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00757.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAI-Zhpp\u002FFTN)\n\n- (arXiv 2022.10) 朝向视觉参数**高效****迁移学习**的统一视角，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00788.pdf)\n\n- (arXiv 2022.10) 用于生成式**迁移学习**的视觉**提示**调优，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00990.pdf)\n\n- (arXiv 2022.10) 视觉Transformer中**RGB-D融合**的强大迁移基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00843.pdf)\n\n- (arXiv 2022.10) LPT：面向图像分类的**长尾****提示**调优，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01033.pdf)\n\n- (arXiv 2022.10) 加速大规模视觉Transformer用于**密集预测**而无需微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01035.pdf)\n\n- (arXiv 2022.10) CLIP2POINT：通过图像-深度预训练将**CLIP**迁移到**点云分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01055.pdf)\n\n- (arXiv 2022.10) Dual-former：用于高效**图像修复**的混合自注意力Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01069.pdf)\n\n- (arXiv 2022.10) 面向**视觉与语言**基础模型的语言感知**软提示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01115.pdf)\n\n- (arXiv 2022.10) ASIF：无需训练即可将单模态模型转化为**多模态**的耦合数据，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01738.pdf)\n\n- (arXiv 2022.10) ImmFusion：在所有天气条件下进行鲁棒的毫米波-RGB融合，用于**三维人体重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01346.pdf)\n\n- (arXiv 2022.10) 使用**最优传输**进行**视觉-语言**模型的提示学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01253.pdf)\n\n- (arXiv 2022.10) 用于视觉和点云**三维目标检测**的桥接Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01391.pdf)\n\n- (arXiv 2022.10) 用于单目视觉**里程计**中尺度估计的密集预测Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.01723.pdf)\n\n- (arXiv 2022.10) 人类**运动****扩散**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14916.pdf)，[[项目]](https:\u002F\u002Fguytevet.github.io\u002Fmdm-page\u002F)\n\n- (arXiv 2022.10) TokenFlow：重新思考**视觉-语言检索**中的细粒度跨模态对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13822.pdf)\n\n- (arXiv 2022.10) Uni**CLIP**：对比学习型**语言-图像**预训练的统一框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13430.pdf)\n\n- (arXiv 2022.10) CrossDTR：用于**三维目标检测**的跨视图和深度引导的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13507.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsty61010\u002FCrossDTR)\n\n- (arXiv 2022.10) 用于鲁棒**动作识别**的多数据集训练Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12362.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJunweiLiang\u002FMultiTrain)\n\n- (arXiv 2022.10) 多尺度**人-物交互**检测器，[[论文]](https:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?arnumber=9927451)\n\n- (arXiv 2022.10) LGDN：用于**视频-语言**建模的语言引导去噪网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11388.pdf)\n\n- (arXiv 2022.10) RaP：面向**文本-视频**检索的冗余感知视频-语言预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06881.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fcaskcsg\u002FVLP\u002Ftree\u002Fmain\u002FRaP)\n\n- (arXiv 2022.10) 用于少样本**语义分割**的中间原型挖掘Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06780.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLIUYUANWEI98\u002FIPMT)\n\n- (arXiv 2022.10) 通过**脑-视觉-语言**特征的多模态学习解码视觉神经表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06756.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FChangdeDu\u002FBraVL)\n\n- (arXiv 2022.10) Q-ViT：准确且完全**量化**的低比特视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06707.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYanjingLi0202\u002FQ-ViT)\n\n- (arXiv 2022.10) 前缀域Transformer：无需花哨技术的异构**人脸识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06529.pdf)\n\n- (arXiv 2022.10) 用于人类**视频动作推理**的视觉知识图谱，[[论文]](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3503161.3548257)\n\n- (arXiv 2022.10) 用于随机性**运动预测**的人体关节运动学扩散-精炼，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05976.pdf)\n\n- (arXiv 2022.10) VIMA：利用多模态**提示**进行通用**机器人操作**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03094.pdf)，[[项目]](https:\u002F\u002Fvimalabs.github.io\u002F)\n\n- (arXiv 2022.10) 系统接下来应该做什么？：用于估计系统行为的**操作性动作字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02735.pdf)\n\n- (arXiv 2022.10) DMMGAN：基于注意力机制的生成对抗网络实现的多样化多**运动预测**，应用于三维人体关节，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09124.pdf)\n\n- (arXiv 2022.10) PIZZA：一种强大的仅使用图像的零样本零CAD方法，用于**6自由度跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07589.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fnv-nguyen\u002Fpizza)\n\n\n\n### 2022年9月\n\n- (arXiv 2022.09) 用于Transformer进一步**预训练**的自蒸馏，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.02871.pdf)\n\n- (arXiv 2022.09) 用于操控任务的**视觉-触觉**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00121.pdf)，[[项目]](https:\u002F\u002Fwww.mmintlab.com\u002Fvtt)\n\n- (arXiv 2022.09) 理解纯**CLIP**指导在**体素**网格**NeRF**模型中的应用，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15172.pdf)，[[项目]](https:\u002F\u002Fhanhung.github.io\u002FPureCLIPNeRF\u002F)\n\n- (arXiv 2022.09) 用于弱监督语义**分割**的双重渐进式变换，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15211.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhuodongjian0603\u002Fcrt)\n\n- (arXiv 2022.09) 用于大型**点云**中目标**检测**的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15258.pdf)\n\n- (arXiv 2022.09) 基于**扩散**模型的**图像翻译**，采用解耦的风格与内容表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15264.pdf)\n\n- (arXiv 2022.09) ERNIE-VIL 2.0：用于**图像-文本**预训练的多视角对比学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15270.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FERNIE\u002F)\n\n- (arXiv 2022.09) 从自然**脚本**知识中学习可迁移的**时空**表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15280.pdf)\n\n- (arXiv 2022.09) SMALLCAP：基于检索增强的轻量级图像**字幕生成**提示模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15323.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FRitaRamo\u002Fsmallcap)\n\n- (arXiv 2022.09) SPIKFORMER：当**脉冲神经网络**遇到Transformer时，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15425.pdf)\n\n- (arXiv 2022.09) F-VLM：基于冻结视觉和语言模型的开放词汇表目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15639.pdf)\n\n- (arXiv 2022.09) 对比语料库归因法用于解释表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00107.pdf)\n\n- (arXiv 2022.09) 对齐引导的时序注意力用于**视频动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00132.pdf)\n\n- (arXiv 2022.09) EDA：显式文本解耦与密集对齐技术，用于**3D视觉与语言**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14941.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyanmin-wu\u002FEDA)\n\n- (arXiv 2022.09) SPOTLIGHT：利用聚焦型视觉-语言模型进行**移动UI理解**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14927.pdf)\n\n- (arXiv 2022.09) DREAMFUSION：使用2D**扩散**模型实现**文本到3D**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14988.pdf)，[[项目]](https:\u002F\u002Fdreamfusion3d.github.io\u002F)\n\n- (arXiv 2022.09) REST：通过检索与自训练实现生成式**动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15000.pdf)\n\n- (arXiv 2022.09) **高效**视觉Transformer**训练**：数据驱动视角，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.15006.pdf)\n\n- (arXiv 2022.09) 利用BERT场景表示的人机协作机器人**抓取**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14026.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fhitl-grasping-bert)\n\n- (arXiv 2022.09) 从**因果**视角重新审视**少样本**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13816.pdf)\n\n- (arXiv 2022.09) 对压缩视觉Transformer的**攻击**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13785.pdf)\n\n- (arXiv 2022.09) 自适应稀疏ViT：通过充分挖掘自注意力机制，实现可学习的自适应**标记剪枝**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13802.pdf)\n\n- (arXiv 2022.09) DeViT：变形视觉Transformer在**视频修复**中的应用，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13925.pdf)\n\n- (arXiv 2022.09) Obj2Seq：将**对象**格式化为序列，并结合类别提示用于视觉任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13948.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FCASIA-IVA-Lab\u002FObj2Seq)\n\n- (arXiv 2022.09) Dynamic MDETR：用于视觉**定位**的动态多模态Transformer解码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13959.pdf)\n\n- (arXiv 2022.09) 无监督**图像动画**的运动Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14024.pdf)\n\n- (arXiv 2022.09) 加权对比**哈希**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14099.pdf)，[[代码]](http:\u002F\u002Fgithub.com\u002FRosieYuu\u002FWCH)\n\n- (arXiv 2022.09) CALIP：无需参数的注意力机制实现的**CLIP**零样本增强，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14169.pdf)\n\n- (arXiv 2022.09) 面向任务驱动具身智能体的对话行为，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12953.pdf)\n\n- (arXiv 2022.09) NEURAL MARIONETTE：基于Transformer的多动作人类**运动合成**系统，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13204.pdf)，[[代码]](https:\u002F\u002Fwjohnnyw.github.io\u002Fblog\u002Ftag2motion\u002F)\n\n- (arXiv 2022.09) 拥抱一致性：一种用于**时空视频定位**的单阶段方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13306.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FSTCAT)\n\n- (arXiv 2022.09) 文本自适应的多视觉原型匹配，用于**视频-文本**检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13307.pdf)\n\n- (arXiv 2022.09) 向参数高效的整合预训练**语言**模型以用于时间**视频**定位迈进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13359.pdf)\n\n- (arXiv 2022.09) 使用Transformer进行**航拍**视频中的**异常检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13363.pdf)，[[代码]](https:\u002F\u002Fyoutu.be\u002FancczYryOBY)\n\n- (arXiv 2022.09) AdaFocusV3：关于统一的时空动态**视频识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13465.pdf)\n\n- (arXiv 2022.09) 具有全局意图**定位**和局部运动细化的**运动**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13508.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsshaoshuai\u002FMTR)\n\n- (arXiv 2022.09) FREESEG：基于可解释的对比语言-图像预训练的免费掩码，用于**语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13558.pdf)\n\n- (arXiv 2022.09) 从可听的**交互**中学习状态感知的视觉表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13583.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FHimangiM\u002FRepLAI)\n\n- (arXiv 2022.09) 向可解释的**3D**接地**视觉问答**迈进：一个新的基准和强大的基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12028.pdf)\n\n- (arXiv 2022.09) 利用自监督训练进行**无意动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11870.pdf)\n\n- (arXiv 2022.09) NeRF-Loc：基于Transformer的**物体定位**方法，应用于**神经辐射场**中，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12068.pdf)\n\n- (arXiv 2022.09) 一切皆有价值：用于基于分数的**扩散模型**的ViT骨干网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12152.pdf)\n\n- (arXiv 2022.09) 改写就够了：用于新颖物体**图像描述**的方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12343.pdf)\n\n- (arXiv 2022.09) 预训练模型的协作使**少样本**学习效果更佳，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12255.pdf)\n\n- (arXiv 2022.09) 多模态**视频章节生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2209\u002F2209.12694.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fczt117\u002FMVCG)\n\n- (arXiv 2022.09) **文本到图像**模型的最佳提示词及其寻找方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11711)\n\n- (arXiv 2022.09) Swin2SR：用于**压缩图像超分辨率**和**修复**的SwinV2 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11345.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmv-lab\u002Fswin2sr)\n\n- (arXiv 2022.09) 3DPCT：具有双重自注意力机制的3D**点云**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11255.pdf)\n\n- (arXiv 2022.09) 用于移动设备上人体**行为识别**的**轻量级**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11750.pdf)\n\n- (arXiv 2022.09) PACT：用于**自回归机器人预训练**的感知-动作因果Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11133.pdf)\n\n- (arXiv 2022.09) UniColor：一种基于Transformer的多模态**彩色化**统一框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11223.pdf)，[[代码]](https:\u002F\u002Fluckyhzt.github.io\u002Funicolor)\n\n- (arXiv 2022.09) 基于上下文视觉Transformer的**交通事故风险预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.11180.pdf)\n\n- (arXiv 2022.09) CONE：一种高效的粗到精对齐框架，用于**长视频时间定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10918.pdf)\n\n- (arXiv 2022.09) 从未分割的烹饪视频中生成**食谱**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10134.pdf)\n\n- (arXiv 2022.09) PicT：一种轻量级弱监督视觉Transformer，用于**路面病害分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10074.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FDearCaat\u002FPicT)\n\n- (arXiv 2022.09) 展示、解释与讲述：维基百科中基于实体感知的上下文化图像**描述**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10474.pdf)\n\n- (arXiv 2022.09) RNGDet++：结合实例分割和多尺度特征增强的Transformer驱动的**道路网络图检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10150.pdf)，[[代码]](https:\u002F\u002Ftonyxuqaq.github.io\u002Fprojects\u002FRNGDetPlusPlus\u002F)\n\n- (arXiv 2022.09) 朝着类人型基于文本的**视觉问答**的3D空间推理迈进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10326.pdf)\n\n- (arXiv 2022.09) I2DFormer：用于**零样本图像分类**的**图像**到**文档**注意力学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.10304.pdf)\n\n- (arXiv 2022.09) 基于Transformer模型的**整数**微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09815.pdf)\n\n- (arXiv 2022.09) 面向**现实世界规划**的开放词汇可查询场景表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09874.pdf)，[[代码]](https:\u002F\u002Fnlmap-saycan.github.io\u002F)\n\n- (arXiv 2022.09) Det**CLIP**：面向**开放世界检测**的字典增强型视觉概念并行预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09407.pdf)\n\n- (arXiv 2022.09) 用于从**第一视角**RGB视频中进行**3D手部姿态估计**和**动作识别**的层次化时间Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09484.pdf)\n\n- (arXiv 2022.09) 用于**图像解析**的**图**推理Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09545.pdf)\n\n- (arXiv 2022.09) **量子**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08167.pdf)\n\n- (arXiv 2022.09) 野外主动**视觉搜索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08803.pdf)\n\n- (arXiv 2022.09) PPT：用于单目和多视角人体**姿态估计**的令牌剪枝姿态Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08194.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FHowieMa\u002FPPT)\n\n- (arXiv 2022.09) 学习图像**描述**中的独特且具有代表性的**模式**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08231.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fbladewaltz1\u002FModeCap)\n\n- (arXiv 2022.09) TODE-Trans：基于Transformer的透明物体**深度估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08455.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyuchendoudou\u002FTODE)\n\n- (arXiv 2022.09) 基于树结构的**文本-视觉**BERT，用于百度视频广告中的视频搜索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08759.pdf)\n\n- (arXiv 2022.09) 利用Transformer进行集成特征和代价聚合以实现**稠密对应关系**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08742.pdf)\n\n- (arXiv 2022.09) 用于视觉Transformer中**局部-全局交互**的轴向扩展窗口，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08726.pdf)\n\n- (arXiv 2022.09) 适用于**无人机**平台的**目标再识别**任务的**不确定性感知**多任务金字塔视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08686.pdf)\n\n- (arXiv 2022.09) TASKED：基于Transformer的对抗性学习方法，通过**自我知识蒸馏**利用**可穿戴传感器**进行**人类活动识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09092.pdf)\n\n- (arXiv 2022.09) EcoFormer：具有**线性复杂度**的节能注意力机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09004.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fziplab\u002FEcoFormer)\n\n- (arXiv 2022.09) 用于360°视频中**显著性检测**的**全景**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08956.pdf)\n\n- (arXiv 2022.09) 有偏见的艺术家：在**文本引导的图像生成模型**中利用同形异义词来放大**文化偏见**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08891.pdf)\n\n- (arXiv 2022.09) 将**场景图修改**视为增量式结构扩展，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09093.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTHU-BPM\u002FSGM)\n\n- (arXiv 2022.09) 在自监督Transformer中对提议进行判别采样，用于**弱监督物体定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09209.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshakeebmurtaza\u002Fdips)\n\n- (arXiv 2022.09) 基于时间平滑Transformer的实时**在线视频检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09236.pdf)\n\n- (arXiv 2022.09) ViT-DD：用于半监督**驾驶员分心检测**的多任务视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09178.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPurdueDigitalTwin\u002FViT-DD)\n\n- (arXiv 2022.09) 代码即策略：用于**具身控制**的语言模型程序，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07753.pdf)，[[项目]](https:\u002F\u002Fcode-as-policies.github.io\u002F)\n\n- (arXiv 2022.09) SQ-Swin：用于**生菜褐变预测**的预训练暹罗二次Swin Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07683.pdf)\n\n- (arXiv 2022.09) 用于**高效**深度学习的自注意力池化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07659.pdf)\n\n- (arXiv 2022.09) 无源域泛化的领域统一提示表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14926.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmuse1998\u002FSource-Free-Domain-Generalization)\n\n- (arXiv 2022.09) 拉近与真实世界**以对象为中心的学习**的差距，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14860.pdf)\n\n- (arXiv 2022.09) 提示引导的**场景生成**用于**3D**零样本学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14690.pdf)\n\n- (arXiv 2022.09) RE-IMAGEN：检索增强型**文本到图像**生成器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14491.pdf)\n\n- (arXiv 2022.09) 面向条件自然语言生成的分布感知**指标**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07518.pdf)\n\n- (arXiv 2022.09) **CLIP**ping隐私：针对多模态机器学习模型的身份推断**攻击**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07341.pdf)\n\n- (arXiv 2022.09) 利用相关性信息瓶颈微调预训练**视觉-语言**模型以实现鲁棒的**视觉问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06954.pdf)\n\n- (arXiv 2022.09) PriorLane：一种基于Transformer的先验知识增强型**车道检测**方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06994.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvincentqqb\u002FPriorLane)\n\n- (arXiv 2022.09) 我们能否从一个**2D**视觉**Transformer**出发解决**3D**视觉任务？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07026.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FSimple3D-Former.git)\n\n- (arXiv 2022.09) 探索对比式**语言-图像**预训练的视觉可解释性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07046.pdf)\n\n- (arXiv 2022.09) OmniVL：一个适用于**图像-语言**和**视频-语言**任务的基础模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07526.pdf)\n\n- (arXiv 2022.09) 使用**掩码自编码器**进行测试时训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07522.pdf)，[[代码]](https:\u002F\u002Fyossigandelsman.github.io\u002Fttt_mae\u002Findex.html)\n\n- (arXiv 2022.09) 基于深度最近**质心**的视觉**识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07383.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FChengHan111\u002FDNC)\n\n- (arXiv 2022.09) **可供性**区域的一次性迁移？AffCorrs！[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07147.pdf)，[[代码]](https:\u002F\u002Fsites.google.com\u002Fview\u002Faffcorrs)\n\n- (arXiv 2022.09) 视觉-语言模型中用于零样本泛化的测试时**提示调优**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07511.pdf)，[[代码]](https:\u002F\u002Fazshue.github.io\u002FTPT\u002F)\n\n- (arXiv 2022.09) 训练**鲁棒**视觉Transformer的轻量级方案，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07399.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdedeswim\u002Fvits-robustness-torch)\n\n- (arXiv 2022.09) 关于Transformer在低标注数据下的**视频识别**中出人意料的有效性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07474.pdf)\n\n- (arXiv 2022.09) 计算机视觉中**注意力头**数量与**Transformer编码器**数量的关系，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07221.pdf)\n\n- (arXiv 2022.09) 用于**零样本动作识别**的全局语义描述符，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12061.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvalterlej\u002Fobjsentzsar)\n\n- (arXiv 2022.09) 重访语言与视觉领域的神经网络**缩放定律**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06640.pdf)\n\n- (arXiv 2022.09) 小型Transformer可以计算通用**度量嵌入**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06788.pdf)\n\n- (arXiv 2022.09) **CLIP**-ViP：将预训练的图像-文本模型适配到**视频-语言**表征对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06430.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FXPretrain\u002Ftree\u002Fmain\u002FCLIP-ViP)\n\n- (arXiv 2022.09) CRAFT：基于时空上下文融合Transformer的相机-雷达**3D目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06535.pdf)\n\n- (arXiv 2022.09) Transformers和CNNs在**SBIR**任务上均胜过人类，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06629.pdf)\n\n- (arXiv 2022.09) PaLI：一个联合规模化的**多语言****语言-图像**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06794.pdf)\n\n- (arXiv 2022.09) MUST-VQA：多语言场景文本**VQA**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06730.pdf)，[[代码]](https:\u002F\u002Fwww.ethnologue.com\u002Fenterprise-faq\u002Fhow-many-languages-world-are-unwritten-0)\n\n- (arXiv 2022.09) 利用大型语言模型进行**机器人3D场景理解**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05629.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMIT-SPARK\u002Fllm_scene_understanding)\n\n- (arXiv 2022.09) 一种基于Transformer的轻量级模型用于**鱼类地标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05777.pdf)\n\n- (arXiv 2022.09) PSAQ-ViT V2：迈向准确且通用的无数据**量化**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05687.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzkkli\u002FPSAQ-ViT)\n\n- (arXiv 2022.09) ComplETR：利用视觉Transformer降低密集场景下目标**检测**的标注成本，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05654.pdf)\n\n- (arXiv 2022.09) Semantic2Graph：基于图的多模态特征用于**视频中的动作分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2209\u002F2209.05653.pdf)\n\n- (arXiv 2022.09) CenterFormer：基于中心点的Transformer用于**3D目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05588.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTuSimple\u002Fcenterformer)\n\n- (arXiv 2022.09) PreSTU：用于**场景文本**理解的预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05534.pdf)\n\n- (arXiv 2022.09) OmDet：具有大规模**视觉-语言**多数据集预训练的语言感知型目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05946.pdf)\n\n- (arXiv 2022.09) DMTNet：基于Transformer的动态多尺度网络，用于双像素图像的**散焦去模糊**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06040.pdf)\n\n- (arXiv 2022.09) SeRP：使用扰动后的**点云**进行自监督表征学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06067.pdf)\n\n- (arXiv 2022.09) VL-Taboo：分析**视觉-语言**模型的基于属性的零样本能力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06103.pdf)\n\n- (arXiv 2022.09) Story**DALL-E**：将预训练的文本到图像Transformer适配用于**故事续写**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.06192.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fadymaharana\u002Fstorydalle)\n\n- (arXiv 2022.09) 关于**自注意力的计算复杂性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04881.pdf)\n\n- (arXiv 2022.09) 基于指令的历史感知策略用于**机器人操作**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04899.pdf)，[[代码]](https:\u002F\u002Fguhur.github.io\u002Fhiveformer\u002F)\n\n- (arXiv 2022.09) 面向多语言**视觉问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05401.pdf)\n\n- (arXiv 2022.09) PERCEIVER-ACTOR：用于**机器人操作**的多任务Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05451.pdf)，[[项目]](https:\u002F\u002Fperact.github.io\u002F)\n\n- (arXiv 2022.09) 用于增量式**视频亮点检测**的全局原型编码，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05166.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FForeverPs\u002FGPE)\n\n- (arXiv 2022.09) 用于**被动活动识别**的自监督多模态融合Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03765.pdf)\n\n- (arXiv 2022.09) FETA：面向**专家任务**应用的**基础模型**专业化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03648.pdf)\n\n- (arXiv 2022.09) 自监督视觉Transformer中的先验知识引导**注意力**机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03745.pdf)\n\n- (arXiv 2022.09) 探索**掩码自编码器**的目标表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03917.pdf)\n\n- (arXiv 2022.09) ISS：以图像为桥梁实现**文本引导的3D形状生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04145.pdf)\n\n- (arXiv 2022.09) 面向机器人应用的置信度引导**形状补全**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04300.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fandrearosasco\u002Fhyperpcr)\n\n- (arXiv 2022.09) 针对**开放词汇**任务的**图像-语言**Transformer预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04372.pdf)\n\n- (arXiv 2022.09) 基于Token-Critic改进的掩码**图像生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.04439.pdf)\n\n- (arXiv 2022.09) “照我做的做，别照我说的做”：将**语言** grounding 到**机器人**的可操作性上，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01691.pdf)，[[代码]](https:\u002F\u002Fsay-can.github.io\u002F)\n\n- (arXiv 2022.09) Uformer-ICS：一种用于**图像压缩感知**的专用U型Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01763.pdf)\n\n- (arXiv 2022.09) 带有**掩码**视觉建模的端到端**视频-语言**Transformer的实证研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01540.pdf)\n\n- (arXiv 2022.09) 用于**视频快照压缩成像**的时空Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01578.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fucaswangls\u002FSTFormer)\n\n- (arXiv 2022.09) MAFormer：一种具有多尺度**注意力**融合的Transformer网络，用于视觉识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01620.pdf)\n\n- (arXiv 2022.09) SEFormer：用于**3D目标检测**的结构嵌入Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01745.pdf)\n\n- (arXiv 2022.09) ADTR：带有特征重建的**异常检测**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01816.pdf)\n\n- (arXiv 2022.09) 使用局部线性变换学习无监督形状**对应关系**的规范嵌入，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02152.pdf)\n\n- (arXiv 2022.09) Transformer-CNN联合体：结合两种方法优势的**半监督语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02178.pdf)\n\n- (arXiv 2022.09) PTSEFormer：面向**视频目标检测**的渐进式时空增强Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02242.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FHon-Wong\u002FPTSEFormer)\n\n- (arXiv 2022.09) VITKD：VIT特征**知识蒸馏**的实用指南，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02432.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyzd-v\u002Fcls_KD)\n\n- (arXiv 2022.09) DPIT：用于人体**姿态估计**的双流集成Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02431.pdf)\n\n- (arXiv 2022.09) SkeletonMAE：用于自监督**骨骼动作识别**的时空**掩码自编码器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02399.pdf)\n\n- (arXiv 2022.09) 鸭嘴兽长什么样？为**零样本图像分类**生成定制化提示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03320.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsarahpratt\u002FCuPL)\n\n- (arXiv 2022.09) AI插画师：基于**提示**的跨模态**生成**，将原始描述转化为**图像**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03160.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FAI_Illustrator)\n\n- (arXiv 2022.09) MimCo：使用对比教师进行**掩码**图像建模预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.03063.pdf)\n\n- (arXiv 2022.09) **多模态**对比表征学习用于**实体对齐**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00891.pdf)\n\n- (arXiv 2022.09) 零样本多模态**艺术家控制**的**3D**对象集检索与探索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00682.pdf)\n\n- (arXiv 2022.09) 几何对齐变分Transformer用于**图像条件下的布局生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00852.pdf)\n\n- (arXiv 2022.09) 基于Transformer的实时**3D**单个物体**跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00860.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshanjiayao\u002FPTT)\n\n- (arXiv 2022.09) 视频引导的课程学习用于**口语视频grounding**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00277.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmarmot-xy\u002FSpoken-Video-Grounding)\n\n- (arXiv 2022.09) FLAME：基于自由形式语言的**运动合成**与**编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00349.pdf)\n\n- (arXiv 2022.09) TOKENCUT：利用自监督Transformer和归一化割来**分割**图像和视频中的物体，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00383.pdf)，[[代码]](https:\u002F\u002Fwww.m-psi.fr\u002FPapers\u002FTokenCut2022\u002F)\n\n- (arXiv 2022.09) 通过序列到序列翻译实现统一的完全监督和带时间戳的**时序动作分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00638.pdf)\n\n- (arXiv 2022.09) MAPLE：用于半监督**点云动作识别**的掩码伪标签自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00407.pdf)，[[项目]](http:\u002F\u002Fxiaodongchen.cn\u002FMAPLE\u002F)\n\n- (arXiv 2022.09) 通过图像修复进行**视觉提示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00647.pdf)，[[项目]](https:\u002F\u002Fyossigandelsman.github.io\u002Fvisual_prompt)\n\n- (arXiv 2022.09) RLIP：用于**人-物体交互**检测的关联**语言-图像**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.01814.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJacobYuan7\u002FRLIP)\n\n\n\n### 2022年8月\n\n- (arXiv 2022.08) 关于使用语言模型进行具身任务的接地规划，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00465.pdf)，[[项目]](https:\u002F\u002Finklab.usc.edu\u002FG-PlanET\u002F)\n\n- (arXiv 2022.08) 篮球追踪数据中的**群体活动识别**——团队运动中的神经嵌入（NETS），[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00451.pdf)\n\n- (arXiv 2022.08) SWIN-TRANSFORMER-YOLOV5用于实时**葡萄酒葡萄串检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14508.pdf)\n\n- (arXiv 2022.08) SIM-Trans：用于**细粒度视觉分类**的结构信息建模Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14607.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPKU-ICST-MIPL\u002FSIM-Trans_ACMMM2022)\n\n- (arXiv 2022.08) 将**图像细节**注入**CLIP**的特征空间，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14649.pdf)\n\n- (arXiv 2022.08) 用于**时序句子定位**的层次化局部-全局Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14882.pdf)\n\n- (arXiv 2022.08) EViT：基于加密视觉Transformer的云环境下**隐私保护**的**图像检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14657.pdf)\n\n- (arXiv 2022.08) TRUST：一种基于分割的Transformer实现的准确、端到端的**表格结构识别器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14687.pdf)\n\n- (arXiv 2022.08) ELMformer：一种高效的原始**图像修复**方法，基于局部乘性Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14704.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fleonmakise\u002FELMformer)\n\n- (arXiv 2022.08) SoMoFormer：基于Transformer的**多人姿态预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14023.pdf)\n\n- (arXiv 2022.08) 基于循环窗口的级联Transformer用于**在线动作检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14209.pdf)\n\n- (arXiv 2022.08) ASpanFormer：无检测器的自适应跨度Transformer用于**图像匹配**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14201.pdf)\n\n- (arXiv 2022.08) 鲁棒的声音引导**图像操纵**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14114.pdf)\n\n- (arXiv 2022.08) TrojViT：视觉Transformer中的**木马植入**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13049.pdf)\n\n- (arXiv 2022.08) 用户可控的潜在空间Transformer用于StyleGAN的**图像布局编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12408.pdf)\n\n- (arXiv 2022.08) 少样本学习与Transformer结合：用于**少样本分类**的统一查询-支持Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12398.pdf)\n\n- (arXiv 2022.08) JARVIS：面向对话式**具身智能体**的神经符号常识推理框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13266.pdf)\n\n- (arXiv 2022.08) TFusion：基于Transformer的N对1**多模态**融合模块，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12776.pdf)\n\n- (arXiv 2022.08) VMFormer：基于Transformer的端到端**视频抠图**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12801.pdf)，[[代码]](https:\u002F\u002Fchrisjuniorli.github.io\u002Fproject\u002FVMFormer\u002F)\n\n- (arXiv 2022.08) LOGICRANK：逻辑诱导重排序用于生成式**文本到图像**系统，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13518.pdf)\n\n- (arXiv 2022.08) CLUSTR：通过聚类探索视觉Transformer的**高效自注意力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13138.pdf)\n\n- (arXiv 2022.08) 基于中层语义知识迁移的**联邦零样本学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13465.pdf)\n\n- (arXiv 2022.08) 基于软上下文共享的**提示调优**用于**视觉-语言**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13474.pdf)\n\n- (arXiv 2022.08) 基于视觉概念和层次对齐的高效**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13628.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmshukor\u002FViCHA)\n\n- (arXiv 2022.08) CounTR：基于Transformer的通用视觉**计数**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13721.pdf)，[[代码]](https:\u002F\u002Fverg-avesta.github.io\u002FCounTR_Webpage\u002F)\n\n- (arXiv 2022.08) **开放集**半监督目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.13722.pdf)\n\n- (arXiv 2022.08) g**Swin**：具有移位窗口层次结构的门控**MLP**视觉模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11718.pdf)\n\n- (arXiv 2022.08) 用于**时序动作定位**的自适应感知Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11908.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSouperO\u002FAdaPerFormer)\n\n- (arXiv 2022.08) 符号回放：以**场景图**为提示的持续学习应用于**VQA**任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12037.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FCLVQA)\n\n- (arXiv 2022.08) **掩码**自编码器助力高效**知识蒸馏**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12256.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002FDMAE)\n\n- (arXiv 2022.08) LaTe**RF**：基于标签和**文本**的对象辐射场，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01583.pdf)\n\n- (arXiv 2022.08) Video Mobile-Former：采用**高效**全局时空建模的**视频识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12257.pdf)\n\n- (arXiv 2022.08) Pix4Point：用于3D**点云理解**的图像预训练Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12259.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fguochengqian\u002FPix4Point)\n\n- (arXiv 2022.08) Mask**CLIP**：**掩码**自蒸馏推进对比语言-图像预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.12262.pdf)\n\n- (arXiv 2022.08) 视觉字幕特征增强的**视频概要生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11307.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAopolin-Lv\u002FVSENet)\n\n- (arXiv 2022.08) CATS：互补的**CNN**和Transformer编码器用于**分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11572.pdf)\n\n- (arXiv 2022.08) 用于多模态摘要的段落级**视觉-语言**语义对齐建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11303.pdf)\n\n- (arXiv 2022.08) Fashion**VQA**：一个领域特定的视觉问答系统，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11253.pdf)\n\n- (arXiv 2022.08) K阶图导向Transformer结合GRAATTENTION用于**3D姿态和形状估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11328.pdf)\n\n- (arXiv 2022.08) 朝着在基于Transformer的目标**检测器**中**高效**利用多尺度特征的方向发展，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11356.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZhangGongjie\u002FIMFA)\n\n- (arXiv 2022.08) 利用**多语言**知识迁移改进**视频检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11553.pdf)\n\n- (arXiv 2022.08) **高效**稀疏激活Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.14580.pdf)\n\n- (arXiv 2022.08) M2HF：用于**文本-视频检索**的多层次多模态混合融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07664.pdf)\n\n- (arXiv 2022.08) 通过补丁采样调度加速视觉Transformer训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09520.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002FBradMcDanel\u002Fpss)\n\n- (arXiv 2022.08) 一种双模态方法用于（零样本）**多标签分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09562.pdf)\n\n- (arXiv 2022.08) 利用对抗学习和Transformer进行离线**手写数学识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09662.pdf)\n\n- (arXiv 2022.08) 语义增强的**图像聚类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09849.pdf)\n\n- (arXiv 2022.08) DPTNet：用于**场景文本检测**的双路径Transformer架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09878.pdf)\n\n- (arXiv 2022.08) ProtoPFormer：聚焦于视觉Transformer中的原型部件以实现**可解释图像识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.10431.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzju-vipa\u002FProtoPFormer)\n\n- (arXiv 2022.08) 图像即外语：面向所有视觉及**视觉-语言**任务的BEIT预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.10442.pdf)，[[项目]](https:\u002F\u002Faka.ms\u002Fbeit-3)\n\n- (arXiv 2022.08) PoseBERT：用于**时序3D人体建模**的通用Transformer模块，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.10211.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fposebert)\n\n- (arXiv 2022.08) **高效**无注意力机制的**视频**移位Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11108.pdf)\n\n- (arXiv 2022.08) 用于**命名实体识别**的扁平化多模态交互Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.11039.pdf)\n\n- (arXiv 2022.08) 基于跨模态Transformer的**舞蹈风格迁移**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09406.pdf)\n\n- (arXiv 2022.08) 通过Token融合提升**图像分类**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2208\u002F2208.09183.pdf)\n\n- (arXiv 2022.08) VAuLT：通过深层语言表示的传播增强**视觉-语言**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09021.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgchochla\u002FVAuLT)\n\n- (arXiv 2022.08) **文本到图像生成**：不让任何语言被落下，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09333.pdf)\n\n- (arXiv 2022.08) 基于序列式跨模态语义图的方面级**情感分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09417.pdf)\n\n- (arXiv 2022.08) 通过自适应时空注意力实现多样化的**视频字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09266.pdf)\n\n- (arXiv 2022.08) VL**MAE**：**视觉-语言**掩码自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09374.pdf)\n\n- (arXiv 2022.08) SoMoFormer：用于**多人运动预测**的社会感知运动Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.09224.pdf)\n\n- (arXiv 2022.08) ILLUME：通过与其“闲聊”来解释**视觉-语言**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08241.pdf)\n\n- (arXiv 2022.08) ViT-ReT：结合视觉与循环Transformer的神经网络，用于视频中的人类**活动识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07929.pdf)\n\n- (arXiv 2022.08) UniLayout：驯服统一的序列到序列Transformer以用于**图形布局生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08037.pdf)\n\n- (arXiv 2022.08) InterTrack：用于**3D多目标跟踪**的交互Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08041.pdf)\n\n- (arXiv 2022.08) 理解**视觉-语言**任务中的**注意力**机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08104.pdf)\n\n- (arXiv 2022.08) 基于提示微调实现**开放词汇场景图生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08165.pdf)\n\n- (arXiv 2022.08) 针对**视觉-语言**预训练模型的类别感知视觉提示调优，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08340.pdf)\n\n- (arXiv 2022.08) 通过可分散的**点**学习统一视觉**感知**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08630.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FUniHead)\n\n- (arXiv 2022.08) 基于隐式视觉引导和超网络的**文本到图像生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08493.pdf)\n\n- (arXiv 2022.08) ConMatch：基于置信度引导的一致性正则化的**半监督学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08631.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJiwonCocoder\u002FConMatch)\n\n- (arXiv 2022.08) 八点算法作为ViTs进行**相对位姿预测**的归纳偏置，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08988.pdf)\n\n- (arXiv 2022.08) 使用Mask**CLIP**实现开放词汇的**全景分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08984.pdf)\n\n- (arXiv 2022.08) 用于**领域泛化**的提示视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08914.pdf)\n\n- (arXiv 2022.08) GSRFormer：具有交替语义注意力精炼的**接地情境识别**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08965.pdf)\n\n- (arXiv 2022.08) CONVIFORMERS：**卷积引导**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08900.pdf)\n\n- (arXiv 2022.08) 学习空间-频率Transformer用于视觉对象**跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.08829.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTchuanm\u002FSFTransT.git)\n\n- (arXiv 2022.08) 具有双层特征恢复的**高效**多模态Transformer，用于鲁棒的多模态情感分析，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07589.pdf)\n\n- (arXiv 2022.08) 您的ViT其实是一个混合型的**判别-生成****扩散**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07791.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsndnyang\u002FDiffusion_ViT)\n\n- (arXiv 2022.08) LLM.int8()：面向大规模Transformer的8位**矩阵乘法**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07339.pdf)\n\n- (arXiv 2022.08) ExpansionNet v2：在快速端到端训练中通过块静态扩展实现图像**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06551.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjchenghu\u002FExpansionNet_v2)\n\n- (arXiv 2022.08) 用于自动驾驶车辆的多模态Transformer**路径预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.07256.pdf)\n\n- (arXiv 2022.08) 流场引导的Transformer用于**视频修复**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06768.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhitachinsk\u002FFGT)\n\n- (arXiv 2022.08) TL;DW？利用任务相关性和跨模态显著性对**教学视频**进行摘要，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06773.pdf)，[[项目]](https:\u002F\u002Fmedhini.github.io\u002Fivsum\u002F)\n\n- (arXiv 2022.08) HoW-3D：从单张图像中实现整体性的**3D线框感知**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06999.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FWenchao-M\u002FHoW-3D)\n\n- (arXiv 2022.08) **BEIT V2**：采用向量量化视觉标记的**掩码图像建模**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06366.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm)\n\n- (arXiv 2022.08) MILAN：基于语言辅助表示的**掩码图像预训练**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06049.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzejiangh\u002FMILAN)\n\n- (arXiv 2022.08) 用于**深度伪造检测**的混合Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05820.pdf)\n\n- (arXiv 2022.08) 大规模下的**半监督**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05688.pdf)\n\n- (arXiv 2022.08) PPMN：用于单阶段**全景叙事定位**的像素-短语匹配网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05647.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdzh19990407\u002FPPMN)\n\n- (arXiv 2022.08) 探索基于锚点的检测在 **Ego4D** 自然语言查询中的应用，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05375.pdf)\n\n- (arXiv 2022.08) 基于骨架的动作识别的语言监督训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05318.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMartinXM\u002FLST)\n\n- (arXiv 2022.08) 探索使用Transformer进行3D点云目标跟踪时的点云与BEV融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05216.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJasonkks\u002FPTTR)\n\n- (arXiv 2022.08) 基于上下文感知Transformer的无鬼影 **高动态范围成像**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.05114.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmegvii-research\u002FHDR-Transformer)\n\n- (arXiv 2022.08) 基于 **CLIP** 的神经邻域 **风格迁移** 用于 **3D资产**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04370.pdf)\n\n- (arXiv 2022.08) 大规模数据上的 **体育视频分析**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04897.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjackwu502\u002FNSVA)\n\n- (arXiv 2022.08) 视觉Transformer (VTs) 在非自然图像领域的迁移效果如何？一项涉及 **艺术分类** 的实证研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04693.pdf)\n\n- (arXiv 2022.08) Transformer之眼：全局-局部相关性在 **第一人称视角注视估计** 中的应用，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04464.pdf)，[[代码]](https:\u002F\u002Fbolinlai.github.io\u002FGLC-EgoGazeEst)\n\n- (arXiv 2022.08) **DALLE**-URBAN：捕捉大型 **文本到图像** 变换器的城市设计专长，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04139.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsachith500\u002FDALLEURBAN)\n\n- (arXiv 2022.08) PlaneFormers：从稀疏视图平面到 **3D重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04307.pdf)，[[代码]](https:\u002F\u002Fsamiragarwala.github.io\u002FPlaneFormers)\n\n- (arXiv 2022.08) 通过显式高层语义提升 **视频-文本检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04215.pdf)\n\n- (arXiv 2022.08) 基于 **CLIP** 引导的群体优化实现独特的图像 **字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04254.pdf)\n\n- (arXiv 2022.08) 通过学习遮挡不变特征理解 **掩码图像建模**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04164.pdf)\n\n- (arXiv 2022.08) GRIT-VLP：用于高效 **视觉与语言** 预训练的分组小批量采样，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.04060.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjaeseokbyun\u002FGRIT-VLP)\n\n- (arXiv 2022.08) 推进普通视觉Transformer向 **遥感** 基础模型发展，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03987.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FRemote-Sensing-RVSA)\n\n- (arXiv 2022.08) 域随机化增强的深度模拟与恢复技术，用于感知和 **抓取** 光泽及透明物体，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03792.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPKU-EPIC\u002FDREDS)\n\n- (arXiv 2022.08) Jointformer：具有误差预测与精修功能的单帧提升Transformer，用于 **3D人体姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03704.pdf)\n\n- (arXiv 2022.08) 冻结的 **CLIP** 模型是高效的 **视频** 学习者，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03550.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002Fefficient-video-recognition)\n\n- (arXiv 2022.08) MonoViT：基于视觉Transformer的自监督 **单目深度估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03543.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzxcqlf\u002FMonoViT)\n\n- (arXiv 2022.08) HaloAE：基于HaloNet的局部Transformer自编码器，用于 **异常检测** 和 **定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03486.pdf)，[[代码]](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FHaloAE-E27B\u002FREADME.md)\n\n- (arXiv 2022.08) IVT：一种端到端的实例引导型视频Transformer，用于 **3D姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03431.pdf)\n\n- (arXiv 2022.08) 一草千言：结合 **文本** 和 **草图** 的 **图像检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03354.pdf)，[[代码]](https:\u002F\u002Fjanesjanes.github.io\u002Ftsbir\u002F)\n\n- (arXiv 2022.08) PointConvFormer： **基于点的卷积** 的反击，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02879.pdf)\n\n- (arXiv 2022.08) ChiQA：一个大规模的基于图像的真实世界 **问答数据集**，用于多模态理解，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03030.pdf)\n\n- (arXiv 2022.08) LaTTe： **语言** **轨迹** 转换器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02918.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Farthurfenderbucker\u002FNL_trajectory_reshaper)\n\n- (arXiv 2022.08) 学习时空频率Transformer用于 **压缩视频超分辨率**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03012.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FFTVSR)\n\n- (arXiv 2022.08) TransMatting：利用Transformer提升 **透明物体抠图** 效果，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03007.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002FAceCHQ\u002FTransMatting)\n\n- (arXiv 2022.08) 字级细粒度 **故事可视化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02341.pdf)\n\n- (arXiv 2022.08) 细粒度语义对齐的 **视觉-语言** 预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02515.pdf)\n\n- (arXiv 2022.08) 扩展 **语言-图像** 预训练模型以用于通用 **视频识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02816.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVideoX\u002Ftree\u002Fmaster\u002FX-CLIP)\n\n- (arXiv 2022.08) P2P：通过点到像素提示微调预训练图像模型以用于 **点云分析**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02812.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwangzy22\u002FP2P)\n\n- (arXiv 2022.08) **Drop**Key，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02646.pdf)\n\n- (arXiv 2022.08) MVSFormer：结合预训练视觉Transformer和温度引导的深度进行 **多视图立体匹配**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02541.pdf)\n\n- (arXiv 2022.08) 基于CLIP的视频对象 **分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01924.pdf)\n\n- (arXiv 2022.08) XCon：通过专家学习进行 **细粒度类别发现**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01898.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYiXXin\u002FXCon)\n\n- (arXiv 2022.08) 结合CNN和Transformer的编码器，用于提升细粒度的人体 **动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01897.pdf)\n\n- (arXiv 2022.08) 用于弱监督 **目标定位** 的重注意力Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01838.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsu-hui-zz\u002FReAttentionTransformer)\n\n- (arXiv 2022.08) TAG：通过文本感知的视觉问答生成提升文本- **VQA** 性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01813.pdf)\n\n- (arXiv 2022.08) 用于 **长视频理解** 的双流Transformer架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01753.pdf)\n\n- (arXiv 2022.08) 一种快速的 **文本驱动** 方法用于 **生成艺术内容**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01748.pdf)\n\n- (arXiv 2022.08) DAHITRA：基于新型层次化Transformer架构的**损伤评估**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02205.pdf)\n\n- (arXiv 2022.08) MinVIS：一种无需视频训练的极简**视频实例分割**框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02245.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FMinVIS)\n\n- (arXiv 2022.08) 针对多模态表征学习的**掩码**视觉与语言建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02131.pdf)\n\n- (arXiv 2022.08) SSformer：用于语义分割的**轻量级**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02034.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshiwt03\u002FSSformer)\n\n- (arXiv 2022.08) 基于时空图Transformer的**姿态**不确定性感知**动作同步性估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01161.pdf)\n\n- (arXiv 2022.08) 取长补短：面向领域的Transformer用于**无监督域适应**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01195.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBIT-DA\u002FDomain-Oriented-Transformer)\n\n- (arXiv 2022.08) 用于**加速**和**稳定**Transformer的统一归一化方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01313.pdf)\n\n- (arXiv 2022.08) 一张图胜过千言万语：利用文本反演个性化**文本到图像生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01618.pdf)，[[项目]](https:\u002F\u002Ftextual-inversion.github.io\u002F)\n\n- (arXiv 2022.08) 基于交叉注意力控制的提示到**提示**的**图像编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01626.pdf)\n\n- (arXiv 2022.08) 动量Transformer：缩小自注意力与其**线性化**之间的性能差距，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00579.pdf)\n\n- (arXiv 2022.08) 测试**文本引导图像生成**中的关系理解能力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00005.pdf)\n\n- (arXiv 2022.08) UAVM：一种用于**视听**学习的统一模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00061.pdf)\n\n- (arXiv 2022.08) Meta-**DETR**：通过利用类间相关性实现图像级别的**少样本**目标检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00219.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZhangGongjie\u002FMeta-DETR)\n\n- (arXiv 2022.08) 面向长期**4D点云视频理解**的点基元Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00281.pdf)\n\n- (arXiv 2022.08) 一招通吃：基于动态推理的单阶段**指代表达理解**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00361.pdf)\n\n- (arXiv 2022.08) 走向理解WordArt：用于**场景文字识别**的角点引导Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00438.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxdxie\u002FWordArt)\n\n- (arXiv 2022.08) SdAE：自蒸馏的**掩码自动编码器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00449.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAbrahamYabo\u002FSdAE)\n\n- (arXiv 2022.08) 通过学习结合视觉语义的码本来增强**视觉语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00475.pdf)\n\n- (arXiv 2022.08) STrajNet：基于多模态Swin Transformer的**占用流预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00394.pdf)\n\n- (arXiv 2022.08) D^3Former：用于增量学习的去偏双蒸馏Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00777.pdf)，[[代码]](https:\u002F\u002Ftinyurl.com\u002Fd3former)\n\n- (arXiv 2022.08) 面向**空中跟踪**的局部感知Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00662.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvision4robotics\u002FLPAT)\n\n- (arXiv 2022.08) SIAMIXFORMER：用于从双时相**遥感**图像中进行建筑物检测和变化检测的暹罗Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00657.pdf)\n\n- (arXiv 2022.08) 将Transformer作为元学习器用于**隐式神经表示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.02801.pdf)，[[代码]](https:\u002F\u002Fyinboc.github.io\u002Ftrans-inr\u002F)\n\n- (arXiv 2022.08) 基于迭代式视频-文本联合分词的**视频问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00934.pdf)，[[代码]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fvideoqa-cotokenization)\n\n- (arXiv 2022.08) 通过柯西问题理解视觉Transformer的对抗**鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.00906.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTrustAI\u002FODE4RobustViT)\n\n\n\n### 2022年7月\n\n- (arXiv 2022.07) Pro-tuning：面向视觉任务的统一**提示调优**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14381.pdf)\n\n- (arXiv 2022.07) ALADIN：蒸馏细粒度对齐分数以实现高效的**图像-文本**匹配与检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14757.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmesnico\u002FALADIN)\n\n- (arXiv 2022.07) 面向数据高效**视觉-语言**对齐的课程学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14525.pdf)\n\n- (arXiv 2022.07) DnSwin：通过连续小波滑动Transformer实现真实世界的**去噪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13861.pdf)\n\n- (arXiv 2022.07) 基于解耦模态的交叉注意力，利用Transformer进行**3D人体网格恢复**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13820.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fpostech-ami\u002FFastMETRO)\n\n- (arXiv 2022.07) AvatarPoser：从稀疏运动传感中实现关节式的**全身姿态跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13784.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002Feth-siplab\u002FAvatarPoser)\n\n- (arXiv 2022.07) 语义对齐匹配用于提升**DETR**收敛性和多尺度特征融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14172.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZhangGongjie\u002FSAM-DETR)\n\n- (arXiv 2022.07) 使用可解释的传感器融合Transformer增强**自动驾驶**安全性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14024.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FInterFuser)\n\n- (arXiv 2022.07) 视频掩码精炼器用于高质量的**视频实例分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14012.pdf)，[[项目]](http:\u002F\u002Fvis.xyz\u002Fpub\u002Fvmt)\n\n- (arXiv 2022.07) 一种适用于Transformer的**变分自编码器**，采用非参数化的变分信息瓶颈，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13529.pdf)\n\n- (arXiv 2022.07) 基于对比学习的视觉Transformer实现在线**持续学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13516.pdf)\n\n- (arXiv 2022.07) 用于图像**字幕生成**的检索增强Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13162.pdf)\n\n- (arXiv 2022.07) 基于时间片移位的时空自注意力建模用于**动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13259.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMartinXM\u002FTPS)\n\n- (arXiv 2022.07) 注意力就是**NeRF**所需的一切吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13298.pdf)，[[代码]](https:\u002F\u002Fvita-group.github.io\u002FGNT\u002F)\n\n- (arXiv 2022.07) **卷积嵌入**使层次化视觉Transformer更强大，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13317.pdf)\n\n- (arXiv 2022.07) SiRi：一种用于基于Transformer的**视觉定位**的简单选择性微调机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13325.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fqumengxue\u002Fsiri-vg.git)\n\n- (arXiv 2022.07) 基于**自监督**预训练特征的深度**聚类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13364.pdf)\n\n- (arXiv 2022.07) 对比型**掩码自编码器**是更强大的视觉学习模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13532.pdf)\n\n- (arXiv 2022.07) VICTOR：利用Transformer和时尚领域特定的对比预训练进行视觉**不兼容性检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13458.pdf)\n\n- (arXiv 2022.07) 基于语义控制的组合式**人-场景交互合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12824.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzkf1997\u002FCOINS)\n\n- (arXiv 2022.07) 用于**自监督**视频表征学习的静态与动态概念，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12795.pdf)\n\n- (arXiv 2022.07) 视频Transformer在**动作识别**中的无监督域适应，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12842.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvturrisi\u002FUDAVT)\n\n- (arXiv 2022.07) LaKo：通过后期知识到文本注入实现的知识驱动型**视觉问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12888.pdf)\n\n- (arXiv 2022.07) TransFiner：一种针对**多目标跟踪**的全尺度精炼方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12967.pdf)\n\n- (arXiv 2022.07) 使用预训练Transformer进行S-Prompts学习：面向**领域增量学习**的奥卡姆剃刀原则，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12819.pdf)\n\n- (arXiv 2022.07) WinoGAViL：一款旨在挑战**视觉-语言**模型的趣味化关联**基准测试**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12576.pdf)，[[项目]](https:\u002F\u002Fwinogavil.github.io\u002F)\n\n- (arXiv 2022.07) 面向事件级**视觉问答**的跨模态因果关系推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12647.pdf)\n\n- (arXiv 2022.07) 基于图神经网络和时空Transformer注意力的点云三维视频**目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12659.pdf)\n\n- (arXiv 2022.07) 从模态共享的对比型**语言-图像**预训练中学习视觉表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12661.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FHxyou\u002FMSCLIP)\n\n- (arXiv 2022.07) V^2L：将视觉和**视觉-语言**模型应用于大规模**商品检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12994.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FWangWenhao0716\u002FV2L)\n\n- (arXiv 2022.07) NewsStories：用**视觉**摘要来配图说明**文章**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13061.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002FNewsStoriesData\u002Fnewsstories.github.io)\n\n- (arXiv 2022.07) 具有混合**匹配**策略的**DETR**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13080.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FHDETR)\n\n- (arXiv 2022.07) GROUP **DETR**：采用解耦的一对多标签分配实现**快速**训练收敛，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.13085.pdf)\n\n- (arXiv 2022.07) 利用CNN和Vision Transformer改进MRI图像的**超分辨率**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11748.pdf)\n\n- (arXiv 2022.07) Video Swin Transformer在2022年Ego4D挑战赛中用于**第一人称视频**理解，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2207\u002F2207.11329.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBCV-Uniandes\u002FPNR_OSCC)\n\n- (arXiv 2022.07) 对CNN与Vision Transformer在**鲁棒性**方面较量的公正看法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11347.pdf)\n\n- (arXiv 2022.07) **生成式**工匠：一款语义感知且可控的**CLIP**风格化工具，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11598.pdf)\n\n- (arXiv 2022.07) MAR：用于高效**动作识别**的掩码自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11660.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Falibaba-mmai-research\u002FMasked-Action-Recognition)\n\n- (arXiv 2022.07) 基于分割时空注意力机制的**第一人称视频**中**物体状态变化分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11814.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmd-mohaiminul\u002FObjectStateChange)\n\n- (arXiv 2022.07) 每个领域背后都存在分布偏移：为**全景语义分割**适配畸变感知的Vision Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11860.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjamycheung\u002FTrans4PASS)\n\n- (arXiv 2022.07) 基于参考图像的变形注意力Transformer**超分辨率**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11938.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fcaojiezhang\u002FDATSR)\n\n- (arXiv 2022.07) JIGSAW-VIT：在Vision Transformer中学习**拼图游戏**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11971.pdf)，[[代码]](https:\u002F\u002Fyingyichen-cyy.github.io\u002FJigsaw-ViT)\n\n- (arXiv 2022.07) TransCL：Transformer实现强大而灵活的**压缩感知学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11972.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMC-E\u002FTransCL\u002F)\n\n- (arXiv 2022.07) 用于单目标**点云跟踪**的3D暹罗Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11995.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffpthink\u002FSTNet)\n\n- (arXiv 2022.07) 基于意图的长期人类**第一人称动作预测**@EGO4D挑战赛2022，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12080.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FEvm7\u002Fego4dlta-icvae)\n\n- (arXiv 2022.07) IGFormer：用于基于**骨骼**的动作的**人体交互识别**的交互图Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12100.pdf)\n\n- (arXiv 2022.07) 在文化遗产领域的**视觉问答**中，是否只需要**GPT-3**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12101.pdf)\n\n- (arXiv 2022.07) 将时空注意力应用于Vision Transformer以**识别分心驾驶**和**疲劳驾驶**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12148.pdf)\n\n- (arXiv 2022.07) 使用Transformer进行**动作质量评估**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12318.pdf)\n\n- (arXiv 2022.07) 自蒸馏Vision Transformer用于**领域泛化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12392.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmaryam089\u002FSDViT)\n\n- (arXiv 2022.07) 探索使用**CLIP**来**评估****图像**的外观和感觉，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.12396.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FIceClear\u002FCLIP-IQA)\n\n- (arXiv 2022.07) 带有隐式边的Transformer用于基于粒子的**物理模拟**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10860.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fftbabi\u002FTIE_ECCV2022.git)\n\n- (arXiv 2022.07) 带有集成量化功能的自回归**图像合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10776.pdf)\n\n- (arXiv 2022.07) 针对图像**字幕生成**的有效未来上下文建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10897.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffeizc\u002FFuture-Caption)\n\n- (arXiv 2022.07) 基于演化伪标记的零样本视频**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11100.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYoadTew\u002Fzero-shot-video-to-text)\n\n- (arXiv 2022.07) 全景场景图**生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11247.pdf)，[[项目]](https:\u002F\u002Fpsgdataset.org\u002F)，[[代码]](https:\u002F\u002Fgithub.com\u002FJingkang50\u002FOpenPSG)\n\n- (arXiv 2022.07) 使用带有MAE预训练的Vanilla ViT主干进行**面部表情识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11081.pdf)\n\n- (arXiv 2022.07) 面向**视觉-语言导航**的目标驱动结构化Transformer规划器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11201.pdf)\n\n- (arXiv 2022.07) **规模定律**与模型架构：归纳偏置如何影响扩展规律？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10551.pdf)\n\n- (arXiv 2022.07) 用于ABAW4挑战赛中**面部情感识别**的混合CNN-Transformer模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10201.pdf)\n\n- (arXiv 2022.07) Mesh**MAE**：用于3D**网格**数据分析的掩码自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10228.pdf)\n\n- (arXiv 2022.07) SeedFormer：基于补丁种子的上采样Transformer点云补全，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10315.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhrzhou2\u002Fseedformer)\n\n- (arXiv 2022.07) LocVTP：用于时序定位的**视频-文本**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10362.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmengcaopku\u002FLocVTP)\n\n- (arXiv 2022.07) 用于**高效视频识别**的时序显著性查询网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10379.pdf)，[[代码]](https:\u002F\u002Flawrencexia2008.github.io\u002Fprojects\u002Ftsqnet)\n\n- (arXiv 2022.07) 姿态一切：迈向类别无关的**姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10387.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fluminxu\u002FPose-for-Everything)\n\n- (arXiv 2022.07) 基于隐式空间校准的Transformer弱监督**目标定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10447.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002F164140757\u002FSCM)\n\n- (arXiv 2022.07) 一种用于**动作检测**的高效**时空**金字塔Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10448.pdf)\n\n- (arXiv 2022.07) 向视觉Transformer上的**高效对抗训练**迈进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10498.pdf)\n\n- (arXiv 2022.07) TinyViT：针对**小型**视觉Transformer的快速预训练蒸馏，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10666.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FCream\u002Ftree\u002Fmain\u002FTinyViT)\n\n- (arXiv 2022.07) 用于人体**骨骼表示**学习的层次自监督Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09644.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyuxiaochen1103\u002FHi-TRS)\n\n- (arXiv 2022.07) 显式图像**字幕编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09625.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fbaaaad\u002FECE)\n\n- (arXiv 2022.07) AiATrack：用于Transformer视觉**跟踪**的注意力机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09603.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLittle-Podi\u002FAiATrack)\n\n- (arXiv 2022.07) Tip-Adapter：无需训练即可适配**CLIP**以进行**少样本分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09519.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgaopengcuhk\u002FTip-Adapter)\n\n- (arXiv 2022.07) 单帧**大气湍流抑制**：基准研究及一种受物理启发的新Transformer模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10040.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FTurbNet)\n\n- (arXiv 2022.07) HTNet：基于层次Transformer的无锚点**时序动作定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09662.pdf)\n\n- (arXiv 2022.07) GRIT：使用双视觉特征的更快更好的图像**字幕生成**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09666.pdf)\n\n- (arXiv 2022.07) OTPose：适用于稀疏标注视频的遮挡感知Transformer**姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09725.pdf)\n\n- (arXiv 2022.07) FaceFormer：具有尺度感知能力的盲人**人脸修复**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09790.pdf)\n\n- (arXiv 2022.07) 用于**自动3D标注**和物体**检测**的多模态Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09805.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FCliu2\u002FMTrans)\n\n- (arXiv 2022.07) 用于**音视频****零样本**学习的时序与跨模态注意力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09966.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FExplainableML\u002FTCAF-GZSL)\n\n- (arXiv 2022.07) 局部性指导用于提升视觉Transformer在**小型数据集**上的性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10026.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flkhl\u002Ftiny-transformers)\n\n- (arXiv 2022.07) 以物体为中心的视频表示是否有利于迁移学习？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10075.pdf)\n\n- (arXiv 2022.07) DUQIM-Net：用于多视角**操作**的概率性物体层级表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09105.pdf)\n\n- (arXiv 2022.07) 用于解释日常任务中可能发生的碰撞的**关系型未来**字幕生成模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09083.pdf)\n\n- (arXiv 2022.07) 条件式**DETR** V2：带边界框查询的**高效**检测Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08914.pdf)\n\n- (arXiv 2022.07) 利用未标注数据结合**视觉与语言**模型进行物体**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08954.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxiaofeng94\u002FVL-PLM)\n\n- (arXiv 2022.07) TTVFI：学习轨迹感知Transformer用于**视频帧插值**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09048.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FTTVFI.git)\n\n- (arXiv 2022.07) 时间即关键：用于视频Transformer的**时序自监督**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09067.pdf)\n\n- (arXiv 2022.07) IDET：用于**高质量变化检测**的迭代差异增强Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09240.pdf)\n\n- (arXiv 2022.07) 不要停止学习：迈向**CLIP**模型的**持续学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09248.pdf)\n\n- (arXiv 2022.07) 带有时序解析Transformer的**动作质量评估**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09270.pdf)\n\n- (arXiv 2022.07) 基于Transformer的视觉**表征**学习：序列到序列的视角，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09339.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FSETR)\n\n- (arXiv 2022.07) 基于结构先验引导的生成对抗Transformer用于**低光照图像增强**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07828.pdf)\n\n- (arXiv 2022.07) TS2-Net：用于T**ext-Video检索**的令牌转移与选择Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07852.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyuqi657\u002Fts2_net)\n\n- (arXiv 2022.07) Clover：迈向统一的**视频-语言**对齐与融合模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07885.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLeeYN-43\u002FClover)\n\n- (arXiv 2022.07) SatMAE：用于时序与多光谱**卫星影像**的Transformer预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08051.pdf)\n\n- (arXiv 2022.07) FashionViL：面向时尚领域的**视觉-语言**表征学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08150.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBrandonHanx\u002Fmmf)\n\n- (arXiv 2022.07) 基于视觉-语言提示的零样本**时序动作检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08184.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsauradip\u002FSTALE)\n\n- (arXiv 2022.07) 重新思考**视频超分辨率**Transformer中的对齐机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08494.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FXPixelGroup\u002FRethinkVSRAlignment)\n\n- (arXiv 2022.07) 通过对比与聚类视觉-语言嵌入实现开放世界**语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08455.pdf)\n\n- (arXiv 2022.07) TokenMix：重新思考视觉Transformer中用于数据**增强**的图像混合方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08409.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FTokenMix)\n\n- (arXiv 2022.07) 探索人类全局上下文：**视觉-语言**模型真的能像人类一样做出判断吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08333.pdf)\n\n- (arXiv 2022.07) Defect Transformer：一种用于**表面缺陷检测**的高效混合Transformer架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08319.pdf)\n\n- (arXiv 2022.07) 基于关系推理的语义**新奇性检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08699.pdf)\n\n- (arXiv 2022.07) 通过预训练将**事件检测**和**字幕生成**统一为序列生成任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08625.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FQiQAng\u002FUEDVC)\n\n- (arXiv 2022.07) 视觉Transformer中的多流形**注意力**机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08569.pdf)\n\n- (arXiv 2022.07) UniFormer：用于**鸟瞰图**中时空表征的统一多视角融合Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08536.pdf)\n\n- (arXiv 2022.07) 将**位置预测**作为有效的预训练策略，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07611.pdf)\n\n- (arXiv 2022.07) 具有交叉特征注意力的**轻量级**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07268.pdf)\n\n- (arXiv 2022.07) 使用相对位置编码参数化视觉**MLP**中的**跨令牌关系**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07284.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZhicaiwww\u002FPosMLP)\n\n- (arXiv 2022.07) X-CLIP：用于**视频-文本检索**的端到端多粒度对比学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07285.pdf)\n\n- (arXiv 2022.07) 学习视差Transformer网络以去除**立体图像JPEG伪影**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07335.pdf)\n\n- (arXiv 2022.07) 一种双掩码自编码器，用于结合时空骨骼令牌补全实现**鲁棒运动捕捉**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07381.pdf)\n\n- (arXiv 2022.07) 一句**字幕**是否胜过千张**图片**？一项针对**表征**学习的对照研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07635.pdf)\n\n- (arXiv 2022.07) 基于预训练视觉和语言模型的多模态**开放词汇视频分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07646.pdf)\n\n- (arXiv 2022.07) 用于**视频插帧**的交叉注意力Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04132.pdf)\n\n- (arXiv 2022.07) 向能够生成非通用文本的多模态**视觉-语言**模型迈进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04174.pdf)\n\n- (arXiv 2022.07) QKVA网格：图像视角下的**注意力**与堆叠式DETR，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04313.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshengwenyuan\u002Fsdetr)\n\n- (arXiv 2022.07) Snipper：一种时空Transformer，可在视频片段上同时进行多人**3D姿态估计跟踪**和**预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04320.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJimmyZou\u002FSnipper)\n\n- (arXiv 2022.07) Transformer中的水平与垂直**注意力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04399.pdf)\n\n- (arXiv 2022.07) CoMER：基于Transformer的**手写数学表达式识别**中的覆盖建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04410.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FGreen-Wood\u002FCoMER)\n\n- (arXiv 2022.07) DPText-DETR：借助Transformer中的动态点实现更好的**场景文本检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04491.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fymy-k\u002FDPText-DETR)\n\n- (arXiv 2022.07) DEPTHFORMER：用于**单目深度估计**的多尺度视觉Transformer，结合全局与局部信息融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04535.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fashutosh1807\u002FDepthformer.git)\n\n- (arXiv 2022.07) LaT：基于循环一致性的潜在翻译技术，用于**视频-文本检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04858.pdf)\n\n- (arXiv 2022.07) **双**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04976.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYehLi\u002FImageNetModel)\n\n- (arXiv 2022.07) Wave-ViT：将**小波**与Transformer统一用于视觉**表征**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04978.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYehLi\u002FImageNetModel)\n\n- (arXiv 2022.07) 利用弱监督检测Transformer扩展新型物体**检测**能力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05205.pdf)\n\n- (arXiv 2022.07) 使用Transformer挖掘群体线索，用于社交**群体活动识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05254.pdf)\n\n- (arXiv 2022.07) 基于查询的**外延绘画**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05312.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKaiseem\u002FQueryOTR)\n\n- (arXiv 2022.07) IDEA：通过在线多标签识别提高文本多样性，用于**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05333.pdf)\n\n- (arXiv 2022.07) 视频图Transformer用于**视频问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05342.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FVGT)\n\n- (arXiv 2022.07) Next-ViT：下一代视觉Transformer，专为现实**工业**场景中的**高效部署**而设计，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05501.pdf)\n\n- (arXiv 2022.07) UniNet：结合卷积、Transformer和MLP的统一**架构搜索**框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05420.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FUniNet)\n\n- (arXiv 2022.07) 基于**密钥**的视觉Transformer图像与模型变换，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05366.pdf)\n\n- (arXiv 2022.07) eX-ViT：一种新颖的可解释视觉Transformer，用于**弱监督语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05358.pdf)\n\n- (arXiv 2022.07) 面向**少样本动作识别**的复合原型匹配，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05515.pdf)\n\n- (arXiv 2022.07) 长期跳跃注意力与短期周期性移位用于**视频分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05526.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FVideoNetworks\u002FLAPS-transformer)\n\n- (arXiv 2022.07) LightViT：迈向**轻量级**且**无卷积**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05557.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhunto\u002FLightViT)\n\n- (arXiv 2022.07) 从人类情感中的**标签关系**中学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05577.pdf)\n\n- (arXiv 2022.07) MSP-Former：用于单张图像**去雪**的多尺度投影Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05621.pdf)\n\n- (arXiv 2022.07) 告诉我证据是什么？用于**答案定位**的双模态**视觉-语言**交互，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05703.pdf)\n\n- (arXiv 2022.07) 基于NeRF的、仅需单张输入图像的**视图合成**用视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05736.pdf)，[[代码]](https:\u002F\u002Fcseweb.ucsd.edu\u002F~viscomp\u002Fprojects\u002FVisionNeRF\u002F)\n\n- (arXiv 2022.07) COSIM：用于**反事实场景想象**的常识推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03961.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhyounghk\u002FCoSIm)\n\n- (arXiv 2022.07) 超越迁移学习：用于**动作定位**的协同微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03807.pdf)\n\n- (arXiv 2022.07) RePFormer：用于鲁棒**人脸关键点检测**的精炼金字塔Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03917.pdf)\n\n- (arXiv 2022.07) k-means**掩码**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04044.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fdeeplab2)\n\n- (arXiv 2022.07) **联合训练**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03481.pdf)，[[代码]](https:\u002F\u002Ftraining-transformers-together.github.io\u002F)\n\n- (arXiv 2022.07) 利用机器和用户生成的自然语言描述提升**少样本图像分类**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03133.pdf)\n\n- (arXiv 2022.07) MaiT：利用**注意力掩码**使图像Transformer更加**高效**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03006.pdf)\n\n- (arXiv 2022.07) 用于通用**事件边界字幕生成**的双流Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03038.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FGX77\u002FDual-Stream-Transformer-for-Generic-Event-Boundary-Captioning)\n\n- (arXiv 2022.07) **无Softmax**线性Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03341.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FSOFT)\n\n- (arXiv 2022.07) 桥接对象级与图像级表征以实现**开放词汇检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.03482.pdf)，[[代码]](https:\u002F\u002Fbit.ly\u002F3byZoQp)\n\n- (arXiv 2022.07) Transformer是可适应的**任务规划器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02442.pdf)，[[代码]](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002Ftemporal_task_planner-Paper148\u002F)\n\n- (arXiv 2022.07) 使用**物理感知Transformer**进行**阵列相机图像融合**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02250.pdf)\n\n- (arXiv 2022.07) OSFormer：基于Transformer的一阶段伪装实例**分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02255.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPJLallen\u002FOSFormer)\n\n- (arXiv 2022.07) 视觉-语言Transformer中面向**VQA**的弱监督定位，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02334.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Faurooj\u002FWSG-VQA-VLTransformers)\n\n- (arXiv 2022.07) PIC第4次挑战：语义辅助的多特征编码与多头解码用于密集**视频字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02583.pdf)\n\n- (arXiv 2022.07) STVGFormer：结合静动态跨模态理解的时空**视频定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02756.pdf)\n\n- (arXiv 2022.07) 通过**CLIP**实现反事实**图像操纵**的探索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02812.pdf)\n\n- (arXiv 2022.07) MatFormer：一种用于程序化**材料**的**生成式**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01044.pdf)\n\n- (arXiv 2022.07) 用于**视频摘要**的多模态帧评分Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01814.pdf)\n\n- (arXiv 2022.07) 基于实例编码Transformer生成**3D零件装配**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01779.pdf)\n\n- (arXiv 2022.07) 场景感知提示用于多模态**对话理解和生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01823.pdf)\n\n- (arXiv 2022.07) 通过自适应上下文池化实现**高效**表征学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01844.pdf)\n\n- (arXiv 2022.07) 受交互式注意力启发的**注视目标估计**，[[论文]](https:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?arnumber=9828503)，[[代码]](https:\u002F\u002Fgithub.com\u002Fnkuhzx\u002FVSG-IA)\n\n- (arXiv 2022.07) 可泛化的基于补丁的**神经渲染**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.10662.pdf)，[[项目]](https:\u002F\u002Fmohammedsuhail.net\u002Fgen_patch_neural_rendering\u002F)\n\n- (arXiv 2022.07) 用于人类**反应生成**的交互Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01685.pdf)\n\n- (arXiv 2022.07) TM2T：用于**3D人体动作与文本**相互生成的随机化与分词建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01696.pdf)，[[项目]](https:\u002F\u002Fericguo5513.github.io\u002FTM2T\u002F)\n\n- (arXiv 2022.07) FishFormer：基于环形切片的Transformer，用于**鱼眼校正**并探索其适用范围，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01925.pdf)\n\n- (arXiv 2022.07) 通过多模态**知识迁移**实现开放词汇多标签分类，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01887.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fseanhe97\u002FMKT)\n\n- (arXiv 2022.07) 通过指代性文本短语实现可解释且细粒度的**3D定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01821.pdf)，[[代码]](https:\u002F\u002Fyanx27.github.io\u002Fphraserefer\u002F)\n\n- (arXiv 2022.07) 利用层次化层间注意力改进Transformer中的**语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02126.pdf)\n\n- (arXiv 2022.07) 针对**语言与视觉**扰动的多模态**鲁棒性**分析，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02159.pdf)，[[项目]](https:\u002F\u002Fmaddy12.github.io\u002FMultiModalVideoRobustness\u002F)\n\n- (arXiv 2022.07) CoBEVT：使用稀疏Transformer进行协作式的**鸟瞰语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02202.pdf)\n\n- (arXiv 2022.07) 通过以物体为中心的分层表征来**分割运动物体**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02206.pdf)\n\n- (arXiv 2022.07) 对**视觉-语言**预训练模型中的**社会偏见**进行反事实测量与消除，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01056.pdf)\n\n- (arXiv 2022.07) 对比跨模态知识共享预训练用于**视觉-语言**表征学习与检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00733.pdf)\n\n- (arXiv 2022.07) 基于Transformer学习跨图像对象语义关系以进行**少样本细粒度**图像**分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00784.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJiakangYuan\u002FHelixFormer)\n\n- (arXiv 2022.07) 基于记忆的标签-文本调优用于**少样本**类**增量****学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01036.pdf)\n\n- (arXiv 2022.07) 利用上下文信息进行通用事件边界**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01050.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzjr2000\u002FContext-GEBC)\n\n- (arXiv 2022.07) 只需一个**检测器**：基于视觉Transformer的**不同模态**统一目标检测器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01071.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fliketheflower\u002FYONOD.git)\n\n- (arXiv 2022.07) 将更多注意力转向**视觉-语言跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01076.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJudasDie\u002FSOTS)\n\n- (arXiv 2022.07) **语言**能理解**深度**吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01077.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAdonis-galaxy\u002FDepthCLIP)\n\n- (arXiv 2022.07) TANet：基于Transformer的非对称网络用于**RGB-D显著性目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01172.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flc012463\u002FTANet)\n\n- (arXiv 2022.07) DUET：用于**对比零样本学习**的跨模态语义对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01328.pdf)\n\n- (arXiv 2022.07) 为视觉识别迁移**文本知识**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01297.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwhwu95\u002FText4Vis)\n\n- (arXiv 2022.07) R^2-VOS：通过关系循环一致性实现鲁棒的引用式**视频**目标**分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01203.pdf)\n\n- (arXiv 2022.07) CRFormer：一种用于**阴影去除**的跨区域Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01600.pdf)\n\n- (arXiv 2022.07) 动态**空间稀疏化**用于**高效**视觉Transformer和卷积神经网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01580.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fraoyongming\u002FDynamicViT)\n\n- (arXiv 2022.07) 回归MLP：人类**运动预测**的简单基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01567.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdulucas\u002FsiMLPe)\n\n- (arXiv 2022.07) I-ViT：仅整数**量化**用于**高效**视觉Transformer推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01405.pdf)\n\n- (arXiv 2022.07) 重新思考视觉Transformer中的**查询-键**成对交互，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00188.pdf)\n\n- (arXiv 2022.07) 大规模**视频动作识别**模型的**鲁棒性**分析，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01398.pdf)，[[代码]](https:\u002F\u002Frose-ar.github.io\u002F)\n\n- (arXiv 2022.07) VL-CheckList：使用物体、属性和关系**评估**预训练的**视觉-语言**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00221.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fom-ai-lab\u002FVL-CheckList)\n\n- (arXiv 2022.07) **掩码自编码器**用于汽车**点云**上的自监督学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00531.pdf)\n\n- (arXiv 2022.07) MotionMixer：基于**MLP**的**3D**人体**姿态预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00499.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMotionMLP\u002FMotionMixer)\n\n- (arXiv 2022.07) DALG：用于**图像检索**的深度注意局部与全局建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00287.pdf)\n\n- (arXiv 2022.07) PolarFormer：利用极坐标Transformer进行多摄像头**3D目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15398.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FPolarFormer)\n\n- (arXiv 2022.07) CTrGAN：用于**步态迁移**的循环Transformer GAN，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15248.pdf)\n\n- (arXiv 2022.07) LM-Nav：结合大型预训练的**语言**、**视觉**和**行动**模型的机器人**导航**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.04429.pdf)\n\n- (arXiv 2022.07) 用于视觉BERT预训练的自举式**掩码自编码器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07116.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLightDXY\u002FBootMAE)\n\n- (arXiv 2022.07) ReAct：利用关系查询进行**时序动作检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07097.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsssste\u002FReact)\n\n- (arXiv 2022.07) 通过视觉**领域**的视角对**全视界**表征进行**基准测试**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07106.pdf)，[[项目]](https:\u002F\u002Fzhangyuanhan-ai.github.io\u002FOmniBenchmark)\n\n- (arXiv 2022.07) **卷积旁路**是更好的视觉Transformer**适配器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07039.pdf)\n\n- (arXiv 2022.07) 使用像素进行**语言建模**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06991.pdf)\n\n- (arXiv 2022.07) 基于Transformer的上下文压缩以增强目标**检测**中的特征金字塔，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06603.pdf)\n\n- (arXiv 2022.07) 使用时空丢弃Transformer进行**深度伪造视频检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06612.pdf)\n\n- (arXiv 2022.07) iColoriT：借助视觉Transformer，将局部提示传播到正确区域，以实现**交互式上色**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06831.pdf)\n\n- (arXiv 2022.07) 利用**湍流**抑制Transformer进行**大气**中的**成像**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06465.pdf)\n\n- (arXiv 2022.07) 对称感知Transformer用于**镜面检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06332.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ftyhuang0428\u002FSATNet)\n\n- (arXiv 2022.07) 金字塔Transformer用于**交通标志检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06067.pdf)\n\n- (arXiv 2022.07) 全局-局部运动Transformer用于无监督的基于**骨骼**的**动作**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06101.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBoeun-Kim\u002FGL-Transformer)\n\n- (arXiv 2022.07) DynaST：用于示例引导的**图像生成**的动态稀疏Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06124.pdf)\n\n- (arXiv 2022.07) Trans4Map：利用视觉Transformer从**自我中心图像**到**客体中心语义**的整体自顶向下映射重访，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06205.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjamycheung\u002FTrans4Map)\n\n- (arXiv 2022.07) 入口翻转Transformer用于**参与者行为**的推理与预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.06235.pdf)\n\n- (arXiv 2022.07) Wayformer：通过简单高效的注意力网络进行**运动预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05844.pdf)\n\n- (arXiv 2022.07) 利用关键帧和Transformer控制器进行多样化的**舞蹈合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05906.pdf)\n\n- (arXiv 2022.07) 学习估计视频中人体运动的**外力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05845.pdf)\n\n- (arXiv 2022.07) 用于**对比聚类**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12925.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJackKoLing\u002FVTCC)\n\n- (arXiv 2022.07) Pose2Room：从**人类活动**理解**三维场景**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03030.pdf)\n\n- (arXiv 2022.07) 面向基于DETR的**人-物体交互检测**的难正样本查询挖掘，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05293.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMuchHair\u002FHQM)\n\n- (arXiv 2022.07) 跨架构**知识蒸馏**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.05273.pdf)\n\n- (arXiv 2022.07) 距离在**人-物体交互检测**中至关重要，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.01869.pdf)\n\n\n\n### 2022年6月\n\n- (arXiv 2022.06) TENET：用于**运动预测**中有效时序流的Transformer编码网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00170.pdf)\n\n- (arXiv 2022.06) GaitForeMer：通过人类运动预测进行自监督预训练的Transformer，用于**少样本步态障碍严重程度估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00106.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmarkendo\u002FGaitForeMer)\n\n- (arXiv 2022.06) GSCLIP：一种解释自然**语言**中**分布偏移**的框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15007.pdf)\n\n- (arXiv 2022.06) 基于迁移学习的空间Transformer网络，用于小规模细粒度骨骼动作的太极拳**动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15002.pdf)\n\n- (arXiv 2022.06) 用于本质**可解释**Transformer的因果性：CAT-XPLAIN，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.14841.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmvrl\u002FCAT-XPLAIN)\n\n- (arXiv 2022.06) 一种统一的端到端检索-阅读框架，用于基于知识的**VQA**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.14989.pdf)\n\n- (arXiv 2022.06) 使用Transformer进行**持续学习**以实现**图像分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.14085.pdf)\n\n- (arXiv 2022.06) ST-Adapter：参数**高效**的**图像到视频**迁移学习，用于**动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13559.pdf)\n\n- (arXiv 2022.06) 通过**测试时**类别条件特征对齐，在不从头重新训练的情况下增强视觉Transformer的**鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13951.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fkojima-takeshi188\u002FCFA)\n\n- (arXiv 2022.06) 利用**语言**加速**工具操作**的学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13074.pdf)\n\n- (arXiv 2022.06) RoME：面向角色感知的专家混合Transformer，用于**文本到视频检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12845.pdf)\n\n- (arXiv 2022.06) VLCAP：结合**视觉-语言**与对比学习，实现连贯的视频段落**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12972.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FUARK-AICV\u002FVLCAP)\n\n- (arXiv 2022.06) Video2**StyleGAN**：将**视频**编码到潜在空间以进行**操控**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13078.pdf)\n\n- (arXiv 2022.06) 文本驱动的**视频对象风格化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12396.pdf)，[[项目]](https:\u002F\u002Fsloeschcke.github.io\u002FText-Driven-Stylization-of-Video-Objects\u002F)\n\n- (arXiv 2022.06) 具有提案挖掘和预测均衡的**开放词汇**目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11134.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPealing\u002FMEDet)\n\n- (arXiv 2022.06) CMT-DeepLab：用于全景**分割**的聚类掩码Transformer，[[论文]](about:blank)\n\n- (arXiv 2022.06) 针对**视觉-语言**预训练模型的对抗性**攻击**研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09391.pdf)\n\n- (arXiv 2022.06) CLiMB：一个用于**视觉与语言**任务的**持续学习**基准测试，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09059.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FGLAMOR-USC\u002FCLiMB)\n\n- (arXiv 2022.06) **可视化**并理解**自监督**视觉学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09753.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffawazsammani\u002Fxai-ssl)\n\n- (arXiv 2022.06) VReBERT：一种简单灵活的Transformer，用于**视觉关系检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09111.pdf)\n\n- (arXiv 2022.06) 将查询铭记于心：基于查询条件卷积的**视觉定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09114.pdf)\n\n- (arXiv 2022.06) **DALL-E**用于**检测**：基于语言的上下文图像合成用于目标检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09592.pdf)\n\n- (arXiv 2022.06) REVECA——丰富的编码器-解码器框架，用于**视频事件字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09178.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTooTouch\u002FREVECA)\n\n- (arXiv 2022.06) SAViR-T：基于Transformer的**空间注意力视觉推理**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09265.pdf)\n\n- (arXiv 2022.06) EATFormer：受**进化**算法启发改进的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09325.pdf)，[[代码]](https:\u002F\u002Fhttps\u002F\u002Fgithub.com\u002Fzhangzjn\u002FEATFormer)\n\n- (arXiv 2022.06) Dual**CoOp**：通过有限标注快速适应**多标签**识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09541.pdf)\n\n- (arXiv 2022.06) 捕捉并推断密集的全身**人-场景接触**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09553.pdf)，[[项目]](https:\u002F\u002Frich.is.tue.mpg.de\u002F)\n\n- (arXiv 2022.06) M&M Mix：一个多模态多视角Transformer**集成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09852.pdf)\n\n- (arXiv 2022.06) DisCoVQA：用于**视频质量评估**的时序扭曲内容Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09853.pdf)\n\n- (arXiv 2022.06) Voxel-**MAE**：用于大规模**点云**预训练的掩码自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09900.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fchaytonmin\u002FVoxel-MAE)\n\n- (arXiv 2022.06) **全局上下文**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.09959.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FGCViT)\n\n- (arXiv 2022.06) 通过密度引导的自适应选择CNN和Transformer估算来**计数**不同密度的**人群**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10075.pdf)\n\n- (arXiv 2022.06) 单阶段**动作检测**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10080.pdf)\n\n- (arXiv 2022.06) Sem**MAE**：语义引导的掩码用于学习掩码自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10207.pdf)\n\n- (arXiv 2022.06) 基于Transformer的多模态提案与重排序，用于维基百科**图片字幕**匹配，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10436.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmesnico\u002FWiki-Image-Caption-Matching)\n\n- (arXiv 2022.06) **邻域**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10552.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FVicinity-Vision-Transformer)\n\n- (arXiv 2022.06) EdgeNeXt：用于移动视觉应用的**高效**融合**CNN-Transformer**架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10589.pdf)，[[代码]](https:\u002F\u002Ft.ly\u002F_Vu9)\n\n- (arXiv 2022.06) 时序一致的语义**视频编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10590.pdf)\n\n- (arXiv 2022.06) VLMbench：面向**视觉-语言操控**的组合基准测试，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08522.pdf)\n\n- (arXiv 2022.06) MINEDOJO：利用互联网规模知识构建开放式**具身智能体**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08853.pdf)，[[项目]](https:\u002F\u002Fminedojo.org\u002F)\n\n- (arXiv 2022.06) IRISformer：用于室内场景单张图像**逆向渲染**的密集视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08423.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FViLab-UCSD\u002FIRISformer)\n\n- (arXiv 2022.06) 视觉Transformer中的后门**攻击**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08477.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FUCDvision\u002Fbackdoor_transformer.git)\n\n- (arXiv 2022.06) 通过视觉**显著性**纠正ViT的**捷径**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08567.pdf)\n\n- (arXiv 2022.06) 利用特权信息进行**零样本动作识别**的学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08632.pdf)\n\n- (arXiv 2022.06) Bridge-Tower：在**视觉-语言**表征学习中搭建编码器之间的桥梁，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08657.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBridgeTower)\n\n- (arXiv 2022.06) CtrlFormer：通过Transformer学习可迁移的**视觉控制**状态表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08883.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fctrlformer-icml\u002F)\n\n- (arXiv 2022.06) SimA：适用于视觉Transformer的简单**无Softmax注意力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08898.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FUCDvision\u002Fsima)\n\n- (arXiv 2022.06) UNIFIED-IO：一个用于**视觉、语言**及**多模态**任务的**统一模型**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08916.pdf)，[[项目]](https:\u002F\u002Funified-io.allenai.org\u002F)\n\n- (arXiv 2022.06) VLMixer：通过跨模态CutMix进行无配对的**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08919.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fttengwang\u002FVLMixer)\n\n- (arXiv 2022.06) ZJU-Alibaba团队提交至2022年**Ego4D**自然语言**查询**挑战赛的ReLER@ZJU-Alibaba方案，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00383.pdf)\n\n- (arXiv 2022.06) 基于Video + **CLIP**的**Ego4D**长期**动作预测**基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00579.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsrijandas07\u002Fclip_baseline_LTA_Ego4d)\n\n- (arXiv 2022.06) 什么使得**领域泛化**如此困难？，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07802.pdf)\n\n- (arXiv 2022.06) SAVi++：迈向从真实世界**视频**中进行端到端的**以对象为中心的学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07764.pdf)，[[代码]](https:\u002F\u002Fslot-attention-video.github.io\u002Fsavi++\u002F)\n\n- (arXiv 2022.06) 在**CLIP**中解耦**视觉**与**书面****概念**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07835.pdf)，[[项目]](https:\u002F\u002Fjoaanna.github.io\u002Fdisentangling_spelling_in_clip\u002F)\n\n- (arXiv 2022.06) 多尺度协作式多模态Transformer用于视频中的**情感分析**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07981.pdf)\n\n- (arXiv 2022.06) 自监督视觉Transformer的**patch**级别**表征**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07990.pdf)\n\n- (arXiv 2022.06) 通过冻结双向语言模型实现**零样本视频问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08155.pdf)，[[代码]](https:\u002F\u002Fantoyang.github.io\u002Ffrozenbilm.html)\n\n- (arXiv 2022.06) Omni**MAE**：针对图像和视频的单模型掩码预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08356.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fomnivore)\n\n- (arXiv 2022.06) 通过探查注意力条件下的**掩码一致性**来**适应**自监督视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08222.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvirajprabhu\u002FPACMAC)\n\n- (arXiv 2022.06) LAVENDER：将**视频-语言**理解统一为掩码语言建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07160.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLAVENDER)\n\n- (arXiv 2022.06) 多模态事件图：迈向对**多模态**世界的**事件中心理解**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07207.pdf)\n\n- (arXiv 2022.06) 重新思考**少样本分类**中的泛化问题，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07267.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmrkshllr\u002FFewTURE)\n\n- (arXiv 2022.06) VCT：一种**视频压缩**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07307.pdf)\n\n- (arXiv 2022.06) 利用Transformer和自监督进行**深度**和**自我运动**的**预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07435.pdf)\n\n- (arXiv 2022.06) 在主干网络中融合进行粗粒度到细粒度的**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07643.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FFIBER)\n\n- (arXiv 2022.06) SP-ViT：为视觉Transformer学习**2D空间先验**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07662.pdf)\n\n- (arXiv 2022.06) 一种简单的**数据混合**先验，用于提升**自监督**学习效果，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07692.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOliverRensu\u002FSDMP)\n\n- (arXiv 2022.06) 前缀语言模型是**统一的多模态学习者**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07699.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshizhediao\u002FDaVinci)\n\n- (arXiv 2022.06) 面向自监督视觉预训练的**掩码频率建模**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07706.pdf)，[[代码]](https:\u002F\u002Fwww.mmlab-ntu.com\u002Fproject\u002Fmfm\u002Findex.html)\n\n- (arXiv 2022.06) 可泛化的**神经辐射场**，用于借助Transformer进行新视角合成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05375.pdf)\n\n- (arXiv 2022.06) 一个多模态知识发现与预训练的统一**持续学习**框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05555.pdf)\n\n- (arXiv 2022.06) 利用视觉Transformer学习估计**夏普利值**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05282.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsuinleelab\u002Fvit-shapley)\n\n- (arXiv 2022.06) 基于图的空间Transformer结合记忆重放技术，用于多未来**行人轨迹预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05712.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJacobieee\u002FST-MR)\n\n- (arXiv 2022.06) **GLIPv2**：统一定位与**VL理解**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05836.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FGLIP)\n\n- (arXiv 2022.06) INDIGO：用于**领域泛化**的内在多模态性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05912.pdf)\n\n- (arXiv 2022.06) 基于**类别条件**对比学习的**传导式CLIP**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06177.pdf)\n\n- (arXiv 2022.06) SILVER-BULLET-3D 在 MANISKILL 2021 上：基于示范学习与启发式规则的方法用于**物体操作**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06289.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fcaiqi\u002FSilver-Bullet-3D\u002F)\n\n- (arXiv 2022.06) MLP-3D：一种类似**MLP**的**3D**架构，带有分组时间混合机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06292.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZhaofanQiu\u002FMLP-3D)\n\n- (arXiv 2022.06) 用于目标**检测**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06323.pdf)\n\n- (arXiv 2022.06) 通过对象令牌的帧-片段一致性，将**图像**场景结构引入视频，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06346.pdf)，[[代码]](https:\u002F\u002Feladb3.github.io\u002FSViT\u002F)\n\n- (arXiv 2022.06) TransVG++：基于语言条件的视觉Transformer实现端到端**视觉定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06619.pdf)\n\n- (arXiv 2022.06) ReCo：检索与协同**分割**以实现**零样本**迁移，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07045.pdf)，[[项目]](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Freco)\n\n- (arXiv 2022.06) MAREO：基于**记忆**和**注意力**的**视觉推理**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04928.pdf)\n\n- (arXiv 2022.06) 用于**多动作运动合成**的递归Transformer变分自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06741.pdf)\n\n- (arXiv 2022.06) **物体场景表示**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06922.pdf)\n\n- (arXiv 2022.06) 理解与排序图像**字幕**中的语义，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06930.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYehLi\u002Fxmodaler\u002Ftree\u002Fmaster\u002Fconfigs\u002Fimage_caption\u002Fcosnet)\n\n- (arXiv 2022.06) 探索使用**DINO**训练的视觉Transformer中的**对抗攻击**与**防御**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06761.pdf)\n\n- (arXiv 2022.06) **周边**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06801.pdf)，[[代码]](http:\u002F\u002Fcvlab.postech.ac.kr\u002Fresearch\u002FPerViT\u002F)\n\n- (arXiv 2022.06) 使用Transformer实现高效的无解码器目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06829.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPealing\u002FDFFT.)\n\n- (arXiv 2022.06) 原型化的**对比语言图像预训练**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10996.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmegvii-research\u002Fprotoclip)\n\n- (arXiv 2022.06) SpA-Former：通过空间注意力进行图像**阴影检测与去除**的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10910.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzhangbaijin\u002FSpA-Former-shadow-removal)\n\n- (arXiv 2022.06) 视觉Transformer的统一且生物合理的**关系图表示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11073.pdf)\n\n- (arXiv 2022.06) 基础模型能谈论**因果关系**吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10591.pdf)\n\n- (arXiv 2022.06) 通过在3D空间中恢复令牌来学习**视角无关**的视觉**表示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11895.pdf)，[[代码]](https:\u002F\u002Fwww3.cs.stonybrook.edu\u002F~jishang\u002F3dtrl\u002F3dtrl.html)\n\n- (arXiv 2022.06) MaskViT：用于**视频预测**的**掩码**视觉预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11894.pdf)\n\n- (arXiv 2022.06) PromptPose：语言**提示**有助于**动物姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11752.pdf)\n\n- (arXiv 2022.06) **视频预训练**（VPT）：通过观看**未标注**的**在线**视频来学习行动，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11795.pdf)\n\n- (arXiv 2022.06) MERLOT Reserve：通过**视觉、语言和声音**获取神经脚本知识，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02639.pdf)，[[项目]](https:\u002F\u002Frowanzellers.com\u002Fmerlotreserve)\n\n- (arXiv 2022.06) 构建时空Transformer用于**第一人称视角的3D姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04785.pdf)\n\n- (arXiv 2022.06) **位置**标签用于**自监督**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04981.pdf)\n\n- (arXiv 2022.06) 探索特征自相关性以用于**自监督**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05184.pdf)\n\n- (arXiv 2022.06) 基于补丁的对象中心Transformer用于高效**视频生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04003.pdf)，[[代码]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fpovt-public)\n\n- (arXiv 2022.06) 稀疏融合**专家混合**是领域泛化学习者，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04046.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLuodian\u002FSF-MoE-DG)\n\n- (arXiv 2022.06) VN-Transformer：面向向量神经元的**旋转等变**注意力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04176.pdf)\n\n- (arXiv 2022.06) **CLIP**-Actor：基于**文本**驱动的推荐与风格化，用于**动画化人体网格**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04382.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYouwang-Kim\u002FCLIP-Actor)\n\n- (arXiv 2022.06) **OOD** 数据增强可能与**开放集识别**相冲突，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04242.pdf)\n\n- (arXiv 2022.06) 草稿与修订：利用上下文RQ-Transformer实现高效的**图像生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04452.pdf)\n\n- (arXiv 2022.06) cycle text2face：通过Transformer实现**文本到人脸GAN**的循环，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04503.pdf)\n\n- (arXiv 2022.06) 通过几何引导的核Transformer实现高效且鲁棒的**2D到BEV**表示学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04584.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FGKT)\n\n- (arXiv 2022.06) 基于Transformer的乌尔都语手写**文字**光学字符识别器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04575.pdf)\n\n- (arXiv 2022.06) 用于视觉Transformer的**空间熵正则化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04636.pdf)\n\n- (arXiv 2022.06) 关于**掩码图像建模**中的数据缩放问题，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04664.pdf)\n\n- (arXiv 2022.06) 极端**掩码**用于学习实例级和分布式视觉**表示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04667.pdf)\n\n- (arXiv 2022.06) GateHUB：带有背景抑制功能的门控历史单元，用于**在线动作检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04668.pdf)\n\n- (arXiv 2022.06) 利用基于Transformer的注意力模型对监控视频进行**异常检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01524.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fkapildeshpande\u002FAnomaly-Detection-in-Surveillance-Videos)\n\n- (arXiv 2022.06) Contra**CLIP**：由对比句子对驱动的可解释**GAN**生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02104.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fchi0tzp\u002FContraCLIP)\n\n- (arXiv 2022.06) EAANet：**高效**注意力增强卷积网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01821.pdf)\n\n- (arXiv 2022.06) 视觉线索：弥合**视觉与语言**基础，用于图像段落**字幕**生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01843.pdf)\n\n- (arXiv 2022.06) 基于引导形变注意力的循环**视频修复**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02146.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJingyunLiang\u002FRVRT)\n\n- (arXiv 2022.06) 重新思考**CLIP**的**开放性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01986.pdf)\n\n- (arXiv 2022.06) OrdinalCLIP：为**语言引导的序数回归**学习排序提示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02338.pdf)\n\n- (arXiv 2022.06) 面向多通道**视频-语言检索**的预训练对比模型快速适应研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02082.pdf)\n\n- (arXiv 2022.06) 用于视频中**文本分类**的对比图多模态模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02343.pdf)\n\n- (arXiv 2022.06) 针对**移动端**视觉Transformer的可分离**自注意力**机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02680.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fapple\u002Fml-cvnets)\n\n- (arXiv 2022.06) Mask **DINO**：迈向统一的基于Transformer的目标**检测**与**分割**框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02777.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FIDEACVR\u002FMaskDINO)\n\n- (arXiv 2022.06) 基于LIMoE的多模态对比学习：**语言-图像**专家混合模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02770.pdf)\n\n- (arXiv 2022.06) cViL：利用知识蒸馏进行**视觉-语言**模型的跨语言训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03354.pdf)\n\n- (arXiv 2022.06) 零样本图像分类的**掩码式**无监督自训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02967.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FMUST)\n\n- (arXiv 2022.06) DETR++：驯服你的多尺度**检测**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02977.pdf)\n\n- (arXiv 2022.06) 用于通用**事件边界检测**的结构化上下文Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02985.pdf)\n\n- (arXiv 2022.06) 揭示面向**视频-语言**学习的单帧偏差，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03428.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjayleicn\u002Fsingularity)\n\n- (arXiv 2022.06) Cerberus Transformer：联合**语义**、**可用性**和**属性**解析，[[论文]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FChen_Cerberus_Transformer_Joint_Semantic_Affordance_and_Attribute_Parsing_CVPR_2022_paper.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOPEN-AIR-SUN\u002FCerberus)\n\n- (arXiv 2022.06) CNN能否比Transformer更**鲁棒**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03452.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002FRobustCNN)\n\n- (arXiv 2022.06) 检测中心：通过语言嵌入上的查询适配统一目标**检测**数据集，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03484.pdf)\n\n- (CVPR 2022) 关键点Transformer：解决复杂**手与物体交互**中的关节识别问题，以实现精确的**3D姿态估计**，[[论文]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FHampali_Keypoint_Transformer_Solving_Joint_Identification_in_Challenging_Hands_and_Object_CVPR_2022_paper.pdf)\n\n- (arXiv 2022.06) A-OKVQA：一个基于世界知识的**视觉问答**基准，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01718.pdf)，[[项目]](http:\u002F\u002Fa-okvqa.allenai.org\u002F)\n\n- (arXiv 2022.06) 重温**视频-语言**理解中的“视频”概念，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01720.pdf)，[[项目]](https:\u002F\u002Fstanfordvl.github.io\u002Fatp-revisit-video-lang\u002F)\n\n- (arXiv 2022.06) 基于局部**掩码式**重建的高效**自监督**视觉预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00790.pdf)\n\n- (arXiv 2022.06) 针对复杂**场景生成**的图像构图建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00923.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJohnDreamer\u002FTwFA)\n\n- (arXiv 2022.06) 面向**视频动作预测**的统一递归建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01009.pdf)\n\n- (arXiv 2022.06) **前缀条件化**统一了语言和标签监督，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01125.pdf)\n\n- (arXiv 2022.06) 优化视觉Transformer的相关性图可提升其**鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01161.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhila-chefer\u002FRobustViT)\n\n- (arXiv 2022.06) VL-BEIT：生成式**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01127.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm)\n\n- (arXiv 2022.06) EfficientFormer：以MobileNet速度运行的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01191.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002FEfficientFormer)\n\n- (arXiv 2022.06) REVIVE：在基于知识的**视觉问答**中，区域视觉表征至关重要，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01201.pdf)\n\n- (arXiv 2022.06) 用于**自监督**视觉表征学习的暹罗图像建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01204.pdf)\n\n- (CVPR 2022) 基于Oracle查询的蒸馏方法用于基于Transformer的**人-物交互检测**，[[论文]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FQu_Distillation_Using_Oracle_Queries_for_Transformer-Based_Human-Object_Interaction_Detection_CVPR_2022_paper.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSherlockHolmes221\u002FDOQ)\n\n- (CVPR 2022) 探索基于结构感知的Transformer在交互提案上的应用，用于**人-物交互检测**，[[论文]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FZhang_Exploring_Structure-Aware_Transformer_Over_Interaction_Proposals_for_Human-Object_Interaction_Detection_CVPR_2022_paper.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzyong812\u002FSTIP)\n\n- (CVPR 2022) 基于瞬间观测的人类**轨迹预测**，[[论文]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FSun_Human_Trajectory_Prediction_With_Momentary_Observation_CVPR_2022_paper.pdf)\n\n- (arXiv 2022.06) 我的邻居在哪里？利用**自监督**视觉Transformer中的补丁关系，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00481.pdf)\n\n- (arXiv 2022.06) 将基于体素的表示与Transformer统一用于**3D目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00630.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FUVTR)\n\n- (arXiv 2022.06) 通过结构幻觉Transformer级联实现极端的**平面图重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00645.pdf)\n\n- (arXiv 2022.06) 跨视角语言建模：迈向统一的跨语言**跨模态预训练**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00621.pdf)\n\n- (arXiv 2022.06) VALHALLA：用于机器翻译的**视觉幻觉**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00100.pdf)，[[代码]](http:\u002F\u002Fwww.svcl.ucsd.edu\u002Fprojects\u002Fvalhalla)\n\n- (arXiv 2022.06) 利用Transformer学习序列上下文以进行**3D手部姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00171.pdf)\n\n- (arXiv 2022.06) CLIP4IDC：用于图像差异描述的**CLIP**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00629.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsushizixin\u002FCLIP4IDC)\n\n- (arXiv 2022.06) 基于空间感知与语义感知标记对齐的跨域**检测**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00222.pdf)\n\n- (arXiv 2022.06) 视觉**GNN**：一张图胜过一组节点，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00272.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhuawei-noah\u002FCV-Backbones)\n\n- (arXiv 2022.06) 面向随机人类**运动预测**的弱监督动作过渡学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15608.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwei-mao-2019\u002FWAT)\n\n- (arXiv 2022.06) TubeFormer-**DeepLab**：视频掩码Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15361.pdf)\n\n- (arXiv 2022.06) 基于Tubelet标记的**视频**中**人-物体交互**检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01908.pdf)\n\n\n\n### 2022年5月\n\n- (arXiv 2022.05) HeatER：基于热图Transformer的高效统一网络，用于**人体重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15448.pdf)\n\n- (arXiv 2022.05) 基于Transformer的机器人**抓取检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2205\u002F2205.15112.pdf)\n\n- (arXiv 2022.05) 多模态**掩码自编码器**学习可迁移表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14204.pdf)\n\n- (arXiv 2022.05) 基于**CLIP**引导学习的多模态**假新闻检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14304.pdf)\n\n- (arXiv 2022.05) WT-MVSNet：基于窗口的Transformer用于**多视图立体视觉**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14319.pdf)\n\n- (arXiv 2022.05) 面向**快速**预训练的对象级**掩码自编码器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14338.pdf)\n\n- (arXiv 2022.05) 更深入地研究**自监督**的**轻量级**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14443.pdf)\n\n- (arXiv 2022.05) 变分Transformer：一种超越准确性和多样性之间权衡的框架，用于图像**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14458.pdf)\n\n- (arXiv 2022.05) CY**CLIP**：循环对比式**语言-图像**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14459.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoel-shashank\u002FCyCLIP)\n\n- (arXiv 2022.05) MDMLP：使用**MLP**从小数据集开始进行图像分类，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14477.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAmoza-Theodore\u002FMDMLP)\n\n- (arXiv 2022.05) SupMAE：**监督**的**掩码自编码器**是高效的视觉学习者，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14540.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fcmu-enyac\u002Fsupmae)\n\n- (arXiv 2022.05) 3D-C2FT：用于**多视角三维重建**的粗到精Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14575.pdf)\n\n- (arXiv 2022.05) 针对**提示**微调的提示对齐梯度，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14865.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBeierZhu\u002FPrompt-align)\n\n- (arXiv 2022.05) **光照**自适应Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14871.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fcuiziteng\u002FIllumination-Adaptive-Transformer)\n\n- (arXiv 2022.05) HiViT：**层次化**视觉Transformer结合**掩码图像建模**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14949.pdf)\n\n- (arXiv 2022.05) GMML就是你需要的一切，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14986.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSara-Ahmed\u002FGMML)\n\n- (arXiv 2022.05) COMPLETEDT：利用密集增强推理Transformer进行**点云补全**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14999.pdf)\n\n- (arXiv 2022.05) 视觉Transformer在**密集预测**任务中的自监督预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15173.pdf)\n\n- (arXiv 2022.05) VLUE：一个用于评估**视觉-语言**模型的多任务**基准测试**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15237.pdf)，[[基准]](https:\u002F\u002Fvlue-benchmark.github.io\u002F)，[[代码]](https:\u002F\u002Fgithub.com\u002FMichaelZhouwang\u002FVLUE)\n\n- (arXiv 2022.05) 架构无关的**掩码图像建模**——从ViT回到CNN，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13943.pdf)\n\n- (arXiv 2022.05) 对比学习通过特征蒸馏在微调中与**掩码图像建模**相匹敌，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14141.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSwinTransformer\u002FFeature-Distillation)\n\n- (arXiv 2022.05) GIT：一个面向视觉和语言的生成式**图像到文本**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14100.pdf)\n\n- (arXiv 2022.05) 3DILG：用于**3D生成建模**的不规则潜在网格，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13914.pdf)\n\n- (arXiv 2022.05) 针对复杂且自然场景视频的简单**无监督**的**以对象为中心的学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14065.pdf)，[[代码]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fslot-transformer-for-videos)\n\n- (arXiv 2022.05) 长期**动作预测**的未来Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14022.pdf)，[[项目]](http:\u002F\u002Fcvlab.postech.ac.kr\u002Fresearch\u002FFUTR)\n\n- (arXiv 2022.05) X-ViT：高性能**线性**视觉Transformer，无需Softmax，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13805.pdf)\n\n- (arXiv 2022.05) 通过目标感知Transformer进行**知识蒸馏**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10793.pdf)\n\n- (arXiv 2022.05) 用于快速视觉感知器的动态**查询**选择，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10873.pdf)\n\n- (arXiv 2022.05) MonoFormer：利用Transformer实现自监督单目**深度**估计的泛化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11083.pdf)\n\n- (arXiv 2022.05) PEVL：针对**视觉-语言**模型的位置增强预训练和提示微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11169.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FPEVL)\n\n- (arXiv 2022.05) 使用因果剪枝知识**提示**支持**视觉-语言**模型推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11100.pdf)\n\n- (arXiv 2022.05) 超级视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11397.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flmbxmu\u002FSuperViT)\n\n- (arXiv 2022.05) mPLUG：通过跨模态跳跃连接实现高效且有效的**视觉-语言**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12005.pdf)\n\n- (arXiv 2022.05) **VQA**-GNN：利用多模态语义图进行视觉问答**推理**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11501.pdf)\n\n- (arXiv 2022.05) UMSNet：用于人类**活动识别**的通用多传感器网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11756.pdf)\n\n- (arXiv 2022.05) 使用视觉Transformer进行**隐私**保护的图像**分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12041.pdf)\n\n- (arXiv 2022.05) HiVLP：用于快速图文检索的层次化**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12105.pdf)\n\n- (arXiv 2022.05) ASSET：基于Transformer的高分辨率自回归**语义场景编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12231.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FDifanLiu\u002FASSET)\n\n- (arXiv 2022.05) HDGT：通过场景编码实现多智能体**轨迹预测**的异构驾驶图Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09753.pdf)\n\n- (arXiv 2022.05) **掩码**引导的视觉Transformer (MG-ViT) 用于**少样本**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09995.pdf)\n\n- (arXiv 2022.05) 面向**光谱压缩成像**的退化感知展开式半洗牌Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10102.pdf)\n\n- (arXiv 2022.05) 统一掩码：为具有局部性的基于**金字塔**的视觉Transformer提供**MAE**预训练能力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10063.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fimplus\u002FUM-MAE)\n\n- (arXiv 2022.05) 视觉**概念**标记化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10093.pdf)\n\n- (arXiv 2022.05) MSTRIQ：基于Swin Transformer并采用多阶段融合的无参考**图像质量评估**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10101.pdf)\n\n- (arXiv 2022.05) CogVideo：基于Transformer的大规模**文本到视频**生成预训练，[[论文]](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVideo\u002Fblob\u002Fmain\u002Fpaper\u002FCogVideo-arxiv.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVideo)\n\n- (arXiv 2022.05) 视觉语义AI中**低定序**现象的证据，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10764.pdf)\n\n- (arXiv 2022.05) 利用双任务交互式Transformer提升伪装物体**检测**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10579.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fliuzywen\u002FCOD)\n\n- (arXiv 2022.05) muNet：将预训练深度神经网络演化为可扩展的**自动调优多任务系统**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10937.pdf)\n\n- (arXiv 2022.05) 大型语言模型是**零样本推理者**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11916.pdf)\n\n- (arXiv 2022.05) AdaptFormer：为**可扩展**视觉识别而**适配**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13535.pdf)，[[代码]](http:\u002F\u002Fwww.shoufachen.com\u002Fadaptformer-page)\n\n- (arXiv 2022.05) 面向**掩码**图像建模的**绿色**层次化视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13515.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLayneH\u002FGreenMIM)\n\n- (arXiv 2022.05) 带有边界感知损失的高效U-Transformer用于**动作分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13425.pdf)\n\n- (arXiv 2022.05) 跨架构的**自监督视频**表征学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13313.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fguoshengcv\u002FCACL)\n\n- (arXiv 2022.05) 基于**提示**的学习用于无配对图像**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13125.pdf)\n\n- (arXiv 2022.05) MixMIM：用于**高效**视觉表征学习的**混合**与**掩码**图像建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13137.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FMixMIM)\n\n- (arXiv 2022.05) 具有HiLo**注意力**的**快速**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13213.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzip-group\u002FLITv2)\n\n- (arXiv 2022.05) 基于**CLIP**奖励的细粒度图像**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13115.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fj-min\u002FCLIP-Caption-Reward)\n\n- (arXiv 2022.05) 互信息散度：一种用于多模态**生成模型**的统一度量，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13445.pdf)\n\n- (arXiv 2022.05) MoCoViT：**移动****卷积**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12635.pdf)\n\n- (arXiv 2022.05) AO2-DETR：任意方向物体**检测**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12785.pdf)\n\n- (arXiv 2022.05) **Inception** Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12956.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FiFormer)\n\n- (arXiv 2022.05) VTP：用于多视角多人**3D姿态**估计的体积Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12602.pdf)\n\n- (arXiv 2022.05) UViM：一种利用学习引导代码进行视觉的**统一建模**方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10337.pdf)\n\n- (arXiv 2022.05) 带有图像描述符的语言模型是强大的少样本**视频-语言**学习者，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.10747.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMikeWangWZHL\u002FVidIL)\n\n- (arXiv 2022.05) 仅从**字幕**训练视觉-语言Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09256.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fguilk\u002FVLC)\n\n- (arXiv 2022.05) **体素**信息驱动的**语言**接地，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09710.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Frcorona\u002Fvoxel_informed_language_grounding)\n\n- (arXiv 2022.05) 用于**动作分割**的交叉增强Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09445.pdf)\n\n- (arXiv 2022.05) TRT-ViT：面向**TensorRT**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09579.pdf)\n\n- (arXiv 2022.05) 用于视觉目标**检测**的积分迁移预训练Transformer编码器-解码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09613.pdf)\n\n- (arXiv 2022.05) 用于全玻片图像**分类**的图Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09671.pdf)\n\n- (arXiv 2022.05) VNT-Net：**旋转不变性**向量神经元Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09690.pdf)\n\n- (arXiv 2022.05) 基于去噪**对比**的**掩码**图像建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09616.pdf)\n\n- (arXiv 2022.05) 利用元学习和基于Transformer的关系建模进行跨主体**动作单元检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08787.pdf)\n\n- (arXiv 2022.05) **掩码自编码器**作为时空学习者，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09113.pdf)\n\n- (arXiv 2022.05) BodyMap：学习全身密集**对应关系**地图，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09111.pdf)，[[代码]](https:\u002F\u002Fnsarafianos.github.io\u002Fbodymap)\n\n- (arXiv 2022.05) 通过凸对偶揭示**注意力**：视觉Transformer的分析与解释，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08078.pdf)\n\n- (arXiv 2022.05) Avatar**CLIP**：零样本文本驱动的3D**头像**生成与动画，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08535.pdf)\n\n- (arXiv 2022.05) 用于**密集预测**的视觉Transformer适配器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08534.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fczczup\u002FViT-Adapter)\n\n- (arXiv 2022.05) 演示：使用视觉Transformer实现实时**语义通信**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03886.pdf)\n\n- (arXiv 2022.05) MulT：一个端到端的**多任务**学习Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08303.pdf)，[[代码]](https:\u002F\u002Fivrl.github.io\u002FMulT\u002F)\n\n- (arXiv 2022.05) 一本关于长**视频检索**的**CLIP**搭车指南，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.08508.pdf)\n\n- (arXiv 2022.05) 基于Transformer的视频**帧插值**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07230.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FVFIformer)\n\n- (arXiv 2022.05) 用于图像**去噪**的密集残差Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.06944.pdf)\n\n- (arXiv 2022.05) 使用AV-HuBERT学习基于嘴唇的**视听**说话人嵌入，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07180.pdf)\n\n- (arXiv 2022.05) 翻炒烹饪机器人：双臂对半流动物体的非抓取操作，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05960.pdf)\n\n- (arXiv 2022.05) 面向视频中语言驱动**动作定位**的实体感知与运动感知Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05854.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshuoyang129\u002FEAMAT)\n\n- (arXiv 2022.05) 通过提问学习**检索视频**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05739.pdf)\n\n- (arXiv 2022.05) 一个模型，**多种模态**：一种针对文本、声音、图像、视频和代码的稀疏激活方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.06126.pdf)\n\n- (arXiv 2022.05) 使用视觉Transformer进行简单的开放词汇目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.06230.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fscenic\u002Ftree\u002Fmain\u002Fscenic\u002Fprojects\u002Fowl_vit)\n\n- (arXiv 2022.05) AggPose：用于婴儿**姿态估计**的深度聚合视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05277.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSZAR-LAB\u002FAggPose)\n\n- (arXiv 2022.05) 关于使用Transformer进行目标**检测**的自监督学习方法的实证研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05543.pdf)，[[DETR代码]](https:\u002F\u002Fgithub.com\u002Fgokulkarthik\u002Fdetr)，[[Deform-DETR代码]](https:\u002F\u002Fgithub.com\u002Fgokulkarthik\u002FDeformable-DETR)\n\n- (arXiv 2022.05) 在面向多元化的图像**修复**中减少Transformer的信息损失，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05076.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fliuqk3\u002FPUT)\n\n- (arXiv 2022.05) 基于Transformer的大批量训练跨模态**食谱**嵌入，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04948.pdf)\n\n- (arXiv 2022.05) 用于野外动态**表情识别**的时空Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04749.pdf)\n\n- (arXiv 2022.05) 通过表征预训练实现可泛化的**任务规划**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07993.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fgentp)\n\n- (arXiv 2022.05) EdgeViTs：用视觉Transformer在**移动设备**上与轻量级CNN竞争，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03436.pdf)\n\n- (arXiv 2022.05) 在图像**超分辨率**Transformer中激活更多像素，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04437.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fchxy95\u002FHAT)\n\n- (arXiv 2022.05) 视觉Transformer的逐行**加速器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03998.pdf)\n\n- (arXiv 2022.05) SparseTT：基于稀疏Transformer的视觉**跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03776.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffzh0917\u002FSparseTT)\n\n- (arXiv 2022.05) RoViST：学习用于**视觉叙事**的鲁棒指标，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03774.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fusydnlp\u002Frovist)\n\n- (arXiv 2022.05) 超越边界框：面向目标**检测**的多模态知识学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04072.pdf)\n\n- (arXiv 2022.05) 多尺度采样多层次分层网络用于**视频问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04061.pdf)\n\n- (arXiv 2022.05) Incremental-DETR：通过自监督学习实现增量式少样本目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04042.pdf)\n\n- (arXiv 2022.05) Conv**MAE**：掩码**卷积**遇见掩码自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03892.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAlpha-VL\u002FConvMAE)\n\n- (arXiv 2022.05) 使用Mixup进行**食谱检索**的跨语言适应，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03891.pdf)\n\n- (arXiv 2022.05) Zero和R2D2：大规模中文跨模态基准及**视觉-语言**框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03860.pdf)\n\n- (arXiv 2022.05) 基于循环移位窗口注意力的Transformer**跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03806.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSkyeSong38\u002FCSWinTT)\n\n- (arXiv 2022.05) 超越预训练目标检测器：用于图像**字幕生成**的跨模态文本和视觉上下文，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04363.pdf)\n\n- (arXiv 2022.05) 提示分布学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03340.pdf)\n\n- (arXiv 2022.05) CLIP-CLOP：**CLIP**引导的**拼贴画**和**照片蒙太奇**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03146.pdf)\n\n- (arXiv 2022.05) 用于**视频问答**的双层解耦Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.03039.pdf)\n\n- (arXiv 2022.05) 基于声明的提示调优用于**视觉问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02456.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FCCIIPLab\u002FDPT)\n\n- (arXiv 2022.05) P^3IV：利用弱监督从**教学视频**中进行概率性**程序规划**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02300.pdf)\n\n- (arXiv 2022.05) 语言模型也能“看”：在**文本生成**中接入**视觉**控件，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02655.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyxuansu\u002FMAGIC)\n\n- (arXiv 2022.05) YOLOPose：基于Transformer的关键点回归多目标**6D姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02536.pdf)\n\n- (arXiv 2022.05) 用于实时地图视图**语义分割**的跨视角Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02833.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fbradyz\u002Fcross_view_transformers)\n\n- (arXiv 2022.05) i-Code：一个集成且可组合的**多模态**学习框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01818.pdf)\n\n- (arXiv 2022.05) 预训练单模态和多模态模型中的**视觉常识**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01850.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002FChenyuHeidiZhang\u002FVL-commonsense)\n\n- (arXiv 2022.05) 双重交叉注意力学习用于细粒度视觉**分类**和目标**再识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02151.pdf)\n\n- (arXiv 2022.05) RecipeSnap——一款轻量级的**图像转食谱**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.02141.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjianfa\u002FRecipeSnap-a-lightweight-image-to-recipe-model.git)\n\n- (arXiv 2022.05) CoCa：对比型字幕生成器是**图文**基础模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01917.pdf)\n\n- (arXiv 2022.05) 数据决定对比语言图像预训练（**CLIP**）中的分布**鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01397.pdf)\n\n- (arXiv 2022.05) 面向**零样本动作识别**的跨模态表征学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2205\u002F2205.01657.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FResT)\n\n- (arXiv 2022.05) 基于均值教师Transformer的跨域目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01643.pdf)\n\n- (arXiv 2022.05) 针对**ImageNet-1k**的更优纯ViT基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01580.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbig_vision)\n\n- (arXiv 2022.05) 用于**水下图像增强**的强化Swin-Convs Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00434.pdf)\n\n- (arXiv 2022.05) UTC：一种结合任务间对比学习的统一Transformer，用于**视觉对话**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00423.pdf)\n\n- (arXiv 2022.05) Answer-Me：多任务开放词汇**视觉问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00949.pdf)\n\n- (arXiv 2022.05) Center**CLIP**：用于高效**文本-视频检索**的标记聚类，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00823.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmzhaoshuai\u002FCenterCLIP)\n\n- (arXiv 2022.05) 基于边界Transformer的任意形状**文本检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05320.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FGXYM\u002FTextBPN-Puls-Plus)\n\n- (arXiv 2022.05) HULC：基于姿态流形采样和密集接触引导的**3D人体运动捕捉**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05677.pdf)，[[项目]](https:\u002F\u002Fvcai.mpi-inf.mpg.de\u002Fprojects\u002FHULC)\n\n\n\n### 2022年4月\n\n- (arXiv 2022.04) 学习理解**视频检索**中的否定，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00132.pdf)\n\n- (arXiv 2022.04) LayoutBERT：用于对象插入的掩码语言**布局**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00347.pdf)\n\n- (arXiv 2022.04) 利用视觉-语言验证与迭代推理提升**视觉定位**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00272.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyangli18\u002FVLTVG)\n\n- (arXiv 2022.04) 基于双阶段空间-通道Transformer的粗细联合**视频去噪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.00214.pdf)\n\n- (arXiv 2022.04) SideRT：一种用于单张图像**深度估计**的实时纯Transformer架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13892.pdf)\n\n- (arXiv 2022.04) 这张图片在哪里？基于Transformer的野外**地理定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13861.pdf)\n\n- (arXiv 2022.04) 基于简化Transformer的**深度估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13791.pdf)\n\n- (arXiv 2022.04) 对**DALL-E 2**的非常初步**分析**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2204\u002F2204.13807.pdf)\n\n- (arXiv 2022.04) CogView2：通过层次化Transformer实现更快更好的**文本到图像生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14217.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogView2)\n\n- (arXiv 2022.04) **CLIP**-Art：面向**细粒度艺术分类**的对比预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14244.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKeremTurgutlu\u002Fclip_art)\n\n- (arXiv 2022.04) TEMOS：根据文本描述**生成**多样化的人体**运动**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14109.pdf)，[[项目]](https:\u002F\u002Fimagine.enpc.fr\u002F~petrovim\u002Ftemos)\n\n- (arXiv 2022.04) PyramidCLIP：用于**视觉-语言**模型预训练的层次化特征对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14095.pdf)\n\n- (arXiv 2022.04) 基于对称Transformer的网络，用于**无监督图像配准**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13575.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMingR-Ma\u002FSymTrans)\n\n- (arXiv 2022.04) 悲剧加时间：从弱标签视频中捕捉**非预期人类活动**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13548.pdf)，[[代码]](https:\u002F\u002Fasu-apg.github.io\u002FTragedyPlusTime)\n\n- (arXiv 2022.04) CapOnImage：基于上下文的图像密集**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12974.pdf)\n\n- (arXiv 2022.04) 用于语义**分割**的对象部件自监督学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13101.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMkuuWaUjinga\u002Fleopart)\n\n- (arXiv 2022.04) DearKD：面向视觉Transformer的早期、**数据高效**的知识蒸馏，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12997.pdf)\n\n- (arXiv 2022.04) CATrans：用于少样本**分割**的上下文与亲和力Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12817.pdf)\n\n- (arXiv 2022.04) 自动驾驶汽车**转向角预测**：让Transformer再次成为一辆车，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12748.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fchingisooinar\u002FAI)\n\n- (arXiv 2022.04) ClothFormer：在所有模块中驾驭视频**虚拟试穿**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12151.pdf)\n\n- (arXiv 2022.04) 关于ViT对常见噪声的**鲁棒性**的更深入见解，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12143.pdf)\n\n- (arXiv 2022.04) VITPOSE：用于人体**姿态估计**的简单视觉Transformer基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12484.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FViTPose)\n\n- (arXiv 2022.04) 理解视觉Transformer的**鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12451.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFAN)\n\n- (arXiv 2022.04) MILES：通过注入语言语义进行视觉BERT预训练，用于**视频-文本检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12408.pdf)\n\n- (arXiv 2022.04) 面向**时序定位**的对比语言-动作预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12293.pdf)\n\n- (arXiv 2022.04) 提升**MLP**-Mixer的**对抗迁移性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12204.pdf)\n\n- (arXiv 2022.04) 自适应**分裂-融合**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.12196.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fszx503045266\u002FASF-former)\n\n- (arXiv 2022.04) 基础模型能否实现**机器人操作**的零样本任务指定？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11134.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fzestproject)\n\n- (arXiv 2022.04) RELVIT：面向**视觉关系推理**的概念引导型视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11167.pdf)\n\n- (arXiv 2022.04) VISTA：由U-Net和图像色彩度滤波增强的视觉Transformer，用于自动**零售结账**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11024.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fistiakshihab\u002Fautomated-retail-checkout-aicity22)\n\n- (arXiv 2022.04) **CLIP**-DISSECT：深度视觉网络中**神经元**表征的自动**描述**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10965.pdf)\n\n- (arXiv 2022.04) TEMOS：根据文本描述**生成**多样化的人体**运动**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14109.pdf)，[[项目]](https:\u002F\u002Fimagine.enpc.fr\u002F~petrovim\u002Ftemos)\n\n- (arXiv 2022.04) 基于多视角共同分割与聚类Transformer的无监督层次化语义分割，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11432.pdf)\n\n- (arXiv 2022.04) SwinFuse：用于红外与可见光图像融合的残差Swin Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11436.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZhishe-Wang\u002FSwinFuse)\n\n- (arXiv 2022.04) OCFormer：用于单类图像分类的Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11449.pdf)\n\n- (arXiv 2022.04) DRT：一种轻量级单幅图像去雨递归Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11385.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYC-Liang\u002FDRT)\n\n- (arXiv 2022.04) 超图Transformer：基于知识的视觉问答中的弱监督多跳推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10448.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyujungheo\u002Fkbvqa-public)\n\n- (arXiv 2022.04) ParkPredict+：基于CNN和Transformer的停车场内车辆多模态意图与运动预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10777.pdf)\n\n- (arXiv 2022.04) iCAR：在视觉识别中连接图像分类与图文对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10760.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fweiyx16\u002FiCAR)\n\n- (arXiv 2022.04) 多样化实例发现：面向实例感知的多标签图像识别的Vision-Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10731.pdf)\n\n- (arXiv 2022.04) 基于空间性的Transformer用于点云上的3D密集描述生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10688.pdf)，[[代码]](https:\u002F\u002Fspacap3d.github.io\u002F)\n\n- (arXiv 2022.04) DFAM-DETR：基于可变形特征注意力机制的DETR用于细长物体检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10667.pdf)\n\n- (arXiv 2022.04) NFormer：基于邻居Transformer的鲁棒人体再识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09331.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhaochenheheda\u002FNFormer)\n\n- (arXiv 2022.04) 通过单帧标注从文本查询中进行视频瞬间检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09409.pdf)\n\n- (arXiv 2022.04) GIMO：上下文中基于注视的人体运动预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09443.pdf)\n\n- (arXiv 2022.04) VQGAN-CLIP：利用自然语言指导的开放域图像生成与编辑，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08583.pdf)\n\n- (arXiv 2022.04) 连续环境中视觉-语言导航的Sim-2-Sim迁移，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09667.pdf)\n\n- (arXiv 2022.04) 并非所有Token都相等：基于Token聚类Transformer的人本视觉分析，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08680.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzengwang430521\u002FTCFormer.git)\n\n- (arXiv 2022.04) 视觉Transformer的多模态Token融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08721.pdf)\n\n- (arXiv 2022.04) 自校准高效Transformer用于轻量级超分辨率，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08913.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAlexZou14\u002FSCET)\n\n- (arXiv 2022.04) 搜索视觉Transformer的内在维度，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07722.pdf)\n\n- (arXiv 2022.04) 通过分组变换实现面向视觉-语言任务的轻量化Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07780.pdf)\n\n- (arXiv 2022.04) 基于元学习的跨模态提示进行多模态少样本目标检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07841.pdf)\n\n- (arXiv 2022.04) 基于Transformer的多帧自监督深度估计，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07616.pdf)，[[代码]](https:\u002F\u002Fsites.google.com\u002Ftri.global\u002Fdepthformer)\n\n- (arXiv 2022.04) MST++：用于高效光谱重建的多阶段频谱级Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07908.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fcaiyuanhao1998\u002FMST-plus-plus)\n\n- (arXiv 2022.04) 面向多模态基于方面的情感分析的视觉-语言预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07955.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FNUSTM\u002FVLP-MABSA)\n\n- (arXiv 2022.04) 一个可扩展、高效且有效的基于Transformer的目标检测器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07962.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fnaver-ai\u002Fvidt)\n\n- (arXiv 2022.04) VDTR：基于Transformer的视频去模糊，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08023.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fljzycmd\u002FVDTR)\n\n- (arXiv 2022.04) BSRT：利用Swin Transformer和流引导的可变形对齐提升突发序列超分辨率，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08332.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAlgolzw\u002FBSRT)\n\n- (arXiv 2022.04) 面向视频实例分割的时间高效视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08412.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FTeViT)\n\n- (arXiv 2022.04) VSA：在视觉Transformer中学习变尺寸窗口注意力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08446.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FViTAE-Transformer\u002FViTAE-VSA)\n\n- (arXiv 2022.04) XDBERT：从跨模态系统中将视觉信息蒸馏至BERT以提升语言理解能力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07316.pdf)\n\n- (arXiv 2022.04) 通过对比学习提升视觉对话中的跨模态理解，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07302.pdf)\n\n- (arXiv 2022.04) MVSTER：用于高效多视图立体视觉的极线Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07346.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJeffWang987\u002FMVSTER)\n\n- (arXiv 2022.04) 使用多模态交叉量化器进行无条件的图像-文本对生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07537.pdf)\n\n- (arXiv 2022.04) 推动简单流水线在少样本学习中的极限：外部数据与微调确实有显著效果，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07305.pdf)\n\n- (arXiv 2022.04) COTS：用于跨模态检索的协作式双流视觉-语言预训练模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07441.pdf)\n\n- (arXiv 2022.04) Transformer时代下的图像字幕生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07374.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSjokerLily\u002Fawesome-image-captioning)\n\n- (arXiv 2022.04) ResT V2：更简单、更快、更强，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07366.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwofmanaf\u002FResT)\n\n- (arXiv 2022.04) 基于对称CNN和递归Transformer的轻量级双模态网络，用于单幅图像超分辨率，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13286.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FIVIPLab\u002FLBNet)\n\n- (arXiv 2022.04) 用于早期动作预测的时间渐进式注意力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13340.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Falexandrosstergiou\u002Fprogressive-action-prediction)\n\n- (arXiv 2022.04) 保留图文信息：防止对比学习中的捷径效应——用于图文检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.13382.pdf)\n\n- (arXiv 2022.04) Flamingo：一种用于少样本学习的**视觉语言**模型，[[论文]](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FDeepMind.com\u002FBlog\u002Ftackling-multiple-tasks-with-a-single-visual-language-model\u002Fflamingo.pdf)\n\n- (arXiv 2022.04) RELVIT：面向**视觉关系推理**的概念引导型视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11167.pdf)\n\n- (arXiv 2022.04) 基于骨骼图拉普拉斯算子和自监督视角不变性的**无监督**人体**动作**识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10312.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FIIT-PAVIS\u002FUHAR_Skeletal_Laplacian)\n\n- (arXiv 2022.04) 利用时空检测Transformer学习**未来物体预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10321.pdf)\n\n- (arXiv 2022.04) R^2-Trans：通过冗余减少实现**细粒度视觉分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10095.pdf)，[[代码]](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FR-2-Trans)\n\n- (arXiv 2022.04) 用于**立体视频超分辨率**的新数据集及Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10039.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FH-deep\u002FTrans-SVSR\u002F)\n\n- (arXiv 2022.04) Transformer引导的卷积神经网络用于跨视角**地理定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09967.pdf)\n\n- (arXiv 2022.04) 基于多尺度特征与并行Transformer的**图像质量评估**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09779.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKomalPal9610\u002FIQA)\n\n- (arXiv 2022.04) BTranspose：采用自监督预训练的瓶颈Transformer用于**人体姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.10209.pdf)\n\n- (arXiv 2022.04) 通过解耦Transformer进行**人-物体交互检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.09290.pdf)\n\n- (arXiv 2022.04) ELEVATER：用于评估**语言增强型视觉模型**的**基准测试**与**工具包**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08790.pdf)\n\n- (arXiv 2022.04) **人-物体交互**中的交互性场，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07718.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FForuck\u002FInteractiveness-Field)\n\n- (arXiv 2022.04) DeiT III：**ViT**的复仇，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07118.pdf)\n\n- (arXiv 2022.04) 残差Swin Transformer通道注意力网络用于**图像去马赛克**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07098.pdf)\n\n- (arXiv 2022.04) 邻域**注意力**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07143.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FNeighborhood-Attention-Transformer)\n\n- (arXiv 2022.04) MiniViT：利用权重复用压缩视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07154.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FCream)\n\n- (arXiv 2022.04) ViTOL：用于**弱监督目标定位**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.06772.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSaurav-31\u002FViTOL)\n\n- (arXiv 2022.04) 在语言条件下的机器人**模仿学习**中，什么才是关键？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.06252.pdf)，[[代码]](http:\u002F\u002Fhulc.cs.uni-freiburg.de\u002F)\n\n- (arXiv 2022.04) 一致性驱动的序列Transformer注意力模型用于**部分可观测场景**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00656.pdf)\n\n- (arXiv 2022.04) ReCLIP：**指代表达理解**的强大零样本基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05991.pdf)\n\n- (arXiv 2022.04) **多模态**Transformer对缺失模态是否**鲁棒**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05454.pdf)\n\n- (arXiv 2022.04) TopFormer：用于移动端**语义分割**的Token金字塔Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05525.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FTopFormer)\n\n- (arXiv 2022.04) X-DETR：一种适用于实例级**视觉-语言**任务的通用架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05626.pdf)\n\n- (arXiv 2022.04) **事件**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05172.pdf)\n\n- (arXiv 2022.04) 评估基于像素的**深度强化学习**中视觉Transformer方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04905.pdf)\n\n- (arXiv 2022.04) ManiTrans：通过Token级语义对齐与生成实现实体级**文本引导的图像编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04428.pdf)，[[代码]](https:\u002F\u002Fjawang19.github.io\u002Fmanitrans)\n\n- (arXiv 2022.04) 用于护理**活动识别**的多模态Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04564.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMomilijaz96\u002FMMT_for_NCRC)\n\n- (arXiv 2022.04) 基于渐进式自蒸馏的**鲁棒**跨模态表征学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04588.pdf)\n\n- (arXiv 2022.04) Stripformer：用于快速图像**去模糊**的条带Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04627.pdf)\n\n- (arXiv 2022.04) 不留任何Token：基于**可解释性**的图像分类与生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04908.pdf)\n\n- (arXiv 2022.04) Fashionformer：一种简单、有效且统一的**人体时尚分割与识别**基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04654.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxushilin1\u002FFashionFormer)\n\n- (arXiv 2022.04) Panoptic-PartFormer：学习用于**全景部件分割**的统一模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04655.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FlxtGH\u002FPanoptic-PartFormer)\n\n- (arXiv 2022.04) DILEMMA：利用Transformer进行自监督的**形状与纹理**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04788.pdf)\n\n- (arXiv 2022.04) 学习轨迹感知Transformer用于**视频超分辨率**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04216.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FTTVSR.git)\n\n- (arXiv 2022.04) 学习如何推断**因果**结构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04875.pdf)\n\n- (arXiv 2022.04) 通过解码路径增强进行Transformer的一致性学习，用于**人-物体交互检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04836.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmlvlab\u002FCPChoi)\n\n- (arXiv 2022.04) 类别感知Transformer网络用于提升**人-物体交互检测**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04911.pdf)\n\n- (arXiv 2022.04) ImageNet上的**鲁棒性**能否迁移到下游任务？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03934.pdf)\n\n- (arXiv 2022.04) POSTER：用于**面部表情识别**的金字塔交叉融合Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04083.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzczcwh\u002FPOSTER)\n\n- (arXiv 2022.04) 视觉Transformer用于单张图像**去雾**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03883.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FIDKiro\u002FDehazeFormer)\n\n- (arXiv 2022.04) 基于预训练Transformer的水下**图像增强**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04199.pdf)\n\n- (arXiv 2022.04) Event Transformer：一种稀疏感知的高效**事件数据处理**方案，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03355.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAlbertoSabater\u002FEventTransformer)\n\n- (arXiv 2022.04) PSTR：基于Transformer的端到端一步式**人物搜索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03340.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJialeCao001\u002FPSTR)\n\n- (arXiv 2022.04) 在无需进一步训练的情况下将**CLIP**适配用于**短语定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03647.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fpals-ttic\u002Fadapting-CLIP)\n\n- (arXiv 2022.04) FineDiving：用于程序感知的细粒度**动作质量评估**的**数据集**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03646.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002Fxujinglin\u002FFineDiving)\n\n- (arXiv 2022.04) DaViT：双**注意力**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03645.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdingmyu\u002Fdavit)\n\n- (arXiv 2022.04) 针对**视觉-语言**模型的无监督提示学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03649.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ftonyhuang2022\u002FUPL)\n\n- (arXiv 2022.04) 使用时间无关的VQGAN和时间敏感的Transformer进行**长视频生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03638.pdf)，[[项目]](https:\u002F\u002Fsongweige.github.io\u002Fprojects\u002Ftats\u002Findex.html)\n\n- (arXiv 2022.04) 在**图像-文本-标签**空间中的统一对比学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03610.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FUniCL)\n\n- (arXiv 2022.04) HunYuan_tvr用于**文本-视频**检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03382.pdf)\n\n- (arXiv 2022.04) 学习为**组合式零样本学习**构建软提示，[[论文]]()\n\n- (arXiv 2022.04) 通过**视觉与语言**知识蒸馏实现端到端零样本**HOI**检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.03541.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmrwu-mac\u002FEoID)\n\n- (arXiv 2022.04) 用于长时视频的**时间对齐**网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02968.pdf)，[[代码]](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Ftan\u002F)\n\n- (arXiv 2022.04) 利用掩码图像建模释放纯视觉Transformer在**目标检测**中的潜力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02964.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FMIMDet)\n\n- (arXiv 2022.04) MixFormer：跨窗口和维度的**特征混合**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02557.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleClas)\n\n- (arXiv 2022.04) CM3：一个**因果**掩码的**多模态**互联网模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07520.pdf)\n\n- (arXiv 2022.04) “照我所能做，而非照我说的做”：将语言 grounding 到**机器人**能力中，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01691.pdf)，[[项目]](https:\u002F\u002Fsay-can.github.io\u002F)\n\n- (arXiv 2022.04) TransGeo：对于**跨视角图像地理定位**，Transformer 就足够了，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00097.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJeff-Zilence\u002FTransGeo2022)\n\n- (arXiv 2022.04) 苏格拉底模型：利用语言构建**零样本多模态推理**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00598.pdf)，[[项目]](https:\u002F\u002Fsocraticmodels.github.io\u002F)\n\n- (arXiv 2022.04) 带有时间移位交叉注意力的视觉Transformer，用于高效的**动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00452.pdf)\n\n- (arXiv 2022.04) 从图像字幕中学习**音频-视频**模态，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00679.pdf)\n\n- (arXiv 2022.04) 通过重新审视**高频成分**来改进视觉Transformer，[[论文]]()\n\n- (arXiv 2022.04) POS-BERT：**点云**单阶段BERT预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00989.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffukexue\u002FPOS-BERT)\n\n- (arXiv 2022.04) BinsFormer：重新审视用于**单目深度估计**的自适应分箱方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00987.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzhyever\u002FMonocular-Depth-Estimation-Toolbox)\n\n- (arXiv 2022.04) BatchFormerV2：探索样本关系以进行**密集表征**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01254.pdf)\n\n- (arXiv 2022.04) TransRAC：使用Transformer编码多尺度时间相关性，用于**重复动作计数**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01018.pdf)\n\n- (arXiv 2022.04) 使用状态空间**视频**模型对**长**电影片段进行分类，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01692.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmd-mohaiminul\u002FViS4mer)\n\n- (arXiv 2022.04) TALLFormer：带有长记忆Transformer的**时间动作定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01680.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fklauscc\u002FTALLFormer)\n\n- (arXiv 2022.04) Multi**MAE**：多模态多任务掩码自编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01678.pdf)，[[项目]](https:\u002F\u002Fmultimae.epfl.ch\u002F)\n\n- (arXiv 2022.04) “这是我的独角兽，Fluffy”：个性化冻结的**视觉-语言**表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01694.pdf)\n\n- (arXiv 2022.04) SE(3)-等变注意力网络，用于函数空间中的**形状重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02394.pdf)\n\n- (arXiv 2022.04) 多视角Transformer用于**3D视觉接地**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02174.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsega-hsj\u002FMVT-3DVG)\n\n- (arXiv 2022.04) 在**面部表情识别**任务上配备神经重缩放器的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02181.pdf)\n\n- (arXiv 2022.04) Dual-AI：用于**群体活动识别**的双路径演员交互学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02148.pdf)，[[项目]](https:\u002F\u002Fmingfei.info\u002FDual-AI\u002F)\n\n- (arXiv 2022.04) 无检测器的弱监督**群体活动识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02139.pdf)\n\n- (arXiv 2022.04) 从第一人称视频中联合预测**手部运动和交互热点**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01696.pdf)，[[项目]](https:\u002F\u002Fstevenlsw.github.io\u002Fhoi-forecast)\n\n- (arXiv 2022.04) 看什么、看哪里：用于检测**人-物体交互**的语义和空间细化Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00746.pdf)\n\n- (arXiv 2022.04) MaxViT：**多轴**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01697.pdf)\n\n\n\n### 2022年3月\n\n- (arXiv 2022.03) 一款面向2020年代的**卷积神经网络**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.03545.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FConvNeXt)\n\n- (arXiv 2022.03) DeepNet：将Transformer扩展至**1,000层**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00555.pdf)\n\n- (arXiv 2022.03) 用于**臂-手动态估计**的空间-时间并行Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16202.pdf)\n\n- (arXiv 2022.03) ViSTA：用于跨模态检索的**视觉**与**场景文本**聚合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16778.pdf)\n\n- (arXiv 2022.03) ReSTR：基于Transformer的无卷积引用图像分割，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16768.pdf)，[[项目]](http:\u002F\u002Fcvlab.postech.ac.kr\u002Fresearch\u002Frestr\u002F)\n\n- (arXiv 2022.03) CREATE：中文短视频**检索**与**标题生成**的**基准测试**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16763.pdf)\n\n- (arXiv 2022.03) **可变形** **视频** Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16795.pdf)\n\n- (arXiv 2022.03) 基于占用栅格地图的端到端**轨迹**分布预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16910.pdf)\n\n- (arXiv 2022.03) CRAFT：用于鲁棒**光流**估计的交叉注意力流Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16896.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Faskerlee\u002Fcraft)\n\n- (arXiv 2022.03) VL-InterpreT：解释**视觉-语言**Transformer的交互式**可视化**工具，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17247.pdf)，[[应用]](http:\u002F\u002Fvlinterpretenv4env-env.eba-vmhhefup.us-east-2.elasticbeanstalk.com\u002F)\n\n- (arXiv 2022.03) TransEditor：基于Transformer的双空间**GAN**，用于高度可控的**面部编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17266.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBillyXYB\u002FTransEditor)\n\n- (arXiv 2022.03) BEVFormer：通过时空Transformer从多摄像头图像中学习**鸟瞰图**表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17270.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzhiqi-li\u002FBEVFormer)\n\n- (arXiv 2022.03) **视觉提示**：通过修改像素空间来适配预训练模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17274.pdf)，[[代码]](https:\u002F\u002Fhjbahng.github.io\u002Fvisual_prompting\u002F)\n\n- (arXiv 2022.03) 让老**电影**重焕生机，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17276.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fraywzy\u002FBringing-Old-Films-Back-to-Life)\n\n- (arXiv 2022.03) 使用视觉-语言模型学习用于**开放词汇目标检测**的提示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14940.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdyabel\u002Fdetpro)\n\n- (arXiv 2022.03) SeqTR：一种简单而通用的**视觉定位**网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16265.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsean-zhuh\u002FSeqTR)\n\n- (arXiv 2022.03) InstaFormer：基于Transformer的实例感知**图像到图像翻译**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16248.pdf)\n\n- (arXiv 2022.03) Omni-DETR：基于Transformer的**全监督**目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16089.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Famazon-research\u002Fomni-detr)\n\n- (arXiv 2022.03) 学习食物图像和烹饪食谱的**程序表示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16071.pdf)，[[项目]](http:\u002F\u002Fcookingprograms.csail.mit.edu\u002F)\n\n- (arXiv 2022.03) ITTR：基于Transformer的**无配对图像到图像翻译**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16015.pdf)\n\n- (arXiv 2022.03) VPTR：用于**视频预测**的**高效**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15836.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FXiYe20\u002FVPTR)\n\n- (arXiv 2022.03) 视觉Transformer的参数**高效**微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16329.pdf)\n\n- (arXiv 2022.03) TubeDETR：基于Transformer的时空视频**定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16434.pdf)，[[代码]](https:\u002F\u002Fantoyang.github.io\u002Ftubedetr.html)\n\n- (arXiv 2022.03) 探索用于目标**检测**的纯视觉Transformer骨干网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16527.pdf)\n\n- (arXiv 2022.03) PROMPTDET：用未筛选的图像扩展你的**检测器**词汇量，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16513.pdf)，[[代码]](https:\u002F\u002Ffcjian.github.io\u002Fpromptdet)\n\n- (arXiv 2022.03) 全交叉Transformer支持的**少样本**目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15021.pdf)\n\n- (arXiv 2022.03) 用于目标**跟踪**的统一Transformer跟踪器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15175.pdf)\n\n- (arXiv 2022.03) X-Pool：用于文本-视频检索的跨模态**语言-视频**注意力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15086.pdf)，[[代码]](https:\u002F\u002Flayer6ai-labs.github.io\u002Fxpool\u002F)\n\n- (arXiv 2022.03) 使用可学习**记忆**微调图像Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15243.pdf)\n\n- (arXiv 2022.03) MAT：用于大孔洞图像**修复**的掩码感知Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15270.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffenglinglwb\u002FMAT)\n\n- (arXiv 2022.03) mc-BEiT：用于**图像BERT**预训练的多选离散化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15371.pdf)\n\n- (arXiv 2022.03) 基于Transformer的端到端图像**字幕生成**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15350.pdf)\n\n- (arXiv 2022.03) 用于**零样本学习**的混合路由Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15310.pdf)\n\n- (arXiv 2022.03) 用于**噪声图像分类**的治疗学习Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15529.pdf)\n\n- (arXiv 2022.03) **视觉-语言**预训练模型是否学习了**原始概念**？，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.17271.pdf)\n\n- (arXiv 2022.03) Transformer惯性姿态解算器：基于注意力的实时**人体运动重建**，使用稀疏**IMU**传感器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15720.pdf)\n\n- (arXiv 2022.03) SepViT：**可分离**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15380.pdf)\n\n- (arXiv 2022.03) MatteFormer：基于Transformer的先验标记进行**图像抠像**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.15662.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwebtoon\u002Fmatteformer)\n\n- (arXiv 2022.03) 用于**语义图像分割**的特征选择Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14124.pdf)\n\n- (arXiv 2022.03) 桥接提示：迈向教学视频中的**序数动作理解**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14104.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fttlmh\u002FBridge-Prompt)\n\n- (arXiv 2022.03) RSTT：用于时空**视频超分辨率**的实时时空Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14186.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fllmpass\u002FRSTT)\n\n- (arXiv 2022.03) 单流多级对齐用于**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14395.pdf)\n\n- (arXiv 2022.03) 超越掩码：揭秘视觉Transformer的**基于标记的预训练**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14313.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsunsmarterjie\u002Fbeyond_masking)\n\n- (arXiv 2022.03) 用于**情境感知识别**的协作Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.16518.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjhcho99\u002FCoFormer)\n\n- (arXiv 2022.03) 用于目标导向**导航**的对象记忆Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14708.pdf)\n\n- (arXiv 2022.03) 受大脑启发的基于**脉冲神经元**的**多层感知机**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14679.pdf)，[[代码]](https:\u002F\u002Fgitee.com\u002Fmindspore\u002Fmodels\u002Ftree\u002Fmaster\u002Fresearch\u002Fcv\u002Fsnn_mlp)\n\n- (arXiv 2022.03) HandOccNet：鲁棒遮挡的**三维手部网格估计**网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14564.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fnamepllet\u002FHandOccNet)\n\n- (arXiv 2022.03) REGTR：基于Transformer的端到端**点云对应关系**提取，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14517.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyewzijian\u002FRegTR)\n\n- (arXiv 2022.03) 面向视觉Transformer**高效训练**的自动化渐进式学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14509.pdf)\n\n- (arXiv 2022.03) 用于三维**点云分割**的分层Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14508.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FStratified-Transformer)\n\n- (arXiv 2022.03) NOC-REK：利用外部知识库检索词汇进行的新物体**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14499.pdf)\n\n- (arXiv 2022.03) 基于Swin Transformer的**面部表情识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13472.pdf)\n\n- (arXiv 2022.03) 给我你的注意力：点积注意力被认为不利于对抗性补丁的**鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13639.pdf)\n\n- (arXiv 2022.03) 基于层次化交叉注意力Transformer的高效视觉**跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13537.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fchenxin-dlut\u002FHCAT)\n\n- (arXiv 2022.03) 高性能Transformer**跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13533.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fchenxin-dlut\u002FTransT-M)\n\n- (arXiv 2022.03) RayTran：利用光线追踪Transformer从视频中实现多物体的**3D姿态估计**和**形状重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13296.pdf)\n\n- (arXiv 2022.03) 基于Transformer的多模态多标签**面部动作单元检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13301.pdf)\n\n- (arXiv 2022.03) MonoDETR：单目**3D**目标**检测**中的深度感知Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13310.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZrrSkywalker\u002FMonoDETR.git)\n\n- (arXiv 2022.03) 无3D监督的**文本到网格**生成——基于极限细分方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13333.pdf)，[[项目]](https:\u002F\u002Fwww.nasir.lol\u002Fclipmesh)\n\n- (arXiv 2022.03) GEN-VLKT：简化关联并增强交互理解以提升**HOI检测**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13954.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYueLiao\u002Fgen-vlkt)\n\n- (arXiv 2022.03) CrossFormer：用于**3D人体姿态估计**的跨时空Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13387.pdf)\n\n- (arXiv 2022.03) FitCLIP：针对零样本视频理解任务，微调大规模预训练的**图像-文本**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13371.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fbryant1410\u002F)\n\n- (arXiv 2022.03) 基于结构化剪枝与低秩近似的视觉Transformer**压缩**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13444.pdf)\n\n- (arXiv 2022.03) 基于多头融合Transformer的多模态学习用于**AU检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11441.pdf)\n\n- (arXiv 2022.03) MSTR：用于端到端**人-物交互检测**的多尺度Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.14709.pdf)\n\n- (arXiv 2022.03) 在视觉Transformer中学习补丁到聚类的**注意力机制**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11987.pdf)\n\n- (arXiv 2022.03) 视觉**提示调优**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12119.pdf)\n\n- (arXiv 2022.03) 无需训练的Transformer**架构搜索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12217.pdf)\n\n- (arXiv 2022.03) VideoMAE：掩码自编码器是数据高效的**自监督视频预训练**学习者，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12602.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FVideoMAE)\n\n- (arXiv 2022.03) METAMORPH：使用Transformer学习**通用控制器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11931.pdf)，[[项目]](https:\u002F\u002Fmetamorph-iclr.github.io\u002Fsite\u002F)\n\n- (arXiv 2022.03) 提示数组驱散偏见：通过对抗性学习对**视觉-语言**模型进行**去偏**处理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11933.pdf)\n\n- (arXiv 2022.03) 使用自然**语言**指令重塑**机器人轨迹**：基于Transformer的多模态数据对齐研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13411.pdf)，[[项目]](https:\u002F\u002Farthurfenderbucker.github.io\u002FNL_trajectory_reshaper\u002F)\n\n- (arXiv 2022.03) 利用可扩展Transformer将物体关联起来以实现**视频对象分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11442.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FAOT0\n\n- (arXiv 2022.03) HOP：面向**视觉-语言导航**的历史与顺序感知预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11591.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYanyuanQiao\u002FHOP-VLN)\n\n- (arXiv 2022.03) 学习如何**生成传达几何与语义信息的线稿**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12691.pdf)，[[项目]](https:\u002F\u002Fcarolineec.github.io\u002Finformative_drawings\u002F)\n\n- (arXiv 2022.03) UMT：用于联合**视频时刻检索**和**精彩片段检测**的统一多模态Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12745.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FUMT)\n\n- (arXiv 2022.03) AIMusicGuru：音乐辅助的**人体姿态矫正**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12829.pdf)\n\n- (arXiv 2022.03) 该对学生隐瞒什么：基于注意力的**掩码图像建模**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12719.pdf)\n\n- (arXiv 2022.03) 基于双重可瘦身Transformer，迈向高效且弹性的**视觉问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12814.pdf)\n\n- (arXiv 2022.03) ViT-FOD：一种基于视觉Transformer的**细粒度物体鉴别器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12816.pdf)\n\n- (arXiv 2022.03) 通过Transformer网络进行**关键点跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12848.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLexaNagiBator228\u002FKeypoints-Tracking-via-Transformer-Networks\u002F)\n\n- (arXiv 2022.03) 超越固定窗口：**动态窗口**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12856.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fpzhren\u002FDW-ViT)\n\n- (arXiv 2022.03) Make-A-Scene：基于场景、结合人类先验知识的**文本到图像**生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13131.pdf)\n\n- (arXiv 2022.03) 自监督的以视频为中心的Transformer用于**视频人脸聚类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13166.pdf)\n\n- (arXiv 2022.03) 向无范例的视觉Transformer**持续学习**迈进：关于注意力、功能性和权重正则化的探讨，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13167.pdf)\n\n- (arXiv 2022.03) 全局**跟踪**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13250.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxingyizhou\u002FGTR)\n\n- (arXiv 2022.03) 基于多尺度时空分割注意力Transformer的**视频实例分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13253.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOmkarThawakar\u002FMSSTS-VIS)\n\n- (arXiv 2022.03) QS-Craft：学习量化、拼字游戏与工艺以实现**条件化的人体运动动画**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11632.pdf)\n\n- (arXiv 2022.03) 寻找变化：从未剪辑的网络视频中学习**物体状态**和**改变状态的动作**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11637.pdf)，[[项目]](https:\u002F\u002Fdata.ciirc.cvut.cz\u002Fpublic\u002Fprojects\u002F2022LookForTheChange\u002F)\n\n- (arXiv 2022.03) GradViT：视觉Transformer的**梯度反演**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11894.pdf)，[[代码]](https:\u002F\u002Fgradvit.github.io\u002F)\n\n- (arXiv 2022.03) 利用迁移学习和数据增强的视觉Transformer进行**口罩使用识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11542.pdf)\n\n- (arXiv 2022.03) Transformer网络在**轨迹预测**中的内部机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11878.pdf)\n\n- (arXiv 2022.03) 带有条件匹配的**开放词汇DETR**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11876.pdf)\n\n- (arXiv 2022.03) 用于基于ViT的**持续学习**的元注意力机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11684.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzju-vipa\u002FMEAT-TIL)\n\n- (arXiv 2022.03) CNN与Transformer对**混合图像**的感知与人类相似，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11678.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Faliborji\u002Fhybrid_images.git)\n\n- (arXiv 2022.03) Bailando：带有编舞记忆的演员-评论家GPT实现的**3D舞蹈生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13055.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flisiyao21\u002FBailando\u002F)\n\n- (arXiv 2022.03) 面向多模态**文本与图像**数据的情感反馈合成，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2203\u002F2203.12692.pdf)\n\n- (arXiv 2022.03) ViewFormer：利用Transformer从少量图像实现无NeRF的**神经渲染**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10157.pdf)\n\n- (arXiv 2022.03) 轮子上的**CLIP**：零样本目标**导航**即目标定位与探索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10421.pdf)\n\n- (arXiv 2022.03) 体素集合Transformer：一种基于集合到集合的方法用于从点云中进行**3D目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10314.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fskyhehe123\u002FVoxSeT)\n\n- (arXiv 2022.03) HIPA：用于单张图像**超分辨率**的层次化补丁Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10247.pdf)\n\n- (arXiv 2022.03) DirecFormer：一种基于定向注意力的Transformer方法，用于**鲁棒动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10233.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fuark-cviu\u002FDirecFormer)\n\n- (arXiv 2022.03) MixFormer：采用迭代混合注意力的端到端**跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11082.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FMixFormer)\n\n- (arXiv 2022.03) PersFormer：通过透视Transformer和OpenLane基准测试实现**3D车道线检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11089.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOpenPerceptionX\u002FOpenLane)\n\n- (arXiv 2022.03) Relationformer：一个用于**图像到图**生成的统一框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10202.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsuprosanna\u002Frelationformer)\n\n- (arXiv 2022.03) **CLIP**遇上GamePhysics：利用零样本迁移学习，在游戏视频中实现**漏洞识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11096.pdf)，[[代码]](https:\u002F\u002Fasgaardlab.github.io\u002FCLIPxGamePhysics\u002F)\n\n- (arXiv 2022.03) **双曲**视觉Transformer：结合度量学习的改进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10833.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhtdt\u002Fhyp_metric)\n\n- (arXiv 2022.03) MonoDTR：具有深度感知的Transformer实现的单目**3D目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10981.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fkuanchihhuang\u002FMonoDTR)\n\n- (arXiv 2022.03) 基于Transformer的**HTR**用于历史文献，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.11008.pdf)\n\n- (arXiv 2022.03) simCrossTrans：一种简单的**跨模态**迁移学习方法，用于使用ConvNet或视觉Transformer进行**目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10456.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fliketheflower\u002FsimCrossTrans)\n\n- (arXiv 2022.03) 使用Transformer实现端到端的**人眼注视目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10433.pdf)\n\n- (arXiv 2022.03) 使用Transformer实现端到端的**视频文本定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10539.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fweijiawu\u002FTransDETR)\n\n- (arXiv 2022.03) 带有层次化**视觉-语言**知识蒸馏的开放词汇单阶段**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10593.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FmengqiDyangge\u002FHierKD)\n\n- (arXiv 2022.03) V2X-ViT：利用视觉Transformer实现**车辆**到万物的协同感知，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10638.pdf)\n\n- (arXiv 2022.03) LocATe：使用Transformer在3D空间中实现端到端的**动作定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10719.pdf)\n\n- (arXiv 2022.03) AnoViT：基于视觉Transformer的编码器-解码器实现**无监督异常检测与定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10808.pdf)\n\n- (arXiv 2022.03) ViM：通过虚拟logit匹配实现**分布外检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10807.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhaoqiwang\u002Fvim)\n\n- (arXiv 2022.03) ScalableViT：重新思考视觉Transformer的面向上下文的**泛化能力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10790.pdf)\n\n- (arXiv 2022.03) Iwin：通过带有不规则窗口的Transformer实现**人-物交互检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.10537.pdf)\n\n- (arXiv 2022.03) 带有卷积的视觉Transformer**架构搜索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2203\u002F2203.10435.pdf)\n\n- (arXiv 2022.03) 用于端到端**人员搜索**的级联Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09642.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKitware\u002FCOAT)\n\n- (arXiv 2022.03) CodedVTR：基于码本的稀疏**体素**Transformer，带有几何引导，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09887.pdf)\n\n- (arXiv 2022.03) MatchFormer：用于**特征匹配**的Transformer交错注意力机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09645.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjamycheung\u002FMatchFormer)\n\n- (arXiv 2022.03) 用于**语言引导视频分割**的局部-全局上下文感知Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09773.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fleonnnop\u002FLocater)\n\n- (arXiv 2022.03) 每个人都应了解的关于视觉Transformer的**三件事**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09795.pdf)\n\n- (arXiv 2022.03) 视觉Transformer对虚假相关性是否**鲁棒**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09125.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Fdeeplearning-wisc\u002Fvit-spurious-robustness)\n\n- (arXiv 2022.03) 用于**跨视图地理定位**的互生式Transformer学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09135.pdf)\n\n- (arXiv 2022.03) DU-VLG：通过双序列到序列预训练统一**视觉-语言**生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09052.pdf)\n\n- (arXiv 2022.03) 语义对齐融合Transformer用于**单样本**目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09093.pdf)\n\n- (arXiv 2022.03) UNIMO-2：端到端统一的**视觉-语言**基础学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09067.pdf), [[代码]](https:\u002F\u002Funimo-ptm.github.io\u002F)\n\n- (arXiv 2022.03) Transformer中的属性替代学习与谱令牌池化用于**小样本学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09064.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FStomachCold\u002FHCTransformers)\n\n- (arXiv 2022.03) 仅用一个**CLIP**实现**GAN**的单样本适应，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09301.pdf)\n\n- (arXiv 2022.03) PanoFormer：用于室内360°**深度估计**的全景Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09283.pdf)\n\n- (arXiv 2022.03) PreTR：时空非自回归**轨迹预测**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09293.pdf)\n\n- (arXiv 2022.03) 走出房间之外：从单张图像**合成**一致的长期**3D场景视频**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09457.pdf), [[代码]](https:\u002F\u002Fxrenaa.github.io\u002Flook-outside-room\u002F)\n\n- (arXiv 2022.03) Transframer：利用生成模型进行任意**帧预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09494.pdf)\n\n- (arXiv 2022.03) 向数据**高效**的目标检测Transformer迈进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09507.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Fencounter1997\u002FDE-DETRs)\n\n- (arXiv 2022.03) 用于**显著性**排序的双向目标-上下文优先级学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.09416.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FGrassBro\u002FOCOR)\n\n- (arXiv 2022.03) PATCH-FOOL：视觉Transformer是否始终对**对抗性**扰动具有鲁棒性？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08392.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FRICE-EIC\u002FPatch-Fool)\n\n- (arXiv 2022.03) WegFormer：用于弱监督**语义分割**的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08421.pdf)\n\n- (arXiv 2022.03) 使用带有额外检测头的视觉Transformer进行**开放集识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08441.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Ffeiyang-cai\u002Fosr_vit.git)\n\n- (arXiv 2022.03) 统一的视觉Transformer**压缩**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08243.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FUVC)\n\n- (arXiv 2022.03) 向基于视觉Transformer的实用**可认证补丁防御**迈进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08519.pdf)\n\n- (arXiv 2022.03) EDTER：使用Transformer进行**边缘检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08566.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FMengyangPu\u002FEDTER)\n\n- (arXiv 2022.03) ActFormer：面向通用动作条件下的**3D人体运动生成**的GAN Transformer框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07706.pdf)\n\n- (arXiv 2022.03) 用于**超分辨率**的丰富CNN-Transformer特征聚合网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07682.pdf)\n\n- (arXiv 2022.03) 重振区域特征以推动**视频-语言**预训练的民主化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07720.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FCuthbertCai\u002FDemoVLP)\n\n- (arXiv 2022.03) 用于**密集场景理解**的倒金字塔多任务Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07997.pdf)\n\n- (arXiv 2022.03) 平滑很重要：用于领域自适应**语义分割**的动量Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07988.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Falpc91\u002FTransDA)\n\n- (arXiv 2022.03) 用于**图像反演**和**编辑**的风格Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07932.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Fsapphire497\u002Fstyle-transformer)\n\n- (arXiv 2022.03) MotionCLIP：将人类**运动生成**暴露于**CLIP**空间，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08063.pdf), [[项目]](https:\u002F\u002Fguytevet.github.io\u002Fmotionclip-page\u002F)\n\n- (arXiv 2022.03) **多样性**原则：训练更强的视觉Transformer需要减少各层次的**冗余**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06345.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FDiverse-ViT)\n\n- (arXiv 2022.03) 通过视觉-语言知识蒸馏在CLIP上实现**多模态生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06386.pdf)\n\n- (arXiv 2022.03) 用于鲁棒**人脸对齐**和**关键点内在关系**学习的稀疏局部补丁Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06541.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FJiahao-UTS\u002FSLPT-master)\n\n- (arXiv 2022.03) 通过弱监督学习联合CNN和Transformer网络以实现高效的**人群计数**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06388.pdf)\n\n- (arXiv 2022.03) DFTR：用于**显著性目标检测**的深度监督分层特征融合Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06429.pdf)\n\n- (arXiv 2022.03) DATR：用于**多领域地标检测**的领域自适应Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06433.pdf)\n\n- (arXiv 2022.03) EventFormer：用于**面部动作**单元事件检测的AU事件Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06355.pdf)\n\n- (arXiv 2022.03) 通过语义对齐匹配加速**DETR**的**收敛**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06883.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FZhangGongjie\u002FSAM-DETR)\n\n- (arXiv 2022.03) 一体化：探索统一的**视频-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07303.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Fshowlab\u002Fall-in-one)\n\n- (arXiv 2022.03) CLIP模型是**小样本**学习者：关于VQA和视觉蕴含的实证研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07190.pdf)\n\n- (arXiv 2022.03) EIT：将归纳偏置**高效**地引入ViT，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07116.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FMrHaiPi\u002FEIT)\n\n- (arXiv 2022.03) 面向**小样本**Transformer的自我促进监督，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07057.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FDongSky\u002Ffew-shot-vit)\n\n- (arXiv 2022.03) MDMMT-2：用于**视频检索**的多领域多模态Transformer，向着通用化又迈进了一步，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07086.pdf)\n\n- (arXiv 2022.03) 用于**文本-视频**检索的解耦表示学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07111.pdf)\n\n- (arXiv 2022.03) TransCAM：基于Transformer注意力机制的CAM精炼方法，用于**弱监督语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07239.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fliruiwen\u002FTransCAM)\n\n- (arXiv 2022.03) 电影叙事概要：用于故事理解的**视频-语言数据集**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05711.pdf)，[[数据集]](https:\u002F\u002Fgithub.com\u002Finsundaycathy\u002FSYMON)\n\n- (arXiv 2022.03) 可视化与理解视觉Transformer中的**补丁交互**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05922.pdf)\n\n- (arXiv 2022.03) 通过傅里叶域分析在深度视觉Transformer中**防止过度平滑**：从理论到实践，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05962.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FViT-Anti-Oversmoothing)\n\n- (arXiv 2022.03) 民主化对比型**语言-图像**预训练：关于数据、模型和监督的CLIP**基准测试**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05796.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSense-GVT\u002FDeCLIP)\n\n- (arXiv 2022.03) ActiveMLP：一种类似**MLP**且具有主动标记混合器的架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06108.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FActiveMLP)\n\n- (arXiv 2022.03) 基于Transformer的视频语义嵌入实现**零样本动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05156.pdf)\n\n- (arXiv 2022.03) TrueType Transformer：轮廓格式下的**字符与字体风格识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05338.pdf)\n\n- (arXiv 2022.03) LOOPITR：结合双编码器与交叉编码器架构进行**图像-文本**检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05465.pdf)\n\n- (arXiv 2022.03) MVP：**多模态**引导的视觉预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05175.pdf)\n\n- (arXiv 2022.03) DEER：不依赖检测的端到端**场景文本定位**识别器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05122.pdf)\n\n- (arXiv 2022.03) **多模态**Mixup用于**鲁棒**微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03897.pdf)\n\n- (arXiv 2022.03) AssistQ：以可供性为中心、由问题驱动的任务完成系统，适用于**第一人称助手**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04203.pdf)，[[项目]](https:\u002F\u002Fshowlab.github.io\u002Fassistq\u002F)\n\n- (arXiv 2022.03) **粗粒度到细粒度**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03821.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FChenMnZ\u002FCF-ViT)\n\n- (arXiv 2022.03) 基于自监督预训练视觉Transformer的单目机器人**导航**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03682.pdf)\n\n- (arXiv 2022.03) WAVEMIX：面向图像的**资源高效**标记混合方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03689.pdf)\n\n- (arXiv 2022.03) VOVIT：低延迟的基于图的**音频-视觉**语音分离Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04099.pdf)，[[代码]](https:\u002F\u002Fipcv.github.io\u002FVoViT\u002F)\n\n- (arXiv 2022.03) 图注意力Transformer网络用于**多标签**图像**分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04049.pdf)\n\n- (arXiv 2022.03) EDGEFORMER：通过向视觉Transformer学习来改进**轻量级卷积网络**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03952.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhkzhang91\u002FEdgeFormer)\n\n- (arXiv 2022.03) Skating-Mixer：用于**花样滑冰评分**的多模态**MLP**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03990.pdf)\n\n- (arXiv 2022.03) 动态组Transformer：一种具有动态组**注意力**的通用视觉Transformer骨干网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03937.pdf)\n\n- (arXiv 2022.03) CP-ViT：通过渐进式稀疏性预测进行级联视觉Transformer**剪枝**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04570.pdf)\n\n- (arXiv 2022.03) 模型无关的多任务微调，用于少样本的**视觉-语言****迁移学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04904.pdf)\n\n- (arXiv 2022.03) ChiTransformer：基于线索实现可靠的**立体视觉**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04554.pdf)\n\n- (arXiv 2022.03) 用于基于群体分割的统一Transformer框架：**协同分割**、**协同显著性检测**和**视频显著目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04708.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsuyukun666\u002FUFO)\n\n- (arXiv 2022.03) 粗粒度到细粒度的稀疏Transformer用于**高光谱图像重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04845.pdf)\n\n- (arXiv 2022.03) CMX：使用Transformer进行**RGB-X语义分割**的跨模态融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04838.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhuaaaliu\u002FRGBX_Semantic_Segmentation)\n\n- (arXiv 2022.03) 多尺度Transformer用于**高光谱图像分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04771.pdf)\n\n- (arXiv 2022.03) 注意差距：理解**多模态对比表示**学习中的模态差距，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02053.pdf)，[[代码]](https:\u002F\u002Fmodalitygap.readthedocs.io\u002F)\n\n- (arXiv 2022.03) 使用残差量化进行自回归**图像生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01941.pdf)\n\n- (arXiv 2022.03) CONTEXTFORMER：一种具有空间-通道注意力的Transformer，用于学习型**图像压缩**中的上下文建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02452.pdf)\n\n- (arXiv 2022.03) 视觉Transformer的补丁相似性感知无数据**量化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02250.pdf)\n\n- (arXiv 2022.03) ViT-P：从局部性重新思考**数据高效**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02358.pdf)\n\n- (arXiv 2022.03) DIT：用于**文档图像**Transformer的自监督预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02378.pdf)\n\n- (arXiv 2022.03) 朝着**高效**且**可扩展**的锐度感知最小化方向努力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02714.pdf)\n\n- (arXiv 2022.03) HyperTransformer：一种纹理与光谱特征融合Transformer，用于**全色化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02503.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwgcban\u002FHyperTransformer)\n\n- (arXiv 2022.03) UVCGAN：基于UNet和视觉Transformer的循环一致性GAN，用于**未配对图像到图像转换**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02557.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLS4GAN\u002Fuvcgan)\n\n- (arXiv 2022.03) 让我看看什么，告诉我怎么做：通过多模态条件控制进行**视频合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02573.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002FMMVID)\n\n- (arXiv 2022.03) PANFORMER：一种基于Transformer的**全色化**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02916.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzhysora\u002FPanFormer)\n\n- (arXiv 2022.03) 多类别标记Transformer用于**弱监督语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02891.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxulianuwa\u002FMCTformer)\n\n- (arXiv 2022.03) 跨语言图像匹配用于**弱监督语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02668.pdf)\n\n- (arXiv 2022.03) 从注意力中学习亲和力：基于Transformer的端到端**弱监督语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02664.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Frulixiang\u002Fafa)\n\n- (arXiv 2022.03) DINO：用于端到端目标**检测**的改进去噪锚框DETR，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03605.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FIDEACVR\u002FDINO)\n\n- (arXiv 2022.03) MetaFormer：面向**细粒度识别**的统一元框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02751.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdqshuai\u002FMetaFormer)\n\n- (arXiv 2022.03) 基于跨模态注意力与语言的**视听**广义零样本学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03598.pdf)\n\n- (arXiv 2022.03) 基于Transformer的目标**检测**知识融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03187.pdf)\n\n- (arXiv 2022.03) 针对模态特异性标注视频的**多模态动作识别**中的可学习无关模态丢弃，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.03014.pdf)\n\n- (arXiv 2022.03) **视觉对话**中的指代关系建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02986.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMingxiao-Li\u002FModeling-Coreference-Relations-in-Visual-Dialog)\n\n- (arXiv 2022.03) VITRANSPAD：结合卷积与自注意力的视频Transformer用于**人脸呈现攻击检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01562.pdf)\n\n- (arXiv 2022.03) 多尾视觉Transformer用于**高效推理**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01587.pdf)\n\n- (arXiv 2022.03) 弯曲现实：适应全景**语义分割**的畸变感知Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01452.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjamycheung\u002FTrans4PASS)\n\n- (arXiv 2022.03) 视觉Transformer集成作为**生态学**自动分类的新范式，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01726.pdf)\n\n- (arXiv 2022.03) LGT-Net：基于几何感知Transformer网络的室内全景房间**布局估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01824.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzhigangjiang\u002FLGT-Net)\n\n- (arXiv 2022.03) LatentFormer：基于多智能体Transformer的**交互建模**与**轨迹预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01880.pdf)\n\n- (arXiv 2022.03) DCT-Former：基于离散余弦变换的**高效**自注意力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01178.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fcscribano\u002FDCT-Former-Public)\n\n- (arXiv 2022.03) 基于检索的多粒度对齐的无监督**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00242.pdf)\n\n- (arXiv 2022.03) 时空Transformer注意力网络用于**点云**中3D体素级别的联合分割与运动预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2203\u002F2203.00138.pdf)\n\n- (arXiv 2022.03) **CLIP**-GEN：利用CLIP实现无语言训练的**文本到图像**生成器，[[论文]]()\n\n- (arXiv 2022.03) MixSTE：用于视频中**3D**人体**姿态**估计的序列到序列混合时空编码器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00859.pdf)\n\n- (arXiv 2022.03) X -Trans2Cap：利用Transformer进行跨模态知识迁移以实现**3D密集字幕**生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00843.pdf)\n\n- (arXiv 2022.03) 3DCTN：用于**点云**分类的3D卷积-Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00828.pdf)\n\n- (arXiv 2022.03) DeciWatch：一种用于10倍**高效**2D和3D**姿态**估计的简单基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08713.pdf)\n\n- (arXiv 2022.03) D_2ETR：具有计算高效的跨尺度注意力的**仅解码器DETR**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00860.pdf)\n\n- (arXiv 2022.03) 增量Transformer结构增强的图像**修复**，采用掩码位置编码，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00867.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FDQiaole\u002FZITS_inpainting)\n\n- (arXiv 2022.03) 用于**深度伪造检测**的自监督Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01265.pdf)\n\n- (arXiv 2022.03) 聚合**金字塔**视觉Transformer：无需卷积的图像识别拆分-转换-合并策略，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2203\u002F2203.00960.pdf)\n\n- (arXiv 2022.03) TransDARC：基于Transformer的**驾驶员活动识别**，结合潜在空间特征校准，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00927.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKPeng9510\u002FTransDARC)\n\n- (arXiv 2022.03) DN-DETR：通过引入查询去噪来**加速**DETR**训练**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01305.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FFengLi-ust\u002FDN-DETR)\n\n- (arXiv 2022.03) 利用身份一致性Transformer**保护名人**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.01318.pdf)\n\n- (arXiv 2022.03) 面向**运动控制**的掩码视觉预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.06173.pdf)，[[项目]](https:\u002F\u002Ftetexiao.com\u002Fprojects\u002Fmvp)\n\n- (arXiv 2022.03) NLX-GPT：用于视觉及**视觉-语言**任务中自然语言解释的模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05081.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffawazsammani\u002Fnlxgpt)\n\n- (arXiv 2022.03) 视觉-语言模型的条件式提示学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.05557.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKaiyangZhou\u002FCoOp)\n\n- (arXiv 2022.03) 基于多功能AtrousFormer和局部语义引导的**车道检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04067.pdf)\n\n- (arXiv 2022.03) DALL-EVAL：探究**文本到图像**生成式Transformer的推理能力和社会偏见，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04053.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fj-min\u002FDallEval)\n\n- (arXiv 2022.03) 人类行为的特征性**3D姿态**预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2011.15079.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fchrdiller\u002Fcharacteristic3dposes)\n\n\n\n### 2022年2月\n\n- (arXiv 2022.02) 基于生成流网络的**贝叶斯结构学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.13903.pdf)\n\n- (arXiv 2022.02) 通过领域Transformer迈向**无监督域适应**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.13777.pdf)\n\n- (arXiv 2022.02) 用于**人群定位**的端到端Transformer模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.13065.pdf)\n\n- (arXiv 2022.02) 利用视频Transformer进行瞬时**生理参数估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.12368.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Frevanurambareesh\u002Finstantaneous_transformer)\n\n- (arXiv 2022.02) Style**CLIP**Draw：在**文本到绘画**翻译中耦合内容与风格，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.12362.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fpschaldenbrand\u002FStyleCLIPDraw)\n\n- (arXiv 2022.02) 注意力可实现零**近似**误差，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.12166.pdf)\n\n- (arXiv 2022.02) 当Transformer遇见**机器人抓取**：利用上下文实现高效抓取检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11911.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FWangShaoSUN\u002Fgrasp-transformer)\n\n- (arXiv 2022.02) 无需训练的**自动缩放**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11921.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FVITA-Group\u002FAsViT)\n\n- (arXiv 2022.02) 纵观全局，立足局部：用于**视觉-语言导航**的双尺度图Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11742.pdf)，[[项目]](https:\u002F\u002Fcshizhe.github.io\u002Fprojects\u002Fvln_duet.html)\n\n- (arXiv 2022.02) 学习在视觉Transformer中**合并Token**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.12015.pdf)\n\n- (arXiv 2022.02) ProFormer：基于原型特征增强和视觉Transformer学习**数据高效**的**身体运动**表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11423.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKPeng9510\u002FProFormer)\n\n- (arXiv 2022.02) 基于归一化割的**无监督目标发现**自监督Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11539.pdf)，[[项目]](https:\u002F\u002Fwww.m-psi.fr\u002FPapers\u002FTokenCut2022\u002F)\n\n- (arXiv 2022.02) 关注纹理：用于通用**纹理合成**的多阶段沙漏型视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11703.pdf)\n\n- (arXiv 2022.02) CaMEL：基于教师学习的图像**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10492.pdf)\n\n- (arXiv 2022.02) **层次化**Perceiver，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10890.pdf)\n\n- (arXiv 2022.02) **Movies2Scenes**：利用电影相似性学习场景表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10650.pdf)\n\n- (arXiv 2022.02) GroupViT：由文本监督涌现的**语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11094.pdf)，[[代码]]\n\n- (arXiv 2022.02) 雪花点反卷积：基于跳跃Transformer的**点云**补全与生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09367.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAllenXiangX\u002FSnowflakeNet)\n\n- (arXiv 2022.02) 基于Transformer视频表征的视听场景感知**对话生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09979.pdf)\n\n- (arXiv 2022.02) ViTAEv2：通过探索**归纳偏置**提升视觉Transformer在图像识别等任务中的性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10108.pdf)\n\n- (arXiv 2022.02) PMP-Net++：基于Transformer增强的多步点移动路径实现**点云补全**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09507.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdiviswen\u002FPMP-Net)\n\n- (arXiv 2022.02) DataMUX：面向神经网络的**数据复用**技术，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09318.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FDataMUX)\n\n- (arXiv 2022.02) 关于如何用**语言**规范引导视觉**注意力**的研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.08926.pdf)\n\n- (arXiv 2022.02) 基于Transformer网络对图像序列进行**时空户外照明聚合**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09206.pdf)\n\n- (arXiv 2022.02) 社交媒体**视频**帖子中的**虚假信息检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07706.pdf)\n\n- (arXiv 2022.02) 深度学习能否应用于基于模型的**多目标跟踪**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07909.pdf)\n\n- (arXiv 2022.02) 并非所有Patch都适用：通过**Token重组**加速视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07800.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyouweiliang\u002Fevit)\n\n- (arXiv 2022.02) ActionFormer：利用Transformer定位**动作**时刻，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07925.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhappyharrycn\u002Factionformer_release)\n\n- (arXiv 2022.02) 循序渐进：基于里程碑的长时程**视觉-语言导航**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07028.pdf)\n\n- (arXiv 2022.02) Transformer的可解释性：通过保守传播获得更好的**解释**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07304.pdf)\n\n- (arXiv 2022.02) MeshLeTemp：利用可学习的顶点间关系，将人体**姿态**和**网格重建**推广至野外场景，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07228.pdf)\n\n- (arXiv 2022.02) ViNTER：带有情感弧感知的Transformer用于**图像叙事生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07305.pdf)\n\n- (arXiv 2022.02) 用于**场景图**生成的超关系学习网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07271.pdf)\n\n- (arXiv 2022.02) CommerceMM：基于全维检索的大规模商业**多模态表征**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07247.pdf)\n\n- (arXiv 2022.02) Flowformer：通过保体积流线性化Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06258.pdf)\n\n- (arXiv 2022.02) DialFRED：用于**具身**指令遵循的对话式智能体，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.13330.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fanonrabit\u002FDialFRED)\n\n- (arXiv 2022.02) CATs++：结合卷积和Transformer提升**代价聚合**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06817.pdf)\n\n- (arXiv 2022.02) 几何Transformer用于快速稳健的**点云配准**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06688.pdf)，[[代码]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06688.pdf)\n\n- (arXiv 2022.02) I-Tuning：利用图像微调语言模型以生成**字幕**，[[论文]](I-Tuning：利用图像微调语言模型以生成字幕)\n\n- (arXiv 2022.02) 用于视频驱动的**行人检索**的多方向、多尺度金字塔Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06014.pdf)，[[代码]](https:\u002F\u002Fgit.openi.org.cn\u002Fzangxh\u002FPiT.git)\n\n- (arXiv 2022.02) **视觉声学**匹配，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06875.pdf)\n\n- (arXiv 2022.02) LighTN：用于**点云下采样**时性能与开销权衡的**轻量级**Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06263.pdf)\n\n- (arXiv 2022.02) BViT：基于广域**注意力**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06268.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FDRL-CASIA\u002FDense_ViT)\n\n- (arXiv 2022.02) 具有语义增强的任务适应性特征Transformer用于**少样本分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06498.pdf)\n\n- (arXiv 2022.02) 通过**提示**学习进行领域适应，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06687.pdf)\n\n- (arXiv 2022.02) 混合与位移：挖掘**视觉MLP**中的全局和局部依赖关系，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06510.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJegZheng\u002FMS-MLP)\n\n- (arXiv 2022.02) Wukong：包含1亿条大规模中文**跨模态预训练**数据集及基础框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06767.pdf)，[[项目]](https:\u002F\u002Fwukong-dataset.github.io\u002Fwukong-dataset\u002F)\n\n- (arXiv 2022.02) 视觉Transformer是如何工作的？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06709.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Fxxxnell\u002Fhow-do-vits-work)\n\n- (arXiv 2022.02) ACORT：一种用于参数高效图像**字幕生成**的紧凑型对象关系Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05451.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Fjiahuei\u002Fsparse-image-captioning)\n\n- (arXiv 2022.02) **CLIP**asso：语义感知的**物体草图绘制**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05822.pdf), [[代码]](https:\u002F\u002Fclipasso.github.io\u002Fclipasso\u002F)\n\n- (arXiv 2022.02) 基于多任务Transformer的弱监督**文本检测**研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05508.pdf)\n\n- (arXiv 2022.02) 使用Transformer进行深度**足球比赛解说**：数据集、语义相关损失与多层次评估，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05728.pdf), [[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fsoccercaptioning)\n\n- (arXiv 2022.02) ENTROFORMER：基于Transformer的熵模型，用于学习型**图像压缩**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05492.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Fmx54039q\u002Fentroformer)\n\n- (arXiv 2022.02) 基于预训练和对比学习的图像差异字幕生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04298.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Fyaolinli\u002FIDC)\n\n- (arXiv 2022.02) MaskGIT：掩码式**生成式** **图像** Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04200.pdf)\n\n- (arXiv 2022.02) 对比蒸馏是**自监督** **点云**表征学习的全部需求，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04241.pdf)\n\n- (arXiv 2022.02) 运动感知Transformer用于**遮挡下的人体重识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04243.pdf)\n\n- (arXiv 2022.02) 条件化**运动插帧**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04307.pdf), [[代码]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04307.pdf)\n\n- (arXiv 2022.02) 基于记忆的**注视点预测**在机器人操作的深度模仿学习中，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04877.pdf)\n\n- (arXiv 2022.02) **球面**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04942.pdf)\n\n- (arXiv 2022.02) OWL（观察、观看、聆听）：通过视听时序上下文对**第一人称视频**中的动作进行定位，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04947.pdf)\n\n- (arXiv 2022.02) 夏洛克·福尔摩斯的绑架案：一个用于**视觉溯因推理**的**数据集**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04800.pdf), [[项目]](http:\u002F\u002Fwww.visualabduction.com\u002F)\n\n- (arXiv 2022.02) DALL-EVAL：探测**文本到图像**生成式Transformer的推理能力和社会偏见，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04053.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Fj-min\u002FDallEval)\n\n- (arXiv 2022.02) 预训练语言模型在**交互式决策**中的应用，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.01771.pdf)\n\n- (arXiv 2022.02) TransFollower：通过Transformer进行长序列车辆跟驰**轨迹预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.03183.pdf)\n\n- (arXiv 2022.02) 魔鬼藏在标签里：从句子中进行**语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.02002.pdf)\n\n- (arXiv 2022.02) 面向**通用视觉模型**的网络监督概念扩展，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.02317.pdf), [[项目]](https:\u002F\u002Fprior.allenai.org\u002Fprojects\u002Fgpv2)\n\n- (arXiv 2022.02) VU-BERT：一个用于**视觉对话**的统一框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10787.pdf)\n\n- (arXiv 2022.02) 通过简单的序列到序列学习框架**统一**架构、任务和模态，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.03052.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FOFA-Sys\u002FOFA)\n\n- (arXiv 2022.02) 在相机内参未知的情况下，使用自监督**单目深度估计**中的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.03131.pdf)\n\n- (arXiv 2022.02) TRANSDREAMER：基于Transformer世界模型的**强化学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09481.pdf)\n\n- (arXiv 2022.02) 基于三重对比学习的**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.10401.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Futa-smile\u002FTCL)\n\n- (arXiv 2022.02) 针对**自监督**视觉**预训练**的损坏图像建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.03382.pdf)\n\n- (arXiv 2022.02) BLIP：为统一的**视觉-语言**理解和生成而构建的语言-图像预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12086.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FBLIP)\n\n- (arXiv 2022.02) DNNFuser：将生成式预训练Transformer作为通用映射器，用于**DNN加速器**中的层融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11218.pdf)\n\n- (arXiv 2022.02) Interactron：**具身**自适应**目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00660.pdf)\n\n- (arXiv 2022.02) 针对低配置设备的局部特征匹配方法LoFTR的适配方案，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00770.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FKolkir\u002FCoarse_LoFTR_TRT)\n\n- (arXiv 2022.02) 预训练语言模型在**交互式决策**中的应用，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.01771.pdf)\n\n- (arXiv 2022.02) Transformer能否成为强大的**治疗效应估计器**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.01336.pdf)\n\n- (arXiv 2022.02) 利用注意力机制和视觉Transformer提升基于价值函数模型的**样本效率**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00710.pdf)\n\n- (arXiv 2022.02) 借助对象引导的跨模态校准语义检测**人-物体交互**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00259.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FJacobYuan7\u002FOCN-HOI-Benchmark)\n\n\n\n### 2022年1月\n\n- (arXiv 2022.01) O-ViT：正交视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12133.pdf)\n\n- (arXiv 2022.01) DynaMixer：具有动态混合功能的视觉**MLP**架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12083.pdf)\n\n- (arXiv 2022.01) VRT：一种用于**视频修复**的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12288.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FJingyunLiang\u002FVRT)\n\n- (arXiv 2022.01) DAB-DETR：动态**锚框**是DETR更好的查询方式，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12329.pdf), [[代码]](https:\u002F\u002Fgithub.com\u002FSlongLiu\u002FDAB-DETR)\n\n- (arXiv 2022.01) 插件反演：一种与模型无关的视觉反演方法，结合数据增强，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12961.pdf)\n\n- (arXiv 2022.01) MVP：通过多级语义对齐进行多阶段**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12596.pdf)\n\n- (arXiv 2022.01) VC-GPT：面向端到端生成式**视觉-语言**预训练的视觉条件GPT，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12723.pdf)\n\n- (arXiv 2022.01) BOAT：双边局部**注意力**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.13027.pdf)\n\n- (arXiv 2022.01) 基于图自注意力的Transformer图表示学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12787.pdf)\n\n- (arXiv 2022.01) 将全局特征聚合到局部视觉Transformer中，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12903.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fkrushi1992\u002FMOA-transformer)\n\n- (arXiv 2022.01) 用于视觉问答系统性泛化的Transformer模块网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11316.pdf)\n\n- (arXiv 2022.01) 基于U-Transformer的广义图像外延生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11403.pdf)\n\n- (arXiv 2022.01) RelTR：用于场景图生成的关系Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11460.pdf)\n\n- (arXiv 2022.01) DocSegTr：一种实例级端到端文档图像分割Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11438.pdf)\n\n- (arXiv 2022.01) 预训练的语言Transformer是通用的图像分类器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10182.pdf)\n\n- (arXiv 2022.01) 探索与匹配：基于Transformer的端到端视频定位，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10168.pdf)\n\n- (arXiv 2022.01) TGFuse：一种基于Transformer和生成对抗网络的红外与可见光图像融合方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10147.pdf)\n\n- (arXiv 2022.01) ViT-HGR：基于视觉Transformer的高密度表面肌电图信号手势识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10060.pdf)\n\n- (arXiv 2022.01) ShapeFormer：基于Transformer的稀疏表示形状补全，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10326.pdf)，[[项目]](https:\u002F\u002Fshapeformer.github.io\u002F)\n\n- (arXiv 2022.01) 用于视觉任务的卷积型XFORMER，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10271.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fpranavphoenix\u002FCXV)\n\n- (arXiv 2022.01) DocEnTr：一种端到端文档图像增强Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10252.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdali92002\u002FDocEnTR)\n\n- (arXiv 2022.01) 基于图Transformer的零样本草图图像检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10185.pdf)\n\n- (arXiv 2022.01) SA-VQA：面向视觉问答的视觉与语义表征结构化对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10654.pdf)\n\n- (arXiv 2022.01) 用于建筑物损伤评估的双任务孪生Transformer框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10953.pdf)\n\n- (arXiv 2022.01) 当移位操作遇到视觉Transformer时：一种极其简单的替代注意力机制的方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10801.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSPACH)\n\n- (arXiv 2022.01) 自监督3D语义表示学习用于视觉-语言导航，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10788.pdf)\n\n- (arXiv 2022.01) 仅用2040张图像训练视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10728.pdf)\n\n- (arXiv 2022.01) 利用远程监督学习识别程序性活动，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.10990.pdf)\n\n- (arXiv 2022.01) 基于语义表征评估语言偏置的图像分类，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.11014.pdf)\n\n- (arXiv 2022.01) 视觉Transformer在密集预测任务上的综合研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.08683.pdf)\n\n- (arXiv 2022.01) UniFormer：统一卷积与自注意力用于视觉识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09450.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FUniFormer)\n\n- (arXiv 2022.01) 只需要补丁吗？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09792.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flocuslab\u002Fconvmixer)\n\n- (arXiv 2022.01) 受阅读策略启发的视觉表征学习用于文本到视频检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09168.pdf)\n\n- (arXiv 2022.01) 学习以具身感知为导向的多模态神经SLAM行动，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09862.pdf)\n\n- (arXiv 2022.01) 视觉信息引导的零样本释义生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09107.pdf)\n\n- (arXiv 2022.01) TerViT：一种高效的三值视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.08050.pdf)\n\n- (arXiv 2022.01) 多模态视频字幕生成的端到端生成式预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.08264.pdf)\n\n- (arXiv 2022.01) OMNIVORE：一个适用于多种视觉模态的单一模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.08377.pdf)，[[项目]](https:\u002F\u002Ffacebookresearch.github.io\u002Fomnivore\u002F)\n\n- (arXiv 2022.01) MeMViT：内存增强型多尺度视觉Transformer，用于高效长期视频识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.08383.pdf)\n\n- (arXiv 2022.01) CLEAR基准测试：真实世界图像上的持续学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06289.pdf)，[[项目]](https:\u002F\u002Fclear-benchmark.github.io\u002F)\n\n- (arXiv 2022.01) ProposalCLIP：利用CLIP线索进行无监督的开放类别目标提案生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06696.pdf)\n\n- (arXiv 2022.01) 跨模态对比蒸馏用于指令性活动预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06734.pdf)\n\n- (arXiv 2022.01) Transformer的应用：弱监督动作分割，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05675.pdf)\n\n- (arXiv 2022.01) VAQF：用于低比特视觉Transformer的全自动软硬件协同设计框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06618.pdf)\n\n- (arXiv 2022.01) CLIP-TD：针对视觉-语言任务的CLIP定向蒸馏，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05729.pdf)\n\n- (arXiv 2022.01) 基于双向交叉注意力Transformer的领域适应，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05887.pdf)\n\n- (arXiv 2022.01) 持续Transformer：用于在线推理的无冗余注意力机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06268.pdf)\n\n- (arXiv 2022.01) 基于深度∆-插值器的动作中间帧生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06701.pdf)\n\n- (arXiv 2022.01) RePre：通过重建式预训练改进自监督视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06857.pdf)\n\n- (arXiv 2022.01) GTrans：带有图嵌入的时空自回归Transformer，用于极端事件的临近预报，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06717.pdf)\n\n- (arXiv 2022.01) TransFuse：一种基于Transformer的统一图像融合框架，采用自监督学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07451.pdf)\n\n- (arXiv 2022.01) Q-ViT：视觉Transformer的全可微量化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07703.pdf)\n\n- (arXiv 2022.01) 解耦潜在空间变换器用于**可解释的单目高度估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06357.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002FShadowXZT\u002FDLT-Height-Estimation.pytorch)\n\n- (arXiv 2022.01) Poseur：基于Transformer的直接人体**姿态回归***，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07412.pdf)\n\n- (arXiv 2022.01) SWINUNET3D——一种使用移位窗口Transformer的分层架构，用于深度**交通预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.06390.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fbojesomo\u002FTraffic4Cast2021-SwinUNet3D)\n\n- (arXiv 2022.01) SWIN-POSE：基于Swin Transformer的人体**姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07384.pdf)\n\n- (arXiv 2022.01) 看得更近：利用Transformer融合第一人称与第三人称视角实现**机器人操作**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.07779.pdf)，[[项目]](https:\u002F\u002Fjangirrishabh.github.io\u002Flookcloser\u002F)\n\n- (arXiv 2022.01) ViT2Hash：无监督的信息保持**哈希**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05541.pdf)\n\n- (arXiv 2022.01) 语言驱动的**语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.03546.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fisl-org\u002Flang-seg)\n\n- (arXiv 2022.01) **行人检测**：领域泛化、CNN、Transformer及其他方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.03176.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhasanirtiza\u002FPedestron)\n\n- (arXiv 2022.01) ImageSubject：一个用于**主体检测**的大规模数据集，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.03101.pdf)\n\n- (arXiv 2022.01) 使用图像级监督**检测**两万个类别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02605.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDetic)\n\n- (arXiv 2022.01) 广义**类别发现**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02609.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsgvaze\u002Fgeneralized-category-discovery)\n\n- (arXiv 2022.01) 基于**视频-文本**建模的视频**摘要**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02494.pdf)\n\n- (arXiv 2022.01) 时空元组Transformer用于**基于骨骼的动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02849.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fheleiqiu\u002FSTTFormer)\n\n- (arXiv 2022.01) 视觉Transformer的**四叉树注意力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02767.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTangshitao\u002FQuadtreeAttention)\n\n- (arXiv 2022.01) 监督跨模态检索中**视觉-语言**预训练模型的全面实证研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02772.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002Fzhixiongz\u002FCLIP4CMR)\n\n- (arXiv 2022.01) MERLOT Reserve：通过**视觉、语言和声音**获取神经脚本知识，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02639.pdf)，[[项目]](https:\u002F\u002Frowanzellers.com\u002Fmerlotreserve)\n\n- (arXiv 2022.01) 关于协同注意力Transformer层在**视觉问答**中的有效性，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.03965.pdf)\n\n- (arXiv 2022.01) 金字塔融合Transformer用于**语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04019.pdf)\n\n- (arXiv 2022.01) 多视角Transformer用于**视频识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04288.pdf)\n\n- (arXiv 2022.01) HYPERTRANSFORMER：用于监督和半监督**少样本学习**的模型生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04182.pdf)\n\n- (arXiv 2022.01) UNIFORMER：用于**高效时空**表征学习的统一Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04676.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSense-X\u002FUniFormer)\n\n- (arXiv 2022.01) BridgeFormer：通过多项选择题桥接**视频-文本**检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04850.pdf)，[[项目]](https:\u002F\u002Fgeyuying.github.io\u002FMCQ.html)\n\n- (arXiv 2022.01) TransVOD：基于时空Transformer的端到端**视频目标检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05047.pdf)\n\n- (arXiv 2022.01) **CLIP**-Event：通过**事件**结构连接文本与图像，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05078.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flimanling\u002Fclip-event)\n\n- (arXiv 2022.01) Uni-EDEN：通过多粒度**视觉-语言**预训练构建的通用编码器-解码器网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.04026.pdf)\n\n- (arXiv 2022.01) Lawin Transformer：通过大窗口注意力的多尺度表征改进**语义分割**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01615.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyan-hao-tian\u002Flawin)\n\n- (arXiv 2022.01) 使用统一条件模型对**视觉语言**BERT进行**自训练**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02010.pdf)\n\n- (arXiv 2022.01) TransVPR：基于Transformer的场所识别，采用多层级注意力聚合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02001.pdf)\n\n- (arXiv 2022.01) 用于图像**字幕生成**的紧凑双向Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01984.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYuanEZhou\u002FCBTrans)\n\n- (arXiv 2022.01) 流场引导的稀疏Transformer用于**视频去模糊**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01893.pdf)\n\n- (arXiv 2022.01) 视觉Transformer中的**随机层**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.15111.pdf)\n\n- (arXiv 2022.01) ERNIE-VILG：用于**双向视觉-语言生成**的统一生成式预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.15283.pdf)\n\n- (arXiv 2022.01) InverseMV：使用卷积**视频-音乐**Transformer来**编排钢琴乐谱**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.15320.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flinchintung\u002FVMT)\n\n- (arXiv 2022.01) CSformer：为**压缩感知**架起卷积与Transformer之间的桥梁，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.15299.pdf)\n\n- (arXiv 2022.01) Persformer：一种用于**拓扑机器学习**的Transformer架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.15210.pdf)\n\n- (arXiv 2022.01) 视觉Transformer**瘦身**：在连续优化空间中进行多维度搜索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00814.pdf)\n\n- (arXiv 2022.01) 将语言作为查询用于**指代性视频目标分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00487.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwjn922\u002FReferFormer)\n\n- (arXiv 2022.01) PyramidTNT：利用金字塔架构改进**Transformer-in-Transformer**基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00978.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhuawei-noah\u002FCV-Backbones\u002Ftree\u002Fmaster\u002Ftnt_pytorch)\n\n- (arXiv 2022.01) 一种基于Transformer的**变化检测**暹罗网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01293.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwgcban\u002FChangeFormer)\n\n- (arXiv 2022.01) 具有**可变形注意力**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00520.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FDAT)\n\n- (arXiv 2022.01) 基于ViT特征拼接的**语义外观迁移**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00424.pdf)，[[项目]](https:\u002F\u002Fsplice-vit.github.io\u002F)\n\n- (arXiv 2022.01) 用于**光场图像超分辨率**的细节保持Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00346.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBITszwang\u002FDPT)\n\n\n\n### 2021年12月\n\n- (arXiv 2021.12) 视觉Transformer的多维**模型压缩**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.00043.pdf)\n\n- (arXiv 2021.12) 基于交互式Transformer的孪生网络用于**视频目标分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13983.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLANMNG\u002FSITVOS)\n\n- (arXiv 2021.12) Pale Transformer：一种具有淡色形状**注意力机制**的通用视觉Transformer**骨干网络**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.14000.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBR-IDL\u002FPaddleViT)\n\n- (arXiv 2021.12) APRIL：寻找视觉Transformer在**隐私保护**方面的阿喀琉斯之踵，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.14087.pdf)\n\n- (arXiv 2021.12) 基于分数位置编码的音视频帧同步技术，用于**视频到文本翻译**中的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.14088.pdf)\n\n- (arXiv 2021.12) **CLIP**在医学领域是否像在通用领域一样有益于**视觉问答**任务？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13906.pdf)\n\n- (arXiv 2021.12) SPViT：通过软令牌剪枝实现更**快速**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13890.pdf)\n\n- (arXiv 2021.12) 一捧词汇：从词袋监督中学习可迁移的视觉模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13884.pdf)\n\n- (arXiv 2021.12) StyleGAN-V：一款连续**视频**生成器，兼具**StyleGAN2**的价格、图像质量和优势，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.14683.pdf)，[[代码]](https:\u002F\u002Funiversome.github.io\u002Fstylegan-v)\n\n- (arXiv 2021.12) 基于预训练**视觉-语言**模型的**零样本语义分割**简单基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.14757.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMendelXu\u002Fzsseg.baseline)\n\n- (arXiv 2021.12) Miti-DETR：基于Transformer的物体**检测**模型，采用缓解自注意力收敛问题的设计，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13310.pdf)\n\n- (arXiv 2021.12) SIMVIT：探索一种带有**滑动窗口**的简单视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13085.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fucasligang\u002FSimViT)\n\n- (arXiv 2021.12) SGTR：基于Transformer的端到端**场景图生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12970.pdf)\n\n- (arXiv 2021.12) 基于层次化Transformer的**视频**联合建模用于**协同摘要**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13478.pdf)\n\n- (arXiv 2021.12) 面向**小规模数据集**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13492.pdf)\n\n- (arXiv 2021.12) 基于能量型潜在空间的**生成式**视觉Transformer用于**显著性预测**的学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13528.pdf)\n\n- (arXiv 2021.12) ViR：视觉**资源库**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.13545.pdf)\n\n- (arXiv 2021.12) SeMask：用于**语义分割**的语义掩码Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12782.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FPicsart-AI-Research\u002FSeMask-Segmentation)\n\n- (arXiv 2021.12) 开放词汇图像**分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12143.pdf)\n\n- (arXiv 2021.12) ELSA：面向视觉Transformer的增强型局部**自注意力**机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12786.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdamo-cv\u002FELSA)\n\n- (arXiv 2021.12) LaTr：面向**场景文本**的布局感知Transformer，用于**视觉问答**任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12494.pdf)\n\n- (arXiv 2021.12) 使用交叉注意力Transformer和行为编码进行**多模态人格识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12180.pdf)\n\n- (arXiv 2021.12) 细粒度的**多模态自监督学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12182.pdf)\n\n- (arXiv 2021.12) SLIP：自监督学习与**语言-图像**预训练的结合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.12750.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSLIP)\n\n- (arXiv 2021.12) CLEVR3D：组合式语言与基础视觉推理，用于**3D真实场景**中的**问答**任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11691.pdf)\n\n- (arXiv 2021.12) MIA-Former：通过多粒度输入适配实现**高效**且**鲁棒**的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11542.pdf)\n\n- (arXiv 2021.12) iSegFormer：基于Transformer的交互式图像**分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11325.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fqinliuliuqin\u002FiSegFormer.git)\n\n- (arXiv 2021.12) 利用知识图嵌入进行对比式物体**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11366.pdf)\n\n- (arXiv 2021.12) RepMLPNet：具有重参数化**局部性**的层次化视觉**MLP**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11081.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FDingXiaoH\u002FRepMLP)\n\n- (arXiv 2021.12) **轻量级**视觉Transformer，配备增强型**自注意力**机制，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10809.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FChenglin-Yang\u002FLVT)\n\n- (arXiv 2021.12) MPViT：用于**密集预测**的多路径视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11010.pdf)，[[代码]](https:\u002F\u002Fgit.io\u002FMPViT)\n\n- (arXiv 2021.12) SOIT：基于实例感知Transformer进行物体**分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11037.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FyuxiaodongHRI\u002FSOIT)\n\n- (arXiv 2021.12) 用于高效局部**注意力**的学习查询，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.11435.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmoabarar\u002Fqna)\n\n- (arXiv 2021.12) 关于**低层视觉**的高效Transformer及图像预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10175.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffenglinglwb\u002FEDT)\n\n- (arXiv 2021.12) LOCFORMER：通过特征采样方法使Transformer能够在长篇未修剪视频中执行**时间片段定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10066.pdf)\n\n- (arXiv 2021.12) 告诉我你看到了什么：一种基于自然语言描述的零样本**动作识别**方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09976.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvalterlej\u002Fzsarcap)\n\n- (arXiv 2021.12) 用于**领域适应**的Transformer预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09965.pdf)\n\n- (arXiv 2021.12) ScanQA：用于空间场景理解的3D问答，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10482.pdf)\n\n- (arXiv 2021.12) 自监督预训练是否需要大规模数据集？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10740.pdf)\n\n- (arXiv 2021.12) StyleSwin：基于Transformer的GAN，用于高分辨率**图像生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10762.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FStyleSwin)\n\n- (arXiv 2021.12) Mask2Former用于**视频实例分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10764.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FMask2Former)\n\n- (arXiv 2021.12) GLIDE：基于文本引导的扩散模型实现逼真的**图像生成**与**编辑**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10741.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fglide-text2im)\n\n- (arXiv 2021.12) 基于示例Transformer的高效视觉**跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09686.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvisionml\u002Fpytracking)\n\n- (arXiv 2021.12) 基于图神经网络驱动的Transformer进行**事件相机去噪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09685.pdf)\n\n- (arXiv 2021.12) 对齐与提示：使用实体提示进行**视频与语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09583.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FALPRO)\n\n- (arXiv 2021.12) 基于最优传输蒸馏的**数据高效**语言监督零样本识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09445.pdf)\n\n- (arXiv 2021.12) SiamTrans：利用预训练的双塔Transformer实现零样本多帧**图像修复**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09426.pdf)\n\n- (arXiv 2021.12) 全Transformer框架用于鲁棒的**点云配准**，结合深度信息交互，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09385.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FCGuangyan-BIT\u002FDIT)\n\n- (arXiv 2021.12) ZeroVL：一种在资源有限情况下对齐**视觉-语言**表示的强大基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09331.pdf)\n\n- (arXiv 2021.12) 基于Transformer实现端到端的**图像压缩与分析**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09300.pdf)\n\n- (arXiv 2021.12) 如何增强你的ViT？一致性损失与StyleAug——一种随机风格迁移增强方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09260.pdf)\n\n- (arXiv 2021.12) 用于**持续学习**的提示学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08654.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fl2p)\n\n- (arXiv 2021.12) 用于**视觉-语言**理解的蒸馏双编码器模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08723.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fkugwzk\u002FDistilled-DualEncoder)\n\n- (arXiv 2021.12) 利用无监督语义信息进行密集视频**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08455.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvalterlej\u002Fdvcusi)\n\n- (arXiv 2021.12) 超出常规视角，在**3D场景**中将语言接地，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08879.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fnickgkan\u002Fbeauty_detr)\n\n- (arXiv 2021.12) RegionCLIP：基于区域的**语言-图像**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09106.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FRegionCLIP)\n\n- (arXiv 2021.12) DProST：利用空间雕刻和动态投影空间Transformer进行**6自由度物体位姿估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08775.pdf)\n\n- (arXiv 2021.12) 用于**自监督**视觉预训练的掩码特征预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09133.pdf)\n\n- (arXiv 2021.12) SGEITL：面向**视觉常识推理**的场景图增强图像-文本学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08587.pdf)\n\n- (arXiv 2021.12) TransZero++：跨属性引导的Transformer用于**零样本学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08643.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshiming-chen\u002FTransZero_pp)\n\n- (arXiv 2021.12) 基于Vision Transformer的**视频哈希检索**，用于追踪虚假视频的来源，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08117.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flajlksdf\u002Fvtl)\n\n- (arXiv 2021.12) 视频与图像联合训练Transformer可提升**动作识别**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07175.pdf)\n\n- (arXiv 2021.12) QAHOI：基于查询的锚点用于检测**人-物体交互**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.08647.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fcjw2021\u002FQAHOI)\n\n- (arXiv 2021.12) AdaViT：用于**高效**视觉Transformer的自适应Token，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07658.pdf)\n\n- (arXiv 2021.12) **CLIP**-Lite：从文本标注中学习信息**高效**的视觉表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07133.pdf)\n\n- (arXiv 2021.12) 向统一的基础模型迈进：联合预训练Transformer处理**未配对的图像与文本**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07074.pdf)\n\n- (arXiv 2021.12) 深度ViT特征作为密集的视觉**描述子**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05814.pdf)，[[项目]](https:\u002F\u002Fdino-vit-features.github.io\u002F)\n\n- (arXiv 2021.12) 几何对比Transformer用于广义的3D姿态迁移，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07374.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmikecheninoulu\u002FCGT)\n\n- (arXiv 2021.12) 自监督时序Transformer网络用于**动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.07338.pdf)\n\n- (arXiv 2021.12) COMPOSER：视频中**群体活动**的组合式学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05892.pdf)\n\n- (arXiv 2021.12) 基于短程与长程关系的时空Transformer用于**微表情识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05851.pdf)\n\n- (arXiv 2021.12) 通过实体增强知识注入改进并诊断基于知识的**视觉问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06888.pdf)\n\n- (arXiv 2021.12) SVIP：用于**视频**中流程的**序列验证**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06447.pdf)\n\n- (arXiv 2021.12) 改进Vision Transformer以支持**增量学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06103.pdf)\n\n- (arXiv 2021.12) VL-ADAPTER：用于**视觉-语言**任务的参数高效迁移学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06825.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fylsung\u002FVL_adapter)\n\n- (arXiv 2021.12) 接纳稀疏Transformer的单步法**3D目标检测器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06375.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FTuSimple\u002FSST)\n\n- (arXiv 2021.12) PartGlot：通过语言参考游戏学习**形状部件分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06390.pdf)\n\n- (arXiv 2021.12) 基于空间交互Transformer网络进行**行人轨迹预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06624.pdf)\n\n- (arXiv 2021.12) 学习语义对齐的特征表示用于**基于文本的人脸搜索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06714.pdf)\n\n- (arXiv 2021.12) L-Verse：在**图像**与**文本**之间进行双向**生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11133.pdf)\n\n- (arXiv 2021.12) **自注意力机制**并不需要O(n^2)的内存，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05682.pdf)\n\n- (arXiv 2021.12) Vision Transformer对补丁扰动是否**鲁棒**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10659.pdf)\n\n- (arXiv 2021.12) Mesa：一种用于 Transformer 的**节省内存训练**框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11124.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzhuang-group\u002FMesa)\n\n- (arXiv 2021.12) 将语义概念注入端到端图像**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05230.pdf)\n\n- (arXiv 2021.12) MAGMA——通过基于适配器的微调对**生成模型**进行多模态**增强**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05253.pdf)\n\n- (arXiv 2021.12) LCTR：关于唤醒 Transformer 的局部连续性以实现**弱监督目标定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05291.pdf)\n\n- (arXiv 2021.12) FaceFormer：基于 Transformer 的**语音驱动 3D 面部动画**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05329.pdf)\n\n- (arXiv 2021.12) 重新思考用于**接地情境识别**的两阶段框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05375.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fkellyiss\u002FSituFormer)\n\n- (arXiv 2021.12) **CLIP**2Style**GAN**：无监督提取 StyleGAN 编辑方向，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05219.pdf)\n\n- (arXiv 2021.12) Couplformer：用耦合**注意力**图重新思考视觉 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05425.pdf)\n\n- (arXiv 2021.12) 统一的多模态预训练与基于提示的调优，用于**视觉-语言**理解与生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05587.pdf)\n\n- (arXiv 2021.12) 带原始对象查询的视觉 Transformer 用于**多标签图像分类**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05485.pdf)\n\n- (arXiv 2021.12) Colossal-AI：一个用于**大规模并行训练**的统一深度学习系统，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.14883.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI)\n\n- (arXiv 2021.12) MS-TCT：用于**动作检测**的多尺度时序 ConvTransformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03902.pdf)\n\n- (arXiv 2021.12) 接地的**语言-图像**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03857.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FGLIP)\n\n- (arXiv 2021.12) U^2-Former：一种用于**图像修复**的嵌套 U 形 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02279.pdf)\n\n- (arXiv 2021.12) 用于**点云**分析的自适应通道编码 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2112\u002F2112.02507.pdf)\n\n- (arXiv 2021.12) 基于 Transformer 的遮挡行人**重识别**中的姿态引导特征解耦，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02466.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FWangTaoAs\u002FPFD_Net)\n\n- (arXiv 2021.12) VT-CLIP：利用视觉引导文本增强**视觉-语言**模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02399.pdf)\n\n- (arXiv 2021.12) PointCLIP：通过**CLIP**实现**点云**理解，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02413.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZrrSkywalker\u002FPointCLIP)\n\n- (arXiv 2021.12) 通过双分支全 Transformer 网络学习**跟踪**表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02571.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fphiphiphi31\u002FDualTFR)\n\n- (arXiv 2021.12) 动态标记**归一化**提升视觉 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02624.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fwqshao126\u002FDTN)\n\n- (arXiv 2021.12) PTTR：基于 Transformer 的关系型 3D **点云目标跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02857.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FJasonkks\u002FPTTR)\n\n- (arXiv 2021.12) GETAM：用于**弱监督语义分割**的梯度加权逐元素 Transformer 注意力图，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02841.pdf)\n\n- (arXiv 2021.12) **Text2Mesh**：基于文本的网格神经风格化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03221.pdf)，[[项目]](https:\u002F\u002Fthreedle.github.io\u002Ftext2mesh\u002F)\n\n- (arXiv 2021.12) LMR-CBT：使用 CB-Transformer 学习模态融合表征，用于从非对齐多模态序列中进行多模态**情绪识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01697.pdf)\n\n- (arXiv 2021.12) 让长图像变短：视觉 Transformer 的自适应**标记**长度，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01686.pdf)\n\n- (arXiv 2021.12) FuseDream：无需训练的**文本到图像生成**，结合改进的**CLIP**+GAN 空间优化，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01573.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgnobitab\u002FFuseDream)\n\n- (arXiv 2021.12) TransZero：基于属性的零样本学习 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01683.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshiming-chen\u002FTransZero)\n\n- (arXiv 2021.12) 通过 Transformer 学习可泛化的针对可变形物体的**视觉-触觉**机器人**抓取**策略，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.06374.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FGTLIDAR\u002FDeformableObjectsGrasping.git)\n\n- (arXiv 2021.12) Hformer：混合 CNN-Transformer 模型，用于条纹投影相位解包裹中的**条纹级次预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2112\u002F2112.06759.pdf)\n\n- (arXiv 2021.12) 用于**fMRI 预测**任务的 Transformer 预训练与微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05761.pdf)\n\n- (arXiv 2021.12) 基于 Transformer 的**轨迹预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04350.pdf)\n\n- (arXiv 2021.12) 评估轻量级 Transformer 在**动作识别**中的表现，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09641.pdf)\n\n- (arXiv 2021.12) 基于自我监督的上下文化时空**对比学习**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05181.pdf)\n\n- (arXiv 2021.12) CMA-CLIP：用于**图像-文本**分类的跨模态注意力**CLIP**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03562.pdf)\n\n- (arXiv 2021.12) **自举**ViT：迈向解放视觉 Transformer 的预训练束缚，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03552.pdf)\n\n- (arXiv 2021.12) 基于决策的黑盒**攻击**：通过逐块移除对抗样本攻击视觉 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03492.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FshiyuchengTJU\u002FPAR)\n\n- (arXiv 2021.12) DoodleFormer：用 Transformer 进行创意**素描绘制**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03258.pdf)\n\n- (arXiv 2021.12) 利用模仿和自我监督学习创建**多模态交互式智能体**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03763.pdf)\n\n- (arXiv 2021.12) 野外环境下的**音频-视觉**同步，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04432.pdf)，[[项目]](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Favs)\n\n- (arXiv 2021.12) 先**分类**后**接地**：将**视频**场景图重新表述为时间二分图，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04222.pdf)\n\n- (arXiv 2021.12) Garment4D：从点云序列中重建**服装**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04159.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhongfz16\u002FGarment4D)\n\n- (arXiv 2021.12) 局部移位的**注意力**机制，结合早期全局融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05080.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fshellysheynin\u002FLocally-SAG-Transformer)\n\n- (arXiv 2021.12) BLT：用于可控**布局生成**的双向布局Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05112.pdf)\n\n- (arXiv 2021.12) PE-former：**姿态估计**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04981.pdf)，[[项目]](https:\u002F\u002Fwww.ics.forth.gr\u002Fhccv\u002F)\n\n- (arXiv 2021.12) Hair**CLIP**：通过文本和参考图像设计你的发型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05142.pdf)，[[项目]](https:\u002F\u002Fgithub.com\u002Fwty-ustc\u002FHairCLIP)\n\n- (arXiv 2021.12) **CLIP**-**NeRF**：基于文本和图像驱动的神经辐射场操控，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05139.pdf)，[[代码]](https:\u002F\u002Fcassiepython.github.io\u002Fclipnerf\u002F)\n\n- (arXiv 2021.12) 双语、开放世界视频文本**数据集**及基于Transformer的端到端**视频文本检测器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04888.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fweijiawu\u002FTransVTSpotter)，[[数据集]](https:\u002F\u002Fgithub.com\u002Fweijiawu\u002FBOVText-Benchmark)\n\n- (arXiv 2021.12) DualFormer：用于**高效视频识别**的局部-全局分层Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04674.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fdualformer)\n\n- (arXiv 2021.12) 基于循环凝视的Decoder，用于Transformer的**检测**任务，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04632.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzhechen\u002FDeformable-DETR-REGO)\n\n- (arXiv 2021.12) 快速**点云**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04702.pdf)\n\n- (arXiv 2021.12) 辅助远程操作：利用Transformer来**收集机器人任务演示**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.05129.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fassistive-teleop)\n\n- (arXiv 2021.12) 用于**多光谱目标检测**的跨模态融合Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.00273.pdf)\n\n- (arXiv 2021.12) PatchFormer：一种带有Patch注意力的**高效****点云**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.00207.pdf)\n\n- (arXiv 2021.12) 基于Transformer的方法，用于历史文献中联合进行**手写文字**和**命名实体识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04189.pdf)\n\n- (arXiv 2021.12) 用于**视觉-语言**建模的**MLP**架构：一项经验研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04453.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Feasonnie\u002Fmlp-vil)\n\n- (arXiv 2021.12) 一次性搞定一切——用于**视频检索**的多模态融合Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04446.pdf)\n\n- (arXiv 2021.12) 通过提示优化**视觉-语言**模型以实现高效的视频理解，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04478.pdf)，[[项目]](https:\u002F\u002Fju-chen.github.io\u002Fefficient-prompt\u002F)\n\n- (arXiv 2021.12) FLAVA：一个基础性的**语言与视觉**对齐模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04482.pdf)\n\n- (arXiv 2021.12) 用于**文本驱动图像变换**的嵌入算术，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.03162.pdf)\n\n- (arXiv 2021.12) LAVT：面向**指代式图像分割**的语言感知视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02244.pdf)\n\n- (arXiv 2021.12) 看我在做什么：教学视频中叙述的自监督**空间定位**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10596.pdf)，[[项目]](https:\u002F\u002Fcs-people.bu.edu\u002Frxtan\u002Fprojects\u002Fgrounding_narrations\u002F)\n\n- (arXiv 2021.12) Uni-Perceiver：用于**零样本和少样本**任务的**通用感知**统一预训练架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01522.pdf)\n\n- (arXiv 2021.12) Dense**CLIP**：基于语言引导的**密集**预测，采用上下文感知的提示策略，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01518.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fraoyongming\u002FDenseCLIP)\n\n- (arXiv 2021.12) 自监督**视频**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01514.pdf)，[[代码]](https:\u002F\u002Fgit.io\u002FJ1juJ)\n\n- (arXiv 2021.12) OW-DETR：**开放世界检测**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01513.pdf)\n\n- (arXiv 2021.12) 使用Dream Fields实现零样本**文本引导对象生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01455.pdf)，[[项目]](https:\u002F\u002Fajayj.com\u002Fdreamfields)\n\n- (arXiv 2021.12) 带有学习区域的**视频-文本**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01194.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fruiyan1995\u002FRegion_Learner)\n\n- (arXiv 2021.12) MTFNet：用于**RGB-D显著性目标检测**的互Transformer融合网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01177.pdf)\n\n- (arXiv 2021.12) TCTN：一种用于**时空**预测学习的3D-时间卷积Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01085.pdf)\n\n- (arXiv 2021.12) DenseCLIP：从**CLIP**中提取无需标注的**密集**标签，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01071.pdf)\n\n- (arXiv 2021.12) TransMEF：一种基于Transformer的**多曝光图像融合**框架，采用自监督多任务学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01030.pdf)\n\n- (arXiv 2021.12) SwinTrack：一个简单而强大的Transformer**跟踪**基线，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00995.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FLitingLin\u002FSwinTrack)\n\n- (arXiv 2021.12) 以目标为中心的无监督图像**标题生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00969.pdf)\n\n- (arXiv 2021.12) 视觉配对学习：一种用于图像**分类**的**高效**训练框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00965.pdf)\n\n- (arXiv 2021.12) 用于**场景文本识别**的视觉-语义Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00948.pdf)\n\n- (arXiv 2021.12) 利用Transformer进行可微**空间规划**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01010.pdf)，[[项目]](https:\u002F\u002Fdevendrachaplot.github.io\u002Fprojects\u002Fspatial-planning-transformers)\n\n- (arXiv 2021.12) 改进的**多尺度**视觉Transformer，用于**分类**和**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01526.pdf)\n\n- (arXiv 2021.12) 用于通用图像**分割**的掩码注意力Mask Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01527.pdf)，[[代码]](https:\u002F\u002Fbowenc0221.github.io\u002Fmask2former)\n\n- (arXiv 2021.12) BEVT：**视频**Transformer的BERT预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01529.pdf)\n\n- (arXiv 2021.12) 通过弱监督进行**人-物体交互检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00492.pdf)\n\n- (arXiv 2021.12) 学习Transformer特征用于**图像质量评估**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00485.pdf)\n\n- (arXiv 2021.12) **CLIP**风格化器：仅需单个文本条件即可实现**图像风格迁移**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00374.pdf)\n\n- (arXiv 2021.12) 基于Transformer的**多视角立体视觉**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00336.pdf)\n\n- (arXiv 2021.12) VoRTX：基于Transformer的**体素级视图选择与融合**的**体积3D重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00236.pdf)，[[代码]](https:\u002F\u002Fnoahstier.github.io\u002Fvortx)\n\n- (arXiv 2021.12) 面向检索的对象感知**视频-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00656.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FFingerRec\u002FOA-Transformer)\n\n\n\n### 2021年11月\n\n- (arXiv 2021.11) 多模态Transformer在**类别无关**的目标**检测**中表现卓越，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11430.pdf)，[[代码]](https:\u002F\u002Fgit.io\u002FJ1HPY)\n\n- (arXiv 2021.11) 预测、预防与评估：由预训练视觉-语言模型赋能的解耦式**文本驱动图像操纵**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13333.pdf)\n\n- (arXiv 2021.11) NomMer：在视觉Transformer中提名协同上下文以用于**视觉识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12994.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FNomMer1125\u002FNomMer)\n\n- (arXiv 2021.11) PolyViT：在**图像**、**视频**和**音频**上对视觉Transformer进行**联合训练**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12993.pdf)\n\n- (arXiv 2021.11) SWAT：标记内部及标记之间的空间结构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13677.pdf)\n\n- (arXiv 2021.11) 自适应**傅里叶**神经算子：用于Transformer的**高效**标记混合器，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13587.pdf)\n\n- (arXiv 2021.11) DyTox：具有动态标记扩展功能的用于**持续学习**的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11326.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Farthurdouillard\u002Fdytox)\n\n- (arXiv 2021.11) DABS：一种领域无关的**自监督学习基准测试**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12062.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Falextamkin\u002Fdabs)\n\n- (arXiv 2021.11) 通过Transformer进行冰球**球员识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11535.pdf)\n\n- (arXiv 2021.11) DBIA：针对Transformer网络的无数据后门注入**攻击**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11870.pdf)，[[代码]](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FDBIA-825D)\n\n- (arXiv 2021.11) 用于**多模态**Transformer的稀疏融合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11992.pdf)\n\n- (arXiv 2021.11) PhysFormer：基于**面部视频的生理测量**，采用时差Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12082.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FZitongYu\u002FPhysFormer)\n\n- (arXiv 2021.11) 基于Transformer的人员**重识别**的自监督预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12084.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmichuanhaohao\u002FTransReID-SSL)\n\n- (arXiv 2021.11) 离散表示增强视觉Transformer的**鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10493.pdf)\n\n- (arXiv 2021.11) TRAVLR：时隐时现！评估**视觉-语言推理**的跨模态迁移能力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10756.pdf)\n\n- (arXiv 2021.11) 跨越文本与边界框的格式界限：迈向统一的**视觉-语言**建模，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12085.pdf)\n\n- (arXiv 2021.11) **半监督**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11067.pdf)\n\n- (arXiv 2021.11) CpT：用于3D**点云**处理的卷积点Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10866.pdf)\n\n- (arXiv 2021.11) 使用视觉Transformer实现零样本认证的**对抗补丁防御**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10481.pdf)\n\n- (arXiv 2021.11) PointMixer：用于**点云**理解的MLP-Mixer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11187.pdf)\n\n- (arXiv 2021.11) **MetaFormer** 才是视觉任务的真正所需，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11418.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fpoolformer)\n\n- (arXiv 2021.11) Florence：一种新的计算机视觉**基础模型**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11432.pdf)\n\n- (arXiv 2021.11) 使用视觉Transformer对**检测迁移学习**进行基准测试，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11429.pdf)\n\n- (arXiv 2021.11) 学习如何**组合视觉关系**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09297.pdf)，[[项目]](https:\u002F\u002Fcomposevisualrelations.github.io\u002F)\n\n- (arXiv 2021.11) 基于参考的**磁共振图像重建**，使用纹理Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09492.pdf)\n\n- (arXiv 2021.11) 引导、编辑、检索：面向**教学视频检索**的语言接地型多模态模式，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09276.pdf)\n\n- (arXiv 2021.11) **Swin Transformer V2**：提升容量与分辨率，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09883.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSwin-Transformer)\n\n- (arXiv 2021.11) SimMIM：一个简单的**掩码图像建模**框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09886.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSimMIM)\n\n- (arXiv 2021.11) Restormer：高效的Transformer用于**高分辨率图像修复**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09881.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fswz30\u002FRestormer)\n\n- (arXiv 2021.11) 简单而有效：将**CLIP**嵌入用于**具身AI**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09888.pdf)\n\n- (arXiv 2021.11) ClipCap：用于图像**字幕生成**的CLIP前缀，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09734.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Frmokady\u002FCLIP_prefix_caption)\n\n- (arXiv 2021.11) TransMix：为视觉Transformer提供注意力到**混合**的能力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09833.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FBeckschen\u002FTransMix)\n\n- (arXiv 2021.11) TRIG：基于Transformer的**文本识别器**，带有初始嵌入引导，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08314.pdf)\n\n- (arXiv 2021.11) 多粒度**视觉语言**预训练：使文本与视觉概念对齐，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08276.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzengyan-97\u002FX-VLM)\n\n- (arXiv 2021.11) 通过跨模态对比学习将语言接地于视觉，构建可解释的语义空间，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07180.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyizhen-zhang\u002FVG-Bert)\n\n- (arXiv 2021.11) 语义接地的对象匹配，用于鲁棒的**机器人场景重排**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07975.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fapplied-ai-lab\u002Fobject_matching)\n\n- (arXiv 2021.11) 使用**3D**表示进行**人员跟踪**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07868.pdf)，[[代码]](https:\u002F\u002Fbrjathu.github.io\u002FT3DP)\n\n- (arXiv 2021.11) LiT：通过锁定**图像**和**文本**微调实现零样本迁移，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07991.pdf)\n\n- (arXiv 2021.11) FILIP：细粒度交互式**语言-图像**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07783.pdf)\n\n- (arXiv 2021.11) 图关系Transformer：将**成对对象特征**融入Transformer架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.06075.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fderikclive\u002Ftransformers)\n\n- (arXiv 2021.11) **注意力机制**近似稀疏分布式记忆，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.05498.pdf)\n\n- (arXiv 2021.11) 切片式**递归**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.05297.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fszq0214\u002FSReT)\n\n- (arXiv 2021.11) 混合**BYOL-VIT**：处理**小数据集**的有效方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.04845.pdf)\n\n- (arXiv 2021.11) Tip-Adapter：无需训练的**CLIP**适配器，用于提升**视觉-语言**建模效果，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.03930.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgaopengcuhk\u002FTip-Adapter)\n\n- (arXiv 2021.11) 基于Transformer的令牌生成器提升**图像合成**的视觉质量，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.03481.pdf)\n\n- (arXiv 2021.11) Style**CLIP**Draw：在**文本到绘图合成**中耦合内容与风格，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.03133.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fpschaldenbrand\u002FStyleCLIPDraw)\n\n- (arXiv 2021.11) 重访用于**组合性动作识别**的**时空**布局，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01936.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgorjanradevski\u002Frevisiting-spatial-temporal-layouts)\n\n- (arXiv 2021.11) PatchGame：在**指代游戏**中学习传递中层特征图信号，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01785.pdf)，[[代码]](https:\u002F\u002Fkampta.github.io\u002Fpatch-game)\n\n- (arXiv 2021.11) 视觉Transformer能否执行**卷积**操作？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01353.pdf)\n\n- (arXiv 2021.11) 使用Transformer进行牲畜监测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.00801.pdf)\n\n- (arXiv 2021.11) 在时间上下文的帮助下：多模态**第一人称视角动作识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01024.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fekazakos\u002FMTCN)\n\n- (arXiv 2021.11) IconQA：一个新的抽象图表理解和**视觉语言推理**基准，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13214.pdf)，[[项目]](https:\u002F\u002Ficonqa.github.io\u002F)\n\n- (arXiv 2021.11) BoxeR：用于2D和3D Transformer的**框注意力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13087.pdf)\n\n- (arXiv 2021.11) VLDeformer：用于快速**跨模态检索**的**视觉-语言**分解Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11338.pdf)\n\n- (arXiv 2021.11) 多范围Transformer用于多人**3D运动预测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12073.pdf)，[[代码]](https:\u002F\u002Fjiashunwang.github.io\u002FMRT\u002F)\n\n- (arXiv 2021.11) 场景表示Transformer：通过集合潜变量场景表示实现无几何约束的**新视图合成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13152.pdf)，[[项目]](https:\u002F\u002Fsrt-paper.github.io\u002F)\n\n- (arXiv 2021.11) 通过超级令牌在视觉Transformer中进行**全局交互建模**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13156.pdf)\n\n- (arXiv 2021.11) ML-Decoder：可扩展且多功能的**分类头**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12933.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FAlibaba-MIIL\u002FML_Decoder)\n\n- (arXiv 2021.11) 通过双赢Transformer同时利用领域特定和不变知识，实现**无监督领域适应**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12941.pdf)\n\n- (arXiv 2021.11) SWINBERT：具有稀疏注意力的端到端Transformer，用于**视频字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13196.pdf)\n\n- (arXiv 2021.11) 摊销提示：用于**领域泛化**中**CLIP**的轻量级微调，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12853.pdf)\n\n- (arXiv 2021.11) 通用字幕生成器：通过内容-风格分离进行长尾**视觉-语言**模型训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12727.pdf)\n\n- (arXiv 2021.11) 在Transformer扩展中，**稀疏**就足够了，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12763.pdf)\n\n- (arXiv 2021.11) 使用CLIP实现“**猜谁**？”游戏，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00599.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FArnauDIMAI\u002FCLIP-GuessWho)\n\n- (arXiv 2021.11) HEAT：用于**结构化重建**的全息边缘注意力Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15143.pdf)\n\n- (arXiv 2021.11) 视觉Transformer的统一**剪枝**框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15127.pdf)\n\n- (arXiv 2021.11) 锥形**对抗训练**提升ViT性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15121.pdf)\n\n- (arXiv 2021.11) AssistSR：以可用性为中心、由问题驱动的**视频片段检索**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15050.pdf)，[[代码及数据]](https:\u002F\u002Fgithub.com\u002FStanLei52\u002FAQVSR)\n\n- (arXiv 2021.11) DAFormer：改进网络架构和训练策略，用于**领域自适应语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14887.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flhoyer\u002FDAFormer)\n\n- (arXiv 2021.11) ，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14887.pdf)\n\n- (arXiv 2021.11) AdaViT：用于**高效**图像识别的自适应视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15668.pdf)\n\n- (arXiv 2021.11) ATS：用于**高效**视觉Transformer的自适应令牌采样，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15667.pdf)\n\n- (arXiv 2021.11) **CLIP**与视频字幕生成器相遇：属性感知表征学习促进准确的**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15162.pdf)\n\n- (arXiv 2021.11) CRIS：基于**CLIP**的引用式图像**分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15174.pdf)\n\n- (arXiv 2021.11) 通过多尺度令牌聚合实现分流式**自注意力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15193.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOliverRensu\u002FShunted-Transformer)\n\n- (arXiv 2021.11) MC-SSL0.0：迈向多概念**自监督**学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.15340.pdf)\n\n- (arXiv 2021.11) TransWeather：基于Transformer的恶劣天气条件下退化图像**恢复**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14813.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjeya-maria-jose\u002FTransWeather)\n\n- (arXiv 2021.11) 搜索视觉Transformer的**搜索空间**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14725.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FCream)\n\n- (arXiv 2021.11) TransMVSNet：具有全局上下文感知的基于Transformer的**多视图立体**网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14600.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMegviiRobot\u002FTransMVSNet)\n\n- (arXiv 2021.11) 用于解决视觉**推理**问题的**循环**视觉Transformer，[[论文]]()\n\n- (arXiv 2021.11) **视频帧插值**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13817.pdf)\n\n- (arXiv 2021.11) FQ-ViT：完全**量化**且无需重新训练的视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13824.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flinyang-zhh\u002FFQ-ViT)\n\n- (arXiv 2021.11) LAFITE：迈向无语言指导的**文本到图像生成**训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13792.pdf)\n\n- (arXiv 2021.11) SPARSE DETR：具有可学习稀疏性的**高效**端到端目标**检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14330.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fsparse-detr)\n\n- (arXiv 2021.11) 基于多模态Transformer的端到端**引用视频对象分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14821.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmttr2021\u002FMTTR)\n\n- (arXiv 2021.11) Point-BERT：通过掩码点建模预训练3D**点云**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14819.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flulutang0608\u002FPoint-BERT)\n\n- (arXiv 2021.11) 零样本**图像到文本生成**用于视觉语义算术，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14447.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYoadTew\u002Fzero-shot-image-to-text)\n\n- (arXiv 2021.11) 用于**自然图像**的文本驱动编辑的混合扩散模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14818.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fomriav\u002Fblended-diffusion)\n\n- (arXiv 2021.11) Mask Transfiner：用于高质量**实例分割**的模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.13673.pdf)，[[代码]](http:\u002F\u002Fvis.xyz\u002Fpub\u002Ftransfiner)\n\n- (arXiv 2021.11) MHFormer：用于**3D人体姿态估计**的多假设Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12707.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FVegetebird\u002FMHFormer)\n\n- (arXiv 2021.11) PeCo：用于视觉Transformer的**BERT预训练**的感知码本，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12710.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPeCo)\n\n- (arXiv 2021.11) 解放Transformer：基于离散吸收扩散的并行标记预测，用于从向量量化编码快速生成**高分辨率图像**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12701.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsamb-t\u002Funleashing-transformers)\n\n- (arXiv 2021.11) 向标记化的**人体动态**表示迈进，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11433.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flikenneth\u002Facton)\n\n- (arXiv 2021.11) **自剪枝**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12624.pdf)\n\n- (arXiv 2021.11) VIOLET：带有掩码视觉标记建模的端到端**视频-语言**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12681.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ftsujuifu\u002Fpytorch_violet)\n\n- (arXiv 2021.11) 一种轻量级图Transformer网络，用于从2D人体姿态重建**人体网格**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12696.pdf)\n\n- (arXiv 2021.11) MorphMLP：一种无自注意力、类似**MLP**的图像和视频骨干网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12527.pdf)\n\n- (arXiv 2021.11) 八叉树Transformer：基于分层结构序列的自回归**3D形状生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12480.pdf)\n\n- (arXiv 2021.11) 用于**视频字幕生成**的分层模块化网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12476.pdf)\n\n- (arXiv 2021.11) NU¨WA：用于神经视觉世界创造的**视觉合成预训练**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12417.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FNUWA)\n\n- (arXiv 2021.11) 图像块即波：相位感知视觉**MLP**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12294.pdf)\n\n- (arXiv 2021.11) PTQ4ViT：视觉Transformer的**量化**后训练框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12293.pdf)\n\n- (arXiv 2021.11) PU-Transformer：**点云上采样**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12242.pdf)\n\n- (arXiv 2021.11) 扩展**视觉-语言预训练**以用于图像**字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12233.pdf)\n\n- (arXiv 2021.11) Cerberus Transformer：联合进行**语义、可用性和属性解析**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.12608.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FOPEN-AIR-SUN\u002FCerberus)\n\n- (arXiv 2021.11) 基于时空标记选择的高效**视频**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11591.pdf)\n\n- (arXiv 2021.11) RedCaps：由人民创建、为人民服务的网络精选**图像-文本数据**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.11431.pdf)，[[项目]](https:\u002F\u002Fredcaps.xyz\u002F)\n\n- (arXiv 2021.11) EMScore：通过粗粒度和细粒度嵌入匹配评估**视频字幕生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08919.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FShiYaya\u002Femscore)\n\n- (arXiv 2021.11) 用于**场景生成**的组合式Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08960.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdorarad\u002Fgansformer)\n\n- (arXiv 2021.11) Vis-TOP：视觉Transformer**叠加处理器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10957.pdf)\n\n- (arXiv 2021.11) 基于Transformer的**情境识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10135.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjhcho99\u002Fgsrtr)\n\n- (arXiv 2021.11) 在**小型模型**约束下重新思考视觉Transformer中的**查询、键和值**嵌入，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10017.pdf)\n\n- (arXiv 2021.11) UFO：用于**视觉-语言**表征学习的统一Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10023.pdf)\n\n- (arXiv 2021.11) 利用大规模视频转录推进高分辨率**视频-语言**表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10337.pdf)\n\n- (arXiv 2021.11) 零样本迁移学习的综合扩展，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.10050.pdf)\n\n- (arXiv 2021.11) 简单而有效：用于**具身AI**的**CLIP**嵌入，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09888.pdf)\n\n- (arXiv 2021.11) 通过在补丁嵌入中使用PreLayerNorm提升视觉Transformer的**鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08413.pdf)\n\n- (arXiv 2021.11) IBOT：带有在线分词器的**IMAGE BERT预训练**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07832.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Fibot)\n\n- (arXiv 2021.11) **掩码自编码器**是可扩展的视觉学习者，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.06377.pdf)\n\n- (arXiv 2021.11) 掩码引导的光谱域Transformer，用于高效的**高光谱图像重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.07910.pdf)\n\n- (arXiv 2021.11) Transformer是否比CNN更**鲁棒**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.05464.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fytongbai\u002FViTs-vs-CNNs)\n\n- (arXiv 2021.11) CLIP2TV：关于基于Transformer的方法用于**视频-文本检索**的实证研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.05610.pdf)\n\n- (arXiv 2021.11) 具有可变长度记忆的多模态Transformer，用于**视觉-语言导航**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.05759.pdf)\n\n- (arXiv 2021.11) 通过基于Token的生成器结合Transformer提升**图像合成**的**视觉质量**，[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.03481)\n\n- (arXiv 2021.11) VLMO：具有多模态专家混合的统一**视觉-语言**预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02358.pdf)，[[代码]](https:\u002F\u002Faka.ms\u002Fvlmo)\n\n- (arXiv 2021.11) LAION-400M：由CLIP筛选的4亿对图像-文本的开放数据集，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02114.pdf)，[[项目]](https:\u002F\u002Flaion.ai\u002Flaion-400-open-dataset\u002F)\n\n- (arXiv 2021.11) 端到端视觉-语言Transformer训练的实证研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02387.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fzdou0830\u002FMETER)\n\n- (arXiv 2021.11) 视觉Transformer能否执行卷积操作？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01353.pdf)\n\n- (arXiv 2021.11) HRViT：多尺度高分辨率视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.01236.pdf)\n\n\n\n### 2021年10月\n\n- (arXiv 2021.10) 基于注意力机制的视觉关键词检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15957.pdf)，[[项目]](基于注意力机制的视觉关键词检测)\n\n- (arXiv 2021.10) 通过分割交换学习共同分割以用于检索和发现，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15904.pdf)，[[数据与代码]](http:\u002F\u002Fimagine.enpc.fr\u002F~shenx\u002FSegSwap\u002F)\n\n- (arXiv 2021.10) 用于跨模态文本-视频检索的视觉时空关系增强网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15609.pdf)，[[代码]](https:\u002F\u002Fhttps\u002F\u002Fgithub.com\u002FLionel-Hing\u002FVSR-Net)\n\n- (arXiv 2021.10) 用于无监督域适应的分散Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.14944.pdf)\n\n- (arXiv 2021.10) 散脑：统一稀疏与低秩注意力近似，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15343.pdf)\n\n- (arXiv 2021.10) 基于Transformer的3D目标跟踪，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.14921.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002F3bobo\u002Flttr)\n\n- (arXiv 2021.10) 将抗锯齿技术融入视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15156.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Famazon-research\u002Fanti-aliasing-transformer)\n\n- (arXiv 2021.10) UltraPose：通过人体解耦3D模型合成包含10亿个点的密集姿态，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15267.pdf)，[[数据与代码]](https:\u002F\u002Fgithub.com\u002FMomoAILab\u002Fultrapose)\n\n- (arXiv 2021.10) SOAT：一种场景与对象感知的Transformer，用于视觉-语言导航，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.14143.pdf)\n\n- (arXiv 2021.10) 基于CNN-Transformer编码器-解码器网络的孟加拉语图像标题生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.12442.pdf)\n\n- (arXiv 2021.10) 历史感知多模态Transformer用于视觉-语言导航，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13309.pdf)，[[项目]](https:\u002F\u002Fcshizhe.github.io\u002Fprojects\u002Fvln_hamt.html)\n\n- (arXiv 2021.10) TriBERT：面向全身的人体中心音频-视觉表征学习，用于视觉声音分离，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13412.pdf)\n\n- (arXiv 2021.10) TNTC：基于Transformer互补性的双流网络，用于基于步态的情绪识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13708.pdf)\n\n- (arXiv 2021.10) 基于自注意力的上下文相似性聚合用于视觉重排序，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13430.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMCC-WH\u002FCSA)\n\n- (arXiv 2021.10) IIP-Transformer：用于基于骨骼的动作识别的内部-外部部件Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13385.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fqtwang0035\u002FIIP-Transformer)\n\n- (arXiv 2021.10) 基于图像的CLIP引导本质迁移，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.12427.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fhila-chefer\u002FTargetCLIP)\n\n- (arXiv 2021.10) 沉浸式Transformer：具有双重随机注意力的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11773.pdf)\n\n- (arXiv 2021.10) 文盲版DALL·E学会创作，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11405.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fslate-autoencoder)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsinghgautam\u002Fslate)\n\n- (arXiv 2021.10) 通过深度特征工程学习文本-图像联合嵌入，以实现高效的跨模态检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11592.pdf)\n\n- (arXiv 2021.10) SOFT：无Softmax的线性复杂度Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11945.pdf)，[[代码]](https:\u002F\u002Ffudan-zvg.github.io\u002FSOFT)\n\n- (arXiv 2021.10) 面向人体姿态和形状估计的深度双流视频推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11680.pdf)\n\n- (arXiv 2021.10) 基于动态稀疏注意力的Transformer加速，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11299.pdf)\n\n- (arXiv 2021.10) CLOOB：现代霍普菲尔德网络结合InfoLoob性能超越CLIP，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11316.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fml-jku\u002Fcloob)\n\n- (arXiv 2021.10) 将视觉空间、语言和常识结构整合到故事可视化中，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10834.pdf)\n\n- (arXiv 2021.10) StructFormer：为新型物体的语言引导语义重组学习空间结构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10189.pdf)，[[项目]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fstructformer)\n\n- (arXiv 2021.10) Gophormer：用于节点分类的自我图Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13094.pdf)\n\n- (arXiv 2021.10) STRANSGAN：关于Transformer在GAN中应用的实证研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13107.pdf)，[[代码]](https:\u002F\u002Fnbei.github.io\u002Fstransgan.html)\n\n- (arXiv 2021.10) MVT：用于3D物体识别的多视角视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.13083.pdf)\n\n- (arXiv 2021.10) DocTr：用于几何去畸变和光照校正的文档图像Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.12942.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ffh2019ustc\u002FDocTr)\n\n- (arXiv 2021.10) 基于CNN-Transformer编码器-解码器网络的孟加拉语图像标题生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.12442.pdf)\n\n- (arXiv 2021.10) WAV2CLIP：从CLIP中学习鲁棒的音频表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11499.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fdescriptinc\u002Flyrebird-wav2clip)\n\n- (arXiv 2021.10) AFTer-UNet：轴向融合Transformer UNet用于医学图像分割，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10403.pdf)\n\n- (arXiv 2021.10) CLOOB：现代霍普菲尔德网络结合InfoLoob性能超越CLIP，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11316.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fml-jku\u002Fcloob)\n\n- (arXiv 2021.10) AniFormer：基于Transformer的数据驱动3D动画，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10533.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmikecheninoulu\u002FAniFormer)\n\n- (arXiv 2021.10) 基于查询自适应Transformer的少样本时间动作定位，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.10552.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsauradip\u002FfewshotQAT)\n\n- (arXiv 2021.10) 3D-ANAS v2：在自动设计的卷积网络上嫁接Transformer模块，用于高光谱图像分类，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.11084.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxmm\u002F3D-ANAS-V2)\n\n- (arXiv 2021.10) CMTR：用于可见光-红外 **行人重识别** 的跨模态 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08994.pdf)\n\n- (arXiv 2021.10) 3D-RETR：基于 Transformer 的端到端 **单视图和多视图 3D 重建**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08861.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FFomalhautB\u002F3D-RETR)\n\n- (arXiv 2021.10) HRFormer：用于 **密集预测** 的 **高分辨率** Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09408.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FHRNet\u002FHRFormer)\n\n- (arXiv 2021.10) 利用动作捕捉数据进行 **人体网格恢复**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09243.pdf)\n\n- (arXiv 2021.10) 一个好的 **提示词** 是否胜过数百万参数？面向 **视觉-语言** 模型的低资源提示词学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08484.pdf)\n\n- (arXiv 2021.10) ASFormer：用于 **动作分割** 的 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08568.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FChinaYi\u002FASFormer)\n\n- (arXiv 2021.10) 多模态 **对话回复生成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08515.pdf)\n\n- (arXiv 2021.10) 通过编排多模态说明手册理解 **程序性知识**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08486.pdf)\n\n- (arXiv 2021.10) 组合式 **注意力机制**：解耦搜索与检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09419.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fsarthmit\u002FCompositional-Attention)\n\n- (arXiv 2021.10) 用于 3D **点云序列** 的时空 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09783.pdf)\n\n- (arXiv 2021.10) TransFusion：基于 Transformer 的跨视角融合方法，用于 **3D 人体姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09554.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FHowieMa\u002FTransFusion-Pose)\n\n- (arXiv 2021.10) 用于 **双向图像和文本生成** 的统一多模态 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09753.pdf)\n\n- (arXiv 2021.10) 基于 **高斯键** 混合的 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08678.pdf)\n\n- (arXiv 2021.10) DIFFUSIONCLIP：利用扩散模型进行 **文本引导的图像操控**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02711.pdf)\n\n- (arXiv 2021.10) 视觉 Transformer 和 MLP-Mixer 对抗 CNN 的 **鲁棒性** 比较，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02797.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fphibenz\u002Frobustness_comparison_vit_mlp-mixer_cnn)\n\n- (arXiv 2021.10) 用于视觉感知的涟漪注意力机制，具有 **次二次复杂度**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02453.pdf)\n\n- (arXiv 2021.10) 通过平滑化视觉 Transformer 实现认证的补丁 **鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.07719.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMadryLab\u002Fsmoothed-vit)\n\n- (arXiv 2021.10) CLIP-Forge：迈向零样本 **文本到形状** 生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02624.pdf)\n\n- (arXiv 2021.10) 通过基于补丁的负向增强来理解和提升视觉 Transformer 的 **鲁棒性**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.07858.pdf)\n\n- (arXiv 2021.10) 稀疏 MoE 遇上 **高效集成**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.03360.pdf)\n\n- (arXiv 2021.10) 用于交流的绘画共享 **视觉表征**：不同 **偏见** 如何影响人类的可解释性和意图？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.08203.pdf)\n\n- (arXiv 2021.10) SignBERT：为 **手语识别** 预训练的手部模型感知表征，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05382.pdf)\n\n- (arXiv 2021.10) 在 **自监督** 视觉表征学习中，通过 Transformer 重振 CNN 注意力，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05340.pdf)\n\n- (arXiv 2021.10) 通过微调单个可训练模块来探究视觉 Transformer 和 CNN 的 **迁移学习能力**，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2110\u002F2110.05270.pdf)\n\n- (arXiv 2021.10) 监督无处不在：一种数据高效的对比 **语言-图像** 预训练范式，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05208.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSense-GVT\u002F)\n\n- (arXiv 2021.10) CLIP4Caption ++：用于 **视频字幕** 的多 CLIP 方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05204.pdf)\n\n- (arXiv 2021.10) 基于 Transformer 的双关系图用于 **多标签图像识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04722.pdf)\n\n- (arXiv 2021.10) 改进版 VQGAN 的 **矢量量化图像建模**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04627.pdf)\n\n- (arXiv 2021.10) 用于 **3D 人体姿态估计** 的自适应多视图和时序融合 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05092.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Flelexx\u002FMTF-Transformer)\n\n- (arXiv 2021.10) NVIT：视觉 Transformer 的 **压缩** 和 **参数再分配**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04869.pdf)\n\n- (arXiv 2021.10) 6D-ViT：基于 Transformer 的实例表征学习实现类别级 **6D 物体姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04792.pdf)\n\n- (arXiv 2021.10) CLIP-Adapter：通过特征适配器提升 **视觉-语言** 模型性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04544.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fgaopengcuhk\u002FCLIP-Adapter)\n\n- (arXiv 2021.10) ATISS：用于 **室内场景合成** 的自回归 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.03675.pdf)，[[代码]](https:\u002F\u002Fnv-tlabs.github.io\u002FATISS)\n，\n- (arXiv 2021.10) MOBILEVIT：轻量级、通用且 **移动端** 友好的视觉 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02178.pdf)\n\n- (arXiv 2021.10) 视觉 Transformer 中的 **标记池化**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.03860.pdf)\n\n- (arXiv 2021.10) VIDT：高效且有效的全 Transformer 基础 **目标检测器**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.03921.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fnaver-ai\u002Fvidt)\n\n- (arXiv 2021.10) CLIP4Caption：用于 **视频字幕** 的 CLIP 方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06615.pdf)\n\n- (arXiv 2021.10) **对象**-区域 **视频** Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06915.pdf)，[[代码]](https:\u002F\u002Froeiherz.github.io\u002FORViT\u002F)\n\n- (arXiv 2021.10) 利用注意力中的 **冗余** 进行重用 Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06821.pdf)\n\n- (arXiv 2021.10) 使用神经解释器进行 **动态推理**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06399.pdf)\n\n- (arXiv 2021.10) 一种增强 CLIP 的 **视频-语言** 理解方法，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.07137.pdf)\n\n- (arXiv 2021.10) 使用部件-求和 Transformer 结合复合查询进行 **视觉关系检测**，[[论文]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2021\u002Fpapers\u002FDong_Visual_Relationship_Detection_Using_Part-and-Sum_Transformers_With_Composite_Queries_ICCV_2021_paper.pdf)\n\n- (arXiv 2021.10) 通过查询和多尺度检测发现人类与大词汇量物体的交互，[[论文]](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2021\u002Fpapers\u002FWang_Discovering_Human_Interactions_With_Large-Vocabulary_Objects_via_Query_and_Multi-Scale_ICCV_2021_paper.pdf)\n\n- (arXiv 2021.10) 学习用于食谱生成和食物检索的结构化表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.01209.pdf)\n\n- (arXiv 2021.10) 来自ViT的免费午餐：用于细粒度视觉识别的自适应注意力多尺度融合Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2110\u002F2110.01240.pdf)\n\n\n\n### 2021年9月\n- (arXiv 2021.09) 从视频和文章中联合进行多媒体事件提取，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12776.pdf)\n\n- (arXiv 2021.09) 用于动态时空预测的长距离Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12218.pdf)\n\n- (arXiv 2021.09) 视觉接地的概念组合，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14115.pdf)\n\n- (arXiv 2021.09) CoSeg：受认知启发的无监督通用事件分割，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.15170.pdf)\n\n- (arXiv 2021.09) CCTrans：用Transformer简化并改进人群计数，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14483.pdf)\n\n- (arXiv 2021.09) UFO-ViT：高性能线性视觉Transformer，无需Softmax，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14382.pdf)\n\n- (arXiv 2021.09) 复杂背景下基于Transformer的红外小尺寸目标检测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14379.pdf)\n\n- (arXiv 2021.09) 使用自监督Transformer在无标签情况下定位物体，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14279.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002FLOST)\n\n- (arXiv 2021.09) 几何纠缠的视觉语义Transformer用于图像字幕生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14137.pdf)\n\n- (arXiv 2021.09) VideoCLIP：用于零样本视频文本理解的对比预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.14084.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ffairseq\u002Fexamples\u002FMMPT)\n\n- (arXiv 2021.09) 微调视觉Transformer以预测伊辛模型中的状态变量，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.13925.pdf)\n\n- (arXiv 2021.09) CLIP-It！语言引导的视频摘要，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.00650.pdf)，[[项目]](https:\u002F\u002Fmedhini.github.io\u002Fclip_it)\n\n- (arXiv 2021.09) MFEVIT：一种鲁棒轻量级基于Transformer的多模态2D+3D面部表情识别网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.13086.pdf)\n\n- (arXiv 2021.09) 稀疏空间Transformer用于少样本学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12932.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fchenhaoxing\u002FSSFormers)\n\n- (arXiv 2021.09) 视觉Transformer哈希用于图像检索，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12564.pdf)\n\n- (arXiv 2021.09) PETA：使用Transformer注意力进行相册事件识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12499.pdf)\n\n- (arXiv 2021.09) MLIM：结合掩码语言和图像建模的视觉-语言模型预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12178.pdf)\n\n- (arXiv 2021.09) 密集对比视觉-语言预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.11778.pdf)\n\n- (arXiv 2021.09) CPT：为预训练视觉-语言模型提供的彩色提示调整，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.11797.pdf)\n\n- (arXiv 2021.09) 定位∞形鱼类：野外草图引导的目标定位，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.11874.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fpriba\u002Fsgol_wild)\n\n- (arXiv 2021.09) CLIPORT：用于机器人操作的“是什么”和“在哪里”路径，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.12098.pdf)，[[项目]](https:\u002F\u002Fcliport.github.io\u002F)，[[代码]](https:\u002F\u002Fgithub.com\u002Fcliport\u002Fcliport)\n\n- (arXiv 2021.09) GraFormer：用于3D姿态估计的图卷积Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08364.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FGraformer\u002FGraFormer)\n\n- (arXiv 2021.09) 带有视觉接地的多模态增量Transformer用于视觉对话生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08478.pdf)\n\n- (arXiv 2021.09) 表情片段Transformer用于稳健的基于视频的面部表情识别，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08409.pdf)，[[代码]](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FATSE-C58B)\n\n- (arXiv 2021.09) LOTR：使用定位Transformer进行人脸关键点定位，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.10057.pdf)\n\n- (arXiv 2021.09) Dyadformer：一种用于二元互动长程建模的多模态Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F2109\u002F2109.09487.pdf)\n\n- (arXiv 2021.09) SDTP：面向密集图像预测的语义感知解耦Transformer金字塔，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08963.pdf)\n\n- (arXiv 2021.09) KD-VLP：通过对象知识蒸馏改进端到端视觉-语言预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.10504.pdf)\n\n- (arXiv 2021.09) T6D-Direct：用于多目标6D姿态直接回归的Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.10948.pdf)\n\n- (arXiv 2021.09) OH-Former：用于人员再识别的全关系高阶Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.11159.pdf)\n\n- (arXiv 2021.09) PIX2SEQ：一个用于目标检测的语言建模框架，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.10852.pdf)\n\n- (arXiv 2021.09) ActionCLIP：视频动作识别的新范式，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08472.pdf)\n\n- (arXiv 2021.09) BGT-Net：用于场景图生成的双向GRU Transformer网络，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.05346.pdf)\n\n- (arXiv 2021.09) 神经人表演者：学习可泛化的辐射场以实现人体表演渲染，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07448.pdf)，[[代码]](https:\u002F\u002Fyoungjoongunc.github.io\u002Fnhp\u002F)\n\n- (arXiv 2021.09) 锚定DETR：基于Transformer的检测器的查询设计，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07107.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmegvii-model\u002FAnchorDETR)\n\n- (arXiv 2021.09) 用于3D目标检测的端到端Transformer模型，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08141.pdf)，[[代码]](https:\u002F\u002Ffacebookresearch.github.io\u002F3detr)\n\n- (arXiv 2021.09) 混合局部-全局Transformer用于图像去雾，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07100.pdf)\n\n- (arXiv 2021.09) 半监督多尺度Transformer用于广角人像校正，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.08024.pdf)\n\n- (arXiv 2021.09) 带有几何一致对象的标签注意力Transformer用于图像字幕生成，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07799.pdf)\n\n- (arXiv 2021.09) 姿势Transformer（POTR）：使用非自回归Transformer进行人体运动预测，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07531.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fidiap\u002Fpotr)\n\n- (arXiv 2021.09) PnP-DETR：迈向基于Transformer的**高效**视觉分析，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07036.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Ftwangnh\u002Fpnp-detr)\n\n- (arXiv 2021.09) 学习为视觉对话中的视觉对象进行**接地**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.06013.pdf)\n\n- (arXiv 2021.09) 关于设计用于**视频接地**的多模态Transformer的研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.06085.pdf)，[[代码]](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmengcao\u002Fpublication\u002Fgtr)\n\n- (arXiv 2021.09) CDTrans：用于**无监督域适应**的跨域Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.06165.pdf)\n\n- (arXiv 2021.09) 注意力是否优于**矩阵分解**？[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04553.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FGsunshine\u002FEnjoy-Hamburger)\n\n- (arXiv 2021.09) 具有多模态交互的时序金字塔Transformer用于**视频问答**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04735.pdf)\n\n- (arXiv 2021.09) 线条作为视觉句子：用于视觉定位的上下文感知**线条描述符**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04753.pdf)\n\n- (arXiv 2021.09) 负样本很重要：面向**时序接地**的距离度量学习的复兴，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04872.pdf)\n\n- (arXiv 2021.09) LAViTeR：在图像和字幕生成辅助下学习对齐的**视觉与文本**表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04993.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fmshaikh2\u002FLaViTeR)\n\n- (arXiv 2021.09) 全景叙事**接地**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04988.pdf)\n\n- (arXiv 2021.09) GPT-3在少样本知识型**VQA**中的实证研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.05014.pdf)\n\n- (arXiv 2021.09) PlaTe：在程序化任务中使用Transformer进行**视觉接地规划**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04869.pdf)，[[项目]](https:\u002F\u002Fwww.pair.toronto.edu\u002Fplate-planner\u002F)\n\n- (arXiv 2021.09) **EfficientCLIP**：通过集成置信学习和语言建模实现高效的跨模态预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04699.pdf)\n\n- (arXiv 2021.09) **缩放ReLU**对**训练**视觉Transformer至关重要，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.03810.pdf)\n\n- (arXiv 2021.09) FuseFormer：在Transformer中融合细粒度信息以用于**视频修复**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02974.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fruiliu-ai\u002FFuseFormer)\n\n- (arXiv 2021.09) GCsT：用于**动作识别**的图卷积骨架Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02860.pdf)\n\n- (arXiv 2021.09) WHYACT：在生活方式**vlog**中识别**动作原因**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02747.pdf)\n\n- (arXiv 2021.09) 通过扩展**CLIP**实现零样本**开放集检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02748.pdf)\n\n- (arXiv 2021.09) 向视觉Transformer发起可迁移的**对抗攻击**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04176.pdf)\n\n- (arXiv 2021.09) 学习为**视觉-语言**模型生成**提示词**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.01134.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FKaiyangZhou\u002FCoOp)\n\n- (arXiv 2021.09) 通过多流语料库对齐和双Softmax损失提升**视频-文本检索**性能，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04290.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fstarmemda\u002FCAMoW\u002F)\n\n- (arXiv 2021.09) UCTransNet：从通道视角出发，利用Transformer重新思考**U-Net中的跳跃连接**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04335.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FMcGregorWwww\u002FUCTransNet)\n\n- (arXiv 2021.09) ConvMLP：用于视觉的分层卷积**MLP**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04454.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FConvolutional-MLPs)\n\n- (arXiv 2021.09) TxT：基于Transformer的**跨模态**端到端学习，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04422.pdf)\n\n- (arXiv 2021.09) 视觉与语言，还是视觉为语言？关于多模态Transformer中的**跨模态影响**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04448.pdf)\n\n- (arXiv 2021.09) **稀疏**-MLP：一种具有条件计算的全**MLP**架构，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02008.pdf)\n\n- (arXiv 2021.09) SORNet：用于序列化**操作**的空间对象中心表示，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.03891.pdf)，[[项目]](https:\u002F\u002Fwentaoyuan.github.io\u002Fsornet)\n\n- (arXiv 2021.09) 基于音频-视觉Transformer的**人群计数**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.01926.pdf)\n\n- (arXiv 2021.09) 面向**视觉问答**的弱监督相对空间推理，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.01934.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fpratyay-banerjee\u002Fweak_sup_vqa)\n\n- (arXiv 2021.09) FUSFORMER：一种基于Transformer的融合方法，用于高光谱图像的**超分辨率**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02079.pdf)\n\n- (arXiv 2021.09) CTRL-C：带有线条分类功能的**相机标定**Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02259.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fjwlee-vcl\u002FCTRL-C)\n\n- (arXiv 2021.09) 学习从自然语言监督中生成**场景图**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02227.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FYiwuZhong\u002FSGG_from_NLS)\n\n- (arXiv 2021.09) 动画Transformer：通过片段匹配实现视觉**对应关系**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02614.pdf)\n\n- (arXiv 2021.09) 用于**3D目标检测**的体素Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02497.pdf)\n\n- (ICCV 2021.09) 使用Transformer从单张图像估计**3D人体纹理**，[[论文]](http:\u002F\u002Fpersonal.ie.cuhk.edu.hk\u002F~ccloy\u002Ffiles\u002Ficcv_2021_texformer.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxuxy09\u002FTexformer)\n\n- (arXiv 2021.09) 具有多层次注意力的编码器-解码器用于**3D人体形状和姿态估计**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.02303.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fziniuwan\u002Fmaed)\n\n- (arXiv 2021.09) 用于**语义特征对应**的联合图学习与匹配，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.00240.pdf)\n\n- (arXiv 2021.09) 寻找**高效**的多阶段视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.00642.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fyilunliao\u002Fvit-search)\n\n### 2021年8月\n- (arXiv 2021.08) SIGN：融入空间信息的生成网络，用于**广义零样本语义分割**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.12517.pdf)\n\n- (arXiv 2021.08) GroupFormer：基于聚类时空Transformer的**群体活动识别**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.12630.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002Fxueyee\u002FGroupFormer)\n\n- (arXiv 2021.08) **网络结构之争**：CNN、Transformer和MLP的实证研究，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.13002.pdf)\n\n- (arXiv 2021.08) 探索与改进**移动端**视觉Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.13015.pdf)\n\n- (arXiv 2021.08) 基于集合学习的跨类别**视频精彩片段检测**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11770.pdf)，[[代码]](https:\u002F\u002Fgithub.com\u002FChrisAllenMing\u002FCross_Category_Video_Highlight)\n\n- (arXiv 2021.08) 用于**时空**表征学习的移位分块Transformer，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11575.pdf)\n\n- (arXiv 2021.08) SASRA：面向连续环境中**视觉-语言导航**的语义感知时空推理智能体，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11945.pdf)\n\n- (arXiv 2021.08) LocTex：从局部化文本监督中学习**数据高效**的视觉**表征**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11950.pdf)，[[项目]](https:\u002F\u002Floctex.mit.edu\u002F)\n\n- (arXiv 2021.08) 引导查询位置并执行相似注意力机制，用于基于Transformer的**检测**头，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.09691.pdf)\n\n- (arXiv 2021.08) SIMVLM：利用弱监督进行的简单**视觉语言**模型预训练，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.10904.pdf)\n\n- (arXiv 2021.08) TransFER：使用Transformer学习关系感知的**面部表情表征**，[[论文]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11116)","# Transformer-in-Vision 快速上手指南\n\n**工具简介**：\n`Transformer-in-Vision` 并非一个单一的代码库或可安装软件包，而是一个** curated（精选）的资源列表**。它汇集了基于 Transformer 架构的计算机视觉（CV）前沿论文、开源代码库、数据集以及综述文章。本指南将指导开发者如何利用该列表快速定位所需资源，并搭建基础的 Vision Transformer (ViT) 开发环境。\n\n## 1. 环境准备\n\n由于该列表涵盖的项目多基于 PyTorch 或 TensorFlow\u002FJAX，建议准备以下基础环境：\n\n*   **操作系统**：Linux (Ubuntu 20.04+) 或 macOS (M1\u002FM2\u002FIntel)。Windows 用户建议使用 WSL2。\n*   **Python 版本**：3.8 - 3.10 (推荐 3.9，兼容性最佳)。\n*   **硬件要求**：\n    *   训练\u002F微调：推荐 NVIDIA GPU (显存 ≥ 8GB)，支持 CUDA 11.7+。\n    *   推理\u002F学习：CPU 即可运行部分轻量级示例，但建议使用 GPU 以获得流畅体验。\n*   **前置依赖**：\n    *   Git (用于克隆代码库)\n    *   pip 或 conda (包管理工具)\n\n## 2. 安装步骤\n\n由于 `Transformer-in-Vision` 是资源索引，**无需安装该列表本身**。你需要根据列表中感兴趣的具体项目（如 ViT, CLIP, DETR 等）进行安装。\n\n以下是搭建通用 Vision Transformer 开发环境的标准步骤，以目前最主流的 **Hugging Face Transformers** 库为例（列表中强烈推荐）：\n\n### 步骤 2.1: 创建虚拟环境\n```bash\nconda create -n vit-env python=3.9\nconda activate vit-env\n```\n\n### 步骤 2.2: 安装深度学习框架 (推荐使用国内镜像源加速)\n使用清华源 (Tsinghua Mirror) 安装 PyTorch 和 torchvision：\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n# 若无需 GPU 支持，可使用 CPU 版本:\n# pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu\n```\n\n### 步骤 2.3: 安装核心 Transformer 库\n同样使用国内镜像源安装 `transformers` 及相关视觉处理库：\n```bash\npip install transformers accelerate timm pillow -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n*注：`timm` (PyTorch Image Models) 是列表中多次提及的重要库，包含大量预训练的 ViT 模型。*\n\n### 步骤 2.4: 获取特定项目代码\n从列表中选择一个具体项目（例如 `CLIP` 或 `DETR`），直接克隆其官方仓库。例如克隆 OpenAI 的 CLIP：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP.git\ncd CLIP\npip install -e .\n```\n\n## 3. 基本使用\n\n以下示例展示如何使用已安装的 `transformers` 和 `timm` 库，加载一个预训练的 Vision Transformer 模型进行图像分类。这是理解列表中大多数 CV 模型的基础用法。\n\n### 3.1 最简单的使用示例 (图像分类)\n\n创建一个名为 `quick_start.py` 的文件，写入以下代码：\n\n```python\nimport torch\nfrom PIL import Image\nimport requests\nfrom transformers import ViTImageProcessor, ViTForImageClassification\n\n# 1. 加载预处理器和模型 (自动下载权重，首次运行需联网)\n# 这里使用的是 google\u002Fvit-base-patch16-224，源自列表中提到的 ViT 系列\nmodel_name = \"google\u002Fvit-base-patch16-224\"\nprocessor = ViTImageProcessor.from_pretrained(model_name)\nmodel = ViTForImageClassification.from_pretrained(model_name)\n\n# 2. 准备输入图像 (从网络加载示例图片)\nurl = \"http:\u002F\u002Fimages.cocodataset.org\u002Fval2017\u002F000000000039.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\n# 3. 预处理图像\ninputs = processor(images=image, return_tensors=\"pt\")\n\n# 4. 推理 (Inference)\nwith torch.no_grad():\n    outputs = model(**inputs)\n    logits = outputs.logits\n\n# 5. 获取预测结果\npredicted_class_idx = logits.argmax(-1).item()\nclass_label = model.config.id2label[predicted_class_idx]\n\nprint(f\"预测类别: {class_label}\")\nprint(f\"置信度得分: {logits[0][predicted_class_idx].item():.4f}\")\n```\n\n### 3.2 运行示例\n在终端执行：\n```bash\npython quick_start.py\n```\n\n### 3.3 进阶探索\n根据 `Transformer-in-Vision` 列表中的 **Recent Papers** 或 **Resource** 部分：\n*   **多模态任务**：参考 `CLIP` 或 `LAVIS` 章节，尝试图文检索。\n*   **目标检测**：参考 `KS-DETR` 或 `Detection Transformer` 相关论文，克隆对应 GitHub 仓库运行检测 demo。\n*   **视频理解**：参考 `Video Transformers` 综述，寻找对应的视频动作识别代码库。\n\n通过此流程，你已成功利用该资源列表构建了基础环境并运行了第一个 Vision Transformer 模型。后续可根据列表中的论文链接深入研读算法细节。","某自动驾驶初创团队正在研发新一代感知系统，急需整合摄像头、激光雷达等多源传感器数据以提升复杂路况下的识别准确率。\n\n### 没有 Transformer-in-Vision 时\n- **技术选型迷茫**：面对海量分散的论文与代码库，工程师难以快速定位如 CLIP、DALL·E 2 或专用传感器融合（Sensor Fusion）等前沿模型的最新实现。\n- **多模态融合困难**：传统 CNN 架构在处理视觉与文本联合学习（如路标识别结合导航指令）时效果受限，缺乏成熟的 Vision-Language 预训练方案参考。\n- **复现成本高昂**：从零复现 Imagen Video 或 Phenaki 等视频生成模型需耗费数周调试环境，且容易因版本过时而失败。\n- **创新方向模糊**：团队难以洞察从纯视觉 Transformer 向“大语言模型 + 视觉”（LLM-in-Vision）演进的技术趋势，导致产品规划滞后。\n\n### 使用 Transformer-in-Vision 后\n- **资源一站获取**：直接利用整理好的清单，快速访问 LAVIS、Stable Diffusion 及自动驾驶传感器融合的官方代码与论文，将调研时间从数天缩短至几小时。\n- **架构升级顺畅**：参考列表中成熟的 V-L 联合学习研究（如 METER、Kaleido-BERT），迅速构建出能理解自然语言指令的高精度感知模块。\n- **开发效率倍增**：基于提供的 Hugging Face 及 PyTorch 教程链接，团队快速搭建了视频预测原型，避免了重复造轮子的底层陷阱。\n- **紧跟技术前沿**：通过追踪列表中新增的 LLM-in-Vision 方向，及时调整技术路线，确保系统具备接入通用大模型能力的扩展性。\n\nTransformer-in-Vision 通过聚合全球最前沿的视觉 Transformer 资源，帮助研发团队打破信息壁垒，大幅加速了从理论验证到落地部署的全过程。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDirtyHarryLYL_Transformer-in-Vision_d6885a15.png","DirtyHarryLYL","Yong-Lu Li","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FDirtyHarryLYL_ddf533a9.jpg","Associate Professor at SJTU,\r\nML CV Rob","@MVIG-SJTU","Shanghai","yonglu_li@sjtu.edu.cn","DirtyHa96142091","https:\u002F\u002Fdirtyharrylyl.github.io\u002F","https:\u002F\u002Fgithub.com\u002FDirtyHarryLYL",null,1339,142,"2026-04-01T19:13:24",5,"","未说明",{"notes":91,"python":89,"dependencies":92},"该仓库并非一个具体的可运行软件工具，而是一个关于视觉领域 Transformer 相关论文、代码库和资源的文章列表（Awesome List）。README 中列出的项目（如 CLIP, DALL-E, Stable Diffusion 等）各自拥有独立的代码仓库和环境需求。用户若需运行其中提到的具体模型，请访问 README 中提供的对应链接以获取详细的安装和运行说明。",[],[14,15,35],[95,96,97,98,99,100,101,102],"transformer","vision-transformers","computer-vision","self-attention","multi-modal","visual-language","deep-learning","paper","2026-03-27T02:49:30.150509","2026-04-12T03:23:30.180559",[],[]]