[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-louisfb01--best_AI_papers_2022":3,"tool-louisfb01--best_AI_papers_2022":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",146793,2,"2026-04-08T23:32:35",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108111,"2026-04-08T11:23:26",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":78,"stars":82,"forks":83,"last_commit_at":84,"license":85,"difficulty_score":86,"env_os":87,"env_gpu":88,"env_ram":88,"env_deps":89,"category_tags":92,"github_topics":94,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":112,"updated_at":113,"faqs":114,"releases":115},5758,"louisfb01\u002Fbest_AI_papers_2022","best_AI_papers_2022","A curated list of the latest breakthroughs in AI (in 2022) by release date with a clear video explanation, link to a more in-depth article, and code.","best_AI_papers_2022 是一份精心整理的 2022 年人工智能领域突破性论文清单。它按发布时间顺序，收录了当年最具影响力的研究成果，涵盖图像生成、语言模型、3D 神经渲染及多模态对齐等前沿方向。\n\n面对 AI 技术飞速迭代、新论文层出不穷的现状，研究人员和开发者往往难以高效筛选出真正有价值的成果并快速理解其核心逻辑。best_AI_papers_2022 通过为每篇论文提供清晰的视频讲解、深度文章链接以及可运行的代码资源，极大地降低了学习门槛，帮助用户在掌握理论的同时也能动手实践。此外，清单还特别关注了伦理、偏见治理等关键议题，引导用户审慎思考技术的应用边界。\n\n这份资源非常适合 AI 研究人员、工程师、学生以及对技术趋势感兴趣的设计师使用。无论是希望紧跟学术前沿的学者，还是寻找灵感与代码实现的开发人员，都能从中获益。其独特的“视频 + 文章 + 代码”三位一体呈现方式，让复杂的算法原理变得通俗易懂，是回顾 2022 年 AI 发展历程、构建系统化知识体系的理想指南。","# 2022: A Year Full of Amazing AI papers- A Review 🚀\n## A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.\n\nWhile the world is still recovering, research hasn't slowed its frenetic pace, especially in the field of artificial intelligence. More, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life's quality in the near future. Still, we ought to be careful with which technology we choose to apply.\n\n>\"Science cannot tell us what we ought to do, only what we can do.\"\u003Cbr\u002F>- Jean-Paul Sartre, Being and Nothingness\n\nHere's curated list of the latest breakthroughs in AI and Data Science by release date with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read!\n\n**The complete reference to each paper is listed at the end of this repository.** *Star this repository to stay up to date and stay tuned for next year!* ⭐️\n\nMaintainer: [louisfb01](https:\u002F\u002Fgithub.com\u002Flouisfb01), also active on [YouTube](https:\u002F\u002Fwww.youtube.com\u002F@whatsai) and as a [Podcaster](https:\u002F\u002Fopen.spotify.com\u002Fshow\u002F4rKRJXaXlClkDyInjHkxq3) if you want to see\u002Fhear more about AI!\n\n[![Twitter](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl\u002Fhttps\u002Ftwitter.com\u002Fcloudposse.svg?style=social&label=Follow%20%40whats_ai)](https:\u002F\u002Ftwitter.com\u002FWhats_AI)\n\n\nSubscribe to my [newsletter](https:\u002F\u002Flouisbouchard.substack.com\u002F) - The latest updates in AI explained every week.\n\n\n*Feel free to [message me](https:\u002F\u002Fwww.louisbouchard.ai\u002Fcontact\u002F) any interesting paper I may have missed to add to this repository.*\n\n*Tag me on **Twitter** [@Whats_AI](https:\u002F\u002Ftwitter.com\u002FWhats_AI) or **LinkedIn** [@Louis (What's AI) Bouchard](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fwhats-ai\u002F) if you share the list!* And come chat with us in our [Learn AI Together Discord community](https:\u002F\u002Fwww.louisbouchard.ai\u002Flearn-ai-together\u002F)!\n\n👀 **If you'd like to support my work**, you can check to [Sponsor](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Flouisfb01) this repository or support me on [Patreon](https:\u002F\u002Fwww.patreon.com\u002Fwhatsai).\n\n ### Watch a complete 2022 rewind in 8 minutes\n\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FMGt3APx.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FtYRTzWHOQio)\n\n----\n\n## The Full List\n- [Resolution-robust Large Mask Inpainting with Fourier Convolutions [1]](#1)\n- [Stitch it in Time: GAN-Based Facial Editing of Real Videos [2]](#2)\n- [NeROIC: Neural Rendering of Objects from Online Image Collections [3]](#3)\n- [SpeechPainter: Text-conditioned Speech Inpainting [4]](#4)\n- [Towards real-world blind face restoration with generative facial prior [5]](#5)\n- [4D-Net for Learned Multi-Modal Alignment [6]](#6)\n- [Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [7]](#7)\n- [Hierarchical Text-Conditional Image Generation with CLIP Latents [8]](#8)\n- [MyStyle: A Personalized Generative Prior [9]](#9)\n- [OPT: Open Pre-trained Transformer Language Models [10]](#10)\n- [BlobGAN: Spatially Disentangled Scene Representations [11]](#11)\n- [A Generalist Agent [12]](#12)\n- [Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [13]](#13)\n- [Dalle mini [14]](#14)\n- [No Language Left Behind: Scaling Human-Centered Machine Translation [15]](#15)\n- [Dual-Shutter Optical Vibration Sensing [16]](#16)\n- [Make-a-scene: Scene-based text-to-image generation with human priors [17]](#17)\n- [BANMo: Building Animatable 3D Neural Models from Many Casual Videos [18]](#18)\n- [High-resolution image synthesis with latent diffusion models [19]](#19)\n- [Panoptic Scene Graph Generation [20]](#20)\n- [An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion [21]](#21)\n- [Expanding Language-Image Pretrained Models for General Video Recognition [22]](#22)\n- [MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA [23]](#23)\n- [Robust Speech Recognition via Large-Scale Weak Supervision [24]](#24)\n- [DreamFusion: Text-to-3D using 2D Diffusion [25]](#25)\n- [Imagic: Text-Based Real Image Editing with Diffusion Models [26]](#26)\n- [eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [27]](#27)\n- [InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images [28]](#28)\n- [Galactica: A Large Language Model for Science [29]](#29)\n- [Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition [30]](#30)\n- [ChatGPT: Optimizing Language Models for Dialogue [31]](#31)\n- [Production-Ready Face Re-Aging for Visual Effects [32]](#32)\n- [Paper references](#references)\n\n---\n\n## Resolution-robust Large Mask Inpainting with Fourier Convolutions [1]\u003Ca name=\"1\">\u003C\u002Fa>\nYou’ve most certainly experienced this situation once: You take a great picture with your friend, and someone is photobombing behind you, ruining your future Instagram post. Well, that’s no longer an issue. Either it is a person or a trashcan you forgot to remove before taking your selfie that’s ruining your picture. This AI will just automatically remove the undesired object or person in the image and save your post. It’s just like a professional photoshop designer in your pocket, and with a simple click!\n\nThis task of removing part of an image and replacing it with what should appear behind has been tackled by many AI researchers for a long time. It is called image inpainting, and it’s extremely challenging...\n\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002Fd5ClyqP.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FIa79AvGzveQ)\n* Short read: [This AI Removes Unwanted Objects From your Images!](https:\u002F\u002Fwww.louisbouchard.ai\u002Flama\u002F)\n* Paper: [Resolution-robust Large Mask Inpainting with Fourier Convolutions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07161.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002Fsaic-mdal\u002Flama)\n* [Colab Demo](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsaic-mdal\u002Flama\u002Fblob\u002Fmaster\u002Fcolab\u002FLaMa_inpainting.ipynb)\n* [Product using LaMa](https:\u002F\u002Fcleanup.pictures\u002F)\n\n\n## Stitch it in Time: GAN-Based Facial Editing of Real Videos [2]\u003Ca name=\"2\">\u003C\u002Fa>\nYou've most certainly seen movies like the recent Captain Marvel or Gemini Man where Samuel L Jackson and Will Smith appeared to look like they were much younger. This requires hundreds if not thousands of hours of work from professionals manually editing the scenes he appeared in.\nInstead, you could use a simple AI and do it within a few minutes. Indeed, many techniques allow you to add smiles, make you look younger or older, all automatically using AI-based algorithms. It is called AI-based face manipulations in videos and here's the current state-of-the-art in 2022!\n\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FlvgMjzS.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FmqItu9XoUgk)\n* Short read: [AI Facial Editing of Real Videos ! Stitch it in Time Explained](https:\u002F\u002Fwww.louisbouchard.ai\u002Fstitch-it-in-time\u002F)\n* Paper: [Stitch it in Time: GAN-Based Facial Editing of Real Videos](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.08361)\n* [Code](https:\u002F\u002Fgithub.com\u002Frotemtzaban\u002FSTIT)\n\n\n## NeROIC: Neural Rendering of Objects from Online Image Collections [3]\u003Ca name=\"3\">\u003C\u002Fa>\nNeural Rendering. Neural Rendering is the ability to generate a photorealistic model in space just like this one, from pictures of the object, person, or scene of interest. In this case, you’d have a handful of pictures of this sculpture and ask the machine to understand what the object in these pictures should look like in space. You are basically asking a machine to understand physics and shapes out of images. This is quite easy for us since we only know the real world and depths, but it’s a whole other challenge for a machine that only sees pixels.\nIt’s great that the generated model looks accurate with realistic shapes, but what about how it blends in the new scene? And what if the lighting conditions vary in the pictures taken and the generated model looks different depending on the angle you look at it? This would automatically seem weird and unrealistic to us. These are the challenges Snapchat and the University of Southern California attacked in this new research.\n\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FxTpuwcN.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002F88Pl9zD1Z78)\n* Short read: [Create Realistic 3D Renderings with AI !](https:\u002F\u002Fwww.louisbouchard.ai\u002Fneroic\u002F)\n* Paper: [NeROIC: Neural Rendering of Objects from Online Image Collections](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02533.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002FNeROIC)\n\n\n## SpeechPainter: Text-conditioned Speech Inpainting [4]\u003Ca name=\"4\">\u003C\u002Fa>\nWe’ve seen image inpainting, which aims to remove an undesirable object from a picture. The machine learning-based techniques do not simply remove the objects, but they also understand the picture and fill the missing parts of the image with what the background should look like.\nThe recent advancements are incredible, just like the results, and this inpainting task can be quite useful for many applications like advertisements or improving your future Instagram post. We also covered an even more challenging task: video inpainting, where the same process is applied to videos to remove objects or people.\n\nThe challenge with videos comes with staying consistent from frame to frame without any buggy artifacts. But now, what happens if we correctly remove a person from a movie and the sound is still there, unchanged? Well, we may hear a ghost and ruin all our work.\n\nThis is where a task I never covered on my channel comes in: speech inpainting. You heard it right, researchers from Google just published a paper aiming at inpainting speech, and, as we will see, the results are quite impressive. Okay, we might rather hear than see the results, but you get the point. It can correct your grammar, pronunciation or even remove background noise. All things I definitely need to keep working on, or… simply use their new model… Listen to the examples in my video!\n\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FJyQ41Qv.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FzIIc4bRf5Hg)\n* Short read: [Speech Inpainting with AI !](https:\u002F\u002Fwww.louisbouchard.ai\u002Fspeech-inpainting-with-ai\u002F)\n* Paper: [SpeechPainter: Text-conditioned Speech Inpainting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07273.pdf)\n* [Listen to more examples](https:\u002F\u002Fgoogle-research.github.io\u002Fseanet\u002Fspeechpainter\u002Fexamples\u002F)\n\n\n## Towards real-world blind face restoration with generative facial prior [5]\u003Ca name=\"5\">\u003C\u002Fa>\nDo you also have old pictures of yourself or close ones that didn’t age well or that you, or your parents, took before we could produce high-quality images? I do, and I felt like those memories were damaged forever. Boy, was I wrong!\n\nThis new and completely free AI model can fix most of your old pictures in a split second. It works well even with very low or high-quality inputs, which is typically quite the challenge.\n\nThis week’s paper called Towards Real-World Blind Face Restoration with Generative Facial Prior tackles the photo restoration task with outstanding results. What’s even cooler is that you can try it yourself and in your preferred way. They have open-sourced their code, created a demo and online applications for you to try right now. If the results you’ve seen above aren’t convincing enough, just watch the video and let me know what you think in the comments, I know it will blow your mind!\n\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FDxxFRLI.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FnLDVtzcSeqM)\n* Short read: [Impressive photo restoration by AI !](https:\u002F\u002Fwww.louisbouchard.ai\u002Fgfp-gan\u002F)\n* Paper: [Towards real-world blind face restoration with generative facial prior](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.04061.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FGFPGAN)\n* [Colab Demo](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1sVsoBd9AjckIXThgtZhGrHRfFI6UUYOo)\n* [Online app](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fakhaliq\u002FGFPGAN)\n\n\n## 4D-Net for Learned Multi-Modal Alignment [6]\u003Ca name=\"6\">\u003C\u002Fa>\nHow do autonomous vehicles see?\n\nYou’ve probably heard of LiDAR sensors or other weird cameras they are using. But how do they work, how can they see the world, and what do they see exactly compared to us? Understanding how they work is essential if we want to put them on the road, primarily if you work in the government or build the next regulations. But also as a client of these services.\n\nWe previously covered how Tesla autopilot sees and works, but they are different from conventional autonomous vehicles. Tesla only uses cameras to understand the world, while most of them, like Waymo, use regular cameras and 3D LiDAR sensors. These LiDAR sensors are pretty simple to understand: they won’t produce images like regular cameras but 3D point clouds. LiDAR cameras measure the distance between objects, calculating the pulse laser’s traveling time that they project to the object.\n\nStill, how can we efficiently combine this information and have the vehicle understand it? And what does the vehicle end up seeing? Only points everywhere? Is it enough for driving on our roads? We will look into this with a new research paper by Waymo and Google Research...\n\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FAxGLy7p.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002F0nJMnw1Ldks)\n* Short read: [Combine Lidar and Cameras for 3D object detection - Waymo](https:\u002F\u002Fwww.louisbouchard.ai\u002Fwaymo-lidar\u002F)\n* Paper: [4D-Net for Learned Multi-Modal Alignment](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2021\u002Fpapers\u002FPiergiovanni_4D-Net_for_Learned_Multi-Modal_Alignment_ICCV_2021_paper.pdf)\n\n\n## Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [7]\u003Ca name=\"7\">\u003C\u002Fa>\nAs if taking a picture wasn’t a challenging enough technological prowess, we are now doing the opposite: modeling the world from pictures. I’ve covered amazing AI-based models that could take images and turn them into high-quality scenes. A challenging task that consists of taking a few images in the 2-dimensional picture world to create how the object or person would look in the real world.\n\nTake a few pictures and instantly have a realistic model to insert into your product. How cool is that?!\n\nThe results have dramatically improved upon the first model I covered in 2020, called NeRF. And this improvement isn’t only about the quality of the results. NVIDIA made it even better.\n\nNot only that the quality is comparable, if not better, but it is more than 1'000 times faster with less than two years of research.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002F8PilczV.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FUHQZBQOVAIU)\n* Short read: [NVIDIA Turns Photos into 3D Scenes in Milliseconds](https:\u002F\u002Fwww.louisbouchard.ai\u002Fnvidia-photos-into-3d-scenes\u002F)\n* Paper: [Instant Neural Graphics Primitives with a Multiresolution Hash Encoding](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Finstant-ngp)\n\n\n## Hierarchical Text-Conditional Image Generation with CLIP Latents [8]\u003Ca name=\"8\">\u003C\u002Fa>\nLast year I shared DALL·E, an amazing model by OpenAI capable of generating images from a text input with incredible results. Now is time for his big brother, DALL·E 2. And you won’t believe the progress in a single year! DALL·E 2 is not only better at generating photorealistic images from text. The results are four times the resolution!\n\nAs if it wasn’t already impressive enough, the recent model learned a new skill; image inpainting.\n\nDALL·E could generate images from text inputs.\n\nDALL·E 2 can do it better, but it doesn’t stop there. It can also edit those images and make them look even better! Or simply add a feature you want like some flamingos in the background.\n\nSounds interesting? Learn more in the video or read more below!\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FxZlfsJO.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FrdGVbPI42sA)\n* Short read: [OpenAI's new model DALL·E 2 is amazing!](https:\u002F\u002Fwww.louisbouchard.ai\u002Fopenais-new-model-dall-e-2-is-amazing\u002F)\n* Paper: [Hierarchical Text-Conditional Image Generation with CLIP Latents](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fdall-e-2.pdf)\n\n\n## MyStyle: A Personalized Generative Prior [9]\u003Ca name=\"9\">\u003C\u002Fa>\nThis new model by Google Research and Tel-Aviv University is incredible. You can see it as a very, very powerful deepfake that can do anything. \n\nTake a hundred pictures of any person and you have its persona encoded to fix, edit or create any realistic picture you want.\n\nThis is both amazing and scary if you ask me, especially when you look at the results. Watch the video to see more results and understand how the model works!\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FFAhVBzM.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FBNWAEvFfFvQ)\n* Short read: [Your Personal Photoshop Expert with AI!](https:\u002F\u002Fwww.louisbouchard.ai\u002Fmystyle\u002F)\n* Paper: [MyStyle: A Personalized Generative Prior](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.17272)\n* [Code (coming soon)](https:\u002F\u002Fmystyle-personalized-prior.github.io\u002F)\n\n\n> Check out [the What's AI podcast](https:\u002F\u002Fopen.spotify.com\u002Fshow\u002F4rKRJXaXlClkDyInjHkxq3) for more AI content in the form of interviews with experts in the field! An invited AI expert and I will cover specific topics, sub-fields, and roles related to AI to teach and share knowledge from the people who worked hard to gather it.\n\n\n## OPT: Open Pre-trained Transformer Language Models [10]\u003Ca name=\"10\">\u003C\u002Fa>\nWe’ve all heard about GPT-3 and have somewhat of a clear idea of its capabilities. You’ve most certainly seen some applications born strictly due to this model, some of which I covered in a previous video about the model. GPT-3 is a model developed by OpenAI that you can access through a paid API but have no access to the model itself.\n\nWhat makes GPT-3 so strong is both its architecture and size. It has 175 billion parameters. Twice the amount of neurons we have in our brains! This immense network was pretty much trained on the whole internet to understand how we write, exchange, and understand text. This week, Meta has taken a big step forward for the community. They just released a model that is just as powerful, if not more and has completely open-sourced it.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FZBHHYaQ.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FEjg0OunCi9U)\n* Short read: [Meta's new model OPT is GPT-3's closest competitor! (and is open source)](https:\u002F\u002Fwww.louisbouchard.ai\u002Fopt-meta\u002F)\n* Paper: [OPT: Open Pre-trained Transformer Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01068.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmetaseq)\n\n\n## BlobGAN: Spatially Disentangled Scene Representations [11]\u003Ca name=\"11\">\u003C\u002Fa>\nBlobGAN allows for unreal manipulation of images, made super easily controlling simple blobs. All these small blobs represent an object, and you can move them around or make them bigger, smaller, or even remove them, and it will have the same effect on the object it represents in the image. This is so cool!\n\nAs the authors shared in their results, you can even create novel images by duplicating blobs, creating unseen images in the dataset [like a room with two ceiling fans](https:\u002F\u002Fyoutu.be\u002FmnEzjpiA_4E)! Correct me if I’m wrong, but I believe it is one of, if not the first, paper to make the modification of images as simple as moving blobs around and allowing for edits that were unseen in the training dataset. \n\nAnd you can actually play with this one compared to some companies we all know! They shared their code publicly and a Colab Demo you can try right away. Even more exciting is how BlobGAN works. Learn more in the video!\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002F9ouN5ta.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FmnEzjpiA_4E)\n* Short read: [This is a BIG step for GANs! BlobGAN Explained](https:\u002F\u002Fwww.louisbouchard.ai\u002Fblobgan\u002F)\n* Paper: [BlobGAN: Spatially Disentangled Scene Representations](https:\u002F\u002Fdave.ml\u002Fblobgan\u002F)\n* [Code](https:\u002F\u002Fgithub.com\u002Fdave-epstein\u002Fblobgan)\n* [Colab Demo](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1clvh28Yds5CvKsYYENGLS3iIIrlZK4xO?usp=sharing#scrollTo=0QuVIyVplOKu)\n\n\n## A Generalist Agent [12]\u003Ca name=\"12\">\u003C\u002Fa>\nGato from DeepMind was just published! It is a single transformer that can play Atari games, caption images, chat with people, control a real robotic arm, and more! Indeed, it is trained once and uses the same weights to achieve all those tasks. And as per Deepmind, this is not only a transformer but also an agent. This is what happens when you mix Transformers with progress on multi-task reinforcement learning agents.\n\nGato is a multi-modal agent. Meaning that it can create captions for images or answer questions as a chatbot. You’d say that GPT-3 can already do that, but Gato can do more… The multi-modality comes from the fact that Gato can also play Atari games at the human level or even do real-world tasks like controlling robotic arms to move objects precisely. It understands words, images, and even physics...\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002Frr9VUXn.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FxZKSWNv6Esc)\n* Short read: [Deepmind's new model Gato is amazing!](https:\u002F\u002Fwww.louisbouchard.ai\u002Fdeepmind-gato\u002F)\n* Paper: [A Generalist Agent](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FA%20Generalist%20Agent\u002FGeneralist%20Agent.pdf)\n\n\n## Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [13]\u003Ca name=\"13\">\u003C\u002Fa>\nIf you thought Dall-e 2 had great results, wait until you see what this new model from Google Brain can do. \n\nDalle-e is amazing but often lacks realism, and this is what the team attacked with this new model called Imagen. \n\nThey share a lot of results on their project page as well as a benchmark, which they introduced for comparing text-to-image models, where they clearly outperform Dall-E 2, and previous image generation approaches. Learn more in the video...\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FIpwaSvZ.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FqhtYPhPWCsI)\n* Short read: [Google Brain's Answer to Dalle-e 2: Imagen](https:\u002F\u002Fwww.louisbouchard.ai\u002Fgoogle-brain-imagen\u002F)\n* Paper: [Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding](https:\u002F\u002Fimagen.research.google\u002Fpaper.pdf)\n* [Project page with results](https:\u002F\u002Fimagen.research.google\u002F)\n\n[![Twitter](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl\u002Fhttps\u002Ftwitter.com\u002Fcloudposse.svg?style=social&label=Follow%20%40whats_ai)](https:\u002F\u002Ftwitter.com\u002FWhats_AI)\n\n## DALL·E Mini [14]\u003Ca name=\"14\">\u003C\u002Fa>\nDalle mini is amazing — and YOU can use it!\n\nI'm sure you've seen pictures like those in your Twitter feed in the past few days.\nIf you wondered what they were, they are images generated by an AI called DALL·E mini.\nIf you've never seen those, you need to watch this video because you are missing out.\nIf you wonder how this is possible, well, you are on the perfect video and will know the answer in less than five minutes.\n\nDalle mini is a free, open-source AI that produces amazing images from text inputs.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FUx4ItPo.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FqOxde_JV0vI)\n* Short read: [How does dalle-mini work?](https:\u002F\u002Fwww.louisbouchard.ai\u002Fdalle-mini\u002F)\n* [Code](https:\u002F\u002Fgithub.com\u002Fborisdayma\u002Fdalle-mini)\n* [Huggingface official demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdalle-mini\u002Fdalle-mini)\n\n\n## No Language Left Behind: Scaling Human-Centered Machine Translation [15]\u003Ca name=\"15\">\u003C\u002Fa>\nMeta AI’s most recent model, called “No Language Left Behind” does exactly that: translates across 200 different languages with state-of-the-art quality.\nA single model can handle 200 languages. How incredible is that?\n\nWe find it difficult to have great results strictly in English while Meta is tackling 200 different languages with the same model, and some of the most complicated and less represented ones that even google translate struggles with...\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FOHV5bTU.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002F2G4NeG17Eis)\n* Short read: [No Language Left Behind](https:\u002F\u002Fwww.louisbouchard.ai\u002Fno-language-left-behind\u002F)\n* [Code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Ftree\u002Fnllb)\n* Paper: [No Language Left Behind](https:\u002F\u002Fai.facebook.com\u002Fresearch\u002Fno-language-left-behind\u002F)\n\n\n## Dual-Shutter Optical Vibration Sensing [16]\u003Ca name=\"16\">\u003C\u002Fa>\nThey reconstruct sound using cameras and a laser beam on any vibrating surface, allowing them to isolate music instruments, focus on a specific speaker, remove ambient noises, and many more amazing applications.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FkkS7tGw.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002Fn1M8ZVspJcs)\n* Short read: [CVPR 2022 Best Paper Honorable Mention: Dual-Shutter Optical Vibration Sensing](https:\u002F\u002Fwww.louisbouchard.ai\u002Fcvpr-2022-best-paper\u002F)\n* [Project page](https:\u002F\u002Fimaging.cs.cmu.edu\u002Fvibration\u002F)\n* Paper: [Dual-Shutter Optical Vibration Sensing](https:\u002F\u002Fwww.marksheinin.com\u002F_files\u002Fugd\u002Fa41a28_7d370603fafd419da387de85d8ecb5b4.pdf?index=true)\n\n\n## Make-a-scene: Scene-based text-to-image generation with human priors [17]\u003Ca name=\"17\">\u003C\u002Fa>\nMake-A-Scene is not “just another Dalle”. The goal of this new model isn’t to allow users to generate random images following text prompt as dalle does — which is really cool — but restricts the user control on the generations.\n\nInstead, Meta wanted to push creative expression forward, merging this text-to-image trend with previous sketch-to-image models, leading to “Make-A-Scene”: a fantastic blend between text and sketch-conditioned image generation.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FbivyUmD.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FK3bZXXjW788)\n* Short read: [Produce Amazing Artworks with Text and Sketches!](https:\u002F\u002Fwww.louisbouchard.ai\u002Fmake-a-scene\u002F)\n* Paper: [Make-a-scene: Scene-based text-to-image generation with human priors](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13131.pdf)\n\n\n## BANMo: Building Animatable 3D Neural Models from Many Casual Videos [18]\u003Ca name=\"18\">\u003C\u002Fa>\nCreate deformable 3D models from pictures with BANMo!\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FulRCcMS.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FjDTy-liFoCQ)\n* Short read: [Build Animatable 3D Models with AI](https:\u002F\u002Fwww.louisbouchard.ai\u002Fbanmo\u002F)\n* Paper: [BANMo: Building Animatable 3D Neural Models from Many Casual Videos](https:\u002F\u002Fbanmo-www.github.io\u002Fbanmo-cvpr.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbanmo)\n\n\n## High-resolution image synthesis with latent diffusion models [19]\u003Ca name=\"19\">\u003C\u002Fa>\nWhat do all recent super powerful image models like DALLE, Imagen, or Midjourney have in common? Other than their high computing costs, huge training time, and shared hype, they are all based on the same mechanism: diffusion.\nDiffusion models recently achieved state-of-the-art results for most image tasks including text-to-image with DALLE but many other image generation-related tasks too, like image inpainting, style transfer or image super-resolution.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FPanqNAf.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FRGBNdD3Wn-g)\n* Short read: [Latent Diffusion Models: The Architecture behind Stable Diffusion](https:\u002F\u002Fwww.louisbouchard.ai\u002Flatent-diffusion-models\u002F)\n* Paper: [High-resolution image synthesis with latent diffusion models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10752.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion)\n\n\n👀 **If you'd like to support my work**, you can check to [Sponsor](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Flouisfb01) this repository or support me on [Patreon](https:\u002F\u002Fwww.patreon.com\u002Fwhatsai).\n\n\n## Panoptic Scene Graph Generation [20]\u003Ca name=\"20\">\u003C\u002Fa>\nPanoptic scene graph generation, or PSG, is a new problem task aiming to generate a more comprehensive graph representation of an image or scene based on panoptic segmentation rather than bounding boxes. It can be used to understand images and generate sentences describing what's happening. This may be the most challenging task for an AI! Learn more below...\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FQRQnydw.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FcSsE_H_0Cr8)\n* Short read: [One of the Most Challenging Tasks for AI](https:\u002F\u002Fwww.louisbouchard.ai\u002Fpsg\u002F)\n* Paper: [Panoptic Scene Graph Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.11247)\n* [Code](https:\u002F\u002Fgithub.com\u002FJingkang50\u002FOpenPSG)\n* [Dataset](https:\u002F\u002Fpsgdataset.org\u002F)\n\n\n## An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion [21]\u003Ca name=\"21\">\u003C\u002Fa>\nText-to-Image models like DALLE or stable diffusion are really cool and allow us to generate fantastic pictures with a simple text input. But would it be even cooler to give them a picture of you and ask it to turn it into a painting? Imagine being able to send any picture of an object, person, or even your cat, and ask the model to transform it into another style like turning yourself into a cyborg of into your preferred artistic style or adding it to a new scene.\n\nBasically, how cool would it be to have a version of DALLE we can use to photoshop our pictures instead of having random generations? Having a personalized DALLE, while making it much more simple to control the generation as “an image is worth a thousand words”. It would be like having a DALLE model that is just as personalized and addictive as the TikTok algorithm.\n\nWell, this is what researchers from Tel Aviv University and NVIDIA worked on. They developed an approach for conditioning text-to-image models, like stable diffusion I covered last week, with a few images to represent any object or concept through the words you will send along your images. Transforming the object of your input images into whatever you want!\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FtpwcGVK.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002Ff3oXa7_SYek)\n* Short read: [Guiding Stable Diffusion with your Images](https:\u002F\u002Fwww.louisbouchard.ai\u002Fimageworthoneword\u002F)\n* Paper: [An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01618v1.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002Frinongal\u002Ftextual_inversion)\n\n\n## Expanding Language-Image Pretrained Models for General Video Recognition [22]\u003Ca name=\"22\">\u003C\u002Fa>\nWe’ve seen AI generate text, then generate images and most recently even generate short videos, even though they still need work. The results are incredible when you think that no one is actually involved in the creation process of these pieces and it only has to be trained once to then be used by thousands of people like stable diffusion is. Still, do these models really understand what they are doing? Do they know what the picture or video they just produced really represents? What does such a model understand when it sees such a picture or, even more complex, a video?\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002F65Vz6if.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002Fseb4lmVPEe8)\n* Short read: [General Video Recognition with AI](https:\u002F\u002Fwww.louisbouchard.ai\u002Fgeneral-video-recognition\u002F)\n* Paper: [Expanding Language-Image Pretrained Models for General Video Recognition](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.02816)\n* [Code](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVideoX\u002Ftree\u002Fmaster\u002FX-CLIP)\n\n\n## MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA [23]\u003Ca name=\"23\">\u003C\u002Fa>\nMeta AI’s new model make-a-video is out and in a single sentence: it generates videos from text. It’s not only able to generate videos, but it’s also the new state-of-the-art method, producing higher quality and more coherent videos than ever before!\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FZAv9FBH.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FMWwESVyHWto)\n* Short read: [Make-a-video: The AI Film Maker!](https:\u002F\u002Fwww.louisbouchard.ai\u002Fmake-a-video\u002F)\n* Paper: [MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA](https:\u002F\u002Fmakeavideo.studio\u002FMake-A-Video.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fmake-a-video-pytorch)\n\n\n## Robust Speech Recognition via Large-Scale Weak Supervision [24]\u003Ca name=\"24\">\u003C\u002Fa>\nHave you ever dreamed of a good transcription tool that would accurately understand what you say and write it down? Not like the automatic YouTube translation tools… I mean, they are good but far from perfect. Just try it out and turn the feature on for the video, and you’ll see what I’m talking about.\n\nLuckily, OpenAI just released and open-sourced a pretty powerful AI model just for that: Whisper.\n\nIt understands stuff I can’t even comprehend, not being a native English speaker (listen in the video) and it works for language translation too!\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FmPbvHfl.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FuFOkMme19Zs)\n* Short read: [OpenAI's Most Recent Model: Whisper (explained)](https:\u002F\u002Fwww.louisbouchard.ai\u002Fwhisper\u002F)\n* Paper: [Robust Speech Recognition via Large-Scale Weak Supervision](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fwhisper.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper)\n\n\n## DreamFusion: Text-to-3D using 2D Diffusion [25]\u003Ca name=\"25\">\u003C\u002Fa>\nWe’ve seen models able to take a sentence and generate images. Then, other approaches to manipulate the generated images by learning specific concepts like an object or particular style.\n\nLast week Meta published the Make-A-Video model that I covered, which allows you to generate a short video also from a text sentence. The results aren’t perfect yet, but the progress we’ve made in the field since last year is just incredible.\n\nThis week we make another step forward.\n\nHere’s DreamFusion, a new Google Research model that can understand a sentence enough to generate a 3D model of it. You can see this as a DALLE or Stable Diffusion but in 3D.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FkgvlHXu.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FepuU0VRIcjE)\n* Short read: [3D Models from Text! DreamFusion Explained](https:\u002F\u002Fwww.louisbouchard.ai\u002Fdreamfusion\u002F)\n* Paper: [DreamFusion: Text-to-3D using 2D Diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.14988)\n\n\n## Imagic: Text-Based Real Image Editing with Diffusion Models [26]\u003Ca name=\"26\">\u003C\u002Fa>\nIf you think the recent image generation models like DALLE or Stable Diffusion are cool, you just won’t believe how incredible this one is.\n\"This one\" is Imagic. Imagic takes such a diffusion-based model able to take text and generate images out of it and adapts the model to edit the images. You can generate an image and then teach the model to edit it any way you want.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002Frws0FYl.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FgbpPQ5kVJhM)\n* Short read: [AI Image Editing from Text! Imagic Explained](https:\u002F\u002Fwww.louisbouchard.ai\u002Fimagic\u002F)\n* Paper: [Imagic: Text-Based Real Image Editing with Diffusion Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.09276)\n* [Stable Diffusion implementation](https:\u002F\u002Fgithub.com\u002Fjustinpinkney\u002Fstable-diffusion\u002Fblob\u002Fmain\u002Fnotebooks\u002Fimagic.ipynb)\n\n\n## eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [27]\u003Ca name=\"27\">\u003C\u002Fa>\neDiffi, NVIDIA's most recent model, generates better-looking and more accurate images than all previous approaches like DALLE 2 or Stable Diffusion. eDiffi better understands the text you send and is more customizable, adding a feature we saw in a previous paper from NVIDIA: the painter tool.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002Fe0g0rNe.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002Fgrwp-ht_ixo)\n* Short read: [eDiffi explained: New SOTA Image Synthesis model!](https:\u002F\u002Fwww.louisbouchard.ai\u002Fediffi\u002F)\n* Paper: [eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.01324)\n\n\n👀 **If you'd like to support my work**, you can check to [Sponsor](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Flouisfb01) this repository or support me on [Patreon](https:\u002F\u002Fwww.patreon.com\u002Fwhatsai).\n\n\n## InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images [28]\u003Ca name=\"28\">\u003C\u002Fa>\nGenerate infinite new frames as if you would be flying into your image!\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002Fhza3mLh.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FFQzGhukV-l0)\n* Short read: [InfiniteNature-Zero: Fly Into Your Pictures With AI!](https:\u002F\u002Fwww.louisbouchard.ai\u002Finfinitenature-zero\u002F)\n* Paper: [InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images](https:\u002F\u002Finfinite-nature-zero.github.io\u002Fstatic\u002Fpdfs\u002FInfiniteNatureZero.pdf)\n* [Code](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Finfinite_nature_zero)\n\n\n## Galactica: A Large Language Model for Science [29]\u003Ca name=\"29\">\u003C\u002Fa>\nGalactica is a large language model with a size comparable to GPT-3, but specialized on scientific knowledge. The model can write whitepapers, reviews, Wikipedia pages, and code. It knows how to cite and how to write equations. It’s kind of a big deal for AI and science.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FHVEKpOY.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002F2GfxkCWWzLU)\n* Short read: [Galactica: What is it and What Happened?](https:\u002F\u002Fwww.louisbouchard.ai\u002Fgalactica\u002F)\n* Paper: [Galactica: A Large Language Model for Science](https:\u002F\u002Fgalactica.org\u002Fstatic\u002Fpaper.pdf)\n\n\n## Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition [30]\u003Ca name=\"30\">\u003C\u002Fa>\nFrom a single video, they can synthesize the person talking for pretty much any word or sentence in real time with better quality. You can animate a talking head following any audio track in real-time.\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FSk6fDKu.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FJUqnLN6Q4B0)\n* Short read: [From Audio to Talking Heads in Real-Time with AI! RAD-NeRF explained](https:\u002F\u002Fwww.louisbouchard.ai\u002Frad-nerf\u002F)\n* Paper: [Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.12368)\n\n\n## ChatGPT: Optimizing Language Models for Dialogue [31]\u003Ca name=\"31\">\u003C\u002Fa>\nChatGPT has taken over Twitter and pretty much the whole internet, thanks to its power and the meme potential it provides. We all know how being able to generate memes is the best way to conquer the internet, and so it worked.\n\nSince you’ve seen numerous examples, you might already know that ChatGPT is an AI recently released to the public by OpenAI, that you can chat with. It is also called a chatbot, meaning you can interact with it conversationally, imitatting a one-on-one human discussion.\n\nWhat you might not know is what it is and how it works... Watch the video or read the article or blog post below to learn more!\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FRpH5S2f.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FAsFgn8vU-tQ)\n* Short read: [What is ChatGPT?](https:\u002F\u002Fwww.louisbouchard.ai\u002Fchatgpt\u002F)\n* Blog Post: [ChatGPT: Optimizing Language Models for Dialogue](https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt\u002F)\n\n\n## Production-Ready Face Re-Aging for Visual Effects [32]\u003Ca name=\"32\">\u003C\u002Fa>\nWhether it be for fun in a Snapchat filter, for a movie, or even to remove a few wrinkles, we all have a utility in mind for being able to change our age in a picture.\n\nThis is usually done by skilled artists using Photoshop or a similar tool to edit your pictures. Worst, in a video, they have to do this kind of manual editing for every frame! Just imagine the amount of work needed for that. Well, here’s a solution and a new problem to this situation... 👇\n\n* Short Video Explanation:\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FQOo0O5N.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FWC03N0NFfwk)\n* Short read: [Automatic Re-Aging with AI! Disney’s FRAN Model Explained](https:\u002F\u002Fwww.louisbouchard.ai\u002Fdisney-re-age\u002F)\n* Blog Post: [Production-Ready Face Re-Aging for Visual Effects](https:\u002F\u002Fstudios.disneyresearch.com\u002F2022\u002F11\u002F30\u002Fproduction-ready-face-re-aging-for-visual-effects\u002F)\n\n\n---\n\n\n>If you would like to read more papers and have a broader view, here is another great repository for you covering 2021:\n[2021: A Year Full of Amazing AI papers- A Review](https:\u002F\u002Fgithub.com\u002Flouisfb01\u002Fbest_AI_papers_2021) and feel free to subscribe to my weekly [newsletter](https:\u002F\u002Flouisbouchard.substack.com\u002F) and stay up-to-date with new publications in AI for 2022!\n\n\n*Tag me on **Twitter** [@Whats_AI](https:\u002F\u002Ftwitter.com\u002FWhats_AI) or **LinkedIn** [@Louis (What's AI) Bouchard](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fwhats-ai\u002F) if you share the list!*\n\n---\n\n## Paper references\u003Ca name=\"references\">\u003C\u002Fa>\n\n[1] Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K. and Lempitsky, V., 2022. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE\u002FCVF Winter Conference on Applications of Computer Vision (pp. 2149–2159)., https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07161.pdf\n\n[2] Tzaban, R., Mokady, R., Gal, R., Bermano, A.H. and Cohen-Or, D., 2022. Stitch it in Time: GAN-Based Facial Editing of Real Videos. https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.08361\n\n[3] Kuang, Z., Olszewski, K., Chai, M., Huang, Z., Achlioptas, P. and Tulyakov, S., 2022. NeROIC: Neural Rendering of Objects from Online Image Collections. https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02533.pdf\n\n[4] Borsos, Z., Sharifi, M. and Tagliasacchi, M., 2022. SpeechPainter: Text-conditioned Speech Inpainting. https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07273.pdf\n\n[5] Wang, X., Li, Y., Zhang, H. and Shan, Y., 2021. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition (pp. 9168–9178), https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.04061.pdf\n\n[6] Piergiovanni, A.J., Casser, V., Ryoo, M.S. and Angelova, A., 2021. 4d-net for learned multi-modal alignment. In Proceedings of the IEEE\u002FCVF International Conference on Computer Vision (pp. 15435–15445), https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2021\u002Fpapers\u002FPiergiovanni_4D-Net_for_Learned_Multi-Modal_Alignment_ICCV_2021_paper.pdf.\n\n[7] Thomas Muller, Alex Evans, Christoph Schied and Alexander Keller, 2022, \"Instant Neural Graphics Primitives with a Multiresolution Hash Encoding\", https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.pdf\n\n[8] A. Ramesh et al., 2022, \"Hierarchical Text-Conditional Image Generation with CLIP Latents\", https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fdall-e-2.pdf\n\n[9] Nitzan, Y., Aberman, K., He, Q., Liba, O., Yarom, M., Gandelsman, Y., Mosseri, I., Pritch, Y. and Cohen-Or, D., 2022. MyStyle: A Personalized Generative Prior. arXiv preprint arXiv:2203.17272.\n\n[10] Zhang, Susan et al. “OPT: Open Pre-trained Transformer Language Models.” https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.01068\n\n[11] Epstein, D., Park, T., Zhang, R., Shechtman, E. and Efros, A.A., 2022. BlobGAN: Spatially Disentangled Scene Representations. arXiv preprint arXiv:2205.02837.\n\n[12] Reed S. et al., 2022, Deemind: Gato - A generalist agent, https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FA%20Generalist%20Agent\u002FGeneralist%20Agent.pdf\n\n[13] Saharia et al., 2022, Google Brain, Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, https:\u002F\u002Fgweb-research-imagen.appspot.com\u002Fpaper.pdf\n\n[14] Dayma, et al., 2021, DALL·E Mini, doi:10.5281\u002Fzenodo.5146400\n\n[15] NLLB Team et al., 2022, No Language Left Behind: Scaling Human-Centered Machine Translation\n\n[16] Sheinin, Mark and Chan, Dorian and O’Toole, Matthew and Narasimhan, Srinivasa G., 2022, Dual-Shutter Optical Vibration Sensing, Proc. IEEE CVPR.\n\n[17] Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D. and Taigman, Y., 2022. Make-a-scene: Scene-based text-to-image generation with human priors. https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13131.pdf\n\n[18] Yang, G., Vo, M., Neverova, N., Ramanan, D., Vedaldi, A. and Joo, H., 2022. Banmo: Building animatable 3d neural models from many casual videos. In Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition (pp. 2863-2873).\n\n[19] Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition (pp. 10684–10695), https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10752.pdf\n\n[20] Yang, J., Ang, Y.Z., Guo, Z., Zhou, K., Zhang, W. and Liu, Z., 2022. Panoptic Scene Graph Generation. arXiv preprint arXiv:2207.11247.\n\n[21] Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G. and Cohen-Or, D., 2022. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion.\n\n[22] Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S. and Ling, H., 2022. Expanding Language-Image Pretrained Models for General Video Recognition. arXiv preprint arXiv:2208.02816.\n\n[23] Singer et al. (Meta AI), 2022, “MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA”, https:\u002F\u002Fmakeavideo.studio\u002FMake-A-Video.pdf\n\n[24] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C. and Sutskever, I., Robust Speech Recognition via Large-Scale Weak Supervision.\n\n[25] Poole, B., Jain, A., Barron, J.T. and Mildenhall, B., 2022. DreamFusion: Text-to-3D using 2D Diffusion. arXiv preprint arXiv:2209.14988.\n\n[26] Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I. and Irani, M., 2022. Imagic: Text-Based Real Image Editing with Diffusion Models. arXiv preprint arXiv:2210.09276.\n\n[27] Balaji, Y. et al., 2022, eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers, https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.01324\n\n[28] Li, Z., Wang, Q., Snavely, N. and Kanazawa, A., 2022. InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images. In European Conference on Computer Vision (pp. 515–534). Springer, Cham, https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.11148\n\n[29] Taylor et al., 2022: Galactica: A Large Language Model for Science, https:\u002F\u002Fgalactica.org\u002F\n\n[30] Tang, J., Wang, K., Zhou, H., Chen, X., He, D., Hu, T., Liu, J., Zeng, G. and Wang, J., 2022. Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition. arXiv preprint arXiv:2211.12368.\n\n[31] OpenAI, 2022: ChatGPT: Optimizing Language Models for Dialogue, https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt\u002F\n\n[32] Loss et al., DisneyResearch, 2022: FRAN, https:\u002F\u002Fstudios.disneyresearch.com\u002F2022\u002F11\u002F30\u002Fproduction-ready-face-re-aging-for-visual-effects\u002F\n","# 2022：充满惊人AI论文的一年——回顾🚀\n## 按发表日期精选的最新AI突破列表，附清晰视频讲解、深度文章链接及代码。\n\n尽管全球仍在复苏中，科研的步伐却丝毫未减，尤其是在人工智能领域。今年更是凸显了许多重要议题，如伦理考量、关键偏见、治理机制、透明度等。人工智能与我们对人类大脑的理解及其与AI的关联正在不断演进，展现出在不久的将来提升生活质量的广阔前景。然而，我们也必须谨慎选择所应用的技术。\n\n>“科学不能告诉我们应该做什么，它只能告诉我们能做什么。”\u003Cbr\u002F>——让-保罗·萨特，《存在与虚无》\n\n以下是按发表日期排列的最新AI与数据科学突破精选列表，每项都配有清晰的视频讲解、更深入的文章链接以及相关代码（如适用）。祝您阅读愉快！\n\n**本仓库末尾列出了每篇论文的完整参考文献。** *请给本仓库标星以保持更新，并敬请期待明年的内容！* ⭐️\n\n维护者：[louisfb01](https:\u002F\u002Fgithub.com\u002Flouisfb01)，同时活跃于[YouTube](https:\u002F\u002Fwww.youtube.com\u002F@whatsai)和作为[播客主](https:\u002F\u002Fopen.spotify.com\u002Fshow\u002F4rKRJXaXlClkDyInjHkxq3)，如果您想了解更多关于AI的内容，欢迎关注！\n\n[![Twitter](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl\u002Fhttps\u002Ftwitter.com\u002Fcloudposse.svg?style=social&label=Follow%20%40whats_ai)](https:\u002F\u002Ftwitter.com\u002FWhats_AI)\n\n\n订阅我的[通讯](https:\u002F\u002Flouisbouchard.substack.com\u002F)——每周为您解读最新的AI动态。\n\n\n*如果您发现有我遗漏但值得关注的论文，欢迎随时[联系我](https:\u002F\u002Fwww.louisbouchard.ai\u002Fcontact\u002F)，我会将其加入本仓库。*\n\n*如果您分享此列表，请在**Twitter**上@我[@Whats_AI](https:\u002F\u002Ftwitter.com\u002FWhats_AI)或在**LinkedIn**上@我[@Louis (What's AI) Bouchard](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fwhats-ai\u002F)！也欢迎您加入我们的[Learn AI Together Discord社区](https:\u002F\u002Fwww.louisbouchard.ai\u002Flearn-ai-together\u002F)一起交流！*\n\n👀 **如果您希望支持我的工作**，可以通过[GitHub赞助](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Flouisfb01)本仓库，或在[Patreon](https:\u002F\u002Fwww.patreon.com\u002Fwhatsai)上支持我。\n\n### 观看8分钟的2022年全回顾\n\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FMGt3APx.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FtYRTzWHOQio)\n\n----\n\n## 完整列表\n- [基于傅里叶卷积的分辨率鲁棒大型掩码修复 [1]](#1)\n- [时空缝合：基于GAN的真实视频人脸编辑 [2]](#2)\n- [NeROIC：从在线图像集合中进行物体神经渲染 [3]](#3)\n- [SpeechPainter：文本条件下的语音修复 [4]](#4)\n- [利用生成式人脸先验实现真实场景中的盲人面部修复 [5]](#5)\n- [用于学习型多模态对齐的4D-Net [6]](#6)\n- [具有多分辨率哈希编码的即时神经图形基元 [7]](#7)\n- [基于CLIP潜在空间的层次化文本条件图像生成 [8]](#8)\n- [MyStyle：个性化生成先验 [9]](#9)\n- [OPT：开放的预训练Transformer语言模型 [10]](#10)\n- [BlobGAN：空间解耦的场景表征 [11]](#11)\n- [通用智能体 [12]](#12)\n- [具备深度语言理解能力的逼真文生图扩散模型 [13]](#13)\n- [Dalle mini [14]](#14)\n- [不让任何一种语言掉队：以人为本的机器翻译规模化 [15]](#15)\n- [双快门光学振动感知 [16]](#16)\n- [Make-a-scene：基于场景和人类先验的文生图生成 [17]](#17)\n- [BANMo：从大量日常视频中构建可动画化的3D神经模型 [18]](#18)\n- [使用潜在扩散模型进行高分辨率图像合成 [19]](#19)\n- [全景场景图生成 [20]](#20)\n- [一张图片胜过千言万语：利用文本反转个性化文生图生成 [21]](#21)\n- [扩展语言-图像预训练模型以实现通用视频识别 [22]](#22)\n- [MAKE-A-VIDEO：无需文本-视频数据的文生视频生成 [23]](#23)\n- [通过大规模弱监督实现稳健的语音识别 [24]](#24)\n- [DreamFusion：利用2D扩散模型进行文生3D [25]](#25)\n- [Imagic：基于扩散模型的文本驱动真实图像编辑 [26]](#26)\n- [eDiffi：采用专家去噪器集成的文生图扩散模型 [27]](#27)\n- [InfiniteNature-Zero：从单张图像中学习自然场景的永久视图生成 [28]](#28)\n- [Galactica：面向科学领域的大型语言模型 [29]](#29)\n- [基于音频-空间分解的实时神经辐射说话肖像合成 [30]](#30)\n- [ChatGPT：针对对话优化的语言模型 [31]](#31)\n- [用于视觉特效的生产级人脸逆龄技术 [32]](#32)\n- [论文参考文献](#references)\n\n---\n\n## 基于傅里叶卷积的分辨率鲁棒大型掩码修复 [1]\u003Ca name=\"1\">\u003C\u002Fa>\n你一定遇到过这样的情况：和朋友拍了一张很棒的照片，结果背后突然冒出个“抢镜”的人，毁了你原本完美的Instagram帖子。现在再也不用担心了！无论是不小心出现在照片里的路人，还是随手丢弃的垃圾桶，这款AI都能自动帮你把它们移除，拯救你的照片。它就像随身携带的专业修图师，只需轻轻一点即可完成！\n\n将图像的一部分移除并用背景内容填补的任务，长期以来一直被众多AI研究人员所研究。这项技术被称为图像修复，而它实际上非常具有挑战性...\n\n\n* 简短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002Fd5ClyqP.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FIa79AvGzveQ)\n* 简介文章：[这款AI能帮你移除照片中的多余元素！](https:\u002F\u002Fwww.louisbouchard.ai\u002Flama\u002F)\n* 论文：[基于傅里叶卷积的分辨率鲁棒大型掩码修复](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07161.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002Fsaic-mdal\u002Flama)\n* [Colab演示](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsaic-mdal\u002Flama\u002Fblob\u002Fmaster\u002Fcolab\u002FLaMa_inpainting.ipynb)\n* [使用LaMa的产品](https:\u002F\u002Fcleanup.pictures\u002F)\n\n## 及时修补：基于GAN的真实视频人脸编辑 [2]\u003Ca name=\"2\">\u003C\u002Fa>\n你一定看过像最近的《惊奇队长》或《双子杀手》这样的电影，其中塞缪尔·杰克逊和威尔·史密斯看起来年轻了许多。要做到这一点，专业人士需要花费成百上千小时，手动编辑他们出场的每一帧画面。\n\n但其实，你只需使用一个简单的AI工具，就能在几分钟内完成同样的效果。事实上，许多技术如今都可以自动帮你添加笑容、让你看起来更年轻或更年长，这一切都依靠基于AI的算法实现。这被称为视频中的人脸AI操控，而以下是2022年的最新进展！\n\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FlvgMjzS.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FmqItu9XoUgk)\n* 简短文章：[真实视频的AI人脸编辑！“及时修补”详解](https:\u002F\u002Fwww.louisbouchard.ai\u002Fstitch-it-in-time\u002F)\n* 论文：[及时修补：基于GAN的真实视频人脸编辑](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.08361)\n* [代码](https:\u002F\u002Fgithub.com\u002Frotemtzaban\u002FSTIT)\n\n\n## NeROIC：基于在线图片集的对象神经渲染 [3]\u003Ca name=\"3\">\u003C\u002Fa>\n神经渲染。所谓神经渲染，就是能够仅凭目标物体、人物或场景的照片，在三维空间中生成一张逼真的模型，就像这样。比如，你手头有一些这座雕塑的照片，然后让机器去理解这些图片中的物体在空间中应该是什么样子。本质上，你是在要求机器从图像中理解物理规律和形状。对我们人类来说，这并不难——因为我们熟悉现实世界和深度信息；但对于只看到像素的机器而言，这却是一个全新的挑战。\n\n生成的模型看起来精确且形态逼真固然很棒，但它如何与新场景自然融合呢？如果拍摄照片时的光照条件各不相同，那么生成的模型在不同角度下看起来就会有差异，这显然会让人觉得怪异而不真实。Snapchat和南加州大学就在这项新研究中，针对这些问题提出了创新的解决方案。\n\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FxTpuwcN.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002F88Pl9zD1Z78)\n* 简短文章：[用AI创建逼真的3D渲染！](https:\u002F\u002Fwww.louisbouchard.ai\u002Fneroic\u002F)\n* 论文：[NeROIC：基于在线图片集的对象神经渲染](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02533.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002FNeROIC)\n\n\n## SpeechPainter：文本条件下的语音修复 [4]\u003Ca name=\"4\">\u003C\u002Fa>\n我们之前介绍过图像修复技术，其目标是从照片中移除不需要的物体。基于机器学习的修复方法不仅会简单地抹去这些物体，还会理解整张图片的内容，并根据背景特征补全缺失的部分。\n\n近年来的相关进展令人惊叹，修复效果也非常出色。这种技术在广告制作或优化你的Instagram帖子等方面有着广泛的应用前景。此外，我们还探讨过更具挑战性的视频修复任务——通过类似的方法去除视频中的物体或人物。\n\n然而，处理视频时最大的难点在于确保每一帧之间的一致性，避免出现任何瑕疵或异常。但如果我们在一部电影中成功移除了某个人物，而声音却依然存在、毫无变化，那会怎样呢？那样的话，我们可能会听到一个“幽灵般”的声音，从而破坏整个修复成果。\n\n这就引出了我频道此前从未涉及的一个领域——语音修复。没错，谷歌的研究人员刚刚发表了一篇关于语音修复的论文，结果相当惊人。虽然这次我们更多是“听”而不是“看”，但核心思想是一样的：它可以纠正语法、发音，甚至消除背景噪音。这些都是我一直在努力改进的地方，或者……干脆直接使用他们的新模型吧！快来看看我视频里的示例吧！\n\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FJyQ41Qv.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FzIIc4bRf5Hg)\n* 简短文章：[AI语音修复！](https:\u002F\u002Fwww.louisbouchard.ai\u002Fspeech-inpainting-with-ai\u002F)\n* 论文：[SpeechPainter：文本条件下的语音修复](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07273.pdf)\n* [收听更多示例](https:\u002F\u002Fgoogle-research.github.io\u002Fseanet\u002Fspeechpainter\u002Fexamples\u002F)\n\n\n## 基于生成式人脸先验的真实世界盲人像修复 [5]\u003Ca name=\"5\">\u003C\u002Fa>\n你是否也保存着一些自己或亲人的老照片——那些因为年代久远而质量不佳，或是父母在我们还无法拍摄高质量照片的时代拍下的？我也有这样的照片，曾经以为那些珍贵的回忆已经永远失去了。可事实证明，我大错特错了！\n\n这款全新的免费AI模型，能在瞬间修复你大部分的老照片。它对低质量或高噪点的输入同样表现出色，而这通常是最棘手的问题之一。\n\n本周发布的论文《基于生成式人脸先验的真实世界盲人像修复》以卓越的效果解决了照片修复这一难题。更酷的是，你可以亲自尝试，而且方式多种多样：他们不仅开源了代码，还搭建了一个演示平台和在线应用，供你立即体验。如果你对前面展示的结果还不太信服，不妨观看视频，并在评论区告诉我你的看法——我相信它一定会让你大开眼界！\n\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FDxxFRLI.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FnLDVtzcSeqM)\n* 简短文章：[AI惊艳的图像修复！](https:\u002F\u002Fwww.louisbouchard.ai\u002Fgfp-gan\u002F)\n* 论文：[基于生成式人脸先验的真实世界盲人像修复](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.04061.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FGFPGAN)\n* [Colab演示](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1sVsoBd9AjckIXThgtZhGrHRfFI6UUYOo)\n* [在线应用](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fakhaliq\u002FGFPGAN)\n\n## 4D-Net：用于学习多模态对齐的网络 [6]\u003Ca name=\"6\">\u003C\u002Fa>\n自动驾驶汽车是如何“看”世界的呢？\n\n你可能听说过激光雷达传感器，以及它们使用的其他奇特摄像头。但这些设备究竟是如何工作的？它们怎样感知周围的世界？与人类相比，它们看到的内容又有哪些不同呢？要让自动驾驶汽车真正上路，理解这些技术的工作原理至关重要——无论你是政府工作人员、法规制定者，还是相关服务的用户。\n\n我们之前曾介绍过特斯拉自动辅助驾驶系统的视觉感知与工作方式，但它的技术路径与其他传统自动驾驶车辆有所不同。特斯拉仅依靠摄像头来理解环境，而大多数公司，比如Waymo，则同时使用普通摄像头和3D激光雷达传感器。激光雷达的工作原理相对简单：它不会像普通相机那样生成图像，而是构建出三维点云数据。激光雷达通过测量发射到物体上的激光脉冲往返时间，从而计算出物体之间的距离。\n\n然而，如何高效地融合这些信息，并让车辆真正“理解”它们呢？最终车辆看到的仅仅是满屏的点吗？这样的信息是否足以支持在现实道路上的安全行驶？接下来，我们将结合Waymo与谷歌研究院的一篇最新研究论文，深入探讨这一问题……\n\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FAxGLy7p.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002F0nJMnw1Ldks)\n* 简短阅读：[将激光雷达与摄像头结合用于3D目标检测——Waymo](https:\u002F\u002Fwww.louisbouchard.ai\u002Fwaymo-lidar\u002F)\n* 论文：[4D-Net：用于学习多模态对齐的网络](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2021\u002Fpapers\u002FPiergiovanni_4D-Net_for_Learned_Multi-Modal_Alignment_ICCV_2021_paper.pdf)\n\n\n## 基于多分辨率哈希编码的即时神经图形基元 [7]\u003Ca name=\"7\">\u003C\u002Fa>\n拍照本身已经是一项极具挑战性的技术成就，而现在我们更进一步，反其道而行之：从照片中重建世界！我曾介绍过一些基于人工智能的优秀模型，能够根据图片生成高质量的场景。这项任务的核心在于，仅需几张二维平面图像，便能还原出物体或人物在真实三维空间中的形态。\n\n只需拍摄几张照片，就能立即获得一个逼真的数字模型，直接应用于你的产品设计中。这该有多酷啊！\n\n与我在2020年首次介绍的NeRF模型相比，如今的效果有了质的飞跃。而这种进步不仅体现在结果的质量上，NVIDIA更是将其提升到了新的高度。新方法不仅效果堪比甚至超越了之前的版本，而且速度提升了超过1000倍，这一切仅仅是在短短两年的研究之后实现的。\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002F8PilczV.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FUHQZBQOVAIU)\n* 简短阅读：[NVIDIA可在毫秒级内将照片转化为3D场景](https:\u002F\u002Fwww.louisbouchard.ai\u002Fnvidia-photos-into-3d-scenes\u002F)\n* 论文：[基于多分辨率哈希编码的即时神经图形基元](https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Finstant-ngp)\n\n\n## 基于CLIP潜在表示的层次化文本条件图像生成 [8]\u003Ca name=\"8\">\u003C\u002Fa>\n去年，我曾分享过OpenAI推出的DALL·E模型，它能够根据文本输入生成令人惊叹的图像。如今，它的升级版——DALL·E 2——正式登场。而这一年间的进步之大，简直让人难以置信！DALL·E 2不仅能生成更加逼真的照片级图像，其分辨率更是提升了四倍！\n\n更令人惊喜的是，这款新模型还掌握了一项全新技能：图像修复（inpainting）。\n\nDALL·E可以根据文本生成图像。\n\n而DALL·E 2不仅做得更好，还能进一步编辑这些图像，让它们看起来更加完美；或者干脆添加你想要的元素，比如在背景中加入几只火烈鸟。\n\n听起来很有趣吧？快来观看视频或阅读下方内容，了解更多详情吧！\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FxZlfsJO.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FrdGVbPI42sA)\n* 简短阅读：[OpenAI的新模型DALL·E 2太棒了！](https:\u002F\u002Fwww.louisbouchard.ai\u002Fopenais-new-model-dall-e-2-is-amazing\u002F)\n* 论文：[基于CLIP潜在表示的层次化文本条件图像生成](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fdall-e-2.pdf)\n\n\n## MyStyle：个性化生成先验模型 [9]\u003Ca name=\"9\">\u003C\u002Fa>\n由谷歌研究院和特拉维夫大学联合开发的这款新模型堪称惊艳。你可以把它看作一种功能极其强大的深度伪造工具，几乎无所不能。\n\n只要提供一百张任意人物的照片，就能将该人物的特征编码进模型中，随后无论是修复、编辑，还是完全原创一张逼真的人像，都轻而易举。\n\n在我看来，这既令人赞叹又不免心生担忧，尤其是当你看到实际生成的效果时。不妨观看视频，进一步了解模型的工作机制及更多示例！\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FFAhVBzM.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FBNWAEvFfFvQ)\n* 简短阅读：[用AI打造你的专属Photoshop专家！](https:\u002F\u002Fwww.louisbouchard.ai\u002Fmystyle\u002F)\n* 论文：[MyStyle：个性化生成先验模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.17272)\n* [代码即将发布](https:\u002F\u002Fmystyle-personalized-prior.github.io\u002F)\n\n\n> 欢迎收听[What's AI播客](https:\u002F\u002Fopen.spotify.com\u002Fshow\u002F4rKRJXaXlClkDyInjHkxq3)，获取更多关于人工智能的深度内容！每期节目都会邀请一位业内专家，与我共同探讨人工智能领域的特定主题、子领域及职业方向，分享来自一线从业者的专业知识与见解。\n\n\n## OPT：开放的预训练Transformer语言模型 [10]\u003Ca name=\"10\">\u003C\u002Fa>\n我们对GPT-3早已耳熟能详，对其能力也大致有所了解。许多应用正是依托于这一模型而诞生的，其中部分我也曾在关于GPT-3的前一期视频中提到过。GPT-3是由OpenAI开发的语言模型，用户可通过付费API调用，但无法直接访问模型本身。\n\nGPT-3之所以如此强大，一方面得益于其精妙的架构，另一方面则归功于庞大的参数规模——高达1750亿个！这一超大规模网络几乎接受了整个互联网的数据训练，从而深刻理解人类的写作、交流及文本处理方式。本周，Meta公司为社区迈出了重要一步：他们发布了一款同样强大、甚至更为先进的语言模型，并将其完全开源。\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FZBHHYaQ.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FEjg0OunCi9U)\n* 简短阅读：[Meta的新模型OPT是GPT-3最有力的竞争对手！（且已开源）](https:\u002F\u002Fwww.louisbouchard.ai\u002Fopt-meta\u002F)\n* 论文：[OPT：开放的预训练Transformer语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.01068.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmetaseq)\n\n## BlobGAN：空间解耦的场景表征 [11]\u003Ca name=\"11\">\u003C\u002Fa>\nBlobGAN 能以极其简单的方式操控图像——只需移动几个简单的色块即可实现。这些小色块分别代表图像中的某个物体，你可以随意移动它们、调整大小，甚至删除它们，而这些操作都会对图像中对应的物体产生同样的效果。这真是太酷了！\n\n正如作者在结果中所展示的，你甚至可以通过复制色块来生成全新的图像，创造出数据集中从未出现过的场景，比如一个装有两台吊扇的房间！如果我没记错的话，这可能是最早（如果不是最早的话）将图像编辑简化到只需移动色块，并允许进行训练集中未见修改的论文之一。\n\n而且相比我们熟知的一些公司，这个项目还真的可以动手玩一玩呢！他们不仅公开了代码，还提供了一个 Colab 演示，你可以立即尝试。更令人兴奋的是 BlobGAN 的工作原理。更多细节请观看视频！\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002F9ouN5ta.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FmnEzjpiA_4E)\n* 简短阅读：[这是 GAN 的一大步！BlobGAN 解析](https:\u002F\u002Fwww.louisbouchard.ai\u002Fblobgan\u002F)\n* 论文：[BlobGAN：空间解耦的场景表征](https:\u002F\u002Fdave.ml\u002Fblobgan\u002F)\n* [代码](https:\u002F\u002Fgithub.com\u002Fdave-epstein\u002Fblobgan)\n* [Colab 演示](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1clvh28Yds5CvKsYYENGLS3iIIrlZK4xO?usp=sharing#scrollTo=0QuVIyVplOKu)\n\n\n## 通用智能体 [12]\u003Ca name=\"12\">\u003C\u002Fa>\nDeepMind 的 Gato 刚刚发布！它是一个单一的 Transformer 模型，能够玩 Atari 游戏、为图片添加说明文字、与人聊天、控制真实的机械臂等等！确实，它只经过一次训练，便能用同一组权重完成所有这些任务。据 DeepMind 称，这不仅仅是一个 Transformer，更是一个智能体。这就是当 Transformer 与多任务强化学习智能体的研究进展相结合时所产生的成果。\n\nGato 是一个多模态智能体。这意味着它可以为图片生成描述，也可以像聊天机器人一样回答问题。你可能会说 GPT-3 已经能做到这一点，但 Gato 还能做更多……它的多模态特性体现在，它不仅能以人类水平玩 Atari 游戏，还能完成现实世界的任务，例如精确地控制机械臂搬运物体。它理解文字、图像，甚至物理规律……\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002Frr9VUXn.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FxZKSWNv6Esc)\n* 简短阅读：[Deepmind 新模型 Gato 太棒了！](https:\u002F\u002Fwww.louisbouchard.ai\u002Fdeepmind-gato\u002F)\n* 论文：[通用智能体](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FA%20Generalist%20Agent\u002FGeneralist%20Agent.pdf)\n\n\n## 具备深度语言理解能力的逼真文生图扩散模型 [13]\u003Ca name=\"13\">\u003C\u002Fa>\n如果你觉得 Dall-e 2 的效果已经很出色了，那一定要看看 Google Brain 推出的新模型能做到什么程度。\n\nDall-e 确实很厉害，但往往缺乏真实感，而 Google Brain 团队正是针对这一问题开发了名为 Imagen 的新模型。\n\n他们在项目页面上分享了大量的实验结果，并且还推出了一项用于比较文生图模型的基准测试。结果显示，Imagen 显著优于 Dall-e 2 以及其他先前的图像生成方法。更多内容请观看视频……\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FIpwaSvZ.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FqhtYPhPWCsI)\n* 简短阅读：[Google Brain 对抗 Dall-e 2 的答案：Imagen](https:\u002F\u002Fwww.louisbouchard.ai\u002Fgoogle-brain-imagen\u002F)\n* 论文：[具备深度语言理解能力的逼真文生图扩散模型](https:\u002F\u002Fimagen.research.google\u002Fpaper.pdf)\n* [包含结果的项目页面](https:\u002F\u002Fimagen.research.google\u002F)\n\n[![Twitter](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl\u002Fhttps\u002Ftwitter.com\u002Fcloudposse.svg?style=social&label=Follow%20%40whats_ai)](https:\u002F\u002Ftwitter.com\u002FWhats_AI)\n\n## DALL·E Mini [14]\u003Ca name=\"14\">\u003C\u002Fa>\nDalle mini 非常棒——而且你也可以使用它！\n\n相信你在过去几天的 Twitter 动态里一定见过类似这样的图片。\n如果你好奇它们是怎么来的，那就是由名为 DALL·E mini 的 AI 生成的。\n如果你还没见过这类图片，那就一定要看看这段视频，因为你真的错过了一些精彩的内容。\n如果你想知道这是如何做到的，那么你来对地方了——短短不到五分钟，你就能找到答案。\n\nDalle mini 是一款免费的开源 AI，可以根据文本输入生成令人惊叹的图像。\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FUx4ItPo.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FqOxde_JV0vI)\n* 简短阅读：[Dalle-mini 是如何工作的？](https:\u002F\u002Fwww.louisbouchard.ai\u002Fdalle-mini\u002F)\n* [代码](https:\u002F\u002Fgithub.com\u002Fborisdayma\u002Fdalle-mini)\n* [Huggingface 官方演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdalle-mini\u002Fdalle-mini)\n\n\n## 不落下任何一种语言：以人为本的机器翻译规模化 [15]\u003Ca name=\"15\">\u003C\u002Fa>\nMeta AI 最新推出的“不落下任何一种语言”模型，顾名思义，能够以最先进的质量跨 200 种不同语言进行翻译。\n仅凭一个模型就能处理 200 种语言，这该有多了不起啊？\n\n我们常常发现，即使是在英语领域取得优异成果都颇为不易，而 Meta 却用同一个模型攻克了多达 200 种语言，其中还包括许多复杂且语料稀缺的语言，甚至连谷歌翻译都难以应对……\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FOHV5bTU.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002F2G4NeG17Eis)\n* 简短阅读：[不落下任何一种语言](https:\u002F\u002Fwww.louisbouchard.ai\u002Fno-language-left-behind\u002F)\n* [代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Ftree\u002Fnllb)\n* 论文：[不落下任何一种语言](https:\u002F\u002Fai.facebook.com\u002Fresearch\u002Fno-language-left-behind\u002F)\n\n\n## 双快门光学振动传感 [16]\u003Ca name=\"16\">\u003C\u002Fa>\n这项技术利用摄像头和激光束，在任何振动表面上重建声音，从而实现分离乐器音轨、聚焦特定说话者、消除环境噪音等多种神奇的应用。\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FkkS7tGw.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002Fn1M8ZVspJcs)\n* 简短阅读：[CVPR 2022 最佳论文荣誉提及：双快门光学振动传感](https:\u002F\u002Fwww.louisbouchard.ai\u002Fcvpr-2022-best-paper\u002F)\n* [项目页面](https:\u002F\u002Fimaging.cs.cmu.edu\u002Fvibration\u002F)\n* 论文：[双快门光学振动传感](https:\u002F\u002Fwww.marksheinin.com\u002F_files\u002Fugd\u002Fa41a28_7d370603fafd419da387de85d8ecb5b4.pdf?index=true)\n\n## Make-a-scene：基于场景的文本到图像生成，融入人类先验知识 [17]\u003Ca name=\"17\">\u003C\u002Fa>\nMake-A-Scene 并非“又一个 DALL·E”。这款新模型的目标并不是像 DALL·E 那样让用户仅根据文本提示生成随机图像——尽管这确实很酷——但这种方式会限制用户对生成内容的控制。\n\n相反，Meta 希望推动创意表达的发展，将文本到图像的趋势与先前的草图到图像模型相结合，从而诞生了“Make-A-Scene”：一种将文本和草图条件结合的卓越图像生成技术。\n\n* 简短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FbivyUmD.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FK3bZXXjW788)\n* 简短文章：[用文本和草图创作惊艳艺术作品！](https:\u002F\u002Fwww.louisbouchard.ai\u002Fmake-a-scene\u002F)\n* 论文：[Make-a-scene：基于场景的文本到图像生成，融入人类先验知识](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13131.pdf)\n\n\n## BANMo：从大量日常视频中构建可动画化的 3D 神经网络模型 [18]\u003Ca name=\"18\">\u003C\u002Fa>\n使用 BANMo，只需几张照片就能创建可变形的 3D 模型！\n\n* 简短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FulRCcMS.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FjDTy-liFoCQ)\n* 简短文章：[用 AI 构建可动画化的 3D 模型](https:\u002F\u002Fwww.louisbouchard.ai\u002Fbanmo\u002F)\n* 论文：[BANMo：从大量日常视频中构建可动画化的 3D 神经网络模型](https:\u002F\u002Fbanmo-www.github.io\u002Fbanmo-cvpr.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbanmo)\n\n\n## 基于潜在扩散模型的高分辨率图像合成 [19]\u003Ca name=\"19\">\u003C\u002Fa>\n最近那些功能强大的图像生成模型，比如 DALL·E、Imagen 或 MidJourney，它们之间有什么共同点呢？除了高昂的计算成本、漫长的训练时间以及备受关注之外，它们都基于同一种机制：扩散模型。\n\n近年来，扩散模型在大多数图像任务中都取得了最先进的成果，不仅包括 DALL·E 的文本到图像生成，还涵盖了图像修复、风格迁移和图像超分辨率等多种相关任务。\n\n* 简短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FPanqNAf.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FRGBNdD3Wn-g)\n* 简短文章：[潜在扩散模型：Stable Diffusion 背后的架构](https:\u002F\u002Fwww.louisbouchard.ai\u002Flatent-diffusion-models\u002F)\n* 论文：[基于潜在扩散模型的高分辨率图像合成](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10752.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion)\n\n\n👀 **如果您想支持我的工作**，可以考虑为本仓库 [赞助](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Flouisfb01) 或在 [Patreon](https:\u002F\u002Fwww.patreon.com\u002Fwhatsai) 上支持我。\n\n\n## 全景场景图生成 [20]\u003Ca name=\"20\">\u003C\u002Fa>\n全景场景图生成（PSG）是一项全新的任务，旨在基于全景分割而非边界框，生成更全面的图像或场景图表示。它可用于理解图像并生成描述其中发生事件的句子。这或许是人工智能面临的最具挑战性的任务之一！更多信息请见下文……\n\n* 简短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FQRQnydw.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FcSsE_H_0Cr8)\n* 简短文章：[人工智能最具挑战性的任务之一](https:\u002F\u002Fwww.louisbouchard.ai\u002Fpsg\u002F)\n* 论文：[全景场景图生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.11247)\n* [代码](https:\u002F\u002Fgithub.com\u002FJingkang50\u002FOpenPSG)\n* [数据集](https:\u002F\u002Fpsgdataset.org\u002F)\n\n\n## 一图胜千言：利用文本反转实现文本到图像生成的个性化 [21]\u003Ca name=\"21\">\u003C\u002Fa>\n像 DALL·E 或 Stable Diffusion 这样的文本到图像模型非常酷，只需简单的文本输入就能生成精美的图片。但如果给它们一张你的照片，让它们将其转化为一幅画作，岂不是更酷吗？想象一下，你可以上传任何物体、人物，甚至是你的猫的照片，然后让模型将其转换成另一种风格——比如把你变成赛博格，或者按照你喜欢的艺术风格进行创作，甚至将它融入一个新的场景中。\n\n简单来说，如果能拥有一款类似于 Photoshop 的 DALL·E 版本，而不是随机生成图像，那该有多棒？通过“一图胜千言”的方式，我们可以轻松地控制生成过程，打造个性化的 DALL·E。这就像拥有一个既个性化又让人欲罢不能的 TikTok 算法版 DALL·E。\n\n事实上，特拉维夫大学和 NVIDIA 的研究人员正是致力于这一方向的研究。他们开发了一种方法，能够以少量图像作为条件，来表征任意对象或概念，并通过你随图像发送的文本指令，将输入图像中的对象转换为你想要的任何形式。\n\n* 简短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FtpwcGVK.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002Ff3oXa7_SYek)\n* 简短文章：[用你的图像引导 Stable Diffusion](https:\u002F\u002Fwww.louisbouchard.ai\u002Fimageworthoneword\u002F)\n* 论文：[一图胜千言：利用文本反转实现文本到图像生成的个性化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.01618v1.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002Frinongal\u002Ftextual_inversion)\n\n\n## 扩展语言-图像预训练模型用于通用视频识别 [22]\u003Ca name=\"22\">\u003C\u002Fa>\n我们已经见证了人工智能先生成文本，再生成图像，最近甚至还能生成短视频，尽管这些成果仍有待完善。考虑到这些作品的创作过程中几乎无需人工参与，且只需一次训练便可供成千上万的人使用，例如 Stable Diffusion，其效果确实令人惊叹。然而，这些模型真的理解自己在做什么吗？它们是否清楚自己刚刚生成的图片或视频究竟代表什么？当这样的模型看到一张图片，甚至更为复杂的视频时，它又能理解些什么呢？\n\n* 简短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002F65Vz6if.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002Fseb4lmVPEe8)\n* 简短文章：[AI 的通用视频识别](https:\u002F\u002Fwww.louisbouchard.ai\u002Fgeneral-video-recognition\u002F)\n* 论文：[扩展语言-图像预训练模型用于通用视频识别](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.02816)\n* [代码](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVideoX\u002Ftree\u002Fmaster\u002FX-CLIP)\n\n\n## MAKE-A-VIDEO：无需文本-视频数据的文本到视频生成 [23]\u003Ca name=\"23\">\u003C\u002Fa>\nMeta AI 的新模型 Make-a-Video 已发布，一句话概括就是：它可以根据文本生成视频。不仅如此，它还是目前最先进的方法，能够生成比以往更高品质、更加连贯的视频！\n\n* 简短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FZAv9FBH.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FMWwESVyHWto)\n* 简短文章：[Make-a-Video：AI 影片制作人！](https:\u002F\u002Fwww.louisbouchard.ai\u002Fmake-a-video\u002F)\n* 论文：[MAKE-A-VIDEO：无需文本-视频数据的文本到视频生成](https:\u002F\u002Fmakeavideo.studio\u002FMake-A-Video.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fmake-a-video-pytorch)\n\n## 基于大规模弱监督的鲁棒语音识别 [24]\u003Ca name=\"24\">\u003C\u002Fa>\n你是否曾梦想过一款优秀的转录工具，能够准确理解你说的话并将其记录下来？而不是像 YouTube 的自动翻译那样……我的意思是，那些工具确实不错，但远称不上完美。不妨亲自试一试，打开视频的字幕功能，你就会明白我在说什么了。\n\n幸运的是，OpenAI 刚刚发布并开源了一款非常强大的 AI 模型，专门为此而设计：Whisper。\n\n作为非英语母语者，它甚至能理解我都不太明白的内容（请在视频中收听），而且还能用于语言翻译！\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FmPbvHfl.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FuFOkMme19Zs)\n* 简短阅读：[OpenAI 最新模型：Whisper（详解）](https:\u002F\u002Fwww.louisbouchard.ai\u002Fwhisper\u002F)\n* 论文：[基于大规模弱监督的鲁棒语音识别](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fwhisper.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper)\n\n\n## DreamFusion：利用 2D 扩散模型实现文本到 3D 的转换 [25]\u003Ca name=\"25\">\u003C\u002Fa>\n我们已经见过一些模型，能够根据一句话生成图像。随后又出现了其他方法，通过学习特定概念（如某个物体或某种风格）来操控生成的图像。\n\n上周，Meta 发布了我曾报道过的 Make-A-Video 模型，该模型同样可以根据一段文字生成短视频。目前的结果还并不完美，但自去年以来，我们在这一领域取得的进步实在令人惊叹。\n\n而本周，我们又迈出了新的一步。\n\n这就是 DreamFusion，一款由 Google Research 开发的新模型，它能够理解一句话，并据此生成一个 3D 模型。你可以把它看作是 DALL·E 或 Stable Diffusion 的 3D 版本。\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FkgvlHXu.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FepuU0VRIcjE)\n* 简短阅读：[从文本生成 3D 模型！DreamFusion 详解](https:\u002F\u002Fwww.louisbouchard.ai\u002Fdreamfusion\u002F)\n* 论文：[DreamFusion：利用 2D 扩散模型实现文本到 3D 的转换](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.14988)\n\n\n## Imagic：基于扩散模型的文本驱动真实图像编辑 [26]\u003Ca name=\"26\">\u003C\u002Fa>\n如果你觉得最近的图像生成模型，比如 DALL·E 或 Stable Diffusion 很酷，那你一定会对这款模型感到更加震撼。\n“这款”就是 Imagic。Imagic 基于扩散模型，能够根据文本生成图像，并进一步改造该模型以实现图像编辑功能。你可以先生成一张图像，然后教会模型按照你的需求随意修改它。\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002Frws0FYl.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FgbpPQ5kVJhM)\n* 简短阅读：[文本驱动的 AI 图像编辑！Imagic 详解](https:\u002F\u002Fwww.louisbouchard.ai\u002Fimagic\u002F)\n* 论文：[Imagic：基于扩散模型的文本驱动真实图像编辑](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.09276)\n* [Stable Diffusion 实现](https:\u002F\u002Fgithub.com\u002Fjustinpinkney\u002Fstable-diffusion\u002Fblob\u002Fmain\u002Fnotebooks\u002Fimagic.ipynb)\n\n\n## eDiffi：采用专家级去噪器集成的文本到图像扩散模型 [27]\u003Ca name=\"27\">\u003C\u002Fa>\nNVIDIA 最新推出的 eDiffi 模型，生成的图像不仅外观更佳、细节更丰富，准确性也远超此前的 DALL·E 2 或 Stable Diffusion 等方案。eDiffi 对用户输入的文本理解更为深入，且可定制性更强，同时还引入了 NVIDIA 在先前论文中提出的一项功能——画家工具。\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002Fe0g0rNe.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002Fgrwp-ht_ixo)\n* 简短阅读：[eDiffi 详解：全新 SOTA 图像合成模型！](https:\u002F\u002Fwww.louisbouchard.ai\u002Fediffi\u002F)\n* 论文：[eDiffi：采用专家级去噪器集成的文本到图像扩散模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.01324)\n\n\n👀 **如果你想支持我的工作**，可以前往 [Sponsor](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Flouisfb01) 为这个仓库赞助，或者在 [Patreon](https:\u002F\u002Fwww.patreon.com\u002Fwhatsai) 上支持我。\n\n\n## InfiniteNature-Zero：从单张图像学习自然场景的无限视图生成 [28]\u003Ca name=\"28\">\u003C\u002Fa>\n仿佛你正飞入自己的照片之中，就能生成无穷无尽的新画面！\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002Fhza3mLh.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FFQzGhukV-l0)\n* 简短阅读：[InfiniteNature-Zero：用 AI 飞进你的照片！](https:\u002F\u002Fwww.louisbouchard.ai\u002Finfinitenature-zero\u002F)\n* 论文：[InfiniteNature-Zero：从单张图像学习自然场景的无限视图生成](https:\u002F\u002Finfinite-nature-zero.github.io\u002Fstatic\u002Fpdfs\u002FInfiniteNatureZero.pdf)\n* [代码](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Finfinite_nature_zero)\n\n\n## Galactica：面向科学领域的大型语言模型 [29]\u003Ca name=\"29\">\u003C\u002Fa>\nGalactica 是一款规模与 GPT-3 相当的大型语言模型，但专精于科学知识。该模型能够撰写白皮书、综述、维基百科条目以及代码，懂得如何引用文献和书写公式。对于人工智能和科学领域而言，这无疑是一件大事。\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FHVEKpOY.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002F2GfxkCWWzLU)\n* 简短阅读：[Galactica：它是什么？后来发生了什么？](https:\u002F\u002Fwww.louisbouchard.ai\u002Fgalactica\u002F)\n* 论文：[Galactica：面向科学领域的大型语言模型](https:\u002F\u002Fgalactica.org\u002Fstatic\u002Fpaper.pdf)\n\n\n## 基于音频-空间分解的实时神经辐射场对话肖像合成 [30]\u003Ca name=\"30\">\u003C\u002Fa>\n仅需一段视频，他们就能以更高的质量实时合成人物说出几乎任何单词或句子的画面。你可以让一个人物头像随着任意音频轨道同步进行实时动画化。\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FSk6fDKu.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FJUqnLN6Q4B0)\n* 简短阅读：[用 AI 将音频实时转化为会说话的头像！RAD-NeRF 详解](https:\u002F\u002Fwww.louisbouchard.ai\u002Frad-nerf\u002F)\n* 论文：[基于音频-空间分解的实时神经辐射场对话肖像合成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.12368)\n\n\n## ChatGPT：针对对话优化的语言模型 [31]\u003Ca name=\"31\">\u003C\u002Fa>\nChatGPT 凭借其强大的功能和极高的 meme 潜力，迅速席卷了 Twitter 乃至整个互联网。众所周知，能够生成 meme 是征服互联网的最佳方式，而 ChatGPT 正是这样做的。\n\n既然你已经看过无数相关示例，或许已经知道 ChatGPT 是 OpenAI 最近向公众开放的一款 AI，你可以与之聊天。它也被称为聊天机器人，意味着你可以像与真人一对一交流一样与之互动。\n\n不过，你可能还不清楚它究竟是什么、又是如何工作的……观看下方的视频或阅读文章、博客，了解更多吧！\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FRpH5S2f.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FAsFgn8vU-tQ)\n* 简短阅读：[ChatGPT 是什么？](https:\u002F\u002Fwww.louisbouchard.ai\u002Fchatgpt\u002F)\n* 博客文章：[ChatGPT：针对对话优化的语言模型](https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt\u002F)\n\n## 适用于视觉特效的生产就绪人脸重龄技术 [32]\u003Ca name=\"32\">\u003C\u002Fa>\n无论是为了在 Snapchat 滤镜中玩乐、用于电影制作，还是仅仅想消除几道皱纹，我们每个人心中都有一项实用需求——能够在照片中改变自己的年龄。\n\n通常，这项工作需要由经验丰富的艺术家使用 Photoshop 或类似工具来手动编辑照片。更糟糕的是，在视频中，他们必须对每一帧都进行这样的手动处理！试想一下，这该需要多少人力和时间啊。不过，现在有一个解决方案，同时也带来了一个新的挑战…… 👇\n\n* 短视频讲解：\u003Cbr\u002F>\n[\u003Cimg src=\"https:\u002F\u002Fimgur.com\u002FQOo0O5N.png\" width=\"512\"\u002F>](https:\u002F\u002Fyoutu.be\u002FWC03N0NFfwk)\n* 简短阅读：[利用 AI 自动实现人脸重龄！迪士尼 FRAN 模型详解](https:\u002F\u002Fwww.louisbouchard.ai\u002Fdisney-re-age\u002F)\n* 博客文章：[适用于视觉特效的生产就绪人脸重龄技术](https:\u002F\u002Fstudios.disneyresearch.com\u002F2022\u002F11\u002F30\u002Fproduction-ready-face-re-aging-for-visual-effects\u002F)\n\n\n---\n\n\n> 如果你想阅读更多论文并获得更广阔的视野，这里还有一个很棒的资源库，涵盖了 2021 年的内容：\n[2021：充满惊人 AI 论文的一年——综述](https:\u002F\u002Fgithub.com\u002Flouisfb01\u002Fbest_AI_papers_2021)，同时欢迎订阅我的每周[通讯](https:\u002F\u002Flouisbouchard.substack.com\u002F)，及时了解 2022 年的最新 AI 研究成果！\n\n\n* 如果你分享这份列表，请在 **Twitter** 上标记我 [@Whats_AI](https:\u002F\u002Ftwitter.com\u002FWhats_AI) 或在 **LinkedIn** 上标记我 [@Louis (What's AI) Bouchard](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fwhats-ai\u002F)！\n\n---\n\n## 论文参考文献\u003Ca name=\"references\">\u003C\u002Fa>\n\n[1] Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K. 和 Lempitsky, V.，2022年。基于傅里叶卷积的分辨率鲁棒性大尺寸掩码修复。载于IEEE\u002FCVF冬季计算机视觉应用会议论文集（第2149–2159页）。https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.07161.pdf\n\n[2] Tzaban, R., Mokady, R., Gal, R., Bermano, A.H. 和 Cohen-Or, D.，2022年。及时缝合：基于GAN的真实视频人脸编辑。https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.08361\n\n[3] Kuang, Z., Olszewski, K., Chai, M., Huang, Z., Achlioptas, P. 和 Tulyakov, S.，2022年。NeROIC：从在线图像集合中进行物体的神经渲染。https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02533.pdf\n\n[4] Borsos, Z., Sharifi, M. 和 Tagliasacchi, M.，2022年。SpeechPainter：文本条件化的语音修复。https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.07273.pdf\n\n[5] Wang, X., Li, Y., Zhang, H. 和 Shan, Y.，2021年。利用生成式人脸先验实现真实场景下的盲态人脸修复。载于IEEE\u002FCVF计算机视觉与模式识别会议论文集（第9168–9178页），https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.04061.pdf\n\n[6] Piergiovanni, A.J., Casser, V., Ryoo, M.S. 和 Angelova, A.，2021年。用于学习多模态对齐的4D-Net。载于IEEE\u002FCVF国际计算机视觉会议论文集（第15435–15445页），https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2021\u002Fpapers\u002FPiergiovanni_4D-Net_for_Learned_Multi-Modal_Alignment_ICCV_2021_paper.pdf。\n\n[7] Thomas Muller、Alex Evans、Christoph Schied 和 Alexander Keller，2022年，“具有多分辨率哈希编码的即时神经图形基元”，https:\u002F\u002Fnvlabs.github.io\u002Finstant-ngp\u002Fassets\u002Fmueller2022instant.pdf\n\n[8] A. Ramesh 等，2022年，“基于CLIP潜在空间的层次化文本条件图像生成”，https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fdall-e-2.pdf\n\n[9] Nitzan, Y., Aberman, K., He, Q., Liba, O., Yarom, M., Gandelsman, Y., Mosseri, I., Pritch, Y. 和 Cohen-Or, D.，2022年。MyStyle：个性化生成先验。arXiv预印本 arXiv:2203.17272。\n\n[10] Zhang, Susan 等人。“OPT：开放的预训练Transformer语言模型”。https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.01068\n\n[11] Epstein, D., Park, T., Zhang, R., Shechtman, E. 和 Efros, A.A.，2022年。BlobGAN：空间解耦的场景表示。arXiv预印本 arXiv:2205.02837。\n\n[12] Reed S. 等，2022年，Deemind：Gato——一个通用智能体，https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FA%20Generalist%20Agent\u002FGeneralist%20Agent.pdf\n\n[13] Saharia 等，2022年，Google Brain，具有深度语言理解能力的逼真文本到图像扩散模型，https:\u002F\u002Fgweb-research-imagen.appspot.com\u002Fpaper.pdf\n\n[14] Dayma 等，2021年，DALL·E Mini，doi:10.5281\u002Fzenodo.5146400\n\n[15] NLLB团队等，2022年，“不让任何一种语言掉队：以人为本的机器翻译规模化”\n\n[16] Sheinin, Mark 和 Chan, Dorian 以及 O’Toole, Matthew 和 Narasimhan, Srinivasa G.，2022年，双快门光学振动传感，IEEE CVPR会议论文。\n\n[17] Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D. 和 Taigman, Y.，2022年。Make-a-scene：基于场景和人类先验的文本到图像生成。https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13131.pdf\n\n[18] Yang, G., Vo, M., Neverova, N., Ramanan, D., Vedaldi, A. 和 Joo, H.，2022年。Banmo：从大量日常视频中构建可动画化的3D神经模型。载于IEEE\u002FCVF计算机视觉与模式识别会议论文集（第2863–2873页）。\n\n[19] Rombach, R., Blattmann, A., Lorenz, D., Esser, P. 和 Ommer, B.，2022年。基于潜在扩散模型的高分辨率图像合成。载于IEEE\u002FCVF计算机视觉与模式识别会议论文集（第10684–10695页），https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10752.pdf\n\n[20] Yang, J., Ang, Y.Z., Guo, Z., Zhou, K., Zhang, W. 和 Liu, Z.，2022年。全景场景图生成。arXiv预印本 arXiv:2207.11247。\n\n[21] Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G. 和 Cohen-Or, D.，2022年。一张图片胜过千言万语：利用文本反转技术个性化文本到图像生成。\n\n[22] Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S. 和 Ling, H.，2022年。扩展语言-图像预训练模型以用于通用视频识别。arXiv预印本 arXiv:2208.02816。\n\n[23] Singer等人（Meta AI），2022年，“MAKE-A-VIDEO：无需文本-视频数据的文本到视频生成”，https:\u002F\u002Fmakeavideo.studio\u002FMake-A-Video.pdf\n\n[24] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C. 和 Sutskever, I.，通过大规模弱监督实现稳健的语音识别。\n\n[25] Poole, B., Jain, A., Barron, J.T. 和 Mildenhall, B.，2022年。DreamFusion：利用2D扩散模型进行文本到3D生成。arXiv预印本 arXiv:2209.14988。\n\n[26] Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I. 和 Irani, M.，2022年。Imagic：基于扩散模型的文本驱动真实图像编辑。arXiv预印本 arXiv:2210.09276。\n\n[27] Balaji, Y. 等，2022年，eDiffi：采用专家去噪器集成的文本到图像扩散模型，https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.01324\n\n[28] Li, Z., Wang, Q., Snavely, N. 和 Kanazawa, A.，2022年。InfiniteNature-Zero：从单张图像中学习自然场景的永久视图生成。载于欧洲计算机视觉会议（第515–534页）。Springer, Cham，https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.11148\n\n[29] Taylor等人，2022年：Galactica：面向科学的大规模语言模型，https:\u002F\u002Fgalactica.org\u002F\n\n[30] Tang, J., Wang, K., Zhou, H., Chen, X., He, D., Hu, T., Liu, J., Zeng, G. 和 Wang, J.，2022年。基于音频-空间分解的实时神经辐射场对话肖像合成。arXiv预印本 arXiv:2211.12368。\n\n[31] OpenAI，2022年：ChatGPT：优化语言模型以用于对话，https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt\u002F\n\n[32] Loss等人，DisneyResearch，2022年：FRAN，https:\u002F\u002Fstudios.disneyresearch.com\u002F2022\u002F11\u002F30\u002Fproduction-ready-face-re-aging-for-visual-effects\u002F","# best_AI_papers_2022 快速上手指南\n\n`best_AI_papers_2022` 并非单一的 AI 工具或代码库，而是一个由社区维护的**2022 年突破性 AI 论文精选清单**。该仓库整理了全年最重要的 AI 研究成果，并为每项研究提供了视频讲解、深度文章链接、原始论文以及对应的开源代码实现地址。\n\n本指南将帮助您快速浏览该列表，并定位到您感兴趣的具体项目（如图像修复、语音处理、3D 生成等）进行部署和使用。\n\n## 环境准备\n\n由于本仓库是论文索引，运行具体模型前需根据您选择的项目配置环境。以下是通用的前置要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+) 或 macOS。Windows 用户建议使用 WSL2。\n*   **Python**: 版本 3.8 或更高（具体版本视子项目要求而定）。\n*   **GPU**: 大多数深度学习模型（如扩散模型、GANs）需要 NVIDIA GPU 及 CUDA 支持。\n*   **依赖管理**: 建议安装 `conda` 或 `venv` 以隔离不同项目的依赖环境。\n*   **网络环境**: 访问 GitHub、ArXiv 及 Hugging Face 可能需要稳定的网络连接。国内用户可配置以下加速源：\n    *   PyPI: `https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n    *   Conda: `https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fanaconda\u002Fpkgs\u002Fmain\u002F`\n\n## 安装步骤\n\n本仓库本身无需“安装”，只需克隆即可获取列表。若要运行列表中具体的模型，请遵循以下流程：\n\n### 1. 克隆论文清单仓库\n获取完整的论文列表和说明文档：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Flouisfb01\u002Fbest_AI_papers_2022.git\ncd best_AI_papers_2022\n```\n\n### 2. 选择并部署具体模型\n在 `README.md` 的 \"The Full List\" 部分找到您感兴趣的项目（例如第 1 项：图像修复模型 **LaMa**），点击其对应的 **Code** 链接跳转至官方实现仓库。\n\n以 **LaMa (Resolution-robust Large Mask Inpainting)** 为例，安装步骤如下：\n\n```bash\n# 克隆具体项目代码 (以 LaMa 为例)\ngit clone https:\u002F\u002Fgithub.com\u002Fsaic-mdal\u002Flama.git\ncd lama\n\n# 创建虚拟环境 (推荐)\npython -m venv venv\nsource venv\u002Fbin\u002Factivate  # Windows 用户使用: venv\\Scripts\\activate\n\n# 安装依赖 (国内用户建议使用清华源加速)\npip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 下载预训练模型权重\n# 注意：具体下载命令请参考该项目 README 中的 \"Get pretrained models\" 章节\nbash scripts\u002Fget_pretrained_models.sh\n```\n\n> **提示**：列表中其他项目（如 GFPGAN, Stable Diffusion 变体等）均需在各自的原生仓库中执行类似的安装流程。部分项目提供 Google Colab 演示，无需本地安装即可直接运行。\n\n## 基本使用\n\n使用方式完全取决于您选择的具体模型。以下以 **LaMa 图像修复模型** 为例，展示最简单的推理流程。\n\n### 示例：使用 LaMa 移除图片中的物体\n\n1.  **准备输入**: 准备一张需要修复的图片（例如包含路人或杂物的照片）。\n2.  **运行推理**: 在项目根目录下执行以下命令（假设已准备好 `image.png` 和对应的掩码 `mask.png`）：\n\n```bash\npython predict.py model.path=big-lama indir=.\u002Finput_images outdir=.\u002Foutput_images\n```\n\n或者使用提供的 Colab 演示进行交互式操作：\n*   访问项目页面提供的 [Colab Demo](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsaic-mdal\u002Flama\u002Fblob\u002Fmaster\u002Fcolab\u002FLaMa_inpainting.ipynb)。\n*   上传图片，涂抹需要去除的区域，点击运行即可生成结果。\n\n### 探索更多项目\n回到主目录 `best_AI_papers_2022`，查阅 `README.md` 中的详细列表：\n*   **视频讲解**: 点击每个条目下的视频链接，快速理解算法原理。\n*   **代码链接**: 点击 `[Code]` 标签进入具体项目的 GitHub 页面查看详细 API 和训练指南。\n*   **在线体验**: 部分条目提供 `[Product using ...]` 链接，可直接在网页端体验效果。\n\n通过此清单，您可以高效地追踪 2022 年最前沿的 AI 技术并快速复现。","某初创公司的算法工程师团队正致力于开发一款面向电商的虚拟试衣功能，急需在两周内找到可落地的最新图像生成与编辑方案。\n\n### 没有 best_AI_papers_2022 时\n- **信息检索效率低下**：团队成员需分别在 ArXiv、GitHub 和 YouTube 上碎片化搜索，耗费数天筛选 2022 年的关键论文，极易遗漏如 \"Instant Neural Graphics Primitives\" 等突破性成果。\n- **复现门槛过高**：找到论文后，往往缺乏清晰的视频原理解读或配套的开源代码链接，导致非核心算法的理解成本极高，难以评估工程可行性。\n- **技术选型盲目**：由于缺乏按发布时间梳理的结构化清单，团队难以判断哪些技术（如潜在扩散模型）已成熟到足以商用，哪些仍存在伦理或偏差风险。\n- **知识更新滞后**：仅关注传统顶会论文，忽略了像 \"Dalle mini\" 或 \"No Language Left Behind\" 这类在社区引发热议但未被即时收录进综述的快速迭代项目。\n\n### 使用 best_AI_papers_2022 后\n- **一站式精准获取**：直接通过按日期排序的清单，快速定位到 \"Stitch it in Time\"（视频面部编辑）和 \"High-resolution image synthesis\" 等与试衣场景高度相关的顶尖研究。\n- **多维深度解析**：每项成果均附带直观的视频讲解、深度文章链接及可用代码库，工程师能在几小时内理解核心逻辑并启动本地测试。\n- **决策依据充分**：借助清单中对技术突破点的清晰标注，团队迅速锁定基于潜在扩散模型的高分辨率合成方案作为核心技术栈，大幅缩短调研周期。\n- **紧跟前沿动态**：通过维护者提供的通讯和社区链接，团队能持续追踪从论文到实战的最新转化案例，确保技术路线不偏离行业主流。\n\nbest_AI_papers_2022 将原本需要数周的分散式文献调研压缩为几天的高效技术验证，成为研发团队把握 AI 年度变革红利的加速器。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Flouisfb01_best_AI_papers_2022_62c84699.png","louisfb01","Louis-François Bouchard","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Flouisfb01_d681bed6.png","Making AI accessible on YouTube, Newsletter, Spotify, Apple podcasts.\r\n\r\nCo-Founder at Towards AI.\r\nex-PhD student at Mila, Polytechnique Montréal","Mila\u002FPolytechnique Montréal & @towardsai","montreal",null,"Whats_AI","https:\u002F\u002Fwww.louisbouchard.ai\u002F","https:\u002F\u002Fgithub.com\u002Flouisfb01",3187,199,"2026-04-08T16:41:48","MIT",1,"","未说明",{"notes":90,"python":88,"dependencies":91},"该仓库并非单一可运行的 AI 工具，而是 2022 年优秀 AI 论文的精选列表。列表中每项研究（如 LaMa, GFPGAN, NeROIC 等）都有独立的代码仓库链接和运行环境要求。用户需针对感兴趣的具体论文，访问其对应的 GitHub 页面查看详细的操作系统、GPU、Python 版本及依赖库需求。",[],[13,15,14,93],"其他",[95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111],"python","computer-vision","computer-science","deep-learning","machine-learning","artificial-intelligence","ai","paper","technology","innovation","machinelearning","papers","neural-network","sota","state-of-the-art","state-of-art","2022","2026-03-27T02:49:30.150509","2026-04-09T10:26:59.930411",[],[]]