[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-FreddeFrallan--Multilingual-CLIP":3,"tool-FreddeFrallan--Multilingual-CLIP":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":79,"owner_url":80,"languages":81,"stars":98,"forks":99,"last_commit_at":100,"license":101,"difficulty_score":23,"env_os":102,"env_gpu":103,"env_ram":102,"env_deps":104,"category_tags":112,"github_topics":79,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":113,"updated_at":114,"faqs":115,"releases":156},2825,"FreddeFrallan\u002FMultilingual-CLIP","Multilingual-CLIP","OpenAI CLIP text encoders for multiple languages!","Multilingual-CLIP 是一个强大的开源项目，旨在让 OpenAI 著名的 CLIP 模型能够理解全球多种语言。原始的 CLIP 模型虽然能出色地连接图像与文本，但主要局限于英语环境；Multilingual-CLIP 通过替换并微调文本编码器，成功打破了这一语言壁垒，让用户可以用中文、瑞典语、俄语等一百多种语言直接检索图片，无需依赖翻译。\n\n该项目核心解决了跨语言图文匹配难题，使得非英语用户也能充分利用大规模图像数据集（如 LAION-400M）进行高效搜索与分析。它特别适合 AI 开发者、研究人员以及需要构建多语言图像检索系统的设计师使用。开发者可以轻松调用其提供的 PyTorch 或 TensorFlow 代码，快速集成预训练模型到自己的应用中。\n\n技术亮点在于，它巧妙地将 Hugging Face 上成熟的多语言 Transformer 模型（如 XLM-RoBERTa 和 LaBSE）作为文本编码器，与 OpenAI 的视觉编码器相结合，并在顶部添加线性层进行适配。这不仅保留了原模型强大的视觉理解能力，还赋予了其真正的全球化语言视野。无论是用于学术研究还是实际产品","Multilingual-CLIP 是一个强大的开源项目，旨在让 OpenAI 著名的 CLIP 模型能够理解全球多种语言。原始的 CLIP 模型虽然能出色地连接图像与文本，但主要局限于英语环境；Multilingual-CLIP 通过替换并微调文本编码器，成功打破了这一语言壁垒，让用户可以用中文、瑞典语、俄语等一百多种语言直接检索图片，无需依赖翻译。\n\n该项目核心解决了跨语言图文匹配难题，使得非英语用户也能充分利用大规模图像数据集（如 LAION-400M）进行高效搜索与分析。它特别适合 AI 开发者、研究人员以及需要构建多语言图像检索系统的设计师使用。开发者可以轻松调用其提供的 PyTorch 或 TensorFlow 代码，快速集成预训练模型到自己的应用中。\n\n技术亮点在于，它巧妙地将 Hugging Face 上成熟的多语言 Transformer 模型（如 XLM-RoBERTa 和 LaBSE）作为文本编码器，与 OpenAI 的视觉编码器相结合，并在顶部添加线性层进行适配。这不仅保留了原模型强大的视觉理解能力，还赋予了其真正的全球化语言视野。无论是用于学术研究还是实际产品开发，Multilingual-CLIP 都为多模态人工智能的普及提供了便捷高效的解决方案。","\u003Cbr \u002F>\n\u003Cp align=\"center\">\n  \u003Ch1 align=\"center\">Multilingual-CLIP\u003C\u002Fh1>\n  \u003Ch3 align=\"center\">OpenAI CLIP text encoders for any language\u003C\u002Fh3>\n  \n  \u003Cp align=\"center\">  \n    \u003Ca href=\"https:\u002F\u002From1504.github.io\u002Fclip-retrieval\u002F?back=https%3A%2F%2Fknn5.laion.ai&index=laion_400m&useMclip=true\">Live Demo\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FM-CLIP\">Pre-trained Models\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FContrastive-Tension\u002Fissues\">Report Bug\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fp>\n\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmaster\u002FMultilingual_CLIP.ipynb)\n[![pypi](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fmultilingual-clip.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fmultilingual-clip)\n\n\n\u003C!-- ABOUT THE PROJECT -->\n## Overview\n![Alt text](Images\u002FMultilingual-CLIP.png?raw=true \"Title\")\n\n[OpenAI](https:\u002F\u002Fopenai.com\u002F) recently released the paper [Learning Transferable Visual Models From Natural Language Supervision](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.00020) in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective.\nCLIP consists of two separate models, a visual encoder and a text encoder. These were trained on a wooping 400 Million images and corresponding captions. \nOpenAI has since released a set of their smaller CLIP models, which can be found on the [official CLIP Github](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP).\n\n## Demo\nA live demonstration of multilingual Text-Image retrieval using M-CLIP can be found [here!](https:\u002F\u002From1504.github.io\u002Fclip-retrieval\u002F?back=https%3A%2F%2Fknn5.laion.ai&index=laion_400m&useMclip=true) This demo was created by [Rom1504](https:\u002F\u002Fgithub.com\u002From1504), and it allows you to search the LAION-400M dataset in various languages using M-CLIP.\n\n#### This repository contains\n* Pre-trained CLIP-Text encoders for multiple languages\n* Pytorch & Tensorflow inference code\n* Tensorflow training code\n\n### Requirements\nWhile it is possible that other versions works equally fine, we have worked with the following:\n\n* Python = 3.6.9\n* Transformers = 4.8.1\n\n## Install\n\n`pip install multilingual-clip torch`\n\nYou can also choose to `pip install tensorflow` instead of torch.\n\n\n## Inference Usage\n\nInference code for Tensorflow is also available in [inference_example.py](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmain\u002Finference_example.py)\n\n```python\nfrom multilingual_clip import pt_multilingual_clip\nimport transformers\n\ntexts = [\n    'Three blind horses listening to Mozart.',\n    'Älgen är skogens konung!',\n    'Wie leben Eisbären in der Antarktis?',\n    'Вы знали, что все белые медведи левши?'\n]\nmodel_name = 'M-CLIP\u002FXLM-Roberta-Large-Vit-L-14'\n\n# Load Model & Tokenizer\nmodel = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_name)\n\nembeddings = model.forward(texts, tokenizer)\nprint(embeddings.shape)\n```\n\n## Install for development\n\nSetup a virtualenv:\n\n```\npython3 -m venv .env\nsource .env\u002Fbin\u002Factivate\npip install -e .\n```\n\n## Pre-trained Models\nEvery text encoder is a [Huggingface](https:\u002F\u002Fhuggingface.co\u002F) available transformer, with an additional linear layer on top. For more information of a specific model, click the Model Name to see its model card.\n\u003Cbr>\n\u003Cbr>\n\n| Name |Model Base|Vision Model | Vision Dimensions | Pre-trained Languages | #Parameters|\n| ----------------------------------|:-----: |:-----: |:-----: |:-----: | :-----: |\n| [LABSE Vit-L\u002F14](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FLABSE-Vit-L-14)| [LaBSE](https:\u002F\u002Fhuggingface.co\u002Fsentence-transformers\u002FLaBSE)|  [OpenAI ViT-L\u002F14](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP) | 768 | [109 Languages](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2007.01852.pdf) | 110 M|\n| [XLM-R Large Vit-B\u002F32](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-B-32)| [XLM-Roberta-Large](https:\u002F\u002Fhuggingface.co\u002Fxlm-roberta-large)|  [OpenAI ViT-B\u002F32](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP) | 512 | [100 Languages](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Ftree\u002Fmain\u002Fexamples\u002Fxlmr#Introduction) | 344 M|\n| [XLM-R Large Vit-L\u002F14](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-L-14)| [XLM-Roberta-Large](https:\u002F\u002Fhuggingface.co\u002Fxlm-roberta-large)|  [OpenAI ViT-L\u002F14](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP) | 768 | [100 Languages](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Ftree\u002Fmain\u002Fexamples\u002Fxlmr#Introduction)|  344 M|\n| [XLM-R Large Vit-B\u002F16+](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-B-16Plus)| [XLM-Roberta-Large](https:\u002F\u002Fhuggingface.co\u002Fxlm-roberta-large)|  [Open CLIP ViT-B-16-plus-240](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_clip) | 640 | [100 Languages](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Ftree\u002Fmain\u002Fexamples\u002Fxlmr#Introduction)| 344 M|\n\n### Validation & Training Curves\nFollowing is a table of the \u003Cb>Txt2Img @10-Recal\u003C\u002Fb> for the humanly tanslated [MS-COCO testset](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.07622).\n\n| Name | En | De | Es | Fr | Zh | It | Pl | Ko | Ru | Tr | Jp |\n| ----------------------------------|:-----: |:-----: |:-----: |:-----: | :-----: |:-----: |:-----: |:-----: |:-----: |:-----: |:-----: |\n| [OpenAI CLIP Vit-B\u002F32](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP)| 90.3 | - | - | - | - | - | - | - | - | - | - |\n| [OpenAI CLIP Vit-L\u002F14](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP)| 91.8 | - | - | - | - | - | - | - | - | - | - |\n| [OpenCLIP ViT-B-16+-](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP)| 94.3 | - | - | - | - | - | - | - | - | - | - |\n| [LABSE Vit-L\u002F14](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FLABSE-Vit-L-14)| 91.6 | 89.6 | 89.5 | 89.9 | 88.9 | 90.1 | 89.8 | 80.8 | 85.5 | 89.8 | 73.9 |\n| [XLM-R Large Vit-B\u002F32](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-B-32)| 91.8 | 88.7 | 89.1 | 89.4 | 89.3 | 89.8| 91.4 | 82.1 | 86.1 | 88.8 | 81.0 |\n| [XLM-R Vit-L\u002F14](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-L-14)| 92.4 | 90.6 | 91.0 | 90.0 | 89.7 | 91.1 | 91.3 | 85.2 | 85.8 | 90.3 | 81.9 |\n| [XLM-R Large Vit-B\u002F16+](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-B-16Plus)| \u003Cb>95.0\u003C\u002Fb> | \u003Cb>93.0\u003C\u002Fb> | \u003Cb>93.6\u003C\u002Fb> | \u003Cb>93.1\u003C\u002Fb> | \u003Cb>94.0\u003C\u002Fb> | \u003Cb>93.1\u003C\u002Fb> | \u003Cb>94.4\u003C\u002Fb> | \u003Cb>89.0\u003C\u002Fb> | \u003Cb>90.0\u003C\u002Fb> | \u003Cb>93.0\u003C\u002Fb> | \u003Cb>84.2\u003C\u002Fb> |\n\nThe training curves for these models are available at this [Weights and Biases Report](https:\u002F\u002Fwandb.ai\u002Ffreddefrallan\u002FM-CLIP\u002Freports\u002FM-CLIP-2-6-2022--VmlldzoyMTE1MjU1\u002Fedit?firstReport&runsetFilter), the results for other non-succesfull and ongoing experiments can be found in the [Weights and Biases Project](https:\u002F\u002Fwandb.ai\u002Ffreddefrallan\u002FM-CLIP?workspace=user-freddefrallan).\n\n## Legacy Usage and Models\nOlder versions of M-CLIP had the linear weights stored separately from Huggingface. Whilst the new models have them directly incorporated in the Huggingface repository. More information about these older models can be found in this section. \n\n\u003Cdetails>\n  \u003Csummary>Click for more information\u003C\u002Fsummary>\n  \n##### Download CLIP Model\n```bash\n$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0\n$ pip install ftfy regex tqdm\n$ pip install git+https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP.git\n```\nReplace `cudatoolkit=11.0` above with the appropriate CUDA version on your machine or `cpuonly` when installing on a machine without a GPU.\nFor more information please see the official [CLIP repostitory](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP).\n##### Download Linear Weights\n```bash\n# Linear Model Weights\n$ bash legacy_get-weights.sh\n```\n\n### Inference\n```python\nfrom multilingual_clip import multilingual_clip\n\nprint(multilingual_clip.AVAILABLE_MODELS.keys())\n\nmodel = multilingual_clip.load_model('M-BERT-Distil-40')\n\nembeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])\nprint(embeddings.shape)\n# Yields: torch.Size([3, 640])\n```\n\n\u003C!--- For a more elaborative example see this [Google Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmaster\u002FMultilingual_CLIP.ipynb). --->\n\nFor a more elaborate example, comparing the textual embeddings to the CLIP image embeddings see this [colab notebook](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmaster\u002FMultilingual_CLIP.ipynb).\n\n\u003C!-- GETTING STARTED -->\n## Legacy Pre-trained Models\nEvery text encoder is a [Huggingface](https:\u002F\u002Fhuggingface.co\u002F) available transformer, with an additional linear layer on top. Neither of the models have been extensively tested, but for more information and qualitative test results for a specific model, click the Model Name to see its model card.\n\u003Cbr>\n\u003Cbr>\n\u003Cb>*** Make sure to update to the most recent version of the repostitory when downloading a new model, and re-run the shell script to download the Linear Weights. *** \u003C\u002Fb>\n\n\n| Name |Model Base|Vision Model | Pre-trained Languages | Target Languages | #Parameters|\n| ----------------------------------|:-----: |:-----: |:-----: |:-----: |:-----: |\n|**Multilingual**    ||\n| [M-BERT Distil 40](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002FModel%20Cards\u002FM-BERT%20Distil%2040) | [M-BERT Distil](https:\u002F\u002Fhuggingface.co\u002Fbert-base-multilingual-uncased)|  RN50x4 | [101 Languages](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert\u002Fblob\u002Fmaster\u002Fmultilingual.md#list-of-languages) | [40 Languages](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmain\u002FModel%20Cards\u002FM-BERT%20Distil%2040\u002FFine-Tune-Languages.md) | 66 M|\n| [M-BERT Base 69](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002FModel%20Cards\u002FM-BERT%20Base%2069) | [M-BERT Base](https:\u002F\u002Fhuggingface.co\u002Fbert-base-multilingual-uncased)|RN50x4 | [101 Languages](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert\u002Fblob\u002Fmaster\u002Fmultilingual.md#list-of-languages) | 68 Languages | 110 M|\n| [M-BERT Base ViT-B](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002FModel%20Cards\u002FM-BERT%20Base%20ViT-B) | [M-BERT Base](https:\u002F\u002Fhuggingface.co\u002Fbert-base-multilingual-uncased)|ViT-B\u002F32 | [101 Languages](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert\u002Fblob\u002Fmaster\u002Fmultilingual.md#list-of-languages) | 68 Languages | 110 M|\n|**Monolingual**    ||\n|[Swe-CLIP 500k](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002FModel%20Cards\u002FSwe-CLIP%20500k)| [KB-BERT](https:\u002F\u002Fhuggingface.co\u002FKB\u002Fbert-base-swedish-cased)|  RN50x4 | Swedish | Swedish | 110 M|\n|[Swe-CLIP 2M](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002FModel%20Cards\u002FSwe-CLIP%202M)| [KB-BERT](https:\u002F\u002Fhuggingface.co\u002FKB\u002Fbert-base-swedish-cased)|  RN50x4 | Swedish | Swedish | 110 M|\n\n  \u003C\u002Fdetails>\n  \n## Training a new model\n[This folder](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002Fmultilingual_clip\u002FTeacherLearning) contains the code used for training the above models. If you wsh to train your own model you must do the following things:\n\n* Prepare a set of translated sentence pairs from English -> Your Language(s)\n* Compute regular CLIP-Text embeddings for the English sentences.\n* Edit [Training.py](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmain\u002Fmultilingual_clip\u002FTeacherLearning\u002FTraining.py) to load your data.\n* Train a new CLIP-Text encoder via Teacher Learning \n\n### Pre-computed CLIP Embeddings & Translaton Data\n[This Google Drive folder](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1I9a7naSZubUATWzLFv61DQMWyFlF7wR5?usp=sharing) contains both pre-computed CLIP-Text Embeddings for a large porton of the the image captions of [GCC](https:\u002F\u002Fai.google.com\u002Fresearch\u002FConceptualCaptions\u002F) + [MSCOCO](https:\u002F\u002Fcocodataset.org\u002F#home) + [VizWiz](https:\u002F\u002Fvizwiz.org\u002Ftasks-and-datasets\u002Fimage-captioning\u002F).\n\nThe Google Drive folder also contains the translation data used to train the currently available models.\nGood Luck\n\n## Contribution\nIf you have trained a CLIP Text encoder specific to your language, or another model covering a language not supported here, Please feel free to contact us and we will either upload your model and credit you, or simply link to your already uploaded model.\n\n\u003C!-- CONTACT -->\n## Contact\nIf you have questions regarding the code or otherwise related to this Github page, please open an [issue](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FContrastive-Tension\u002Fissues).\n\nFor other purposes, feel free to contact me directly at: Fredrik.Carlsson@ri.se\n\n\u003C!-- ACKNOWLEDGEMENTS -->\n## Acknowledgements\n* [Stability.ai](https:\u002F\u002Fstability.ai\u002F) for providing much appreciated compute during training.\n* [CLIP](https:\u002F\u002Fopenai.com\u002Fblog\u002Fclip\u002F)\n* [OpenAI](https:\u002F\u002Fopenai.com\u002F)\n* [Huggingface](https:\u002F\u002Fhuggingface.co\u002F)\n* [Best Readme Template](https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template)\n* [\"Two Cats\" Image by pl1602](https:\u002F\u002Fsearch.creativecommons.org\u002Fphotos\u002F8dfd802b-58e5-4cc5-889d-96abba540de1)\n\n\u003C!-- LICENSE -->\n## License\nDistributed under the MIT License. See `LICENSE` for more information.\n\n\u003C!-- CITATION -->\n## Citing\nIf you found this repository useful, please consider citing:\n\n```bibtex\n@InProceedings{carlsson-EtAl:2022:LREC,\n  author    = {Carlsson, Fredrik  and  Eisen, Philipp  and  Rekathati, Faton  and  Sahlgren, Magnus},\n  title     = {Cross-lingual and Multilingual CLIP},\n  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},\n  month          = {June},\n  year           = {2022},\n  address        = {Marseille, France},\n  publisher      = {European Language Resources Association},\n  pages     = {6848--6854},\n  abstract  = {The long-standing endeavor of relating the textual and the visual domain recently underwent a pivotal breakthrough, as OpenAI released CLIP. This model distinguishes how well an English text corresponds with a given image with unprecedented accuracy. Trained via a contrastive learning objective over a huge dataset of 400M of images and captions, it is a work that is not easily replicated, especially for low resource languages. Capitalizing on the modularization of the CLIP architecture, we propose to use cross-lingual teacher learning to re-train the textual encoder for various non-English languages. Our method requires no image data and relies entirely on machine translation which removes the need for data in the target language. We find that our method can efficiently train a new textual encoder with relatively low computational cost, whilst still outperforming previous baselines on multilingual image-text retrieval.},\n  url       = {https:\u002F\u002Faclanthology.org\u002F2022.lrec-1.739}\n}\n```\n\n\n\u003C!-- MARKDOWN LINKS & IMAGES -->\n\u003C!-- https:\u002F\u002Fwww.markdownguide.org\u002Fbasic-syntax\u002F#reference-style-links -->\n[contributors-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002Fothneildrew\u002FBest-README-Template.svg?style=for-the-badge\n[contributors-url]: https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template\u002Fgraphs\u002Fcontributors\n[forks-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fothneildrew\u002FBest-README-Template.svg?style=for-the-badge\n[forks-url]: https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template\u002Fnetwork\u002Fmembers\n[stars-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fothneildrew\u002FBest-README-Template.svg?style=for-the-badge\n[stars-url]: https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template\u002Fstargazers\n[issues-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fothneildrew\u002FBest-README-Template.svg?style=for-the-badge\n[issues-url]: https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template\u002Fissues\n[license-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fothneildrew\u002FBest-README-Template.svg?style=for-the-badge\n[license-url]: https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template\u002Fblob\u002Fmaster\u002FLICENSE.txt\n[linkedin-shield]: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555\n[linkedin-url]: https:\u002F\u002Flinkedin.com\u002Fin\u002Fothneildrew\n[product-screenshot]: images\u002Fscreenshot.png\n","\u003Cbr \u002F>\n\u003Cp align=\"center\">\n  \u003Ch1 align=\"center\">多语言CLIP\u003C\u002Fh1>\n  \u003Ch3 align=\"center\">适用于任何语言的OpenAI CLIP文本编码器\u003C\u002Fh3>\n  \n  \u003Cp align=\"center\">  \n    \u003Ca href=\"https:\u002F\u002From1504.github.io\u002Fclip-retrieval\u002F?back=https%3A%2F%2Fknn5.laion.ai&index=laion_400m&useMclip=true\">在线演示\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FM-CLIP\">预训练模型\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FContrastive-Tension\u002Fissues\">报告问题\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fp>\n\n[![在Colab中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmaster\u002FMultilingual_CLIP.ipynb)\n[![pypi](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fmultilingual-clip.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fmultilingual-clip)\n\n\n\u003C!-- 关于项目 -->\n## 概述\n![Alt text](Images\u002FMultilingual-CLIP.png?raw=true \"标题\")\n\n[OpenAI](https:\u002F\u002Fopenai.com\u002F) 最近发布了论文 [从自然语言监督中学习可迁移的视觉模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.00020)，其中他们介绍了CLIP（对比语言-图像预训练）模型。该模型通过使用对比学习目标来匹配文本和图像的相应向量表示，从而实现文本与图像之间的关联。\nCLIP由两个独立的模型组成：视觉编码器和文本编码器。这两个模型是在惊人的4亿张图片及其对应的标题上进行训练的。\n自那以来，OpenAI发布了一系列较小的CLIP模型，这些模型可以在 [官方CLIP GitHub](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP) 上找到。\n\n## 演示\n使用M-CLIP进行多语言文本-图像检索的实时演示可以在这里找到！[点击此处](https:\u002F\u002From1504.github.io\u002Fclip-retrieval\u002F?back=https%3A%2F%2Fknn5.laion.ai&index=laion_400m&useMclip=true)。这个演示由 [Rom1504](https:\u002F\u002Fgithub.com\u002From1504) 创建，它允许你使用M-CLIP以多种语言搜索LAION-400M数据集。\n\n#### 本仓库包含\n* 多种语言的预训练CLIP文本编码器\n* PyTorch 和 TensorFlow 推理代码\n* TensorFlow 训练代码\n\n### 需求\n虽然其他版本也可能同样适用，但我们主要使用了以下环境：\n\n* Python = 3.6.9\n* Transformers = 4.8.1\n\n## 安装\n\n`pip install multilingual-clip torch`\n\n你也可以选择安装 `tensorflow` 而不是 `torch`。\n\n\n## 推理使用\n\nTensorFlow 的推理代码也包含在 [inference_example.py](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmain\u002Finference_example.py) 中。\n\n```python\nfrom multilingual_clip import pt_multilingual_clip\nimport transformers\n\ntexts = [\n    '三匹盲马正在聆听莫扎特。',\n    '麋鹿是森林之王！',\n    '北极熊如何在南极生活？',\n    '你知道所有的北极熊都是左撇子吗？'\n]\nmodel_name = 'M-CLIP\u002FXLM-Roberta-Large-Vit-L-14'\n\n# 加载模型和分词器\nmodel = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_name)\n\nembeddings = model.forward(texts, tokenizer)\nprint(embeddings.shape)\n```\n\n## 开发环境安装\n\n设置一个虚拟环境：\n\n```\npython3 -m venv .env\nsource .env\u002Fbin\u002Factivate\npip install -e .\n```\n\n## 预训练模型\n每个文本编码器都是一个可在 [Huggingface](https:\u002F\u002Fhuggingface.co\u002F) 上获取的Transformer模型，并在其顶部添加了一个线性层。有关特定模型的更多信息，请点击模型名称查看其模型卡片。\n\u003Cbr>\n\u003Cbr>\n\n| 名称 | 模型基础 | 视觉模型 | 视觉维度 | 预训练语言 | 参数量 |\n| ----------------------------------|:-----: |:-----: |:-----: |:-----: | :-----: |\n| [LABSE Vit-L\u002F14](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FLABSE-Vit-L-14)| [LaBSE](https:\u002F\u002Fhuggingface.co\u002Fsentence-transformers\u002FLaBSE)|  [OpenAI ViT-L\u002F14](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP) | 768 | [109种语言](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2007.01852.pdf) | 1.1亿|\n| [XLM-R Large Vit-B\u002F32](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-B-32)| [XLM-Roberta-Large](https:\u002F\u002Fhuggingface.co\u002Fxlm-roberta-large)|  [OpenAI ViT-B\u002F32](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP) | 512 | [100种语言](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Ftree\u002Fmain\u002Fexamples\u002Fxlmr#Introduction) | 3.44亿|\n| [XLM-R Large Vit-L\u002F14](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-L-14)| [XLM-Roberta-Large](https:\u002F\u002Fhuggingface.co\u002Fxlm-roberta-large)|  [OpenAI ViT-L\u002F14](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP) | 768 | [100种语言](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Ftree\u002Fmain\u002Fexamples\u002Fxlmr#Introduction)|  3.44亿|\n| [XLM-R Large Vit-B\u002F16+](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-B-16Plus)| [XLM-Roberta-Large](https:\u002F\u002Fhuggingface.co\u002Fxlm-roberta-large)|  [开放CLIP ViT-B-16-plus-240](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_clip) | 640 | [100种语言](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Ftree\u002Fmain\u002Fexamples\u002Fxlmr#Introduction)| 3.44亿|\n\n### 验证与训练曲线\n以下是针对人工翻译的 [MS-COCO 测试集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.07622) 的 \u003Cb>Txt2Img @10-Recall\u003C\u002Fb> 表格。\n\n| 名称 | 英语 | 德语 | 西班牙语 | 法语 | 中文 | 意大利语 | 波兰语 | 韩语 | 俄语 | 土耳其语 | 日语 |\n| ----------------------------------|:-----: |:-----: |:-----: |:-----: | :-----: |:-----: |:-----: |:-----: |:-----: |:-----: |:-----: |\n| [OpenAI CLIP Vit-B\u002F32](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP)| 90.3 | - | - | - | - | - | - | - | - | - | - |\n| [OpenAI CLIP Vit-L\u002F14](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP)| 91.8 | - | - | - | - | - | - | - | - | - | - |\n| [OpenCLIP ViT-B-16+-](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP)| 94.3 | - | - | - | - | - | - | - | - | - | - |\n| [LABSE Vit-L\u002F14](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FLABSE-Vit-L-14)| 91.6 | 89.6 | 89.5 | 89.9 | 88.9 | 90.1 | 89.8 | 80.8 | 85.5 | 89.8 | 73.9 |\n| [XLM-R Large Vit-B\u002F32](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-B-32)| 91.8 | 88.7 | 89.1 | 89.4 | 89.3 | 89.8| 91.4 | 82.1 | 86.1 | 88.8 | 81.0 |\n| [XLM-R Vit-L\u002F14](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-L-14)| 92.4 | 90.6 | 91.0 | 90.0 | 89.7 | 91.1 | 91.3 | 85.2 | 85.8 | 90.3 | 81.9 |\n| [XLM-R Large Vit-B\u002F16+](https:\u002F\u002Fhuggingface.co\u002FM-CLIP\u002FXLM-Roberta-Large-Vit-B-16Plus)| \u003Cb>95.0\u003C\u002Fb> | \u003Cb>93.0\u003C\u002Fb> | \u003Cb>93.6\u003C\u002Fb> | \u003Cb>93.1\u003C\u002Fb> | \u003Cb>94.0\u003C\u002Fb> | \u003Cb>93.1\u003C\u002Fb> | \u003Cb>94.4\u003C\u002Fb> | \u003Cb>89.0\u003C\u002Fb> | \u003Cb>90.0\u003C\u002Fb> | \u003Cb>93.0\u003C\u002Fb> | \u003Cb>84.2\u003C\u002Fb> |\n\n这些模型的训练曲线可以在这份 [Weights and Biases 报告](https:\u002F\u002Fwandb.ai\u002Ffreddefrallan\u002FM-CLIP\u002Freports\u002FM-CLIP-2-6-2022--VmlldzoyMTE1NTI1NS\u002Fedit?firstReport&runsetFilter) 中找到。其他未成功及正在进行的实验结果则可在 [Weights and Biases 项目](https:\u002F\u002Fwandb.ai\u002Ffreddefrallan\u002FM-CLIP?workspace=user-freddefrallan) 中查阅。\n\n## 遗留用法与模型\n较早版本的 M-CLIP 会将线性权重单独存储在 Huggingface 之外。而新模型则直接将其整合到 Huggingface 仓库中。有关这些旧模型的更多信息可在本节中找到。\n\n\u003Cdetails>\n  \u003Csummary>点击查看更多信息\u003C\u002Fsummary>\n  \n##### 下载 CLIP 模型\n```bash\n$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0\n$ pip install ftfy regex tqdm\n$ pip install git+https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP.git\n```\n请将上述命令中的 `cudatoolkit=11.0` 替换为你机器上对应的 CUDA 版本，或者在无 GPU 的机器上安装时使用 `cpuonly`。\n更多信息请参阅官方 [CLIP 仓库](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP)。\n\n##### 下载线性权重\n```bash\n# 线性模型权重\n$ bash legacy_get-weights.sh\n```\n\n### 推理\n```python\nfrom multilingual_clip import multilingual_clip\n\nprint(multilingual_clip.AVAILABLE_MODELS.keys())\n\nmodel = multilingual_clip.load_model('M-BERT-Distil-40')\n\nembeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])\nprint(embeddings.shape)\n# 输出：torch.Size([3, 640])\n```\n\n\u003C!--- 更详细的示例请参阅此 [Google Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmaster\u002FMultilingual_CLIP.ipynb). --->\n\n如需更详细的示例，比较文本嵌入与 CLIP 图像嵌入，请参阅此 [Colab 笔记本](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmaster\u002FMultilingual_CLIP.ipynb)。\n\n\u003C!-- 开始使用 -->\n## 遗留预训练模型\n每个文本编码器都是一个可在 [Huggingface](https:\u002F\u002Fhuggingface.co\u002F) 上找到的 Transformer 模型，并在其顶部附加了一个线性层。这两种模型均未经过广泛测试，但如需更多信息及特定模型的定性测试结果，请点击模型名称查看其模型卡片。\n\u003Cbr>\n\u003Cbr>\n\u003Cb>*** 下载新模型时，请务必更新至仓库的最新版本，并重新运行 Shell 脚本来下载线性权重。*** \u003C\u002Fb>\n\n\n| 名称 | 模型基础 | 视觉模型 | 预训练语言 | 目标语言 | 参数量 |\n| ----------------------------------|:-----: |:-----: |:-----: |:-----: |:-----: |\n|**多语言**    ||\n| [M-BERT Distil 40](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002FModel%20Cards\u002FM-BERT%20Distil%2040) | [M-BERT Distil](https:\u002F\u002Fhuggingface.co\u002Fbert-base-multilingual-uncased)|  RN50x4 | [101 种语言](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert\u002Fblob\u002Fmaster\u002Fmultilingual.md#list-of-languages) | [40 种语言](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmain\u002FModel%20Cards\u002FM-BERT%20Distil%2040\u002FFine-Tune-Languages.md) | 6600 万|\n| [M-BERT Base 69](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002FModel%20Cards\u002FM-BERT%20Base%2069) | [M-BERT Base](https:\u002F\u002Fhuggingface.co\u002Fbert-base-multilingual-uncased)|RN50x4 | [101 种语言](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert\u002Fblob\u002Fmaster\u002Fmultilingual.md#list-of-languages) | 68 种语言 | 1.1 亿|\n| [M-BERT Base ViT-B](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002FModel%20Cards\u002FM-BERT%20Base%20ViT-B) | [M-BERT Base](https:\u002F\u002Fhuggingface.co\u002Fbert-base-multilingual-uncased)|ViT-B\u002F32 | [101 种语言](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert\u002Fblob\u002Fmaster\u002Fmultilingual.md#list-of-languages) | 68 种语言 | 1.1 亿|\n|**单语言**    ||\n|[Swe-CLIP 50 万](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002FModel%20Cards\u002FSwe-CLIP%20500k)| [KB-BERT](https:\u002F\u002Fhuggingface.co\u002FKB\u002Fbert-base-swedish-cased)|  RN50x4 | 瑞典语 | 瑞典语 | 1.1 亿|\n|[Swe-CLIP 200 万](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002FModel%20Cards\u002FSwe-CLIP%202M)| [KB-BERT](https:\u002F\u002Fhuggingface.co\u002FKB\u002Fbert-base-swedish-cased)|  RN50x4 | 瑞典语 | 瑞典语 | 1.1 亿|\n\n  \u003C\u002Fdetails>\n  \n## 训练新模型\n[此文件夹](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Ftree\u002Fmain\u002Fmultilingual_clip\u002FTeacherLearning) 包含用于训练上述模型的代码。如果你想训练自己的模型，你需要完成以下步骤：\n\n* 准备一组从英语翻译成你语言的句子对。\n* 为英语句子计算常规的 CLIP 文本嵌入。\n* 编辑 [Training.py](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fblob\u002Fmain\u002Fmultilingual_clip\u002FTeacherLearning\u002FTraining.py) 以加载你的数据。\n* 通过教师学习法训练一个新的 CLIP 文本编码器。\n\n### 预计算的 CLIP 嵌入与翻译数据\n[此 Google Drive 文件夹](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1I9a7naSZubUATWzLFv61DQMWyFlF7wR5?usp=sharing) 包含大量来自 [GCC](https:\u002F\u002Fai.google.com\u002Fresearch\u002FConceptualCaptions\u002F) + [MSCOCO](https:\u002F\u002Fcocodataset.org\u002F#home) + [VizWiz](https:\u002F\u002Fvizwiz.org\u002Ftasks-and-datasets\u002Fimage-captioning\u002F) 图片说明的预计算 CLIP 文本嵌入。\n\n该 Google Drive 文件夹还包含用于训练当前可用模型的翻译数据。\n祝你好运！\n\n## 贡献\n如果你已经训练了一款专属于你语言的 CLIP 文本编码器，或另一款覆盖此处未支持语言的模型，请随时与我们联系，我们将上传你的模型并署名，或直接链接到你已上传的模型。\n\n\u003C!-- 联系方式 -->\n## 联系方式\n如对代码或与此 GitHub 页面相关的问题有任何疑问，请提交 [issue](https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FContrastive-Tension\u002Fissues)。\n\n如有其他事宜，欢迎直接联系我：Fredrik.Carlsson@ri.se\n\n\u003C!-- 致谢 -->\n## 致谢\n* [Stability.ai](https:\u002F\u002Fstability.ai\u002F) 在训练期间提供的宝贵算力支持。\n* [CLIP](https:\u002F\u002Fopenai.com\u002Fblog\u002Fclip\u002F)\n* [OpenAI](https:\u002F\u002Fopenai.com\u002F)\n* [Huggingface](https:\u002F\u002Fhuggingface.co\u002F)\n* [最佳 README 模板](https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template)\n* [\"两只猫\" 图片，作者 pl1602](https:\u002F\u002Fsearch.creativecommons.org\u002Fphotos\u002F8dfd802b-58e5-4cc5-889d-96abba540de1)\n\n\u003C!-- 许可证 -->\n## 许可证\n根据 MIT 许可证发布。更多信息请参阅 `LICENSE` 文件。\n\n\u003C!-- 引用 -->\n\n## 引用\n如果您觉得本仓库有用，请考虑引用：\n\n```bibtex\n@InProceedings{carlsson-EtAl:2022:LREC,\n  author    = {卡尔松，弗雷德里克 与 艾森，菲利普 与 雷卡塔蒂，法通 与 萨尔格伦，马格努斯},\n  title     = {跨语言与多语言CLIP},\n  booktitle      = {语言资源与评估会议论文集},\n  month          = {6月},\n  year           = {2022},\n  address        = {法国马赛},\n  publisher      = {欧洲语言资源协会},\n  pages     = {6848--6854},\n  abstract  = {长期以来，文本与视觉领域的关联研究最近迎来了一个关键性突破，即OpenAI发布了CLIP模型。该模型能够以前所未有的精度判断一段英文文本与给定图像的匹配程度。CLIP通过对比学习目标，在包含4亿张图像及其对应标题的大规模数据集上进行训练，其复杂性和规模使得复现工作尤其困难，特别是对于低资源语言而言。基于CLIP架构的模块化特性，我们提出利用跨语言教师学习方法，为多种非英语语言重新训练文本编码器。我们的方法无需图像数据，完全依赖机器翻译，从而避免了对目标语言数据的需求。实验表明，该方法能够在相对较低的计算成本下高效地训练出新的文本编码器，并且在多语言图像-文本检索任务中仍优于现有基线模型。},\n  url       = {https:\u002F\u002Faclanthology.org\u002F2022.lrec-1.739}\n}\n```\n\n\n\u003C!-- MARKDOWN LINKS & IMAGES -->\n\u003C!-- https:\u002F\u002Fwww.markdownguide.org\u002Fbasic-syntax\u002F#reference-style-links -->\n[contributors-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002Fothneildrew\u002FBest-README-Template.svg?style=for-the-badge\n[contributors-url]: https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template\u002Fgraphs\u002Fcontributors\n[forks-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fothneildrew\u002FBest-README-Template.svg?style=for-the-badge\n[forks-url]: https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template\u002Fnetwork\u002Fmembers\n[stars-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fothneildrew\u002FBest-README-Template.svg?style=for-the-badge\n[stars-url]: https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template\u002Fstargazers\n[issues-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fothneildrew\u002FBest-README-Template.svg?style=for-the-badge\n[issues-url]: https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template\u002Fissues\n[license-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fothneildrew\u002FBest-README-Template.svg?style=for-the-badge\n[license-url]: https:\u002F\u002Fgithub.com\u002Fothneildrew\u002FBest-README-Template\u002Fblob\u002Fmaster\u002FLICENSE.txt\n[linkedin-shield]: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555\n[linkedin-url]: https:\u002F\u002Flinkedin.com\u002Fin\u002Fothneildrew\n[product-screenshot]: images\u002Fscreenshot.png","# Multilingual-CLIP 快速上手指南\n\nMultilingual-CLIP (M-CLIP) 是基于 OpenAI CLIP 架构的多语言文本编码器，支持将多种语言的文本映射到与图像相同的向量空间，从而实现跨语言的图文检索。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux, macOS 或 Windows\n*   **Python 版本**: 推荐 Python 3.6.9 或更高版本\n*   **核心依赖**:\n    *   `torch` (PyTorch) 或 `tensorflow`\n    *   `transformers` (Hugging Face)\n\n> **注意**：虽然其他版本可能也能运行，但官方测试环境为 Python 3.6.9 和 Transformers 4.8.1。\n\n## 安装步骤\n\n推荐使用 `pip` 进行安装。您可以选择安装 PyTorch 版本或 TensorFlow 版本。\n\n### 方式一：安装 PyTorch 版本（推荐）\n\n```bash\npip install multilingual-clip torch\n```\n\n### 方式二：安装 TensorFlow 版本\n\n如果您更倾向于使用 TensorFlow，可以运行：\n\n```bash\npip install multilingual-clip tensorflow\n```\n\n> **国内加速提示**：如果下载速度较慢，建议使用国内镜像源（如清华源或阿里源）：\n> ```bash\n> pip install multilingual-clip torch -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 基本使用\n\n以下是最简单的 PyTorch 推理示例，展示如何加载预训练模型并生成多语言文本嵌入向量。\n\n### 代码示例\n\n```python\nfrom multilingual_clip import pt_multilingual_clip\nimport transformers\n\n# 准备多语言文本列表\ntexts = [\n    'Three blind horses listening to Mozart.',\n    'Älgen är skogens konung!',\n    'Wie leben Eisbären in der Antarktis?',\n    'Вы знали, что все белые медведи левши?'\n]\n\n# 指定模型名称 (此处以 XLM-Roberta-Large + ViT-L-14 为例)\nmodel_name = 'M-CLIP\u002FXLM-Roberta-Large-Vit-L-14'\n\n# 加载模型与分词器\nmodel = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_name)\n\n# 生成嵌入向量\nembeddings = model.forward(texts, tokenizer)\n\n# 输出向量形状\nprint(embeddings.shape)\n```\n\n### 说明\n*   **模型选择**：代码中的 `model_name` 可以从 Hugging Face Model Hub 中查找其他可用模型（如支持更多语种的 `LABSE-Vit-L-14` 等）。\n*   **输出结果**：`embeddings` 是一个张量，形状为 `[文本数量，向量维度]`，可直接用于计算与图像向量的相似度。","一家面向全球市场的跨境电商公司，需要让不同语言的用户都能通过自然语言描述快速检索到匹配的商品图片。\n\n### 没有 Multilingual-CLIP 时\n- **语言壁垒高筑**：系统仅支持英语检索，法语、德语或中文用户必须先将查询词翻译成英文，否则无法找到任何结果。\n- **开发成本高昂**：团队需为每种目标语言单独训练或微调图像 - 文本模型，并维护多套独立的检索索引，服务器资源消耗巨大。\n- **语义匹配偏差**：简单的机器翻译往往丢失原文的语境和细微差别（如“复古风”与\"vintage\"的微妙差异），导致搜出的图片不相关。\n- **响应延迟严重**：用户查询需经过“翻译 API+ 英文检索”的双重链路，增加了网络请求次数，显著拖慢了页面加载速度。\n\n### 使用 Multilingual-CLIP 后\n- **原生多语言支持**：直接利用预训练模型，用户用法语、瑞典语或俄语输入描述，即可在同一个向量空间中精准匹配到对应图片，无需任何翻译步骤。\n- **架构统一简化**：只需部署一套模型和索引，即可覆盖超过 100 种语言，大幅降低了运维复杂度和云计算成本。\n- **跨语言语义对齐**：基于对比学习训练，模型能深刻理解不同语言中相同的视觉概念，即使措辞不同也能召回高度相关的商品图。\n- **实时检索体验**：去除了中间翻译环节，查询请求一步到位，显著降低了延迟，让用户获得流畅的“即搜即得”体验。\n\nMultilingual-CLIP 通过打破语言隔阂，让全球用户能用母语直接与视觉数据对话，极大地提升了跨国业务的搜索效率与用户体验。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFreddeFrallan_Multilingual-CLIP_d853f0f4.png","FreddeFrallan","Fredrik Carlsson","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FFreddeFrallan_d0ec439c.jpg",null,"https:\u002F\u002Fgithub.com\u002FFreddeFrallan",[82,86,90,94],{"name":83,"color":84,"percentage":85},"Jupyter Notebook","#DA5B0B",95,{"name":87,"color":88,"percentage":89},"Python","#3572A5",4.8,{"name":91,"color":92,"percentage":93},"Shell","#89e051",0.1,{"name":95,"color":96,"percentage":97},"Makefile","#427819",0,828,68,"2026-03-15T19:09:45","MIT","未说明","可选（支持 CPU 运行）。若使用 GPU，需根据安装的 PyTorch\u002FTensorFlow 版本配置对应的 CUDA 环境。README 示例中提及了 cudatoolkit=11.0（针对旧版遗留模型），新版模型未强制指定具体显卡型号或显存大小，但运行大型模型（如 ViT-L\u002F14）建议具备足够显存。",{"notes":105,"python":106,"dependencies":107},"该工具主要提供多语言文本编码器，可单独用于文本嵌入计算（无需 GPU）。安装时可选择 PyTorch 或 TensorFlow 后端。若需复现旧版模型或使用特定训练脚本，可能需要手动安装对应版本的 CUDA 工具包（如 11.0）。预训练模型托管在 Hugging Face，首次运行会自动下载。","3.6.9",[108,109,110,111],"multilingual-clip","torch","transformers==4.8.1","tensorflow (可选)",[26,14,54],"2026-03-27T02:49:30.150509","2026-04-06T08:17:46.519941",[116,121,126,131,136,141,146,151],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},13062,"是否提供训练或微调代码？","是的，作者已上传了用于训练的代码（基于 TensorFlow 2）。虽然代码尚未完全整理美观，且需要用户自行准备平行语料库，但可以作为训练的起点。此外，作者还在 Google Drive 文件夹中提供了预计算的数据以加速训练过程。","https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fissues\u002F1",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},13063,"是否支持 ViT-B\u002F32 视觉模型？","支持。维护者已上传了适用于 ViT-B\u002F32 的多语言文本编码器。不过需要注意的是，除了该模型外，其他模型的质量尚未经过全面测试。","https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fissues\u002F2",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},13064,"是否有针对资源较少语言表现更好的 XLM-Roberta 模型？","是的，针对小语种表现优于多语言 BERT 的 XLM-Roberta (配合 ViT-B) 模型已经发布可用。","https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fissues\u002F5",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},13065,"如何训练适配 ViT-L\u002F14 的多语言文本编码器？","建议遵循仓库中的训练指南。为了获得更好的效果，关键是要增加翻译数据的数量（例如从 50 万增加到 200 万样本会有显著提升）。建议尽可能对收集的数据集进行机器翻译以扩充数据量。可以使用 LAION-5B 数据集的多语言子集进行训练。","https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fissues\u002F10",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},13066,"该项目是否有推荐的 BibTeX 引用格式？","该工作已被 LREC 2022 接收为短篇论文。虽然当时尚未发布正式的 BibTeX，但建议引用该会议论文。用户可以关注 LREC 2022 会议页面获取正式的引用格式。","https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fissues\u002F11",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},13067,"项目的许可证是什么？微调后的权重也适用该许可证吗？","项目采用 MIT 许可证。维护者已确认许可证文件已添加，且微调后的权重（主要是线性变换部分）同样在 MIT 许可证下发布。","https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fissues\u002F27",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},13068,"能否使用 Vicuna 等大语言模型作为文本编码器来计算嵌入向量？","通常不行。因为 Vicuna 是仅解码器（decoder-only）模型，无法直接开箱即用地用于计算嵌入向量（embeddings）。目前主流方案仍基于编码器架构。","https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fissues\u002F30",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},13069,"如何获取 1024 维度的嵌入模型（如适配 AudioCLIP）？","官方发布的模型多为 768 维。如果需要 1024 维模型（例如为了适配 AudioCLIP），建议用户按照仓库中的说明自行训练一个模型。如果在训练过程中遇到问题，可以在社区寻求帮助，训练完成后也可以分享模型。","https:\u002F\u002Fgithub.com\u002FFreddeFrallan\u002FMultilingual-CLIP\u002Fissues\u002F23",[157,161,165,169,173,177,181,185,189,193],{"id":158,"version":159,"summary_zh":79,"released_at":160},71732,"1.0.10","2022-06-02T22:56:52",{"id":162,"version":163,"summary_zh":79,"released_at":164},71733,"1.0.8","2022-06-02T21:26:58",{"id":166,"version":167,"summary_zh":79,"released_at":168},71734,"1.0.7","2022-06-02T21:14:54",{"id":170,"version":171,"summary_zh":79,"released_at":172},71735,"1.0.6","2022-06-02T21:06:28",{"id":174,"version":175,"summary_zh":79,"released_at":176},71736,"1.0.5","2022-06-02T20:52:17",{"id":178,"version":179,"summary_zh":79,"released_at":180},71737,"1.0.4","2022-06-02T20:37:17",{"id":182,"version":183,"summary_zh":79,"released_at":184},71738,"1.0.3","2022-06-02T20:32:04",{"id":186,"version":187,"summary_zh":79,"released_at":188},71739,"1.0.2","2022-06-02T08:09:37",{"id":190,"version":191,"summary_zh":79,"released_at":192},71740,"1.0.1","2022-06-02T08:05:23",{"id":194,"version":195,"summary_zh":79,"released_at":196},71741,"1.0.0","2022-06-02T06:51:20"]