[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-facebookresearch--large_concept_model":3,"tool-facebookresearch--large_concept_model":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,2,"2026-04-17T23:33:34",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":77,"owner_website":78,"owner_url":79,"languages":80,"stars":85,"forks":86,"last_commit_at":87,"license":88,"difficulty_score":89,"env_os":90,"env_gpu":91,"env_ram":92,"env_deps":93,"category_tags":105,"github_topics":106,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":112,"updated_at":113,"faqs":114,"releases":142},8891,"facebookresearch\u002Flarge_concept_model","large_concept_model","Large Concept Models: Language modeling in a sentence representation space","large_concept_model 是 Meta 开源的一种新型语言建模框架，旨在让 AI 在“概念”而非单纯的词汇层面理解和生成语言。传统大模型通常逐字预测下一个 token，而 large_concept_model 则是在句子级的语义空间中进行自回归预测。它将整个句子抽象为一个与语言和模态无关的“概念”向量（基于 SONAR 嵌入空间），支持涵盖 200 种文本语言和 57 种语音语言的跨模态处理。\n\n这一架构主要解决了传统模型在处理长程语义依赖和多语言混合场景时的局限性，通过更高层级的语义表示，提升了模型对整体句意的把握能力。项目提供了基于 16 亿参数模型的训练方案，涵盖了均方误差回归和扩散生成等多种技术路径，并使用了万亿级 token 数据进行验证。\n\nlarge_concept_model 非常适合人工智能研究人员、算法工程师以及对下一代语言模型架构感兴趣的开发者使用。其独特的技术亮点在于跳出了传统的离散词符建模，探索了在连续语义空间中进行序列生成的可能性，为构建更通用、更高效的智能体提供了新的研究范式。虽然目前主要面向科研实验，但其展现出的跨语言与跨模态潜力，预示着","large_concept_model 是 Meta 开源的一种新型语言建模框架，旨在让 AI 在“概念”而非单纯的词汇层面理解和生成语言。传统大模型通常逐字预测下一个 token，而 large_concept_model 则是在句子级的语义空间中进行自回归预测。它将整个句子抽象为一个与语言和模态无关的“概念”向量（基于 SONAR 嵌入空间），支持涵盖 200 种文本语言和 57 种语音语言的跨模态处理。\n\n这一架构主要解决了传统模型在处理长程语义依赖和多语言混合场景时的局限性，通过更高层级的语义表示，提升了模型对整体句意的把握能力。项目提供了基于 16 亿参数模型的训练方案，涵盖了均方误差回归和扩散生成等多种技术路径，并使用了万亿级 token 数据进行验证。\n\nlarge_concept_model 非常适合人工智能研究人员、算法工程师以及对下一代语言模型架构感兴趣的开发者使用。其独特的技术亮点在于跳出了传统的离散词符建模，探索了在连续语义空间中进行序列生成的可能性，为构建更通用、更高效的智能体提供了新的研究范式。虽然目前主要面向科研实验，但其展现出的跨语言与跨模态潜力，预示着未来在人机交互领域的广阔应用前景。","# Large Concept Models\n## Language Modeling in a Sentence Representation Space\n\n[[Blog]](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fmeta-fair-updates-agents-robustness-safety-architecture\u002F) [[Paper]](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Flarge-concept-models-language-modeling-in-a-sentence-representation-space\u002F)\n\nThis repository provides the official implementations and experiments for [Large Concept Models](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Flarge-concept-models-language-modeling-in-a-sentence-representation-space\u002F) (**LCM**).\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"space.svg\" width=\"50%\">\n\u003C\u002Fp>\n\n\n\nThe LCM operates on an explicit higher-level semantic representation,\nwhich we name a \"concept\". Concepts are language- and modality-agnostic and represent a higher\nlevel idea. In this work, a concept corresponds to a sentence, and we use the [SONAR](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSONAR)\nembedding space, which supports up to 200 languages in text and 57 languages in speech. See the list of supported languages [here](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSONAR?tab=readme-ov-file#supported-languages-and-download-links).\n\n\n## Approach\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"lcm.svg\" width=\"70%\">\n\u003C\u002Fp>\n\n\n\nThe LCM is a sequence-to-sequence model in the concepts space trained to perform auto-regressive sentence prediction.\nWe explore multiple approaches:\n- MSE regression (`base_lcm` in this code).\n- Variants of diffusion-based generation (we include `two_tower_diffusion_lcm` in this release).\n- Models operating in a quantized SONAR space (coming soon).\n\nThese explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We include in this repository recipes to reproduce the training and finetuning of 1.6B MSE LCM and Two-tower diffusion LCM. See instructions [below](#usage).\n\n## Installing\n\n### Using UV\n\nThe LCM repository relies on fairseq2. If you have `uv` installed on your system, you can install a virtual environment with all the necessary packages by running the following commands:\n```bash\nuv sync --extra cpu --extra eval --extra data\n```\n\nYou can also use `uv run` to run the demo commands with the correct environment.\n\nNote that we only provide requirements for `cpu` dependencies, if you want to use GPU support, you will have to choose the variants of torch and fairseq2 that work for your system.\nFor example for torch 2.5.1 with cuda 1.21, You would do something like:\n```\nuv pip install torch==2.5.1 --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121 --upgrade\nuv pip install fairseq2==v0.3.0rc1 --pre --extra-index-url  https:\u002F\u002Ffair.pkg.atmeta.com\u002Ffairseq2\u002Fwhl\u002Frc\u002Fpt2.5.1\u002Fcu121 --upgrade\n```\n\nCheck [fairseq2 variants](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq2?tab=readme-ov-file#variants) for possible variants. Note that LCM currently relies on the release candidate for fairseq2 0.3.0 rc1.\n\n### Using pip\n\nTo install with pip, the commands are very similar, but you will have to manage your own environment and make sure to install fairseq2 manually first. For instance, for a `cpu` install.\n\n```bash\npip install --upgrade pip\npip install fairseq2==v0.3.0rc1 --pre --extra-index-url  https:\u002F\u002Ffair.pkg.atmeta.com\u002Ffairseq2\u002Fwhl\u002Frc\u002Fpt2.5.1\u002Fcpu\npip install -e \".[data,eval]\"\n```\n\nIf [fairseq2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq2) does not provide a build for your machine, check the readme of that project to build it locally.\n\n## Usage\n\n> [!NOTE]\n> If using `uv` prefix all commands with `uv run` to use the environment created by default in `.venv`, e.g.,\n> `uv run torchrun --standalone`.\n> Alternatively, you can activate the environment once and for all with `source .venv\u002Fbin\u002Factivate`.\n\n### Preparing data\n\nThe LCM can be trained and evaluated using textual data split in sentences and embedded with [SONAR](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSONAR\u002F). We provide a sample processing pipeline that can be used to prepare such training data, you can run it with:\n\n```\n uv run --extra data scripts\u002Fprepare_wikipedia.py \u002Foutput\u002Fdir\u002Ffor\u002Fthe\u002Fdata\n ```\n\n This pipeline shows how to get a dataset from huggingface and process it with SONAR and [SaT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678). Check out the file for more details on processing your own data. While the script provides an example pulling data from huggingface, we also provide [APIs](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fstopes\u002Ftree\u002Fmain\u002Fstopes\u002Futils\u002Fsharding) to process jsonl, parquet and CSV.\n\n### Datacards\n\nThe trainer described below relies on datacards configuring the datasets. These datacards are yaml files with pointers to the dataset files (locally or on s3) and information on its schema. We provide some sample datacards in [`lcm\u002Fdatacards\u002Fdatacards.yaml`](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flarge_concept_model\u002Fblob\u002Fmain\u002Flcm\u002Fdatacards\u002Fdatacards.yaml). Once you have processed some data, you can update the datacards with your paths.\n\n#### Fitting a normalizer\nTo fit a new embedding space normalizer on a given weighted mixture of datasets\none can use the following command :\n```bash\npython scripts\u002Ffit_embedding_normalizer.py --ds dataset1:4 dataset2:1 dataset3:10 --save_path \"path\u002Fto\u002Fnew\u002Fnormalizer.pt\" --max_nb_samples 1000000\n```\nHere, `dataset1`, `dataset2`, `dataset3` are the names of datasets declared in the datacards as shown above\nand `(4, 1, 10)` their respective relative weights.\nThe resulting normalizer can be next declared as a model as shown in `lcm\u002Fcards\u002Fsonar_normalizer.yaml`\nand referenced in all model training configs.\n\n\n### Pre-training models\n\n#### Base MSE LCM\n\nTo train an MSE LCM, we will use one of the following commands:\n\n**Option 1.** Training with SLURM using [submitit](https:\u002F\u002Fgithub.com\u002Ffacebookincubator\u002Fsubmitit) via [stopes](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fstopes\u002Ftree\u002Fmain)'s launcher:\n```sh\npython -m lcm.train \\\n    +pretrain=mse \\\n    ++trainer.output_dir=\"checkpoints\u002Fmse_lcm\" \\\n    ++trainer.experiment_name=training_mse_lcm \\\n```\nWith this command, we will submit a slurm job named `training_mse_lcm` with the recipe's requirements, in this case:\n```yaml\nrequirements:\n  nodes: 4\n  tasks_per_node: 8\n  gpus_per_node: 8\n  cpus_per_task: 32\n  mem_gb: 0\n  timeout_min: 10000\n```\nYou can override the job's requirements like the timeout limit and the launcher's slurm partition with:\n```sh\npython -m lcm.train \\\n    +pretrain=mse \\\n    ++trainer.output_dir=\"checkpoints\u002Fmse_lcm\" \\\n    ++trainer.experiment_name=training_mse_lcm \\\n    ++trainer.requirements.timeout_min=100 \\\n    ++trainer.requirements.cpus_per_task=8 \\\n    ++launcher.partition=$partition_name\n```\n\n**Option 2.** Training locally with `torchrun` (e.g. using only 2 GPUs) with a smaller batch size (overriding `++trainer.data_loading_config.max_tokens=1000`):\n```sh\nCUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 \\\n    -m lcm.train launcher=standalone \\\n    +pretrain=mse \\\n    ++trainer.data_loading_config.max_tokens=1000 \\\n    ++trainer.output_dir=\"checkpoints\u002Fmse_lcm\" \\\n    +trainer.use_submitit=false \\\n```\n> [!IMPORTANT]\n> Since we're changing the number of GPUs required by the recipe, this will not reproduce the experimental setup of the paper.\n\nThe checkpoints directory `checkpoints\u002Fmse_lcm` will be structured as:\n```\n.\n├── checkpoints\n│   ├── step_2000\n│   ├── ...\n│   └── step_250000\n├── config_logs\n├── executor_logs\n├── model_card.yaml\n├── tb   # tensorboard logs\n└── wandb  # W&B logs\n```\nNote that W&B logging is skipped unless `wandb` is available.\nYou can install `wandb` with `uv pip install wandb`.\nW&B arguments can be changed by overriding Hydra config values in the recipe:\n\n```sh\n++trainer.wandb_project=$project_name\n++trainer.wandb_run_name=$run_name\n```\n\n#### Two-tower diffusion LCM\n\nSimilar to the base MSE LCM we can submit a training job following the recipe in [.\u002Frecipes\u002Ftrain\u002Fpretrain\u002Ftwo_tower.yaml](.\u002Frecipes\u002Ftrain\u002Fpretrain\u002Ftwo_tower.yaml) via:\n\n```sh\npython -m lcm.train \\\n    +pretrain=two_tower \\\n    ++trainer.output_dir=\"checkpoints\u002Ftwo_tower_lcm\" \\\n    ++trainer.experiment_name=training_two_tower_lcm \\\n```\n\n> [!TIP]\n> To understand the different ingredients of training recipes, check [this README](.\u002Frecipes\u002Ftrain\u002FREADME.md).\n\n\n### Finetuning models\nTo finetune the previously pre-trained two-tower diffusion LCM on supervised data,  follow these steps:\n\n**Step 1.** Register the pre-trained checkpoint as a fairseq2 asset.\n\nYou can finetune the final checkpoint with the card `checkpoints\u002Ftwo_tower_lcm\u002Fmodel_card.yaml` or any checkpoint after a specific number of training steps, e.g., `checkpoints\u002Ftwo_tower_lcm\u002Fcheckpoints\u002Fstep_2000\u002Fmodel_card.yaml`.\nTo register the selected checkpoint, copy the automatically created yaml file to `.\u002Flcm\u002Fcards\u002Fmycards.yaml` and rename the model to replace the default `on_the_fly_lcm`.\n`.\u002Flcm\u002Fcards\u002Fmycards.yaml` will look like:\n```yaml\n__source__: inproc\n checkpoint: file:\u002F\u002Fpath_to\u002Flarge_concept_model\u002Fcheckpoints\u002Ftwo_tower_lcm\u002Fcheckpoints\u002Fstep_2000\u002Fmodel.pt\n model_arch: two_tower_diffusion_lcm_1_6B\n model_family: two_tower_diffusion_lcm\n name: my_pretrained_two_tower\n```\nFor more on how to manage fairseq2 assets, see [documentation](https:\u002F\u002Ffacebookresearch.github.io\u002Ffairseq2\u002Fnightly\u002Fbasics\u002Fassets.html).\n\n**Step 2.** Launch a finetuning job pointing to the model to finetune, in this instance `my_pretrained_two_tower`:\n```sh\nCUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 \\\n    -m lcm.train launcher=standalone \\\n    +finetune=two_tower \\\n    ++trainer.output_dir=\"checkpoints\u002Ffinetune_two_tower_lcm\" \\\n    ++trainer.data_loading_config.max_tokens=1000 \\\n    +trainer.use_submitit=false \\\n    ++trainer.model_config_or_name=my_pretrained_two_tower\n```\nor\n\n```sh\npython -m lcm.train \\\n    +finetune=two_tower \\\n    ++trainer.output_dir=\"checkpoints\u002Ffinetune_two_tower_lcm\" \\\n    ++trainer.experiment_name=finetune_two_tower_lcm \\\n    ++trainer.model_config_or_name=my_pretrained_two_tower\n```\n\nSimilarly, to finetune an MSE LCM, follow the same instructions for registering a pre-trained checkpoint and submit a finetuning job with the appropriate recipe ([.\u002Frecipes\u002Ftrain\u002Ffinetune\u002Fmse.yaml](.\u002Frecipes\u002Ftrain\u002Ffinetune\u002Fmse.yaml)) via:\n```sh\npython -m lcm.train \\\n    +finetune=mse \\\n    ++trainer.output_dir=\"checkpoints\u002Ffinetune_mse_lcm\" \\\n    ++trainer.experiment_name=finetune_mse_lcm \\\n    ++trainer.model_config_or_name=my_pretrained_mse_lcm\n```\n### Evaluating models\n\n\n> [!NOTE]\n> For advanced evaluation (benchmarking different tasks, comparing results with LLMs, etc.) , check [the evaluation documentation](.\u002Fexamples\u002Fevaluation\u002FREADME.md).\n\n\n**Step 0.** Download NLTK data required for evaluating ROUGE:\n```py\npython -m nltk.downloader punkt_tab\n```\n\n**Step 1.**\nGenerate and score outputs of a model either by pointing to its `model_card` yaml file or after registering it as a fairseq2 asset (the same way we registerd `my_pretrained_two_tower`):\n```sh\nmodel_card=.\u002Fcheckpoints\u002Ffinetune_two_tower_lcm\u002Fcheckpoints\u002Fstep_1000\u002Fmodel_card.yaml\nOUTPUT_DIR=evaluation_outputs\u002Ftwo_tower\n\ntorchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation  \\\n  --predictor two_tower_diffusion_lcm  \\\n  --show_progress true \\\n  --data_loading.max_samples 100 \\\n  --model_card ${model_card} \\\n  --launcher standalone \\\n  --dataset.source_suffix_text '[MODEL]:' \\\n  --tasks finetuning_data_lcm.validation \\\n   --task_args '{\"max_gen_len\": 10, \"eos_config\": {\"text\": \"End of text.\"}}' \\\n  --data_loading.batch_size 4  --generator_batch_size 4 \\\n  --dump_dir ${OUTPUT_DIR} \\\n  --inference_timesteps 40 \\\n  --initial_noise_scale 0.6 \\\n  --guidance_scale 3 \\\n  --guidance_rescale 0.7\n```\nwhere in the example we are evaluating 100 samples only (`--data_loading.max_samples 100`) and limiting the model output length to 10 sentences (`--task_args '{\"max_gen_len\": 10}'`).\n\nOutputs dumped in `.\u002Fevaluation_outputs\u002Ftwo_tower` will be structured as:\n```\n.\n├── metadata.jsonl\n├── metrics.eval.jsonl\n├── raw_results\n├── results\n└── tb\n```\nwhere `metrics.eval.jsonl` contains corpus-level scores.\n\n\nTo evaluate an MSE LCM, we use the associated predictor (`base_lcm`) and evaluate with:\n\n```sh\nmodel_card=.\u002Fcheckpoints\u002Ffinetune_mse_lcm\u002Fcheckpoints\u002Fstep_1000\u002Fmodel_card.yaml\nOUTPUT_DIR=evaluation_outputs\u002Fmse_lcm\n\ntorchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation  \\\n  --predictor base_lcm --sample_latent_variable False \\\n  --show_progress true \\\n  --data_loading.max_samples 100 \\\n  --model_card ${model_card} \\\n  --launcher standalone \\\n  --dataset.source_suffix_text '[MODEL]:' \\\n  --tasks finetuning_data_lcm.validation \\\n   --task_args '{\"max_gen_len\": 10, \"eos_config\": {\"text\": \"End of text.\"}}' \\\n  --data_loading.batch_size 4  --generator_batch_size 4 \\\n  --dump_dir ${OUTPUT_DIR} \\\n```\n\nNote that in this example, we only show how to evaluate the LCM on the same finetuning dataset (validation split). To evaluate in a downstream task, and compare results with the LLM, refer to the [Evaluation documentation](.\u002Fexamples\u002Fevaluation\u002FREADME.md).\n\n## Contributing\n\nSee the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.\n\n## Citation\n\nIf you use this codebase, please cite:\n```\n@article{lcm2024,\n  author = {{LCM team}, Lo\\\"{i}c Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-juss\\`{a}, David Dale, Hady Elsahar, Kevin Heffernan, Jo\\~{a}o Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk},\n  title = {{Large Concept Models}: Language Modeling in a Sentence Representation Space},\n  publisher = {arXiv},\n  year = {2024},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.08821},\n}\n```\n\n## License\n\nThis code is released under the MIT license (see [LICENSE](.\u002FLICENSE)).\n","# 大概念模型\n## 句子表示空间中的语言建模\n\n[[博客]](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fmeta-fair-updates-agents-robustness-safety-architecture\u002F) [[论文]](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Flarge-concept-models-language-modeling-in-a-sentence-representation-space\u002F)\n\n本仓库提供了[大概念模型](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Flarge-concept-models-language-modeling-in-a-sentence-representation-space\u002F)（**LCM**）的官方实现与实验。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"space.svg\" width=\"50%\">\n\u003C\u002Fp>\n\n\n\nLCM 在一种显式的高层语义表示上运行，我们将其称为“概念”。概念具有语言和模态无关性，代表更高层次的思想。在本工作中，一个概念对应于一句话，我们使用 [SONAR](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSONAR) 嵌入空间，该空间支持文本中的 200 种语言以及语音中的 57 种语言。受支持的语言列表请参见 [此处](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSONAR?tab=readme-ov-file#supported-languages-and-download-links)。\n\n\n## 方法\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"lcm.svg\" width=\"70%\">\n\u003C\u002Fp>\n\n\n\nLCM 是一个在概念空间中进行自回归句子预测的序列到序列模型。我们探索了多种方法：\n- 均方误差回归（此代码中的 `base_lcm`）。\n- 基于扩散的生成变体（本次发布包含 `two_tower_diffusion_lcm`）。\n- 在量化 SONAR 空间中运行的模型（即将推出）。\n\n这些探索均采用 16 亿参数的模型，并以约 1.3 万亿个 token 的数据进行训练。本仓库包含了重现 16 亿参数 MSE LCM 和双塔扩散 LCM 训练及微调的脚本。具体说明请参见 [下方](#usage)。\n\n## 安装\n\n### 使用 UV\n\nLCM 仓库依赖于 fairseq2。如果你的系统已安装 `uv`，可以通过运行以下命令来创建一个包含所有必要包的虚拟环境：\n```bash\nuv sync --extra cpu --extra eval --extra data\n```\n\n你也可以使用 `uv run` 来在正确的环境中运行演示命令。\n\n请注意，我们仅提供 CPU 依赖项的要求；若需使用 GPU 支持，则需要选择适合你系统的 PyTorch 和 fairseq2 版本。例如，对于 PyTorch 2.5.1 和 CUDA 1.21，你可以执行如下操作：\n```\nuv pip install torch==2.5.1 --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121 --upgrade\nuv pip install fairseq2==v0.3.0rc1 --pre --extra-index-url  https:\u002F\u002Ffair.pkg.atmeta.com\u002Ffairseq2\u002Fwhl\u002Frc\u002Fpt2.5.1\u002Fcu121 --upgrade\n```\n\n请参考 [fairseq2 版本](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq2?tab=readme-ov-file#variants) 以了解可能的版本。需要注意的是，LCM 目前依赖于 fairseq2 0.3.0 rc1 发布候选版本。\n\n### 使用 pip\n\n使用 pip 安装时，命令非常相似，但你需要自行管理环境，并确保先手动安装 fairseq2。例如，对于 CPU 安装：\n\n```bash\npip install --upgrade pip\npip install fairseq2==v0.3.0rc1 --pre --extra-index-url  https:\u002F\u002Ffair.pkg.atmeta.com\u002Ffairseq2\u002Fwhl\u002Frc\u002Fpt2.5.1\u002Fcpu\npip install -e \".[data,eval]\"\n```\n\n如果 [fairseq2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq2) 没有为你使用的机器提供构建版本，请查阅该项目的 README 文件，以在当地编译构建。\n\n## 使用\n\n> [!NOTE]\n> 如果使用 `uv`，请在所有命令前加上 `uv run`，以使用默认在 `.venv` 中创建的环境，例如：\n> `uv run torchrun --standalone`。\n> 或者，你也可以通过 `source .venv\u002Fbin\u002Factivate` 一次性激活环境。\n\n### 数据准备\n\nLCM 可以使用按句子分割并用 [SONAR](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSONAR\u002F) 嵌入的文本数据进行训练和评估。我们提供了一个可用于准备此类训练数据的示例处理流程，你可以通过以下命令运行它：\n```\n uv run --extra data scripts\u002Fprepare_wikipedia.py \u002Foutput\u002Fdir\u002Ffor\u002Fthe\u002Fdata\n ```\n\n该流程展示了如何从 Hugging Face 获取数据，并使用 SONAR 和 [SaT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678) 进行处理。请查看文件以获取更多关于处理自有数据的详细信息。虽然该脚本提供了从 Hugging Face 拉取数据的示例，但我们还提供了 [API](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fstopes\u002Ftree\u002Fmain\u002Fstopes\u002Futils\u002Fsharding) 来处理 jsonl、parquet 和 CSV 文件。\n\n### 数据卡片\n\n下文所述的训练器依赖于用于配置数据集的数据卡片。这些数据卡片是 YAML 文件，其中包含指向数据文件（本地或 S3 上）的指针以及有关其模式的信息。我们在 [`lcm\u002Fdatacards\u002Fdatacards.yaml`](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flarge_concept_model\u002Fblob\u002Fmain\u002Flcm\u002Fdatacards\u002Fdatacards.yaml) 中提供了一些示例数据卡片。当你处理完一些数据后，可以更新数据卡片中的路径。\n\n#### 拟合归一化器\n要为给定的加权数据集混合物拟合一个新的嵌入空间归一化器，可以使用以下命令：\n```bash\npython scripts\u002Ffit_embedding_normalizer.py --ds dataset1:4 dataset2:1 dataset3:10 --save_path \"path\u002Fto\u002Fnew\u002Fnormalizer.pt\" --max_nb_samples 1000000\n```\n其中，`dataset1`、`dataset2`、`dataset3` 是如上所示数据卡片中声明的数据集名称，而 `(4, 1, 10)` 则是它们各自的相对权重。\n生成的归一化器随后可以按照 `lcm\u002Fcards\u002Fsonar_normalizer.yaml` 中所示的方式声明为模型，并在所有模型训练配置中引用。\n\n### 预训练模型\n\n#### 基础 MSE LCM\n\n要训练一个 MSE LCM，我们可以使用以下命令之一：\n\n**选项 1.** 使用 SLURM 和 [submitit](https:\u002F\u002Fgithub.com\u002Ffacebookincubator\u002Fsubmitit) 通过 [stopes](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fstopes\u002Ftree\u002Fmain) 的启动器进行训练：\n```sh\npython -m lcm.train \\\n    +pretrain=mse \\\n    ++trainer.output_dir=\"checkpoints\u002Fmse_lcm\" \\\n    ++trainer.experiment_name=training_mse_lcm \\\n```\n使用此命令，我们将提交一个名为 `training_mse_lcm` 的 SLURM 作业，并按照配方的要求执行，在本例中为：\n```yaml\nrequirements:\n  nodes: 4\n  tasks_per_node: 8\n  gpus_per_node: 8\n  cpus_per_task: 32\n  mem_gb: 0\n  timeout_min: 10000\n```\n您可以通过以下方式覆盖作业的要求，例如超时限制和启动器的 SLURM 分区：\n```sh\npython -m lcm.train \\\n    +pretrain=mse \\\n    ++trainer.output_dir=\"checkpoints\u002Fmse_lcm\" \\\n    ++trainer.experiment_name=training_mse_lcm \\\n    ++trainer.requirements.timeout_min=100 \\\n    ++trainer.requirements.cpus_per_task=8 \\\n    ++launcher.partition=$partition_name\n```\n\n**选项 2.** 使用 `torchrun` 在本地进行训练（例如仅使用 2 张 GPU），并采用较小的批大小（覆盖 `++trainer.data_loading_config.max_tokens=1000`）：\n```sh\nCUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 \\\n    -m lcm.train launcher=standalone \\\n    +pretrain=mse \\\n    ++trainer.data_loading_config.max_tokens=1000 \\\n    ++trainer.output_dir=\"checkpoints\u002Fmse_lcm\" \\\n    +trainer.use_submitit=false \\\n```\n> [!重要]\n> 由于我们更改了配方所需的 GPU 数量，因此这将无法复现论文中的实验设置。\n\n检查点目录 `checkpoints\u002Fmse_lcm` 的结构如下：\n```\n.\n├── checkpoints\n│   ├── step_2000\n│   ├── ...\n│   └── step_250000\n├── config_logs\n├── executor_logs\n├── model_card.yaml\n├── tb   # tensorboard 日志\n└── wandb  # W&B 日志\n```\n请注意，除非安装了 `wandb`，否则将跳过 W&B 日志记录。\n您可以使用 `uv pip install wandb` 来安装 `wandb`。\n可以通过覆盖配方中的 Hydra 配置值来更改 W&B 参数：\n\n```sh\n++trainer.wandb_project=$project_name\n++trainer.wandb_run_name=$run_name\n```\n\n#### 双塔扩散 LCM\n\n与基础 MSE LCM 类似，我们也可以按照 [.\u002Frecipes\u002Ftrain\u002Fpretrain\u002Ftwo_tower.yaml](.\u002Frecipes\u002Ftrain\u002Fpretrain\u002Ftwo_tower.yaml) 中的配方提交训练作业：\n\n```sh\npython -m lcm.train \\\n    +pretrain=two_tower \\\n    ++trainer.output_dir=\"checkpoints\u002Ftwo_tower_lcm\" \\\n    ++trainer.experiment_name=training_two_tower_lcm \\\n```\n\n> [!提示]\n> 要了解训练配方中的不同组成部分，请查看 [此 README](.\u002Frecipes\u002Ftrain\u002FREADME.md)。\n\n\n### 微调模型\n要对先前预训练的双塔扩散 LCM 进行监督数据上的微调，请按照以下步骤操作：\n\n**步骤 1.** 将预训练的检查点注册为 fairseq2 资产。\n\n您可以使用带有卡片 `checkpoints\u002Ftwo_tower_lcm\u002Fmodel_card.yaml` 的最终检查点，或任何经过特定训练步数后的检查点，例如 `checkpoints\u002Ftwo_tower_lcm\u002Fcheckpoints\u002Fstep_2000\u002Fmodel_card.yaml`。要注册选定的检查点，将自动生成的 YAML 文件复制到 `.\u002Flcm\u002Fcards\u002Fmycards.yaml`，并重命名模型以替换默认的 `on_the_fly_lcm`。`.\u002Flcm\u002Fcards\u002Fmycards.yaml` 将如下所示：\n```yaml\n__source__: inproc\n checkpoint: file:\u002F\u002Fpath_to\u002Flarge_concept_model\u002Fcheckpoints\u002Ftwo_tower_lcm\u002Fcheckpoints\u002Fstep_2000\u002Fmodel.pt\n model_arch: two_tower_diffusion_lcm_1_6B\n model_family: two_tower_diffusion_lcm\n name: my_pretrained_two_tower\n```\n有关如何管理 fairseq2 资产的更多信息，请参阅 [文档](https:\u002F\u002Ffacebookresearch.github.io\u002Ffairseq2\u002Fnightly\u002Fbasics\u002Fassets.html)。\n\n**步骤 2.** 启动指向要微调模型的微调作业，在本例中为 `my_pretrained_two_tower`：\n```sh\nCUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 \\\n    -m lcm.train launcher=standalone \\\n    +finetune=two_tower \\\n    ++trainer.output_dir=\"checkpoints\u002Ffinetune_two_tower_lcm\" \\\n    ++trainer.data_loading_config.max_tokens=1000 \\\n    +trainer.use_submitit=false \\\n    ++trainer.model_config_or_name=my_pretrained_two_tower\n```\n或者\n\n```sh\npython -m lcm.train \\\n    +finetune=two_tower \\\n    ++trainer.output_dir=\"checkpoints\u002Ffinetune_two_tower_lcm\" \\\n    ++trainer.experiment_name=finetune_two_tower_lcm \\\n    ++trainer.model_config_or_name=my_pretrained_two_tower\n```\n\n同样，要微调一个 MSE LCM，可以按照相同的预训练检查点注册说明，并使用相应的配方（[.\u002Frecipes\u002Ftrain\u002Ffinetune\u002Fmse.yaml](.\u002Frecipes\u002Ftrain\u002Ffinetune\u002Fmse.yaml)）提交微调作业：\n```sh\npython -m lcm.train \\\n    +finetune=mse \\\n    ++trainer.output_dir=\"checkpoints\u002Ffinetune_mse_lcm\" \\\n    ++trainer.experiment_name=finetune_mse_lcm \\\n    ++trainer.model_config_or_name=my_pretrained_mse_lcm\n```\n\n### 评估模型\n\n\n> [!NOTE]\n> 对于高级评估（例如不同任务的基准测试、与大语言模型的结果比较等），请查看[评估文档](.\u002Fexamples\u002Fevaluation\u002FREADME.md)。\n\n\n**步骤 0.** 下载用于评估 ROUGE 所需的 NLTK 数据：\n```py\npython -m nltk.downloader punkt_tab\n```\n\n**步骤 1.**\n可以通过指向模型的 `model_card` YAML 文件，或者在将其注册为 fairseq2 资产之后（与我们注册 `my_pretrained_two_tower` 的方式相同），生成并评分模型的输出：\n```sh\nmodel_card=.\u002Fcheckpoints\u002Ffinetune_two_tower_lcm\u002Fcheckpoints\u002Fstep_1000\u002Fmodel_card.yaml\nOUTPUT_DIR=evaluation_outputs\u002Ftwo_tower\n\ntorchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation  \\\n  --predictor two_tower_diffusion_lcm  \\\n  --show_progress true \\\n  --data_loading.max_samples 100 \\\n  --model_card ${model_card} \\\n  --launcher standalone \\\n  --dataset.source_suffix_text '[MODEL]:' \\\n  --tasks finetuning_data_lcm.validation \\\n   --task_args '{\"max_gen_len\": 10, \"eos_config\": {\"text\": \"End of text.\"}}' \\\n  --data_loading.batch_size 4  --generator_batch_size 4 \\\n  --dump_dir ${OUTPUT_DIR} \\\n  --inference_timesteps 40 \\\n  --initial_noise_scale 0.6 \\\n  --guidance_scale 3 \\\n  --guidance_rescale 0.7\n```\n其中，在示例中我们仅评估 100 个样本（`--data_loading.max_samples 100`），并将模型输出长度限制为 10 句话（`--task_args '{\"max_gen_len\": 10}'`）。\n\n输出将被转储到 `.\u002Fevaluation_outputs\u002Ftwo_tower` 目录下，其结构如下：\n```\n.\n├── metadata.jsonl\n├── metrics.eval.jsonl\n├── raw_results\n├── results\n└── tb\n```\n其中 `metrics.eval.jsonl` 包含语料级别的评分。\n\n要评估 MSE LCM，我们使用相应的预测器 (`base_lcm`) 并按以下方式评估：\n\n```sh\nmodel_card=.\u002Fcheckpoints\u002Ffinetune_mse_lcm\u002Fcheckpoints\u002Fstep_1000\u002Fmodel_card.yaml\nOUTPUT_DIR=evaluation_outputs\u002Fmse_lcm\n\ntorchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation  \\\n  --predictor base_lcm --sample_latent_variable False \\\n  --show_progress true \\\n  --data_loading.max_samples 100 \\\n  --model_card ${model_card} \\\n  --launcher standalone \\\n  --dataset.source_suffix_text '[MODEL]:' \\\n  --tasks finetuning_data_lcm.validation \\\n   --task_args '{\"max_gen_len\": 10, \"eos_config\": {\"text\": \"End of text.\"}}' \\\n  --data_loading.batch_size 4  --generator_batch_size 4 \\\n  --dump_dir ${OUTPUT_DIR} \\\n```\n\n请注意，在此示例中，我们仅展示了如何在相同的微调数据集（验证集）上评估 LCM。若要在下游任务中进行评估，并与大语言模型的结果进行比较，请参阅[评估文档](.\u002Fexamples\u002Fevaluation\u002FREADME.md)。\n\n## 贡献\n有关如何参与贡献，请参阅[CONTRIBUTING](CONTRIBUTING.md)文件。\n\n## 引用\n如果您使用本代码库，请引用以下内容：\n```\n@article{lcm2024,\n  author = {{LCM 团队}, Lo\\\"{i}c Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-juss\\`{a}, David Dale, Hady Elsahar, Kevin Heffernan, Jo\\~{a}o Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk},\n  title = {{大型概念模型}：基于句子表示空间的语言建模},\n  publisher = {arXiv},\n  year = {2024},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.08821},\n}\n```\n\n## 许可证\n本代码以 MIT 许可证发布（详见[LICENSE](.\u002FLICENSE)）。","# Large Concept Model (LCM) 快速上手指南\n\nLarge Concept Models (LCM) 是 Meta AI 开源的一种新型语言模型架构。它不在传统的 token 空间操作，而是在显式的“概念”（Concept）空间中进行建模。在本实现中，一个概念对应一个句子，利用 [SONAR](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSONAR) 嵌入空间，支持多达 200 种语言的文本和 57 种语言的语音处理。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐 Ubuntu 20.04+) 或 macOS\n- **Python**: 3.10 或更高版本\n- **GPU (可选)**: 如需加速训练或推理，需安装 CUDA 驱动。本仓库默认提供 CPU 依赖配置，GPU 用户需手动指定 PyTorch 和 fairseq2 的 CUDA 版本。\n- **磁盘空间**: 建议预留至少 50GB 空间用于存放模型权重、数据集及中间缓存。\n\n### 前置依赖\n- **包管理工具**: 推荐使用 [`uv`](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) (极速 Python 包管理器)，也可使用 `pip`。\n- **核心框架**: 项目强依赖于 `fairseq2` (当前版本需 `v0.3.0rc1`) 和 `torch`。\n\n> **注意**：国内开发者若遇到网络问题，建议在配置 pip\u002Fuv 源时使用清华或阿里镜像（见下文安装步骤中的注释）。\n\n---\n\n## 安装步骤\n\n### 方式一：使用 UV（推荐）\n\n`uv` 能自动管理虚拟环境并解析依赖，速度极快。\n\n1. **安装 uv** (如果尚未安装):\n   ```bash\n   curl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n   ```\n\n2. **克隆仓库并同步环境**:\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flarge_concept_model.git\n   cd large_concept_model\n   \n   # 安装 CPU 版本依赖（包含数据处理和评估工具）\n   # 国内用户可添加 --index-url 指定镜像源，例如：--index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n   uv sync --extra cpu --extra eval --extra data\n   ```\n\n3. **配置 GPU 支持 (可选)**:\n   如果需要 GPU 加速，需手动升级 `torch` 和 `fairseq2` 到对应的 CUDA 版本。以下以 CUDA 12.1 为例：\n   ```bash\n   # 替换为适合你系统的 CUDA 版本索引地址\n   uv pip install torch==2.5.1 --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121 --upgrade\n   uv pip install fairseq2==v0.3.0rc1 --pre --extra-index-url https:\u002F\u002Ffair.pkg.atmeta.com\u002Ffairseq2\u002Fwhl\u002Frc\u002Fpt2.5.1\u002Fcu121 --upgrade\n   ```\n\n### 方式二：使用 Pip\n\n如果你习惯使用原生 `pip`，需自行管理虚拟环境。\n\n```bash\npython -m venv .venv\nsource .venv\u002Fbin\u002Factivate\n\npip install --upgrade pip\n# 安装 fairseq2 (CPU 版本示例)\npip install fairseq2==v0.3.0rc1 --pre --extra-index-url https:\u002F\u002Ffair.pkg.atmeta.com\u002Ffairseq2\u002Fwhl\u002Frc\u002Fpt2.5.1\u002Fcpu\n# 安装 LCM 及其额外依赖\npip install -e \".[data,eval]\"\n```\n\n---\n\n## 基本使用\n\n以下流程展示如何准备数据、预训练模型以及进行简单的推理评估。所有命令若在 `uv` 环境下运行，请在命令前加 `uv run`。\n\n### 1. 准备数据\n\nLCM 需要将被分割为句子的文本数据转换为 SONAR 嵌入向量。以下脚本演示了如何从 HuggingFace 拉取维基百科数据并进行处理。\n\n```bash\n# 将处理后的数据保存到 \u002Foutput\u002Fdir\u002Ffor\u002Fthe\u002Fdata\nuv run --extra data scripts\u002Fprepare_wikipedia.py \u002Foutput\u002Fdir\u002Ffor\u002Fthe\u002Fdata\n```\n\n*提示：处理完成后，你需要更新 `lcm\u002Fdatacards\u002Fdatacards.yaml` 文件，将路径指向你新生成的数据目录。*\n\n### 2. 预训练模型 (Base MSE LCM)\n\n你可以选择使用 SLURM 集群或在本地单机多卡上进行训练。\n\n**本地单机训练示例 (使用 2 张 GPU):**\n为了快速验证，我们减小 batch size 并在本地运行：\n\n```bash\nCUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 \\\n    -m lcm.train launcher=standalone \\\n    +pretrain=mse \\\n    ++trainer.data_loading_config.max_tokens=1000 \\\n    ++trainer.output_dir=\"checkpoints\u002Fmse_lcm\" \\\n    +trainer.use_submitit=false\n```\n\n训练完成后，检查点将保存在 `checkpoints\u002Fmse_lcm` 目录下。\n\n### 3. 微调模型 (Finetuning)\n\n在预训练基础上，可以使用监督数据进行微调。首先需要注册预训练模型卡片。\n\n**步骤 1: 注册模型**\n复制生成的 `model_card.yaml` 到 `.\u002Flcm\u002Fcards\u002Fmycards.yaml`，修改内容如下（假设使用 two-tower 模型的某个检查点）：\n\n```yaml\n__source__: inproc\ncheckpoint: file:\u002F\u002F\u003C绝对路径>\u002Flarge_concept_model\u002Fcheckpoints\u002Ftwo_tower_lcm\u002Fcheckpoints\u002Fstep_2000\u002Fmodel.pt\nmodel_arch: two_tower_diffusion_lcm_1_6B\nmodel_family: two_tower_diffusion_lcm\nname: my_pretrained_two_tower\n```\n\n**步骤 2: 启动微调**\n```bash\nCUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 \\\n    -m lcm.train launcher=standalone \\\n    +finetune=two_tower \\\n    ++trainer.output_dir=\"checkpoints\u002Ffinetune_two_tower_lcm\" \\\n    ++trainer.data_loading_config.max_tokens=1000 \\\n    +trainer.use_submitit=false \\\n    ++trainer.model_config_or_name=my_pretrained_two_tower\n```\n\n### 4. 模型评估与推理\n\n在评估前，需下载 NLTK 数据以支持 ROUGE 指标计算：\n```bash\npython -m nltk.downloader punkt_tab\n```\n\n运行评估脚本生成结果并计算指标：\n```bash\nmodel_card=.\u002Fcheckpoints\u002Ffinetune_two_tower_lcm\u002Fcheckpoints\u002Fstep_1000\u002Fmodel_card.yaml\nOUTPUT_DIR=evaluation_outputs\u002Ftwo_tower\n\ntorchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation \\\n  --predictor two_tower_diffusion_lcm \\\n  --show_progress true \\\n  --data_loading.max_samples 100 \\\n  --model_card ${model_card} \\\n  --launcher standalone \\\n  --dataset.source_suffix_text '[MODEL]:' \\\n  --tasks finetuning_data_lcm.validation \\\n  --task_args '{\"max_gen_len\": 10, \"eos_config\": {\"text\": \"End of text.\"}}' \\\n  --data_loading.batch_size 4 --generator_batch_size 4 \\\n  --dump_dir ${OUTPUT_DIR} \\\n  --inference_timesteps 40 \\\n  --initial_noise_scale 0.6 \\\n  --guidance_scale 3 \\\n  --guidance_rescale 0.7\n```\n\n评估结果（包括指标分数和生成文本）将输出至 `evaluation_outputs\u002Ftwo_tower` 目录。","某跨国电商团队正在构建一个支持全球 200 种语言的智能客服系统，需要实时理解并生成多语言回复以处理用户咨询。\n\n### 没有 large_concept_model 时\n- **多语言维护成本极高**：传统方案需为每种语言单独训练或微调模型，导致算力资源浪费且更新滞后。\n- **语义理解碎片化**：不同语言间的细微语义差异常被忽略，导致英语中的“紧急”与斯瓦希里语中的对应词在向量空间中距离过远，影响意图识别准确率。\n- **跨模态扩展困难**：若想增加语音客服功能，需重新搭建一套独立的语音识别与合成链路，无法复用现有的文本模型能力。\n- **长上下文逻辑断裂**：基于 Token 的传统建模在处理长句时容易丢失整体句意，导致生成的回复虽然语法正确但逻辑不通。\n\n### 使用 large_concept_model 后\n- **统一语义空间降本增效**：large_concept_model 将 200 种语言和 57 种语音映射到统一的 SONAR 概念空间，只需维护一个模型即可覆盖所有语种，大幅降低训练与部署成本。\n- **深层语义精准对齐**：通过在“句子表示空间”进行自回归预测，模型能捕捉跨语言的深层概念一致性，确保不同语言下的用户意图被同等精准地理解。\n- **天然支持跨模态交互**：得益于语言与模态无关的特性，团队可直接利用同一套架构无缝接入语音输入输出，无需重复造轮子。\n- **句子级逻辑更连贯**：直接对完整句子的概念进行建模，避免了 Token 级别的碎片化问题，生成的多语言回复在逻辑连贯性和信息密度上显著提升。\n\nlarge_concept_model 通过将语言建模从\"Token 预测”升级为“概念预测”，彻底打破了多语言与跨模态应用的壁垒，让全球化 AI 服务变得高效且统一。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_large_concept_model_886b4156.png","facebookresearch","Meta Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ffacebookresearch_449342bd.png","",null,"https:\u002F\u002Fopensource.fb.com","https:\u002F\u002Fgithub.com\u002Ffacebookresearch",[81],{"name":82,"color":83,"percentage":84},"Python","#3572A5",100,2351,206,"2026-04-16T15:54:59","MIT",4,"Linux","非必需（支持 CPU），若使用 GPU 需自行安装匹配的 PyTorch CUDA 版本（示例为 CUDA 12.1）。训练示例配置要求每节点 8 张 GPU，但未指定具体显存大小。","未说明（训练示例配置要求每任务 32 CPU 核心，内存设为 0 表示由调度器决定或无限制）",{"notes":94,"python":95,"dependencies":96},"1. 该项目强依赖 fairseq2 的发布候选版本 (v0.3.0rc1)，安装时需指定特定的索引源。\n2. 默认仅提供 CPU 版本的依赖安装命令，使用 GPU 需手动安装对应 CUDA 版本的 torch 和 fairseq2。\n3. 训练流程主要设计用于 SLURM 集群环境，本地运行需使用 torchrun 并调整批次大小。\n4. 数据预处理需要 SONAR 嵌入空间支持，涉及多语言文本和语音处理。","未说明（需通过 uv 或 pip 管理环境）",[97,98,99,100,101,102,103,104],"fairseq2==v0.3.0rc1","torch>=2.5.1","SONAR","stopes","hydra-core","submitit","nltk","wandb (可选)",[35,14],[107,108,109,110,111],"language-models","nlp","pytorch","seq2seq","sequence-to-sequence","2026-03-27T02:49:30.150509","2026-04-18T14:25:45.450300",[115,120,125,130,134,138],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},39866,"运行评估脚本时出现 'Missing _source_text_column or article' 错误怎么办？","如果在准备数据步骤（prepare_data）中指定了 `prompt_prefix` 和 `prompt_suffix`，数据集会被转换，列名会重命名为 `prompt` 和 `answer`。在运行评估脚本时，必须通过 `--dataset.source_text_column` 和 `--dataset.target_text_column` 参数显式指定这些新列名，而不是使用原始列名。例如：\n\n```bash\n!uv run torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation \\\n  --predictor gemma \\\n  --model_name google\u002Fgemma-2-2b-it \\\n  --generator_batch_size 16 \\\n  --tasks cnn_dailymail_llm.test \\\n  --task_args '{\"max_gen_len\": 200}' \\\n  --dataset_dir \u002Fcontent\u002Fjsonl_dataset\u002Fcnn_dailymail\u002Fcnn_dailymail \\\n  --dataset.source_text_column prompt \\\n  --dataset.target_text_column answer \\\n  --dump_dir \u002Fcontent\u002Foutput_results_llm\n```","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flarge_concept_model\u002Fissues\u002F17",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},39867,"如何处理 `_tokenize_batch` 方法中因 `pyarrow.ListArray` 不支持 `.to()` 方法而导致的类型错误？","该错误是因为代码试图对 `pyarrow.ListArray` 类型直接调用 `.to()` 方法。建议的解决方案有两种：\n1. 在调用 `.to()` 之前，显式将数据转换为 `torch.Tensor`。例如：`embs = [torch.Tensor(x.as_py()).to(self.gang.device).to(dtype) for x in batch[col_name]]`。\n2. 推荐直接使用 `pyarrow` 或 `polars` 库，利用它们的 `.cast` 方法在数据处理阶段将数据集转换为正确的类型，避免在 DataLoader 中出现类型不匹配。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flarge_concept_model\u002Fissues\u002F12",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},39868,"运行训练或标准化脚本时提示 'Partition filters (split == \"train\") is set but dataset has NO partition columns' 错误如何解决？","此错误表明代码试图根据 `split` 列（如 train\u002Fvalidation\u002Ftest）过滤数据，但你的 Parquet 数据集中缺少该分区列。解决方法是检查你的数据集 schema，确保包含名为 `split` 的列，或者在配置文件（yaml）中移除相关的分区过滤设置。如果使用的是自定义数据集，需要在生成 Parquet 文件时添加 `split` 列，或者修改加载逻辑以适配没有分区列的数据结构。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flarge_concept_model\u002Fissues\u002F9",{"id":131,"question_zh":132,"answer_zh":133,"source_url":129},39869,"在使用 `fit_embedding_normalizer.py` 脚本时遇到数据架构或路径配置错误怎么办？","确保你的 YAML 配置文件正确指向了包含 Parquet 文件的路径，并且 `source_column` 名称与 Parquet 文件中的实际列名完全一致。例如，如果 Parquet 文件中列名为 `text_sentences_sonar_emb`，YAML 中也必须如此书写。此外，检查 Python 版本兼容性（建议使用 Python 3.11），并确保 `parquet_path` 配置正确（本地路径或 S3 路径）。如果数据结构是嵌套列表（如 `list_(list_(float32))`），请确认脚本是否支持该格式或需先展平数据。",{"id":135,"question_zh":136,"answer_zh":137,"source_url":119},39870,"如何正确安装与 PyTorch 2.5.1 兼容的 fairseq2 依赖？","对于 PyTorch 2.5.1 和 CUDA 12.1 环境，需要使用特定的预发布版本和索引 URL 进行安装。命令如下：\n\n```bash\npip install fairseq2==v0.3.0rc1 --pre --extra-index-url https:\u002F\u002Ffair.pkg.atmeta.com\u002Ffairseq2\u002Fwhl\u002Frc\u002Fpt2.5.1\u002Fcu121\n```\n\n安装完 fairseq2 后，再安装项目其他依赖：\n```bash\npip install -e \".[data,eval]\"\n```",{"id":139,"question_zh":140,"answer_zh":141,"source_url":119},39871,"评估过程中出现 Tokenizer 类型不匹配的警告（GemmaTokenizer vs PreTrainedTokenizerFast）会影响结果吗？","这是一个常见的 Hugging Face Transformers 库的警告，通常表示加载的检查点分词器类与当前调用的类不完全一致（例如一个是慢速分词器，一个是快速分词器）。在大多数情况下，这不会导致严重的功能错误，分词结果通常是可用的。如果未出现后续的运行时错误（如形状不匹配或崩溃），可以忽略此警告。若需消除警告，可尝试强制指定分词器类或使用 `use_fast=False` 参数加载分词器，但这通常不是必须的。",[]]