[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-fixie-ai--ultravox":3,"tool-fixie-ai--ultravox":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":104,"forks":105,"last_commit_at":106,"license":107,"difficulty_score":10,"env_os":108,"env_gpu":109,"env_ram":110,"env_deps":111,"category_tags":122,"github_topics":123,"view_count":128,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":129,"updated_at":130,"faqs":131,"releases":162},768,"fixie-ai\u002Fultravox","ultravox","A fast multimodal LLM for real-time voice","Ultravox 是一款专为实时语音交互设计的高速多模态大语言模型。它打破了传统语音系统需先经过独立语音识别（ASR）再送入大模型的流程，而是直接将音频转化为大模型可理解的高维向量。这一创新架构不仅显著降低了响应延迟，还让模型能够原生捕捉人类语音中的时间节奏与情感色彩，实现更自然的对话体验。当前 Ultravox 支持音频流输入并输出文本流，基于 Llama 3、Mistral 等开源基座模型训练，提供 8B 至 70B 多种规格版本。未来它将进一步支持直接生成语音令牌，实现真正的语音对语音交互。对于致力于开发实时语音 AI 应用的开发者、探索多模态技术的科研人员，以及追求低延迟交互体验的产品团队而言，Ultravox 都是一个灵活且高效的起点。无论是本地部署还是构建自定义语音智能体，都能从中获得强大支持。","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Cimg alt=\"Ultravox\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffixie-ai_ultravox_readme_01f12dae8236.png\">\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\u003Ch3 align=\"center\">\nA fast multimodal LLM designed for real-time voice interactions\n\u003C\u002Fh3>\n\n_Latest News_\n* 2025\u002F12 - [Ultravox 0.7](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Ffixie-ai\u002Fultravox-v07) available\n* 2025\u002F06 — [Ultravox 0.6](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Ffixie-ai\u002Fultravox-v06-6865e7885cf904e21c31da03) available\n* 2025\u002F02 — [Ultravox 0.5](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.5) available\n* 2024\u002F11 — [Ultravox 0.4.1](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.4.1) available\n* 2024\u002F08 — [Ultravox 0.4](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.4) available\n* 2024\u002F08 — [Ultravox 0.3](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.3) available\n* 2024\u002F08 — Preview of Ultravox APIs available, more information [here](https:\u002F\u002Ffixie-ai.github.io\u002Fultradox\u002F)\n\n_Key Links_\n* [Ultravox Realtime](https:\u002F\u002Fultravox.ai) — Build real-time Voice AI agents on top of the Ultravox model\n* [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai) — Our Hugging Face page\n\n---\n\n# About\n\nUltravox is a new kind of multimodal LLM that can understand text as well as human speech, without the need for a separate Audio Speech Recognition (ASR) stage. Building on research like [AudioLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.03143), [SeamlessM4T](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fseamless-m4t\u002F), [Gazelle](https:\u002F\u002Ftincans.ai\u002Fslm), [SpeechGPT](https:\u002F\u002Fgithub.com\u002F0nutation\u002FSpeechGPT\u002Ftree\u002Fmain\u002Fspeechgpt), and others, Ultravox is able to extend any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLM. We've trained versions on Llama 3, Mistral, and Gemma. This direct coupling allows Ultravox to respond much more quickly than systems that combine separate ASR and LLM components. In the future this will also allow Ultravox to natively understand the paralinguistic cues of timing and emotion that are omnipresent in human speech.\n\nUltravox currently takes in audio and emits streaming text. As we evolve the model, we'll train it to be able to emit a stream of speech tokens that can then be converted directly into raw audio by an appropriate unit vocoder.\n\nOur default model is built on top of Llama 3.3 70B. We also have an 8B variant available on Hugging Face.\n\nUltravox can be trained against any open-weight model. See below for more details on training.\n\n### Demo\n\nSee Ultravox in action on our [demo page](https:\u002F\u002Fdemo.ultravox.ai). You can build your own voice-to-voice agents on our Realtime platform at ultravox.ai.\n\n\n### Discord\n\nJoin us on our Discord server [here](https:\u002F\u002Fdiscord.gg\u002FQw6KHxv8YB).\n\n### Jobs\n\nIf you're interested in working on Ultravox fulltime, we're hiring! Check out our jobs page [here](https:\u002F\u002Fcareers.fixie.ai).\n\n### Inference Server\n\nYou can try out Ultravox using your own audio content (as a WAV file) by spinning up an Ultravox instance on our partner, BaseTen: [https:\u002F\u002Fwww.baseten.co\u002Flibrary\u002Fultravox\u002F](https:\u002F\u002Fwww.baseten.co\u002Flibrary\u002Fultravox\u002F). They offer free credits to get started.\n\nIf you're interested in running Ultravox in a real-time capacity, we offer a set of managed APIs as well. You can learn more about getting access to those [here](https:\u002F\u002Fdocs.ultravox.ai).\n\n### Model\n\nYou can download the latest weights from the [Ultravox Hugging Face page](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai\u002F).\n\n### Architecture\n\n[![architecture diagram](https:\u002F\u002Fraw.githubusercontent.com\u002Ffixie-ai\u002Fultravox\u002Fmain\u002Fdocs\u002Fassets\u002FUltravox%20Model%20Architecture.svg)](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1ey81xuuMzrJaBwztb_Rq24Cit37GQokD2aAes_KkGVI\u002Fedit)\n\n# Contributing\n\nRead on if you're interested in training your own version of Ultravox.\n\n## Environment Setup (Mac)\n\nInstall the basic tools:\n\n- [`Homebrew`](https:\u002F\u002Fbrew.sh) is a package manager for MacOS that also mostly works for Linux. If you're running Debian or Ubuntu Linux, you can alternatively get by with apt.\n- [`Just`](https:\u002F\u002Fjust.systems\u002Fman\u002Fen\u002F) simplifies our shell workflows. It frequently functions as our interface to all the other tools.\n\n```bash\n\u002Fbin\u002Fbash -c \"$(curl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002FHomebrew\u002Finstall\u002FHEAD\u002Finstall.sh)\"\nbrew update\nbrew install just\n```\n\nIt's recommended to use pyenv for managing environments due to the use of Poetry:\n\n```bash\nbrew install xz\nbrew install pyenv\npyenv init\npyenv install 3.11\npyenv global 3.11\n\n# Optional\npyenv shell 3.11\n```\n\n>**Note**: Use of conda is NOT recommended with Poetry\n\nAfter creating a virtual environment, install required packages using `just` and `poetry`:\n\n```bash\njust install\n```\n\nIf you plan to use augmentations (optional), you may also want to install system packages necessary for augmentations. You can do that with `just install-augs-system`. Read more about augmentations [here](ultravox\u002Fdata\u002Faug\u002FAUGMENTATIONS.md).\n\nWe're using Poetry to manage the Python virtual environment. You can observe your environment with `poetry env info`.\n\n## Training\n\nCurrently, we keep both the LLM and the audio encoder frozen and only train the adapter\u002Fprojector. Training Ultraox v0.4 took 2-3 hours on 8xH100 GPUs for 14K training steps.\n\n### Use-Cases for Training Ultravox\n\nWhy would you want to (re-) train Ultravox? Here are a few scenarios:\n\n1. You want to use a different LLM or audio encoder backbone.\n\n   a. In this case you need to re-train the adapter. You can use `example_config.yaml`, which contains our config for our latest release, and you should be able to simply change the base LLM or encoder by specifying `--text-model \u003Chf-model-id-for-llm>` and\u002For `--audio-model \u003Chf-model-id-for-encoder>`.\n\n2. You want to improve the knowledge of the model\n\n    a. We suggest to either use RAG on the fly (no training needed), or fine-tune the LLM backbone instead. Fine-tuning the LLM backbone does not require re-training Ultravox (i.e., the existing adapter will work).\n\n3. You want to use your own audio data, for example to add support for a new language.\n\n   a. First step, prepare your dataset: at bare minimum, the samples should have an `audio` and a text `continuation` field.\n\n   b. Take a look at [`ds_tool.py`](ultravox\u002Ftools\u002Fds_tool\u002Fds_tool.py) and [`continuation.jinja`](ultravox\u002Ftools\u002Fds_tool\u002Fcontinuation.jinja) as well as [our variant of Common Voice](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ffixie-ai\u002Fcommon_voice_17_0\u002Fviewer\u002Ffr) that was created using `ds_tool` to add the `continuation` field.\n\n   c. Add your dataset to the dataset mix in `example_config.yaml` and train.\n\nThere's no one-size fits all. If you need help you can find us on our Discord server [here](https:\u002F\u002Fdiscord.gg\u002FQw6KHxv8YB).\n\n### How to Train\n\nWe do most of our training on the [MosaicML platform](https:\u002F\u002Fdocs.mosaicml.com) and therefore most of our tooling and docs are Mosaic-related. However, the MosaicML platform is being shut down at the end of July 2025, but we still left the configs in here for the commands. You can do the same training on your own GPU without much difficulty. Here we assume you have the environment set up (run `just install`). You can also take a look at [`setup.sh`](setup.sh)\n\nTo kick off a training run you can do:\n\n```bash\npoetry run python -m ultravox.training.train --config_path ultravox\u002Ftraining\u002Fconfigs\u002Fexample_config.yaml\n```\n\nFor DDP training make sure to add `torchrun`. We also recommend prefetching weights in advance:\n\n```bash\nTRAIN_ARGS=\"--config_path ultravox\u002Ftraining\u002Fconfigs\u002Fexample_config.yaml\"\npoetry run python -m ultravox.training.helpers.prefetch_weights $TRAIN_ARGS\npoetry run torchrun --nproc_per_node=8 -m ultravox.training.train $TRAIN_ARGS\n```\n\nFor a debug run, you can use smaller models, datasets, or batch size. Here's a config that uses TinyLlama as the LLM backbone:\n\n```bash\npoetry run python -m ultravox.training.train --config_path ultravox\u002Ftraining\u002Fconfigs\u002Fasr_tinyllama_100s.yaml --batch_size 1 --report_logs_to tensorboard\n```\n\nWe use [SimpleParsing](https:\u002F\u002Fgithub.com\u002Flebrice\u002Fsimpleparsing\u002F) for configs. Configs are composable (i.e. you can specify zero or many configs) and `meta_config.yaml` is always used as the default.\nSee [`configs_base.py`](ultravox\u002Ftraining\u002Fconfig_base.py) to find the parameters you modify, such as the `--text-model`, `--device`, `--exp-name`, etc.\n\n#### Multi-node training\n\n- For multi-node training, all you need to do is update `compute.gpus` line on `mcli_train.yaml` to get more GPUs for training\n  - All factors of 8 are supported\n- For more than 4 nodes, you might need to increase `val_dataset_args.max_samples`\n\n**NOTE:** W&B doesn't currently support multiple nodes. You'll only get info from the main node. It's possible to support it with grouped runs, so let me know if you think this is important for you.\n\n### Running evaluations\n\nFor inference or evaluations, you can use:\n\n```bash\njust eval --config_path ultravox\u002Fevaluation\u002Fconfigs\u002Feval_config.yaml\n```\n\nwhere `eval_config.yaml` is a config file that specifies the model, datasets, and configurations to use for inference or evaluation. If your dataset is not already defined in ultravox, you need to create a config file for your dataset in `ultravox\u002Fdata\u002Fconfigs\u002F` (with the appropriate `eval_config` field to specify evaluation metrics and arguments), and register it in `ultravox\u002Fdata\u002Fregistry.py`. Please refer to examples in `ultravox\u002Fdata\u002Fconfigs\u002F`.\n\n## Misc\n\nThe [Justfile](Justfile) is a good resource for finding popular commands. Here are a few:\n\n```bash\njust update    # update dependencies\njust format    # run formatting (black, isort, autoflake)\njust test      # run tests\njust python    # activate venv and run python\n```\n","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Cimg alt=\"Ultravox\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffixie-ai_ultravox_readme_01f12dae8236.png\">\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\u003Ch3 align=\"center\">\n一种专为实时语音交互设计的快速多模态大语言模型\n\u003C\u002Fh3>\n\n_最新动态_\n* 2025\u002F12 - [Ultravox 0.7](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Ffixie-ai\u002Fultravox-v07) 已发布\n* 2025\u002F06 — [Ultravox 0.6](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Ffixie-ai\u002Fultravox-v06-6865e7885cf904e21c31da03) 已发布\n* 2025\u002F02 — [Ultravox 0.5](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.5) 已发布\n* 2024\u002F11 — [Ultravox 0.4.1](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.4.1) 已发布\n* 2024\u002F08 — [Ultravox 0.4](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.4) 已发布\n* 2024\u002F08 — [Ultravox 0.3](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.3) 已发布\n* 2024\u002F08 — Ultravox API 预览版已可用，更多信息请查看 [此处](https:\u002F\u002Ffixie-ai.github.io\u002Fultradox\u002F)\n\n_关键链接_\n* [Ultravox Realtime](https:\u002F\u002Fultravox.ai) — 在 Ultravox 模型之上构建实时语音 AI 代理\n* [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai) — 我们的 Hugging Face 页面\n\n---\n\n# 关于\n\nUltravox 是一种新型的多模态大语言模型（**LLM**），能够理解文本以及人类语音，无需单独的音频语音识别（**ASR**）阶段。基于 [AudioLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.03143)、[SeamlessM4T](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fseamless-m4t\u002F)、[Gazelle](https:\u002F\u002Ftincans.ai\u002Fslm)、[SpeechGPT](https:\u002F\u002Fgithub.com\u002F0nutation\u002FSpeechGPT\u002Ftree\u002Fmain\u002Fspeechgpt) 等研究，Ultravox 能够通过多模态投影器将音频直接转换为 LLM 使用的高维空间，从而扩展任何开放权重大语言模型（**open-weight LLM**）。我们在 Llama 3、Mistral 和 Gemma 上训练了版本。这种直接耦合使得 Ultravox 的响应速度比结合独立 ASR 和 LLM 组件的系统快得多。未来，这将使 Ultravox 能够原生理解人类语音中普遍存在的节奏和情感等副语言线索。\n\nUltravox 目前接收音频并输出流式文本。随着模型的演进，我们将训练它能够输出一串语音令牌，这些令牌随后可以由适当的单元声码器直接转换为原始音频。\n\n我们的默认模型基于 Llama 3.3 70B 构建。我们也在 Hugging Face 上提供了 8B 变体。\n\nUltravox 可以针对任何开放权重模型进行训练。有关训练的更多详细信息，请参见下文。\n\n### 演示\n\n在我们的 [演示页面](https:\u002F\u002Fdemo.ultravox.ai) 上查看 Ultravox 的实际应用。您可以在 ultravox.ai 上的实时平台构建自己的语音到语音代理。\n\n### Discord\n\n加入我们的 Discord 服务器 [这里](https:\u002F\u002Fdiscord.gg\u002FQw6KHxv8YB)。\n\n### 招聘\n\n如果您有兴趣全职从事 Ultravox 相关工作，我们正在招聘！请查看我们的职位页面 [这里](https:\u002F\u002Fcareers.fixie.ai)。\n\n### 推理服务器\n\n您可以通过在我们合作伙伴 BaseTen 上启动 Ultravox 实例来试用 Ultravox（使用您自己的音频内容，如 WAV 文件）：[https:\u002F\u002Fwww.baseten.co\u002Flibrary\u002Fultravox\u002F](https:\u002F\u002Fwww.baseten.co\u002Flibrary\u002Fultravox\u002F)。他们提供免费的初始积分。\n\n如果您有兴趣以实时容量运行 Ultravox，我们也提供一组托管 API。您可以在此处了解如何获取访问权限 [这里](https:\u002F\u002Fdocs.ultravox.ai)。\n\n### 模型\n\n您可以从 [Ultravox Hugging Face 页面](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai\u002F) 下载最新的权重。\n\n### 架构\n\n[![架构图](https:\u002F\u002Fraw.githubusercontent.com\u002Ffixie-ai\u002Fultravox\u002Fmain\u002Fdocs\u002Fassets\u002FUltravox%20Model%20Architecture.svg)](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1ey81xuuMzrJaBwztb_Rq24Cit37GQokD2aAes_KkGVI\u002Fedit)\n\n# 贡献\n\n如果您对训练自己版本的 Ultravox 感兴趣，请继续阅读。\n\n## 环境设置（Mac）\n\n安装基本工具：\n\n- [`Homebrew`](https:\u002F\u002Fbrew.sh) 是 MacOS 的软件包管理器，在 Linux 上也基本适用。如果您运行的是 Debian 或 Ubuntu Linux，也可以使用 apt。\n- [`Just`](https:\u002F\u002Fjust.systems\u002Fman\u002Fen\u002F) 简化了我们的 shell 工作流。它经常作为与其他所有工具的接口。\n\n```bash\n\u002Fbin\u002Fbash -c \"$(curl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002FHomebrew\u002Finstall\u002FHEAD\u002Finstall.sh)\"\nbrew update\nbrew install just\n```\n\n由于使用了 Poetry，建议使用 pyenv 来管理环境：\n\n```bash\nbrew install xz\nbrew install pyenv\npyenv init\npyenv install 3.11\npyenv global 3.11\n\n# 可选\npyenv shell 3.11\n```\n\n>**注意**: 不建议与 Poetry 一起使用 conda\n\n创建虚拟环境后，使用 `just` 和 `poetry` 安装所需的软件包：\n\n```bash\njust install\n```\n\n如果您计划使用数据增强（可选），您可能还需要安装用于增强的系统软件包。您可以使用 `just install-augs-system` 完成此操作。有关增强的更多信息请查看 [此处](ultravox\u002Fdata\u002Faug\u002FAUGMENTATIONS.md)。\n\n我们使用 Poetry 来管理 Python 虚拟环境。您可以使用 `poetry env info` 观察您的环境。\n\n## 训练\n\n目前，我们保持 LLM 和音频编码器冻结，仅训练适配器\u002F投影器。训练 Ultraox v0.4 在 8xH100 GPU 上耗时 2-3 小时，共 14K 个训练步骤。\n\n### Ultravox 训练用例\n\n为什么您要（重新）训练 Ultravox？以下是几种场景：\n\n1. 您想使用不同的 LLM 或音频编码器骨干网络。\n\n   a. 在这种情况下，您需要重新训练适配器。您可以使用 `example_config.yaml`，其中包含我们最新发布的配置，您应该只需通过指定 `--text-model \u003Chf-model-id-for-llm>` 和\u002F或 `--audio-model \u003Chf-model-id-for-encoder>` 即可更改基础 LLM 或编码器。\n\n2. 您想提高模型的知识\n\n    a. 我们建议要么即时使用检索增强生成（**RAG**）（无需训练），要么微调 LLM 骨干网络。微调 LLM 骨干网络不需要重新训练 Ultravox（即现有的适配器仍然有效）。\n\n3. 您想使用自己的音频数据，例如添加对新语言的支持。\n\n   a. 第一步，准备您的数据集：至少，样本应包含 `audio` 和文本 `continuation` 字段。\n\n   b. 查看 [`ds_tool.py`](ultravox\u002Ftools\u002Fds_tool\u002Fds_tool.py) 和 [`continuation.jinja`](ultravox\u002Ftools\u002Fds_tool\u002Fcontinuation.jinja)，以及使用 `ds_tool` 创建的 [我们的 Common Voice 变体](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ffixie-ai\u002Fcommon_voice_17_0\u002Fviewer\u002Ffr) 以添加 `continuation` 字段。\n\n   c. 将您的数据集添加到 `example_config.yaml` 中的数据集中并进行训练。\n\n没有万能的方法。如果您需要帮助，可以在我们的 Discord 服务器上找到我们 [这里](https:\u002F\u002Fdiscord.gg\u002FQw6KHxv8YB)。\n\n### 如何训练\n\n我们主要在 [MosaicML 平台](https:\u002F\u002Fdocs.mosaicml.com) 上进行大部分训练，因此我们的工具和文档大多与 Mosaic 相关。然而，MosaicML 平台将于 2025 年 7 月底关闭，但我们仍保留了这里的配置以用于命令执行。您可以在自己的 GPU 上轻松完成相同的训练。此处假设您已设置好环境（运行 `just install`）。您也可以查看 [`setup.sh`](setup.sh)。\n\n要启动训练任务，您可以执行：\n\n```bash\npoetry run python -m ultravox.training.train --config_path ultravox\u002Ftraining\u002Fconfigs\u002Fexample_config.yaml\n```\n\n对于 DDP（分布式数据并行）训练，请确保添加 `torchrun`。我们还建议提前预取权重：\n\n```bash\nTRAIN_ARGS=\"--config_path ultravox\u002Ftraining\u002Fconfigs\u002Fexample_config.yaml\"\npoetry run python -m ultravox.training.helpers.prefetch_weights $TRAIN_ARGS\npoetry run torchrun --nproc_per_node=8 -m ultravox.training.train $TRAIN_ARGS\n```\n\n对于调试运行，您可以使用较小的模型、数据集或批次大小。这里有一个使用 TinyLlama 作为大语言模型（LLM）骨干的配置：\n\n```bash\npoetry run python -m ultravox.training.train --config_path ultravox\u002Ftraining\u002Fconfigs\u002Fasr_tinyllama_100s.yaml --batch_size 1 --report_logs_to tensorboard\n```\n\n我们使用 [SimpleParsing](https:\u002F\u002Fgithub.com\u002Flebrice\u002Fsimpleparsing\u002F) 来处理配置。配置文件是可组合的（即您可以指定零个或多个配置），并且 `meta_config.yaml` 始终作为默认配置使用。请参见 [`configs_base.py`](ultravox\u002Ftraining\u002Fconfig_base.py) 以查找您需要修改的参数，例如 `--text-model`、`--device`、`--exp-name` 等。\n\n#### 多节点训练\n\n- 对于多节点训练，您只需更新 `mcli_train.yaml` 上的 `compute.gpus` 行即可获取更多 GPU 进行训练\n  - 支持 8 的整数倍\n- 对于超过 4 个节点的情况，您可能需要增加 `val_dataset_args.max_samples`\n\n**注意：** W&B（Weights & Biases）目前不支持多个节点。您只能从主节点获取信息。通过分组运行是有可能支持的，如果您认为这很重要，请告诉我。\n\n### 运行评估\n\n对于推理或评估，您可以使用：\n\n```bash\njust eval --config_path ultravox\u002Fevaluation\u002Fconfigs\u002Feval_config.yaml\n```\n\n其中 `eval_config.yaml` 是一个配置文件，指定了用于推理或评估的模型、数据集和配置。如果您的数据集尚未在 ultravox 中定义，则需要在 `ultravox\u002Fdata\u002Fconfigs\u002F` 中为您的数据集创建一个配置文件（包含适当的 `eval_config` 字段以指定评估指标和参数），并在 `ultravox\u002Fdata\u002Fregistry.py` 中注册它。请参考 `ultravox\u002Fdata\u002Fconfigs\u002F` 中的示例。\n\n## 其他\n\n[Justfile](Justfile) 是查找常用命令的好资源。以下是一些示例：\n\n```bash\njust update    # update dependencies\njust format    # run formatting (black, isort, autoflake)\njust test      # run tests\njust python    # activate venv and run python\n```","# Ultravox 快速上手指南\n\nUltravox 是一款专为实时语音交互设计的多模态大语言模型（Multimodal LLM）。它无需独立的语音识别（ASR）阶段，可直接理解人类语音并输出文本。默认模型基于 Llama 3.3 70B 构建，同时也提供 8B 轻量版。\n\n## 1. 环境准备\n\n本工具主要支持 MacOS 和 Linux 环境。由于涉及大量模型权重下载，建议提前配置国内加速源。\n\n### 前置依赖\n- **包管理器**: [Homebrew](https:\u002F\u002Fbrew.sh) (MacOS) 或 apt (Linux)\n- **任务管理**: [`Just`](https:\u002F\u002Fjust.systems\u002Fman\u002Fen\u002F)\n- **Python 环境**: 推荐使用 [`pyenv`](https:\u002F\u002Fgithub.com\u002Fpyenv\u002Fpyenv) 管理，需 Python 3.11\n- **虚拟环境**: 使用 [`Poetry`](https:\u002F\u002Fpython-poetry.org\u002F) 管理依赖\n\n> **注意**: 请勿将 Conda 与 Poetry 混用。\n\n### 网络优化（中国开发者推荐）\n在运行以下命令前，建议设置环境变量以加速 Hugging Face 模型下载：\n```bash\nexport HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n```\n\n## 2. 安装步骤\n\n### 第一步：安装基础工具\n```bash\n\u002Fbin\u002Fbash -c \"$(curl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002FHomebrew\u002Finstall\u002FHEAD\u002Finstall.sh)\"\nbrew update\nbrew install just\n```\n\n### 第二步：配置 Python 环境\n```bash\nbrew install xz\nbrew install pyenv\npyenv init\npyenv install 3.11\npyenv global 3.11\n\n# 可选：为当前目录指定版本\npyenv shell 3.11\n```\n\n### 第三步：安装项目依赖\n进入项目目录后，执行以下命令自动安装所需包：\n```bash\njust install\n```\n\n如需使用数据增强功能（可选），可额外安装系统级依赖：\n```bash\njust install-augs-system\n```\n\n## 3. 基本使用\n\n### 模型评估\n安装完成后，可使用预定义配置对模型进行评估测试：\n```bash\njust eval --config_path ultravox\u002Fevaluation\u002Fconfigs\u002Feval_config.yaml\n```\n若需自定义数据集，需在 `ultravox\u002Fdata\u002Fconfigs\u002F` 中创建配置文件并在 `ultravox\u002Fdata\u002Fregistry.py` 中注册。\n\n### 训练微调\nUltravox 目前采用冻结 LLM 和音频编码器、仅训练适配器（Adapter\u002FProjector）的方式。\n\n**单节点训练示例：**\n```bash\npoetry run python -m ultravox.training.train --config_path ultravox\u002Ftraining\u002Fconfigs\u002Fexample_config.yaml\n```\n\n**分布式训练 (DDP) 示例：**\n```bash\nTRAIN_ARGS=\"--config_path ultravox\u002Ftraining\u002Fconfigs\u002Fexample_config.yaml\"\npoetry run python -m ultravox.training.helpers.prefetch_weights $TRAIN_ARGS\npoetry run torchrun --nproc_per_node=8 -m ultravox.training.train $TRAIN_ARGS\n```\n\n**调试模式（使用小模型）：**\n```bash\npoetry run python -m ultravox.training.train --config_path ultravox\u002Ftraining\u002Fconfigs\u002Fasr_tinyllama_100s.yaml --batch_size 1 --report_logs_to tensorboard\n```\n\n### 其他使用方式\n- **本地推理**: 可通过 BaseTen 平台尝试运行实例 ([链接](https:\u002F\u002Fwww.baseten.co\u002Flibrary\u002Fultravox\u002F))。\n- **API 服务**: 如需生产级实时语音能力，可访问官方文档申请托管 API ([链接](https:\u002F\u002Fdocs.ultravox.ai))。\n- **在线 Demo**: 查看实时效果请访问 [demo.ultravox.ai](https:\u002F\u002Fdemo.ultravox.ai)。","某智能汽车团队正在开发车载语音诊断系统，要求驾驶员在行驶中通过自然对话快速查询车辆故障信息，同时确保操作零干扰。\n\n### 没有 ultravox 时\n- 传统方案需先调用独立 ASR 服务转写语音，再送入大模型，端到端延迟常超过 2 秒，严重影响驾驶时的决策效率\n- 集成多个 API 导致架构复杂，维护成本高且容易出现单点故障，一旦 ASR 服务波动会导致整个功能不可用\n- 语音转文字过程丢失语调情绪，无法识别用户焦急或困惑的语气，导致回复机械生硬缺乏同理心\n- 多次网络请求增加流量消耗，且在弱网环境下难以部署运行，无法满足车内封闭环境的稳定性要求\n\n### 使用 ultravox 后\n- 音频直接输入模型生成文本，响应速度提升至毫秒级，实现真正实时对话，用户几乎感觉不到等待时间\n- 移除独立 ASR 模块简化了技术栈，降低了服务器负载与运维复杂度，大幅减少了系统集成的工作量\n- 原生支持语音理解，能捕捉停顿和语气变化，提供更人性化的交互体验，让机器更懂人类说话的节奏\n- 支持本地化部署，减少云端依赖，保障数据隐私并降低长期运营成本，特别适合对延迟敏感的车载场景\n\nultravox 通过端到端的音频处理能力，将复杂的语音交互链路简化为单一模型推理，显著提升了实时语音应用的流畅度与安全性。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffixie-ai_ultravox_949036a2.png","fixie-ai","Fixie.ai","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ffixie-ai_dd230893.png","Fixie is the platform for building and managing LLM powered applications",null,"hello@fixie.ai","fixieai","https:\u002F\u002Fwww.fixie.ai","https:\u002F\u002Fgithub.com\u002Ffixie-ai",[85,89,93,97,100],{"name":86,"color":87,"percentage":88},"Python","#3572A5",98.8,{"name":90,"color":91,"percentage":92},"Just","#384d54",0.5,{"name":94,"color":95,"percentage":96},"Shell","#89e051",0.4,{"name":98,"color":91,"percentage":99},"Dockerfile",0.2,{"name":101,"color":102,"percentage":103},"Jinja","#a52a22",0.1,4391,370,"2026-04-05T08:12:06","MIT","Linux, macOS","需要 NVIDIA GPU，训练示例使用 8xH100，具体显存及 CUDA 版本未明确说明","未说明",{"notes":112,"python":113,"dependencies":114},"建议使用 Poetry 管理环境，不建议使用 Conda；默认模型为 Llama 3.3 70B，对算力要求高；训练示例在 8xH100 上耗时 2-3 小时；MosaicML 平台将于 2025 年 7 月关闭；支持自定义数据集训练。","3.11",[115,116,117,118,119,120,121],"torch","transformers","datasets","simple-parsing","wandb","pyyaml","poetry",[14,55,13,15,26],[124,125,126,127],"ai","llm","slm","speech",4,"2026-03-27T02:49:30.150509","2026-04-06T05:37:36.077038",[132,137,142,147,152,157],{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},3298,"Ultravox 0.4.1 与 vLLM 集成时报错如何处理？","这是由于 vLLM 中硬编码了 Llama3 的 Token ID 导致的。需要手动更新 vLLM 源码中的配置。请修改 `vllm\u002Fmodel_executor\u002Fmodels\u002Fultravox.py` 文件中的以下行：\n\n```python\n_AUDIO_PLACEHOLDER_OVERRIDE = \"\u003C|reserved_special_token_0|>\"\n_AUDIO_PLACEHOLDER_TOKEN = 128002\n```\n\n注意：官方曾提交过修复 PR 但因其他问题搁置，建议根据上述参数自行调整。","https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fissues\u002F272",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},3299,"如何在 Mac (Apple Silicon) 设备上运行 Ultravox？","推理功能支持 Apple Silicon（使用 cpu 或 mps），但速度显著慢于 Nvidia GPU。运行 Gradio 演示可直接执行命令 `just gradio`（默认使用 mps）。\n\n为降低内存占用，可设置 `--data_type=float16`，此时 RAM 使用约为 20GB。训练时使用 `--device=mps` 仍需进一步改进。","https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fissues\u002F88",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},3300,"复现实验时训练 Loss 下降后又急剧上升怎么办？","这通常与 Batch Size 过小或梯度累积不足有关。建议增大梯度累积步数 `grad_accum_steps`，例如设置为 8（默认为 1），以帮助稳定训练过程并防止 Loss 反弹。","https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fissues\u002F170",{"id":148,"question_zh":149,"answer_zh":150,"source_url":151},3301,"Ultravox 推理时返回乱码或无效字符是什么原因？","这可能与硬件环境有关。有用户反馈在 CPU 上运行 PyTorch 时会出现大量感叹号等乱码，切换到 GPU 后恢复正常。建议优先使用 GPU 进行推理以确保稳定性。","https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fissues\u002F277",{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},3302,"Ultravox 是否支持文本转语音（TTS）或情感语音生成？","目前该功能正在开发中。维护者表示如果一切按计划进行，预计今年夏天会有初步版本发布。","https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fissues\u002F300",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},3303,"Ultravox 是否支持 Llama 3.2 模型？","支持情况请参考相关 Pull Request #127。社区也在讨论未来是否可以直接发送音频到模型而不依赖 Whisper 以提升性能。","https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fissues\u002F161",[163,168,173,178,183],{"id":164,"version":165,"summary_zh":166,"released_at":167},102859,"v0.6","We're releasing **Ultravox v0.6** today. The [weights](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai) have been pushed to Hugging Face. If you're using the [Ultravox Realtime APIs](https:\u002F\u002Fdocs.ultravox.ai), v0.6 is the new default.\r\n\r\n## What's New\r\n* v0.6 improves upon 0.5 in the following ways:\r\n* Improvements on Hindi language understanding.\r\n* Produces \u003Cnoise> token on noisy or non-human audio. \r\n* Improved background noise and audio quality robustness in all languages. \r\n* New gemma3 and qwen3 variants in addition to our base llama 3.3 models.\r\n\r\n## Evals\r\nNew eval support for [VoiceBench](https:\u002F\u002Fgithub.com\u002FMatthewCYM\u002FVoiceBench). The benchmark tests speech-language models on 9 different tasks ranging from open-form text generation to question-answering and instruction following. \r\n\r\n## Training\r\nThis version of Ultravox continues to use a frozen Llama pre-trained core (3.1 for 8B and 3.3 for 70B), along with new gemma3 27b and qwen3 32b variants.\r\n\r\n## What's Changed\r\nTraining Stability: Patch HF Hub and Datasets methods and update [datasets.py](http:\u002F\u002Fdatasets.py\u002F) by @farzadab #280\r\nGeneral improvement. by @zqhuang211 #281\r\nUltravox v0.6 + general improvements by @liPatrick #309\r\nAllow response generation with no user message by @matthewclso #310\r\nAdd voicebench evaluation suite by @zqhuang211 #312\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fcompare\u002Fv0.5...v0.6","2025-08-18T15:48:28",{"id":169,"version":170,"summary_zh":171,"released_at":172},102860,"v0.5","We're releasing **Ultravox v0.5** today. The [weights](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai) have been pushed to Hugging Face. If you're using the [Ultravox Realtime APIs](https:\u002F\u002Fdocs.ultravox.ai), v0.5 is the new default.\r\n\r\n## What's New\r\nv0.5 improves upon 0.4.1 in the following ways:\r\n* 60% improvement in transcription accuracy, with lower word error rates (WER) across 82 evaluation sets from LibriSpeech, CommonVoice, and Fleurs.\r\n* 18% improvement in speech-based web question answering, particularly in handling named entities and fine-grained speech details.\r\n* 24% improvement in X-to-English translation, as measured by BLEU across 19 languages\r\n* Expanded language support from 15 to 42 languages, making it significantly more accessible for global applications.\r\n\r\n#### 42 Languages Supported\r\nArabic, Belarusian, Bengali, Bulgarian, Chinese, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hindi, Hungarian, Italian, Japanese, Latvian, Lithuanian, Macedonian, Marathi, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Welsh.\r\n\r\n## Evals\r\nOur primary method of evaluation is speech translation, measured by [BLEU](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBLEU) and, newly for v0.5, Big Bench Audio for general reasoning in response to Audio input.\r\n\r\n### Ultravox 70B\r\n|  | Ultravox 0.4.1 70B | Ultravox 0.5 70B |\r\n| --- |  ---: | ---: |\r\n| **covost2 en_ar** | 19.64 | 20.21 |\r\n| **covost2 en_de** | 32.47 | 34.53 |\r\n| **covost2 es_en** | 40.76 | 43.29 |\r\n| **covost2 ru_en** | 45.07 | 48.99 |\r\n| **covost2 en_ca** | 37.58 | 40.01 |\r\n| **covost2 zh_en** | 17.98 | 21.37 |\r\n| **big bench audio** | 76.20 | 82.70 |\r\n\r\n### Ultravox 8B\r\n|  | Ultravox 0.4.1 8B | **Ultravox 0.5 8B** |\r\n| --- | ---: | ---: | \r\n| **covost2 en_ar** | 12.28 | 12.99 | \r\n| **covost2 en_ca** | 29.94 | 31.54 |\r\n| **covost2 en_de** | 27.13 | 28.70 |\r\n| **covost2 es_en** | 39.16 | 40.19 |\r\n| **covost2 ru_en** | 39.65 | 42.13 |\r\n| **covost2 zh_en** | 14.55 | 17.22 |\r\n| **big bench audio** | 63.20 | 66.54 |\r\n\r\n\r\n## Training\r\nThis version of Ultravox continues to use a frozen Llama pre-trained core (3.1 for 8B and 3.3 for 70B), but we've significantly increased the size of the data and the overall training time. The training time on 8xH100s is about ~100 hours for the 8B model and ~150 hours for the 70B model.\r\n\r\n## What's Changed\r\n* Audio streaming training with masking by @saeeddhqan in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F148\r\n* Defining block size in UltravoxConfig, and solving assertions by @saeeddhqan in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F157\r\n* Gradio demo for real-time conversations with WebRTC by @freddyaboulton in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F150\r\n* Fix \"AttributeError: 'NoneType' object has no attribute 'tokenizer'\" by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F173\r\n* docs: update README.md by @eltociear in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F174\r\n* Update ultravox model and config for v0.5 by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F276\r\n\r\n## New Contributors\r\n* @saeeddhqan made their first contribution in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F148\r\n* @freddyaboulton made their first contribution in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F150\r\n* @eltociear made their first contribution in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F174\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fcompare\u002Fv0.4.1...v0.5","2025-02-11T01:35:49",{"id":174,"version":175,"summary_zh":176,"released_at":177},102861,"v0.4.1","We're releasing **Ultravox 0.4.1** today. The [weights](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai) have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox Realtime APIs, v0.4.1 is the new default.\r\n\r\nWe'd love to hear feedback on your experience with Ultravox, along with feature suggestions.\r\n\r\n## What's New\r\nv0.4.1 improves upon 0.4 in the following ways:\r\n* We've upgraded the Whisper encoder from Whisper-medium to Whisper-large-v3-turbo. This has led to quality improvements (see the table below).\r\n* We're adding six new languages: Chinese, Dutch, Hindi, Swedish, Turkish, and Ukrainian. That brings the total supported languages to 15 (see table below).\r\n* Increased the amount of training data for English.\r\n\r\n#### 15 Languages Supported\r\n| Language   | ISO Code |\r\n|------------|-----------|\r\n| Arabic         | ar              |\r\n| Chinese      | zh              |\r\n| Dutch          | nl              |\r\n| English       | en              |\r\n| French        | fr               |\r\n| German      | de              |\r\n| Hindi           | hi               |\r\n| Italian          | it               |\r\n| Japanese    | ja              |\r\n| Portuguese | pt             |\r\n| Russian       | ru              |\r\n| Spanish      | es              |\r\n| Swedish      | sv              |\r\n| Turkish        | tr               |\r\n| Ukrainian    | uk              |\r\n\r\n## Evals\r\nOur primary method of evaluation is speech translation, measured by [BLEU](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBLEU), as a proxy or general instruction-following capability (the higher the number the better). `ca` is an example of model performance for languages not included in training.\r\n\r\n### Ultravox 70B\r\n|  | Ultravox 0.4 70B | Ultravox 0.4.1 70B |\r\n| --- |  ---: | ---: |\r\n| **en_ar** | 14.97 | 19.64 |\r\n| **en_de** | 30.30 | 32.47 |\r\n| **es_en** | 39.55 | 40.76 |\r\n| **ru_en** | 44.16 | 45.07 |\r\n| **en_ca** | 35.02 | 37.58 |\r\n| **zh_en** | 12.16 | 17.98 |\r\n\r\n### Ultravox 8B\r\n|  | Ultravox 0.4 8B | Ultravox 0.4.1 8B |\r\n| --- |  ---: | ---: |\r\n| **en_ar** | 11.17 | 12.28 |\r\n| **en_de** | 25.47 | 27.13 |\r\n| **es_en** | 37.11 | 39.16 |\r\n| **ru_en** | 38.96 | 39.65 |\r\n| **en_ca** | 27.46 | 29.94 |\r\n| **zh_en** | 10.08 | 14.55 |\r\n\r\n\r\n## Training\r\nThis version of Ultravox continues to use a frozen Llama 3.1 pre-trained core (for both 8B and 70B), but we've significantly increased the size of the data and the overall training time. The speech adapter was trained on >10k hours of multilingual speech data. The training time on 8xH100s is about 24 hours for the 8B model and 3 days for the 70B model.\r\n\r\n## What's Changed\r\n* Bugfix: push_to_hub to use correct model to test by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F98\r\n* Integrating OAI evals post training by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F85\r\n* Make sure do_eval works without do_train by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F100\r\n* Add AutoProcessor registration by @petersalas in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F102\r\n* Support num epochs in config by @liPatrick in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F90\r\n* Assert dataset length when using epochs by @liPatrick in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F104\r\n* Add chunking to ds_tool by @liPatrick in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F97\r\n* max_duration for Mosaic jobs by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F112\r\n* Not uploading text_config when text_model_id is present by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F108\r\n* [70B-Part1] Prefetch weights separately by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F106\r\n* [Bugfix] Dot in output_dir causes evals to fail by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F115\r\n* Update oaieval dependency by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F114\r\n* Bugfix for path replace by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F116\r\n* [70B-Part2] Improved save model (that can work with FSDP) by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F107\r\n* [70B-Part3] FSDP Training by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F109\r\n* [70B-Part4] Config and init_empty_weights by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F117\r\n* Update README: use cases for Ultravox training by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F118\r\n* Create test for config_base.py by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F119\r\n* Using fixie-ai version of peoples_speech by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F125\r\n* Dataset Tool to add Timestamps by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F121\r\n\r\n## New Contributors\r\n* @petersalas made their first contribution in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F102\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fcompare\u002Fv0.4...v0.4.1","2024-11-12T22:48:06",{"id":179,"version":180,"summary_zh":181,"released_at":182},102862,"v0.4","Hey everyone,\r\n\r\nWe're releasing **Ultravox 0.4** today. The [weights](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai\u002Fultravox-v0_4) have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox APIs, v0.4 is the new default.\r\n\r\nThere are two key differences between 0.3 and 0.4:\r\n* We've upgraded the Whisper encoder from Whisper-small to Whisper-medium\r\n* We've trained on a larger set of multi-lingual data. Previous versions of Ultravox were only trained on English. Supported languages are now `ar`, `de`, `en`, `es`, `fr`, `it`, `ja`, `pt`, `ru`.\r\n\r\nv0.4 builds upon the work in 0.3 and continues to show improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by [BLEU](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBLEU), as a proxy or general instruction-following capability (the higher the number the better). `ca` and `zh` are examples of model performance for languages not included in training.\r\n\r\n|  | Ultravox 0.3 | Ultravox 0.4 |\r\n| --- | ---: | ---: |\r\n| **en_ar** | 9.07 | 28.07 |\r\n| **en_de** | 22.67 | 25.60 |\r\n| **es_en** | 24.10| 31.03 |\r\n| **ru_en** | 22.52 | 38.96 |\r\n| **en_ca** | 24.87 | 27.49 |\r\n| **zh_en** | 4.26 | 10.08 |\r\n\r\nThis version of Ultravox continues to use a frozen Llama 3.1 8B pre-trained core, but we've roughly doubled the size of the data and the overall training time. The speech adapter was trained on ~5k hours of speech from LibriSpeech, Common Voice, Peoples Speech, and AnyInstruct. The training time on 8xH100s is roughly 170 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.3 was trained on ~2.5k hours of audio.\r\n\r\nWe'd love to hear feedback on your experience with Ultravox, along with feature suggestions. Roadmap coming soon.\r\n\r\n## What's Changed\r\n* Update gradio demo to support text\u002Fvoice conversation by @zqhuang211 in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F75\r\n* Offline batch inference mode by @liPatrick in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F82\r\n* Live reload for Gradio demo by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F89\r\n* Working AutoProcessor.from_pretrained by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F92\r\n* Use bfloat16 by default on MPS by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F95\r\n* Add retry and filter in ds tool by @liPatrick in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F81\r\n* Change tokenizer padding_side to left for eval by @zqhuang211 in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F96\r\n* Make v0.4 release by @zqhuang211 in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F99\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fcompare\u002Fv0.3...v0.4","2024-08-27T01:12:23",{"id":184,"version":185,"summary_zh":186,"released_at":187},102863,"v0.3","Hey everyone,\r\n\r\nWe're officially making **Ultravox 0.3** available today. The [weights](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai\u002Fultravox-v0_3) have been pushed to Hugging Face (along with updated datasets for training), and the model training code has been updated as well. We’re also opening up early preview access to our Ultravox APIs through our managed service. For more information on that, please go here: [https:\u002F\u002Ffixie-ai.github.io\u002Fultradox\u002F](https:\u002F\u002Ffixie-ai.github.io\u002Fultradox\u002F)\r\n\r\nv0.3 demonstrates substantially improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by [BLEU](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBLEU), as a proxy or general instruction-following capability (the higher the number the better):\r\n\r\n|  | Ultravox 0.2 | Ultravox 0.3 |\r\n| --- | --- | --- |\r\n| **en_de** | 12.07 | 22.68 |\r\n| **es_en** | 15.17 | 24.10 |\r\n\r\nThis version of Ultravox uses a frozen Llama 3.1 8B pre-trained core. The speech adapter was trained on 2.5k hours of speech from both LibriSpeech and CommonVoice. The training time on 8xH100s is roughly 80 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.2 was trained on ~1.5k hours of audio.\r\n\r\nIn addition to increasing the overall size of the training set, v0.3 also introduces two other important changes. The first is that we’re augmenting the ASR data sets with synthetic data in the form of generated continuations. The second change is that we’ve migrated to a Knowledge Distillation approach for calculating loss. Combined, both of these approaches result in much higher speech to text alignment in the adapter. You can learn more in their [respective](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.00916) [papers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.19041).\r\n\r\nThe key benefit of better adapter alignment is that it makes it easier to customize Ultravox to particular needs and use cases by allowing it to extend any pre-trained LLM (including fine-tuned versions) with speech capabilities while retaining core capabilities across modalities. If this is something that interests you, please [get in touch](mailto:hello@fixie.ai).\r\n\r\nWe’d love feedback on the model, so please let us know what works well and what doesn’t. To make testing easier, we built a new Gradio demo. To run it, simply run `just gradio` inside of the Ultravox folder.\r\n\r\n## What's Changed\r\n* Remove legacy directory by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F1\r\n* Improved Evaluations by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F2\r\n* Audio Encoder to bfloat16 by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F4\r\n* Whisper encoder + No 30 second padding by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F5\r\n* Optionally include \"passage\" in BoolQ samples by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F6\r\n* Add tts_tool, for converting a HF dataset to audio by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F12\r\n* Add logging code by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F19\r\n* Fixes the default HF model name by @cezarc1 in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F13\r\n* Update Hugging Face link by @simonw in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F17\r\n* Don't run tests on docs changes by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F21\r\n* Local tokenizer and processor for more consistent CI by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F16\r\n* Tool for uploading to HF Hub by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F15\r\n* Remove mlflow dependency by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F23\r\n* Switch from Pip to Poetry by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F24\r\n* Tool for adding new synthetic columns by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F14\r\n* entails -> provides a rationale for by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F27\r\n* Add @file syntax to ds_tool by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F28\r\n* datasets: Handle converting `int16` audio data in `VoiceSample`. by @shaper in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F26\r\n* Allow for toggling training and eval on\u002Foff by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F29\r\n* Add Eleven and Fireworks support to ds_tool by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F31\r\n* Don't fail basic inference due to missing OAI key by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F34\r\n* BoolQ for Training and Eval by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F30\r\n* Extending `ds_tool` for SODA conversational dataset by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F32\r\n* Add streaming support, using HF TextStreamer by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F46\r\n* Minor fixes to ds_tool and infer_tool by @juberti in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F36\r\n* SODA Dataset for Training by @farzadab in https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Fpull\u002F35\r\n*","2024-08-23T00:02:16"]