[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-GetStream--Vision-Agents":3,"tool-GetStream--Vision-Agents":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":80,"owner_website":82,"owner_url":83,"languages":84,"stars":89,"forks":90,"last_commit_at":91,"license":92,"difficulty_score":23,"env_os":93,"env_gpu":94,"env_ram":93,"env_deps":95,"category_tags":104,"github_topics":105,"view_count":23,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":117,"updated_at":118,"faqs":119,"releases":149},3537,"GetStream\u002FVision-Agents","Vision-Agents","Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.","Vision-Agents 是由 Stream 推出的开源框架，旨在帮助开发者快速构建能够“看、听、理解”视频的多模态 AI 智能体。它解决了传统方案中视频处理延迟高、多模型整合困难以及实时交互体验不佳的痛点，让创建低延迟的视频 AI 应用变得简单高效。\n\n这款工具特别适合需要开发实时视频交互应用的开发者，例如打造运动教练助手、无人机火情监测、物理治疗指导或互动游戏等场景。其核心优势在于极致的低延迟表现：利用 Stream 的边缘网络，用户可在 500 毫秒内快速加入会话，并将音视频延迟控制在 30 毫秒以内，确保对话流畅自然。\n\n在技术架构上，Vision-Agents 具有高度的开放性与灵活性。它不仅支持原生调用 OpenAI、Gemini 和 Claude 等主流大模型的最新能力，还允许开发者灵活集成 YOLO、Roboflow 等目标检测模型，形成自定义的处理流水线。此外，它提供了覆盖 React、iOS、Android、Unity 等多平台的 SDK，并内置了语音活动检测（VAD）和智能轮转机制，让智能体能像真人一样进行自然的实时对话与工具调用。无论是初创团队还是资深工程师，","Vision-Agents 是由 Stream 推出的开源框架，旨在帮助开发者快速构建能够“看、听、理解”视频的多模态 AI 智能体。它解决了传统方案中视频处理延迟高、多模型整合困难以及实时交互体验不佳的痛点，让创建低延迟的视频 AI 应用变得简单高效。\n\n这款工具特别适合需要开发实时视频交互应用的开发者，例如打造运动教练助手、无人机火情监测、物理治疗指导或互动游戏等场景。其核心优势在于极致的低延迟表现：利用 Stream 的边缘网络，用户可在 500 毫秒内快速加入会话，并将音视频延迟控制在 30 毫秒以内，确保对话流畅自然。\n\n在技术架构上，Vision-Agents 具有高度的开放性与灵活性。它不仅支持原生调用 OpenAI、Gemini 和 Claude 等主流大模型的最新能力，还允许开发者灵活集成 YOLO、Roboflow 等目标检测模型，形成自定义的处理流水线。此外，它提供了覆盖 React、iOS、Android、Unity 等多平台的 SDK，并内置了语音活动检测（VAD）和智能轮转机制，让智能体能像真人一样进行自然的实时对话与工具调用。无论是初创团队还是资深工程师，都能借助 Vision-Agents 轻松将创意转化为现实的实时视频 AI 产品。","![VisionAgents](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_e6426b09ae44.png)\n\n# Open Vision Agents by Stream\n\n[![build](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Factions)\n[![PyPI version](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fvision-agents.svg)](http:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fvision-agents)\n![PyPI - Python Version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fvision-agents.svg)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FGetStream\u002FVision-Agents)](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fblob\u002Fmain\u002FLICENSE)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F1108586339550638090)](https:\u002F\u002Fdiscord.gg\u002FRkhX9PxMS6)\n[![X (Twitter)](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FX-@visionagents__ai-000000?logo=x&logoColor=white)](https:\u002F\u002Fx.com\u002Fvisionagents_ai)\n\n### Multi-modal AI agents that watch, listen, and understand video.\n\n[Vision Agents](https:\u002F\u002Fvisionagents.ai\u002F) give you the building blocks to create intelligent, low-latency video experiences powered by your models,\nyour infrastructure, and your use cases.\n\n### Key Highlights\n\n- **Video AI:** Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini\u002FOpenAI in real-time.\n- **Low Latency:** Join quickly (500ms) and maintain audio\u002Fvideo latency under 30ms\n  using [Stream's edge network](https:\u002F\u002Fgetstream.io\u002Fvideo\u002F?utm_source=github.com&utm_medium=referral&utm_campaign=vision_agents).\n- **Open:** Built by Stream, but works with any video edge network.\n- **Native APIs:** Native SDK methods from OpenAI (`create response`), Gemini (`generate`), and Claude (\n  `create message`) — always access the latest LLM capabilities.\n- **SDKs:** SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency\n  network.\n\n## Getting Started\n\n**Step 1: Install via uv**\n\n`uv add vision-agents`\n\n**Step 2: (Optional) Install with extra integrations**\n\n`uv add \"vision-agents[getstream, openai, elevenlabs, deepgram]\"`\n\n**Step 3: Obtain your Stream API credentials**\n\nGet a free API key from [Stream](https:\u002F\u002Fgetstream.io\u002Ftry-for-free\u002F?utm_source=github.com&utm_medium=referral&utm_campaign=vision_agents). Developers receive **333,000 participant minutes** per month,\nplus extra credits via the Maker Program.\n\nFollow the [quickstart guide](https:\u002F\u002Fvisionagents.ai\u002Fintroduction\u002Fquickstart) to build your first agent.\n\n## See It In Action\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd1258ac2-ca98-4019-80e4-41ec5530117e\n\nThis example shows you how to build golf coaching AI with YOLO and Gemini Live.\nCombining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use\ncases.\nFor example: Drone fire detection, sports\u002Fvideo game coaching, physical therapy, workout coaching, just dance style\ngames etc.\n\n```python\n# partial example, full example: examples\u002F02_golf_coach_example\u002Fgolf_coach_example.py\nagent = Agent(\n    edge=getstream.Edge(),\n    agent_user=agent_user,\n    instructions=\"Read @golf_coach.md\",\n    llm=gemini.Realtime(fps=10),\n    processors=[ultralytics.YOLOPoseProcessor(model_path=\"yolo11n-pose.pt\", device=\"cuda\")],\n)\n```\n\n## Features\n\n| **Feature**              | **Description**                                                                                         |\n|--------------------------|---------------------------------------------------------------------------------------------------------|\n| **Real-time WebRTC**     | Stream video directly to model providers for instant visual understanding.                              |\n| **Video Processing**     | Pluggable processor pipeline for YOLO, Roboflow, or custom PyTorch\u002FONNX models before\u002Fafter LLM calls. |\n| **Turn Detection**       | Natural conversation flow with VAD, diarization, and smart turn-taking.                                 |\n| **Tool Calling & MCP**   | Execute code and APIs mid-conversation — Linear issues, weather, telephony, or any MCP server.          |\n| **Phone Integration**    | Inbound and outbound voice calls via Twilio with bidirectional audio streaming.                         |\n| **RAG**                  | Retrieval-augmented generation with TurboPuffer vector search or Gemini FileSearch.                     |\n| **Memory**               | Agents recall context across turns and sessions via Stream Chat.                                        |\n| **Text Back-channel**    | Message the agent silently during a call — coaching overlays, silent instructions, etc.                 |\n| **Production Ready**     | Built-in HTTP server, Prometheus metrics, horizontal scaling, and Kubernetes deployment.                |\n\n## Out-of-the-Box Integrations\n\n**LLMs:** [OpenAI](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fopenai) · [Gemini](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fgemini) · [xAI](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fxai) · [OpenRouter](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fopenrouter) · [Hugging Face](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fhuggingface) · [Kimi AI](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fkimi)\n\n**Realtime:** [OpenAI Realtime](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fopenai) · [Gemini Live](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fgemini) · [AWS Nova Sonic](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Faws-bedrock) · [Qwen](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fqwen)\n\n**STT:** [Deepgram](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fdeepgram) · [AssemblyAI](https:\u002F\u002Fwww.assemblyai.com\u002Fdocs\u002Fstreaming\u002Funiversal-3-pro) · [Fast-Whisper](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Ffast-whisper) · [Fish Audio](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Ffish) · [Wizper](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fwizper) · [Mistral Voxtral](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fmistral)\n\n**TTS:** [ElevenLabs](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Felevenlabs) · [Cartesia](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fcartesia) · [Deepgram](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fdeepgram) · [AWS Polly](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Faws-polly) · [Pocket](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fpocket) · [Kokoro](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fkokoro) · [Inworld](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Finworld) · [Fish Audio](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Ffish)\n\n**Vision:** [Ultralytics](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fultralytics) · [Roboflow](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Froboflow) · [Moondream](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fmoondream) · [NVIDIA Cosmos](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fnvidia) · [Decart](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fdecart)\n\n**Avatars:** [LemonSlice](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Flemonslice)\n\n**Turn Detection:** [Vogent](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fvogent) · [Smart Turn](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fsmart-turn)\n\n**Other:** [Twilio](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fexamples\u002F03_phone_and_rag_example) · [TurboPuffer](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Frag)\n\n## Documentation\n\nCheck out the full docs at [VisionAgents.ai](https:\u002F\u002Fvisionagents.ai\u002F).\n\n**Quickstart:** [Voice AI](https:\u002F\u002Fvisionagents.ai\u002Fintroduction\u002Fvoice-agents) · [Video AI](https:\u002F\u002Fvisionagents.ai\u002Fintroduction\u002Fvideo-agents)\n\n**Guides:** [MCP & Function Calling](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fmcp-tool-calling) · [Video Processors](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fvideo-processors) · [Phone Calling](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fcalling) · [RAG](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Frag) · [Testing](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Ftesting)\n\n**Production:** [HTTP Server](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fhttp-server) · [Deployment](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fdeployment) · [Kubernetes](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fkubernetes-deployment) · [Horizontal Scaling](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fhorizontal-scaling) · [Prometheus Metrics](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fprometheus-metrics)\n\n## Examples\n\n| 🔮 Demo Applications                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                         |\n|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|\n| \u003Cbr>\u003Ch3>Voice Agents (Low Latency + RAG + File Search)\u003C\u002Fh3>Build fast voice agents that can reason over knowledge, search files, and respond in real time.\u003Cbr>\u003Cbr>• Low-latency voice interactions\u003Cbr>• Retrieval-augmented responses\u003Cbr>• File and knowledge search\u003Cbr>\u003Cbr> [>Source Code and tutorial](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fplugins\u002Fcartesia\u002Fexample)                                                                                                                                                    | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_1fec746562d7.gif\" width=\"320\" alt=\"Voice Agent Demo\">               |\n| \u003Cbr>\u003Ch3>Realtime Coaching and Video Understanding\u003C\u002Fh3>Power interactive coaching flows with live pose tracking and processor pipelines for frame-by-frame understanding.\u003Cbr>\u003Cbr>• Real-time pose tracking\u003Cbr>• Actionable coaching feedback\u003Cbr>• Video processor pipeline support\u003Cbr>\u003Cbr> [>Source Code and tutorial](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fexamples\u002F02_golf_coach_example)                                                     | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_89ba3c6e9d27.gif\" width=\"320\" alt=\"Realtime Coaching Demo\">                 |\n| \u003Cbr>\u003Ch3>Video Restyling and Avatars\u003C\u002Fh3>Use models like Decart Lucy to build virtual try-ons, stylized scenes, or give your agents a visual identity.\u003Cbr>\u003Cbr>• Real-time video restyling\u003Cbr>• Virtual try-on experiences\u003Cbr>• Avatar-like visual presence\u003Cbr>\u003Cbr> [>Source Code and tutorial](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fplugins\u002Fdecart\u002Fexample)                                                                                                    | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_f91f9bc25c0b.gif\" width=\"320\" alt=\"Video Restyling Demo\">           |\n| \u003Cbr>\u003Ch3>Custom Video Models (Roboflow, YOLO, and More)\u003C\u002Fh3>Train and run custom computer vision models for security monitoring, moderation, and other domain-specific workflows.\u003Cbr>\u003Cbr>• Bring your own CV models\u003Cbr>• Real-time moderation pipelines\u003Cbr>• Security and detection use cases\u003Cbr>\u003Cbr> [>Source Code and tutorial](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fexamples\u002F11_moderation_example) | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_67b1d19bebb5.gif\" width=\"320\" alt=\"Custom Video Models Demo\">          |\n| \u003Cbr>\u003Ch3>Tools, MCP, and Phone Calling\u003C\u002Fh3>Connect external APIs and services so agents can validate data and take real-world actions during live conversations.\u003Cbr>\u003Cbr>• MCP and function calling support\u003Cbr>• Twilio-based phone workflows\u003Cbr>• Real-time fraud response automation\u003Cbr>\u003Cbr> [>Phone + RAG example](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fexamples\u002F03_phone_and_rag_example) · [>Fraud workflow example](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fplugins\u002Fopenai\u002Fexamples\u002Fnemotron_example) | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_e247ac44ce51.gif\" width=\"320\" alt=\"Tools and Phone Demo\"> |\n\n## Development\n\nSee [DEVELOPMENT.md](DEVELOPMENT.md)\n\nWant to add your platform or provider? See [Create Your Own Plugin](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fcreate-your-own-plugin) or reach out to **nash@getstream.io**.\n\n## Current Limitations\n\n- Video AI struggles with small text — models may hallucinate scores, signs, etc.\n- Context degrades on longer sessions (~30s+) for continuous video understanding\n- Most use cases need a mix of specialized models (YOLO, Roboflow) with larger LLMs\n- Real-time models require audio\u002Ftext to trigger responses — video alone won't prompt output\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_2ba2cd5ae162.png)](https:\u002F\u002Fwww.star-history.com\u002F#GetStream\u002Fvision-agents&type=timeline&legend=top-left)\n","![VisionAgents](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_e6426b09ae44.png)\n\n# Stream 开放的视觉智能体\n\n[![构建](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Factions)\n[![PyPI 版本](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fvision-agents.svg)](http:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fvision-agents)\n![PyPI - Python 版本](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fvision-agents.svg)\n[![许可证](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FGetStream\u002FVision-Agents)](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fblob\u002Fmain\u002FLICENSE)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F1108586339550638090)](https:\u002F\u002Fdiscord.gg\u002FRkhX9PxMS6)\n[![X (Twitter)](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FX-@visionagents__ai-000000?logo=x&logoColor=white)](https:\u002F\u002Fx.com\u002Fvisionagents_ai)\n\n### 多模态 AI 智能体，能够观看、聆听并理解视频。\n\n[Vision Agents](https:\u002F\u002Fvisionagents.ai\u002F) 为您提供构建智能化、低延迟视频体验所需的基石，这些体验由您的模型、基础设施和应用场景驱动。\n\n### 核心亮点\n\n- **视频 AI：** 专为实时视频 AI 打造。可将 YOLO、Roboflow 等与 Gemini\u002FOpenAI 实时结合。\n- **低延迟：** 快速加入（500 毫秒），并通过 [Stream 的边缘网络](https:\u002F\u002Fgetstream.io\u002Fvideo\u002F?utm_source=github.com&utm_medium=referral&utm_campaign=vision_agents) 将音视频延迟保持在 30 毫秒以内。\n- **开放：** 由 Stream 构建，但可与任何视频边缘网络配合使用。\n- **原生 API：** 提供来自 OpenAI (`create response`)、Gemini (`generate`) 和 Claude (`create message`) 的原生 SDK 方法——始终访问最新的 LLM 能力。\n- **SDK：** 面向 React、Android、iOS、Flutter、React Native 和 Unity 的 SDK，由 Stream 的超低延迟网络提供支持。\n\n## 开始使用\n\n**步骤 1：通过 uv 安装**\n\n`uv add vision-agents`\n\n**步骤 2：（可选）安装包含额外集成的版本**\n\n`uv add \"vision-agents[getstream, openai, elevenlabs, deepgram]\"`\n\n**步骤 3：获取您的 Stream API 凭证**\n\n从 [Stream](https:\u002F\u002Fgetstream.io\u002Ftry-for-free\u002F?utm_source=github.com&utm_medium=referral&utm_campaign=vision_agents) 获取免费的 API 密钥。开发者每月可获得 **333,000 分钟参与者时长**，并通过 Maker 计划获得更多积分。\n\n按照 [快速入门指南](https:\u002F\u002Fvisionagents.ai\u002Fintroduction\u002Fquickstart) 构建您的第一个智能体。\n\n## 实际演示\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd1258ac2-ca98-4019-80e4-41ec5530117e\n\n此示例展示了如何使用 YOLO 和 Gemini Live 构建高尔夫教练 AI。将快速目标检测模型（如 YOLO）与完整的实时 AI 结合，适用于多种不同的视频 AI 应用场景。例如：无人机火灾检测、体育\u002F电子游戏指导、物理治疗、健身教练、Just Dance 类型的游戏等。\n\n```python\n# 部分示例，完整示例请参见 examples\u002F02_golf_coach_example\u002Fgolf_coach_example.py\nagent = Agent(\n    edge=getstream.Edge(),\n    agent_user=agent_user,\n    instructions=\"阅读 @golf_coach.md\",\n    llm=gemini.Realtime(fps=10),\n    processors=[ultralytics.YOLOPoseProcessor(model_path=\"yolo11n-pose.pt\", device=\"cuda\")],\n)\n```\n\n## 功能特性\n\n| **功能**              | **描述**                                                                                         |\n|--------------------------|---------------------------------------------------------------------------------------------------------|\n| **实时 WebRTC**     | 直接将视频流传输至模型提供商，实现即时视觉理解。                              |\n| **视频处理**     | 可插拔的处理器流水线，用于 YOLO、Roboflow 或自定义 PyTorch\u002FONNX 模型，在 LLM 调用前后进行处理。 |\n| **轮次检测**       | 通过 VAD、说话人分离和智能轮次管理，实现自然的对话流程。                                 |\n| **工具调用 & MCP**   | 在对话过程中执行代码和 API —— 解决线性问题、查询天气、进行电话通信，或调用任何 MCP 服务器。          |\n| **电话集成**    | 通过 Twilio 实现呼入和呼出语音通话，并支持双向音频流。                         |\n| **RAG**                  | 基于 TurboPuffer 向量检索或 Gemini FileSearch 的增强型生成技术。                     |\n| **记忆**               | 智能体可通过 Stream Chat 在不同轮次和会话中回忆上下文信息。                                        |\n| **文本回传通道**    | 在通话期间静默地向智能体发送消息——例如教练叠加层、静默指令等。                 |\n| **生产就绪**     | 内置 HTTP 服务器、Prometheus 指标、水平扩展和 Kubernetes 部署。                |\n\n## 即插即用的集成\n\n**LLMs：** [OpenAI](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fopenai) · [Gemini](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fgemini) · [xAI](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fxai) · [OpenRouter](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fopenrouter) · [Hugging Face](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fhuggingface) · [Kimi AI](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fkimi)\n\n**实时服务：** [OpenAI Realtime](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fopenai) · [Gemini Live](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fgemini) · [AWS Nova Sonic](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Faws-bedrock) · [Qwen](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fqwen)\n\n**STT：** [Deepgram](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fdeepgram) · [AssemblyAI](https:\u002F\u002Fwww.assemblyai.com\u002Fdocs\u002Fstreaming\u002Funiversal-3-pro) · [Fast-Whisper](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Ffast-whisper) · [Fish Audio](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Ffish) · [Wizper](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fwizper) · [Mistral Voxtral](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fmistral)\n\n**TTS：** [ElevenLabs](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Felevenlabs) · [Cartesia](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fcartesia) · [Deepgram](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fdeepgram) · [AWS Polly](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Faws-polly) · [Pocket](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fpocket) · [Kokoro](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fkokoro) · [Inworld](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Finworld) · [Fish Audio](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Ffish)\n\n**视觉：** [Ultralytics](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fultralytics) · [Roboflow](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Froboflow) · [Moondream](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fmoondream) · [NVIDIA Cosmos](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fnvidia) · [Decart](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fdecart)\n\n**虚拟形象：** [LemonSlice](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Flemonslice)\n\n**轮次检测：** [Vogent](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fvogent) · [Smart Turn](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fsmart-turn)\n\n**其他：** [Twilio](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fexamples\u002F03_phone_and_rag_example) · [TurboPuffer](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Frag)\n\n## 文档\n\n请在 [VisionAgents.ai](https:\u002F\u002Fvisionagents.ai\u002F) 查看完整文档。\n\n**快速入门：** [语音 AI](https:\u002F\u002Fvisionagents.ai\u002Fintroduction\u002Fvoice-agents) · [视频 AI](https:\u002F\u002Fvisionagents.ai\u002Fintroduction\u002Fvideo-agents)\n\n**指南：** [MCP 与函数调用](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fmcp-tool-calling) · [视频处理器](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fvideo-processors) · [电话呼叫](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fcalling) · [RAG](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Frag) · [测试](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Ftesting)\n\n**生产环境：** [HTTP 服务器](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fhttp-server) · [部署](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fdeployment) · [Kubernetes](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fkubernetes-deployment) · [水平扩展](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fhorizontal-scaling) · [Prometheus 指标](https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fprometheus-metrics)\n\n## 示例\n\n| 🔮 演示应用                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                         |\n|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|\n| \u003Cbr>\u003Ch3>语音代理（低延迟 + RAG + 文件搜索）\u003C\u002Fh3>构建能够基于知识进行推理、搜索文件并实时响应的高速语音代理。\u003Cbr>\u003Cbr>• 低延迟语音交互\u003Cbr>• 增强检索式响应\u003Cbr>• 文件与知识搜索\u003Cbr>\u003Cbr> [>源代码与教程](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fplugins\u002Fcartesia\u002Fexample)                                                                                                                                                    | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_1fec746562d7.gif\" width=\"320\" alt=\"语音代理演示\">               |\n| \u003Cbr>\u003Ch3>实时教练与视频理解\u003C\u002Fh3>利用实时姿态跟踪和逐帧理解的处理器流水线，赋能互动式教练流程。\u003Cbr>\u003Cbr>• 实时姿态跟踪\u003Cbr>• 可操作的教练反馈\u003Cbr>• 视频处理器流水线支持\u003Cbr>\u003Cbr> [>源代码与教程](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fexamples\u002F02_golf_coach_example)                                                     | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_89ba3c6e9d27.gif\" width=\"320\" alt=\"实时教练演示\">                 |\n| \u003Cbr>\u003Ch3>视频重制与虚拟形象\u003C\u002Fh3>使用 Decart Lucy 等模型，构建虚拟试穿、风格化场景，或为您的代理赋予视觉形象。\u003Cbr>\u003Cbr>• 实时视频重制\u003Cbr>• 虚拟试穿体验\u003Cbr>• 类似虚拟形象的视觉呈现\u003Cbr>\u003Cbr> [>源代码与教程](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fplugins\u002Fdecart\u002Fexample)                                                                                                    | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_f91f9bc25c0b.gif\" width=\"320\" alt=\"视频重制演示\">           |\n| \u003Cbr>\u003Ch3>自定义视频模型（Roboflow、YOLO 等）\u003C\u002Fh3>训练并运行自定义计算机视觉模型，用于安全监控、内容审核及其他领域特定的工作流。\u003Cbr>\u003Cbr>• 使用您自己的 CV 模型\u003Cbr>• 实时内容审核流水线\u003Cbr>• 安全与检测应用场景\u003Cbr>\u003Cbr> [>源代码与教程](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fexamples\u002F11_moderation_example) | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_67b1d19bebb5.gif\" width=\"320\" alt=\"自定义视频模型演示\">          |\n| \u003Cbr>\u003Ch3>工具、MCP 与电话呼叫\u003C\u002Fh3>连接外部 API 和服务，使代理能够在实时对话中验证数据并采取现实世界行动。\u003Cbr>\u003Cbr>• 支持 MCP 和函数调用\u003Cbr>• 基于 Twilio 的电话工作流\u003Cbr>• 实时欺诈响应自动化\u003Cbr>\u003Cbr> [>电话 + RAG 示例](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fexamples\u002F03_phone_and_rag_example) · [>欺诈工作流示例](https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Ftree\u002Fmain\u002Fplugins\u002Fopenai\u002Fexamples\u002Fnemotron_example) | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_e247ac44ce51.gif\" width=\"320\" alt=\"工具与电话演示\"> |\n\n## 开发\n\n请参阅 [DEVELOPMENT.md](DEVELOPMENT.md)\n\n希望添加您的平台或提供商？请参阅 [创建您自己的插件](https:\u002F\u002Fvisionagents.ai\u002Fintegrations\u002Fcreate-your-own-plugin)，或联系 **nash@getstream.io**。\n\n## 当前限制\n\n- 视频 AI 在处理小尺寸文本时表现不佳——模型可能会产生幻觉，例如误读分数、标志等。\n- 对于连续视频理解，较长会话（约 30 秒以上）会导致上下文质量下降。\n- 大多数用例需要将专用模型（如 YOLO、Roboflow）与大型 LLM 结合使用。\n- 实时模型需要音频或文本触发响应——仅靠视频本身无法生成输出。\n\n## 星级历史\n\n[![星级历史图表](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_readme_2ba2cd5ae162.png)](https:\u002F\u002Fwww.star-history.com\u002F#GetStream\u002Fvision-agents&type=timeline&legend=top-left)","# Vision-Agents 快速上手指南\n\nVision-Agents 是一个由 Stream 开源的多模态 AI 代理框架，专为实时视频和语音交互设计。它支持低延迟（\u003C30ms）的视频流处理，可轻松集成 YOLO、Roboflow 等视觉模型与 Gemini、OpenAI 等大语言模型，适用于体育教练、安防监控、虚拟化身等场景。\n\n## 环境准备\n\n在开始之前，请确保满足以下系统要求：\n\n- **操作系统**: Linux, macOS, 或 Windows (WSL 推荐)\n- **Python 版本**: Python 3.9 - 3.12\n- **包管理器**: 推荐使用 [`uv`](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) (极速 Python 包安装器)，也可使用 pip\n- **API 凭证**: \n  - [Stream API Key](https:\u002F\u002Fgetstream.io\u002Ftry-for-free\u002F) (用于低延迟视频网络，新用户每月赠送 333,000 参与分钟数)\n  - 对应的大模型 API Key (如 Google Gemini, OpenAI 等)\n- **硬件加速 (可选)**: 若运行本地视觉模型 (如 YOLO)，建议配备 NVIDIA GPU 并安装 CUDA 驱动\n\n## 安装步骤\n\n### 1. 安装核心库\n\n推荐使用 `uv` 进行安装，速度更快且依赖解析更精准：\n\n```bash\nuv add vision-agents\n```\n\n若使用 `pip`：\n\n```bash\npip install vision-agents\n```\n\n### 2. 安装额外集成组件 (可选)\n\n根据需求安装特定的服务商集成（如 Stream 视频网络、OpenAI、ElevenLabs 语音、Deepgram 转录等）：\n\n```bash\nuv add \"vision-agents[getstream, openai, elevenlabs, deepgram]\"\n```\n\n> **提示**：国内开发者若遇到网络问题，可配置 uv\u002Fpip 使用国内镜像源（如清华源、阿里源）加速下载。\n\n## 基本使用\n\n以下是一个构建**实时高尔夫教练 AI**的最小化示例。该示例结合了 Ultralytics YOLO 进行姿态检测，并使用 Google Gemini Live 进行实时多模态交互。\n\n### 代码示例\n\n```python\n# 完整示例参考：examples\u002F02_golf_coach_example\u002Fgolf_coach_example.py\nfrom vision_agents import Agent\nfrom vision_agents.edge import getstream\nfrom vision_agents.llm import gemini\nfrom vision_agents.processors import ultralytics\n\n# 初始化 Agent\nagent = Agent(\n    edge=getstream.Edge(),  # 使用 Stream 边缘网络实现低延迟传输\n    agent_user=agent_user,  # 用户上下文对象\n    instructions=\"Read @golf_coach.md\", # 系统指令文件\n    llm=gemini.Realtime(fps=10), # 使用 Gemini Realtime 模式，每秒处理 10 帧\n    processors=[\n        ultralytics.YOLOPoseProcessor(\n            model_path=\"yolo11n-pose.pt\", \n            device=\"cuda\" # 指定使用 GPU 加速\n        )\n    ],\n)\n\n# 启动代理逻辑 (具体启动方式视应用场景而定，如 WebRTC 连接)\n# await agent.run() \n```\n\n### 核心流程说明\n\n1.  **配置 Edge**: 通过 `getstream.Edge()` 接入低延迟视频网络，确保音视频延迟低于 30ms。\n2.  **选择 LLM**: 使用 `gemini.Realtime` 或其他支持的实时模型（如 OpenAI Realtime），实现“边看边听边说”。\n3.  **挂载处理器**: 在 `processors` 列表中注入计算机视觉模型（如 YOLO），可在发送给 LLM 前对视频帧进行预处理（如提取骨骼关键点、检测物体）。\n4.  **运行**: 结合前端 SDK (React\u002FiOS\u002FAndroid 等) 建立 WebRTC 连接，即可开始实时互动。\n\n更多详细用法、MCP 工具调用及电话集成示例，请访问 [VisionAgents.ai 官方文档](https:\u002F\u002Fvisionagents.ai\u002F)。","一家智能健身初创公司正在开发一款基于摄像头的实时动作纠正应用，旨在通过视频分析指导用户完成标准的深蹲和硬拉动作。\n\n### 没有 Vision-Agents 时\n- **延迟过高导致反馈滞后**：传统架构需先将视频上传至云端处理再返回结果，端到端延迟往往超过 500ms，用户做完动作后才收到错误提示，失去纠正意义。\n- **多模型集成复杂**：开发者需自行编写胶水代码串联 YOLO 姿态识别模型与大语言模型（LLM），维护不同 SDK 的兼容性耗费大量精力。\n- **并发成本高昂**：随着用户量增加，中心化服务器带宽和算力成本呈指数级上升，难以支撑大规模实时视频流分析。\n- **交互体验生硬**：缺乏原生的语音打断和自然对话机制，AI 教练只能单向播报，无法在用户提问时即时响应。\n\n### 使用 Vision-Agents 后\n- **毫秒级实时反馈**：利用 Stream 的边缘网络，视频流直接接入模型，将音视频延迟控制在 30ms 以内，用户在动作变形瞬间即可听到纠正指令。\n- **流水线式快速构建**：通过内置的 `YOLOPoseProcessor` 插件，只需几行代码即可将姿态识别与 Gemini 实时大模型无缝结合，大幅缩短开发周期。\n- **弹性边缘架构**：借助分布式边缘节点处理视频流，显著降低中心服务器负载，以更低成本支撑高并发用户同时在线训练。\n- **拟人化互动体验**：原生支持 VAD（语音活动检测）和智能轮转机制，AI 教练能像真人一样倾听用户疑问并即时插话指导，交互自然流畅。\n\nVision-Agents 通过边缘计算与多模态模型的深度整合，将高延迟的视频分析任务转化为低延迟、可交互的实时智能体验。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FGetStream_Vision-Agents_e6426b09.png","GetStream","Stream","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FGetStream_e8ed76a4.png","Build scalable in-app chat, feeds, and live video with AI moderation capabilities in hours, not weeks.",null,"support@getstream.io","https:\u002F\u002Fgetstream.io","https:\u002F\u002Fgithub.com\u002FGetStream",[85],{"name":86,"color":87,"percentage":88},"Python","#3572A5",100,7639,624,"2026-04-04T19:04:57","Apache-2.0","未说明","可选。若使用 YOLO 等视觉处理器需 NVIDIA GPU（示例代码显示 device=\"cuda\"），具体型号和显存未说明；若仅使用云端 LLM 可不依赖本地 GPU。",{"notes":96,"python":97,"dependencies":98},"该工具主要作为编排框架，重度依赖外部 API（如 OpenAI, Gemini, Deepgram, ElevenLabs 等）和本地视觉模型插件。安装推荐使用 'uv' 包管理器。若运行本地视觉处理（如 YOLO），需自行配置 CUDA 环境；若仅调用云端多模态模型，则对本地硬件要求较低。支持通过插件扩展集成 Roboflow、Twilio 等服务。","3.8+ (根据 PyPI badge 推断，具体版本需参考 PyPI 页面)",[99,100,101,102,103],"vision-agents","ultralytics (用于 YOLO)","torch (PyTorch, 用于自定义模型)","onnx (可选，用于模型推理)","stream-video-sdk (隐含，用于 Stream 网络)",[13,15,14,52,55],[106,107,108,109,110,111,112,113,114,115,116],"ai","ai-agents","vision-ai","voice-ai","video-agents","agentic-ai","agents","realtime","stt","tts","video-ai","2026-03-27T02:49:30.150509","2026-04-06T05:16:50.871366",[120,125,130,135,140,145],{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},16211,"为什么无法从 vision_agents.core.llm.events 导入 RealtimeTranscriptEvent？","如果您使用的是实时 LLM（Realtime LLM），目前可能无法直接获取完整的转录文本。解决方案是：如果使用的是普通 LLM，可以接入任何 STT（语音转文本）提供商来替代。此外，您可以尝试将不同的部分转录片段（partial transcripts）拼接起来以获得完整的转录内容，而不是依赖单一的实时事件导入。","https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fissues\u002F222",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},16212,"为什么 Vision Agent 的 process_image 或 process_video 方法没有被调用，导致无法检测视频中的物体？","这通常是由时序问题引起的。常见原因及解决方法包括：\n1. **工具调用过早**：在帧到达之前就调用了分析工具，导致获取不到数据。解决方法是添加重试循环。\n2. **问候语发送过早**：代理在浏览器发布视频轨道之前就发送了问候语。解决方法是在加入通话后添加一个帧等待循环（例如最多等待 10 秒），确保检测到帧后再发送初始问候。\n3. **检测结果被重置**：即使未检测到物体，`latest_detections` 也会在每一帧被重置。解决方法是仅当检测到新物体时才更新该变量，并添加超时机制（如 3 秒）。\n4. **会话污染**：上一次呼叫的检测数据影响了新会话。解决方法是在处理器中添加 `reset()` 方法，在每次新会话开始时清除旧数据。","https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fissues\u002F147",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},16213,"使用 uv 安装 vision-agents 时遇到 scipy 构建失败错误怎么办？","该问题通常与构建环境或代码中的异步语法有关。首先尝试检查代码中是否有错误的异步用法，例如将 `async with await agent.join(call)` 修改为 `with await agent.join(call)`（去掉多余的 async）。如果问题依然存在且涉及 scipy 等底层库的编译错误，请确保您的系统已安装必要的构建工具（如 macOS 上的 Xcode Command Line Tools 或 Linux 上的 build-essential），或者尝试使用标准的 pip 进行安装以绕过特定的 uv 构建缓存问题。","https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fissues\u002F125",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},16214,"在 Windows 上安装 Vision Agents 时遇到 NumPy 编译错误如何解决？","Windows 用户遇到此类错误通常是因为缺少编译 Python C 扩展所需的构建工具。解决方法是安装 **Visual Studio Build Tools**。安装完成后，重新运行安装命令即可成功构建 NumPy 及其他依赖包。确保您的 Python 版本与预编译包的兼容性，必要时升级或降级 Python 版本。","https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fissues\u002F206",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},16215,"遇到 JWTAuth 错误 'token used before issue at (iat)' 是什么原因？","此错误通常表示生成的 JWT 令牌中的 'iat'（issued at，签发时间）戳记晚于当前服务器时间，或者客户端与服务器之间存在显著的时间不同步。解决方法包括：\n1. 校准您的系统时钟，确保其与网络时间同步。\n2. 检查生成令牌的代码逻辑，确保 'iat' 字段使用的是当前的准确时间戳。\n3. 如果是在容器或虚拟机中运行，请检查宿主机的时间设置是否正确。","https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fissues\u002F112",{"id":146,"question_zh":147,"answer_zh":148,"source_url":124},16216,"如何在前端接收转录文本而不是使用 call.startTranscription？","如果您使用的是实时 LLM，直接控制转录流的能力有限。建议的替代方案是：切换到使用普通 LLM，并集成第三方 STT（语音转文本）服务提供商。通过这种方式，您可以更灵活地处理转录数据，并将其发送到前端，而不是依赖内置的 `call.startTranscription` 方法。同时，可以考虑在后端拼接部分转录片段后再发送给前端，以保证文本的完整性。",[150,155,160,165,170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245],{"id":151,"version":152,"summary_zh":153,"released_at":154},94488,"v0.5.0","## 变更内容\n* 修复：在会话结束时关闭 AsyncStream 客户端，由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F457 中完成\n* Anam 头像插件，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F445 中完成\n* 更新 README 中的链接，由 @Nash0x7E2 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F462 中完成\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.4.7...v0.5.0","2026-04-01T17:38:54",{"id":156,"version":157,"summary_zh":158,"released_at":159},94489,"v0.4.7","## 变更内容\n### 错误修复\n* 由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F456 中将 Deepgram 版本锁定为 \u003C6.1.0\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.4.6...v0.4.7","2026-03-27T11:53:07",{"id":161,"version":162,"summary_zh":163,"released_at":164},94490,"v0.4.6","## 变更内容\n### 修复\n* 由 @maxkahan 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F451 中降低 Deepgram 延迟\n\n### 文档\n* 由 @maxkahan 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F450 中添加内容审核演示\n* 由 @Nash0x7E2 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F454 中更新 Gemini 插件的 README\n\n### 依赖更新\n* 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F427 中将 pyasn1 从 0.6.2 升级到 0.6.3\n* 由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F452 中修复 Slack 通知中的负载转义问题\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.4.5...v0.4.6","2026-03-27T11:37:59",{"id":166,"version":167,"summary_zh":168,"released_at":169},94491,"v0.4.5","## 变更内容\n\n\n### 功能\n* 功能：@Jagdeep1 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F415 中添加了对 AWS 配置文件认证的支持。\n* 新增：@dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F447 中添加了显示当前核心版本的启动画面。\n* 优化：@dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F449 中实现了在非交互式终端或传入 `--no-splash` 参数时隐藏启动提示的功能。\n\n### 修复\n* 修复内存泄漏：@aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F444 中停止在事件日志中将 NumPy 数组序列化为字符串。\n* 修复：@aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F437 中实现了 GeminiRealtime 中工具执行的非阻塞行为。\n* 修复：@aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F438 中实现了在 AWS 实时会话中跟踪后台工具任务的功能。\n* 修复：@aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F439 中实现了 OpenAI 和 XAI 实时会话中工具执行的非阻塞行为。\n* 修复：@dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F442 中修复了 Openrouter 函数调用及集成测试问题。\n\n### 其他\n* 增加：@DaemonLoki 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F410 中将 Slack 发布功能添加到了 `run_tests` 操作中。\n* 优化：@d3xvn 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F441 中简化了 README 文档以提高清晰度，并新增了 `ROADMAP.md` 文件。\n* 重构：@aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F443 中将 `_run_tool_in_background` 方法移动到了基础 Realtime 类中。\n\n\n## 新贡献者\n* @Jagdeep1 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F415 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.4.4...v0.4.5","2026-03-25T14:43:51",{"id":171,"version":172,"summary_zh":173,"released_at":174},94492,"v0.4.4","## 变更内容\n\n### 功能\n* 支持本地设备 由 @Nash0x7E2 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F347 中实现\n\n### Bug修复\n* 修复(aws): 发送转录事件并处理 Nova Sonic 的打断行为 由 @prettyprettyprettygood 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F408 中实现\n* 修复 EventManager.shutdown() 永久挂起的问题 由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F421 中实现\n* 使用纪元跟踪跳过在轮次切换时处理过时的 TTS 事件 由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F430 中实现\n* 将 `elevenlabs.STT` 切换为使用 VAD 模式，而非手动提交 由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F435 中实现\n* 修复 GeminiRealtime 中的事件处理问题 由 @Nash0x7E2 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F436 中实现\n\n### 示例与文档\n* 升级部署示例，采用生产就绪的 Helm Chart 由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F396 中实现\n* 重构：将 07_deploy_example 重命名为 07_k8s_deploy_example 由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F416 中实现\n* 从示例 Grafana 仪表板中移除“轮次持续时间”面板 由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F418 中实现\n* 文档：使 K8s 部署示例与云无关 由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F425 中实现\n* 添加 NVIDIA Nemotron 示例 由 @maxkahan 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F426 中实现\n* 添加使用 OpenRouter 的 Mimo 示例 由 @maxkahan 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F432 中实现\n\n### 依赖项\n* 将最低 GetStream 版本提升至 3.0.1 由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F434 中实现\n\n### 杂项\n* 跳过并修复失败的集成测试 由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F413 中实现\n* 移除 PRODUCTION.md 文件 由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F424 中实现\n* 向 Inworld TTS 请求添加 X-User-Agent 和 X-Request-Id 头部 由 @ianbbqzy 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F428 中实现\n* 格式化 inworld\u002Ftts.py 文件 由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F431 中实现\n* 将 pyjwt 从 2.11.0 升级至 2.12.0 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F420 中实现\n* 将 pyopenssl 从 25.3.0 升级至 26.0.0 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F423 中实现\n\n## 新贡献者\n* @prettyprettyprettygood 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F408 中完成了首次贡献\n* @ianbbqzy 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F428 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.4.3...v0.4.4","2026-03-23T12:05:52",{"id":176,"version":177,"summary_zh":178,"released_at":179},94493,"v0.4.3","## 变更内容\n### Bug修复\n* 修复内存泄漏：在代理关闭时清理处理程序任务和闭包，由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F407 中完成\n* 修复 StreamEdge 连接的竞态条件，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F412 中完成\n\n### 依赖更新\n* 将 getstream 插件升级至 v3.0.0，由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F400 中完成\n* 更新 S2-Pro 的 Fish Audio 库，由 @Nash0x7E2 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F405 中完成\n* 将插件依赖更新至最新兼容版本，由 @cursor[bot] 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F406 中完成\n* chore(deps)：为关键依赖添加主要版本上限，由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F404 中完成\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.4.2...v0.4.3","2026-03-11T15:23:48",{"id":181,"version":182,"summary_zh":183,"released_at":184},94494,"v0.4.2","## 变更内容\n\n### 功能\n* 由 @maxkahan 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F377 中添加了 HF 检测处理器\n* 由 @Nash0x7E2 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F394 中为 AssemblyAI STT 添加了说话人分离支持\n* 由 @maxkahan 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F402 中添加了 Gemini VLM 示例并更新了 Gemini 模型\n\n### Bug 修复\n* 修复：将实时转录内容缓冲为单条聊天消息，由 @d3xvn 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F383 中完成\n* 修复：急于检测发言时导致转录内容顺序错乱的问题，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F401 中完成\n* 修复：将 getstream 固定到 v2 版本以防止升级时出现破坏性变更，由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F403 中完成\n* 在初始化时一次性设置 LLMJudge 指令，由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F399 中完成\n\n### 示例\n* 添加销售助理示例——实时 AI 会议教练，由 @d3xvn 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F351 中完成\n\n### 杂项任务\n* CC token 使用与测试改进，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F392 中完成\n* CI 构建提速，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F393 中完成\n* 从 TestResponse 中移除死代码，由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F398 中完成\n* 统一所有插件的“安装”说明，并为 Anthropic 插件添加 README 文件，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F395 中完成\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.4.1...v0.4.2","2026-03-10T14:07:26",{"id":186,"version":187,"summary_zh":188,"released_at":189},94495,"v0.4.1","## 变更内容\n* 增加对 AssemblyAI 流式语音转文本的支持，由 @Nash0x7E2 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F389 中实现。\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.4.0...v0.4.1","2026-03-04T20:52:57",{"id":191,"version":192,"summary_zh":193,"released_at":194},94496,"v0.4.0","## 变更内容\n* 由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F373 中实现，强制在 FunctionRegistry 中仅注册异步函数。\n* 由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F376 中移除旧版 mock_tools，改用 mock_functions。\n* EventManager：修复事件处理器指定了返回类型时出现的失败问题，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F381 中完成。\n* 更新 Agent 认证流程，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F380 中完成。\n* 支持基于 Redis 的会话存储，以便在多个节点间运行，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F374 中实现。\n* 添加 py.typed 标记以符合 PEP 561 规范，由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F378 中完成。\n* 更新 agent_server_example 的 README 文件，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F385 中完成。\n* 修复在未安装 Redis 时可选的 RedisSessionKVStore 导入问题，由 @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F384 中完成。\n* 文档：在 README 中澄清 Cartesia 的角色（修复 #268），由 @aniruddhaadak80 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F366 中完成。\n* 修复 Agent 指标存储问题，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F387 中完成。\n* 添加 CHANGELOG.md 文件及更新说明，由 @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F388 中完成。\n\n## 新贡献者\n* @aniruddhaadak80 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F366 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.3.8...v0.4.0","2026-03-03T16:39:45",{"id":196,"version":197,"summary_zh":198,"released_at":199},94497,"v0.3.8","## 变更内容\n* @dangusev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F372 中将 Deepgram 插件更新为使用 SDK v6 版本。\n* @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F364 中添加了代理测试框架。\n\n## 新贡献者\n* @aliev 在 https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F364 中完成了首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.3.7...v0.3.8","2026-02-24T19:45:34",{"id":201,"version":202,"summary_zh":203,"released_at":204},94498,"v0.3.7","## What's Changed\r\n* add new model to qwen example and add new openrouter vlm example by @maxkahan in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F360\r\n* LemonSlice Avatar plugin by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F365 \r\n* Update default GPT-Realtime to 1.5 by @Nash0x7E2 in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F371\r\n* Add huggingface transformers plugin by @maxkahan in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F350  \r\n* Add CLAUDE.md by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F358\r\n* fix anthropic messages not added to conversation history by @maxkahan in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F359\r\n* Pass missing secrets to github action by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F361\r\n* fix message duplication in gemini and nvidia vlm by @maxkahan in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F362\r\n* Fixes for TTS and audio publishing by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F363\r\n\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.3.6...v0.3.7","2026-02-23T21:42:46",{"id":206,"version":207,"summary_zh":208,"released_at":209},94499,"v0.3.6","## What's Changed\r\n\r\n### Fixes\r\n* Add missing onnxruntime dependency to the core package by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F355\r\n\r\n### Dependencies\r\n* Bump langchain-core from 1.2.6 to 1.2.11 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F354\r\n* Bump cryptography from 46.0.3 to 46.0.5 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F352\r\n* Update cartesia tts plugin to use v3.0.0+ by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F356\r\n\r\n### Chores\r\n* Fix\u002Fintegration test fixes by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F357\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.3.5...v0.3.6","2026-02-13T18:29:48",{"id":211,"version":212,"summary_zh":213,"released_at":214},94500,"v0.3.5","## What's Changed\r\n\r\n### Features \r\n* Support multiple speakers on the same call by @dangusev in #348 #349  \r\nDocs - https:\u002F\u002Fvisionagents.ai\u002Fguides\u002Fmultiple-speakers\r\n\r\n### Dependencies\r\n* Bump fonttools from 4.60.1 to 4.60.2 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F344\r\n* Bump langsmith from 0.6.1 to 0.6.3 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F346\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.3.4...v0.3.5","2026-02-10T16:12:29",{"id":216,"version":217,"summary_zh":218,"released_at":219},94501,"v0.3.4","## What's Changed\r\n* added Mistral Voxtral integration on Readme by @brookesanchez-del in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F341\r\n* Gemini 3 vision VLM API by @Nash0x7E2 in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F328\r\n* Decouple vision agents from getstream by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F330\r\n* Remove uv.lock files from examples and add them to .gitignore by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F342\r\n* Bump virtualenv from 20.35.4 to 20.36.1 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F287\r\n* Bump authlib from 1.6.5 to 1.6.6 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F293\r\n* Bump protobuf from 6.33.0 to 6.33.5 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F335\r\n* Bump aiohttp from 3.13.2 to 3.13.3 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F343\r\n* Bump pip from 25.3 to 26.0 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F329\r\n* Bump filelock from 3.20.0 to 3.20.3 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F288\r\n* Bump python-multipart from 0.0.21 to 0.0.22 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F317\r\n* Bump marshmallow from 3.26.1 to 3.26.2 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F345\r\n\r\n## New Contributors\r\n* @brookesanchez-del made their first contribution in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F341\r\n* @dependabot[bot] made their first contribution in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F287\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.3.3...v0.3.4","2026-02-06T18:53:56",{"id":221,"version":222,"summary_zh":223,"released_at":224},94502,"v0.3.3","## What's Changed\r\n* Fix twilio plugin build by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F318\r\n* feat: add custom events and metrics broadcasting via Stream Video by @d3xvn in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F325\r\n* Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F324\r\n* Add mistral by @maxkahan in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F334\r\n\r\n## New Contributors\r\n* @salmanmkc made their first contribution in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F324\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.3.2...v0.3.3","2026-02-05T14:50:53",{"id":226,"version":227,"summary_zh":228,"released_at":229},94503,"v0.3.2","## What's Changed\r\n\r\n### Features \r\n* Add limits to AgentLauncher by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F302\r\n* XAI Realtime model support  by @tschellenbach in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F296\r\n\r\n### Bugfixes\r\n* Fix Agent warnings by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F310\r\n* fix: SDK usage statistics tracking for vision-agents by @tjirab in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F304\r\n\r\n### Docs & Examples\r\n* Add Hugging Face integration to README by @Wauplin in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F305\r\n* feat: add Grafana dashboard to prometheus metrics example by @d3xvn in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F295\r\n* fix: prometheus metrics example documentation by @d3xvn in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F311\r\n\r\n\r\n## New Contributors\r\n* @Wauplin made their first contribution in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F305\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.3.1...v0.3.2","2026-01-28T23:29:04",{"id":231,"version":232,"summary_zh":233,"released_at":234},94504,"v0.3.0","## What's Changed\r\n* Release blog post: https:\u002F\u002Fgetstream.io\u002Fblog\u002Fvision-agents-v0-3\u002F\r\n\r\n* Suppress errors on `Agent.join` if the agent is closed or closing by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F291\r\n* Spring cleaning jan 15 by @tschellenbach in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F292\r\n* Security camera example by @Nash0x7E2 in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F247\r\n* Agent HTTP server by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F284\r\n* Update urllib to 2.6.3 by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F297\r\n* Bump mcp version to >=1.23.3 by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F298\r\n* Various fixes by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F299\r\n* Remove print by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F300\r\n* Migrate examples to the new Runner API by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F301\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.2.10...v0.3.0","2026-01-20T16:29:07",{"id":236,"version":237,"summary_zh":238,"released_at":239},94505,"v0.2.10","## What's Changed\r\n* fix: stop sending video frames to realtime LLMs and stop processors when participant leaves by @d3xvn in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F283\r\n* Prod prep by @tschellenbach in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F285\r\n* Fix SIP example runner  by @Nash0x7E2 in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F286\r\n* Support for Pocket TTS  by @Nash0x7E2 in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F289\r\n* feat: added metrics and example by @d3xvn in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F278\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.2.9...v0.2.10","2026-01-14T23:48:49",{"id":241,"version":242,"summary_zh":243,"released_at":244},94506,"v0.2.9","## What's Changed\r\n* Fix mypy by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F281\r\n* Add support for Cosmos 2 VLM  by @Nash0x7E2 in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F282\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.2.8...v0.2.9","2026-01-09T20:50:01",{"id":246,"version":247,"summary_zh":248,"released_at":249},94507,"v0.2.8","Hotfix for the regression in Deepgram 5.3.1\r\n\r\n- from deepgram.extensions.types.sockets import ListenV2ControlMessage was removed\r\n- the new from deepgram.listen.v2.types import ListenV2CloseStream is available in >5.3.1\r\n\r\n## What's Changed\r\n* Close idle agents after some timeout & Agent clean up by @dangusev in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F279\r\n* fix(plugins:getstream): Support setting user avatars (#233) by @m0reA1 in https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fpull\u002F234\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FGetStream\u002FVision-Agents\u002Fcompare\u002Fv0.2.7...v0.2.8","2026-01-08T20:12:36"]