[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ray-project--kuberay":3,"tool-ray-project--kuberay":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":118,"forks":119,"last_commit_at":120,"license":121,"difficulty_score":122,"env_os":123,"env_gpu":124,"env_ram":125,"env_deps":126,"category_tags":137,"github_topics":138,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":144,"updated_at":145,"faqs":146,"releases":172},3277,"ray-project\u002Fkuberay","kuberay","A toolkit to run Ray applications on Kubernetes","KubeRay 是一款强大的开源 Kubernetes 算子，旨在简化 Ray 应用在 Kubernetes 集群上的部署与管理。对于希望利用分布式计算进行机器学习训练、大模型推理或批量数据处理的用户来说，直接在 Kubernetes 上手动配置 Ray 集群往往复杂且容易出错，而 KubeRay 正是为了解决这一痛点而生。\n\n它主要面向开发者、数据科学家及研究人员，让他们能更专注于算法与业务逻辑，而非底层基础设施的运维。KubeRay 的核心亮点在于提供了三种自定义资源：RayCluster 负责集群的全生命周期管理与自动扩缩容；RayJob 可实现任务提交自动化，并在结束后清理资源以节省成本；RayService 则支持零停机升级和高可用部署，特别适合对服务连续性要求严格的在线推理场景。此外，KubeRay 还拥有丰富的生态工具，如简化的命令行插件和实验性可视化仪表盘，并能无缝集成 Prometheus、Grafana 等主流监控及调度系统。无论是构建大规模模型训练平台，还是部署高并发 AI 服务，KubeRay 都能提供稳定、高效的支撑。","\u003C!-- markdownlint-disable MD013 -->\n# KubeRay\n\n[![Build Status](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fworkflows\u002FGo-build-and-test\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Factions)\n[![Release](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fray-project\u002Fkuberay)](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Freleases)\n[![Go Report Card](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fray-project_kuberay_readme_c1743a0ee95f.png)](https:\u002F\u002Fgoreportcard.com\u002Freport\u002Fgithub.com\u002Fray-project\u002Fkuberay)\n\nKubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray) applications on Kubernetes. It offers several key components:\n\n**KubeRay core**: This is the official, fully-maintained component of KubeRay that provides three custom resource definitions, RayCluster, RayJob, and RayService. These resources are designed to help you run a wide range of workloads with ease.\n\n* **RayCluster**: KubeRay fully manages the lifecycle of RayCluster, including cluster creation\u002Fdeletion, autoscaling, and ensuring fault tolerance.\n\n* **RayJob**: With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the job finishes.\n\n* **RayService**: RayService is made up of two parts: a RayCluster and a Ray Serve deployment graph. RayService offers zero-downtime upgrades for RayCluster and high availability.\n\n**KubeRay ecosystem**: Some optional components.\n\n* **Kubectl Plugin** (Beta): Starting from KubeRay v1.3.0, you can use the `kubectl ray` plugin to simplify\ncommon workflows when deploying Ray on Kubernetes. If you aren’t familiar with Kubernetes, this\nplugin simplifies running Ray on Kubernetes. See [kubectl-plugin](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fcluster\u002Fkubernetes\u002Fuser-guides\u002Fkubectl-plugin.html#kubectl-plugin) for more details.\n\n* **KubeRay APIServer** (Alpha): It provides a layer of simplified configuration for KubeRay resources. The KubeRay API server is used internally\nby some organizations to back user interfaces for KubeRay resource management. See [KubeRay APIServer V2](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fblob\u002Fmaster\u002Fapiserversdk\u002FREADME.md) for more details.\n\n* **KubeRay Dashboard** (Experimental): Starting from KubeRay v1.4.0, we have introduced a new dashboard that enables users to view and manage KubeRay resources.\nWhile it is not yet production-ready, we welcome your feedback. See [KubeRay Dashboard](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fuser-guides\u002Fkuberay-dashboard.html) for more details.\n\n## Documentation\n\nFrom September 2023, all user-facing KubeRay documentation will be hosted on the [Ray documentation](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fcluster\u002Fkubernetes\u002Findex.html).\nThe KubeRay repository only contains documentation related to the development and maintenance of KubeRay.\n\n## Quick Start\n\n* [RayCluster Quickstart](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fgetting-started\u002Fraycluster-quick-start.html)\n* [RayJob Quickstart](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fgetting-started\u002Frayjob-quick-start.html)\n* [RayService Quickstart](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fgetting-started\u002Frayservice-quick-start.html)\n\n## Examples\n\nKubeRay examples are hosted on the [Ray documentation](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fcluster\u002Fkubernetes\u002Fexamples.html).\nExamples span a wide range of use cases, including training, LLM online inference, batch inference, and more.\n\n## Kubernetes Ecosystem\n\nKubeRay integrates with the Kubernetes ecosystem, including observability tools (e.g., Prometheus, Grafana, py-spy), queuing systems (e.g., Volcano, Apache YuniKorn, Kueue), ingress controllers (e.g., Nginx), and more.\nSee [KubeRay Ecosystem](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fcluster\u002Fkubernetes\u002Fk8s-ecosystem.html) for more details.\n\n## Blog Posts\n\n* [Scaling Ray to 10K Models and Beyond](https:\u002F\u002Fmedium.com\u002Fworkday-engineering\u002Fscaling-ray-to-10k-models-and-beyond-92799b4c9fc3) Workday\n* [How Klaviyo built a robust model serving platform with Ray Serve](https:\u002F\u002Fklaviyo.tech\u002Fhow-klaviyo-built-a-robust-model-serving-platform-with-ray-serve-c02ec65788b3) Klaviyo\n* [Evolving Niantic AR Mapping Infrastructures with Ray](https:\u002F\u002Fnianticlabs.com\u002Fnews\u002Fray) Niantic\n* [Building a Modern Machine Learning Platform with Ray at Samsara](https:\u002F\u002Fwww.samsara.com\u002Fblog\u002Fbuilding-a-modern-machine-learning-platform-with-ray) Samsara\n* [Using Ray on Kubernetes with KubeRay at Google Cloud](https:\u002F\u002Fcloud.google.com\u002Fblog\u002Fproducts\u002Fcontainers-kubernetes\u002Fuse-ray-on-kubernetes-with-kuberay) Google\n* [How DoorDash Built an Ensemble Learning Model for Time Series Forecasting with KubeRay](https:\u002F\u002Fdoordash.engineering\u002F2023\u002F06\u002F20\u002Fhow-doordash-built-an-ensemble-learning-model-for-time-series-forecasting\u002F) Doordash\n* [AI\u002FML Models Batch Training at Scale with Open Data Hub](https:\u002F\u002Fcloud.redhat.com\u002Fblog\u002Fai\u002Fml-models-batch-training-at-scale-with-open-data-hub) Red Hat\n* [Distributed Machine Learning at Instacart](https:\u002F\u002Ftech.instacart.com\u002Fdistributed-machine-learning-at-instacart-4b11d7569423) Instacart\n* [Unleashing ML Innovation at Spotify with Ray](https:\u002F\u002Fengineering.atspotify.com\u002F2023\u002F02\u002Funleashing-ml-innovation-at-spotify-with-ray\u002F) Spotify\n* [Best Practices For Ray Cluster On ACK](https:\u002F\u002Fwww.alibabacloud.com\u002Fblog\u002Fbest-practices-for-ray-clusters---ray-on-ack_600925) Alibaba Cloud\n\n## Talks\n\n* [Advanced Model Serving Techniques with Ray on Kubernetes | KubeCon 2024 NA](https:\u002F\u002Fyoutu.be\u002FmASxYpfWUNU?si=iCuXakrP7ORAg37z) Anyscale + Google\n* [Building Scalable AI Infrastructure with Kuberay and Kubernetes | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FbbKpBTGf_AU?si=BkdCL7FGOde71t_P) Anyscale + Google\n* [Ray at Scale: Apple's Approach to Elastic GPU Management | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FZCRZQVt-r3g?si=1Gxkpy8CNVVDDBP0) Apple\n* [Scaling Ray Train to 10K Kubernetes Nodes on GKE | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F9S5WznGnIpE?si=O6Rqpor9QmAvdv6u) Google\n* [KubeSecRay: Fortifying Multi-Tenant Ray Clusters on Kubernetes | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FY-kLmZ3nklQ?si=N9FIc5Nk_rWwKBRp) Microsoft\n* [Scaling LLM Inference: AWS Inferentia Meets Ray Serve on EKS | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F6rNfYlm6s1k?si=WZeXZXrMDtRbbVKO) AWS\n* [How Roblox Scaled Machine Learning by Leveraging Ray for Efficient Batch Inference | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FBN1CVDZjQRE?si=9pN9A3bReSL26Pc-) Roblox\n* [Airbnb's LLM Evolution: Fine-Tuning with Ray | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FjYQ9ry8uXY0?si=3P56QNo8Qwovv4Vf) Airbnb\n* [Ray @ eBay: Pioneering a Next-Gen AI Platform | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F5KuTdRq9Zto?si=8m485B1411ixfdlx) eBay\n* [Spotify Harnesses Ray for Next-Gen AI Infrastructure | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F4kw3EYBz1Gs?si=PswsNR88xe6Mxuas) Spotify\n* [Spotify's Approach to Distributed LLM Training with Ray on GKE | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F2l1lVBdmNIQ?si=PwCeZD1-XajPNLam) Spotify\n* [Reddit's ML Evolution: Scaling with Ray and KubeRay | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FXwrGk0SM6ls?si=xNMQo548lOonKLiK) Reddit\n* [IBM's Approach to Building a Cloud-Native AI Platform | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FQ27JFtLE6b4?si=QQhVMZyBRelkLC13) IBM\n* [Exploring Hinge's ML Platform Evolution with Ray | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F_nsTcYtfnXU?si=dKNasWOxiTRJgyvj) Hinge\n* [How Rubrik Unlocked AI at Scale with Ray Serve | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FMd5vww4ardo?si=leiuvNkDy2fKeK8r) Rubrik\n* [Supercharge Your AI Platform with KubeRay | KubeCon 2023 NA](https:\u002F\u002Fyoutu.be\u002FDgfJR6wR4BQ?si=QuK3j7VEkteSwglA) Anyscale + Google\n* [Sailing Ray Workloads with KubeRay and Kueue in Kubernetes](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Q-sQLDMeJ8M) Volcano + DaoCloud\n* [Serving Large Language Models with KubeRay on TPUs](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=RK_u6cfPnnw) Google\n\n## Development\n\nPlease read our [CONTRIBUTING](CONTRIBUTING.md) guide before making a pull request. Refer to our [DEVELOPMENT](.\u002Fray-operator\u002FDEVELOPMENT.md) to build and run tests locally.\n\n## Getting Involved\n\nJoin [Ray's Slack workspace](https:\u002F\u002Fdocs.google.com\u002Fforms\u002Fd\u002Fe\u002F1FAIpQLSfAcoiLCHOguOm8e7Jnn-JJdZaCxPGjgVCvFijHB5PLaQLeig\u002Fviewform), and search the following public channels:\n\n* `#kuberay`: This channel aims to help KubeRay users with their questions. The messages will be closely monitored by the Ray and KubeRay maintainers.\n\nKubeRay contributors are welcome to join the bi-weekly KubeRay community meetings.\n\n* Add the [Ray\u002FKubeRay Google calendar](https:\u002F\u002Fcalendar.google.com\u002Fcalendar\u002Fu\u002F1?cid=Y19iZWIwYTUxZDQyZTczMTFmZWFmYTY5YjZiOTY1NjAxMTQ3ZTEzOTAxZWE0ZGU5YzA1NjFlZWQ5OTljY2FiOWM4QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20) to your calendar.\n\n## Security\n\nIf you discover a potential security issue in this project, or think you may\nhave discovered a security issue, we ask that you notify KubeRay Security via our\n[Slack Channel](https:\u002F\u002Fray-distributed.slack.com\u002Farchives\u002FC02GFQ82JPM).\nPlease do **not** create a public GitHub issue.\n\n## License\n\nThis project is licensed under the [Apache-2.0 License](LICENSE).\n","\u003C!-- markdownlint-disable MD013 -->\n# KubeRay\n\n[![构建状态](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fworkflows\u002FGo-build-and-test\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Factions)\n[![发布版本](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fray-project\u002Fkuberay)](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Freleases)\n[![Go Report Card](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fray-project_kuberay_readme_c1743a0ee95f.png)](https:\u002F\u002Fgoreportcard.com\u002Freport\u002Fgithub.com\u002Fray-project\u002Fkuberay)\n\nKubeRay 是一个功能强大的开源 Kubernetes 运算符，可简化在 Kubernetes 上部署和管理 [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray) 应用程序的过程。它提供多个关键组件：\n\n**KubeRay 核心**：这是 KubeRay 的官方、完全维护的组件，提供了三种自定义资源定义：RayCluster、RayJob 和 RayService。这些资源旨在帮助您轻松运行各种工作负载。\n\n* **RayCluster**：KubeRay 完全管理 RayCluster 的生命周期，包括集群的创建与删除、自动扩缩容以及确保容错性。\n\n* **RayJob**：借助 RayJob，KubeRay 会在集群准备就绪时自动创建 RayCluster 并提交作业。您还可以配置 RayJob，使 RayCluster 在作业完成后自动删除。\n\n* **RayService**：RayService 由两部分组成：一个 RayCluster 和一个 Ray Serve 部署图。RayService 提供 RayCluster 的零停机升级和高可用性。\n\n**KubeRay 生态系统**：一些可选组件。\n\n* **kubectl 插件**（测试版）：从 KubeRay v1.3.0 开始，您可以使用 `kubectl ray` 插件来简化在 Kubernetes 上部署 Ray 的常见工作流程。如果您不熟悉 Kubernetes，此插件可以大大简化在 Kubernetes 上运行 Ray 的过程。更多详情请参阅 [kubectl-plugin](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fcluster\u002Fkubernetes\u002Fuser-guides\u002Fkubectl-plugin.html#kubectl-plugin)。\n\n* **KubeRay API 服务器**（Alpha 版）：它为 KubeRay 资源提供了一层简化的配置接口。KubeRay API 服务器目前被一些组织内部用于支持 KubeRay 资源管理的用户界面。更多详情请参阅 [KubeRay APIServer V2](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fblob\u002Fmaster\u002Fapiserversdk\u002FREADME.md)。\n\n* **KubeRay 仪表板**（实验版）：从 KubeRay v1.4.0 开始，我们推出了一款新的仪表板，使用户能够查看和管理 KubeRay 资源。尽管目前尚未达到生产就绪状态，但我们欢迎您的反馈。更多详情请参阅 [KubeRay Dashboard](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fuser-guides\u002Fkuberay-dashboard.html)。\n\n## 文档\n\n自 2023 年 9 月起，所有面向用户的 KubeRay 文档都将托管在 [Ray 文档](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fcluster\u002Fkubernetes\u002Findex.html)上。KubeRay 仓库仅包含与 KubeRay 开发和维护相关的文档。\n\n## 快速入门\n\n* [RayCluster 快速入门](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fgetting-started\u002Fraycluster-quick-start.html)\n* [RayJob 快速入门](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fgetting-started\u002Frayjob-quick-start.html)\n* [RayService 快速入门](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fgetting-started\u002Frayservice-quick-start.html)\n\n## 示例\n\nKubeRay 示例托管在 [Ray 文档](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fcluster\u002Fkubernetes\u002Fexamples.html)上。示例涵盖了广泛的使用场景，包括模型训练、LLM 在线推理、批量推理等。\n\n## Kubernetes 生态系统\n\nKubeRay 与 Kubernetes 生态系统无缝集成，包括可观测性工具（如 Prometheus、Grafana、py-spy）、队列管理系统（如 Volcano、Apache YuniKorn、Kueue）、入口控制器（如 Nginx）等。更多详情请参阅 [KubeRay 生态系统](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fcluster\u002Fkubernetes\u002Fk8s-ecosystem.html)。\n\n## 博客文章\n\n* [将 Ray 扩展至 1 万个模型及以上](https:\u002F\u002Fmedium.com\u002Fworkday-engineering\u002Fscaling-ray-to-10k-models-and-beyond-92799b4c9fc3) Workday\n* [Klaviyo 如何利用 Ray Serve 构建稳健的模型服务平台](https:\u002F\u002Fklaviyo.tech\u002Fhow-klaviyo-built-a-robust-model-serving-platform-with-ray-serve-c02ec65788b3) Klaviyo\n* [Niantic 如何利用 Ray 演进 AR 地图基础设施](https:\u002F\u002Fnianticlabs.com\u002Fnews\u002Fray) Niantic\n* [Samsara 如何利用 Ray 构建现代化机器学习平台](https:\u002F\u002Fwww.samsara.com\u002Fblog\u002Fbuilding-a-modern-machine-learning-platform-with-ray) Samsara\n* [Google Cloud 如何通过 KubeRay 在 Kubernetes 上使用 Ray](https:\u002F\u002Fcloud.google.com\u002Fblog\u002Fproducts\u002Fcontainers-kubernetes\u002Fuse-ray-on-kubernetes-with-kuberay) Google\n* [DoorDash 如何利用 KubeRay 构建时间序列预测的集成学习模型](https:\u002F\u002Fdoordash.engineering\u002F2023\u002F06\u002F20\u002Fhow-doordash-built-an-ensemble-learning-model-for-time-series-forecasting\u002F) DoorDash\n* [Red Hat 利用 Open Data Hub 大规模进行 AI\u002FML 模型批量训练](https:\u002F\u002Fcloud.redhat.com\u002Fblog\u002Fai\u002Fml-models-batch-training-at-scale-with-open-data-hub) Red Hat\n* [Instacart 如何实现分布式机器学习](https:\u002F\u002Ftech.instacart.com\u002Fdistributed-machine-learning-at-instacart-4b11d7569423) Instacart\n* [Spotify 如何利用 Ray 解放 ML 创新潜力](https:\u002F\u002Fengineering.atspotify.com\u002F2023\u002F02\u002Funleashing-ml-innovation-at-spotify-with-ray\u002F) Spotify\n* [ACK 上 Ray 集群的最佳实践](https:\u002F\u002Fwww.alibabacloud.com\u002Fblog\u002Fbest-practices-for-ray-clusters---ray-on-ack_600925) Alibaba Cloud\n\n## 演讲\n\n* [在 Kubernetes 上使用 Ray 的高级模型推理技术 | KubeCon 2024 北美](https:\u002F\u002Fyoutu.be\u002FmASxYpfWUNU?si=iCuXakrP7ORAg37z) Anyscale + Google\n* [使用 Kuberay 和 Kubernetes 构建可扩展的 AI 基础设施 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FbbKpBTGf_AU?si=BkdCL7FGOde71t_P) Anyscale + Google\n* [大规模 Ray：苹果公司弹性 GPU 管理的方法 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FZCRZQVt-r3g?si=1Gxkpy8CNVVDDBP0) Apple\n* [在 GKE 上将 Ray Train 扩展到 1 万个 Kubernetes 节点 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F9S5WznGnIpE?si=O6Rqpor9QmAvdv6u) Google\n* [KubeSecRay：强化 Kubernetes 上的多租户 Ray 集群 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FY-kLmZ3nklQ?si=N9FIc5Nk_rWwKBRp) Microsoft\n* [LLM 推理的规模化：AWS Inferentia 搭配 EKS 上的 Ray Serve | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F6rNfYlm6s1k?si=WZeXZXrMDtRbbVKO) AWS\n* [Roblox 如何利用 Ray 实现高效的批量推理，从而扩展机器学习规模 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FBN1CVDZjQRE?si=9pN9A3bReSL26Pc-) Roblox\n* [Airbnb 的 LLM 发展历程：使用 Ray 进行微调 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FjYQ9ry8uXY0?si=3P56QNo8Qwovv4Vf) Airbnb\n* [eBay 的 Ray：开创下一代 AI 平台 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F5KuTdRq9Zto?si=8m485B1411ixfdlx) eBay\n* [Spotify 利用 Ray 打造下一代 AI 基础设施 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F4kw3EYBz1Gs?si=PswsNR88xe6Mxuas) Spotify\n* [Spotify 在 GKE 上使用 Ray 进行分布式 LLM 训练的方法 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F2l1lVBdmNIQ?si=PwCeZD1-XajPNLam) Spotify\n* [Reddit 的机器学习演进：借助 Ray 和 KubeRay 扩展规模 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FXwrGk0SM6ls?si=xNMQo548lOonKLiK) Reddit\n* [IBM 构建云原生 AI 平台的方法 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FQ27JFtLE6b4?si=QQhVMZyBRelkLC13) IBM\n* [探索 Hinge 的 ML 平台如何借助 Ray 不断演进 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002F_nsTcYtfnXU?si=dKNasWOxiTRJgyvj) Hinge\n* [Rubrik 如何通过 Ray Serve 实现大规模 AI 应用 | Ray Summit 2024](https:\u002F\u002Fyoutu.be\u002FMd5vww4ardo?si=leiuvNkDy2fKeK8r) Rubrik\n* [使用 KubeRay 加速你的 AI 平台 | KubeCon 2023 北美](https:\u002F\u002Fyoutu.be\u002FDgfJR6wR4BQ?si=QuK3j7VEkteSwglA) Anyscale + Google\n* [在 Kubernetes 中使用 KubeRay 和 Kueue 管理 Ray 工作负载](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Q-sQLDMeJ8M) Volcano + DaoCloud\n* [在 TPU 上使用 KubeRay 提供大型语言模型服务](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=RK_u6cfPnnw) Google\n\n## 开发\n\n在提交拉取请求之前，请阅读我们的 [CONTRIBUTING](CONTRIBUTING.md) 指南。请参考我们的 [DEVELOPMENT](.\u002Fray-operator\u002FDEVELOPMENT.md) 文档，以在本地构建并运行测试。\n\n## 参与方式\n\n加入 [Ray 的 Slack 工作区](https:\u002F\u002Fdocs.google.com\u002Fforms\u002Fd\u002Fe\u002F1FAIpQLSfAcoiLCHOguOm8e7Jnn-JJdZaCxPGjgVCvFijHB5PLaQLeig\u002Fviewform)，并在以下公共频道中查找相关信息：\n\n* `#kuberay`：此频道旨在帮助 KubeRay 用户解答问题。Ray 和 KubeRay 的维护者会密切关注该频道的消息。\n  \n欢迎 KubeRay 的贡献者参加每两周一次的 KubeRay 社区会议。\n\n* 将 [Ray\u002FKubeRay Google 日历](https:\u002F\u002Fcalendar.google.com\u002Fcalendar\u002Fu\u002F1?cid=Y19iZWIwYTUxZDQyZTczMTFmZWFmYTY5YjZiOTY1NjAxMTQ3ZTEzOTAxZWE0ZGU5YzA1NjFlZWQ5OTljY2FiOWM4QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20) 添加到你的日历中。\n\n## 安全性\n\n如果您在此项目中发现潜在的安全漏洞，或认为自己可能发现了安全问题，请通过我们的 [Slack 频道](https:\u002F\u002Fray-distributed.slack.com\u002Farchives\u002FC02GFQ82JPM) 向 KubeRay 安全团队报告。请**不要**创建公开的 GitHub 问题。\n\n## 许可证\n\n本项目采用 [Apache-2.0 许可证](LICENSE) 许可。","# KubeRay 快速上手指南\n\nKubeRay 是一个强大的开源 Kubernetes Operator，旨在简化 Ray 应用在 Kubernetes 上的部署与管理。它支持自动扩缩容、容错处理以及零停机升级，是运行大规模机器学习（如 LLM 训练与推理）的理想选择。\n\n## 环境准备\n\n在开始之前，请确保满足以下系统要求和前置依赖：\n\n*   **Kubernetes 集群**：版本需为 **1.23+**。\n*   **kubectl**：已安装并配置好与集群的连接。\n*   **Helm**：推荐用于安装 KubeRay（版本 **3.10+**）。\n    *   *国内加速建议*：如果访问官方 Helm 源较慢，可配置国内镜像源（如阿里云）：\n      ```bash\n      helm repo add kuberay https:\u002F\u002Fray-project.github.io\u002Fkuberay-helm\u002F\n      # 若需使用国内镜像加速拉取 Operator 镜像，可在 values.yaml 中指定 image repository 为阿里云地址\n      ```\n*   **权限**：拥有在目标命名空间创建 Custom Resource Definitions (CRD) 和部署应用的权限。\n\n## 安装步骤\n\n推荐使用 Helm 进行安装，以下是标准安装流程：\n\n1.  **添加 KubeRay Helm 仓库**\n    ```bash\n    helm repo add kuberay https:\u002F\u002Fray-project.github.io\u002Fkuberay-helm\u002F\n    helm repo update\n    ```\n\n2.  **安装 KubeRay Operator**\n    此命令将在 `default` 命名空间部署 KubeRay Operator。\n    ```bash\n    helm install kuberay-operator kuberay\u002Fkuberay-operator --version 1.2.0\n    ```\n    > **注意**：请将 `--version` 替换为最新的稳定版本号。安装完成后，Operator 会自动注册 `RayCluster`、`RayJob` 和 `RayService` 三种自定义资源。\n\n3.  **验证安装**\n    确认 Pod 处于运行状态：\n    ```bash\n    kubectl get pods\n    # 应看到类似 kuberay-operator-xxxxx 的 Pod 状态为 Running\n    ```\n\n## 基本使用\n\n安装完成后，你可以通过定义 YAML 文件来部署 Ray 应用。以下是最简单的 **RayCluster** 使用示例。\n\n### 1. 创建 RayCluster\n\n创建一个名为 `sample-cluster.yaml` 的文件，内容如下：\n\n```yaml\napiVersion: ray.io\u002Fv1\nkind: RayCluster\nmetadata:\n  name: sample-cluster\nspec:\n  rayVersion: '2.9.0' # 指定 Ray 版本\n  headGroupSpec:\n    rayStartParams:\n      dashboard-host: '0.0.0.0'\n    template:\n      spec:\n        containers:\n        - name: ray-head\n          image: rayproject\u002Fray:2.9.0\n          ports:\n          - containerPort: 6379\n            name: gcs-server\n          - containerPort: 8265\n            name: dashboard\n          - containerPort: 10001\n            name: client\n          resources:\n            limits:\n              cpu: \"1\"\n              memory: \"2Gi\"\n            requests:\n              cpu: \"1\"\n              memory: \"2Gi\"\n  workerGroupSpecs:\n  - replicas: 1\n    minReplicas: 1\n    maxReplicas: 5\n    groupName: small-group\n    rayStartParams: {}\n    template:\n      spec:\n        containers:\n        - name: ray-worker\n          image: rayproject\u002Fray:2.9.0\n          resources:\n            limits:\n              cpu: \"1\"\n              memory: \"2Gi\"\n            requests:\n              cpu: \"1\"\n              memory: \"2Gi\"\n```\n\n应用该配置：\n```bash\nkubectl apply -f sample-cluster.yaml\n```\n\n### 2. 验证与连接\n\n查看集群状态：\n```bash\nkubectl get raycluster\n# 输出应显示 sample-cluster 状态为 ready\n```\n\n获取 Head 节点的服务信息并进行端口转发，以便本地访问 Ray Dashboard：\n```bash\nkubectl port-forward svc\u002Fsample-cluster-head-svc 8265:8265\n```\n随后在浏览器访问 `http:\u002F\u002Flocalhost:8265` 即可看到 Ray 监控面板。\n\n### 3. 提交任务 (可选)\n\n如果你需要运行一次性任务，可以使用 `RayJob` 资源。创建一个 `sample-job.yaml`：\n\n```yaml\napiVersion: ray.io\u002Fv1\nkind: RayJob\nmetadata:\n  name: sample-job\nspec:\n  submissionMode: InteractiveMode # 或 ClusterMode\n  entrypoint: python \u002Fhome\u002Fray\u002Fsamples\u002Fsample_code.py\n  rayClusterSpec:\n    # 此处可复用上述 RayCluster 的配置，或者引用现有的集群\n    # 为简洁起见，实际使用时通常在此处完整定义集群规格或引用已有集群\n    rayVersion: '2.9.0'\n    headGroupSpec:\n      rayStartParams: {dashboard-host: '0.0.0.0'}\n      template:\n        spec:\n          containers:\n          - name: ray-head\n            image: rayproject\u002Fray:2.9.0\n            resources:\n              limits:\n                cpu: \"1\"\n                memory: \"2Gi\"\n              requests:\n                cpu: \"1\"\n                memory: \"2Gi\"\n    workerGroupSpecs:\n    - replicas: 1\n      groupName: small-group\n      rayStartParams: {}\n      template:\n        spec:\n          containers:\n          - name: ray-worker\n            image: rayproject\u002Fray:2.9.0\n            resources:\n              limits:\n                cpu: \"1\"\n                memory: \"2Gi\"\n              requests:\n                cpu: \"1\"\n                memory: \"2Gi\"\n```\n\n应用任务：\n```bash\nkubectl apply -f sample-job.yaml\n```\nKubeRay 将自动创建集群、提交代码，并在任务完成后根据配置决定是否销毁集群。","某金融科技公司数据团队需要在 Kubernetes 集群上部署大规模实时反欺诈模型，该模型基于 Ray 构建，需应对早晚高峰流量的剧烈波动。\n\n### 没有 kuberay 时\n- **运维极其繁琐**：工程师必须手动编写复杂的 Kubernetes YAML 文件来编排 Ray Head 和 Worker 节点，每次调整资源都容易出错且耗时。\n- **弹性响应滞后**：面对突发流量，缺乏自动扩缩容机制，人工介入扩容速度慢，导致交易请求排队甚至超时失败。\n- **更新伴随中断**：发布新模型版本时，需要手动销毁旧集群再创建新集群，服务不可避免地出现分钟级中断，影响用户体验。\n- **任务生命周期难管**：批量离线计算任务结束后，残留的 Ray 集群无法自动清理，长期占用昂贵算力资源，造成成本浪费。\n\n### 使用 kuberay 后\n- **部署一键化**：通过定义简单的 `RayCluster` 或 `RayJob` 自定义资源，kuberay 自动处理底层所有 Pod 的创建、配置与网络连接，大幅降低门槛。\n- **智能自动伸缩**：利用内置的 Autoscaler，kuberay 能根据队列负载秒级动态增减 Worker 节点，轻松扛住早晚高峰流量冲击。\n- **零停机升级**：借助 `RayService` 组件，kuberay 支持蓝绿部署策略，在后台无缝切换新版本模型，确保线上服务全天候高可用。\n- **资源自动回收**：配置 `RayJob` 后，任务一旦运行完成，kuberay 会自动销毁关联集群，彻底杜绝资源闲置，显著优化云成本。\n\nkuberay 将原本需要资深运维专家数天才能稳定的分布式系统，转化为开发人员可自助管理的标准化服务，真正实现了 AI 应用在云原生环境的高效落地。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fray-project_kuberay_689bbfac.png","ray-project","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fray-project_341db8ae.png","",null,"raydistributed","https:\u002F\u002Fdocs.ray.io","https:\u002F\u002Fgithub.com\u002Fray-project",[83,87,91,94,98,102,106,109,112,115],{"name":84,"color":85,"percentage":86},"Go","#00ADD8",87.8,{"name":88,"color":89,"percentage":90},"Python","#3572A5",7.2,{"name":92,"color":93,"percentage":10},"TypeScript","#3178c6",{"name":95,"color":96,"percentage":97},"Makefile","#427819",1,{"name":99,"color":100,"percentage":101},"Shell","#89e051",0.4,{"name":103,"color":104,"percentage":105},"Mustache","#724b3b",0.2,{"name":107,"color":108,"percentage":105},"Dockerfile","#384d54",{"name":110,"color":85,"percentage":111},"Go Template",0,{"name":113,"color":114,"percentage":111},"Awk","#c30e9b",{"name":116,"color":117,"percentage":111},"JavaScript","#f1e05a",2415,733,"2026-04-03T22:19:48","Apache-2.0",4,"Linux, macOS","未说明（取决于运行的 Ray 工作负载需求，KubeRay 本身作为 Kubernetes Operator 不直接依赖特定 GPU）","未说明（取决于 Kubernetes 集群节点配置及运行的 Ray 工作负载需求）",{"notes":127,"python":128,"dependencies":129},"KubeRay 是一个 Kubernetes Operator，需部署在现有的 Kubernetes 集群中。核心组件包括 RayCluster、RayJob 和 RayService。从 v1.3.0 起提供 kubectl 插件简化操作。文档已迁移至 Ray 官方文档站点。具体资源需求（CPU\u002FGPU\u002F内存）完全取决于用户部署的 Ray 应用类型（如训练、推理等）。","未说明（作为 Kubernetes Operator 运行，主要依赖 Go 语言环境；Ray 工作负载的 Python 版本由用户定义的镜像决定）",[130,131,132,133,134,135,136],"Kubernetes","kubectl","Go (用于构建)","Prometheus (可选，用于监控)","Grafana (可选，用于监控)","Volcano\u002FKueue\u002FApache YuniKorn (可选，用于队列调度)","Nginx Ingress (可选，用于入口控制)",[13,53],[139,140,141,142,143],"machine-learning","kubernetes","apache","ray","deep-learning","2026-03-27T02:49:30.150509","2026-04-06T11:30:48.866596",[147,152,157,162,167],{"id":148,"question_zh":149,"answer_zh":150,"source_url":151},15046,"KubeRay Operator 频繁重启并报错\"leader election lost\"（领导选举丢失）怎么办？","这通常是由于 Kubernetes API Server 对请求进行了限流（throttling），导致 Operator 无法及时更新租约。解决方案包括：\n1. 检查 API Server 的监控指标，特别是包含 `flowcontrol` 的指标，确认是否存在限流。\n2. 查看控制平面日志（Control Plane Logs）以确认 API 请求是否被拒绝。\n3. 如果是 GKE 集群，注意默认的每分钟 3000 次请求配额限制，可能需要优化 Operator 的请求频率或升级集群配置。\n4. 有用户反馈将集群从区域级（zonal）迁移到多区域（regional）集群后解决了该问题。","https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fissues\u002F2252",{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},15047,"当 RayCluster 的 Head Pod 因节点磁盘压力（Disk Pressure）或内存不足被 Kubelet 驱逐后，为什么没有自动重建？","这是一个已知问题，在某些版本中 RayCluster Controller 未能正确处理非 API 发起的驱逐（如 Kubelet 因资源压力发起的驱逐）。\n解决方案：\n1. 临时方案：手动删除处于 \"Evicted\" 状态的 Head Pod，Controller 随后会创建新的 Pod。\n2. 根本解决：该问题已在后续版本（参考 PR #2217）中修复。请确保将 `ray-operator` 升级到包含该修复的版本（建议 1.2.0 或更高）。\n3. 验证：可以通过模拟节点资源压力（如使用 `fallocate` 填满磁盘或限制内存）来测试升级后的行为是否符合预期。","https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fissues\u002F2125",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},15048,"Ray Autoscaler 在不健康的节点上反复创建大量处于 Pending 状态的 Pod 怎么办？","当 Autoscaler 尝试将 Pod 调度到不健康节点失败后，可能会陷入重试循环，导致产生数百个 Pending Pod。\n解决方案：\n1. 这是一个已知缺陷，社区已提交过相关修复（参考 PR #662）。\n2. 检查您的 KubeRay 版本，确保已应用了针对 Autoscaler 重试逻辑的补丁。\n3. 如果问题依然存在，可能需要手动干预清理 Pending Pod，并检查节点状态，移除或修复不健康的节点，防止 Autoscaler 继续向这些节点调度。","https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fissues\u002F1108",{"id":163,"question_zh":164,"answer_zh":165,"source_url":166},15049,"KubeRay 是否支持 RayCluster 的滚动升级（Rolling Upgrade）？","早期版本不支持原生的滚动升级，用户通常需要手动删除 Head Pod 来触发更新。\n现状与方案：\n1. 社区已经设计了滚动升级方案（参考 Ray Enhancement Proposal #58）。\n2. 如果您使用的是较新版本（如 1.2.2+），请查阅最新文档确认该功能是否已默认启用。\n3. 对于旧版本，目前的变通方法是：更新镜像后，手动删除 Head Pod，Operator 会基于新配置重建集群，但这会导致短暂的服务中断，并非平滑的滚动更新。建议关注官方发布的最新版本以获取原生支持。","https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fissues\u002F527",{"id":168,"question_zh":169,"answer_zh":170,"source_url":171},15050,"如何将 RayService 与 Kueue MultiKueue 集成以实现多集群作业调度？","为了支持 Kueue MultiKueue，RayService 需要添加 `spec.managedBy` 字段。\n实施步骤：\n1. 确保您的 KubeRay 和 Kueue 版本已包含相关支持（MultiKueue 支持 PR 已合并）。\n2. 在 RayService 的 YAML 定义中添加 `spec.managedBy` 字段，指定管理该服务的控制器。\n3. 该逻辑与 RayJob 中的实现类似。目前该功能已在最新版本中可用，无需额外开发，只需正确配置 YAML 即可实现一个集群管理并分发作业到多个工作集群的功能。","https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fissues\u002F4486",[173,178,182,187,192,197,201,206,211,216,221,226,231,236,241,246,251,256,261,266],{"id":174,"version":175,"summary_zh":176,"released_at":177},81869,"v1.6.0","## 亮点\n\n### Ray 历史服务器（Alpha）\n\nKubeRay v1.6 引入了对 Ray 历史服务器的 **Alpha 支持**。该项目使用户能够从 Ray 集群中收集和聚合事件，并通过回放这些事件来恢复集群状态的历史快照。通过为 Ray 仪表板提供替代后端，历史服务器允许用户在临时集群（例如通过 RayJob 管理的集群）终止后，仍然查看 Ray 仪表板并进行调试。\n\n**在此尝试历史服务器：[历史服务器快速入门指南](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fblob\u002Fmaster\u002Fhistoryserver\u002Fdocs\u002Fset_up_historyserver.md)。**\n\n> :warning: **警告:** 此功能目前处于 Alpha 阶段，这意味着未来的 KubeRay 版本可能会包含破坏性更新。我们非常期待听到您对该功能的使用体验！请在[此跟踪问题](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fissues\u002F3966)中留下您的反馈，以帮助我们改进其开发。\n\n### 使用 Kubernetes RBAC 的 Ray 令牌认证\n\n自 KubeRay v1.6 和 Ray v2.55 起，您可以使用 Kubernetes RBAC 来管理对启用了 [令牌认证](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fray-security\u002Ftoken-auth.html)的 Ray 集群的用户访问控制。启用此功能后，Ray 将被配置为将令牌认证委托给 Kubernetes。这意味着您可以使用与 Kubernetes 相同的凭据来访问 Ray 集群，而平台运维人员则可以使用标准的 Kubernetes RBAC 来控制对 Ray 集群的访问权限。有关分步指南，请参阅[配置 Ray 集群以使用 Kubernetes RBAC 认证](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fuser-guides\u002Fkuberay-auth-rbac.html)。\n\n现在，您还可以引用包含静态认证令牌的 Secret，用于 Ray 集群的令牌认证。  \n```yaml\napiVersion: v1\nkind: Secret\nmetadata:\n  name: ray-cluster-token\ntype: Opaque\nstringData:\n  auth_token: \"super-secret-example-token\"\n---\napiVersion: ray.io\u002Fv1\nkind: RayCluster\nmetadata:\n  name: ray-cluster-with-auth\nspec:\n  authOptions:\n    mode: token\n    secretName: ray-cluster-token\n  rayVersion: '2.53.0'\n  headGroupSpec:\n    rayStartParams: {}\n```\n\n\n### RayCronJob\n\nKubeRay v1.6 引入了 `RayCronJob` 自定义资源定义 (CRD)，使用户能够使用标准的 cron 表达式按周期性计划调度 RayJob。这对于定期批处理、计划训练任务或重复的数据管道非常有用。\n\n> :warning: **警告:** RayCronJob 是一项 **Alpha** 功能，默认情况下是 **禁用** 的。要启用它，请在 kuberay-operator 上设置功能门控：\n> ```\n> --feature-gates=RayCronJob=true\n> ```\n\n以下是新自定义资源的示例：\n\n```yaml\napiVersion: ray.io\u002Fv1\nkind: RayCronJob\nmetadata:\n  name: raycronjob-sample\nspec:\n  schedule: \"* * * * *\"\n  jobTemplate:\n    entrypoint: python \u002Fhome\u002Fray\u002Fsamples\u002Fsample_code.py\n    shutdownAfterJobFinishes: true\n    ttlSecondsAfterFinished: 600\n    runtime","2026-03-19T05:07:17",{"id":179,"version":180,"summary_zh":78,"released_at":181},81870,"v1.6.0-rc.0","2026-03-11T04:38:20",{"id":183,"version":184,"summary_zh":185,"released_at":186},81871,"v1.5.1","# v1.5.1\n\n## 亮点 \n\n此版本新增了对使用 RayCluster、RayJob 和 RayService 进行 Ray 令牌认证的支持。 \n\n您可以通过以下 API 启用 Ray 令牌认证：\n```yaml\napiVersion: ray.io\u002Fv1\nkind: RayCluster\nmetadata:\n  name: ray-cluster-with-auth\nspec:\n  rayVersion: '2.52.0'\n  authOptions:\n    mode: token\n``` \n\n必须将 `spec.rayVersion` 指定为 `2.52.0` 或更高版本。完整示例请参见 [ray-cluster.auth.yaml](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fblob\u002Fmaster\u002Fray-operator\u002Fconfig\u002Fsamples\u002Fray-cluster.auth.yaml)。\n\n## 错误修复\n\n* 修复 NewClusterWithIncrementalUpgrade 策略中的一个 bug：在升级过程中，活跃（旧）集群的 Serve 配置缓存被错误地更新为新的 ServeConfigV2。#4212 @ryanaoleary \n* 将 Pod 层面的容器失败信息暴露到 RayCluster 状态中 #4196 @spencer-p \n* 修复 RayJob 状态在初始化阶段发生失败时不会更新的 bug #4191 @spencer-p \n* 修复 RayCluster 状态并非始终会传播到 RayJob 状态的 bug #4192 @machichima","2025-11-21T18:02:42",{"id":188,"version":189,"summary_zh":190,"released_at":191},81872,"v1.5.0","# 亮点\n\n## Ray 标签选择器 API\n\nRay v2.49 引入了 [标签选择器 API](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fray-core\u002Fscheduling\u002Flabels.html)。相应地，KubeRay v1.5 现在提供了一个顶级 API，用于定义 Ray 标签和资源。这一新的顶级 API 是今后的首选方式，将取代之前在 `rayStartParams` 中设置标签和自定义资源的做法。\n\n新 API 将由 Ray 自动伸缩器使用，从而基于任务和 Actor 的标签选择器改进自动伸缩决策。此外，通过此 API 配置的标签会直接镜像到 Pod 中。这种镜像机制使用户在管理和交互 Ray 集群时，能够更无缝地将 Ray 标签选择器与标准 Kubernetes 标签选择器相结合。\n\n您可以按以下方式使用新 API：\n```yaml\napiVersion: ray.io\u002Fv1\nkind: RayCluster\nspec:\n  ...\n  headGroupSpec:\n    rayStartParams: {}\n    resources:\n      Custom1: \"1\"\n    labels:\n      ray.io\u002Fzone: us-west-2a\n      ray.io\u002Fregion: us-west-2\n  workerGroupSpec:\n  - replicas: 1\n    rayStartParams: {}\n    resources:\n      Custom1: \"1\"\n    labels:\n      ray.io\u002Fzone: us-west-2a\n      ray.io\u002Fregion: us-west-2\n```\n\n## RayJob Sidecar 提交模式\n\nRayJob 资源现在支持 `spec.submissionMode` 的一个新值，即 `SidecarMode`。Sidecar 模式直接解决了 `K8sJobMode` 和 `HttpMode` 中的一个关键限制：提交作业需要外部 Pod 或 KubeRay 运算符具备网络连通性。而使用 Sidecar 模式时，作业提交将通过向 Head Pod 注入一个 Sidecar 容器来编排。这一方案无需外部客户端处理提交过程，并能减少因网络故障导致的作业提交失败。\n\n要使用此功能，请在您的 RayJob 中将 `spec.submissionMode` 设置为 `SidecarMode`：\n```yaml\napiVersion: ray.io\u002Fv1\nkind: RayJob\nmetadata:\n  name: my-rayjob\nspec:\n  submissionMode: \"SidecarMode\"\n  ...\n```\n\n## RayJob 的高级删除策略\n\nKubeRay 现在支持一种更为先进且灵活的 API，用于在 RayJob 规范中表达删除策略。这一新设计不再局限于单一的布尔字段 `spec.shutdownAfterJobFinishes`，而是允许用户根据 Ray 作业的状态，利用可配置的 TTL 值来定义不同的清理策略。\n\n该 API 开启了多种新用例，这些用例要求在作业完成或失败后保留特定资源。例如，用户现在可以实施如下策略：\n* 在作业失败后，仅保留 Head Pod 一段时间以方便调试。\n* 在作业成功运行后，延长整个 Ray 集群的 TTL，以便进行事后分析或数据提取。\n\n通过将特定的 TTL 与 Ray 作业状态（如成功、失败）以及清理策略（如删除工作节点、删除集群、删除自身）相关联，用户可以对资源清理和成本管理实现细粒度控制。\n\n以下是使用 t 的示例：","2025-11-04T14:58:29",{"id":193,"version":194,"summary_zh":195,"released_at":196},81873,"v1.5.0-rc.1","\r\n","2025-11-03T15:08:50",{"id":198,"version":199,"summary_zh":78,"released_at":200},81874,"v1.5.0-rc.0","2025-10-30T03:00:40",{"id":202,"version":203,"summary_zh":204,"released_at":205},81875,"v1.4.2","## 更改日志\n* cc344f14fa5d7dae80bbd696f7f88bfe1da8479f [修复][发布] 修复 Krew 发布时的缩进错误 (#3823) (#3877)\n* c78bdcff48a03363687614c9667e069214ed6fb5 [RayCluster] 将 headpod 名称改回非确定性 (#3872) (#3876)\n* 34ea80e0f51f80fb092cdc17ca75d4139449edef [发布] 更新 1.4.2 版本的 KubeRay 版本引用 (#3879)\n\n","2025-07-16T23:41:07",{"id":207,"version":208,"summary_zh":209,"released_at":210},81876,"v1.4.1","## 更改日志\n* 0e6b46457b4892ca7594f94ebf4c154d1ff5be0f Cherry-pick：在 Go 模块中使用 Go 1.24.0 (#3835) (#3846)\n* 600a346f8626415817272f8c32314ab7c8ad4af5 修复 Ray 夜间构建镜像的环境变量设置 (#3826) (#3828)\n* 3d138cf3cecd0de449a2c860a85008d086791b5f [发布] 更新 KubeRay 1.4.1 版本引用 (#3848)\n\n","2025-07-07T20:19:42",{"id":212,"version":213,"summary_zh":214,"released_at":215},81877,"v1.4.0","## 亮点\n\n### Kubectl 插件增强\n\nKubeRay v1.4.0 对 Kubectl 插件进行了重大改进：\n\n* 新增了 `scale` 命令，用于扩缩 `RayCluster` 中的工作节点组。\n* 扩展了 `get` 命令，支持列出 Ray 节点和工作节点组。\n* 改进了 `create` 命令：\n  * 允许覆盖配置文件中的默认值。\n  * 支持更多字段，如 Kubernetes 标签和注解、节点选择器、临时存储、`ray start` 参数、TPU、自动伸缩器版本等。\n\n更多详情请参阅 [使用 Kubectl 插件（beta）](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fuser-guides\u002Fkubectl-plugin.html)。\n\n### KubeRay 仪表板（alpha）\n\n自 v1.4.0 起，您可以使用 KubeRay 的开源仪表板 UI。该组件目前仍处于实验阶段，不建议用于生产环境，但欢迎提供反馈。\n\nKubeRay 仪表板是一个基于 Web 的界面，允许您查看和管理在 Kubernetes 集群上运行的 KubeRay 资源。它与 Ray 集群本身的 Ray 仪表板不同。KubeRay 仪表板提供了所有 KubeRay 资源的集中视图。\n\n更多信息请参阅 [使用 KubeRay 仪表板（实验性）](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fuser-guides\u002Fkuberay-dashboard.html)。（PR 合并后，链接将更新至文档网站）\n\n### 与 `kubernetes-sigs\u002Fscheduler-plugins` 集成\n\n从 v1.4.0 开始，KubeRay 集成了另一个调度插件 [`kubernetes-sigs\u002Fscheduler-plugins`](https:\u002F\u002Fgithub.com\u002Fkubernetes-sigs\u002Fscheduler-plugins)，以支持 `RayCluster` 资源的团伙调度。目前仅支持单调度器模式。\n\n详细信息请参阅 [KubeRay 与调度插件集成](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fk8s-ecosystem\u002Fscheduler-plugins.html)。\n\n### KubeRay APIServer V2（alpha）\n\n新的 APIServer v2 提供了一个与 Kubernetes API 兼容的 HTTP 代理接口。它使用户能够使用标准的 Kubernetes 客户端来管理 Ray 资源。\n\n主要特性：\n\n* 完全兼容 Kubernetes OpenAPI 规范和 CRD。\n* 作为 Go 库提供，可用于构建带有可插拔 HTTP 中间件的自定义代理。\n\nAPIServer v1 现已进入维护模式，不再添加新功能。v2 仍处于 alpha 阶段，欢迎贡献代码和反馈。\n\n### 服务级别指标（SLI）指标\n\nKubeRay 现在包含 SLI 指标，以帮助监控 KubeRay 资源的状态和性能。\n\n详细信息请参阅 [KubeRay 指标参考](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fk8s-ecosystem\u002Fmetrics-references.html)。\n\n## 破坏性变更\n\n### 默认使用非登录 Bash Shell\n\n在 v1.4.0 之前，KubeRay 大多数命令都是通过登录 Shell 来执行的。从 v1.4.0 开始，默认 Shell 变更为 **非登录** Bash Shell。您可以通过设置 `ENABLE_LOGIN_SHELL` 环境变量暂时恢复登录 Shell 的行为，但使用登录 Shell 是 n","2025-06-21T16:12:48",{"id":217,"version":218,"summary_zh":219,"released_at":220},81878,"v1.3.2","## Bug 修复\n\n*  [RayJob] 在提交作业时使用 --no-wait，以避免将错误返回码传递到日志尾部输出 (#3216)\n*  [kubectl 插件] kubectl ray job submit：提供入口点，以保持与 v1.2.2 的兼容性 (#3186)\n\n## 功能改进\n\n*  [kubectl 插件] 添加 head\u002Fworker 节点选择器选项 (#3228)\n*  [kubectl 插件] 为 kubectl 插件的创建 worker 组功能添加节点选择器选项 (#3235)\n\n## 更改日志\n*  [RayJob][修复] 在提交作业时使用 --no-wait，以避免将错误返回码传递到日志尾部输出 (#3216)\n*  kubectl ray job submit：提供入口点 (#3186)\n*  [kubectl 插件] 添加 head\u002Fworker 节点选择器选项 (#3228)\n*  为 kubectl 插件的创建 worker 组功能添加节点选择器选项 (#3235)\n*  [杂项][CI] 将 release-image-build GitHub 工作流的输入限制为仅接受标签 (#3117)\n*  [CI] 从发布流程中移除创建标签步骤 (#3249)\n\n","2025-04-03T02:05:03",{"id":222,"version":223,"summary_zh":224,"released_at":225},81879,"v1.3.1","## 亮点\n\n本次发布更新了 Go 依赖，以解决在使用较新版本的 `k8s.io\u002Fcomponent-base` 时出现的不兼容问题。\n\n## 更改日志\n\n更新 component-base 后，需要进行构建才能使更改生效（#3163，[mszadkow](https:\u002F\u002Fgithub.com\u002Fmszadkow)）\n","2025-03-18T12:24:37",{"id":227,"version":228,"summary_zh":229,"released_at":230},81880,"v1.3.0","## 亮点\n\n### RayCluster 条件 API\n\nRayCluster 的条件 API 在 v1.3 中正式进入 Beta 阶段。新 API 提供了关于 RayCluster 可观测状态的更多详细信息，而这些信息在旧版 API 中无法表达。v1.3 版本支持以下条件：`AllPodRunningAndReadyFirstTime`、`RayClusterPodsProvisioning`、`HeadPodNotFound` 和 `HeadPodRunningAndReady`。我们将在未来的版本中继续增加更多条件。\n\n### Ray Kubectl 插件\n\nRay Kubectl 插件也已升级为 Beta 状态。KubeRay v1.3 支持以下命令：\n* `kubectl ray logs \u003Ccluster-name>`：将 Ray 日志下载到本地目录\n* `kubectl ray session \u003Ccluster-name>`：启动到 Ray Head 节点的端口转发会话\n* `kubectl ray create \u003Ccluster>`：创建一个 Ray 集群\n* `kubectl ray job submit`：使用本地工作目录创建并提交一个 RayJob\n\n更多详细信息请参阅 [Ray Kubectl 插件文档](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fcluster\u002Fkubernetes\u002Fuser-guides\u002Fkubectl-plugin.html)。\n\n### RayJob 稳定性改进\n\n为了提升长时间运行的 RayJob 的稳定性，我们进行了多项改进。特别是在使用 `submissionMode=K8sJobMode` 时，作业提交不会再因重复 ID 而失败。现在，如果提交 ID 已存在，系统将直接获取现有作业的日志，而不会重新提交。\n\n### RayService API 改进\n\nRayService 致力于实现零宕机的服务部署。当 RayService 规范中的更改无法原地应用时，它会尝试在后台将流量迁移到新的 RayCluster。然而，用户并不总是具备创建新 RayCluster 所需的资源。从 KubeRay 1.3 开始，用户可以通过 RayServiceSpec 中的新 UpgradeStrategy 选项来自定义这一行为。\n\n此前，RayService 中的 `serviceStatus` 字段并不一致，无法准确反映实际状态。自 KubeRay v1.3.0 起，我们为 RayService 引入了两个条件：`Ready` 和 `UpgradeInProgress`。借鉴 RayCluster 的做法，我们决定弃用 `serviceStatus` 字段。未来，`serviceStatus` 将被移除，条件将成为确定服务状态的权威来源。目前，`serviceStatus` 仍可使用，但其取值仅限于“Running”或空字符串。\n\n### GCS 容错 API 改进\n\nRayCluster 中新增的 `GcsFaultToleranceOptions` 字段，为用户提供了在 RayCluster 上启用 GCS 容错功能的更 streamlined 方式。这消除了以往需要将相关配置分散到 Pod 注解、容器环境变量以及 RayStartParams 中的麻烦。此外，用户现在还可以在该新字段中指定 Redis 用户名（需配合 Ray 2.4.1 或更高版本）。如需查看此更改对 YAML 配置的影响，请参阅 [示例清单](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fblob\u002Fmaster\u002Fray-operator\u002Fconfig","2025-02-19T00:39:12",{"id":232,"version":233,"summary_zh":234,"released_at":235},81881,"v1.2.2","## 亮点\n\n* (alpha) **Ray kubectl 插件**\n  * `get`、`session`、`log`、`job submit`\n* (alpha) **Kubernetes 事件**：为 KubeRay 与 Kubernetes API 服务器交互中的重要信息创建 Kubernetes 事件\n* (alpha) **Apache YuniKorn 集成**\n\n## 更改日志\n* [发布] 将 Ray 镜像更新至 2.34.0 ([#2303](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2303)，@kevin85421)\n* 撤销 “[发布] 将 Ray 镜像更新至 2.34.0 (#2303)” ([#2413](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2413)，@kevin85421)\n* 撤销 “[发布] 将 Ray 镜像更新至 2.34.0 (#2303)” (#2413) ([#2415](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2415)，@kevin85421)\n* [构建][kubectl-plugin] 为 kubectl 插件添加发布脚本 ([#2407](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2407)，@MortalHappiness)\n* [功能][kubectl-plugin] 为 kubectl ray log 添加 Long、Example 和 shell 补全 ([#2405](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2405)，@MortalHappiness)\n* 支持使用 Apache YuniKorn 进行团伙调度 ([#2396](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2396)，@yangwwei)\n* [功能][Kubectl-Plugin] 实现 kubectl ray job submit ([#2394](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2394)，@chiayi)\n* 添加 1K、5K 和 10K RayCluster\u002FRayJob 的可扩展性测试结果 ([#2218](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2218)，@andrewsykim)\n* [功能][kubectl-plugin] 为 kubectl ray session 添加动态 shell 补全 ([#2390](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2390)，@MortalHappiness)\n* [功能][RayJob]：生成提交者以及 RayCluster 的创建\u002F删除事件 ([#2389](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2389)，@rueian)\n* [RayJob] 为失败的 k8s 创建任务添加失败反馈（日志和事件） ([#2306](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2306)，@tinaxfwu)\n* [功能][Kubectl-Plugin] 为 RayJob 和 RayService 实现 kubectl session ([#2379](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2379)，@MortalHappiness)\n* [功能][kubectl-plugin] 添加静态 shell 补全的说明 ([#2384](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2384)，@MortalHappiness)\n* [功能][RayJob] 用户模式提交方式 ([#2364](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2364)，@MortalHappiness)\n* [功能] 在 pre-commit 中添加 Kubernetes 清单验证。([#2380](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2380)，@LeoLiao123)\n* [功能][RayCluster]：生成 GCS FT Redis 清理作业的创建事件 ([#2382](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2382)，@rueian)\n* [杂项][轻微] 为 kubectl-plugin 添加 .gitignore ([#2383](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2383)，@MortalHappiness)\n* 移除批处理调度器名称的默认选项 ([#2371](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2371)，@yangwwei)\n* RayCluster 无头工作节点服务应发布 NotReadyAddresses ([#2375](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2375)，@ryanaoleary)\n* [CI][GitHub-Actions] 将 actions\u002Fupload-artifact 升级至 v4 ([#2373","2024-09-29T08:40:24",{"id":237,"version":238,"summary_zh":239,"released_at":240},81882,"v1.2.1","Compared to KubeRay v1.2.0, KubeRay v1.2.1 includes an additional commit (#2243). This commit fixes the issue where a RayService created by a KubeRay version older than v1.2.0 does not support zero-downtime upgrades after upgrading to KubeRay v1.2.0.\r\n\r\n* [RayService] Use original ClusterIP for new head service (#2343, @kevin85421)\r\n\r\n","2024-08-31T06:43:49",{"id":242,"version":243,"summary_zh":244,"released_at":245},81883,"v1.2.0","# Highlights\r\n\r\n* **RayCluster CRD status observability improvement**: [design doc](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY\u002Fedit)\r\n* **Support retry in RayJob**: #2192 \r\n* **Coding style improvement**\r\n\r\n# RayCluster\r\n* [RayCluster][Fix] evicted head-pod can be recreated or restarted ([#2217](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2217), @JasonChen86899)\r\n* [Test][RayCluster] Add tests for RestartPolicyOnFailure for eviction ([#2302](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2302), @MortalHappiness)\r\n* kuberay autoscaler pod use same command and args as ray head container ([#2268](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2268), @cswangzheng)\r\n* Updated default timeout seconds for probes ([#2265](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2265), @HarshAgarwal11)\r\n* Buildkite autoscaler e2e ([#2199](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2199), @rueian)\r\n* [Test][Autoscaler][2\u002Fn] Add Ray Autoscaler e2e tests for GPU workers ([#2181](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2181), @rueian)\r\n* [Test][Autoscaler][1\u002Fn] Add Ray Autoscaler e2e tests ([#2168](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2168), @kevin85421)\r\n* [Bug] Fix RayCluster with an overridden app.kubernetes.io\u002Fname (#2147) ([#2166](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2166), @rueian)\r\n* [Feat][RayCluster] Make the Head service headless ([#2117](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2117), @rueian)\r\n* [Refactor][RayCluster] Make ray.io\u002Fgroup=headgroup be constant ([#1970](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1970), @rueian)\r\n* [Feature][autoscaler v2] Set RAY_NODE_TYPE_NAME when starting ray node ([#1973](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1973), @kevin85421)\r\n* feat: add `RayCluster.status.readyWorkerReplicas` ([#1930](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1930), @davidxia)\r\n* [Chore][Samples] Rename ray-cluster.mini.yaml and add workerGroupSpecs ([#2100](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2100), @MortalHappiness)\r\n* [Chore] Delete redundant pod existance checking ([#2113](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2113), @MortalHappiness)\r\n* [Autoscaler V2] Polish Autoscaler V2 YAML ([#2064](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2064), @kevin85421)\r\n* [Refactor] Use RayClusterHeadPodsAssociationOptions to replace MatchingLabels ([#2056](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2056), @evalaiyc98)\r\n* [Sample][autoscaler v2] Add sample yaml for autosclaer v2  ([#1974](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1974), @rickyyx)\r\n* Allow configuration of restartPolicy ([#2197](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2197), @c0dearm)\r\n* [Chore][Log] Delete error loggings right before returned errors ([#2103](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2103), @MortalHappiness)\r\n* [Refactor] Follow-up for PR 1930 ([#2124](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2124), @MortalHappiness)\r\n* [Test] Move StateTransitionTimes envtest to a better place ([#2111](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2111), @kevin85421)\r\n* support using proxy subresources when connecting to Ray head node ([#1980](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1980), @andrewsykim)\r\n* [Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image   ([#2087](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2087), @kevin85421)\r\n* [Bug] KubeRay operator failed to watch endpoint ([#2080](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2080), @kevin85421)\r\n* [Refactor] Remove `cleanupInvalidVolumeMounts` ([#2104](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2104), @kevin85421)\r\n* support using proxy subresources when connecting to Ray head node ([#1980](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1980), @andrewsykim)\r\n* [Chore] Run operator outside the cluster ([#2090](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2090), @MortalHappiness)\r\n* [Feat] Deprecate ForcedClusterUpgrade ([#2075](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2075), @MortalHappiness)\r\n* [Bug] Ray operator crashes when specifying RayCluster with resources.limits but no resources.requests ([#2077](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2077), @kevin85421)\r\n\r\n# RayCluster CRD status improvement\r\n\r\n* RayClusterProvisioned status should be set while cluster is being provisioned for the first time ([#2304](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2304), @andrewsykim)\r\n* Add RayClusterProvisioned Condition Type ([#2301](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2301), @Yicheng-Lu-llll)\r\n* [Test][RayCluster] Add envtests for RayCluster conditions ([#2283](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2283), @MortalHappiness)\r\n* [Fix][RayCluster] Make the RayClusterReplicaFailureReason to capture the correct reason ([#2282](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2282), @rueian)\r\n* Add RayClusterReady Condition Type  ([#2271](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F2271), @Yicheng-Lu-llll)\r\n* [Feature][RayCluster]: Implement the HeadReady condition  ([#2261](","2024-08-29T21:44:24",{"id":247,"version":248,"summary_zh":249,"released_at":250},81884,"v1.1.1","Compared to KubeRay v1.1.0, KubeRay v1.1.1 includes four cherry-picked commits.\r\n\r\n* [Bug] Ray operator crashes when specifying RayCluster with resources.limits but no resources.requests (#2077, @kevin85421)\r\n* [CI] Pin kustomize to v5.3.0 (#2067, @kevin85421)\r\n* [Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image (#2087, @kevin85421)\r\n* [Hotfix][CI] Pin setup-envtest dep (#2038, @kevin85421)","2024-05-08T20:14:00",{"id":252,"version":253,"summary_zh":254,"released_at":255},81885,"v1.1.0","# Highlights\r\n\r\n* **RayJob improvements**\r\n  * **Gang \u002F Priority scheduling with Kueue:** \r\n     * [Gang scheduling doc](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fexamples\u002Frayjob-kueue-gang-scheduling.html#kuberay-kueue-gang-scheduling-example)\r\n     * [Priority scheduling doc](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fexamples\u002Frayjob-kueue-priority-scheduling.html)\r\n     * [Kueue \u002F RayJob - tutorial with toy example that everyone can try it locally](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fk8s-ecosystem\u002Fkueue.html#kuberay-kueue)\r\n  * **ActiveDeadlineSeconds (new field)**: A feature to control the lifecycle of a RayJob. See [this doc](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fgetting-started\u002Frayjob-quick-start.html) and #1933 for more details.\r\n  * **submissionMode (new field)**: Users can specify “K8sJobMode” or “HTTPMode”. The default value is “K8sJobMode”. In HTTPMode, the submitter K8s Job will not be created. Instead, KubeRay sends a HTTP request to the Ray head Pod to create a Ray job. See [this doc](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Fmaster\u002Fcluster\u002Fkubernetes\u002Fgetting-started\u002Frayjob-quick-start.html) and #1893  for more details.\r\n  * Fix a lot of stability issues.\r\n\r\n* **Structured logging**\r\n  * In KubeRay v1.1.0, we have changed the KubeRay logs to JSON format, and each log message includes context information such as the custom resource’s name and reconcileID. Hence, users can filter out logs associated with a RayCluster, RayJob, or RayService CR by its name.\r\n\r\n* **RayService improvements**\r\n  * Refactor health check mechanism to improve the stability.\r\n  * Deprecate the `deploymentUnhealthySecondThreshold` and `serviceUnhealthySecondThreshold` to avoid unintentional preparation of new RayCluster custom resource.\r\n* **TPU multi-host PodSlice support**\r\n  * The KubeRay team is actively working with the Google GKE and TPU teams on integration. The required changes in KubeRay have already been completed. The GKE team will complete some tasks on their side this week or next. Then, users should be able to use multi-host TPU PodSlice with a static RayCluster (without autoscaling).\r\n* **Stop publishing images on DockerHub; instead, we will only publish on Quay.**\r\n  * https:\u002F\u002Fquay.io\u002Frepository\u002Fkuberay\u002Foperator?tab=tags\r\n  * Users should use docker pull `quay.io\u002Fkuberay\u002Foperator:v1.1.0` instead of docker pull `kuberay\u002Foperator:v1.1.0`.\r\n\r\n# RayJob\r\n\r\n## RayJob state machine refactor\r\n\r\n* [RayJob][Status][1\u002Fn] Redefine the definition of JobDeploymentStatusComplete ([#1719](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1719), @kevin85421)\r\n* [RayJob][Status][2\u002Fn] Redefine `ready` for RayCluster to avoid using HTTP requests to check dashboard status ([#1733](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1733), @kevin85421)\r\n* [RayJob][Status][3\u002Fn] Define JobDeploymentStatusInitializing ([#1737](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1737), @kevin85421)\r\n* [RayJob][Status][4\u002Fn] Remove some JobDeploymentStatus and updateState function calls ([#1743](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1743), @kevin85421)\r\n* [RayJob][Status][5\u002Fn] Refactor getOrCreateK8sJob ([#1750](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1750), @kevin85421)\r\n* [RayJob][Status][6\u002Fn] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL ([#1762](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1762), @kevin85421)\r\n* [RayJob][Status][7\u002Fn] Define JobDeploymentStatusNew explicitly ([#1772](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1772), @kevin85421)\r\n* [RayJob][Status][8\u002Fn] Only a RayJob with the status Running can transition to Complete at this moment ([#1774](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1774), @kevin85421)\r\n* [RayJob][Status][9\u002Fn] RayJob should not pass any changes to RayCluster ([#1776](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1776), @kevin85421)\r\n* [RayJob][10\u002Fn] Add finalizer to the RayJob when the RayJob status is JobDeploymentStatusNew ([#1780](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1780), @kevin85421)\r\n* [RayJob][Status][11\u002Fn] Refactor the suspend operation ([#1782](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1782), @kevin85421)\r\n* [RayJob][Status][12\u002Fn] Resume suspended RayJob ([#1783](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1783), @kevin85421)\r\n* [RayJob][Status][13\u002Fn] Make suspend operation atomic by introducing the new status `Suspending` ([#1798](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1798), @kevin85421)\r\n* [RayJob][Status][14\u002Fn] Decouple the Initializing status and Running status ([#1801](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1801), @kevin85421)\r\n* [RayJob][Status][15\u002Fn] Unify the codepath for the status transition to `Suspended` ([#1805](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1805), @kevin85421)\r\n* [RayJob][Status][16\u002Fn] Refactor `Running` status ([#1807](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1807), @kevin85421)\r\n* [RayJob][Status][17\u002Fn] Unify the codepath for status updates ([#1814](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1814), @kevin854","2024-03-23T04:05:08",{"id":257,"version":258,"summary_zh":259,"released_at":260},81886,"v1.0.0","# KubeRay is officially in General Availability!\r\n\r\n* Bump the CRD version from v1alpha1 to v1.\r\n* Relocate almost all documentation to the Ray website.\r\n* Improve RayJob UX.\r\n* Improve GCS fault tolerance.\r\n\r\n# GCS fault tolerance\r\n\r\n* [GCS FT] Improve GCS FT cleanup UX ([#1592](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1592), @kevin85421)\r\n* [Bug][RayCluster] Fix RAY_REDIS_ADDRESS parsing with redis scheme and… ([#1556](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1556), @rueian)\r\n* [Bug] RayService with GCS FT HA issue ([#1551](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1551), @kevin85421)\r\n* [Test][GCS FT] End-to-end test for cleanup_redis_storage (#1422)(#1459) ([#1466](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1466), @rueian)\r\n* [Feature][GCS FT] Clean up Redis once a GCS FT-Enabled RayCluster is deleted ([#1412](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1412), @kevin85421)\r\n* Update GCS fault tolerance YAML ([#1404](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1404), @kevin85421)\r\n* [GCS FT] Consider the case of sidecar containers ([#1386](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1386), @kevin85421)\r\n* [GCS FT] Give readiness \u002F liveness probes good default values ([#1364](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1364), @kevin85421)\r\n* [GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events ([#1341](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1341), @kevin85421)\r\n\r\n# CRD versioning\r\n\r\n* [CRD] Inject CRD version to the Autoscaler sidecar container ([#1496](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1496), @kevin85421)\r\n* [CRD][2\u002Fn] Update from CRD v1alpha1 to v1 ([#1482](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1482), @kevin85421)\r\n* [CRD][1\u002Fn] Create v1 CRDs ([#1481](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1481), @kevin85421)\r\n* [CRD] Set maxDescLen to 0 ([#1449](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1449), @kevin85421)\r\n\r\n# RayService\r\n\r\n* [Hotfix][Bug] Avoid unnecessary zero-downtime upgrade ([#1581](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1581), @kevin85421)\r\n* [Feature] Add an example for RayService high availability ([#1566](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1566), @kevin85421)\r\n* [Feature] Add a flag to make zero downtime upgrades optional ([#1564](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1564), @kevin85421)\r\n* [Bug][RayService] KubeRay does not recreate Serve applications if a head Pod without GCS FT recovers from a failure. ([#1420](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1420), @kevin85421)\r\n* [Bug] Fix the filename of text summarizer YAML ([#1415](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1415), @kevin85421)\r\n* [serve] Change text ml yaml to use french in user config ([#1403](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1403), @zcin)\r\n* [services] Add text ml rayservice yaml ([#1402](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1402), @zcin)\r\n* [Bug] Fix flakiness of RayService e2e tests ([#1385](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1385), @kevin85421)\r\n* Add RayService sample test ([#1377](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1377), @Darren221)\r\n* [RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold ([#1293](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1293), @kevin85421)\r\n* [RayService][Observability] Add more loggings about networking issues ([#1282](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1282), @kevin85421)\r\n\r\n# RayJob\r\n\r\n* [Feature] Improve observability for flaky RayJob test ([#1587](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1587), @kevin85421)\r\n* [Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running ([#1583](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1583), @architkulkarni)\r\n* [RayJob] Fix RayJob status reconciliation ([#1539](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1539), @astefanutti)\r\n* [RayJob]: Always use target RayCluster image as default RayJob submitter image ([#1548](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1548), @astefanutti)\r\n* [RayJob] Add default CPU and memory for job submitter pod ([#1319](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1319), @architkulkarni)\r\n* [Bug][RayJob] Check dashboard readiness before creating job pod (#1381) ([#1429](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1429), @rueian)\r\n* [Feature][RayJob] Use RayContainerIndex instead of 0 (#1397) ([#1427](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1427), @rueian)\r\n* [RayJob] Enable job log streaming by setting `PYTHONUNBUFFERED` in job container ([#1375](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1375), @architkulkarni)\r\n* Add field to expose entrypoint num cpus in rayjob ([#1359](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1359), @shubhscoder)\r\n* [RayJob] Add runtime env YAML field ([#1338](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1338), @architkulkarni)\r\n* [Bug][RayJob] RayJob with custom head service name ([#1332](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1332), @kevin85421)\r\n* [RayJob] Add e2e sample yam","2023-11-07T06:12:05",{"id":262,"version":263,"summary_zh":264,"released_at":265},81887,"v0.6.0","# Highlights\r\n\r\n* RayService\r\n  * RayService starts to support Ray Serve multi-app API (#1136, #1156)\r\n  * RayService stability improvements (#1231, #1207, #1173)\r\n  * RayService observability (#1230)\r\n  * RayService examples\r\n    * [RayService] Stable Diffusion example ([#1181](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1181), @kevin85421)\r\n    * MobileNet example ([#1175](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1175), @kevin85421)\r\n  * RayService troubleshooting handbook (#1221)\r\n\r\n* RayJob refactoring (#1177)\r\n* Autoscaler stability improvements (#1251, #1253)\r\n\r\n# RayService\r\n\r\n* [RayService][Observability] Add more logging for RayService troubleshooting ([#1230](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1230), @kevin85421)\r\n* [Bug] Long image pull time will trigger blue-green upgrade after the head is ready ([#1231](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1231), @kevin85421)\r\n* [RayService] Stable Diffusion example ([#1181](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1181), @kevin85421)\r\n* [RayService] Update docs to use multi-app ([#1179](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1179), @zcin)\r\n* [RayService] Change runtime env for e2e autoscaling test ([#1178](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1178), @zcin)\r\n* [RayService] Add e2e tests ([#1167](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1167), @zcin)\r\n* [RayService][docs] Improve explanation for config file and in-place updates ([#1229](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1229), @zcin)\r\n* [RayService][Doc] RayService troubleshooting handbook ([#1221](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1221), @kevin85421)\r\n* [Doc] Improve RayService doc ([#1235](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1235), @kevin85421)\r\n* [Doc] Improve FAQ page and RayService troubleshooting guide ([#1225](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1225), @kevin85421)\r\n* [RayService] Add RayService alb ingress CR ([#1169](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1169), @sihanwang41)\r\n* [RayService] Add support for multi-app config in yaml-string format ([#1156](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1156), @zcin)\r\n* [rayservice] Add support for getting multi-app status ([#1136](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1136), @zcin)\r\n* [Refactor] Remove Dashboard Agent service ([#1207](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1207), @kevin85421)\r\n* [Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error ([#1173](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1173), @kevin85421)\r\n* MobileNet example ([#1175](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1175), @kevin85421)\r\n* [Bug] fix RayActorOptionSpec.items.spec.serveConfig.deployments.rayActorOptions.memory int32 data type ([#1220](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1220), @kevin85421)\r\n\r\n# RayJob\r\n\r\n* [RayJob] Submit job using K8s job instead of checking Status and using DashboardHTTPClient ([#1177](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1177), @architkulkarni)\r\n* [Doc] [RayJob] Add documentation for submitterPodTemplate ([#1228](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1228), @architkulkarni)\r\n\r\n# Autoscaler\r\n\r\n* [release blocker][Feature] Only Autoscaler can make decisions to delete Pods ([#1253](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1253), @kevin85421)\r\n* [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster ([#1251](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1251), @kevin85421)\r\n\r\n# Helm\r\n\r\n* [Helm][RBAC] Introduce the option crNamespacedRbacEnable to enable or disable the creation of Role\u002FRoleBinding for RayCluster preparation ([#1162](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1162), @kevin85421)\r\n* [Bug] Allow zero replica for workers for Helm ([#968](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F968), @ducviet00)\r\n* [Bug] KubeRay tries to create ClusterRoleBinding when singleNamespaceInstall and rbacEnable are set to true ([#1190](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1190), @kevin85421)\r\n\r\n# KubeRay API Server\r\n\r\n* Add support for openshift routes ([#1183](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1183), @blublinsky)\r\n* Adding API server support for service account ([#1148](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1148), @blublinsky)\r\n\r\n# Documentation\r\n\r\n* [release v0.6.0] Update tags and versions ([#1270](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1270), @kevin85421)\r\n* [release v0.6.0-rc.1] Update tags and versions ([#1264](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1264), @kevin85421)\r\n* [release v0.6.0-rc.0] Update tags and versions ([#1237](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1237), @kevin85421)\r\n* [Doc] Develop Ray Serve Python script on KubeRay ([#1250](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1250), @kevin85421)\r\n* [Doc] Fix the order of comments in sample Job YAML file ([#1242](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1242), @architkulkarni)\r\n* [Doc] Upload a screenshot for the Serve page in Ray dashboard ([#1236](https:\u002F\u002Fgithub.com\u002Fray-project","2023-07-26T22:35:32",{"id":267,"version":268,"summary_zh":269,"released_at":270},81888,"v0.5.2","# Changelog for v0.5.2\r\n\r\n## Highlights\r\n\r\nThe KubeRay 0.5.2 patch release includes the following improvements.\r\n* Allow specifying the entire headService and serveService YAML spec. Previously, only certain special fields such as `labels` and `annotations` were exposed to the user.\r\n  * Expose entire head pod Service to the user ([#1040](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1040), [@architkulkarni](https:\u002F\u002Fgithub.com\u002Farchitkulkarni))\r\n  * Exposing Serve Service ([#1117](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1117), [@kodwanis](https:\u002F\u002Fgithub.com\u002Fkodwanis))\r\n* RayService stability improvements\r\n  * RayService object’s Status is being updated due to frequent reconciliation ([#1065](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1065), [@kevin85421](https:\u002F\u002Fgithub.com\u002Fkevin85421))\r\n  * [RayService] Submit requests to the Dashboard after the head Pod is running and ready ([#1074](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1074), [@kevin85421](https:\u002F\u002Fgithub.com\u002Fkevin85421))\r\n  * Fix in HeadPod Service Generation logic which was causing frequent reconciliation ([#1056](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1056), [@msumitjain](https:\u002F\u002Fgithub.com\u002Fmsumitjain))\r\n* Allow watching multiple namespaces\r\n  * [Feature] Watch CR in multiple namespaces with namespaced RBAC resources ([#1106](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1106), [@kevin85421](https:\u002F\u002Fgithub.com\u002Fkevin85421))\r\n* Autoscaler stability improvements\r\n   * [Bug] RayService restarts repeatedly with Autoscaler ([#1037](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1037), [@kevin85421](https:\u002F\u002Fgithub.com\u002Fkevin85421))\r\n  * [Bug] autoscaler not working properly in rayjob ([#1064](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1064), [@Yicheng-Lu-llll](https:\u002F\u002Fgithub.com\u002FYicheng-Lu-llll))\r\n  * [Bug][Autoscaler] Operator does not remove workers ([#1139](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1139), [@kevin85421](https:\u002F\u002Fgithub.com\u002Fkevin85421))\r\n\r\n\r\n## Contributors\r\n\r\nWe'd like to thank the following contributors for their contributions to this release:\r\n\r\n@ByronHsu, @Yicheng-Lu-llll, @anishasthana, @architkulkarni, @blublinsky, @chrisxstyles, @dirtyValera, @ecurtin, @jasoonn, @jjyao, @kevin85421, @kodwanis, @msumitjain, @oginskis, @psschwei, @scarlet25151, @sihanwang41, @tedhtchang, @varungup90, @xubo245\r\n\r\n## Features\r\n\r\n* Add a flag to enable\u002Fdisable worker init container injection ([#1069](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1069), @ByronHsu)\r\n* Add a warning to discourage users from launching a KubeRay-incompatible autoscaler. ([#1102](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1102), @kevin85421)\r\n* Add consistency check for deepcopy generated files ([#1127](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1127), @varungup90)\r\n* Add kubernetes dependency in python client library ([#998](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F998), @jasoonn)\r\n* Add support for pvcs to apiserver ([#1118](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1118), @psschwei)\r\n* Add support for tolerations, env, annotations and labels ([#1070](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1070), @blublinsky)\r\n* Align Init Container's ImagePullPolicy with Ray Container's ImagePullPolicy ([#1080](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1080), @Yicheng-Lu-llll)\r\n* Connect Ray client with TLS using Nginx Ingress on Kind cluster (#729) ([#1051](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1051), @tedhtchang)\r\n* Expose entire head pod Service to the user ([#1040](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1040), @architkulkarni)\r\n* Exposing Serve Service ([#1117](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1117), @kodwanis)\r\n* [Test] Add e2e test for sample RayJob yaml on kind ([#935](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F935), @architkulkarni)\r\n* Parametrize ray-operator makefile ([#1121](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1121), @anishasthana)\r\n* RayService object's Status is being updated due to frequent reconciliation ([#1065](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1065), @kevin85421)\r\n* [Feature] Support suspend in RayJob ([#926](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F926), @oginskis)\r\n* [Feature] Watch CR in multiple namespaces with namespaced RBAC resources ([#1106](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1106), @kevin85421)\r\n* [RayService] Submit requests to the Dashboard after the head Pod is running and ready ([#1074](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1074), @kevin85421)\r\n* feat: Rename instances of rayiov1alpha1 to rayv1alpha1 ([#1112](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1112), @anishasthana)\r\n* ray-operator: Reuse contexts across ray operator reconcilers ([#1126](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1126), @anishasthana)\r\n\r\n## Fixes\r\n\r\n* Fix CI ([#1145](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1145), @kevin85421)\r\n* Fix config frequent update ([#1014](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fkuberay\u002Fpull\u002F1014), @sihanwang41)\r\n* Fix for Sample YAML Config Test - 2.4.0 Failure due to 'suspend' Field  ([#","2023-06-14T21:10:12"]