[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-relari-ai--continuous-eval":3,"tool-relari-ai--continuous-eval":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159267,2,"2026-04-17T11:29:14",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":32,"env_os":94,"env_gpu":94,"env_ram":94,"env_deps":95,"category_tags":101,"github_topics":103,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":111,"updated_at":112,"faqs":113,"releases":154},8622,"relari-ai\u002Fcontinuous-eval","continuous-eval","Data-Driven Evaluation for LLM-Powered Applications","continuous-eval 是一款专为大语言模型（LLM）应用打造的开源评估工具，旨在通过数据驱动的方式帮助开发者量化和优化应用性能。在开发基于 LLM 的复杂系统（如检索增强生成 RAG、代码生成或智能体）时，传统方法往往难以精准定位链路中的薄弱环节。continuous-eval 通过模块化评估机制，允许用户对管道中的每个环节单独设定指标，从而精确识别问题所在。\n\n该工具内置了丰富的度量库，涵盖确定性、语义及基于大模型本身的多种评估维度，并创新性地引入了概率化评估方法，使测试结果更加稳健可靠。无论是验证检索内容的准确度，还是评估智能体的工具调用能力，它都能提供科学的量化依据。\n\ncontinuous-eval 特别适合 AI 应用开发者、算法工程师及研究人员使用。它以 Python 包形式提供，安装简便，支持快速集成到现有的开发与测试流程中。通过简单的代码配置，用户即可运行单一指标测试或构建完整的自动化评估流水线，让模型迭代不再依赖主观猜测，而是建立在扎实的数据反馈之上，助力打造更高质量的 AI 应用。","\u003Ch3 align=\"center\">\n  \u003Cimg\n    src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frelari-ai_continuous-eval_readme_c1b20125a0df.png\"\n    width=\"350\"\n  >\n\u003C\u002Fh3>\n\n\u003Cdiv align=\"center\">\n\n  \n  \u003Ca href=\"https:\u002F\u002Fdocs.relari.ai\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-view-blue\" alt=\"Documentation\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fcontinuous-eval\">![https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fcontinuous-eval\u002F](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fcontinuous-eval.svg)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Freleases\">![https:\u002F\u002FGitHub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Freleases](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Frelari-ai\u002Fcontinuous-eval)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fcontinuous-eval\u002F\">![https:\u002F\u002Fgithub.com\u002FNaereen\u002Fbadges\u002F](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frelari-ai_continuous-eval_readme_6c80bcaa3db1.png)\u003C\u002Fa>\n  \u003Ca a href=\"https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fblob\u002Fmain\u002FLICENSE\">![https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fcontinuous-eval\u002F](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fl\u002Fcontinuous-eval.svg)\u003C\u002Fa>\n\n\n\u003C\u002Fdiv>\n\n\u003Ch2 align=\"center\">\n  \u003Cp>Data-Driven Evaluation for LLM-Powered Applications\u003C\u002Fp>\n\u003C\u002Fh2>\n\n\n\n## Overview\n\n`continuous-eval` is an open-source package created for data-driven evaluation of LLM-powered application.\n\n\u003Ch1 align=\"center\">\n  \u003Cimg\n    src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frelari-ai_continuous-eval_readme_ae662cf7180c.png\"\n  >\n\u003C\u002Fh1>\n\n## How is continuous-eval different?\n\n- **Modularized Evaluation**: Measure each module in the pipeline with tailored metrics.\n\n- **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.\n\n- **Probabilistic Evaluation**: Evaluate your pipeline with probabilistic metrics\n\n## Getting Started\n\nThis code is provided as a PyPi package. To install it, run the following command:\n\n```bash\npython3 -m pip install continuous-eval\n```\n\nif you want to install from source:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval.git && cd continuous-eval\npoetry install --all-extras\n```\n\nTo run LLM-based metrics, the code requires at least one of the LLM API keys in `.env`. Take a look at the example env file `.env.example`.\n\n## Run a single metric\n\nHere's how you run a single metric on a datum.\nCheck all available metrics here: [link](https:\u002F\u002Fcontinuous-eval.docs.relari.ai\u002F)\n\n```python\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1\n\ndatum = {\n    \"question\": \"What is the capital of France?\",\n    \"retrieved_context\": [\n        \"Paris is the capital of France and its largest city.\",\n        \"Lyon is a major city in France.\",\n    ],\n    \"ground_truth_context\": [\"Paris is the capital of France.\"],\n    \"answer\": \"Paris\",\n    \"ground_truths\": [\"Paris\"],\n}\n\nmetric = PrecisionRecallF1()\n\nprint(metric(**datum))\n```\n\n## Run an evaluation\n\nIf you want to run an evaluation on a dataset, you can use the `EvaluationRunner` class.\n\n```python\nfrom time import perf_counter\n\nfrom continuous_eval.data_downloader import example_data_downloader\nfrom continuous_eval.eval import EvaluationRunner, SingleModulePipeline\nfrom continuous_eval.eval.tests import GreaterOrEqualThan\nfrom continuous_eval.metrics.retrieval import (\n    PrecisionRecallF1,\n    RankedRetrievalMetrics,\n)\n\n\ndef main():\n    # Let's download the retrieval dataset example\n    dataset = example_data_downloader(\"retrieval\")\n\n    # Setup evaluation pipeline (i.e., dataset, metrics and tests)\n    pipeline = SingleModulePipeline(\n        dataset=dataset,\n        eval=[\n            PrecisionRecallF1().use(\n                retrieved_context=dataset.retrieved_contexts,\n                ground_truth_context=dataset.ground_truth_contexts,\n            ),\n            RankedRetrievalMetrics().use(\n                retrieved_context=dataset.retrieved_contexts,\n                ground_truth_context=dataset.ground_truth_contexts,\n            ),\n        ],\n        tests=[\n            GreaterOrEqualThan(\n                test_name=\"Recall\", metric_name=\"context_recall\", min_value=0.8\n            ),\n        ],\n    )\n\n    # Start the evaluation manager and run the metrics (and tests)\n    tic = perf_counter()\n    runner = EvaluationRunner(pipeline)\n    eval_results = runner.evaluate()\n    toc = perf_counter()\n    print(\"Evaluation results:\")\n    print(eval_results.aggregate())\n    print(f\"Elapsed time: {toc - tic:.2f} seconds\\n\")\n\n    print(\"Running tests...\")\n    test_results = runner.test(eval_results)\n    print(test_results)\n\n\nif __name__ == \"__main__\":\n    # It is important to run this script in a new process to avoid\n    # multiprocessing issues\n    main()\n```\n\n## Run evaluation on a pipeline (modular evaluation)\n\nSometimes the system is composed of multiple modules, each with its own metrics and tests.\nContinuous-eval supports this use case by allowing you to define modules in your pipeline and select corresponding metrics.\n\n```python\nfrom typing import Any, Dict, List\n\nfrom continuous_eval.data_downloader import example_data_downloader\nfrom continuous_eval.eval import (\n    Dataset,\n    EvaluationRunner,\n    Module,\n    ModuleOutput,\n    Pipeline,\n)\nfrom continuous_eval.eval.result_types import PipelineResults\nfrom continuous_eval.metrics.generation.text import AnswerCorrectness\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics\n\n\ndef page_content(docs: List[Dict[str, Any]]) -> List[str]:\n    # Extract the content of the retrieved documents from the pipeline results\n    return [doc[\"page_content\"] for doc in docs]\n\n\ndef main():\n    dataset: Dataset = example_data_downloader(\"graham_essays\u002Fsmall\u002Fdataset\")\n    results: Dict = example_data_downloader(\"graham_essays\u002Fsmall\u002Fresults\")\n\n    # Simple 3-step RAG pipeline with Retriever->Reranker->Generation\n    retriever = Module(\n        name=\"retriever\",\n        input=dataset.question,\n        output=List[str],\n        eval=[\n            PrecisionRecallF1().use(\n                retrieved_context=ModuleOutput(page_content),  # specify how to extract what we need (i.e., page_content)\n                ground_truth_context=dataset.ground_truth_context,\n            ),\n        ],\n    )\n\n    reranker = Module(\n        name=\"reranker\",\n        input=retriever,\n        output=List[Dict[str, str]],\n        eval=[\n            RankedRetrievalMetrics().use(\n                retrieved_context=ModuleOutput(page_content),\n                ground_truth_context=dataset.ground_truth_context,\n            ),\n        ],\n    )\n\n    llm = Module(\n        name=\"llm\",\n        input=reranker,\n        output=str,\n        eval=[\n            AnswerCorrectness().use(\n                question=dataset.question,\n                answer=ModuleOutput(),\n                ground_truth_answers=dataset.ground_truth_answers,\n            ),\n        ],\n    )\n\n    pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)\n    print(pipeline.graph_repr())  # visualize the pipeline in marmaid format\n\n    runner = EvaluationRunner(pipeline)\n    eval_results = runner.evaluate(PipelineResults.from_dict(results))\n    print(eval_results.aggregate())\n\n\nif __name__ == \"__main__\":\n    main()\n```\n\n> Note: it is important to wrap your code in a main function (with the `if __name__ == \"__main__\":` guard) to make sure the parallelization works properly.\n\n## Custom Metrics\n\nThere are several ways to create custom metrics, see the [Custom Metrics](https:\u002F\u002Fcontinuous-eval.docs.relari.ai\u002Fv0.3\u002Fmetrics\u002Foverview) section in the docs.\n\nThe simplest way is to leverage the `CustomMetric` class to create a LLM-as-a-Judge.\n\n```python\nfrom continuous_eval.metrics.base.metric import Arg, Field\nfrom continuous_eval.metrics.custom import CustomMetric\nfrom typing import List\n\ncriteria = \"Check that the generated answer does not contain PII or other sensitive information.\"\nrubric = \"\"\"Use the following rubric to assign a score to the answer based on its conciseness:\n- Yes: The answer contains PII or other sensitive information.\n- No: The answer does not contain PII or other sensitive information.\n\"\"\"\n\nmetric = CustomMetric(\n    name=\"PIICheck\",\n    criteria=criteria,\n    rubric=rubric,\n    arguments={\"answer\": Arg(type=str, description=\"The answer to evaluate.\")},\n    response_format={\n        \"reasoning\": Field(\n            type=str,\n            description=\"The reasoning for the score given to the answer\",\n        ),\n        \"score\": Field(\n            type=str, description=\"The score of the answer: Yes or No\"\n        ),\n        \"identifies\": Field(\n            type=List[str],\n            description=\"The PII or other sensitive information identified in the answer\",\n        ),\n    },\n)\n\n# Let's calculate the metric for the first datum\nprint(metric(answer=\"John Doe resides at 123 Main Street, Springfield.\"))\n```\n\n## 💡 Contributing\n\nInterested in contributing? See our [Contribution Guide](CONTRIBUTING.md) for more details.\n\n## Resources\n\n- **Docs:** [link](https:\u002F\u002Fcontinuous-eval.docs.relari.ai\u002F)\n- **Examples Repo**: [end-to-end example repo](https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fexamples)\n- **Blog Posts:**\n  - Practical Guide to RAG Pipeline Evaluation: [Part 1: Retrieval](https:\u002F\u002Fmedium.com\u002Frelari\u002Fa-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893), [Part 2: Generation](https:\u002F\u002Fmedium.com\u002Frelari\u002Fa-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)\n  - How important is a Golden Dataset for LLM evaluation?\n [(link)](https:\u002F\u002Fmedium.com\u002Frelari\u002Fhow-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5)\n  - How to evaluate complex GenAI Apps: a granular approach [(link)](https:\u002F\u002Fmedium.com\u002Frelari\u002Fhow-to-evaluate-complex-genai-apps-a-granular-approach-0ab929d5b3e2)\n  - How to Make the Most Out of LLM Production Data: Simulated User Feedback [(link)](https:\u002F\u002Fmedium.com\u002Ftowards-data-science\u002Fhow-to-make-the-most-out-of-llm-production-data-simulated-user-feedback-843c444febc7)\n  - Generate Synthetic Data to Test LLM Applications [(link)](https:\u002F\u002Fmedium.com\u002Frelari\u002Fgenerate-synthetic-data-to-test-llm-applications-4bffeb51b80e)\n- **Discord:** Join our community of LLM developers [Discord](https:\u002F\u002Fdiscord.gg\u002FGJnM8SRsHr)\n- **Reach out to founders:** [Email](mailto:founders@relari.ai) or [Schedule a chat](https:\u002F\u002Fcal.com\u002Frelari\u002Fintro)\n\n## License\n\nThis project is licensed under the Apache 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## Open Analytics\n\nWe monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement.\nYou can take a look at exactly what we track in the [telemetry code](continuous_eval\u002Futils\u002Ftelemetry.py)\n\nTo disable usage-tracking you set the `CONTINUOUS_EVAL_DO_NOT_TRACK` flag to `true`.\n","\u003Ch3 align=\"center\">\n  \u003Cimg\n    src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frelari-ai_continuous-eval_readme_c1b20125a0df.png\"\n    width=\"350\"\n  >\n\u003C\u002Fh3>\n\n\u003Cdiv align=\"center\">\n\n  \n  \u003Ca href=\"https:\u002F\u002Fdocs.relari.ai\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-view-blue\" alt=\"文档\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fcontinuous-eval\">![https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fcontinuous-eval\u002F](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fcontinuous-eval.svg)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Freleases\">![https:\u002F\u002FGitHub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Freleases](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Frelari-ai\u002Fcontinuous-eval)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fcontinuous-eval\u002F\">![https:\u002F\u002Fgithub.com\u002FNaereen\u002Fbadges\u002F](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frelari-ai_continuous-eval_readme_6c80bcaa3db1.png)\u003C\u002Fa>\n  \u003Ca a href=\"https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fblob\u002Fmain\u002FLICENSE\">![https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fcontinuous-eval\u002F](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fl\u002Fcontinuous-eval.svg)\u003C\u002Fa>\n\n\n\u003C\u002Fdiv>\n\n\u003Ch2 align=\"center\">\n  \u003Cp>面向大语言模型应用的数据驱动评估\u003C\u002Fp>\n\u003C\u002Fh2>\n\n\n\n## 概述\n\n`continuous-eval` 是一个开源工具包，专为基于大语言模型的应用程序提供数据驱动的评估而设计。\n\n\u003Ch1 align=\"center\">\n  \u003Cimg\n    src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frelari-ai_continuous-eval_readme_ae662cf7180c.png\"\n  >\n\u003C\u002Fh1>\n\n## continuous-eval 有何不同？\n\n- **模块化评估**：使用量身定制的指标来衡量流水线中的每个模块。\n  \n- **全面的指标库**：涵盖检索增强生成（RAG）、代码生成、智能体工具使用、分类等多种大语言模型应用场景。支持混合使用确定性、语义和基于大语言模型的指标。\n\n- **概率性评估**：通过概率性指标对您的流水线进行评估。\n\n## 快速入门\n\n该代码以 PyPI 包的形式提供。要安装它，请运行以下命令：\n\n```bash\npython3 -m pip install continuous-eval\n```\n\n如果您想从源码安装：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval.git && cd continuous-eval\npoetry install --all-extras\n```\n\n要运行基于大语言模型的指标，代码需要在 `.env` 文件中至少配置一个大语言模型的 API 密钥。请参阅示例 `.env.example` 文件。\n\n## 运行单个指标\n\n以下是针对单个数据点运行单个指标的方法。所有可用指标请参见：[链接](https:\u002F\u002Fcontinuous-eval.docs.relari.ai\u002F)\n\n```python\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1\n\ndatum = {\n    \"question\": \"法国的首都是什么？\",\n    \"retrieved_context\": [\n        \"巴黎是法国的首都和最大城市。\",\n        \"里昂是法国的主要城市。\",\n    ],\n    \"ground_truth_context\": [\"巴黎是法国的首都。\"],\n    \"answer\": \"巴黎\",\n    \"ground_truths\": [\"巴黎\"],\n}\n\nmetric = PrecisionRecallF1()\n\nprint(metric(**datum))\n```\n\n## 运行评估\n\n如果您想对一个数据集进行评估，可以使用 `EvaluationRunner` 类。\n\n```python\nfrom time import perf_counter\n\nfrom continuous_eval.data_downloader import example_data_downloader\nfrom continuous_eval.eval import EvaluationRunner, SingleModulePipeline\nfrom continuous_eval.eval.tests import GreaterOrEqualThan\nfrom continuous_eval.metrics.retrieval import (\n    PrecisionRecallF1,\n    RankedRetrievalMetrics,\n)\n\n\ndef main():\n    # 下载检索数据集示例\n    dataset = example_data_downloader(\"retrieval\")\n\n    # 设置评估流水线（即数据集、指标和测试）\n    pipeline = SingleModulePipeline(\n        dataset=dataset,\n        eval=[\n            PrecisionRecallF1().use(\n                retrieved_context=dataset.retrieved_contexts,\n                ground_truth_context=dataset.ground_truth_contexts,\n            ),\n            RankedRetrievalMetrics().use(\n                retrieved_context=dataset.retrieved_contexts,\n                ground_truth_context=dataset.ground_truth_contexts,\n            ),\n        ],\n        tests=[\n            GreaterOrEqualThan(\n                test_name=\"召回率\", metric_name=\"context_recall\", min_value=0.8\n            ),\n        ],\n    )\n\n    # 启动评估管理器并运行指标（及测试）\n    tic = perf_counter()\n    runner = EvaluationRunner(pipeline)\n    eval_results = runner.evaluate()\n    toc = perf_counter()\n    print(\"评估结果：\")\n    print(eval_results.aggregate())\n    print(f\"耗时：{toc - tic:.2f}秒\\n\")\n\n    print(\"正在运行测试...\")\n    test_results = runner.test(eval_results)\n    print(test_results)\n\n\nif __name__ == \"__main__\":\n    # 为了避免多进程问题，务必在新进程中运行此脚本\n    main()\n```\n\n## 在流水线上运行评估（模块化评估）\n\n有时，系统由多个模块组成，每个模块都有自己的指标和测试。\nContinuous-eval 通过允许你在流水线中定义模块并选择相应的指标来支持这种用例。\n\n```python\nfrom typing import Any, Dict, List\n\nfrom continuous_eval.data_downloader import example_data_downloader\nfrom continuous_eval.eval import (\n    Dataset,\n    EvaluationRunner,\n    Module,\n    ModuleOutput,\n    Pipeline,\n)\nfrom continuous_eval.eval.result_types import PipelineResults\nfrom continuous_eval.metrics.generation.text import AnswerCorrectness\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics\n\n\ndef page_content(docs: List[Dict[str, Any]]) -> List[str]:\n    # 从流水线结果中提取检索到的文档内容\n    return [doc[\"page_content\"] for doc in docs]\n\n\ndef main():\n    dataset: Dataset = example_data_downloader(\"graham_essays\u002Fsmall\u002Fdataset\")\n    results: Dict = example_data_downloader(\"graham_essays\u002Fsmall\u002Fresults\")\n\n    # 简单的三步 RAG 流水线：检索器->重排序器->生成\n    retriever = Module(\n        name=\"retriever\",\n        input=dataset.question,\n        output=List[str],\n        eval=[\n            PrecisionRecallF1().use(\n                retrieved_context=ModuleOutput(page_content),  # 指定如何提取所需内容（即 page_content）\n                ground_truth_context=dataset.ground_truth_context,\n            ),\n        ],\n    )\n\n    reranker = Module(\n        name=\"reranker\",\n        input=retriever,\n        output=List[Dict[str, str]],\n        eval=[\n            RankedRetrievalMetrics().use(\n                retrieved_context=ModuleOutput(page_content),\n                ground_truth_context=dataset.ground_truth_context,\n            ),\n        ],\n    )\n\n    llm = Module(\n        name=\"llm\",\n        input=reranker,\n        output=str,\n        eval=[\n            AnswerCorrectness().use(\n                question=dataset.question,\n                answer=ModuleOutput(),\n                ground_truth_answers=dataset.ground_truth_answers,\n            ),\n        ],\n    )\n\n    pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)\n    print(pipeline.graph_repr())  # 以 Marmaid 格式可视化流水线\n\n    runner = EvaluationRunner(pipeline)\n    eval_results = runner.evaluate(PipelineResults.from_dict(results))\n    print(eval_results.aggregate())\n\n\nif __name__ == \"__main__\":\n    main()\n```\n\n> 注意：将代码包裹在 `main` 函数中（使用 `if __name__ == \"__main__\":` 保护）非常重要，以确保并行化正常工作。\n\n## 自定义指标\n\n创建自定义指标有多种方法，请参阅文档中的[自定义指标](https:\u002F\u002Fcontinuous-eval.docs.relari.ai\u002Fv0.3\u002Fmetrics\u002Foverview)部分。\n\n最简单的方法是利用 `CustomMetric` 类来创建一个“LLM 作为裁判”的自定义指标。\n\n```python\nfrom continuous_eval.metrics.base.metric import Arg, Field\nfrom continuous_eval.metrics.custom import CustomMetric\nfrom typing import List\n\ncriteria = \"检查生成的答案是否包含 PII 或其他敏感信息。\"\nrubric = \"\"\"请根据答案的简洁性，使用以下评分标准对答案进行打分：\n- 是：答案包含 PII 或其他敏感信息。\n- 否：答案不包含 PII 或其他敏感信息。\n\"\"\"\n\nmetric = CustomMetric(\n    name=\"PIICheck\",\n    criteria=criteria,\n    rubric=rubric,\n    arguments={\"answer\": Arg(type=str, description=\"待评估的答案。\")},\n    response_format={\n        \"reasoning\": Field(\n            type=str,\n            description=\"给出该答案评分的理由\",\n        ),\n        \"score\": Field(\n            type=str, description=\"答案的评分：是或否\"\n        ),\n        \"identifies\": Field(\n            type=List[str],\n            description=\"在答案中识别出的 PII 或其他敏感信息\",\n        ),\n    },\n)\n\n# 计算第一个数据点的指标\nprint(metric(answer=\"John Doe 居住在斯普林菲尔德市主街 123 号。\"))\n```\n\n## 💡 贡献\n\n有兴趣参与贡献吗？请查看我们的[贡献指南](CONTRIBUTING.md)，了解更多详情。\n\n## 资源\n\n- **文档**：[链接](https:\u002F\u002Fcontinuous-eval.docs.relari.ai\u002F)\n- **示例仓库**：[端到端示例仓库](https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fexamples)\n- **博客文章**：\n  - RAG 流水线评估实用指南：[第 1 部分：检索](https:\u002F\u002Fmedium.com\u002Frelari\u002Fa-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893)、[第 2 部分：生成](https:\u002F\u002Fmedium.com\u002Frelari\u002Fa-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)\n  - 黄金数据集对 LLM 评估有多重要？\n [(链接)](https:\u002F\u002Fmedium.com\u002Frelari\u002Fhow-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5)\n  - 如何评估复杂的 GenAI 应用程序：一种细粒度的方法 [(链接)](https:\u002F\u002Fmedium.com\u002Frelari\u002Fhow-to-evaluate-complex-genai-apps-a-granular-approach-0ab929d5b3e2)\n  - 如何充分利用 LLM 生产数据：模拟用户反馈 [(链接)](https:\u002F\u002Fmedium.com\u002Ftowards-data-science\u002Fhow-to-make-the-most-out-of-llm-production-data-simulated-user-feedback-843c444febc7)\n  - 生成合成数据以测试 LLM 应用程序 [(链接)](https:\u002F\u002Fmedium.com\u002Frelari\u002Fgenerate-synthetic-data-to-test-llm-applications-4bffeb51b80e)\n- **Discord**：加入我们的 LLM 开发者社区 [Discord](https:\u002F\u002Fdiscord.gg\u002FGJnM8SRsHr)\n- **联系创始人**：[电子邮件](mailto:founders@relari.ai) 或 [预约聊天](https:\u002F\u002Fcal.com\u002Frelari\u002Fintro)\n\n## 许可证\n\n本项目采用 Apache 2.0 许可证——详情请参阅 [LICENSE](LICENSE) 文件。\n\n## 开放式分析\n\n我们监控基本的匿名使用统计数据，以了解用户的偏好、为新功能提供参考，并识别可能需要改进的领域。\n你可以在[遥测代码](continuous_eval\u002Futils\u002Ftelemetry.py)中查看我们具体跟踪的内容。\n\n要禁用使用情况跟踪，只需将 `CONTINUOUS_EVAL_DO_NOT_TRACK` 标志设置为 `true`。","# continuous-eval 快速上手指南\n\n`continuous-eval` 是一个专为 LLM 驱动应用设计的开源数据驱动评估工具。它支持模块化评估、丰富的指标库（涵盖 RAG、代码生成、Agent 等场景）以及概率性评估。\n\n## 环境准备\n\n- **系统要求**：Python 3.8+\n- **前置依赖**：\n  - 若需运行基于 LLM 的评估指标，需准备至少一个 LLM API Key（如 OpenAI, Anthropic 等）。\n  - 请在项目根目录创建 `.env` 文件并配置密钥，参考示例文件 `.env.example`。\n\n## 安装步骤\n\n### 方式一：通过 PyPI 安装（推荐）\n\n```bash\npython3 -m pip install continuous-eval\n```\n\n> **国内加速建议**：如遇下载缓慢，可使用清华或阿里镜像源：\n> ```bash\n> python3 -m pip install continuous-eval -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n### 方式二：从源码安装\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval.git && cd continuous-eval\npoetry install --all-extras\n```\n\n## 基本使用\n\n### 1. 运行单个指标\n\n以下示例演示如何对单条数据运行 `PrecisionRecallF1` 检索指标：\n\n```python\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1\n\ndatum = {\n    \"question\": \"What is the capital of France?\",\n    \"retrieved_context\": [\n        \"Paris is the capital of France and its largest city.\",\n        \"Lyon is a major city in France.\",\n    ],\n    \"ground_truth_context\": [\"Paris is the capital of France.\"],\n    \"answer\": \"Paris\",\n    \"ground_truths\": [\"Paris\"],\n}\n\nmetric = PrecisionRecallF1()\n\nprint(metric(**datum))\n```\n\n### 2. 运行数据集评估\n\n使用 `EvaluationRunner` 对整个数据集进行评估，并运行预设测试（如召回率是否达标）：\n\n```python\nfrom time import perf_counter\n\nfrom continuous_eval.data_downloader import example_data_downloader\nfrom continuous_eval.eval import EvaluationRunner, SingleModulePipeline\nfrom continuous_eval.eval.tests import GreaterOrEqualThan\nfrom continuous_eval.metrics.retrieval import (\n    PrecisionRecallF1,\n    RankedRetrievalMetrics,\n)\n\n\ndef main():\n    # 下载示例检索数据集\n    dataset = example_data_downloader(\"retrieval\")\n\n    # 配置评估流水线（数据集、指标和测试条件）\n    pipeline = SingleModulePipeline(\n        dataset=dataset,\n        eval=[\n            PrecisionRecallF1().use(\n                retrieved_context=dataset.retrieved_contexts,\n                ground_truth_context=dataset.ground_truth_contexts,\n            ),\n            RankedRetrievalMetrics().use(\n                retrieved_context=dataset.retrieved_contexts,\n                ground_truth_context=dataset.ground_truth_contexts,\n            ),\n        ],\n        tests=[\n            GreaterOrEqualThan(\n                test_name=\"Recall\", metric_name=\"context_recall\", min_value=0.8\n            ),\n        ],\n    )\n\n    # 启动评估管理器并运行\n    tic = perf_counter()\n    runner = EvaluationRunner(pipeline)\n    eval_results = runner.evaluate()\n    toc = perf_counter()\n    print(\"Evaluation results:\")\n    print(eval_results.aggregate())\n    print(f\"Elapsed time: {toc - tic:.2f} seconds\\n\")\n\n    print(\"Running tests...\")\n    test_results = runner.test(eval_results)\n    print(test_results)\n\n\nif __name__ == \"__main__\":\n    # 必须使用 main 函数包裹以确保多进程正常工作\n    main()\n```\n\n### 3. 自定义指标（LLM-as-a-Judge）\n\n通过 `CustomMetric` 快速创建基于 LLM 的自定义评估逻辑，例如检查回答中是否包含敏感信息（PII）：\n\n```python\nfrom continuous_eval.metrics.base.metric import Arg, Field\nfrom continuous_eval.metrics.custom import CustomMetric\nfrom typing import List\n\ncriteria = \"Check that the generated answer does not contain PII or other sensitive information.\"\nrubric = \"\"\"Use the following rubric to assign a score to the answer based on its conciseness:\n- Yes: The answer contains PII or other sensitive information.\n- No: The answer does not contain PII or other sensitive information.\n\"\"\"\n\nmetric = CustomMetric(\n    name=\"PIICheck\",\n    criteria=criteria,\n    rubric=rubric,\n    arguments={\"answer\": Arg(type=str, description=\"The answer to evaluate.\")},\n    response_format={\n        \"reasoning\": Field(\n            type=str,\n            description=\"The reasoning for the score given to the answer\",\n        ),\n        \"score\": Field(\n            type=str, description=\"The score of the answer: Yes or No\"\n        ),\n        \"identifies\": Field(\n            type=List[str],\n            description=\"The PII or other sensitive information identified in the answer\",\n        ),\n    },\n)\n\n# 计算第一条数据的指标\nprint(metric(answer=\"John Doe resides at 123 Main Street, Springfield.\"))\n```\n\n> **注意**：涉及多进程评估时，请务必将执行代码包裹在 `if __name__ == \"__main__\":` 块中。","某电商公司的 AI 团队正在迭代其智能客服系统，该系统基于 RAG（检索增强生成）架构，需要从海量商品文档中检索信息并回答用户咨询。\n\n### 没有 continuous-eval 时\n- **评估黑盒化**：团队只能凭感觉或人工抽检判断效果，无法量化每个模块（如检索器、生成器）的具体性能瓶颈。\n- **指标单一僵化**：仅依赖简单的字符串匹配准确率，无法识别语义相似但措辞不同的正确回答，导致误判率高。\n- **回归风险大**：每次更新模型或提示词后，缺乏自动化测试基准，难以发现新引入的隐性错误，常引发线上事故。\n- **调试效率低**：面对坏案例（Bad Cases），开发人员需手动编写脚本分析，耗时耗力且难以复现问题根源。\n\n### 使用 continuous-eval 后\n- **模块化透视**：利用模块化评估功能，团队能分别监控“检索召回率”和“答案忠实度”，精准定位是文档没找对还是模型答偏了。\n- **多维语义度量**：引入语义相似度及 LLM 驱动的评估指标，自动识别含义正确但表述不同的回答，评估结果更符合人类直觉。\n- **持续集成防护**：将评估流水线嵌入 CI\u002FCD，设定阈值测试（如召回率不得低于 0.8），一旦指标下滑自动阻断发布，确保版本稳定。\n- **数据驱动迭代**：通过概率性评估快速扫描大规模测试集，自动生成坏案例报告，指导团队针对性优化数据或策略。\n\ncontinuous-eval 将原本模糊的模型效果转化为可度量、可监控的数据指标，让 LLM 应用的迭代从“凭经验猜”转变为“靠数据改”。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frelari-ai_continuous-eval_7e44353d.png","relari-ai","Relari.ai","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Frelari-ai_aa75cc12.png","",null,"founders@relari.ai","relariai","relari.ai","https:\u002F\u002Fgithub.com\u002Frelari-ai",[82,86],{"name":83,"color":84,"percentage":85},"Python","#3572A5",95,{"name":87,"color":88,"percentage":89},"Jinja","#a52a22",5,516,38,"2026-04-17T07:21:48","Apache-2.0","未说明",{"notes":96,"python":97,"dependencies":98},"该工具是一个用于评估 LLM 应用的 Python 包。若需运行基于 LLM 的评估指标，必须在 .env 文件中配置至少一个 LLM 服务的 API 密钥。建议使用 poetry 进行源码安装以获取所有额外功能。代码包含并行处理逻辑，运行脚本时需包裹在 if __name__ == '__main__': 保护块中。","3.x (通过 PyPI badge 推断支持 Python 3)",[99,100],"poetry (用于源码安装)","LLM API Key (运行基于 LLM 的指标时必需)",[35,102,14],"其他",[104,105,106,107,108,109,110],"evaluation-framework","evaluation-metrics","information-retrieval","llm-evaluation","llmops","rag","retrieval-augmented-generation","2026-03-27T02:49:30.150509","2026-04-18T02:20:45.491490",[114,119,124,129,134,139,144,149],{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},38613,"是否支持 Cohere LLM 模型？","是的，Cohere LLM 的支持已被添加。该功能请求已在 Issue #46 中得到解决并合并到主分支中。","https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fissues\u002F35",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},38611,"是否支持使用 Azure OpenAI 端点而不是默认的 OpenAI 端点？","是的，项目已添加对 Azure OpenAI 端点的支持。维护者确认该功能已在相关 PR（#32）中实现并合并，用户现在可以通过 LLMFactory 配置使用 Azure OpenAI。","https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fissues\u002F31",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},38612,"如何创建自定义评估指标（Custom Metric）？","项目已支持自定义指标的创建。该功能已在提交记录 `b9457ba` 中实现，用户可以参考相关代码或文档来定义自己的评估逻辑。","https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fissues\u002F13",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},38610,"遇到 LangChain 依赖报错 'Not Found' 或提示需要安装 langchain-community 怎么办？","这是因为 LangChain 更新了包结构。解决方案是将导入语句切换为 `langchain_community`。具体操作是运行命令 `pip install -U langchain-community` 进行安装，并在代码中使用 `from langchain_community.chat_models import AzureChatOpenAI` 等新的导入路径。","https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fissues\u002F52",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},38614,"DebertaAnswerScores 的 batch_calculate 方法输出不一致怎么办？","这是一个已知问题，`calculate` 方法返回归一化概率，而 `batch_calculate` 此前返回的是 logits。该问题已被修复，现在确保两个方法的输出保持一致（均返回归一化概率\u002FSoftmax 输出）。","https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fissues\u002F6",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},38615,"没有 OpenAI API Key 时程序是否会强制报错？","不会强制报错。修复后，系统仅在真正需要使用 OpenAI 功能时才会检查并抛出缺少 Key 的错误，如果使用的是其他不需要 OpenAI Key 的功能或模型，则不会报错。","https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fissues\u002F17",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},38616,"哪里可以找到项目的贡献指南（CONTRIBUTING.md）？","项目已添加 CONTRIBUTING.md 文件。该文档已在提交记录 `9794f32` 中发布，用户可以查看该文件了解如何参与项目贡献。","https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fissues\u002F40",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},38617,"生成确定性指标中的 'Blue' 是指什么？","这是一个拼写错误，原本指的是 'BLEU' (Bilingual Evaluation Understudy) 指标。该拼写错误已在 Issue #20 中被修复，代码中已更正为 BLEU。","https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fissues\u002F18",[155,160,165,170,175,180,185,190,195,200,205],{"id":156,"version":157,"summary_zh":158,"released_at":159},314519,"v0.3.14","## 变更内容\n* llm factory 更新，由 @yisz 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F76 中完成\n* feature\u002Fprobabilistic metrics，由 @pantonante 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F77 中完成\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fcompare\u002Fv0.3.13...v0.3.14","2025-01-10T19:54:47",{"id":161,"version":162,"summary_zh":163,"released_at":164},314520,"v0.3.13","## 变更内容\n* 由 @pantonante 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F74 中添加了 TokenCount 指标\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fcompare\u002Fv0.3.11...v0.3.13","2024-08-04T23:02:36",{"id":166,"version":167,"summary_zh":168,"released_at":169},314521,"v0.3.11","## 变更内容\n* 修复了 issue #69，由 @kelvinchanwh 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F73 中完成\n* 修复了 issue #71，由 @kelvinchanwh 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F72 中完成\n* 修复了数据集评估示例中 retrieved_contexts 的键名问题，由 @jmartisk 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F68 中完成\n\n## 新贡献者\n* @kelvinchanwh 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F73 中完成了首次贡献\n* @jmartisk 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F68 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fcompare\u002Fv0.3.10...v0.3.11","2024-06-18T18:44:53",{"id":171,"version":172,"summary_zh":173,"released_at":174},314522,"v0.3.10","## 变更内容\n* 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F66 中为 LLM 接口添加了 `max_tokens` 字段\n* 修复进度条总是少 1 的问题 | 更新示例中的生成器为 4o，详见 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F67\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fcompare\u002Fv0.3.9...v0.3.10","2024-06-02T16:12:04",{"id":176,"version":177,"summary_zh":178,"released_at":179},314523,"v0.3.9","## 变更内容\n* 修复了 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F58 中示例文档中的上下文属性名称\n* 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F61 中过渡到评估运行器\n* 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F63 中添加了 SQL 指标\n\n## 新贡献者\n* @LucasLeRay 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F58 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fcompare\u002Fv0.3.7...v0.3.9","2024-05-23T19:53:21",{"id":181,"version":182,"summary_zh":183,"released_at":184},314524,"v0.3.7","## 变更内容\n* 修复了 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F55 中精确率\u002F平均精确率的双重计数边缘情况。\n* 修复了 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F56 中代码字符串必须使用关键字 `ground_truth_answers` 的问题。\n\n## 新贡献者\n* @stantonius 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F56 中完成了首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fcompare\u002Fv0.3.5...v0.3.7","2024-04-25T01:42:26",{"id":186,"version":187,"summary_zh":188,"released_at":189},314525,"v0.3.5","- 添加基础大模型提供商\r\n- 修复 bug","2024-03-27T06:06:42",{"id":191,"version":192,"summary_zh":193,"released_at":194},314526,"v0.3.4","## 变更内容\n* 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F46 中添加了 Cohere LLM\n* 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F48 中新增了 LLM 自定义指标功能\n* 功能：在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F49 中添加了 `vllm` 的 OpenAI API 端点\n\n## 新贡献者\n* @joennlae 在 https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F49 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fcompare\u002Fv0.3.2...v0.3.4","2024-03-20T05:26:25",{"id":196,"version":197,"summary_zh":198,"released_at":199},314527,"v0.3.2","- 指标批量执行现在默认使用线程\n- 修复 bug","2024-03-08T09:00:52",{"id":201,"version":202,"summary_zh":203,"released_at":204},314528,"v0.3.1","要点：\n\n- 在 `Dataset` 类中添加了 `from_data` 类方法\n- 修复了 `EvaluationResults`、`MetricsResults` 和 `TestResults` 中的 `is_empty` 方法\n- 在基于大模型的指标中添加了错误处理","2024-02-29T19:38:02",{"id":206,"version":207,"summary_zh":208,"released_at":209},314529,"v0.2.7","## What's Changed\r\n* Added Code Evaluation Metrics in https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fpull\u002F29\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Frelari-ai\u002Fcontinuous-eval\u002Fcompare\u002Fv0.2.6...v0.2.7","2024-02-16T22:19:28"]