[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-potsawee--selfcheckgpt":3,"tool-potsawee--selfcheckgpt":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":10,"env_os":94,"env_gpu":95,"env_ram":96,"env_deps":97,"category_tags":105,"github_topics":82,"view_count":10,"oss_zip_url":82,"oss_zip_packed_at":82,"status":16,"created_at":106,"updated_at":107,"faqs":108,"releases":139},1022,"potsawee\u002Fselfcheckgpt","selfcheckgpt","SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models","SelfCheckGPT 是一个专门检测生成式大语言模型\"幻觉\"的开源工具。所谓\"幻觉\"，就是 AI 在回答问题时可能生成看似合理但实际错误或无中生有的信息。这个工具最大的亮点是**零资源**和**黑盒**检测——既不需要外部知识库，也无需访问模型的内部参数，就能识别出哪些句子可能不靠谱。\n\n它的工作原理很巧妙：让同一个模型对相同问题生成多个回答，通过检查这些回答之间的一致性来判断信息可信度。比如，如果多次回答都提到\"迈克尔·韦纳出生于1942年3月31日\"，这个事实就很可能可靠；如果只出现一次，就需要打个问号。\n\nSelfCheckGPT 提供了多种检测方法（包括问答、BERTScore、n-gram 等），能输出句子级别的可信度评分（0到1之间，分数越高越可能不真实）。工具已被 EMNLP 2023 会议接收，使用也非常简单，只需一行 pip 命令即可安装。\n\n这个工具特别适合 AI 开发者和研究人员，帮助他们评估和提升大语言模型生成内容的准确性，构建更可靠的 AI 应用。","SelfCheckGPT\n=====================================================\n[![arxiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2303.08896-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08896)\n[![PyPI version selfcheckgpt](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fselfcheckgpt.svg?kill_cache=1)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fselfcheckgpt\u002F)\n[![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fpotsawee_selfcheckgpt_readme_22f09fee23f8.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fselfcheckgpt)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)\n- Project page for our paper \"[SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08896)\"\n- We investigated several variants of the selfcheck approach: BERTScore, Question-Answering, n-gram, NLI, and LLM-Prompting. \n- [Nov 2023] SelfCheckGPT-NLI Calibration Analysis thanks to Daniel Huynh [\\[Link to Article\\]](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fdhuynh95\u002Fautomatic-hallucination-detection)\n- [Oct 2023] The paper is accepted and to appear at EMNLP 2023 [\\[Poster\\]](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1EzQ3MdmrF0gM-83UV2OQ6_QR1RuvhJ9h\u002Fview?usp=drive_link)\n- [Aug 2023] Slides from ML Collective Talk [\\[Link to Slides\\]](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F13LUBPUm4y1nlKigZxXHn7Cl2lw5KuGbc\u002Fview)\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fpotsawee_selfcheckgpt_readme_b243cc0c4174.png)\n\n## Code\u002FPackage\n\n### Installation\n\n    pip install selfcheckgpt\n\n### SelfCheckGPT Usage: BERTScore, QA, n-gram\n\nThere are three variants of SelfCheck scores in this package as described in the paper: `SelfCheckBERTScore()`, `SelfCheckMQAG()`, `SelfCheckNgram()`. All of the variants have `predict()` which will output the sentence-level scores w.r.t. sampled passages. You can use packages such as spacy to split passage into sentences. For reproducibility, you can set `torch.manual_seed` before calling this function. See more details in Jupyter Notebook [```demo\u002FSelfCheck_demo1.ipynb```](demo\u002FSelfCheck_demo1.ipynb)\n\n```python\n# Include necessary packages (torch, spacy, ...)\nfrom selfcheckgpt.modeling_selfcheck import SelfCheckMQAG, SelfCheckBERTScore, SelfCheckNgram\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nselfcheck_mqag = SelfCheckMQAG(device=device) # set device to 'cuda' if GPU is available\nselfcheck_bertscore = SelfCheckBERTScore(rescale_with_baseline=True)\nselfcheck_ngram = SelfCheckNgram(n=1) # n=1 means Unigram, n=2 means Bigram, etc.\n\n# LLM's text (e.g. GPT-3 response) to be evaluated at the sentence level  & Split it into sentences\npassage = \"Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation.\"\nsentences = [sent.text.strip() for sent in nlp(passage).sents] # spacy sentence tokenization\nprint(sentences)\n['Michael Alan Weiner (born March 31, 1942) is an American radio host.', 'He is the host of The Savage Nation.']\n\n# Other samples generated by the same LLM to perform self-check for consistency\nsample1 = \"Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.\"\nsample2 = \"Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.\"\nsample3 = \"Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT.\"\n\n# --------------------------------------------------------------------------------------------------------------- #\n# SelfCheck-MQAG: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual\n# Additional params for each scoring_method:\n# -> counting: AT (answerability threshold, i.e. questions with answerability_score \u003C AT are rejected)\n# -> bayes: AT, beta1, beta2\n# -> bayes_with_alpha: beta1, beta2\nsent_scores_mqag = selfcheck_mqag.predict(\n    sentences = sentences,               # list of sentences\n    passage = passage,                   # passage (before sentence-split)\n    sampled_passages = [sample1, sample2, sample3], # list of sampled passages\n    num_questions_per_sent = 5,          # number of questions to be drawn  \n    scoring_method = 'bayes_with_alpha', # options = 'counting', 'bayes', 'bayes_with_alpha'\n    beta1 = 0.8, beta2 = 0.8,            # additional params depending on scoring_method\n)\nprint(sent_scores_mqag)\n# [0.30990949 0.42376232]\n\n# --------------------------------------------------------------------------------------------------------------- #\n# SelfCheck-BERTScore: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual\nsent_scores_bertscore = selfcheck_bertscore.predict(\n    sentences = sentences,                          # list of sentences\n    sampled_passages = [sample1, sample2, sample3], # list of sampled passages\n)\nprint(sent_scores_bertscore)\n# [0.0695562  0.45590915]\n\n# --------------------------------------------------------------------------------------------------------------- #\n# SelfCheck-Ngram: Score at sentence- and document-level where value is in [0.0, +inf) and high value means non-factual\n# as opposed to SelfCheck-MQAG and SelfCheck-BERTScore, SelfCheck-Ngram's score is not bounded\nsent_scores_ngram = selfcheck_ngram.predict(\n    sentences = sentences,   \n    passage = passage,\n    sampled_passages = [sample1, sample2, sample3],\n)\nprint(sent_scores_ngram)\n# {'sent_level': { # sentence-level score similar to MQAG and BERTScore variant\n#     'avg_neg_logprob': [3.184312, 3.279774],\n#     'max_neg_logprob': [3.476098, 4.574710]\n#     },\n#  'doc_level': {  # document-level score such that avg_neg_logprob is computed over all tokens\n#     'avg_neg_logprob': 3.218678904916201,\n#     'avg_max_neg_logprob': 4.025404834169327\n#     }\n# }\n```\n\n### SelfCheckGPT Usage: NLI (recommended)\n\nEntailment (or Contradiction) score with input being the sentence and a sampled passage can be used as the selfcheck score. We use DeBERTa-v3-large fine-tuned to Multi-NLI, and we normalize the probability of \"entailment\" or \"contradiction\" classes, and take Prob(contradiction) as the score.\n\n```python\nfrom selfcheckgpt.modeling_selfcheck import SelfCheckNLI\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nselfcheck_nli = SelfCheckNLI(device=device) # set device to 'cuda' if GPU is available\n\nsent_scores_nli = selfcheck_nli.predict(\n    sentences = sentences,                          # list of sentences\n    sampled_passages = [sample1, sample2, sample3], # list of sampled passages\n)\nprint(sent_scores_nli)\n# [0.334014 0.975106 ] -- based on the example above\n```\n\n### SelfCheckGPT Usage: LLM Prompt\n\nPrompting an LLM (Llama2, Mistral, OpenAI's GPT) to assess information consistency in a zero-shot setup. We query an LLM to assess whether the i-th sentence is supported by the sample (as the context). Similar to other methods, a higher score indicates higher chance of being hallucination. An example when using Mistral is below:\n\n\n```python\n# Option1: open-source model\nfrom selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nllm_model = \"mistralai\u002FMistral-7B-Instruct-v0.2\"\nselfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)\n\n# Option2: API access \n# (currently only support OpenAI and Groq)\n# from selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt\n# selfcheck_prompt = SelfCheckAPIPrompt(client_type=\"openai\", model=\"gpt-3.5-turbo\")\n# selfcheck_prompt = SelfCheckAPIPrompt(client_type=\"groq\", model=\"llama3-70b-8192\", api_key=\"your-api-key\")\n\nsent_scores_prompt = selfcheck_prompt.predict(\n    sentences = sentences,                          # list of sentences\n    sampled_passages = [sample1, sample2, sample3], # list of sampled passages\n    verbose = True, # whether to show a progress bar\n)\nprint(sent_scores_prompt)\n# [0.33333333, 0.66666667] -- based on the example above\n```\n\nThe LLM can be any model available on HuggingFace. The default prompt template is `Context: {context}\\n\\nSentence: {sentence}\\n\\nIs the sentence supported by the context above? Answer Yes or No.\\n\\nAnswer: `, but you can change it using `selfcheck_prompt.set_prompt_template(new_prompt)`.\n\n\nMost models (gpt-3.5-turbo, Llama2, Mistral) will output either 'Yes' or 'No' >95% of the time, while any remaining outputs can be set to N\u002FA. The output is converted to score: Yes -> 0.0, No -> 1.0, N\u002FA -> 0.5. The inconsistency score is then calculated by averaging.\n\n## Dataset\nThe `wiki_bio_gpt3_hallucination` dataset currently consists of 238 annotated passages (`v3`). You can find more information in the paper or our data card on HuggingFace: https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fpotsawee\u002Fwiki_bio_gpt3_hallucination. To use this dataset, you can either load it through HuggingFace dataset API, or download it directly from below in the JSON format.\n\n### Update\nWe've annotated GPT-3 wikibio passages further, and now the dataset consists of 238 annotated passages. Here is [the link](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1N3_ZQmr9yBbsOP2JCpgiea9oiNIu78Xw\u002Fview?usp=sharing) for the IDs of the first 65 passages in the `v1`.\n\n### Option1: HuggingFace\n\n```python\nfrom datasets import load_dataset\ndataset = load_dataset(\"potsawee\u002Fwiki_bio_gpt3_hallucination\")\n```\n\n### Option2: Manual Download\nDownload from our [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1AyQ7u9nYlZgUZLm5JBDx6cFFWB__EsNv\u002Fview?usp=share_link), then you can load it in python:\n\n```python\nimport json\nwith open(\"dataset.json\", \"r\") as f:\n    content = f.read()\ndataset = json.loads(content)\n```\n\nEach instance consists of:\n- `gpt3_text`: GPT-3 generated passage\n- `wiki_bio_text`: Actual Wikipedia passage (first paragraph)\n- `gpt3_sentences`: `gpt3_text` split into sentences using `spacy`\n- `annotation`: human annotation at the sentence level\n-  `wiki_bio_test_idx`: ID of the concept\u002Findividual from the original wikibio dataset (testset)\n-  `gpt3_text_samples`: list of sampled passages (do_sample = True & temperature = 1.0)\n\n## Experiments\n\n### Probability-based baselines (e.g. GPT-3's probabilities)\n\nAs described in our paper, probabilities (and generation entropies) of the generative LLM can be used to measure its confidence. Check our example\u002Fimplementation of this approach in [```demo\u002Fexperiments\u002Fprobability-based-baselines.ipynb```](demo\u002Fexperiments\u002Fprobability-based-baselines.ipynb)\n\n### Experimental Results\n- Full details can be found in our paper.\n- Note that our new results show that LLMs such as GPT-3 (text-davinci-003) or ChatGPT (gpt-3.5-turbo) are good at text inconsistency assessment. Based on this finding, we try **SelfCheckGPT-Prompt** where each sentence (to be evaluated) is compared against each and every sampled_passage by prompting ChatGPT. SelfCheckGPT-Prompt is the best-performing method.\n\nResults on the `wiki_bio_gpt3_hallucination` dataset.\n\n| Method               |  NonFact (AUC-PR)  |  Factual (AUC-PR)  |   Ranking (PCC)   |\n|----------------------|:------------------:|:------------------:|:-----------------:|\n| Random Guessing      |        72.96       |        27.04       |         -         |\n| GPT-3 Avg(-logP)     |        83.21       |        53.97       |       57.04       |\n| SelfCheck-BERTScore  |        81.96       |        44.23       |       58.18       |\n| SelfCheck-QA         |        84.26       |        48.14       |       61.07       |\n| SelfCheck-Unigram    |        85.63       |        58.47       |       64.71       |\n| SelfCheck-NLI        |        92.50       |        66.08       |       74.14       |\n| SelfCheck-Prompt (Llama2-7B-chat)        |        89.05       |        63.06       |       61.52       |\n| SelfCheck-Prompt (Llama2-13B-chat)        |        91.91       |        64.34       |       75.44       |\n| SelfCheck-Prompt (Mistral-7B-Instruct-v0.2)        |        91.31       |        62.76       |       74.46       |\n| **SelfCheck-Prompt (gpt-3.5-turbo)** |      **93.42**     |      **67.09**     |     **78.32**     |\n\n## Miscellaneous\n[MQAG (Multiple-choice Question Answering and Generation)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.12307) was proposed in our previous work. Our MQAG implementation is included in this package, which can be used to: (1) generate multiple-choice questions, (2) answer multiple-choice questions, (3) obtain MQAG score.\n\n### MQAG Usage\n\n```python\nfrom selfcheckgpt.modeling_mqag import MQAG\nmqag_model = MQAG()\n```\n\nIt has three main functions: `generate()`, `answer()`, `score()`. We show an example usage in [```demo\u002FMQAG_demo1.ipynb```](demo\u002FMQAG_demo1.ipynb)\n\n## Acknowledgements\nThis work is supported by Cambridge University Press & Assessment (CUP&A), a department of The Chancellor, Masters, and Scholars of the University of Cambridge, and the Cambridge Commonwealth, European & International Trust.\n\n## Citation\n\n```\n@article{manakul2023selfcheckgpt,\n  title={Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models},\n  author={Manakul, Potsawee and Liusie, Adian and Gales, Mark JF},\n  journal={arXiv preprint arXiv:2303.08896},\n  year={2023}\n}\n```\n","SelfCheckGPT\n=====================================================\n[![arxiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2303.08896-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08896)\n[![PyPI version selfcheckgpt](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fselfcheckgpt.svg?kill_cache=1)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fselfcheckgpt\u002F)\n[![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fpotsawee_selfcheckgpt_readme_22f09fee23f8.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fselfcheckgpt)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)\n- 我们论文 \"[SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08896)\" 的项目页面\n- 我们研究了 selfcheck 方法的几种变体：BERTScore（基于BERT的相似度评分）、Question-Answering（问答）、n-gram（n元语法）、NLI（自然语言推理）和 LLM-Prompting（大语言模型提示）。\n- [2023年11月] 感谢 Daniel Huynh 的 SelfCheckGPT-NLI 校准分析 [\\[链接到文章\\]](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fdhuynh95\u002Fautomatic-hallucination-detection)\n- [2023年10月] 论文已被 EMNLP 2023 接收并即将发表 [\\[海报\\]](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1EzQ3MdmrF0gM-83UV2OQ6_QR1RuvhJ9h\u002Fview?usp=drive_link)\n- [2023年8月] ML Collective 演讲的幻灯片 [\\[链接到幻灯片\\]](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F13LUBPUm4y1nlKigZxXHn7Cl2lw5KuGbc\u002Fview)\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fpotsawee_selfcheckgpt_readme_b243cc0c4174.png)\n\n## 代码\u002F包\n\n### 安装\n\n    pip install selfcheckgpt\n\n### SelfCheckGPT 使用方法：BERTScore、QA、n-gram\n\n本包中有三种 SelfCheck 分数变体，如论文所述：`SelfCheckBERTScore()`、`SelfCheckMQAG()`、`SelfCheckNgram()`。所有变体都有 `predict()` 方法，该方法会输出句子级别（sentence-level）的分数，用于衡量与采样段落（sampled passages）的一致性。您可以使用 spacy（一个自然语言处理库）等包将段落拆分为句子。为了可复现性（reproducibility），您可以在调用此函数前设置 `torch.manual_seed`。更多详情见 Jupyter Notebook（交互式笔记本）[```demo\u002FSelfCheck_demo1.ipynb```](demo\u002FSelfCheck_demo1.ipynb)\n\n```python\n# Include necessary packages (torch, spacy, ...)\nfrom selfcheckgpt.modeling_selfcheck import SelfCheckMQAG, SelfCheckBERTScore, SelfCheckNgram\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nselfcheck_mqag = SelfCheckMQAG(device=device) # set device to 'cuda' if GPU is available\nselfcheck_bertscore = SelfCheckBERTScore(rescale_with_baseline=True)\nselfcheck_ngram = SelfCheckNgram(n=1) # n=1 means Unigram, n=2 means Bigram, etc.\n\n# LLM's text (e.g. GPT-3 response) to be evaluated at the sentence level  & Split it into sentences\npassage = \"Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation.\"\nsentences = [sent.text.strip() for sent in nlp(passage).sents] # spacy sentence tokenization\nprint(sentences)\n['Michael Alan Weiner (born March 31, 1942) is an American radio host.', 'He is the host of The Savage Nation.']\n\n# Other samples generated by the same LLM to perform self-check for consistency\nsample1 = \"Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.\"\nsample2 = \"Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.\"\nsample3 = \"Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT.\"\n\n# --------------------------------------------------------------------------------------------------------------- #\n# SelfCheck-MQAG: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual\n# Additional params for each scoring_method:\n# -> counting: AT (answerability threshold, i.e. questions with answerability_score \u003C AT are rejected)\n# -> bayes: AT, beta1, beta2\n# -> bayes_with_alpha: beta1, beta2\nsent_scores_mqag = selfcheck_mqag.predict(\n    sentences = sentences,               # list of sentences\n    passage = passage,                   # passage (before sentence-split)\n    sampled_passages = [sample1, sample2, sample3], # list of sampled passages\n    num_questions_per_sent = 5,          # number of questions to be drawn  \n    scoring_method = 'bayes_with_alpha', # options = 'counting', 'bayes', 'bayes_with_alpha'\n    beta1 = 0.8, beta2 = 0.8,            # additional params depending on scoring_method\n)\nprint(sent_scores_mqag)\n# [0.30990949 0.42376232]\n\n# --------------------------------------------------------------------------------------------------------------- #\n# SelfCheck-BERTScore: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual\nsent_scores_bertscore = selfcheck_bertscore.predict(\n    sentences = sentences,                          # list of sentences\n    sampled_passages = [sample1, sample2, sample3], # list of sampled passages\n)\nprint(sent_scores_bertscore)\n# [0.0695562  0.45590915]\n\n# --------------------------------------------------------------------------------------------------------------- #\n# SelfCheck-Ngram: Score at sentence- and document-level where value is in [0.0, +inf) and high value means non-factual\n# as opposed to SelfCheck-MQAG and SelfCheck-BERTScore, SelfCheck-Ngram's score is not bounded\nsent_scores_ngram = selfcheck_ngram.predict(\n    sentences = sentences,   \n    passage = passage,\n    sampled_passages = [sample1, sample2, sample3],\n)\nprint(sent_scores_ngram)\n# {'sent_level': { # sentence-level score similar to MQAG and BERTScore variant\n#     'avg_neg_logprob': [3.184312, 3.279774],\n#     'max_neg_logprob': [3.476098, 4.574710]\n#     },\n#  'doc_level': {  # document-level score such that avg_neg_logprob is computed over all tokens\n#     'avg_neg_logprob': 3.218678904916201,\n#     'avg_max_neg_logprob': 4.025404834169327\n#     }\n# }\n```\n\n### SelfCheckGPT 使用方法：NLI（推荐）\n\n以句子和采样段落作为输入的蕴含（Entailment）或矛盾（Contradiction）分数可以用作 selfcheck 分数。我们使用在 Multi-NLI（多体裁自然语言推理数据集）上微调的 DeBERTa-v3-large（一种预训练语言模型），对“蕴含”或“矛盾”类别的概率进行归一化（normalize），并将 Prob(contradiction) 作为分数。\n\n```python\nfrom selfcheckgpt.modeling_selfcheck import SelfCheckNLI\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nselfcheck_nli = SelfCheckNLI(device=device) # set device to 'cuda' if GPU is available\n\nsent_scores_nli = selfcheck_nli.predict(\n    sentences = sentences,                          # list of sentences\n    sampled_passages = [sample1, sample2, sample3], # list of sampled passages\n)\nprint(sent_scores_nli)\n# [0.334014 0.975106 ] -- based on the example above\n```\n\n### SelfCheckGPT 使用方法：LLM 提示\n\n提示 LLM（Llama2、Mistral、OpenAI 的 GPT）在 zero-shot（零样本）设置下评估信息一致性。我们查询 LLM 以评估第 i 个句子是否得到样本（作为上下文）的支持。与其他方法类似，分数越高表示越有可能是幻觉。下面是一个使用 Mistral 的示例：\n\n```python\n\n```\n\n```markdown\n# Option1: open-source model\nfrom selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nllm_model = \"mistralai\u002FMistral-7B-Instruct-v0.2\"\nselfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)\n\n# Option2: API access \n# (currently only support OpenAI and Groq)\n# from selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt\n# selfcheck_prompt = SelfCheckAPIPrompt(client_type=\"openai\", model=\"gpt-3.5-turbo\")\n# selfcheck_prompt = SelfCheckAPIPrompt(client_type=\"groq\", model=\"llama3-70b-8192\", api_key=\"your-api-key\")\n\nsent_scores_prompt = selfcheck_prompt.predict(\n    sentences = sentences,                          # list of sentences\n    sampled_passages = [sample1, sample2, sample3], # list of sampled passages\n    verbose = True, # whether to show a progress bar\n)\nprint(sent_scores_prompt)\n# [0.33333333, 0.66666667] -- based on the example above\n```\n\nLLM（大型语言模型）可以是 HuggingFace 上可用的任何模型。默认的 prompt template（提示模板）为 `Context: {context}\\n\\nSentence: {sentence}\\n\\nIs the sentence supported by the context above? Answer Yes or No.\\n\\nAnswer: `，但您可以使用 `selfcheck_prompt.set_prompt_template(new_prompt)` 进行更改。\n\n大多数模型（gpt-3.5-turbo、Llama2、Mistral）>95% 的时间会输出 'Yes' 或 'No'，而任何剩余输出可设为 N\u002FA。输出将转换为分数：Yes -> 0.0，No -> 1.0，N\u002FA -> 0.5。inconsistency score（不一致性分数）随后通过取平均值计算得出。\n\n## 数据集\n`wiki_bio_gpt3_hallucination` 数据集目前包含 238 篇标注过的段落（`v3`）。您可以在论文中或我们的 HuggingFace 数据卡片上找到更多信息：https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fpotsawee\u002Fwiki_bio_gpt3_hallucination。要使用此数据集，您可以通过 HuggingFace 数据集 API 加载，也可以直接从下方以 JSON 格式下载。\n\n### 更新\n我们进一步标注了 GPT-3 wikibio 段落，现在数据集包含 238 篇标注过的段落。以下是 `v1` 中前 65 篇段落的 ID [链接](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1N3_ZQmr9yBbsOP2JCpgiea9oiNIu78Xw\u002Fview?usp=sharing)。\n\n### 选项1：HuggingFace\n\n```python\nfrom datasets import load_dataset\ndataset = load_dataset(\"potsawee\u002Fwiki_bio_gpt3_hallucination\")\n```\n\n### 选项2：手动下载\n从我们的 [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1AyQ7u9nYlZgUZLm5JBDx6cFFWB__EsNv\u002Fview?usp=share_link) 下载，然后您可以在 Python 中加载：\n\n```python\nimport json\nwith open(\"dataset.json\", \"r\") as f:\n    content = f.read()\ndataset = json.loads(content)\n```\n\n每个实例包含：\n- `gpt3_text`：GPT-3 生成的段落\n- `wiki_bio_text`：实际的 Wikipedia 段落（第一段）\n- `gpt3_sentences`：使用 `spacy` 将 `gpt3_text` 拆分为句子\n- `annotation`：句子级别的人工标注\n- `wiki_bio_test_idx`：来自原始 wikibio 数据集（测试集）的概念\u002F个体 ID\n- `gpt3_text_samples`：采样段落列表（do_sample = True & temperature = 1.0）\n\n## 实验\n\n### 基于概率的基线方法（例如 GPT-3 的概率）\n\n如我们的论文所述，生成式 LLM 的概率（和生成熵）可用于衡量其置信度。请查看我们在 [```demo\u002Fexperiments\u002Fprobability-based-baselines.ipynb```](demo\u002Fexperiments\u002Fprobability-based-baselines.ipynb) 中的示例\u002F实现。\n\n### 实验结果\n- 完整细节可在我们的论文中找到。\n- 请注意，我们的新结果表明，LLM（如 GPT-3 (text-davinci-003) 或 ChatGPT (gpt-3.5-turbo)）擅长文本不一致性评估。基于这一发现，我们尝试了 **SelfCheckGPT-Prompt**，其中每个待评估的句子都与每个 sampled_passage 进行比较，通过提示 ChatGPT 完成。SelfCheckGPT-Prompt 是表现最佳的方法。\n\n在 `wiki_bio_gpt3_hallucination` 数据集上的结果。\n\n| 方法               |  非事实 (AUC-PR)  |  事实 (AUC-PR)  |   排序 (PCC)   |\n|----------------------|:------------------:|:------------------:|:-----------------:|\n| 随机猜测      |        72.96       |        27.04       |         -         |\n| GPT-3 Avg(-logP)     |        83.21       |        53.97       |       57.04       |\n| SelfCheck-BERTScore  |        81.96       |        44.23       |       58.18       |\n| SelfCheck-QA         |        84.26       |        48.14       |       61.07       |\n| SelfCheck-Unigram    |        85.63       |        58.47       |       64.71       |\n| SelfCheck-NLI        |        92.50       |        66.08       |       74.14       |\n| SelfCheck-Prompt (Llama2-7B-chat)        |        89.05       |        63.06       |       61.52       |\n| SelfCheck-Prompt (Llama2-13B-chat)        |        91.91       |        64.34       |       75.44       |\n| SelfCheck-Prompt (Mistral-7B-Instruct-v0.2)        |        91.31       |        62.76       |       74.46       |\n| **SelfCheck-Prompt (gpt-3.5-turbo)** |      **93.42**     |      **67.09**     |     **78.32**     |\n\n## 其他\n[MQAG（多项选择题问答与生成）](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.12307) 在我们之前的工作中提出。我们的 MQAG 实现包含在此包中，可用于：(1) 生成多项选择题，(2) 回答多项选择题，(3) 获取 MQAG 分数。\n\n### MQAG 用法\n\n```python\nfrom selfcheckgpt.modeling_mqag import MQAG\nmqag_model = MQAG()\n```\n\n它有三个主要功能：`generate()`、`answer()`、`score()`。我们在 [```demo\u002FMQAG_demo1.ipynb```](demo\u002FMQAG_demo1.ipynb) 中展示了示例用法。\n\n## 致谢\n本工作由剑桥大学出版社与考评部（Cambridge University Press & Assessment, CUP&A）——剑桥大学的一个部门，以及剑桥英联邦、欧洲与国际信托基金支持。\n\n## 引用\n\n```\n@article{manakul2023selfcheckgpt,\n  title={Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models},\n  author={Manakul, Potsawee and Liusie, Adian and Gales, Mark JF},\n  journal={arXiv preprint arXiv:2303.08896},\n  year={2023}\n}\n```","# SelfCheckGPT 快速上手指南\n\n## 环境准备\n\n- **Python 版本**: 3.7 或更高版本\n- **操作系统**: Linux、macOS 或 Windows\n- **硬件要求**: 建议使用 GPU（CUDA）加速计算，CPU 也可运行\n- **前置依赖**: 需要安装 PyTorch\n\n```bash\n# 安装 PyTorch（根据你的 CUDA 版本选择）\n# 访问 https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F 获取最新安装命令\npip install torch\n```\n\n## 安装步骤\n\n```bash\n# 安装 SelfCheckGPT 主包（推荐国内镜像源加速）\npip install selfcheckgpt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 安装 SpaCy 及其语言模型（用于句子分割）\npip install spacy\npython -m spacy download en_core_web_sm\n```\n\n## 基本使用（NLI 方法）\n\nNLI（自然语言推理）是推荐的方法，使用 DeBERTa 模型检测文本一致性。\n\n```python\nimport torch\nimport spacy\nfrom selfcheckgpt.modeling_selfcheck import SelfCheckNLI\n\n# 1. 初始化模型和工具\nnlp = spacy.load(\"en_core_web_sm\")\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nselfcheck_nli = SelfCheckNLI(device=device)\n\n# 2. 准备待检测文本（GPT 生成内容）\npassage = \"Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation.\"\nsentences = [sent.text.strip() for sent in nlp(passage).sents]\nprint(\"待检测句子:\", sentences)\n\n# 3. 准备同一模型生成的其他样本（用于一致性对比）\nsample1 = \"Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.\"\nsample2 = \"Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.\"\nsample3 = \"Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT.\"\n\n# 4. 执行幻觉检测\nsent_scores = selfcheck_nli.predict(\n    sentences=sentences,\n    sampled_passages=[sample1, sample2, sample3],\n)\n\nprint(\"幻觉检测分数:\", sent_scores)\n# 输出示例: [0.334014, 0.975106]\n```\n\n### 结果解读\n\n- **分数范围**: [0.0, 1.0]\n- **分数含义**: 分数越高，表示该句子越可能是幻觉（非事实）\n- **建议阈值**: 通常 0.5 以上需要警惕，可根据实际场景调整\n\n## 其他检测方法\n\n快速切换其他检测方法：\n\n```python\nfrom selfcheckgpt.modeling_selfcheck import SelfCheckBERTScore, SelfCheckLLMPrompt\n\n# BERTScore 方法（无需 GPU）\nselfcheck_bertscore = SelfCheckBERTScore(rescale_with_baseline=True)\nscores = selfcheck_bertscore.predict(\n    sentences=sentences, \n    sampled_passages=[sample1, sample2, sample3]\n)\n\n# LLM Prompt 方法（需要 GPU 或 API）\nselfcheck_prompt = SelfCheckLLMPrompt(\"mistralai\u002FMistral-7B-Instruct-v0.2\", device)\nscores = selfcheck_prompt.predict(\n    sentences=sentences, \n    sampled_passages=[sample1, sample2, sample3]\n)\n```\n\n更多详细示例请参考：[demo\u002FSelfCheck_demo1.ipynb](https:\u002F\u002Fgithub.com\u002Fpotsawee\u002Fselfcheckgpt\u002Fblob\u002Fmain\u002Fdemo\u002FSelfCheck_demo1.ipynb)","某互联网医疗平台的AI团队正在开发智能问诊助手，基于GPT-4为患者提供常见疾病咨询和用药指导。每天有数千名患者提问，系统需要实时生成专业回答。\n\n### 没有 selfcheckgpt 时\n- **医生审核成为瓶颈**：每份AI回答都必须由值班医生逐字审阅，高峰期积压严重，患者平均等待40分钟才能获得回复\n- **幻觉风险难以防范**：系统曾将\"阿司匹林每日用量\"错误写成\"单次服用2g\"，直到患者投诉才发现，险些造成医疗事故\n- **问题定位效率低下**：当发现回答有误时，只能整段废弃重写，无法快速定位是\"病因分析\"还是\"用药建议\"部分出错\n- **可信度无法量化**：产品团队无法判断哪些回答可以高置信度直接推送，哪些必须强制人工介入，策略制定缺乏数据支撑\n\n### 使用 selfcheckgpt 后\n- **智能分层审核**：SelfCheckMQAG自动为每个句子打分，仅将置信度低于0.3的句子标记给医生复核，审核工作量降低70%，患者等待时间缩短至12分钟\n- **实时风险拦截**：检测到\"单次服用2g\"这类异常表述时，系统自动触发二次验证流程，拦截高风险内容，上线后医疗投诉下降90%\n- **精准定位修改**：BERTScore版本精确定位到具体错误句子，医生只需修改\"用药剂量\"这一句，无需重写整个回答，单次修正时间从5分钟降至30秒\n- **量化置信度策略**：基于n-gram一致性评分，产品团队设定>0.8分的回答直接推送，0.5-0.8分加警示标签，\u003C0.5分转人工，策略效果可测量可优化\n\nselfcheckgpt让医疗AI在保障安全的前提下实现高效响应，将医生从重复审核中解放出来，专注于真正需要专业判断的复杂病例。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fpotsawee_selfcheckgpt_9904dd6a.png","potsawee","Potsawee (Punpun) Manakul","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fpotsawee_40bfdbd8.jpg","Building Speech & NLP models","University of Cambridge","Cambridge, GB","m.potsawee@gmail.com",null,"https:\u002F\u002Fpotsawee.github.io\u002F","https:\u002F\u002Fgithub.com\u002Fpotsawee",[86],{"name":87,"color":88,"percentage":89},"Python","#3572A5",100,609,71,"2026-04-04T07:25:35","MIT","Linux, macOS, Windows","可选但强烈推荐，运行本地LLM模型需8GB+显存，CUDA 11.7+","未说明",{"notes":98,"python":96,"dependencies":99},"1. 支持本地模型和API两种方式运行LLM Prompt方法；2. 首次运行需从HuggingFace下载多个大模型（如DeBERTa-v3-large、Mistral-7B等），总大小约5-10GB；3. 本地运行LLM（如Mistral-7B）需要足够显存，建议使用量化版本；4. 使用OpenAI或Groq API需要相应的API密钥",[100,101,102,103,104],"torch","transformers","spacy","datasets","accelerate",[26,13,54],"2026-03-27T02:49:30.150509","2026-04-06T09:46:57.849184",[109,114,119,124,129,134],{"id":110,"question_zh":111,"answer_zh":112,"source_url":113},4548,"示例结果中的两个数字是什么意思？","这两个数字是 SelfCheck 为每个句子计算的幻觉检测分数。示例中有两句话需要评估，因此输出两个分数。第一个数字对应第一句话，第二个数字对应第二句话。分数范围在 0.0 到 1.0 之间，数值越高表示该句子越有可能是幻觉（非事实性内容）。这些分数是通过将待检测句子与多个采样段落（sampled_passages）进行比较计算得出的，因此采样段落是计算分数的必要输入。在实际使用中，采样段落类似于 Google Bard 对同一提示生成的其他草稿版本。","https:\u002F\u002Fgithub.com\u002Fpotsawee\u002Fselfcheckgpt\u002Fissues\u002F29",{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},4549,"如何将 SelfCheckGPT 应用于特定领域的事实核查？","SelfCheckGPT 提供多种实现方式：QA、BERTScore、n-gram、NLI 和 LLM-prompt。除 LLM-prompt 需要手动编码外，其他变体均已内置在包中。LLM-prompt 的实现参考：https:\u002F\u002Fgithub.com\u002Fpotsawee\u002Fselfcheckgpt\u002Ftree\u002Fmain\u002Fdemo\u002Fexperiments\u002Fselfcheck_prompt。核心流程：1) 对同一提示多次采样生成 $S_0, S_1, ..., S_N$；2) 用 $S_1$ 到 $S_N$ 作为证据检查 $S_0$ 中的每个句子；3) 将\"Yes\u002FNo\"回答转换为分数（Yes=0.0, No=1.0）并平均。推荐提示模板：`Context: {sample}\\n\\nSentence: {sentence_to_be_assessed}\\n\\nIs the sentence supported by the context above? Answer Yes or No:`","https:\u002F\u002Fgithub.com\u002Fpotsawee\u002Fselfcheckgpt\u002Fissues\u002F14",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},4550,"如何构建数据集？gpt3_text_samples 是如何生成的？需要自行爬取数据吗？","对于现代指令模型（如 Llama3），建议使用提示：`write an article in a Wikipedia style about {concept} with {attributes}`。温度参数推荐设为 0.8 左右。原始项目使用预训练模型，采用下一个词预测风格提示：`This is a Wikipedia passage about {concept}:`。gpt3_text 和 gpt3_text_samples 都是通过 GPT-3 和上述提示从 wiki_bio_text 生成的，区别在于后者使用更高的温度或采样参数以获得多样性。原始数据中的段落排序和句子处理需要自行实现，项目不提供完整的爬取脚本。","https:\u002F\u002Fgithub.com\u002Fpotsawee\u002Fselfcheckgpt\u002Fissues\u002F34",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},4551,"是否支持自动生成响应 R 和样本 S，而不是手动提供？","是的，项目已合并相关 PR 并提供了自动生成 Notebook。该功能允许用户输入查询和 Huggingface LLM 模型，系统自动执行：1) 用 temperature=0 生成确定性响应 R；2) 用 temperature=1 和采样技术生成 N 个样本 S1...SN；3) 使用选定方法计算 R 的幻觉分数。具体实现可参考项目中的示例 Notebook，该 Notebook 封装了整个流程，用户只需提供查询和模型即可得到带幻觉分数的响应。","https:\u002F\u002Fgithub.com\u002Fpotsawee\u002Fselfcheckgpt\u002Fissues\u002F25",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},4552,"论文中的 β 参数是什么意思？为什么设置为 0.8？","β 表示在内容出现幻觉（F）时，QA 模型给出不同答案（a≠a_R）的概率，即 β = P(a≠a_R|F)。在贝叶斯框架中用于调整先验概率。设置为 0.8 是基于初步实验的经验选择，并非最优值。由于 QG 和 QA 模型是固定的，理论上 β 应该是固定但未知的值，理想情况下应通过额外数据调优。本研究因数据有限，仅尝试了少数几个值并固定为 0.8，这种方法存在局限性。","https:\u002F\u002Fgithub.com\u002Fpotsawee\u002Fselfcheckgpt\u002Fissues\u002F11",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},4553,"SelfCheckGPT 能否用于其他领域？有其他测试数据集吗？","方法本身具有通用性，但目前项目主要基于 WikiBio 数据集进行评估，暂无扩展其他领域标注数据的计划。要在其他领域使用，需要自行构建评估数据集并标注幻觉。代码库中的方法实现可以直接应用于新领域，但效果需要重新验证。项目不提供其他领域的预构建数据集，用户需自行准备数据。","https:\u002F\u002Fgithub.com\u002Fpotsawee\u002Fselfcheckgpt\u002Fissues\u002F3",[140,145,150,155,160,165],{"id":141,"version":142,"summary_zh":143,"released_at":144},113697,"0.1.7","- Add SelfCheckGPT LLM-Prompt with OpenAI API in addition to using open-source model that has already been supported ","2024-03-10T05:06:50",{"id":146,"version":147,"summary_zh":148,"released_at":149},113698,"0.1.6","Add SelfCheckGPT with LLM prompting using an open-source model (e.g., Mistral, Llama)","2024-02-06T16:05:02",{"id":151,"version":152,"summary_zh":153,"released_at":154},113699,"0.1.4","- Add SelfCheck-NLI (based on DeBERTa-v3 fine-tuned to Multi-NLI), which shows good results\r\n- Add `rescale_with_baseline` (default=True) to SelfCheck-BERTScore","2023-07-18T17:05:17",{"id":156,"version":157,"summary_zh":158,"released_at":159},113700,"0.1.3","- add SelfCheck-Ngram module into the selfcheckgpt package","2023-05-25T15:08:41",{"id":161,"version":162,"summary_zh":163,"released_at":164},113701,"0.1.2","Refactor and Add MQAG features\r\n- refactor code into SelfCheck (both SelfCheckMQAG and SelfCheckBERTScore) and MQAG\r\n- add MQAG as many functions of MQAG are used by SelfCheckMQAG\r\n- fix max_length of T5 model (there is no length limit)\r\n- add version.py","2023-03-23T23:34:31",{"id":166,"version":167,"summary_zh":168,"released_at":169},113702,"0.1.1","Currently, SelfCheckGPT includes the two variants: (1) SelfCheckGPT-MQAG and (2) SelfCheckGPT-BERTScore as described in our preprint on arxiv (dated 15 March 2023) ","2023-03-22T15:29:22"]