[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-jianzhnie--awesome-instruction-datasets":3,"tool-jianzhnie--awesome-instruction-datasets":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":82,"stars":85,"forks":86,"last_commit_at":87,"license":88,"difficulty_score":89,"env_os":90,"env_gpu":90,"env_ram":90,"env_deps":91,"category_tags":94,"github_topics":95,"view_count":10,"oss_zip_url":82,"oss_zip_packed_at":82,"status":16,"created_at":103,"updated_at":104,"faqs":105,"releases":106},715,"jianzhnie\u002Fawesome-instruction-datasets","awesome-instruction-datasets","A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练  ChatLLM 模型。","awesome-instruction-datasets 是一个专为大语言模型领域打造的开源资源库，汇集了训练 ChatLLM 所需的高质量指令数据集。在微调类似 ChatGPT 或 Llama 的模型时，数据往往是最大的瓶颈之一。awesome-instruction-datasets 解决了数据源分散、筛选困难的问题，将全球范围内优质的指令微调与 RLHF 数据集集中整理，极大降低了获取成本。\n\n这里适合自然语言处理研究人员、AI 工程师以及深度学习学生使用。无论你是想复现经典模型，还是探索新的训练方法，都能在此找到灵感。其独特之处在于分类清晰，不仅包含 Alpaca、OpenAssistant 等知名英文数据集，还收录了 Belle、Firefly 等中文资源，并明确标注了语言标签。此外，它还涵盖了 RLHF 相关数据，为模型对齐提供关键支持。通过整合这些核心资源，awesome-instruction-datasets 助力社区加速研发，让大模型训练变得更加高效和便捷。","\n\u003Cdiv align=\"center\">\n\n# Awesome Instruction Datasets \n[![Awesome](https:\u002F\u002Fawesome.re\u002Fbadge.svg)](https:\u002F\u002Fawesome.re)\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\n[中文](README_zh.md) | English\n\u003C\u002Fdiv>\n\n# Contents\n- [Awesome Prompt datasets](#awesome-prompt-datasets)\n- [Contents](#contents)\n- [Introduction](#introduction)\n- [Prompt Datasets](#prompt-datasets)\n  - [Statistics](#statistics)\n- [RLHF Datasets](#rlhf-datasets)\n  - [Statistics](#statistics-1)\n- [The template](#the-template)\n- [The Prompt Datasets List](#the-prompt-datasets-list)\n  - [Alpaca -Stanford](#alpaca--stanford)\n  - [Instruction in the Wild](#instruction-in-the-wild)\n  - [JosephusCheung\u002FGuanacoDataset](#josephuscheungguanacodataset)\n  - [Stanford Human Preferences Dataset (SHP)](#stanford-human-preferences-dataset-shp)\n  - [Hello-SimpleAI\u002FHC3](#hello-simpleaihc3)\n  - [Hello-SimpleAI\u002FHC3-Chinese](#hello-simpleaihc3-chinese)\n  - [allenai\u002Fprosocial-dialog](#allenaiprosocial-dialog)\n  - [allenai\u002Fnatural-instructions](#allenainatural-instructions)\n  - [PhoebusSi\u002FAlpaca-CoT](#phoebussialpaca-cot)\n  - [nomic-ai\u002Fgpt4all](#nomic-aigpt4all)\n  - [bigscience\u002FxP3](#bigsciencexp3)\n  - [teknium1\u002FGPTeacher](#teknium1gpteacher)\n  - [thunlp\u002FUltraChat](#thunlpultrachat)\n  - [cascip\u002FChatAlpaca](#cascipchatalpaca)\n  - [YeungNLP\u002Ffirefly-train-1.1M)](#yeungnlpfirefly-train-11m)\n  - [orhonovich\u002Funnatural-instructions](#orhonovichunnatural-instructions)\n  - [Instruction-Tuning-with-GPT-4\u002FGPT-4-LLM](#instruction-tuning-with-gpt-4gpt-4-llm)\n  - [databrickslabs\u002Fdolly](#databrickslabsdolly)\n  - [OpenAssistant\u002Foasst1](#openassistantoasst1)\n  - [BELLE\u002Fdata\u002F1.5M](#belledata15m)\n  - [alpaca\\_chinese\\_dataset](#alpaca_chinese_dataset)\n  - [Med-ChatGLM\u002Fdata](#med-chatglmdata)\n  - [pCLUE](#pclue)\n  - [COIG](#coig)\n- [The RLHF Datasets List](#the-rlhf-datasets-list)\n  - [Anthropic\u002Fhh-rlhf](#anthropichh-rlhf)\n  - [HuggingFaceH4\u002Fstack-exchange-preferences](#huggingfaceh4stack-exchange-preferences)\n  - [stanfordnlp\u002FSHP](#stanfordnlpshp)\n  - [Instruction-Tuning-with-GPT-4\u002FGPT-4-LLM](#instruction-tuning-with-gpt-4gpt-4-llm-1)\n  - [Natural Instruction \u002F Super-Natural Instruction](#natural-instruction--super-natural-instruction)\n  - [BigScience\u002FP3](#bigsciencep3)\n  - [xMTF - BigScience](#xmtf---bigscience)\n  - [HH-RLHF - Anthropic](#hh-rlhf---anthropic)\n  - [Unnatural Instruction](#unnatural-instruction)\n  - [Self-Instruct](#self-instruct)\n  - [UnifiedSKG - HKU](#unifiedskg---hku)\n  - [Google\u002FFlan Collection](#googleflan-collection)\n  - [InstructDial](#instructdial)\n  - [ChatGPT Distillation Data](#chatgpt-distillation-data)\n  - [Open Instruction Generalist (OIG).](#open-instruction-generalist-oig)\n  - [OpenAI WebGPT.](#openai-webgpt)\n  - [OpenAI Summarization.](#openai-summarization)\n- [Datasets without license information](#datasets-without-license-information)\n  - [alespalla\u002Fchatbot\\_instruction\\_prompts](#alespallachatbot_instruction_prompts)\n- [Contributing](#contributing)\n- [License](#license)\n\n\n# Introduction\n\"Welcome to 'awesome-prompt-datasets', a comprehensive collection of high-quality open-source instruction tuning datasets to train chat-based LLMs (ChatGPT,LLaMA,Alpaca)。\n\nInstruction Tuning \u002F Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.\n\nWith 'awesome-prompt-dataset', you can accelerate your research and development in NLP and unlock new opportunities for innovation. Let's explore the possibilities together!\"\n\n# Prompt Datasets\n\nReferring to [this](https:\u002F\u002Fgithub.com\u002FyaodongC\u002Fawesome-instruction-dataset) ([@yaodongC](https:\u002F\u002Fgithub.com\u002FyaodongC)), we labeled each collected dataset according to the following rules:\n\n**(Lang)Lingual-Tags**:\n\n- EN: Instruction datasets in English\n- CN: Instruction datasets in Chinese\n- ML: [Multi-lingual] Instruction datasets in multiple languages\n\n**(Task)Task-Tags**:\n\n- MT: [Multi-task] Datasets containing multiple tasks\n- TS: [Task-specific] Datasets tailored for specific tasks\n\n**(Gen)Generation-method**:\n\n- HG: [Human Generated Dataset] Datasets created by humans\n- SI: [Self-Instruct] Datasets generated using self-instruct methods\n- MIX: [Mixed Dataset] Dataset contains both human and machine generated data\n- COL: [Collection of Dataset] Dataset made from a collection of other datasets\n\n## Statistics\n\n| Project                                                      |                           Datasets                           | Org                        | Nums      | Lang  | Task  | Gen  | Type                                                         | Src                                                          | Url                                                          |\n| :----------------------------------------------------------- | :----------------------------------------------------------: | -------------------------- | :-------- | :---- | :---- | :--- | :----------------------------------------------------------- | :----------------------------------------------------------- | :----------------------------------------------------------- |\n| [Chain of Thought](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002FFLAN)  | [cot_data](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002FFLAN\u002Ftree\u002Fmain\u002Fflan\u002Fv2\u002Fcot_data) \\|[few_shot_data](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002FFLAN\u002Ftree\u002Fmain\u002Fflan\u002Fv2\u002Fniv2_few_shot_data) | Google                     | 74771     | EN\u002FCN | MT    | HG   | instruct with cot reasoning                                  | annotating CoT on existing data                              | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FChain-of-Thought) |\n| [GPT4all](https:\u002F\u002Fgithub.com\u002Fnomic-ai\u002Fgpt4all)               | [nomic-ai\u002Fgpt4all-j-prompt-generations](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnomic-ai\u002Fgpt4all-j-prompt-generations) | nomic-ai                   | 806199    | EN    | MT    | COL  | code, storys and dialogs                                     | distillation from GPT-3.5-turbo                              | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FGPT4all) |\n| [GPTeacher](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher)           | [GPT-4 General-Instruct ](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher\u002Ftree\u002Fmain\u002FInstruct)\\|[Roleplay-Instruct](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher\u002Ftree\u002Fmain\u002FRoleplay) \\|[Code-Instruct ](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher\u002Ftree\u002Fmain\u002FCodegen)\\| [Toolformer](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher\u002Ftree\u002Fmain\u002FToolformer) | teknium1                   | 29013     | EN    | MT    | SI   | general, roleplay, toolformer                                | GPT-4 & toolformer                                           | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FGPTeacher) |\n| [Guanaco](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJosephusCheung\u002FGuanacoDataset) | [JosephusCheung\u002FGuanacoDataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJosephusCheung\u002FGuanacoDataset) | JosephusCheung             | 534610    | ML    | MT    | SI   | various linguistic tasks                                     | text-davinci-003                                             | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FGuanaco) |\n| [HC3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3)    | [Hello-SimpleAI\u002FHC3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3) | Hello-SimpleAI \\| 万得资讯 | 37175     | EN\u002FCN | TS    | MIX  | dialogue evaluation                                          | human or ChatGPT                                             | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FHC3) |\n| [HC3-Chinese](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3-Chinese) | [Hello-SimpleAI\u002FHC3-Chinese](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3-Chinese) | Hello-SimpleAI\\|万得资讯   | 13k       | CN    | TS    | MIX  | dialogue evaluation                                          | human or ChatGPT                                             |                                                              |\n| [alpaca](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)       | [tatsu-lab\u002Falpaca](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca) | tatsu-lab                  | 52002     | EN    | MT    | SI   | general instruct                                             | text-davinci-003                                             | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Falpaca) |\n| [AlpacaDataCleaned](https:\u002F\u002Fgithub.com\u002Fgururise\u002FAlpacaDataCleaned) | [yahma\u002Falpaca-cleaned](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyahma\u002Falpaca-cleaned) | yahma                      | 52k       | EN    | MT    | SI   | general instruct                                             | text-davinci-003                                             | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Falpaca) |\n| [Chinese-LLaMA-Alpaca](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-LLaMA-Alpaca) | [alpaca_data_zh_51k](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-LLaMA-Alpaca\u002Fblob\u002Fmain\u002Fdata\u002Falpaca_data_zh_51k.json) | ymcui(讯飞)                | 51k       | CN    | MT    | SI   | general instruct                                             | text-davinci-003                                             |                                                              |\n| [Luotuo-Chinese-LLM](https:\u002F\u002Fgithub.com\u002FLC1332\u002FLuotuo-Chinese-LLM)  骆驼 | [trans_chinese_alpaca_data](https:\u002F\u002Fgithub.com\u002FLC1332\u002FLuotuo-Chinese-LLM\u002Fblob\u002Fmain\u002Fdata\u002Ftrans_chinese_alpaca_data.json) | LC1332(商汤)               | 52k       | CN    | MT    | SI   | general instruct                                             | text-davinci-003                                             |                                                              |\n| [Natural Instructions](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fnatural-instructions) | [Allen AI 61 task](https:\u002F\u002Finstructions.apps.allenai.org\u002F#:~:text=Download%20Natural%2DInstructions%20%2D%20v1.1)\\|[1.5k task](https:\u002F\u002Finstructions.apps.allenai.org\u002F#:~:text=Natural%2DInstructions%20%2D%20v2-,.,-x) | Allen AI                   | 5040134   | ML    | MT    | COL  | diverse nlp tasks                                            | human annotated datasets collection                          | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FNatural-Instructions) |\n| [belle_cn](https:\u002F\u002Fhuggingface.co\u002FBelleGroup)                | [BelleGroup\u002Ftrain_1M_CN](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbellegroup\u002Ftrain_1M_CN) \\|[BelleGroup\u002Ftrain_0.5M_CN](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbellegroup\u002Ftrain_0.5M_CN) | BelleGroup(链家)           | 1079517   | CN    | TS\u002FMT | SI   | general, mathematical reasoning, dialogue                    | text-davinci-003                                             | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fbelle_cn) |\n| [instinwild](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild)   | [instinwild_ch](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild\u002Ftree\u002Fmain\u002Fdata) \\| [instinwild_en](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild\u002Ftree\u002Fmain\u002Fdata) |                            | 52191     | EN\u002FCN | MT    | SI   | generation, open-qa, mind-storm                              | text-davinci-003                                             | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Finstinwild) |\n| [华驼(HuaTuo)](https:\u002F\u002Fgithub.com\u002FSCIR-HI\u002FHuatuo-Llama-Med-Chinese) | [中文医学知识](https:\u002F\u002Fgithub.com\u002FSCIR-HI\u002FHuatuo-Llama-Med-Chinese\u002Fblob\u002Fmain\u002Fdata\u002Fllama_data.json) \\|[肝癌](https:\u002F\u002Fgithub.com\u002FSCIR-HI\u002FHuatuo-Llama-Med-Chinese\u002Fblob\u002Fmain\u002Fdata-literature\u002Fliver_cancer.json) | SCIR-HI(哈工大)            | 8K        | CN    | TS    | SI   | 公开和自建的中文医学知识库                                   | GPT3.5                                                       |                                                              |\n| [prosocial dialog](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fprosocial-dialog) | [allenai\u002Fprosocial-dialog](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fprosocial-dialog) | allenai                    | 165681    | EN    | TS    | MIX  | dialogue                                                     | GPT-3 rewrites questions + humans feedback manually          | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fprosocial-dialog) |\n| [finance_en](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fgbharti\u002Ffinance-alpaca) | [gbharti\u002Ffinance-alpaca](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fprosocial-dialog) |                            | 68912     | EN    | TS    | COL  | financial related qa                                         | GPT3.5                                                       | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002F) |\n| [xP3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbigscience\u002FxP3)        | [bigscience\u002FxP3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbigscience\u002FxP3) | bigscience                 | 78883588  | ML    | MT    | COL  | a collection of prompts & datasets across 46 of languages & 16 NLP tasks | human annotated datasets collection                          | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FxP3) |\n| [firefly](https:\u002F\u002Fgithub.com\u002Fyangjianxin1\u002FFirefly)           | [YeungNLP\u002Ffirefly-train-1.1M](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYeungNLP\u002Ffirefly-train-1.1M) |                            | 1649398   | CN    | MT    | COL  | 23 nlp tasks                                                 | human annotated datasets collection                          | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Ffirefly) |\n| [instruct](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fswype\u002Finstruct)   | [swype\u002Finstruct](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fswype\u002Finstruct) |                            | 888969    | EN    | MT    | COL  | augmented of GPT4All, Alpaca, open-source Meta datasets      | augmentation performed using the advanced NLP tools provided by AllenAI | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Finstruct) |\n| [Code Alpaca](https:\u002F\u002Fgithub.com\u002Fsahil280114\u002Fcodealpaca)     | [sahil280114\u002Fcodealpaca](https:\u002F\u002Fgithub.com\u002Fsahil280114\u002Fcodealpaca\u002Fblob\u002Fmaster\u002Fdata\u002Fcode_alpaca_20k.json) |                            | 20022     | EN    | TS    | SI   | code generation, editing, optimization                       | text-davinci-003                                             | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FCodeAlpaca) |\n| [Alpaca_GPT4](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM) | [alpaca_gpt4_data](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM\u002Fblob\u002Fmain\u002Fdata\u002Falpaca_gpt4_data.json)\\|[alpaca_gpt4_data_zh](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM\u002Fblob\u002Fmain\u002Fdata\u002Falpaca_gpt4_data_zh.json) \\|[comparison_data_v2](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM\u002Fblob\u002Fmain\u002Fdata\u002Fcomparison_data_v2.json) | 微软                       | 52002     | EN\u002FCN | MT    | SI   | general instruct                                             | generated by GPT-4 using Alpaca                              | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FalpacaGPT4) |\n| [webGPT](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fwebgpt_comparisons) | [openai\u002Fwebgpt_comparisons](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fwebgpt_comparisons) | openai                     | 18994     | EN    | TS    | MIX  | information retrieval (IR) QA                                | fine-tuned GPT-3, each instruction has two outputs, select better one | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FwebGPT) |\n| [dolly 2.0](https:\u002F\u002Fgithub.com\u002Fdatabrickslabs\u002Fdolly)         | [databricks\u002Fdatabricks-dolly-15k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdatabricks\u002Fdatabricks-dolly-15k) | databricks                 | 15015     | EN    | TS    | HG   | closed QA , summarization and etc, Wikipedia as references   | human annotated                                              | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fdolly) |\n| [mosaicml\u002Fllm-foundry](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fllm-foundry) | [mosaicml\u002Fdolly_hhrlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmosaicml\u002Fdolly_hhrlhf) | mosaicml                   | 59.3K     | EN    | TS    | HG   | This dataset is a combination of [Databrick's dolly-15k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdatabricks\u002Fdatabricks-dolly-15k) dataset and a filtered subset of [Anthropic's HH-RLHF](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf). | human annotated                                              |                                                              |\n| [baize](https:\u002F\u002Fgithub.com\u002Fproject-baize\u002Fbaize-chatbot) 白泽 | [alpaca_chat_data.json](https:\u002F\u002Fgithub.com\u002Fproject-baize\u002Fbaize-chatbot\u002Ftree\u002Fmain\u002Fdata) \\|[medical_chat_data.json](https:\u002F\u002Fgithub.com\u002Fproject-baize\u002Fbaize-chatbot\u002Fblob\u002Fmain\u002Fdata\u002Fmedical_chat_data.json) \\| [quora_chat_data.json](https:\u002F\u002Fgithub.com\u002Fproject-baize\u002Fbaize-chatbot\u002Fblob\u002Fmain\u002Fdata\u002Fquora_chat_data.json) \\|[stackoverflow_chat_data.json](https:\u002F\u002Fgithub.com\u002Fproject-baize\u002Fbaize-chatbot\u002Fblob\u002Fmain\u002Fdata\u002Fstackoverflow_chat_data.json) | project-baize              | 653699    | EN    | MT    | COL  | a collection from Alpaca, Quora, StackOverFlow and MedQuAD questions | human annotated datasets collection                          | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fbaize) |\n| [hh-rlhf](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fhh-rlhf)             | [Anthropic\u002Fhh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fanthropic\u002Fhh-rlhf) | Anthropic                  | 284517    | EN    | TS    | MIX  | dialogue                                                     | dialog between human and RLHF models                         | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fhh-rlhf) |\n| [OIG(part)](https:\u002F\u002Flaion.ai\u002Fblog\u002Foig-dataset\u002F)              |    [laion\u002FOIG](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flaion\u002Foig)    | laion                      | 49237     | EN    | MT    | COL  | created from various tasks, such as question and answering   | using data augmentation, human annotated datasets collection | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FOIG) |\n| [GAOKAO](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FGAOKAO-Bench)          | [Fill-in-the-blank_Questions](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FGAOKAO-Bench\u002Ftree\u002Fmain\u002Fdata\u002FFill-in-the-blank_Questions) \\| [Multiple-choice_Questions](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FGAOKAO-Bench\u002Ftree\u002Fmain\u002Fdata\u002FMultiple-choice_Questions) \\| [Open-ended_Questions](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FGAOKAO-Bench\u002Ftree\u002Fmain\u002Fdata\u002FOpen-ended_Questions) | OpenLMLab                  | 2785      | CN    | MT    | COL  | Multiple-choice, Fill-in-the-blank and Open-ended questions from examination | human annotated                                              | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FGAOKAO) |\n| [camel](https:\u002F\u002Fgithub.com\u002Flightaime\u002Fcamel) \\| 骆驼          | [camel-ai\u002Fcode](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcamel-ai\u002Fai_society)\\|[camel-ai\u002Fbiology](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcamel-ai\u002Fbiology) \\|[camel-ai\u002Fphysics](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcamel-ai\u002Fphysics) \\|[camel-ai\u002Fchemistry](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcamel-ai\u002Fchemistry) \\|[camel-ai\u002Fmath](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcamel-ai\u002Fmath) | camel-ai                   | 760620    | EN    | MT    | SI   | Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog | gpt-3.5-turbo                                                | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fcamel) |\n| [FLAN-Muffin](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMuennighoff\u002Fflan) | [Muennighoff\u002Fflan](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMuennighoff\u002Fflan) |                            | 1764800   | EN    | MT    | COL  | 60 nlp tasks                                                 | human annotated datasets collection                          | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FFLAN-Muffin) |\n| [COIG](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FCOIG)            |      [COIG](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FCOIG)       | BAAI\\|智源                 | 298428    | CN    | MT    | COL  | collect fron Exam, Translated, Human Value Alignment Instructions and Counterfactural Correction Multi-round Chat | using automatic tool and manual verification                 | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FCOIG) |\n| [GPT4Tools](https:\u002F\u002Fgithub.com\u002FStevenGrove\u002FGPT4Tools)        | [gpt4tools_71k.json](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1JKIT-Or1of7TJuWvmrJpPoOx0cLdcWry\u002Fview?usp=share_link) | StevenGrove                | 71446     | EN    | MT    | SI   | a collection of tool-related instructions                    | gpt-3.5-turbo                                                | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fgpt4tools) |\n| [ShareChat](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FRyokoAI\u002FShareGPT52K) | [RyokoAI\u002FShareGPT52K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FRyokoAI\u002FShareGPT52K) | RyokoAI                    | 1663241   | EN    | MT    | MIX  | general instruct                                             | crowdsourcing to collect conversations between people and ChatGPT (ShareGPT) | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FShareGPT) |\n| [Auto CoT](https:\u002F\u002Fgithub.com\u002Famazon-science\u002Fauto-cot)       | [kojima-takeshi188\u002Fzero_shot_cot\u002Fdataset](https:\u002F\u002Fgithub.com\u002Fkojima-takeshi188\u002Fzero_shot_cot\u002Ftree\u002Fmain\u002Fdataset) \\|[kojima-takeshi188\u002Fzero_shot_cot\u002Flog](https:\u002F\u002Fgithub.com\u002Fkojima-takeshi188\u002Fzero_shot_cot\u002Ftree\u002Fmain\u002Flog) | amazon-science             |           | EN    |       |      |                                                              |                                                              | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FAuto-CoT) |\n| [MOSS](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FMOSS)（复旦 Moss）       | [fnlp\u002Fmoss-002-sft-data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ffnlp\u002Fmoss-002-sft-data)\\| [moss-003-sft-data](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FMOSS\u002Ftree\u002Fmain\u002FSFT_data\u002Fconversations\u002Fconversation_without_plugins) | fnlp                       | 1583595   | EN\u002FCN | SI    |      |                                                              |                                                              | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FMOSS) |\n| [ultrachat](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FUltraChat)             | [stingning\u002Fultrachat](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstingning\u002Fultrachat) | thnlp                      | 28247446  | EN    |       |      |                                                              |                                                              | [download](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fultrachat) |\n| [StackLLaMA](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flvwerra\u002Fstack-exchange-paired) | [lvwerra\u002Fstack-exchange-paired](lvwerra\u002Fstack-exchange-paired) |                            | todo      | EN    |       | HG   |                                                              |                                                              |                                                              |\n| [Self-Instruct](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct)   | [yizhongw\u002Fself-instruct](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct\u002Fblob\u002Fmain\u002Fdata\u002Fgpt3_generations\u002Fbatch_221203\u002Fall_instances_82K.jsonl) |                            | 82 K      | EN    | SI    | SI   |                                                              |                                                              |                                                              |\n| [Zhihu-KOL](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fwangrui6\u002FZhihu-KOL) | [Zhihu-KOL](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fwangrui6\u002FZhihu-KOL) | Openassisent               | 100 w     |       | SI    | HG   | Zhihu data for training Open Assitant                        |                                                              |                                                              |\n| [stanfordnlp\u002FSHP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP) | [stanfordnlp\u002FSHP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP) | stanfordnlp                | 385 k     | EN    | MT    | HG   |                                                              | human preferences over responses                             |                                                              |\n| [LAION-AI\u002FOpen-Assistant](https:\u002F\u002Fgithub.com\u002FLAION-AI\u002FOpen-Assistant) | [OpenAssistant\u002Foasst1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenAssistant\u002Foasst1) | Openassisent               | 84.4k     | EN    | MT    | HG   | OpenAssistant Conversations Dataset (OASST1)                 | human-generated, human-annotated                             |                                                              |\n| [akoksal\u002FLongForm](https:\u002F\u002Fgithub.com\u002Fakoksal\u002FLongForm)      | [akoksal\u002FLongForm](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fakoksal\u002FLongForm) | akoksal\u002FLongForm           | 30k       | EN    | SI    | HG   |                                                              | 们从现有语料库（如 C4 和维基百科）中选择一组不同的人工文档，并通过 LLM 为给定的文档生成指令。 |                                                              |\n| [sail-sg\u002Fsymbolic-instruction-tuning](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fsymbolic-instruction-tuning) | [sail\u002Fsymbolic-instruction-tuning](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsail\u002Fsymbolic-instruction-tuning) | sail-sg                    | 800K      | ML    | SI    |      |                                                              | Human Synthetic Examples                                     |                                                              |\n| 医疗问答 [michael-wzhu\u002FPromptCBLUE](https:\u002F\u002Fgithub.com\u002Fmichael-wzhu\u002FPromptCBLUE) | [michaelwzhu\u002FChatMed_Consult_Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmichaelwzhu\u002FChatMed_Consult_Dataset) | michael-wzhu               | 110113    | CN    | SI    |      |                                                              | 互联网上的医疗问诊问题(110,113)，反映了真实世界的不同用户\u002F患者的医疗问诊需求。目前response都是由OpenAI `GPT-3.5`引擎回答的。 |                                                              |\n| [mbzuai-nlp\u002FLaMini-LM](https:\u002F\u002Fgithub.com\u002Fmbzuai-nlp\u002FLaMini-LM) | [MBZUAI\u002FLaMini-instruction](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FLaMini-instruction) | MBZUAI\u002FLaMini-instruction  | **2.58M** | EN    | MT    | SI   |                                                              | 通过离线蒸馏从大型语言模型中提取知识                         |                                                              |\n| [pCLUE](https:\u002F\u002Fgithub.com\u002FCLUEbenchmark\u002FpCLUE)              |       [pCLUE](https:\u002F\u002Fgithub.com\u002FCLUEbenchmark\u002FpCLUE)        |                            | 120 万    |       |       |      |                                                              |                                                              |                                                              |\n| [WizardLM](https:\u002F\u002Fgithub.com\u002Fnlpxucan\u002FWizardLM)             | [victor123\u002Fevol_instruct_70k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fvictor123\u002Fevol_instruct_70k) | WizardLM                   | 70k       | EN    | MT    |      |                                                              |                                                              |                                                              |\n|                                                              |                                                              |                            |           |       |       |      |                                                              |                                                              |                                                              |\n\n# RLHF Datasets\n\n## Statistics\n\n|                           Project                            | Links                                                        |              Org              | Nums   |  Lang   | Summary                                                      |\n| :----------------------------------------------------------: | ------------------------------------------------------------ | :---------------------------: | ------ | :-----: | ------------------------------------------------------------ |\n| [webgpt_comparisons](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fwebgpt_comparisons) |                                                              |            Openai             | 19,578 | English | In the [WebGPT paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.09332), the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total. |\n|    [SHP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP)    |                                                              |          stanfordnlp          | 349 K  | English | SHP is a dataset of 385K collective human preferences over responses to questions\u002Finstructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., [SteamSHP](https:\u002F\u002Fhuggingface.co\u002Fstanfordnlp\u002FSteamSHP-flan-t5-xl)). |\n| [rlhf-reward-datasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyitingxie\u002Frlhf-reward-datasets) |                                                              |           yitingxie           | 76.3 k | English |                                                              |\n| [Dahoas\u002Ffull-hh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDahoas\u002Ffull-hh-rlhf) |                                                              |            Dahoas             | 112 k  | English | Anthropic's HH dataset reformatted into prompt, chosen, rejected samples. |\n| [Dahoas\u002Fsynthetic-instruct-gptj-pairwise](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDahoas\u002Fsynthetic-instruct-gptj-pairwise) |                                                              |            Dahoas             |        | English |                                                              |\n| [Dahoas\u002Frm-static](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDahoas\u002Frm-static) |                                                              |            Dahoas             | 76.3k  | English | Split of [hh-static](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDahoas\u002Fstatic-hh) used for training reward models after supervised fine-tuning. |\n| [Anthropic\u002Fhh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf) |                                                              |           Anthropic           | 22k    | English | This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data. |\n| [Instruction-Tuning-with-GPT-4\u002FGPT-4-LLM](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM) |                                                              | Instruction-Tuning-with-GPT-4 | 52k    | English | Ranked responses (Note: Data is evaluated by `GPT-4` model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes \"GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses\" |\n| [thu-coai\u002FSafety-Prompts](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FSafety-Prompts) | [thu-coai\u002FSafety-Prompts](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fthu-coai\u002FSafety-Prompts) |           thu-coai            | 100k   | Chinese | 中文安全prompts，用于评测和提升大模型的安全性，将模型的输出与人类的价值观对齐。 |\n| [Chatgpt-Comparison-Detection project](https:\u002F\u002Fgithub.com\u002FHello-SimpleAI\u002Fchatgpt-comparison-detection) | [Hello-SimpleAI\u002FHC3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3) |                               | 24.3K  | English | Human ChatGPT Comparison Corpus, 60k human answers and 27K ChatGPT answers for around 24K questions. |\n\n# Open ChatLLMs\n\n| Release    | Model_name                                                   | Base          | Model_Size | Datasets                                                     | Number of Instances | Language    |\n| ---------- | ------------------------------------------------------------ | ------------- | ---------- | ------------------------------------------------------------ | ------------------- | ----------- |\n| 2022-12    | GPT-3 Self Inst.                                             | GPT-3         | 175B       | Self-Instruct                                                | 82 k                | En          |\n| 2023-03-03 | [alpaca](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)       | LLaMA         | 7B         | [alpaca_data](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002Falpaca_data.json) | 52 k                | En          |\n| 2023-03-19 | [alpaca-lora](https:\u002F\u002Fgithub.com\u002Ftloen\u002Falpaca-lora\u002Fcommits\u002Fmain) | LLaMA         | 7B 13B 30B | [alpaca_data](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002Falpaca_data.json)、[alpaca_data_cleaned](https:\u002F\u002Fgithub.com\u002Ftloen\u002Falpaca-lora\u002Fblob\u002Fmain\u002Falpaca_data_cleaned.json) | 52 k                | En          |\n| 2023-03-23 | [Chinese-Vicuna](https:\u002F\u002Fgithub.com\u002FFacico\u002FChinese-Vicuna)   | LLaMA         | 7B 13B     | [BELLE](https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE)、[GuanacoDataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJosephusCheung\u002FGuanacoDataset) | 1M                  | Zh          |\n| 2023-03-24 | [Alpaca-CoT](https:\u002F\u002Fgithub.com\u002FPhoebusSi\u002FAlpaca-CoT)        | LLaMA         | 7B         | [dataset](https:\u002F\u002Fgithub.com\u002FPhoebusSi\u002FAlpaca-CoT#statistics) | ----                | En Zh       |\n| 2023-03-25 | [dolly](https:\u002F\u002Fgithub.com\u002Fdatabrickslabs\u002Fdolly)             | dolly         | 6B         | [alpaca_data](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002Falpaca_data.json) | 52 k                | En          |\n| 2023-03-25 | [guanaco](https:\u002F\u002Fhuggingface.co\u002FKBlueLeaf\u002Fguanaco-7B-leh)   | LLaMA         | 7B         | [GuanacoDataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJosephusCheung\u002FGuanacoDataset) | 534 k               | En Zh Ja De |\n| 2023-03-28 | [Chinese-LLaMA-Alpaca](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-LLaMA-Alpaca) | LLaMA         | 7B         | [alpaca_data_zh](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-LLaMA-Alpaca\u002Ftree\u002Fmain\u002Fdata)、[pCLUE](https:\u002F\u002Fgithub.com\u002FCLUEbenchmark\u002FpCLUE)、[translation2019zh](https:\u002F\u002Fgithub.com\u002Fbrightmart\u002Fnlp_chinese_corpus#5%E7%BF%BB%E8%AF%91%E8%AF%AD%E6%96%99translation2019zh)、[alpaca_data](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002Falpaca_data.json)、Self-Instruct | 2M                  | Zh          |\n| 2023-03-29 | [ColossalChat](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI)      | LLaMA         | 7B 13B     | [InstructionWild](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild) | 104 k               | En Zh       |\n| 2023-03-31 | [Luotuo](https:\u002F\u002Fgithub.com\u002FLC1332\u002FLuotuo-Chinese-LLM)       | LLaMA ChatGLM | 7B 6B      | [trans_chinese_alpaca_data](https:\u002F\u002Fgithub.com\u002FLC1332\u002FChinese-alpaca-lora\u002Fblob\u002Fmain\u002Fdata\u002Ftrans_chinese_alpaca_data.json) | 52k                 | Zh          |\n| 2023-03-31 | [cerebras-lora-alpaca](https:\u002F\u002Fgithub.com\u002Flxe\u002Fcerebras-lora-alpaca) | Cerebras-GPT  | 2.7B       | [AlpacaDataCleaned](https:\u002F\u002Fgithub.com\u002Fgururise\u002FAlpacaDataCleaned) | 52k                 | En          |\n\n# The template\n\nAppend the new project at the end of file\n```shell\n\n[{Project-name}\u002F{Dataset-name}]{https:\u002F\u002Fgithub.com\u002Flink\u002Fto\u002Fproject}\n\n- [paper\u002Fproject link](link)\n- [dataset link](link)\n- Related work: (if applicable)\n\nSome introductions ...\n\n```\n\n# The Prompt Datasets List\n\n## [Alpaca -Stanford](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n\n- [Paper\u002FProject Link](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n- [Dataset Link](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n- Data generation model: text-davinci-003\n- Cost: $600\n\nThe Alpaca of the Stanford release is a fine-tuning model for instruct-tuning based on the Meta Ai LLaMA model.\n\nAlpaca automatically generated 52k instruction data using GPT-3.5 and used it to fine-tune the LLaMA model. Experimental results show that it can reach or even exceed the performance of GPT-3.5 on some tasks.\n\n## [Instruction in the Wild](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild)\n\n- [Paper\u002FProject Link](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FColossalChat)\n- [Dataset Link](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild)\n- Data generation model: text-davinci-003\n  \n\nInstruction Tuning is a key component of ChatGPT. OpenAI used their user-based Instruction dataset, but unfortunately, this dataset is not open-sourced. Self-Instruct released a small instruction dataset including 175 instructions written by human labors. Standford Alpaca Team generated 52K instructions by text-davinci-003 model based on the the 175 seed instructions above.\n\nThis project targets on a larger and more diverse instruction dataset. To this end, we collected 429 instructions from ChatGPT usage screenshots and released both English and Chinese versions. We found these instructions are very diverse even if the scale is still small. We follow Alpaca to generate 52K instructions and their responses. All data can be found in data dir.\n\nNote: This is an ongoing project. We are still collecting and improving our data. We release this dataset as early as possible to speedup our LLM research. We will also release a whitepaper soon.\n\n## [JosephusCheung\u002FGuanacoDataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJosephusCheung\u002FGuanacoDataset)\n\n- Data generation model: text-davinci-003\n- Cost: $6000\n\n52K instruction data generated from modified self-instruct pipeline with human written 429 seed task.\n\n\n##  [Stanford Human Preferences Dataset (SHP)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP)\n\n- [DataLinks](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP)\n\nSHP is a dataset of 385K collective human preferences over responses to questions\u002Finstructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., [SteamSHP](https:\u002F\u002Fhuggingface.co\u002Fstanfordnlp\u002FSteamSHP-flan-t5-xl)).\n\nEach example is a Reddit post with a question\u002Finstruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively). SHP exploits the fact that if comment A was written after comment B but has a higher score nonetheless, then A is ostensibly more preferred to B. If A had been written before B, then we could not conclude this, since its higher score could have been the result of more visibility. We chose data where the preference label is intended to reflect which response is more helpful rather than which is less harmful, the latter being the focus of much past work.\n\nHow is SHP different from [Anthropic's HH-RLHF dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf)? Most notably, all the data in SHP is naturally occurring and human-written, whereas the responses in HH-RLHF are machine-written, giving us two very different distributions that can complement each other.\n\n\n## [Hello-SimpleAI\u002FHC3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3)\n\n- Summary:The the first human-ChatGPT comparison corpus (English Version), named HC3 dataset\n- Data generation model: `gpt-3.5`, `human generated`\n- paper: [How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.07597)\n- Cost: N\u002FA\n\n## [Hello-SimpleAI\u002FHC3-Chinese](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3-Chinese)\n\n- Summary:The the first human-ChatGPT comparison corpus (Chinese Version), named HC3 dataset\n- Data generation model: `gpt-3.5`, `human generated`\n- paper: [How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.07597)\n- Cost: N\u002FA\n\n\n## [allenai\u002Fprosocial-dialog](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fprosocial-dialog)\n\n- Summary: ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms.\n- Data generation model: `gpt-3.5`, `human generated`\n- paper: [ProsocialDialog: A Prosocial Backbone for Conversational Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.12688)\n- Cost: N\u002FA\n\n## [allenai\u002Fnatural-instructions](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fnatural-instructions)\n\n- Summary: A community effort to create a large collection of `1,616 diverse NLP tasks` and their natural language definitions\u002Finstructions.\n- Data generation model: `Human generated`\n- paper: [Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.07705)\n- Cost: N\u002FA\n\n\n## [PhoebusSi\u002FAlpaca-CoT](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT)\n\n- Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect various instruction tuning datasets. [Github Repo](https:\u002F\u002Fgithub.com\u002FPhoebusSi\u002FAlpaca-CoT)\n- paper: N\u002FA\n- Cost: N\u002FA\n\n## [nomic-ai\u002Fgpt4all](https:\u002F\u002Fgithub.com\u002Fnomic-ai\u002Fgpt4all)\n\n- Summary: gpt4all leverages three publicly available datasets: 1.[laion\u002FOIG](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flaion\u002FOIG), 2.[pacovaldez\u002Fstackoverflow-questions](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fpacovaldez\u002Fstackoverflow-questions) 3. subset of [bigscience\u002Fbloomz-p3](https:\u002F\u002Fhuggingface.co\u002Fbigscience\u002Fbloomz-p3)\n- Data generation model: N\u002FA\n- paper: [GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo](https:\u002F\u002Fs3.amazonaws.com\u002Fstatic.nomic.ai\u002Fgpt4all\u002F2023_GPT4All_Technical_Report.pdf)\n- Cost: $500\n\n\n## [bigscience\u002FxP3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbigscience\u002FxP3)\n\n- Summary: [Prompt-resource] xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks.\n- Data generation model: N\u002FA\n- paper: [Crosslingual Generalization through Multitask Finetuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.01786)\n- Cost: N\u002FA\n\n\n\n## [teknium1\u002FGPTeacher](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher)\n\n- Summary: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer\n- Data generation model: `GPT-4`\n- paper: N\u002FA\n- Cost: N\u002FA\n\n## [thunlp\u002FUltraChat](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FUltraChat)\n\n- Summary: UltraChat aims to construct an open-source, large-scale, and multi-round dialogue data. The first part of UltraChat (i.e., the Questions about the World sector) is released, which contains 280k diverse and informative dialogues. More dialogues about writing and creation, assistance on existing materials are to come.\n- Data generation model: `GPT-3.5-turbo`\n- paper: N\u002FA\n- Cost: N\u002FA\n\n## [cascip\u002FChatAlpaca](https:\u002F\u002Fgithub.com\u002Fcascip\u002FChatAlpaca)\n\n- Summary: Based on the Stanford Alpaca data, ChatAlpaca extends the data to multi-turn instructions and their corresponding responses. More data (20k) and the Chinese translated version are to come.\n- Data generation model: `GPT-3.5-turbo`\n- paper: N\u002FA\n- Cost: N\u002FA\n- Related: [(tatsu-lab\u002FAlpaca)|52K|EN|MT|SI](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n\n## [YeungNLP\u002Ffirefly-train-1.1M)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYeungNLP\u002Ffirefly-train-1.1M)\n- Summary: Chinese datasets of 23 tasks combined with human-written instruction templates. \n- Data generation model: N\u002FA\n- paper: N\u002FA\n- Cost: N\u002FA\n\n## [orhonovich\u002Funnatural-instructions](https:\u002F\u002Fgithub.com\u002Forhonovich\u002Funnatural-instructions)\n- Summary: 64K examples by prompting a language model with three seed examples of instructions and eliciting a fourth. Then the set is expanded to 240K by prompting the model to rephrase each instruction.\n- Data generation model: `text-davinci-002`\n- paper: [Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09689)\n- Cost: N\u002FA\n\n## [Instruction-Tuning-with-GPT-4\u002FGPT-4-LLM](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM)\n- Summary: 52K instruction-following data generated by GPT-4 with the original Alpaca prompts & Alpaca prompts translated into Chinese by ChatGPT + 9K instruction-following data generated by GPT-4 with prompts in Unnatural Instruction.\n- Data generation model: `GPT-4`\n- paper: [Instruction Tuning with GPT-4](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.03277)\n- Cost: N\u002FA\n- Related: \n    -[(tatsu-lab\u002FAlpaca)|52K|EN|MT|SI](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n    -[(orhonovich\u002Funnatural-instructions)|240K|EN|MT|MIX](https:\u002F\u002Fgithub.com\u002Forhonovich\u002Funnatural-instructions)\n\n## [databrickslabs\u002Fdolly](https:\u002F\u002Fgithub.com\u002Fdatabrickslabs\u002Fdolly\u002Ftree\u002Fmaster\u002Fdata)\n- Summary: This datset was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.\n- Data generation model: N\u002FA\n- paper: [Free Dolly](https:\u002F\u002Fwww.databricks.com\u002Fblog\u002F2023\u002F04\u002F12\u002Fdolly-first-open-commercially-viable-instruction-tuned-llm)\n- Cost: N\u002FA\n\n## [OpenAssistant\u002Foasst1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenAssistant\u002Foasst1)\n- Summary: OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. \n- Data generation model: N\u002FA\n- paper: [OpenAssistant Conversations - Democratizing Large Language Model Alignment](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F10iR5hKwFqAKhL3umx8muOWSRm7hs5FqX\u002Fview)\n- Cost: N\u002FA\n\n## BELLE\u002Fdata\u002F1.5M\n\n- 下载地址: [https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE\u002Ftree\u002Fmain\u002Fdata\u002F1.5M](https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE\u002Ftree\u002Fmain\u002Fdata\u002F1.5M)\n- 数据量: 1.5M\n- 生成方式: self-instruct，使用了中文种子任务，以及openai的text-davinci-003接口\n- 涉及任务: 包含175个种子任务，[https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE\u002Fblob\u002Fmain\u002Fdata\u002F1.5M\u002Fzh_seed_tasks.json](https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE\u002Fblob\u002Fmain\u002Fdata\u002F1.5M\u002Fzh_seed_tasks.json)\n- 数据示例: [https:\u002F\u002Fhuggingface.co\u002Fdatasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBelleGroup\u002Ftrain_0.5M_CN)\n\n## alpaca_chinese_dataset\n\n- 下载地址: [https:\u002F\u002Fgithub.com\u002Fhikariming\u002Falpaca_chinese_dataset](https:\u002F\u002Fgithub.com\u002Fhikariming\u002Falpaca_chinese_dataset)\n- 数据量: 52k\n- 生成方式: 借助chatgpt对原始的[stanford_alpaca](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)做机器翻译，并加入人工校验来保证质量\n- 涉及任务: 与原始的[stanford_alpaca](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)一致，可以在原项目的[seed_task.json](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002Fseed_tasks.jsonl)中查到全部任务\n\n## Med-ChatGLM\u002Fdata\n\n- 下载地址: [https:\u002F\u002Fgithub.com\u002FSCIR-HI\u002FMed-ChatGLM](https:\u002F\u002Fgithub.com\u002FSCIR-HI\u002FMed-ChatGLM)\n- 数据量: 7k\n- 生成方式: 利用GPT3.5接口围绕医学知识库构建问答数据，并设置了多种Prompt形式来充分利用知识\n- 涉及任务: 医学领域相关的问答，包含并发症，高危因素，组织学检查，临床症状，药物治疗，辅助治疗\n\n## pCLUE\n\n- 下载地址: [https:\u002F\u002Fgithub.com\u002FCLUEbenchmark\u002FpCLUE](https:\u002F\u002Fgithub.com\u002FCLUEbenchmark\u002FpCLUE)\n- 数据量: 1.2M\n- 生成方式: 通过原有的NLP任务数据集，结合特定的[prompt](https:\u002F\u002Fwww.zhihu.com\u002Fsearch?q=prompt&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={\"sourceType\"%3A\"article\"%2C\"sourceId\"%3A\"624084039\"})模板生成\n- 涉及任务: 包含9个NLP数据集，涉及的NLP任务有[文本分类](https:\u002F\u002Fwww.zhihu.com\u002Fsearch?q=文本分类&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={\"sourceType\"%3A\"article\"%2C\"sourceId\"%3A\"624084039\"})\u002F自然语言推理\u002F语义匹配\u002F[指代消解](https:\u002F\u002Fwww.zhihu.com\u002Fsearch?q=指代消解&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={\"sourceType\"%3A\"article\"%2C\"sourceId\"%3A\"624084039\"})\u002F关键词识别\u002F阅读理解\n\n## COIG\n\n- 下载地址: [https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FCOIG](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FCOIG)\n\n- 数据量: \n\n- - Translated Instructions (67,798)\n  - Exam Instructions (63,532)\n  - Human Value Alignment Instructions (34,471)\n  - Counterfactural Correction Multi-round Chat (13,653)\n  - Leetcode Instructions (11,737)\n\n- 生成方式: 融合了多个领域的数据，具体可以参考论文[Chinese Open Instruction Generalist: A Preliminary Release](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.07987)\n\nhttps:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FInstructionZoo\n\nhttps:\u002F\u002Fgithub.com\u002Flightaime\u002Fcamel\n\n# The RLHF Datasets List\n\n## [Anthropic\u002Fhh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf)\n\n- Summary: This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data. \n- Data generation model: `Anthropic RL-CAI 52B`\n- paper: [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.05862)\n- Cost: N\u002FA\n\n## [HuggingFaceH4\u002Fstack-exchange-preferences](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceH4\u002Fstack-exchange-preferences)\n\n- Summary: This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.\n- Data generation model: N\u002FA\n- paper: [A General Language Assistant as a Laboratory for Alignment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.00861)\n- Cost: N\u002FA\n\n## [stanfordnlp\u002FSHP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP)\n\n- Summary: Each example is a Reddit post with a question\u002Finstruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively).\n- Data generation model: N\u002FA\n- paper: N\u002FA\n- Cost: N\u002FA\n\n## [Instruction-Tuning-with-GPT-4\u002FGPT-4-LLM](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM)\n\n- Summary: Ranked responses (Note: Data is evaluated by `GPT-4` model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes \"GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses\" \n- Data generation model: `GPT-4`\n- paper: [Instruction Tuning with GPT-4](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.03277)\n- Cost: N\u002FA\n- Related: \n    -[(tatsu-lab\u002FAlpaca)|52K|EN|MT|SI](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n\n\n## Natural Instruction \u002F Super-Natural Instruction\n\n- [Paper\u002FProject link](https:\u002F\u002Faclanthology.org\u002F2022.acl-long.244.pdf)\n- [Dataset link](https:\u002F\u002Finstructions.apps.allenai.org\u002F)\n\nAllen AI is the first organization to try Instruction as a prompt and fine-tune LLMs. In the Natural Instruction paper, you can basically understand the labeling ideas of the instruction.\n\nIn its proposed dataset, 61 and different NLP tasks are included.\n\nSuper-Natural Instruction is a super-intensive version of Natural Instruction, which contains more than 1,600 different NLP tasks, and there are more than 76 different types of NLP tasks (such as: classification, extraction, sequence labeling).\n\n## [BigScience\u002FP3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbigscience\u002FP3)\n\n- [Paper\u002FProject Link](https:\u002F\u002Fgithub.com\u002Fbigscience-workshop\u002Fpromptsource)\n- [Dataset Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbigscience\u002FP3)\n\nBigScience is jointly organized by Hugging Face and French CNRS, IDRIS, GENCI, etc. It is one of the largest open source LLMs organizations.\n\nBigScience developed the PromptSource project at the end of 2021, and open sourced a series of toolkits to help researchers build prompts based on existing NLP tasks. So far, the PromptSource project contains more than 2000 prompt templates for 270 NLP tasks.\n\nOn this basis, BigScience constructed the P3 dataset. You can find P3 data on Hugging Face Hub, and the data size of P3 is between 100M-1B.\n\n## xMTF - BigScience\n\n- [Project Link](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01786.pdf)\n- [Dataset Link](https:\u002F\u002Fgithub.com\u002Fbigscience-workshop\u002Fxmtf)\n\nBased on the English prompt, BigScience extends its prompt to multiple non-English languages.\n\nThe project contains 13 NLP tasks and is available in 46 different languages. The corresponding prompt contains an indeterminate number of languages.\n\nAfter fine-tuning on the basis of multilingual, both BLOOM and T0 have realized the ideal multilingual ability.\n\n## HH-RLHF - Anthropic\n\n- [Paper\u002FProject Link](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05862.pdf)\n- [Dataset Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf)\n\nClaud under Anthropic is one of the main competitors of ChatGPT.\n\nAnthropic has open-sourced the RLHF dataset it uses in its own product line.\n\nThe original intention of the HH-RLHF project is to train Helpful and Harmless (HH) LLMs. Therefore, in addition to the quality of the project's responses, whether it is harmful information is also reflected in its human feedback.\n\nThe paper records how to use the behavior of the RLHF data Align model to human values, and records the construction method and standards of the data set.\n\n## [Unnatural Instruction](https:\u002F\u002Fgithub.com\u002Forhonovich\u002Funnatural-instructions)\n\n- [Paper\u002FProject Link](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09689.pdf)\n- [Dataset Link](https:\u002F\u002Fgithub.com\u002Forhonovich\u002Funnatural-instructions)\n\nUsing LLMs to independently generate instruction data is an active direction in the field of instruction-tuning.\n\nUnnatural Instruction uses GPT3 (text-davinci-002) to generate 64k instruction prompt data. And use the same model to rewrite the 64k prompt, and finally get 240k instruction data.\n\nThe paper shows that the prompts generated by LLMs in Instruct-Tuning show good results, even surpassing models such as T0 that are fine-tuned on P3 and other data.\n\n## [Self-Instruct](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct)\n\n- [Paper\u002FProject Link](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10560.pdf)\n- [Dataset Link](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct)\n\nSelf-Instruct is also the idea of using LLMs to generate prompts for instruction-tuning. However, a more fine-grained generation process is used.\n\nConcepts such as Task pool and Quality filtering were introduced to partially alleviate the noise problem of self-intrauct type data.\n\n## [UnifiedSKG - HKU](https:\u002F\u002Funifiedskg.com\u002F)\n\n- [Paper\u002FProject Link](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05966.pdf)\n\n- [DataSet Link](https:\u002F\u002Funifiedskg.com\u002F)\n\nUnifiedSKG has added knowledge grounding in the Text-to-Text framework, that is, in the prompt-output framework, it has added structured data for assistance.\n\nAs an example, some NLP tasks rely heavily on structured knowledge bases\u002Fdatabases. The idea of UnifiedSKG is to serialize the required database and embed it into the prompt. As shown below.\n\nUnifiedSKG represents a direction in the field of LLMs that attempts to use structured knowledge to enhance performance.\n\n## [Google\u002FFlan Collection](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002FFLAN\u002Ftree\u002Fmain\u002Fflan\u002Fv2)\n\n- [Paper\u002FProject Link](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13688.pdf)\n- [Dataset Link](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002FFLAN\u002Ftree\u002Fmain\u002Fflan\u002Fv2)\n\nIn this project, Google merged its own Flan 2021 data with some open source instruction data (P3, super-natural instruction, etc.).\n\nIn Flan Collection's paper, Google also summarizes some key points in Flan series model training\u002Freasoning, which may have good reference value.\n\nThe Flan Collection compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one place, formats them into a mix of zero-shot, few-shot and chain-of-thought templates\n\n- \n## InstructDial\n\n- [Paper\u002FProject Link](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12673.pdf)\n- [Dataset Link](https:\u002F\u002Fgithub.com\u002Fprakharguptaz\u002FInstructdial\u002Ftree\u002Fmain\u002Fdatasets)\n\nInstructDial is an attempt to fine-tune instructions on a specific task type. Experimental results show that after fine-tuning on dialogue instruction data, the model performs better on dialogue tasks than on very large-scale task sets.\n\n\n## ChatGPT Distillation Data\nPublic User-Shared Dialogues with ChatGPT (ShareGPT) Around 60K dialogues shared by users on ShareGPT were collected using public APIs. To maintain data quality, we deduplicated on the user-query level and removed any non-English conversations. This leaves approximately 30K examples.\n\nHuman ChatGPT Comparison Corpus (HC3) We use both the human and ChatGPT responses from the [HC3 english dataset](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.07597), which contains around 60K human answers and 27K ChatGPT answers for around 24K questions, resulting in a total number of around 87K question-answer examples.\n\n\n## Open Instruction Generalist (OIG). \n- [Paper\u002FProject Link](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.03300)\n- [Dataset Link](https:\u002F\u002Flaion.ai\u002Fblog\u002Foig-dataset\u002F)\n\nWe use a manually-selected subset of components from the Open [Instruction Generalist dataset](https:\u002F\u002Flaion.ai\u002Fblog\u002Foig-dataset\u002F) curated by LAION. Specifically, we use the grade-school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples.\n\n\n## OpenAI WebGPT. \n- [Paper\u002FProject Link](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.03300)\n- [Dataset Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fwebgpt_comparisons)\n\nIn the [WebGPT paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.09332), the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total.\n\nEach example in the dataset contains a pair of model answers for a question, and the associated metadata. Each answer has a preference score from humans that can be used to determine which of the two answers are better. \n\n## OpenAI Summarization. \n- [Paper\u002FProject Link](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.03300)\n- [Dataset Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fsummarization)\n\nThe OpenAI summarization dataset contains ~93K examples, each example consists of feedback from humans regarding the summarizations generated by a model. Human evaluators chose the superior summary from two options.\n\n# Datasets without license information \n\n ## [alespalla\u002Fchatbot_instruction_prompts](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Falespalla\u002Fchatbot_instruction_prompts)\n\n - Summary: A compilation of `tatsu-lab\u002Falpaca` ,`Dahoas\u002Finstruct-human-assistant-prompt` ,`allenai\u002Fprosocial-dialog`\n - Data generation model: N\u002FA\n - paper: N\u002FA\n - Cost: N\u002FA\n\n# Contributing\n\nOur purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.\n\n# License\n\n`Awesome-Prompt-Dataset` is released under the Apache 2.0 license.\n\n\n\n## Reference\n- https:\u002F\u002Fgithub.com\u002FZjh-819\u002FLLMDataHub\n- https:\u002F\u002Fgithub.com\u002Fraunak-agarwal\u002Finstruction-datasets\n- https:\u002F\u002Fgithub.com\u002Fzhilizju\u002FAwesome-instruction-tuning\n- https:\u002F\u002Fgithub.com\u002FRenzeLou\u002Fawesome-instruction-learning\n- https:\u002F\u002Fgithub.com\u002Fneuml\u002Ftxtinstruct\n","\u003Cdiv align=\"center\">\n\n# 优秀指令数据集 \n[![Awesome](https:\u002F\u002Fawesome.re\u002Fbadge.svg)](https:\u002F\u002Fawesome.re)\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\n[中文](README_zh.md) | 英文\n\u003C\u002Fdiv>\n\n# 目录\n- [优秀提示词数据集](#awesome-prompt-datasets)\n- [目录](#contents)\n- [简介](#introduction)\n- [提示词数据集](#prompt-datasets)\n  - [统计信息](#statistics)\n- [人类反馈强化学习 (RLHF) 数据集](#rlhf-datasets)\n  - [统计信息](#statistics-1)\n- [模板](#the-template)\n- [提示词数据集列表](#the-prompt-datasets-list)\n  - [Alpaca -Stanford](#alpaca--stanford)\n  - [Instruction in the Wild](#instruction-in-the-wild)\n  - [JosephusCheung\u002FGuanacoDataset](#josephuscheungguanacodataset)\n  - [Stanford Human Preferences Dataset (SHP)](#stanford-human-preferences-dataset-shp)\n  - [Hello-SimpleAI\u002FHC3](#hello-simpleaihc3)\n  - [Hello-SimpleAI\u002FHC3-Chinese](#hello-simpleaihc3-chinese)\n  - [allenai\u002Fprosocial-dialog](#allenaiprosocial-dialog)\n  - [allenai\u002Fnatural-instructions](#allenainatural-instructions)\n  - [PhoebusSi\u002FAlpaca-CoT](#phoebussialpaca-cot)\n  - [nomic-ai\u002Fgpt4all](#nomic-aigpt4all)\n  - [bigscience\u002FxP3](#bigsciencexp3)\n  - [teknium1\u002FGPTeacher](#teknium1gpteacher)\n  - [thunlp\u002FUltraChat](#thunlpultrachat)\n  - [cascip\u002FChatAlpaca](#cascipchatalpaca)\n  - [YeungNLP\u002Ffirefly-train-1.1M)](#yeungnlpfirefly-train-11m)\n  - [orhonovich\u002Funnatural-instructions](#orhonovichunnatural-instructions)\n  - [Instruction-Tuning-with-GPT-4\u002FGPT-4-LLM](#instruction-tuning-with-gpt-4gpt-4-llm)\n  - [databrickslabs\u002Fdolly](#databrickslabsdolly)\n  - [OpenAssistant\u002Foasst1](#openassistantoasst1)\n  - [BELLE\u002Fdata\u002F1.5M](#belledata15m)\n  - [alpaca\\_chinese\\_dataset](#alpaca_chinese_dataset)\n  - [Med-ChatGLM\u002Fdata](#med-chatglmdata)\n  - [pCLUE](#pclue)\n  - [COIG](#coig)\n- [人类反馈强化学习 (RLHF) 数据集列表](#the-rlhf-datasets-list)\n  - [Anthropic\u002Fhh-rlhf](#anthropichh-rlhf)\n  - [HuggingFaceH4\u002Fstack-exchange-preferences](#huggingfaceh4stack-exchange-preferences)\n  - [stanfordnlp\u002FSHP](#stanfordnlpshp)\n  - [Instruction-Tuning-with-GPT-4\u002FGPT-4-LLM](#instruction-tuning-with-gpt-4gpt-4-llm-1)\n  - [Natural Instruction \u002F Super-Natural Instruction](#natural-instruction--super-natural-instruction)\n  - [BigScience\u002FP3](#bigsciencep3)\n  - [xMTF - BigScience](#xmtf---bigscience)\n  - [HH-RLHF - Anthropic](#hh-rlhf---anthropic)\n  - [Unnatural Instruction](#unnatural-instruction)\n  - [Self-Instruct](#self-instruct)\n  - [UnifiedSKG - HKU](#unifiedskg---hku)\n  - [Google\u002FFlan Collection](#googleflan-collection)\n  - [InstructDial](#instructdial)\n  - [ChatGPT Distillation Data](#chatgpt-distillation-data)\n  - [Open Instruction Generalist (OIG).](#open-instruction-generalist-oig)\n  - [OpenAI WebGPT.](#openai-webgpt)\n  - [OpenAI Summarization.](#openai-summarization)\n- [无许可信息的数据集](#datasets-without-license-information)\n  - [alespalla\u002Fchatbot\\_instruction\\_prompts](#alespallachatbot_instruction_prompts)\n- [贡献](#contributing)\n- [许可证](#license)\n\n\n# 简介\n欢迎来到 \"awesome-prompt-datasets\"，这是一个全面的高质量开源指令微调数据集集合，用于训练基于对话的大型语言模型（ChatGPT, LLaMA, Alpaca）。\n\n指令微调 (Instruction Tuning) \u002F 人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF) 数据集是像 ChatGPT 这样遵循指令的大型语言模型 (LLMs) 的关键组成部分。本仓库致力于提供一份全面的指令微调数据集列表，这些数据集被用于各种大型语言模型中，使研究人员和开发者更容易访问和利用这些资源。\n\n通过 \"awesome-prompt-dataset\"，您可以加速在自然语言处理 (Natural Language Processing, NLP) 领域的研发工作，并解锁创新的新机遇。让我们一起探索无限可能！\n\n# 提示词数据集\n\n参考 [此链接](https:\u002F\u002Fgithub.com\u002FyaodongC\u002Fawesome-instruction-dataset) ([@yaodongC](https:\u002F\u002Fgithub.com\u002FyaodongC))，我们根据以下规则对每个收集到的数据集进行了标记：\n\n**(语言) 语言标签**:\n\n- EN: 英语指令数据集\n- CN: 中文指令数据集\n- ML: [多语言] 多种语言的指令数据集\n\n**(任务) 任务标签**:\n\n- MT: [多任务] 包含多个任务的数据集\n- TS: [特定任务] 针对特定任务定制的数据集\n\n**(生成) 生成方法**:\n\n- HG: [人工生成数据集] 由人类创建的数据集\n- SI: [自指令] 使用自指令 (Self-Instruct) 方法生成的数据集\n- MIX: [混合数据集] 包含人工和机器生成数据的数据集\n- COL: [数据集集合] 由其他数据集集合而成的数据集\n\n## 统计信息\n\n| 项目                                                      |                           数据集                           | 组织                        | 数量      | 语言  | 任务  | 生成 | 类型                                                         | 来源                                                          | 链接                                                          |\n| :----------------------------------------------------------- | :----------------------------------------------------------: | -------------------------- | :-------- | :---- | :---- | :--- | :----------------------------------------------------------- | :----------------------------------------------------------- | :----------------------------------------------------------- |\n| [思维链 (Chain of Thought)](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002FFLAN)  | [cot_data](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002FFLAN\u002Ftree\u002Fmain\u002Fflan\u002Fv2\u002Fcot_data) \\|[few_shot_data](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002FFLAN\u002Ftree\u002Fmain\u002Fflan\u002Fv2\u002Fniv2_few_shot_data) | Google                     | 74771     | EN\u002FCN | MT    | HG   | 使用思维链 (CoT) 推理的指令                                  | 在现有数据上标注 CoT                              | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FChain-of-Thought) |\n| [GPT4all](https:\u002F\u002Fgithub.com\u002Fnomic-ai\u002Fgpt4all)               | [nomic-ai\u002Fgpt4all-j-prompt-generations](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnomic-ai\u002Fgpt4all-j-prompt-generations) | nomic-ai                   | 806199    | EN    | MT    | COL  | 代码、故事和对话                                     | 从 GPT-3.5-turbo 蒸馏                              | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FGPT4all) |\n| [GPTeacher](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher)           | [GPT-4 General-Instruct ](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher\u002Ftree\u002Fmain\u002FInstruct)\\|[Roleplay-Instruct](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher\u002Ftree\u002Fmain\u002FRoleplay) \\|[Code-Instruct ](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher\u002Ftree\u002Fmain\u002FCodegen)\\| [Toolformer](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher\u002Ftree\u002Fmain\u002FToolformer) | teknium1                   | 29013     | EN    | MT    | SI   | 通用、角色扮演、Toolformer                                | GPT-4 & Toolformer                                           | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FGPTeacher) |\n| [Guanaco](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJosephusCheung\u002FGuanacoDataset) | [JosephusCheung\u002FGuanacoDataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJosephusCheung\u002FGuanacoDataset) | JosephusCheung             | 534610    | ML    | MT    | SI   | 各种语言任务                                     | text-davinci-003                                             | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FGuanaco) |\n| [HC3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3)    | [Hello-SimpleAI\u002FHC3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3) | Hello-SimpleAI \\| 万得资讯 | 37175     | EN\u002FCN | TS    | MIX  | 对话评估                                          | 人类或 ChatGPT                                             | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FHC3) |\n| [HC3-Chinese](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3-Chinese) | [Hello-SimpleAI\u002FHC3-Chinese](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3-Chinese) | Hello-SimpleAI\\|万得资讯   | 13k       | CN    | TS    | MIX  | 对话评估                                          | 人类或 ChatGPT                                             |                                                              |\n| [alpaca](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)       | [tatsu-lab\u002Falpaca](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca) | tatsu-lab                  | 52002     | EN    | MT    | SI   | 通用指令                                             | text-davinci-003                                             | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Falpaca) |\n| [AlpacaDataCleaned](https:\u002F\u002Fgithub.com\u002Fgururise\u002FAlpacaDataCleaned) | [yahma\u002Falpaca-cleaned](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyahma\u002Falpaca-cleaned) | yahma                      | 52k       | EN    | MT    | SI   | 通用指令                                             | text-davinci-003                                             | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Falpaca) |\n| [Chinese-LLaMA-Alpaca](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-LLaMA-Alpaca) | [alpaca_data_zh_51k](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-LLaMA-Alpaca\u002Fblob\u002Fmain\u002Fdata\u002Falpaca_data_zh_51k.json) | ymcui(讯飞)                | 51k       | CN    | MT    | SI   | 通用指令                                             | text-davinci-003                                             |                                                              |\n| [Luotuo-Chinese-LLM](https:\u002F\u002Fgithub.com\u002FLC1332\u002FLuotuo-Chinese-LLM)  骆驼 | [trans_chinese_alpaca_data](https:\u002F\u002Fgithub.com\u002FLC1332\u002FLuotuo-Chinese-LLM\u002Fblob\u002Fmain\u002Fdata\u002Ftrans_chinese_alpaca_data.json) | LC1332(商汤)               | 52k       | CN    | MT    | SI   | 通用指令                                             | text-davinci-003                                             |                                                              |\n| [Natural Instructions](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fnatural-instructions) | [Allen AI 61 task](https:\u002F\u002Finstructions.apps.allenai.org\u002F#:~:text=Download%20Natural%2DInstructions%20%2D%20v1.1)\\|[1.5k task](https:\u002F\u002Finstructions.apps.allenai.org\u002F#:~:text=Natural%2DInstructions%20%2D%20v2-,.,-x) | Allen AI                   | 5040134   | ML    | MT    | COL  | 多样的自然语言处理 (NLP) 任务                                            | 人工标注数据集集合                          | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FNatural-Instructions) |\n| [belle_cn](https:\u002F\u002Fhuggingface.co\u002FBelleGroup)                | [BelleGroup\u002Ftrain_1M_CN](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbellegroup\u002Ftrain_1M_CN) \\|[BelleGroup\u002Ftrain_0.5M_CN](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbellegroup\u002Ftrain_0.5M_CN) | BelleGroup(链家)           | 1079517   | CN    | TS\u002FMT | SI   | 通用、数学推理、对话                    | text-davinci-003                                             | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fbelle_cn) |\n| [instinwild](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild)   | [instinwild_ch](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild\u002Ftree\u002Fmain\u002Fdata) \\| [instinwild_en](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild\u002Ftree\u002Fmain\u002Fdata) |                            | 52191     | EN\u002FCN | MT    | SI   | 生成、开放问答 (QA)、头脑风暴                              | text-davinci-003                                             | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Finstinwild) |\n| [华驼 (HuaTuo)](https:\u002F\u002Fgithub.com\u002FSCIR-HI\u002FHuatuo-Llama-Med-Chinese) | [中文医学知识](https:\u002F\u002Fgithub.com\u002FSCIR-HI\u002FHuatuo-Llama-Med-Chinese\u002Fblob\u002Fmain\u002Fdata\u002Fllama_data.json) \\|[肝癌](https:\u002F\u002Fgithub.com\u002FSCIR-HI\u002FHuatuo-Llama-Med-Chinese\u002Fblob\u002Fmain\u002Fdata-literature\u002Fliver_cancer.json) | SCIR-HI(哈工大)            | 8K        | CN    | TS    | SI   | 公开和自建的中文医学知识库                                   | GPT-3.5                                                       |                                                              |\n| [prosocial dialog](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fprosocial-dialog) | [allenai\u002Fprosocial-dialog](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fprosocial-dialog) | allenai                    | 165681    | EN    | TS    | MIX  | 对话                                                     | GPT-3 重写问题 + 人工反馈手动          | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fprosocial-dialog) |\n| [finance_en](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fgbharti\u002Ffinance-alpaca) | [gbharti\u002Ffinance-alpaca](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fprosocial-dialog) |                            | 68912     | EN    | TS    | COL  | 金融相关问答                                         | GPT-3.5                                                       | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002F) |\n| [xP3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbigscience\u002FxP3)        | [bigscience\u002FxP3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbigscience\u002FxP3) | bigscience                 | 78883588  | ML    | MT    | COL  | 涵盖 46 种语言和 16 个自然语言处理 (NLP) 任务的提示与数据集集合 | 人工标注数据集集合                          | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FxP3) |\n| [firefly](https:\u002F\u002Fgithub.com\u002Fyangjianxin1\u002FFirefly)           | [YeungNLP\u002Ffirefly-train-1.1M](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYeungNLP\u002Ffirefly-train-1.1M) |                            | 1649398   | CN    | MT    | COL  | 23 个自然语言处理 (NLP) 任务                                                 | 人工标注数据集集合                          | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Ffirefly) |\n| [instruct](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fswype\u002Finstruct)   | [swype\u002Finstruct](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fswype\u002Finstruct) |                            | 888969    | EN    | MT    | COL  | GPT4All, Alpaca、开源 Meta 数据集的增强版      | 使用 AllenAI 提供的先进自然语言处理 (NLP) 工具进行增强 | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Finstruct) |\n| [Code Alpaca](https:\u002F\u002Fgithub.com\u002Fsahil280114\u002Fcodealpaca)     | [sahil280114\u002Fcodealpaca](https:\u002F\u002Fgithub.com\u002Fsahil280114\u002Fcodealpaca\u002Fblob\u002Fmaster\u002Fdata\u002Fcode_alpaca_20k.json) |                            | 20022     | EN    | TS    | SI   | 代码生成、编辑、优化                       | text-davinci-003                                             | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FCodeAlpaca) |\n| [Alpaca_GPT4](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM) | [alpaca_gpt4_data](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM\u002Fblob\u002Fmain\u002Fdata\u002Falpaca_gpt4_data.json)\\|[alpaca_gpt4_data_zh](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM\u002Fblob\u002Fmain\u002Fdata\u002Falpaca_gpt4_data_zh.json) \\|[comparison_data_v2](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM\u002Fblob\u002Fmain\u002Fdata\u002Fcomparison_data_v2.json) | 微软                       | 52002     | EN\u002FCN | MT    | SI   | 通用指令                                             | 使用 Alpaca 由 GPT-4 生成                              | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FalpacaGPT4) |\n| [webGPT](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fwebgpt_comparisons) | [openai\u002Fwebgpt_comparisons](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fwebgpt_comparisons) | openai                     | 18994     | EN    | TS    | MIX  | 信息检索 (IR) 问答                                | 微调后的 GPT-3，每条指令有两个输出，选择更好的一个 | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FwebGPT) |\n| [dolly 2.0](https:\u002F\u002Fgithub.com\u002Fdatabrickslabs\u002Fdolly)         | [databricks\u002Fdatabricks-dolly-15k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdatabricks\u002Fdatabricks-dolly-15k) | databricks                 | 15015     | EN    | TS    | HG   | 封闭问答、摘要等，以维基百科为参考   | 人工标注                                              | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fdolly) |\n| [mosaicml\u002Fllm-foundry](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fllm-foundry) | [mosaicml\u002Fdolly_hhrlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmosaicml\u002Fdolly_hhrlhf) | mosaicml                   | 59.3K     | EN    | TS    | HG   | 该数据集是 [Databrick's dolly-15k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdatabricks\u002Fdatabricks-dolly-15k) 数据集和 [Anthropic's HH-RLHF](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf) 过滤子集的组合。 | 人工标注                                              |                                                              |\n| [baize](https:\u002F\u002Fgithub.com\u002Fproject-baize\u002Fbaize-chatbot) 白泽 | [alpaca_chat_data.json](https:\u002F\u002Fgithub.com\u002Fproject-baize\u002Fbaize-chatbot\u002Ftree\u002Fmain\u002Fdata) \\|[medical_chat_data.json](https:\u002F\u002Fgithub.com\u002Fproject-baize\u002Fbaize-chatbot\u002Fblob\u002Fmain\u002Fdata\u002Fmedical_chat_data.json) \\| [quora_chat_data.json](https:\u002F\u002Fgithub.com\u002Fproject-baize\u002Fbaize-chatbot\u002Fblob\u002Fmain\u002Fdata\u002Fquora_chat_data.json) \\|[stackoverflow_chat_data.json](https:\u002F\u002Fgithub.com\u002Fproject-baize\u002Fbaize-chatbot\u002Fblob\u002Fmain\u002Fdata\u002Fstackoverflow_chat_data.json) | project-baize              | 653699    | EN    | MT    | COL  | 来自 Alpaca、Quora、StackOverflow 和 MedQuAD 问题的集合 | 人工标注数据集集合                          | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fbaize) |\n| [hh-rlhf](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fhh-rlhf)             | [Anthropic\u002Fhh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fanthropic\u002Fhh-rlhf) | Anthropic                  | 284517    | EN    | TS    | MIX  | 对话                                                     | 人类与 RLHF 模型之间的对话                         | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fhh-rlhf) |\n| [OIG(part)](https:\u002F\u002Flaion.ai\u002Fblog\u002Foig-dataset\u002F)              |    [laion\u002FOIG](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flaion\u002Foig)    | laion                      | 49237     | EN    | MT    | COL  | 源自各种任务，例如问答   | 使用数据增强，人工标注数据集集合 | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FOIG) |\n| [GAOKAO](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FGAOKAO-Bench)          | [Fill-in-the-blank_Questions](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FGAOKAO-Bench\u002Ftree\u002Fmain\u002Fdata\u002FFill-in-the-blank_Questions) \\| [Multiple-choice_Questions](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FGAOKAO-Bench\u002Ftree\u002Fmain\u002Fdata\u002FMultiple-choice_Questions) \\| [Open-ended_Questions](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FGAOKAO-Bench\u002Ftree\u002Fmain\u002Fdata\u002FOpen-ended_Questions) | OpenLMLab                  | 2785      | CN    | MT    | COL  | 考试中的选择题、填空题和开放式问题 | 人工 annotated                                              | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FGAOKAO) |\n| [camel](https:\u002F\u002Fgithub.com\u002Flightaime\u002Fcamel) \\| 骆驼          | [camel-ai\u002Fcode](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcamel-ai\u002Fai_society)\\|[camel-ai\u002Fbiology](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcamel-ai\u002Fbiology) \\|[camel-ai\u002Fphysics](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcamel-ai\u002Fphysics) \\|[camel-ai\u002Fchemistry](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcamel-ai\u002Fchemistry) \\|[camel-ai\u002Fmath](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcamel-ai\u002Fmath) | camel-ai                   | 760620    | EN    | MT    | SI   | AI 社会、代码、数学、物理、化学、生物领域的角色扮演对话 | gpt-3.5-turbo                                                | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fcamel) |\n| [FLAN-Muffin](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMuennighoff\u002Fflan) | [Muennighoff\u002Fflan](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMuennighoff\u002Fflan) |                            | 1764800   | EN    | MT    | COL  | 60 个自然语言处理 (NLP) 任务                                                 | 人工标注数据集集合                          | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FFLAN-Muffin) |\n| [COIG](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FCOIG)            |      [COIG](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FCOIG)       | BAAI\\|智源                 | 298428    | CN    | MT    | COL  | 收集自考试、翻译、人类价值对齐指令和反事实修正多轮对话 | 使用自动工具和人工验证                 | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FCOIG) |\n| [GPT4Tools](https:\u002F\u002Fgithub.com\u002FStevenGrove\u002FGPT4Tools)        | [gpt4tools_71k.json](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1JKIT-Or1of7TJuWvmrJpPoOx0cLdcWry\u002Fview?usp=share_link) | StevenGrove                | 71446     | EN    | MT    | SI   | 一系列工具相关指令                    | gpt-3.5-turbo                                                | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fgpt4tools) |\n| [ShareChat](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FRyokoAI\u002FShareGPT52K) | [RyokoAI\u002FShareGPT52K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FRyokoAI\u002FShareGPT52K) | RyokoAI                    | 1663241   | EN    | MT    | MIX  | 通用指令                                             | 众包收集人与 ChatGPT 之间的对话 (ShareGPT) | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FShareGPT) |\n| [Auto CoT](https:\u002F\u002Fgithub.com\u002Famazon-science\u002Fauto-cot)       | [kojima-takeshi188\u002Fzero_shot_cot\u002Fdataset](https:\u002F\u002Fgithub.com\u002Fkojima-takeshi188\u002Fzero_shot_cot\u002Ftree\u002Fmain\u002Fdataset) \\|[kojima-takeshi188\u002Fzero_shot_cot\u002Flog](https:\u002F\u002Fgithub.com\u002Fkojima-takeshi188\u002Fzero_shot_cot\u002Ftree\u002Fmain\u002Flog) | amazon-science             |           | EN    |       |      |                                                              |                                                              | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FAuto-CoT) |\n| [MOSS](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FMOSS)（复旦 Moss）       | [fnlp\u002Fmoss-002-sft-data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ffnlp\u002Fmoss-002-sft-data)\\| [moss-003-sft-data](https:\u002F\u002Fgithub.com\u002FOpenLMLab\u002FMOSS\u002Ftree\u002Fmain\u002FSFT_data\u002Fconversations\u002Fconversation_without_plugins) | fnlp                       | 1583595   | EN\u002FCN | SI    |      |                                                              |                                                              | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002FMOSS) |\n| [ultrachat](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FUltraChat)             | [stingning\u002Fultrachat](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstingning\u002Fultrachat) | thnlp                      | 28247446  | EN    |       |      |                                                              |                                                              | [下载](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT\u002Ftree\u002Fmain\u002Fultrachat) |\n| [StackLLaMA](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flvwerra\u002Fstack-exchange-paired) | [lvwerra\u002Fstack-exchange-paired](lvwerra\u002Fstack-exchange-paired) |                            | todo      | EN    |       | HG   |                                                              |                                                              |                                                              |\n| [Self-Instruct](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct)   | [yizhongw\u002Fself-instruct](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct\u002Fblob\u002Fmain\u002Fdata\u002Fgpt3_generations\u002Fbatch_221203\u002Fall_instances_82K.jsonl) |                            | 82 K      | EN    | SI    | SI   |                                                              |                                                              |                                                              |\n| [Zhihu-KOL](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fwangrui6\u002FZhihu-KOL) | [Zhihu-KOL](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fwangrui6\u002FZhihu-KOL) | Openassisent               | 100 w     |       | SI    | HG   | 用于训练 Open Assistant 的知乎数据                        |                                                              |                                                              |\n| [stanfordnlp\u002FSHP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP) | [stanfordnlp\u002FSHP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP) | stanfordnlp                | 385 k     | EN    | MT    | HG   |                                                              | 对回复的人类偏好                             |                                                              |\n| [LAION-AI\u002FOpen-Assistant](https:\u002F\u002Fgithub.com\u002FLAION-AI\u002FOpen-Assistant) | [OpenAssistant\u002Foasst1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenAssistant\u002Foasst1) | Openassisent               | 84.4k     | EN    | MT    | HG   | OpenAssistant 对话数据集 (OASST1)                 | 人类生成，人工标注                             |                                                              |\n| [akoksal\u002FLongForm](https:\u002F\u002Fgithub.com\u002Fakoksal\u002FLongForm)      | [akoksal\u002FLongForm](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fakoksal\u002FLongForm) | akoksal\u002FLongForm           | 30k       | EN    | SI    | HG   |                                                              | 从现有语料库（如 C4 和维基百科）中选择一组多样化的人工文档，并通过大语言模型 (LLM) 为给定文档生成指令。 |                                                              |\n| [sail-sg\u002Fsymbolic-instruction-tuning](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fsymbolic-instruction-tuning) | [sail\u002Fsymbolic-instruction-tuning](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsail\u002Fsymbolic-instruction-tuning) | sail-sg                    | 800K      | ML    | SI    |      |                                                              | 人类合成示例                                     |                                                              |\n| 医疗问答 [michael-wzhu\u002FPromptCBLUE](https:\u002F\u002Fgithub.com\u002Fmichael-wzhu\u002FPromptCBLUE) | [michaelwzhu\u002FChatMed_Consult_Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmichaelwzhu\u002FChatMed_Consult_Dataset) | michael-wzhu               | 110113    | CN    | SI    |      |                                                              | 互联网上的医疗问诊问题 (110,113)，反映了真实世界的不同用户\u002F患者的医疗问诊需求。目前 response 都是由 OpenAI `GPT-3.5`引擎回答的。 |                                                              |\n| [mbzuai-nlp\u002FLaMini-LM](https:\u002F\u002Fgithub.com\u002Fmbzuai-nlp\u002FLaMini-LM) | [MBZUAI\u002FLaMini-instruction](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FLaMini-instruction) | MBZUAI\u002FLaMini-instruction  | **2.58M** | EN    | MT    | SI   |                                                              | 通过离线蒸馏从大型语言模型中提取知识                         |                                                              |\n| [pCLUE](https:\u002F\u002Fgithub.com\u002FCLUEbenchmark\u002FpCLUE)              |       [pCLUE](https:\u002F\u002Fgithub.com\u002FCLUEbenchmark\u002FpCLUE)        |                            | 120 万    |       |       |      |                                                              |                                                              |                                                              |\n| [WizardLM](https:\u002F\u002Fgithub.com\u002Fnlpxucan\u002FWizardLM)             | [victor123\u002Fevol_instruct_70k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fvictor123\u002Fevol_instruct_70k) | WizardLM                   | 70k       | EN    | MT    |      |                                                              |                                                              |                                                              |\n|                                                              |                                                              |                            |           |       |       |      |                                                              |                                                              |                                                              |\n\n# RLHF（人类反馈强化学习）数据集\n\n## 统计信息\n\n|                           Project                            | Links                                                        |              Org              | Nums   |  Lang   | Summary                                                      |\n| :----------------------------------------------------------: | ------------------------------------------------------------ | :---------------------------: | ------ | :-----: | ------------------------------------------------------------ |\n| [webgpt_comparisons](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fwebgpt_comparisons) |                                                              |            Openai             | 19,578 | 英语 | 在 [WebGPT 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.09332) 中，作者从人类反馈中训练了一个奖励模型（Reward Model）。他们使用该奖励模型训练了一个长文本回答问答模型，以与人类偏好对齐。这是 WebGPT 项目结束时标记为适合奖励建模的所有比较的集合。总共有 19,578 个比较样本。 |\n|    [SHP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP)    |                                                              |          stanfordnlp          | 349 K  | 英语 | SHP 是一个包含 38.5 万个集体人类偏好的数据集，涉及 18 个不同主题领域（从烹饪到法律咨询）的问题\u002F指令的回答。这些偏好旨在反映一个回答相对于另一个回答的帮助程度，并用于训练 RLHF 奖励模型和自然语言生成（NLG）评估模型（例如：[SteamSHP](https:\u002F\u002Fhuggingface.co\u002Fstanfordnlp\u002FSteamSHP-flan-t5-xl)）。 |\n| [rlhf-reward-datasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyitingxie\u002Frlhf-reward-datasets) |                                                              |           yitingxie           | 76.3 k | 英语 |                                                              |\n| [Dahoas\u002Ffull-hh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDahoas\u002Ffull-hh-rlhf) |                                                              |            Dahoas             | 112 k  | 英语 | 将 Anthropic 的 HH 数据集重新格式化为提示词（Prompt）、被选择（Chosen）、被拒绝（Rejected）样本。 |\n| [Dahoas\u002Fsynthetic-instruct-gptj-pairwise](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDahoas\u002Fsynthetic-instruct-gptj-pairwise) |                                                              |            Dahoas             |        | 英语 |                                                              |\n| [Dahoas\u002Frm-static](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDahoas\u002Frm-static) |                                                              |            Dahoas             | 76.3k  | 英语 | [hh-static](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDahoas\u002Fstatic-hh) 的划分版本，用于监督微调（Supervised Fine-tuning）后训练奖励模型。 |\n| [Anthropic\u002Fhh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf) |                                                              |           Anthropic           | 22k    | 英语 | 此 RLHF 数据集是一个迭代的“在线”数据集，包含来自 520 亿参数语言模型的数据。它包含 2.2 万个帮助性比较数据，且不含红队测试（Red-teaming）数据。 |\n| [Instruction-Tuning-with-GPT-4\u002FGPT-4-LLM](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM) |                                                              | Instruction-Tuning-with-GPT-4 | 52k    | 英语 | 对三个模型（GPT-4、GPT-3.5 和 OPT-IML）生成的 Alpaca 提示词响应进行排名（注意：数据由 `GPT-4` 模型评估，而非人工）。通过要求 GPT-4 对质量进行评分来实现。作者认为\"GPT-4 能够识别并修正自己的错误，并能准确判断响应的质量”。 |\n| [thu-coai\u002FSafety-Prompts](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FSafety-Prompts) | [thu-coai\u002FSafety-Prompts](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fthu-coai\u002FSafety-Prompts) |           thu-coai            | 100k   | 中文 | 中文安全提示词（Prompts），用于评估和提升大模型的安全性，使模型输出与人类价值观对齐。 |\n| [Chatgpt-Comparison-Detection project](https:\u002F\u002Fgithub.com\u002FHello-SimpleAI\u002Fchatgpt-comparison-detection) | [Hello-SimpleAI\u002FHC3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3) |                               | 24.3K  | 英语 | 人类 ChatGPT 对比语料库，针对约 2.4 万个问题，包含 6 万个人类回答和 2.7 万个 ChatGPT 回答。 |\n\n# Open ChatLLMs\n\n| 发布时间 | 模型名称                                                   | 基座          | 模型规模 | 数据集                                                     | 实例数量 | 语言    |\n| ---------- | ------------------------------------------------------------ | ------------- | ---------- | ------------------------------------------------------------ | ------------------- | ----------- |\n| 2022-12    | GPT-3 Self Inst.                                             | GPT-3         | 175B       | Self-Instruct                                                | 82 k                | 英文          |\n| 2023-03-03 | [alpaca](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)       | LLaMA         | 7B         | [alpaca_data](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002Falpaca_data.json) | 52 k                | 英文          |\n| 2023-03-19 | [alpaca-lora](https:\u002F\u002Fgithub.com\u002Ftloen\u002Falpaca-lora\u002Fcommits\u002Fmain) | LLaMA         | 7B 13B 30B | [alpaca_data](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002Falpaca_data.json)、[alpaca_data_cleaned](https:\u002F\u002Fgithub.com\u002Ftloen\u002Falpaca-lora\u002Fblob\u002Fmain\u002Falpaca_data_cleaned.json) | 52 k                | 英文          |\n| 2023-03-23 | [Chinese-Vicuna](https:\u002F\u002Fgithub.com\u002FFacico\u002FChinese-Vicuna)   | LLaMA         | 7B 13B     | [BELLE](https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE)、[GuanacoDataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJosephusCheung\u002FGuanacoDataset) | 1M                  | 中文          |\n| 2023-03-24 | [Alpaca-CoT](https:\u002F\u002Fgithub.com\u002FPhoebusSi\u002FAlpaca-CoT)        | LLaMA         | 7B         | [dataset](https:\u002F\u002Fgithub.com\u002FPhoebusSi\u002FAlpaca-CoT#statistics) | ----                | 英文 中文       |\n| 2023-03-25 | [dolly](https:\u002F\u002Fgithub.com\u002Fdatabrickslabs\u002Fdolly)             | dolly         | 6B         | [alpaca_data](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002Falpaca_data.json) | 52 k                | 英文          |\n| 2023-03-25 | [guanaco](https:\u002F\u002Fhuggingface.co\u002FKBlueLeaf\u002Fguanaco-7B-leh)   | LLaMA         | 7B         | [GuanacoDataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJosephusCheung\u002FGuanacoDataset) | 534 k               | 英文 中文 日文 德文 |\n| 2023-03-28 | [Chinese-LLaMA-Alpaca](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-LLaMA-Alpaca) | LLaMA         | 7B         | [alpaca_data_zh](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-LLaMA-Alpaca\u002Ftree\u002Fmain\u002Fdata)、[pCLUE](https:\u002F\u002Fgithub.com\u002FCLUEbenchmark\u002FpCLUE)、[translation2019zh](https:\u002F\u002Fgithub.com\u002Fbrightmart\u002Fnlp_chinese_corpus#5%E7%BF%BB%E8%AF%91%E8%AF%AD%E6%96%99translation2019zh)、[alpaca_data](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002Falpaca_data.json)、Self-Instruct | 2M                  | 中文          |\n| 2023-03-29 | [ColossalChat](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI)      | LLaMA         | 7B 13B     | [InstructionWild](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild) | 104 k               | 英文 中文       |\n| 2023-03-31 | [Luotuo](https:\u002F\u002Fgithub.com\u002FLC1332\u002FLuotuo-Chinese-LLM)       | LLaMA ChatGLM | 7B 6B      | [trans_chinese_alpaca_data](https:\u002F\u002Fgithub.com\u002FLC1332\u002FChinese-alpaca-lora\u002Fblob\u002Fmain\u002Fdata\u002Ftrans_chinese_alpaca_data.json) | 52k                 | 中文          |\n| 2023-03-31 | [cerebras-lora-alpaca](https:\u002F\u002Fgithub.com\u002Flxe\u002Fcerebras-lora-alpaca) | Cerebras-GPT  | 2.7B       | [AlpacaDataCleaned](https:\u002F\u002Fgithub.com\u002Fgururise\u002FAlpacaDataCleaned) | 52k                 | 英文          |\n\n# 模板\n\n将新项目追加到文件末尾\n```shell\n\n[{Project-name}\u002F{Dataset-name}]{https:\u002F\u002Fgithub.com\u002Flink\u002Fto\u002Fproject}\n\n- [paper\u002Fproject link](link)\n- [dataset link](link)\n- Related work: (if applicable)\n\nSome introductions ...\n\n```\n\n# 提示词数据集列表\n\n## [Alpaca -Stanford](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n\n- [论文\u002F项目链接](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n- [数据集链接](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n- 数据生成模型：text-davinci-003\n- 成本：$600\n\n斯坦福发布的 Alpaca 是一个基于 Meta AI LLaMA 模型的用于指令微调 (instruct-tuning) 的微调 (fine-tuning) 模型。\n\nAlpaca 使用 GPT-3.5 自动生成了 52k 条指令数据，并用于微调 LLaMA 模型。实验结果表明，它在某些任务上可以达到甚至超越 GPT-3.5 的性能。\n\n## [Instruction in the Wild](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild)\n\n- [论文\u002F项目链接](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FColossalChat)\n- [数据集链接](https:\u002F\u002Fgithub.com\u002FXueFuzhao\u002FInstructionWild)\n- 数据生成模型：text-davinci-003\n  \n指令微调 (Instruction Tuning) 是 ChatGPT 的关键组成部分。OpenAI 使用了他们基于用户的指令数据集，但不幸的是，该数据集并未开源。Self-Instruct 发布了一个小型指令数据集，包含由人工编写的 175 条指令。斯坦福 Alpaca 团队基于上述 175 条种子指令 (seed instructions)，通过 text-davinci-003 模型生成了 52K 条指令。\n\n本项目旨在构建一个更大且更多样化的指令数据集。为此，我们从 ChatGPT 的使用截图中收集了 429 条指令，并发布了中英文版本。我们发现这些指令非常多样化，尽管规模仍然较小。我们遵循 Alpaca 的方法生成了 52K 条指令及其回复。所有数据均可在 data 目录中找到。\n\n注意：这是一个进行中的项目。我们仍在收集和整理我们的数据。我们尽早发布此数据集以加速我们的 LLM (大型语言模型) 研究。我们也将在不久后发布一份白皮书。\n\n## [JosephusCheung\u002FGuanacoDataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJosephusCheung\u002FGuanacoDataset)\n\n- 数据生成模型：text-davinci-003\n- 成本：$6000\n\n52K 条指令数据是通过修改后的 self-instruct 流程生成的，包含人工编写的 429 个种子任务。\n\n## [斯坦福人类偏好数据集 (SHP)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP)\n\n- [数据链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP)\n\nSHP 是一个包含 38.5 万条关于 18 个不同主题领域（从烹饪到法律咨询）的问题\u002F指令回复的集体人类偏好数据集。这些偏好旨在反映一个回复相较于另一个回复的帮助程度，并 intended to be used for training RLHF（人类反馈强化学习）奖励模型和 NLG（自然语言生成）评估模型 (e.g., [SteamSHP](https:\u002F\u002Fhuggingface.co\u002Fstanfordnlp\u002FSteamSHP-flan-t5-xl))。\n\n每个示例都是一个 Reddit 帖子，包含一个问题\u002F指令以及该帖子的两个顶级评论，其中一条评论被 Reddit 用户（集体）更偏好。SHP 利用了这样一个事实：如果评论 A 是在评论 B 之后撰写的，但仍然拥有更高的得分，那么 A 显然比 B 更受偏好。如果 A 是在 B 之前撰写的，我们就不能得出这个结论，因为其较高的得分可能是由于可见性更高所致。我们选择的数据中，偏好标签旨在反映哪个回复更有用，而不是哪个危害更小，后者是许多过去工作的重点。\n\nSHP 与 [Anthropic 的 HH-RLHF 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf) 有何不同？最显著的是，SHP 中的所有数据都是自然发生且由人类编写的，而 HH-RLHF 中的回复是由机器编写的，这为我们提供了两种可以互补的不同分布。\n\n\n## [Hello-SimpleAI\u002FHC3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3)\n\n- 摘要：首个真人 -ChatGPT 对比语料库（英文版），名为 HC3 数据集\n- 数据生成模型：`gpt-3.5`, `human generated`\n- 论文：[How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.07597)\n- 成本：无\n\n## [Hello-SimpleAI\u002FHC3-Chinese](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHello-SimpleAI\u002FHC3-Chinese)\n\n- 摘要：首个真人 -ChatGPT 对比语料库（中文版），名为 HC3 数据集\n- 数据生成模型：`gpt-3.5`, `human generated`\n- 论文：[How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.07597)\n- 成本：无\n\n\n## [allenai\u002Fprosocial-dialog](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fprosocial-dialog)\n\n- 摘要：ProsocialDialog 是首个大规模多轮英语对话数据集，旨在教导对话代理（Conversational Agents）根据社会规范对问题内容进行回应。\n- 数据生成模型：`gpt-3.5`, `human generated`\n- 论文：[ProsocialDialog: A Prosocial Backbone for Conversational Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.12688)\n- 成本：无\n\n## [allenai\u002Fnatural-instructions](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fnatural-instructions)\n\n- 摘要：一项社区努力，旨在创建大量 `1,616 个多样化的 NLP（自然语言处理）任务` 及其自然语言定义\u002F指令。\n- 数据生成模型：`Human generated`\n- 论文：[Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.07705)\n- 成本：无\n\n\n## [PhoebusSi\u002FAlpaca-CoT](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQingyiSi\u002FAlpaca-CoT)\n\n- 摘要：一个基于 LLaMA 和 Alpaca 的 Chain-of-Thoughts（思维链）推理数据集。注意：他们的仓库将持续收集各种指令微调（Instruction Tuning）数据集。[Github Repo](https:\u002F\u002Fgithub.com\u002FPhoebusSi\u002FAlpaca-CoT)\n- 论文：无\n- 成本：无\n\n## [nomic-ai\u002Fgpt4all](https:\u002F\u002Fgithub.com\u002Fnomic-ai\u002Fgpt4all)\n\n- 摘要：gpt4all 利用三个公开可用的数据集：1.[laion\u002FOIG](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flaion\u002FOIG), 2.[pacovaldez\u002Fstackoverflow-questions](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fpacovaldez\u002Fstackoverflow-questions) 3. [bigscience\u002Fbloomz-p3](https:\u002F\u002Fhuggingface.co\u002Fbigscience\u002Fbloomz-p3) 的子集\n- 数据生成模型：无\n- 论文：[GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo](https:\u002F\u002Fs3.amazonaws.com\u002Fstatic.nomic.ai\u002Fgpt4all\u002F2023_GPT4All_Technical_Report.pdf)\n- 成本：$500\n\n\n## [bigscience\u002FxP3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbigscience\u002FxP3)\n\n- 摘要：[提示词资源（Prompt-resource）] xP3（跨语言公共提示池）是一个涵盖 46 种语言和 16 个 NLP 任务的提示词 & 数据集集合。\n- 数据生成模型：无\n- 论文：[Crosslingual Generalization through Multitask Finetuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.01786)\n- 成本：无\n\n\n\n## [teknium1\u002FGPTeacher](https:\u002F\u002Fgithub.com\u002Fteknium1\u002FGPTeacher)\n\n- 摘要：一组由 GPT-4 生成的模块化数据集集合，包括 General-Instruct、Roleplay-Instruct、Code-Instruct 和 Toolformer\n- 数据生成模型：`GPT-4`\n- 论文：无\n- 成本：无\n\n## [thunlp\u002FUltraChat](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FUltraChat)\n\n- 摘要：UltraChat 旨在构建一个开源、大规模、多轮的对话数据。UltraChat 的第一部分（即“关于世界的问题”板块）已发布，包含 28 万个多样且信息丰富的对话。更多关于写作和创作、现有材料协助的对话即将推出。\n- 数据生成模型：`GPT-3.5-turbo`\n- 论文：无\n- 成本：无\n\n## [cascip\u002FChatAlpaca](https:\u002F\u002Fgithub.com\u002Fcascip\u002FChatAlpaca)\n\n- 摘要：基于 Stanford Alpaca 数据，ChatAlpaca 将数据扩展到多轮指令及其相应的回复。更多数据（2 万条）及中文翻译版即将推出。\n- 数据生成模型：`GPT-3.5-turbo`\n- 论文：无\n- 成本：无\n- 相关：[(tatsu-lab\u002FAlpaca)|52K|EN|MT|SI](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n\n## [YeungNLP\u002Ffirefly-train-1.1M)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYeungNLP\u002Ffirefly-train-1.1M)\n- 摘要：结合人工编写指令模板的 23 个任务的中文数据集。 \n- 数据生成模型：无\n- 论文：无\n- 成本：无\n\n## [orhonovich\u002Funnatural-instructions](https:\u002F\u002Fgithub.com\u002Forhonovich\u002Funnatural-instructions)\n- 摘要：通过向语言模型提供三条指令种子示例并诱导出第四条，生成 6.4 万条示例。然后通过提示模型重写每条指令，将集合扩展至 24 万条。\n- 数据生成模型：`text-davinci-002`\n- 论文：[Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09689)\n- 成本：无\n\n## [Instruction-Tuning-with-GPT-4\u002FGPT-4-LLM](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM)\n- 摘要：5.2 万条由 GPT-4 生成的指令跟随（Instruction-following）数据，使用原始 Alpaca 提示词及 ChatGPT 翻译成中文的 Alpaca 提示词 + 9 千条由 GPT-4 使用 Unnatural Instruction 中的提示词生成的指令跟随数据。\n- 数据生成模型：`GPT-4`\n- 论文：[Instruction Tuning with GPT-4](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.03277)\n- 成本：无\n- 相关： \n    -[(tatsu-lab\u002FAlpaca)|52K|EN|MT|SI](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n    -[(orhonovich\u002Funnatural-instructions)|240K|EN|MT|MIX](https:\u002F\u002Fgithub.com\u002Forhonovich\u002Funnatural-instructions)\n\n## [databrickslabs\u002Fdolly](https:\u002F\u002Fgithub.com\u002Fdatabrickslabs\u002Fdolly\u002Ftree\u002Fmaster\u002Fdata)\n- 简介：该数据集由数千名 Databricks 员工生成，涵盖了 InstructGPT 论文中概述的几种行为类别，包括头脑风暴、分类、封闭式问答、生成、信息提取、开放式问答和摘要。\n- 数据生成模型：N\u002FA\n- 论文：[Free Dolly](https:\u002F\u002Fwww.databricks.com\u002Fblog\u002F2023\u002F04\u002F12\u002Fdolly-first-open-commercially-viable-instruction-tuned-llm)\n- 成本：N\u002FA\n\n## [OpenAssistant\u002Foasst1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenAssistant\u002Foasst1)\n- 简介：OpenAssistant 对话（OASST1），一个由人类生成、人工标注的助手风格对话语料库，包含 161,443 条消息，分布在 66,497 个对话树中，涵盖 35 种不同语言，并标注了 461,292 个质量评分。\n- 数据生成模型：N\u002FA\n- 论文：[OpenAssistant Conversations - Democratizing Large Language Model Alignment](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F10iR5hKwFqAKhL3umx8muOWSRm7hs5FqX\u002Fview)\n- 成本：N\u002FA\n\n## BELLE\u002Fdata\u002F1.5M\n\n- 下载地址：[https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE\u002Ftree\u002Fmain\u002Fdata\u002F1.5M](https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE\u002Ftree\u002Fmain\u002Fdata\u002F1.5M)\n- 数据量：1.5M\n- 生成方式：self-instruct，使用了中文种子任务，以及 openai 的 text-davinci-003 接口\n- 涉及任务：包含 175 个种子任务，[https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE\u002Fblob\u002Fmain\u002Fdata\u002F1.5M\u002Fzh_seed_tasks.json](https:\u002F\u002Fgithub.com\u002FLianjiaTech\u002FBELLE\u002Fblob\u002Fmain\u002Fdata\u002F1.5M\u002Fzh_seed_tasks.json)\n- 数据示例：[https:\u002F\u002Fhuggingface.co\u002Fdatasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBelleGroup\u002Ftrain_0.5M_CN)\n\n## alpaca_chinese_dataset\n\n- 下载地址：[https:\u002F\u002Fgithub.com\u002Fhikariming\u002Falpaca_chinese_dataset](https:\u002F\u002Fgithub.com\u002Fhikariming\u002Falpaca_chinese_dataset)\n- 数据量：52k\n- 生成方式：借助 chatgpt 对原始的 [stanford_alpaca](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca) 做机器翻译，并加入人工校验来保证质量\n- 涉及任务：与原始的 [stanford_alpaca](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca) 一致，可以在原项目的 [seed_task.json](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002Fseed_tasks.jsonl) 中查到全部任务\n\n## Med-ChatGLM\u002Fdata\n\n- 下载地址：[https:\u002F\u002Fgithub.com\u002FSCIR-HI\u002FMed-ChatGLM](https:\u002F\u002Fgithub.com\u002FSCIR-HI\u002FMed-ChatGLM)\n- 数据量：7k\n- 生成方式：利用 GPT3.5 接口围绕医学知识库构建问答数据，并设置了多种 Prompt 形式来充分利用知识\n- 涉及任务：医学领域相关的问答，包含并发症，高危因素，组织学检查，临床症状，药物治疗，辅助治疗\n\n## pCLUE\n\n- 下载地址：[https:\u002F\u002Fgithub.com\u002FCLUEbenchmark\u002FpCLUE](https:\u002F\u002Fgithub.com\u002FCLUEbenchmark\u002FpCLUE)\n- 数据量：1.2M\n- 生成方式：通过原有的 NLP 任务数据集，结合特定的 [prompt](https:\u002F\u002Fwww.zhihu.com\u002Fsearch?q=prompt&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={\"sourceType\"%3A\"article\"%2C\"sourceId\"%3A\"624084039\"}) 模板生成\n- 涉及任务：包含 9 个 NLP 数据集，涉及的 NLP 任务有 [文本分类](https:\u002F\u002Fwww.zhihu.com\u002Fsearch?q=文本分类&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={\"sourceType\"%3A\"article\"%2C\"sourceId\"%3A\"624084039\"})\u002F自然语言推理\u002F语义匹配\u002F[指代消解](https:\u002F\u002Fwww.zhihu.com\u002Fsearch?q=指代消解&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={\"sourceType\"%3A\"article\"%2C\"sourceId\"%3A\"624084039\"})\u002F关键词识别\u002F阅读理解\n\n## COIG\n\n- 下载地址：[https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FCOIG](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FCOIG)\n\n- 数据量： \n\n- - Translated Instructions (67,798)\n  - Exam Instructions (63,532)\n  - Human Value Alignment Instructions (34,471)\n  - Counterfactural Correction Multi-round Chat (13,653)\n  - Leetcode Instructions (11,737)\n\n- 生成方式：融合了多个领域的数据，具体可以参考论文 [Chinese Open Instruction Generalist: A Preliminary Release](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.07987)\n\nhttps:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FInstructionZoo\n\nhttps:\u002F\u002Fgithub.com\u002Flightaime\u002Fcamel\n\n# RLHF 数据集列表 (Reinforcement Learning from Human Feedback)\n\n## [Anthropic\u002Fhh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf)\n\n- 简介：此 RLHF 数据集是一个迭代的“在线”数据集，包含来自 52B 语言模型的数据。它包含 22k 有用性比较和无红队测试数据。\n- 数据生成模型：`Anthropic RL-CAI 52B`\n- 论文：[Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.05862)\n- 成本：N\u002FA\n\n## [HuggingFaceH4\u002Fstack-exchange-preferences](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceH4\u002Fstack-exchange-preferences)\n\n- 简介：此数据集包含来自 Stack Overflow 数据转储的问题和答案，用于偏好模型训练。\n- 数据生成模型：N\u002FA\n- 论文：[A General Language Assistant as a Laboratory for Alignment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.00861)\n- 成本：N\u002FA\n\n## [stanfordnlp\u002FSHP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fstanfordnlp\u002FSHP)\n\n- 简介：每个示例都是一个带有问题\u002F指令的 Reddit 帖子及其一对顶级评论，其中一条评论更受 Reddit 用户（集体）青睐。\n- 数据生成模型：N\u002FA\n- 论文：N\u002FA\n- 成本：N\u002FA\n\n## [Instruction-Tuning-with-GPT-4\u002FGPT-4-LLM](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM)\n\n- 简介：排名响应（注意：数据由 `GPT-4` 模型评估而非人类）的 Alpaca 提示词来自三个模型（GPT-4, GPT-3.5 和 OPT-IML），通过要求 GPT-4 评估质量。作者认为\"GPT-4 能够识别并修复自己的错误，并能准确判断响应的质量”。\n- 数据生成模型：`GPT-4`\n- 论文：[Instruction Tuning with GPT-4](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.03277)\n- 成本：N\u002FA\n- 相关： \n    -[(tatsu-lab\u002FAlpaca)|52K|EN|MT|SI](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca)\n\n\n## Natural Instruction \u002F Super-Natural Instruction\n\n- [论文\u002F项目链接](https:\u002F\u002Faclanthology.org\u002F2022.acl-long.244.pdf)\n- [数据集链接](https:\u002F\u002Finstructions.apps.allenai.org\u002F)\n\nAllen AI 是第一个尝试将指令作为提示词并微调大语言模型（LLM）的组织。在 Natural Instruction 论文中，你可以基本理解指令的标注思路。\n\n在其提出的数据集中，包含了 61 种不同的 NLP 任务。\n\nSuper-Natural Instruction 是 Natural Instruction 的超密集版本，包含超过 1,600 种不同的 NLP 任务，并且有超过 76 种不同类型的 NLP 任务（例如：分类、提取、序列标注）。\n\n## [BigScience\u002FP3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbigscience\u002FP3)\n\n- [论文\u002F项目链接](https:\u002F\u002Fgithub.com\u002Fbigscience-workshop\u002Fpromptsource)\n- [数据集链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbigscience\u002FP3)\n\nBigScience 由 Hugging Face 和法国国家科学研究中心 (CNRS)、IDRIS、GENCI 等联合组织。它是最大的开源大语言模型组织之一。\n\nBigScience 于 2021 年底开发了 PromptSource 项目，并开源了一系列工具包以帮助研究人员基于现有的 NLP 任务构建提示词。到目前为止，PromptSource 项目包含针对 270 个 NLP 任务的 2000 多个提示模板。\n\n在此基础上，BigScience 构建了 P3 数据集。你可以在 Hugging Face Hub 上找到 P3 数据，P3 的数据规模在 100M-1B 之间。\n\n## xMTF - BigScience\n\n- [项目链接](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01786.pdf)\n- [数据集链接](https:\u002F\u002Fgithub.com\u002Fbigscience-workshop\u002Fxmtf)\n\n基于英文提示词（Prompt），BigScience 将其提示词扩展到了多种非英语语言。\n\n该项目包含 13 个自然语言处理（NLP）任务，并支持 46 种不同的语言。对应的提示词包含不定数量的语言。\n\n在基于多语言的基础上进行微调（fine-tuning）后，BLOOM 和 T0 都实现了理想的跨语言能力。\n\n## HH-RLHF - Anthropic\n\n- [论文\u002F项目链接](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.05862.pdf)\n- [数据集链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf)\n\nAnthropic 旗下的 Claude 是 ChatGPT 的主要竞争对手之一。\n\nAnthropic 开源了其在自家产品线中使用的 RLHF（基于人类反馈的强化学习）数据集。\n\nHH-RLHF 项目的初衷是训练有益且无害（Helpful and Harmless, HH）的大语言模型（LLM）。因此，除了项目回复的质量外，是否包含有害信息也体现在其人类反馈中。\n\n该论文记录了如何利用 RLHF 数据的行为来使模型与人类价值观对齐，并记录了数据集的构建方法和标准。\n\n## [Unnatural Instruction](https:\u002F\u002Fgithub.com\u002Forhonovich\u002Funnatural-instructions)\n\n- [论文\u002F项目链接](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09689.pdf)\n- [数据集链接](https:\u002F\u002Fgithub.com\u002Forhonovich\u002Funnatural-instructions)\n\n使用 LLM 独立生成指令数据是指令微调（instruction-tuning）领域的活跃方向。\n\nUnnatural Instruction 使用 GPT3 (text-davinci-002) 生成 64k 指令提示词数据。并使用同一模型重写这 64k 提示词，最终获得 240k 指令数据。\n\n论文表明，Instruct-Tuning 中由 LLM 生成的提示词表现良好，甚至超越了在 P3 等数据上微调的 T0 等模型。\n\n## [Self-Instruct](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct)\n\n- [论文\u002F项目链接](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10560.pdf)\n- [数据集链接](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct)\n\nSelf-Instruct 也是利用 LLM 为指令微调生成提示词的想法。不过，它使用了更细粒度的生成过程。\n\n引入了任务池（Task pool）和质量过滤（Quality filtering）等概念，以部分缓解自指类型数据的噪声问题。\n\n## [UnifiedSKG - HKU](https:\u002F\u002Funifiedskg.com\u002F)\n\n- [论文\u002F项目链接](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.05966.pdf)\n\n- [数据集链接](https:\u002F\u002Funifiedskg.com\u002F)\n\nUnifiedSKG 在文本到文本（Text-to-Text）框架中增加了知识定位（knowledge grounding），即在提示词 - 输出（prompt-output）框架中，增加了结构化数据作为辅助。\n\n例如，一些 NLP 任务严重依赖结构化知识库\u002F数据库。UnifiedSKG 的思路是将所需的数据库序列化并嵌入到提示词中。如下图所示。\n\nUnifiedSKG 代表了 LLM 领域的一个方向，试图利用结构化知识来提升性能。\n\n## [Google\u002FFlan Collection](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002FFLAN\u002Ftree\u002Fmain\u002Fflan\u002Fv2)\n\n- [论文\u002F项目链接](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13688.pdf)\n- [数据集链接](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002FFLAN\u002Ftree\u002Fmain\u002Fflan\u002Fv2)\n\n在此项目中，Google 将其自身的 Flan 2021 数据与一些开源指令数据（P3, super-natural instruction 等）合并。\n\n在 Flan Collection 的论文中，Google 还总结了 Flan 系列模型训练\u002F推理的一些关键点，可能具有良好的参考价值。\n\nFlan Collection 将来自 Flan 2021、P3、Super-Natural Instructions 以及数十个其他数据集编译到一个地方，并将它们格式化为零样本（zero-shot）、少样本（few-shot）和思维链（chain-of-thought）模板的混合体。\n\n- \n## InstructDial\n\n- [论文\u002F项目链接](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.12673.pdf)\n- [数据集链接](https:\u002F\u002Fgithub.com\u002Fprakharguptaz\u002FInstructdial\u002Ftree\u002Fmain\u002Fdatasets)\n\nInstructDial 尝试在特定任务类型上进行指令微调。实验结果表明，在对话指令数据上微调后，模型在对话任务上的表现优于非常大的任务集。\n\n\n## ChatGPT Distillation Data\n公共用户共享对话（ShareGPT）：使用公共 API 收集了 ShareGPT 上用户分享的约 60K 个对话。为了保持数据质量，我们在用户查询级别进行了去重，并移除了任何非英语对话。这留下了大约 30K 个示例。\n\n人类 ChatGPT 对比语料库（HC3）：我们使用了 [HC3 英文数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.07597) 中的人类和 ChatGPT 回答，其中包含约 60K 个人类回答和 27K 个 ChatGPT 回答，针对约 24K 个问题，总共约有 87K 个问题 - 回答示例。\n\n\n## Open Instruction Generalist (OIG). \n- [论文\u002F项目链接](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.03300)\n- [数据集链接](https:\u002F\u002Flaion.ai\u002Fblog\u002Foig-dataset\u002F)\n\n我们使用了 LAION 整理的开放 [指令通用数据集](https:\u002F\u002Flaion.ai\u002Fblog\u002Foig-dataset\u002F) 的手动选择组件子集。具体来说，我们使用了小学数学指令、诗歌转歌曲、以及剧情剧本书籍对话数据集。这总共产生了约 30k 个示例。\n\n\n## OpenAI WebGPT. \n- [论文\u002F项目链接](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.03300)\n- [数据集链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fwebgpt_comparisons)\n\n在 [WebGPT 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.09332) 中，作者从人类反馈中训练了一个奖励模型。他们使用该奖励模型训练了一个长文问答模型，以符合人类偏好。这是 WebGPT 项目结束时标记为适合奖励建模的所有比较数据集。总共有 19,578 个比较。\n\n数据集中的每个示例包含一个问题的两个模型答案及其关联的元数据。每个答案都有人类的偏好评分，可用于确定哪个答案更好。 \n\n## OpenAI Summarization. \n- [论文\u002F项目链接](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.03300)\n- [数据集链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenai\u002Fsummarization)\n\nOpenAI 摘要数据集包含约 93K 个示例，每个示例由人类对模型生成的摘要的反馈组成。人类评估者从两个选项中选择了更好的摘要。\n\n# 无许可信息的数据集 \n\n ## [alespalla\u002Fchatbot_instruction_prompts](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Falespalla\u002Fchatbot_instruction_prompts)\n\n - 总结：`tatsu-lab\u002Falpaca` ,`Dahoas\u002Finstruct-human-assistant-prompt` ,`allenai\u002Fprosocial-dialog` 的汇编\n - 数据生成模型：N\u002FA\n - 论文：N\u002FA\n - 成本：N\u002FA\n\n# 贡献\n\n我们的目标是让这个仓库变得更好。如果您有兴趣贡献，请参见此处获取贡献说明。\n\n# 许可证\n\n`Awesome-Prompt-Dataset` 根据 Apache 2.0 许可证发布。\n\n## 参考资料\n- https:\u002F\u002Fgithub.com\u002FZjh-819\u002FLLMDataHub\n- https:\u002F\u002Fgithub.com\u002Fraunak-agarwal\u002Finstruction-datasets\n- https:\u002F\u002Fgithub.com\u002Fzhilizju\u002FAwesome-instruction-tuning\n- https:\u002F\u002Fgithub.com\u002FRenzeLou\u002Fawesome-instruction-learning\n- https:\u002F\u002Fgithub.com\u002Fneuml\u002Ftxtinstruct","# Awesome Instruction Datasets 快速上手指南\n\n## 简介\n`awesome-instruction-datasets` 是一个高质量开源指令微调数据集的集合库，旨在帮助研究人员和开发者快速访问和利用用于训练 ChatGPT、LLaMA、Alpaca 等基于对话的大语言模型（LLM）的数据资源。本指南将指导您如何获取该资源列表并加载其中的数据集。\n\n## 环境准备\n在开始之前，请确保您的开发环境满足以下要求：\n- **操作系统**：Linux \u002F Windows \u002F macOS\n- **Python 版本**：>= 3.8\n- **核心依赖库**：\n  ```bash\n  pip install transformers datasets torch accelerate\n  ```\n- **Git 工具**：用于克隆仓库\n\n## 安装步骤\n由于本项目为数据集索引仓库，主要操作是克隆代码库并配置数据下载源。\n\n### 1. 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FyaodongC\u002Fawesome-instruction-dataset.git\ncd awesome-instruction-dataset\n```\n\n### 2. 配置国内镜像加速（推荐）\n为了加快 HuggingFace 数据集的下载速度，建议在运行前设置环境变量指向国内镜像。\n```bash\nexport HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n# 或者使用 ModelScope\n# export HF_ENDPOINT=https:\u002F\u002Fmodelscope.cn\u002Fapi\u002Fv1\u002Fstaging\u002Fhub\n```\n\n### 3. 验证环境\n确保 `datasets` 库已正确安装并可联网访问：\n```python\nimport datasets\nprint(datasets.__version__)\n```\n\n## 基本使用\n\n### 1. 浏览数据集列表\n打开项目根目录下的 `README.md` 文件，查看分类标签：\n- **Lang Tags**: EN (英文), CN (中文), ML (多语言)\n- **Task Tags**: MT (多任务), TS (特定任务)\n- **Gen Tags**: HG (人工生成), SI (自指令生成), MIX (混合)\n\n例如，表格中列出了 `BelleGroup\u002Ftrain_1M_CN` 或 `yahma\u002Falpaca-cleaned` 等具体数据集名称。\n\n### 2. 加载并使用数据集\n您可以直接使用 HuggingFace `datasets` 库加载列表中推荐的任意数据集。以下以加载清洗后的 Alpaca 数据集为例：\n\n```python\nfrom datasets import load_dataset\n\n# 加载示例数据集\ndataset = load_dataset(\"yahma\u002Falpaca-cleaned\")\n\n# 查看数据格式\nprint(dataset['train'][0])\n\n# 转换为适合训练的格式\ndata = dataset['train'].shuffle().select(range(100))\n```\n\n### 3. 结合训练框架\n加载数据后，可将其传入 LoRA、LLaMA-Factory 或其他微调框架进行模型训练。具体参数需参考各数据集对应的原始论文或官方文档。","某初创公司技术团队计划基于 LLaMA 微调一个垂直领域的法律问答机器人，急需构建高质量的指令微调数据集。\n\n### 没有 awesome-instruction-datasets 时\n- 需要在多个代码托管平台手动搜索零散的数据集，信息分散且效率低下。\n- 面对海量开源项目，难以快速筛选出适合中文场景的高质量指令数据。\n- 不清楚各数据集的开源许可证，直接使用可能面临法律合规风险。\n- 花费大量时间在数据清洗和格式转换上，挤占了模型训练与调优的时间。\n\n### 使用 awesome-instruction-datasets 后\n- awesome-instruction-datasets 直接聚合了 Alpaca、BELLE 等知名数据集的官方链接，无需二次查找。\n- 通过语言标签（CN\u002FEN）和类型分类（Prompt\u002FRLHF），迅速锁定适合法律场景的中文指令集。\n- 明确区分了有许可证和无许可证的数据源，确保后续商业化使用的安全性。\n- 获取标准化数据格式，大幅减少预处理工作，让团队能更快验证模型效果。\n\nawesome-instruction-datasets 通过整合优质资源，显著降低了大模型微调的数据门槛与开发成本。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fjianzhnie_awesome-instruction-datasets_03855af3.png","jianzhnie","Robin","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fjianzhnie_4cd098bb.png","Machine learning， Reinforcement Learning， Transformers","LLMTech","Shen zhen","jianzhnie@gmail.com",null,"https:\u002F\u002Fjianzhnie.github.io\u002Fllmtech\u002F","https:\u002F\u002Fgithub.com\u002Fjianzhnie",728,41,"2026-04-04T16:54:08","Apache-2.0",1,"未说明",{"notes":92,"python":90,"dependencies":93},"该仓库为指令微调数据集的资源聚合列表（Awesome List），本身不包含可执行代码或安装脚本，因此没有特定的运行环境要求。用户需根据实际选用的数据集及下游模型训练任务（如 LLaMA、ChatGLM 等）自行配置相应的深度学习环境。",[90],[51,26,13],[96,97,98,99,100,101,102],"chatgpt","datasets","llm","prompts","instruction","llama","self-instruct","2026-03-27T02:49:30.150509","2026-04-06T08:46:19.234638",[],[]]