[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-argilla-io--distilabel":3,"tool-argilla-io--distilabel":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",144730,2,"2026-04-07T23:26:32",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":98,"forks":99,"last_commit_at":100,"license":101,"difficulty_score":32,"env_os":102,"env_gpu":103,"env_ram":104,"env_deps":105,"category_tags":118,"github_topics":119,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":129,"updated_at":130,"faqs":131,"releases":160},5365,"argilla-io\u002Fdistilabel","distilabel","Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.","distilabel 是一个专为工程师打造的开源框架，旨在高效构建基于验证研究的合成数据生成与 AI 反馈流水线。在大型语言模型开发中，高质量数据的匮乏往往导致算力浪费和输出效果不佳，distilabel 正是为了解决这一核心痛点而生。它帮助用户通过程序化方式快速合成多样化数据集，并利用 AI 对数据进行自动评估与筛选，从而从源头提升数据质量，让开发者能将宝贵时间集中在模型优化而非繁琐的数据准备上。\n\n该工具特别适合从事自然语言处理（NLP）和大模型应用的开发者及研究人员，无论是传统的文本分类任务，还是复杂的指令遵循、对话生成场景，都能轻松应对。distilabel 的独特亮点在于其高度的灵活性与兼容性：它通过统一的 API 集成市面上任意大模型提供商的 AI 反馈能力，让用户真正掌握微调自有模型所需的数据所有权。同时，框架内置了对最新研究成果的支持，确保流水线具备可扩展性和容错性。尽管原核心团队已转向新项目，但活跃的社区成员正积极维护并推动其迭代，使其成为当前追求高效、可靠数据工程流程的理想选择。","> [!IMPORTANT]  \nThe original authors have moved on to other projects. A group of community members have recently joined the GitHub project as collaborators to maintain the project and are actively working towards the next release. Check out the `develop` branch for access to the latest fixes and improvements in the meantime. \n> \n\u003Cdiv align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fblob\u002Fmain\u002Fdocs\u002Fassets\u002Fdistilabel-white.png?raw=true\">\n    \u003Cimg alt=\"Distilabel Logo\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_bbabba678acb.png\">\n  \u003C\u002Fpicture>\n\u003C\u002Fdiv>\n\n\u003Ch3 align=\"center\">Synthesize data for AI and add feedback on the fly!\u003C\u002Fh3>\n\n\u003Cp align=\"center\">\n  \u003Ca  href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fdistilabel\u002F\">\n    \u003Cimg alt=\"CI\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fdistilabel.svg?style=flat-round&logo=pypi&logoColor=white\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fproject\u002Fdistilabel\">\n    \u003Cimg alt=\"CI\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_6073406518a4.png\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Ftwitter.com\u002Fargilla_io\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftwitter-black?logo=x\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.linkedin.com\u002Fcompany\u002Fargilla-io\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flinkedin-blue?logo=linkedin\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"http:\u002F\u002Fhf.co\u002Fjoin\u002Fdiscord\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-7289DA?&logo=discord&logoColor=white\"\u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\nDistilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.\n\nIf you just want to get started, we recommend you check the [documentation](http:\u002F\u002Fdistilabel.argilla.io\u002F). Curious, and want to know more? Keep reading!\n\u003C!-- ![overview](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_de6eb6cbfa3a.png)  -->\n\n## Why use distilabel?\n\nDistilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.\n\n### Improve your AI output quality through data quality\n\nCompute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time **achieving and keeping high-quality standards for your data**.\n\n### Take control of your data and models\n\n**Ownership of data for fine-tuning your own LLMs** is not easy but Distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.\n\n### Improve efficiency by quickly iterating on the right research and LLMs\n\nSynthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.\n\n## Community\n\nWe are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:\n\n- [Community Meetup](https:\u002F\u002Flu.ma\u002Fembed-checkout\u002Fevt-IQtRiSuXZCIW6FB): listen in or present during one of our bi-weekly events.\n\n- [Discord](http:\u002F\u002Fhf.co\u002Fjoin\u002Fdiscord): get direct support from the community in #argilla-general and #argilla-help.\n\n- [Roadmap](https:\u002F\u002Fgithub.com\u002Forgs\u002Fargilla-io\u002Fprojects\u002F10\u002Fviews\u002F1): plans change but we love to discuss those with our community so feel encouraged to participate.\n\n## What do people build with Distilabel?\n\nThe Argilla community uses distilabel to create amazing [datasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets?other=distilabel) and [models](https:\u002F\u002Fhuggingface.co\u002Fmodels?other=distilabel).\n\n- The [1M OpenHermesPreference](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fargilla\u002FOpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium\u002FOpenHermes-2.5. It shows how we can use Distilabel to **synthesize data on an immense scale**.\n- Our [distilabeled Intel Orca DPO dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fargilla\u002Fdistilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https:\u002F\u002Fhuggingface.co\u002Fargilla\u002Fdistilabeled-OpenHermes-2.5-Mistral-7B), show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.\n- The [haiku DPO data](https:\u002F\u002Fgithub.com\u002Fdavanstrien\u002Fhaiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset.\n\n## Installation\n\n```sh\npip install distilabel --upgrade\n```\n\nRequires Python 3.9+\n\nIn addition, the following extras are available:\n\n### LLMs\n\n- `anthropic`: for using models available in [Anthropic API](https:\u002F\u002Fwww.anthropic.com\u002Fapi) via the `AnthropicLLM` integration.\n- `cohere`: for using models available in [Cohere](https:\u002F\u002Fcohere.ai\u002F) via the `CohereLLM` integration.\n- `argilla`: for exporting the generated datasets to [Argilla](https:\u002F\u002Fargilla.io\u002F).\n- `groq`: for using models available in [Groq](https:\u002F\u002Fgroq.com\u002F) using [`groq`](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fgroq-python) Python client via the `GroqLLM` integration.\n- `hf-inference-endpoints`: for using the [Hugging Face Inference Endpoints](https:\u002F\u002Fhuggingface.co\u002Finference-endpoints) via the `InferenceEndpointsLLM` integration.\n- `hf-transformers`: for using models available in [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) package via the `TransformersLLM` integration.\n- `litellm`: for using [`LiteLLM`](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm) to call any LLM using OpenAI format via the `LiteLLM` integration.\n- `llama-cpp`: for using [llama-cpp-python](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python) Python bindings for `llama.cpp` via the `LlamaCppLLM` integration.\n- `mistralai`: for using models available in [Mistral AI API](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fla-plateforme\u002F) via the `MistralAILLM` integration.\n- `ollama`: for using [Ollama](https:\u002F\u002Follama.com\u002F) and their available models via `OllamaLLM` integration.\n- `openai`: for using [OpenAI API](https:\u002F\u002Fopenai.com\u002Fblog\u002Fopenai-api) models via the `OpenAILLM` integration, or the rest of the integrations based on OpenAI and relying on its client as `AnyscaleLLM`, `AzureOpenAILLM`, and `TogetherLLM`.\n- `vertexai`: for using [Google Vertex AI](https:\u002F\u002Fcloud.google.com\u002Fvertex-ai) proprietary models via the `VertexAILLM` integration.\n- `vllm`: for using [vllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) serving engine via the `vLLM` integration.\n- `sentence-transformers`: for generating sentence embeddings using [sentence-transformers](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers).\n- `mlx`: for using [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx) models via the `MlxLLM` integration.\n\n### Structured generation\n\n- `outlines`: for using structured generation of LLMs with [outlines](https:\u002F\u002Fgithub.com\u002Foutlines-dev\u002Foutlines).\n- `instructor`: for using structured generation of LLMs with [Instructor](https:\u002F\u002Fgithub.com\u002Fjxnl\u002Finstructor\u002F).\n\n### Data processing\n\n- `ray`: for scaling and distributing a pipeline with [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray).\n- `faiss-cpu` and `faiss-gpu`: for generating sentence embeddings using [faiss](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffaiss).\n- `text-clustering`: for using text clustering with [UMAP](https:\u002F\u002Fgithub.com\u002Flmcinnes\u002Fumap) and [Scikit-learn](https:\u002F\u002Fgithub.com\u002Fscikit-learn\u002Fscikit-learn).\n- `minhash`: for using minhash for duplicate detection with [datasketch](https:\u002F\u002Fgithub.com\u002Fdatasketch\u002Fdatasketch) and [nltk](https:\u002F\u002Fgithub.com\u002Fnltk\u002Fnltk).\n\n### Example\n\nTo run the following example you must install `distilabel` with the `hf-inference-endpoints` extra:\n\n```sh\npip install \"distilabel[hf-inference-endpoints]\" --upgrade\n```\n\nThen run:\n\n```python\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline() as pipeline:\n    TextGeneration(\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama\u002FMeta-Llama-3.1-8B-Instruct\",\n            generation_kwargs={\"temperature\": 0.7, \"max_new_tokens\": 512},\n        ),\n    )\n\nif __name__ == \"__main__\":\n    dataset = load_dataset(\"distilabel-internal-testing\u002Finstructions\", split=\"test\")\n    distiset = pipeline.run(dataset=dataset)\n    distiset.push_to_hub(repo_id=\"distilabel-example\")\n```\n\n## Badges\n\nIf you build something cool with `distilabel` consider adding one of these badges to your dataset or model card.\n\n    [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_25ce9e17a758.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"\u002F>](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel)\n\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_25ce9e17a758.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"\u002F>](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel)\n\n    [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_8c6e162fcafd.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"\u002F>](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel)\n\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_8c6e162fcafd.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"\u002F>](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel)\n\n## Contribute\n\nTo directly contribute with `distilabel`, check our [good first issues](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fissues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fissues\u002Fnew\u002Fchoose).\n\n## Citation\n\n```bibtex\n@misc{distilabel-argilla-2024,\n  author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero},\n  title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel}}\n}\n```\n","> [!重要]  \n原作者已转向其他项目。近期，一群社区成员加入了 GitHub 项目，成为协作者以维护该项目，并正在积极筹备下一个版本的发布。在此期间，请查看 `develop` 分支，以获取最新的修复和改进。\n> \n\u003Cdiv align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fblob\u002Fmain\u002Fdocs\u002Fassets\u002Fdistilabel-white.png?raw=true\">\n    \u003Cimg alt=\"Distilabel Logo\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_bbabba678acb.png\">\n  \u003C\u002Fpicture>\n\u003C\u002Fdiv>\n\n\u003Ch3 align=\"center\">为 AI 合成数据，并即时添加反馈！\u003C\u002Fh3>\n\n\u003Cp align=\"center\">\n  \u003Ca  href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fdistilabel\u002F\">\n    \u003Cimg alt=\"CI\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fdistilabel.svg?style=flat-round&logo=pypi&logoColor=white\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fproject\u002Fdistilabel\">\n    \u003Cimg alt=\"CI\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_6073406518a4.png\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Ftwitter.com\u002Fargilla_io\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftwitter-black?logo=x\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.linkedin.com\u002Fcompany\u002Fargilla-io\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flinkedin-blue?logo=linkedin\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"http:\u002F\u002Fhf.co\u002Fjoin\u002Fdiscord\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-7289DA?&logo=discord&logoColor=white\"\u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\nDistilabel 是一个面向工程师的合成数据与 AI 反馈框架，旨在提供基于可靠研究论文、快速、可靠且可扩展的数据流水线。\n\n如果您只想快速上手，建议您查阅[文档](http:\u002F\u002Fdistilabel.argilla.io\u002F)。如果您感到好奇并想了解更多？请继续阅读！\n\u003C!-- ![overview](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_de6eb6cbfa3a.png)  -->\n\n## 为什么使用 Distilabel？\n\nDistilabel 可用于生成各种项目的合成数据和 AI 反馈，包括传统的预测性 NLP（分类、抽取等）以及生成式和大型语言模型场景（指令遵循、对话生成、评价等）。Distilabel 的程序化方法使您能够构建可扩展的数据生成和 AI 反馈流水线。Distilabel 的目标是通过快速生成高质量、多样化的数据集，并结合经过验证的 AI 反馈生成与评估方法，来加速您的 AI 开发进程。\n\n### 通过数据质量提升 AI 输出质量\n\n计算资源昂贵，而输出质量至关重要。我们帮助您**专注于数据质量**，从而一次性解决这两个问题的根本原因。Distilabel 可帮助您合成和评估数据，让您将宝贵的时间投入到**实现并保持高质量的数据标准**上。\n\n### 掌控您的数据与模型\n\n**拥有用于微调自有 LLM 的数据**并非易事，但 Distilabel 可以助您入门。我们通过统一的 API 集成了来自任何 LLM 提供商的**AI 反馈**。\n\n### 通过快速迭代正确的研究与 LLM 提升效率\n\n利用**最新研究论文**合成并评估数据，同时确保**灵活性、可扩展性和容错性**。这样，您就可以专注于优化数据并训练模型。\n\n## 社区\n\n我们是一个由社区驱动的开源项目，非常欢迎您的参与。以下是几种参与方式：\n\n- [社区聚会](https:\u002F\u002Flu.ma\u002Fembed-checkout\u002Fevt-IQtRiSuXZCIW6FB)：在我们的双周活动中聆听或发表演讲。\n\n- [Discord](http:\u002F\u002Fhf.co\u002Fjoin\u002Fdiscord)：在 #argilla-general 和 #argilla-help 频道中获得社区的直接支持。\n\n- [路线图](https:\u002F\u002Fgithub.com\u002Forgs\u002Fargilla-io\u002Fprojects\u002F10\u002Fviews\u002F1)：计划可能会变化，但我们乐于与社区讨论这些计划，欢迎您积极参与。\n\n## 人们用 Distilabel 构建了什么？\n\nArgilla 社区使用 Distilabel 创建了许多精彩的[数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets?other=distilabel)和[模型](https:\u002F\u002Fhuggingface.co\u002Fmodels?other=distilabel)。\n\n- [1M OpenHermesPreference](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fargilla\u002FOpenHermesPreferences) 是一个约 100 万条 AI 偏好数据的数据集，源自 teknium\u002FOpenHermes-2.5。它展示了我们如何使用 Distilabel 在**大规模范围内合成数据**。\n- 我们的[distilabeled Intel Orca DPO 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fargilla\u002Fdistilabel-intel-orca-dpo-pairs)以及[改进后的 OpenHermes 模型](https:\u002F\u002Fhuggingface.co\u002Fargilla\u002Fdistilabeled-OpenHermes-2.5-Mistral-7B)，表明我们如何通过**AI 反馈过滤掉原始数据集中的 50%**，从而**提升模型性能**。\n- [haiku DPO 数据](https:\u002F\u002Fgithub.com\u002Fdavanstrien\u002Fhaiku-dpo) 描述了任何人都可以如何创建针对**特定任务**的**数据集**，并结合**最新研究论文**来提高数据质量。\n\n## 安装\n\n```sh\npip install distilabel --upgrade\n```\n\n需要 Python 3.9 或更高版本。\n\n此外，还提供了以下附加选项：\n\n### 大语言模型\n\n- `anthropic`: 用于通过 `AnthropicLLM` 集成调用 [Anthropic API](https:\u002F\u002Fwww.anthropic.com\u002Fapi) 中可用的模型。\n- `cohere`: 用于通过 `CohereLLM` 集成调用 [Cohere](https:\u002F\u002Fcohere.ai\u002F) 中可用的模型。\n- `argilla`: 用于将生成的数据集导出到 [Argilla](https:\u002F\u002Fargilla.io\u002F)。\n- `groq`: 用于通过 `GroqLLM` 集成，使用 [`groq`](https:\u002F\u002Fgithub.com\u002Fgroq\u002Fgroq-python) Python 客户端调用 [Groq](https:\u002F\u002Fgroq.com\u002F) 中可用的模型。\n- `hf-inference-endpoints`: 用于通过 `InferenceEndpointsLLM` 集成调用 [Hugging Face 推理端点](https:\u002F\u002Fhuggingface.co\u002Finference-endpoints)。\n- `hf-transformers`: 用于通过 `TransformersLLM` 集成调用 [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) 包中可用的模型。\n- `litellm`: 用于通过 `LiteLLM` 集成，使用 [`LiteLLM`](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm) 按照 OpenAI 格式调用任何大语言模型。\n- `llama-cpp`: 用于通过 `LlamaCppLLM` 集成，使用 [llama-cpp-python](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python) Python 绑定调用 `llama.cpp`。\n- `mistralai`: 用于通过 `MistralAILLM` 集成调用 [Mistral AI API](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fla-plateforme\u002F) 中可用的模型。\n- `ollama`: 用于通过 `OllamaLLM` 集成调用 [Ollama](https:\u002F\u002Follama.com\u002F) 及其提供的模型。\n- `openai`: 用于通过 `OpenAILLM` 集成调用 [OpenAI API](https:\u002F\u002Fopenai.com\u002Fblog\u002Fopenai-api) 中的模型；其他基于 OpenAI 的集成，如 `AnyscaleLLM`、`AzureOpenAILLM` 和 `TogetherLLM`，也依赖于 OpenAI 客户端。\n- `vertexai`: 用于通过 `VertexAILLM` 集成调用 [Google Vertex AI](https:\u002F\u002Fcloud.google.com\u002Fvertex-ai) 的专有模型。\n- `vllm`: 用于通过 `vLLM` 集成调用 [vllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) 服务引擎。\n- `sentence-transformers`: 用于使用 [sentence-transformers](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers) 生成句子嵌入。\n- `mlx`: 用于通过 `MlxLLM` 集成调用 [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx) 模型。\n\n### 结构化生成\n\n- `outlines`: 用于通过 [outlines](https:\u002F\u002Fgithub.com\u002Foutlines-dev\u002Foutlines) 实现大语言模型的结构化生成。\n- `instructor`: 用于通过 [Instructor](https:\u002F\u002Fgithub.com\u002Fjxnl\u002Finstructor\u002F) 实现大语言模型的结构化生成。\n\n### 数据处理\n\n- `ray`: 用于通过 [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray) 扩展和分布式处理流水线。\n- `faiss-cpu` 和 `faiss-gpu`: 用于使用 [faiss](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffaiss) 生成句子嵌入。\n- `text-clustering`: 用于结合 [UMAP](https:\u002F\u002Fgithub.com\u002Flmcinnes\u002Fumap) 和 [Scikit-learn](https:\u002F\u002Fgithub.com\u002Fscikit-learn\u002Fscikit-learn) 进行文本聚类。\n- `minhash`: 用于结合 [datasketch](https:\u002F\u002Fgithub.com\u002Fdatasketch\u002Fdatasketch) 和 [nltk](https:\u002F\u002Fgithub.com\u002Fnltk\u002Fnltk) 使用 minhash 进行重复检测。\n\n### 示例\n\n要运行以下示例，您需要安装带有 `hf-inference-endpoints` 附加组件的 `distilabel`：\n\n```sh\npip install \"distilabel[hf-inference-endpoints]\" --upgrade\n```\n\n然后运行：\n\n```python\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline() as pipeline:\n    TextGeneration(\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama\u002FMeta-Llama-3.1-8B-Instruct\",\n            generation_kwargs={\"temperature\": 0.7, \"max_new_tokens\": 512},\n        ),\n    )\n\nif __name__ == \"__main__\":\n    dataset = load_dataset(\"distilabel-internal-testing\u002Finstructions\", split=\"test\")\n    distiset = pipeline.run(dataset=dataset)\n    distiset.push_to_hub(repo_id=\"distilabel-example\")\n```\n\n## 徽章\n\n如果您使用 `distilabel` 构建了酷炫的东西，请考虑在您的数据集或模型卡片上添加以下徽章之一。\n\n    [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_25ce9e17a758.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"\u002F>](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel)\n\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_25ce9e17a758.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"\u002F>](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel)\n\n    [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_8c6e162fcafd.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"\u002F>](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel)\n\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_readme_8c6e162fcafd.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"\u002F>](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel)\n\n## 贡献\n\n如需直接为 `distilabel` 做贡献，请查看我们的 [初次贡献问题](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fissues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) 或 [创建一个新的问题](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fissues\u002Fnew\u002Fchoose)。\n\n## 引用\n\n```bibtex\n@misc{distilabel-argilla-2024,\n  author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero},\n  title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel}}\n}\n```","# Distilabel 快速上手指南\n\nDistilabel 是一个专为工程师设计的合成数据生成与 AI 反馈框架。它基于经过验证的研究方法，帮助你快速构建可扩展的流水线，用于生成高质量数据集或对模型输出进行评判，从而提升大语言模型（LLM）的微调效果。\n\n## 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n*   **操作系统**：Linux, macOS 或 Windows\n*   **Python 版本**：3.9 或更高版本\n*   **前置依赖**：建议先更新 `pip` 包管理工具\n\n> **注意**：Distilabel 支持多种 LLM 提供商（如 OpenAI, Hugging Face, Ollama 等）。根据你计划使用的模型服务，后续可能需要安装额外的可选依赖（extras）。\n\n## 安装步骤\n\n### 1. 基础安装\n使用 pip 安装核心库：\n\n```sh\npip install distilabel --upgrade\n```\n\n*国内用户加速建议*：如果下载速度较慢，推荐使用国内镜像源：\n```sh\npip install distilabel --upgrade -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 2. 安装可选依赖（按需）\nDistilabel 采用模块化设计，你需要根据使用的具体模型后端安装对应的扩展包。例如，若需使用 **Hugging Face Inference Endpoints**，请执行：\n\n```sh\npip install \"distilabel[hf-inference-endpoints]\" --upgrade\n```\n\n其他常用扩展包括：\n*   `openai`: 使用 OpenAI API 及兼容接口\n*   `ollama`: 本地运行 Ollama 模型\n*   `vllm`: 使用 vLLM 推理引擎\n*   `argilla`: 将结果导出至 Argilla 平台\n\n完整列表请参考官方文档中的 \"Installation\" 章节。\n\n## 基本使用\n\n以下示例展示了一个最基础的流水线：加载指令数据集，使用 Hugging Face Inference Endpoints 上的 Llama 3.1 模型进行文本生成，并将结果推送到 Hugging Face Hub。\n\n**前提**：已安装 `distilabel[hf-inference-endpoints]` 并配置好 `HF_TOKEN` 环境变量。\n\n```python\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline() as pipeline:\n    TextGeneration(\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama\u002FMeta-Llama-3.1-8B-Instruct\",\n            generation_kwargs={\"temperature\": 0.7, \"max_new_tokens\": 512},\n        ),\n    )\n\nif __name__ == \"__main__\":\n    # 加载测试数据集\n    dataset = load_dataset(\"distilabel-internal-testing\u002Finstructions\", split=\"test\")\n    \n    # 运行流水线\n    distiset = pipeline.run(dataset=dataset)\n    \n    # 将生成的数据集推送到 Hugging Face Hub\n    distiset.push_to_hub(repo_id=\"distilabel-example\")\n```\n\n**代码说明**：\n1.  **定义流水线**：使用 `Pipeline` 上下文管理器创建流程。\n2.  **配置任务**：`TextGeneration` 步骤指定了要使用的 LLM（此处为 `InferenceEndpointsLLM`）及其生成参数。\n3.  **执行与输出**：调用 `pipeline.run()` 处理输入数据集，生成的 `distiset` 对象可直接上传至 Hugging Face Hub 供后续微调使用。","某初创团队正在为垂直领域的法律问答大模型构建高质量的微调数据集，急需大量多样化的合成指令数据及对应的质量评估反馈。\n\n### 没有 distilabel 时\n- **流程割裂且低效**：工程师需手动编写脚本串联数据生成、清洗和评估环节，每次调整策略都要重写大量胶水代码，迭代周期长达数天。\n- **评估标准不一致**：依赖人工标注或简单的规则匹配来评判合成数据质量，缺乏基于前沿研究论文的标准化 AI 反馈机制，导致数据噪声大。\n- **扩展性差**：当数据量从千级扩展到百万级时，原有脚本缺乏容错机制和并行处理能力，任务频繁中断且难以恢复。\n- **模型供应商锁定**：切换不同的 LLM 提供商（如从 OpenAI 切到本地部署模型）需要重构整个 API 调用逻辑，无法统一管控。\n\n### 使用 distilabel 后\n- **流水线编排自动化**：利用 distilabel 的程序化接口快速搭建可扩展的数据生成与反馈闭环，将迭代周期从数天缩短至几小时。\n- **引入科研级评估**：直接调用内置的基于验证论文的研究方法，让强大的 LLM 自动对合成数据进行打分和筛选，显著提升数据纯净度。\n- **弹性伸缩与容错**：原生支持大规模分布式运行，任务失败可自动重试或断点续传，轻松处理百万级数据合成任务而不崩溃。\n- **统一 API 集成多模型**：通过 distilabel 的统一抽象层，无缝切换任意 LLM 提供商进行数据生成或评判，无需修改核心业务逻辑。\n\ndistilabel 通过标准化的合成数据与 AI 反馈流水线，帮助团队以最低成本实现了高质量专属数据集的快速构建与迭代。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_distilabel_3ac7c98e.png","argilla-io","Argilla","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fargilla-io_5953542f.png","Building the open-source feedback layer for LLMs",null,"contact@argilla.io","argilla_io","https:\u002F\u002Fargilla.io","https:\u002F\u002Fgithub.com\u002Fargilla-io",[82,86,90,94],{"name":83,"color":84,"percentage":85},"Python","#3572A5",98.7,{"name":87,"color":88,"percentage":89},"Jinja","#a52a22",1.2,{"name":91,"color":92,"percentage":93},"Shell","#89e051",0.1,{"name":95,"color":96,"percentage":97},"Makefile","#427819",0,3156,233,"2026-04-07T18:26:24","Apache-2.0","未说明","非必需。取决于所选的 LLM 集成：若使用本地模型（如 vllm, llama-cpp, transformers, mlx），则需要 GPU（具体型号和显存取决于模型大小）；若使用云端 API（如 OpenAI, Anthropic, HF Inference Endpoints），则无需本地 GPU。","未说明（取决于处理的数据集大小和运行的模型）",{"notes":106,"python":107,"dependencies":108},"该工具是一个框架，核心安装仅需 Python 3.9+。具体的硬件和软件依赖完全取决于用户选择的‘额外组件’（extras）。例如，使用云端 API（OpenAI, Anthropic 等）无需特殊硬件；若在本地运行开源模型，则需根据模型大小配置相应的 GPU 显存和计算资源。支持通过 Ray 进行管道扩展和容错。","3.9+",[109,110,111,112,113,114,115,116,117],"datasets","transformers (可选，用于 hf-transformers)","vllm (可选，用于 vllm)","llama-cpp-python (可选，用于 llama-cpp)","ray (可选，用于分布式扩展)","sentence-transformers (可选，用于嵌入)","faiss-cpu\u002Ffaiss-gpu (可选，用于向量搜索)","outlines (可选，用于结构化生成)","instructor (可选，用于结构化生成)",[14,13,15,35,16],[120,121,122,123,124,125,126,127,128],"ai","huggingface","llms","openai","python","rlaif","rlhf","synthetic-data","synthetic-dataset-generation","2026-03-27T02:49:30.150509","2026-04-08T13:02:19.499680",[132,137,142,147,151,156],{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},24331,"在使用 UltraFeedback 任务时遇到报错，提示处理批次失败并发送空批次，如何解决？","这通常是因为数据集格式不符合 UltraFeedback 的预期。'generations' 字段必须是一个包含字符串的列表。你需要更新数据集以符合该格式。示例代码如下：\n\nfrom distilabel.steps import LoadDataFromDicts\nloader = LoadDataFromDicts(\n    data=[\n        {\n            \"instruction\": \"你的指令内容\",\n            \"generations\": [\n                \"模型生成的回答 1\",\n                \"模型生成的回答 2\"\n            ]\n        }\n    ]\n)\n\n确保传入的数据中 generations 是列表形式，而不是单个字符串或其他格式。","https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fissues\u002F879",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},24332,"运行旧版教程代码时出现 ValidationError 或兼容性问题怎么办？","这是因为教程是基于旧版本（如 0.5.0 或 0.6.0）编写的，而当前最新版本（1.0.0）存在重大变更。解决方法是安装与教程匹配的旧版本：\n\n!pip install -q -U distilabel==0.6.0 \"farm-haystack[preprocessing]\"\n\n或者查阅对应版本的文档：\n- 0.5.0 版：https:\u002F\u002Fdistilabel.argilla.io\u002F0.5.0\u002F\n- 0.6.0 版：https:\u002F\u002Fdistilabel.argilla.io\u002F0.6.0\u002F\n\n建议等待官方更新教程以适配 v1.0.0。","https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fissues\u002F612",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},24333,"如何在偏好数据集生成过程中同时使用多个模型？","可以使用 LLMPool 和 ProcessLLM 来组合多个模型。示例代码如下：\n\nfrom distilabel.llm import LLMPool, ProcessLLM\nfrom distilabel.tasks import TextGenerationTask\n\ndef load_model_1(task):\n    return TogetherInferenceLLM(model=\"model-name-1\", api_key=\"key\", task=task, num_threads=4)\n\ndef load_model_2(task):\n    return TogetherInferenceLLM(model=\"model-name-2\", api_key=\"key\", task=task, num_threads=4)\n\npool = LLMPool(\n    llms=[\n        ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_model_1),\n        ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_model_2),\n    ]\n)\n\n这样可以在一个流程中并行调用多个模型生成数据。","https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fissues\u002F304",{"id":148,"question_zh":149,"answer_zh":150,"source_url":146},24334,"如何将已有的私有 JSONL 文件用于 Distilabel 进行标注？","你可以将 JSONL 文件加载为 Hugging Face Dataset 对象，并确保其列名和格式符合任务要求。例如，若使用 PreferenceTask，需有一个名为 'generations' 的列，其值为 LLM 输出的字符串列表。列表长度应与 pipeline.generate() 中的 num_generations 参数一致。示例结构：\n\n[\n  {\"instruction\": \"问题\", \"generations\": [\"回答 1\", \"回答 2\"]},\n  ...\n]\n\n然后使用 LoadDataFromDicts 或直接构造 Dataset 对象传入 Pipeline。",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},24335,"PreferenceToArgilla 在使用自定义 input_mappings 时报错，如何处理？","该问题可能是由于配置中 input_columns 参数未被正确识别所致。检查是否使用了正确的字段映射，例如：\n\npush_to_argilla = PreferenceToArgilla(\n    name=\"push_to_argilla\",\n    api_url=\"your_api_url\",\n    api_key=\"your_api_key\",\n    dataset_name=\"ultrallama3\",\n    dataset_workspace=\"admin\",\n    input_columns={\"generations\": \"generation\"},  # 确保源列名正确\n    num_generations=2,\n)\n\n如果仍报错，可能是 Pydantic 模型配置允许了额外属性而未抛出异常，建议升级 distilabel 或检查输入数据结构是否完全匹配预期 schema。","https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fissues\u002F576",{"id":157,"question_zh":158,"answer_zh":159,"source_url":141},24336,"Distilabel 不同版本之间 API 变化大吗？如何避免兼容性问题？","是的，从 0.x 到 1.0.0 版本之间存在大量破坏性变更（breaking changes）。为避免问题：\n1. 明确你所用教程或示例对应的 distilabel 版本；\n2. 使用 pip install distilabel==\u003C版本号> 安装指定版本；\n3. 查阅对应版本的官方文档（如 https:\u002F\u002Fdistilabel.argilla.io\u002F0.6.0\u002F）；\n4. 关注官方发布的迁移指南或更新日志。\n\n推荐在新项目中直接使用最新版并参考最新文档，旧项目则锁定版本以确保稳定性。",[161,166,171,176,181,186,191,196,201,206,211,216,221,226,231,236,241,246,251,256],{"id":162,"version":163,"summary_zh":164,"released_at":165},153851,"1.5.3","## 变更内容\n* 修复 @Riezebos 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1111 中的拼写错误\n* 仅在可用时使用 PIL 检查图像，由 @plaguss 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1112 中实现\n* 修复多步副本导致管道卡住的问题，由 @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1113 中完成\n\n## 新贡献者\n* @Riezebos 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1111 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.5.2...1.5.3","2025-01-28T10:08:05",{"id":167,"version":168,"summary_zh":169,"released_at":170},153852,"1.5.2","## 变更内容\n* 由 @rolshoven 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1105 中修复了结构化输出 JSON，使其与 `pydantic.BaseModel` 和 `LiteLLM` 异步完成客户端兼容。\n\n## 新贡献者\n* @rolshoven 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1105 中完成了首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.5.1...1.5.2","2025-01-22T10:48:45",{"id":172,"version":173,"summary_zh":174,"released_at":175},153853,"1.5.1","## 变更内容\n* 移除已弃用的 `CombineColumns` 步骤，由 @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1101 中完成\n* 修复图像导入处理并更新 MlxLLM 的初始化，由 @davidberenstein1957 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1102 中完成\n* 通过使其与 `mlx-lm>=0.21` 保持一致，修复 `MlxLLM`，由 @davidberenstein1957 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1103 中完成\n* `1.5.1` 版本，由 @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1104 中发布\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.5.0...1.5.1","2025-01-17T14:39:19",{"id":177,"version":178,"summary_zh":179,"released_at":180},153854,"1.5.0","# ✨ 发布亮点\n\n## 🖼️ 图像生成支持\n\n我们很高兴地推出 `ImageGenerationModel`，这是一个用于处理图像生成模型的新抽象层。这一新增功能使得与能够将文本提示转换为图像的模型无缝集成成为可能。\n\n#### 可用服务\n- 🤗 `InferenceEndpointsImageGeneration`: 与 Hugging Face 的推理端点集成\n- `OpenAIImageGeneration`: 与 OpenAI 的 DALL-E 集成\n\n#### 架构\n正如 `LLM` 被 `Task` 使用一样，我们也引入了 `ImageTask`，作为图像生成工作流的高级抽象。`ImageTask` 定义了某个步骤应如何使用 `ImageGenerationModel` 来完成特定的图像生成任务。\n\n我们的第一个实现是 `ImageGeneration` 任务，它提供了一个简洁的接口：给定一个文本提示，它会利用任何受支持的图像生成模型生成相应的图像。\n\n我们还添加了一个关于如何使用 `distilabel` 生成图像的小教程：[distilabel - 教程 - 使用 `distilabel` 进行图像生成](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fsections\u002Fpipeline_samples\u002Fexamples\u002Fimage_generation\u002F#image-generation-with-distilabel)\n\n## 将图像作为 `LLM` 的输入\n\n我们通过新的 [`TextGenerationWithImage`](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gallery\u002Ftasks\u002Ftextgenerationwithimage\u002F) 任务，增加了对向 `LLM` 提供图像输入的初步支持。我们已经更新并测试了 `InferenceEndpointsLLM` 和 `OpenAILLM` 以支持这一新任务，但在接下来的版本中，我们还将为其他如 `vLLM` 等模型添加图像输入兼容性。\n\n请查看教程 [distilabel - 教程 - 在 `distilabel` 中使用图像进行文本生成](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fsections\u002Fpipeline_samples\u002Fexamples\u002Ftext_generation_with_image\u002F) 来开始使用吧！\n\n## 💻 新增 `MlxLLM` 集成\n\n我们已将 [mlx-lm](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-examples) 包与新的 `MlxLLM` 类集成，从而在 Apple Silicon Mac 上实现原生的机器学习加速。这一集成通过利用 MLX 专为 M 系列芯片设计的高度优化框架，极大地提升了合成数据的生成效率。\n\n## 新的 `InstructionResponsePipeline` 模板\n\n为了使 `distilabel` 从一开始就能更易于使用，我们已经开始进行一些改动。我们将逐步添加预设或模板，以便用户能够快速搭建一个针对特定任务、带有合理预配置默认值的数据生成流水线。我们首先开发的任务是 SFT 或指令响应微调流水线，您可以这样使用：\n\n```python\nfrom distilabel.pipeline import InstructionResponsePipeline\n\npipeline = InstructionResponsePipeline()\ndistiset = pipeline.run()\n```\n\n## 定义加载阶段\n\n我们增加了一种方式，允许用户定义流水线中的哪些步骤应一起加载，从而实现更高效的资源管理和对执行流程的更好控制。这一新特性在以下场景中尤为有用：","2025-01-17T08:28:15",{"id":182,"version":183,"summary_zh":184,"released_at":185},153855,"1.4.2","## 变更内容\n* 修复 `TransformersLLM` 中聊天模板未生效的问题，由 @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1083 中完成。\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.4.1...1.4.2","2024-12-18T16:42:52",{"id":187,"version":188,"summary_zh":189,"released_at":190},153856,"1.4.1","## 变更内容\n* 修复了 `SignatureMixin` 中未能正确处理所有原始类型列表的问题，由 @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F1037 中完成。\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.4.0...1.4.1","2024-10-16T07:30:00",{"id":192,"version":193,"summary_zh":194,"released_at":195},153857,"1.4.0","# ✨ 发布亮点\n\n## 离线批量生成与 OpenAI 批量 API\n\n我们更新了 `LLM` 接口，现在可以将使用提供批量服务的外部平台的 `LLM` 集成到 `distilabel` 中。此外，`OpenAILLM` 也进行了更新，能够使用 OpenAI 的批量 API，从而实现 50% 的成本降低。\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F9a559ae1-099b-47a4-9f92-37a3171dfbff\n\n## 改进的缓存机制，提升输出复用性\n\n众所周知，运行 `LLM` 成本较高，因此我们通常希望尽可能多地复用其生成的输出。在本次发布之前，`distilabel` 的缓存机制仅支持恢复未完成的管道执行，以及重新创建已完成并再次执行的 `Distiset`。\n\n而在本次发布中，我们大幅改进了缓存功能，使得所有 `Step` 的输出都会被缓存，从而可以在其他管道执行中复用，即使该管道已经发生了变化：\n\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F03d6c110-e98a-463e-8876-62c3733d3ef0)\n\n此外，我们在 `Step` 中新增了一个 `use_cache` 属性，允许在步骤级别控制是否启用缓存。\n\n## 步骤可生成附加工件\n\n在某些情况下，`Step` 会生成一些用于生成其输出的附加工件。这些工件可能需要较长时间才能生成，并且未来还可以被复用。因此，我们新增了一个名为 `Step.save_artifact` 的方法，可以在步骤内部调用以存储由该步骤生成的工件。这些工件也会被上传到 Hugging Face Hub。\n\n```python\nfrom typing import List, TYPE_CHECKING\nfrom distilabel.steps import GlobalStep, StepInput, StepOutput\nimport matplotlib.pyplot as plt\n\nif TYPE_CHECKING:\n    from distilabel.steps import StepOutput\n\n\nclass CountTextCharacters(GlobalStep):\n    @property\n    def inputs(self) -> List[str]:\n        return [\"text\"]\n\n    @property\n    def outputs(self) -> List[str]:\n        return [\"text_character_count\"]\n\n    def process(self, inputs: StepInput) -> \"StepOutput\":  # type: ignore\n        character_counts = []\n\n        for input in inputs:\n            text_character_count = len(input[\"text\"])\n            input[\"text_character_count\"] = text_character_count\n            character_counts.append(text_character_count)\n\n        # 生成文本字符数分布的图表\n        plt.figure(figsize=(10, 6))\n        plt.hist(character_counts, bins=30, edgecolor=\"black\")\n        plt.title(\"文本字符数分布\")\n        plt.xlabel(\"字符数\")\n        plt.ylabel(\"频率\")\n\n        # 将图表保存为步骤的工件\n        self.save_artifact(\n            name=\"text_character_count_distribution\",\n            write_function=lambda path: plt.savefig(path \u002F \"figure.png\"),\n            metadata={\"type\": \"image\", \"library\": \"matplotlib\"\n","2024-10-08T14:53:40",{"id":197,"version":198,"summary_zh":199,"released_at":200},153858,"1.3.2","## 变更内容\n* DeepSeek 定理证明任务，由 @plaguss 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F733 中实现\n* 不再取消正在进行的文档工作流，由 @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F919 中实现\n* 修复为 vLLM 创建 Ray 部署组的问题，由 @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F918 中修复\n* 修复在 `InferenceEndpointsLLM` 中将 `base_url` 传递到 `model_id` 的问题，由 @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F924 中修复\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.3.1...1.3.2","2024-08-23T13:15:00",{"id":202,"version":203,"summary_zh":204,"released_at":205},153859,"1.3.1","## 变更内容\n* 创建新的 `distilabel.constants` 模块，用于存储常量，并通过 @plaguss 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F861 中避免循环导入。\n* 添加 OpenAI 请求超时功能，由 @ashim-mahara 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F858 中实现。\n\n## 新贡献者\n* @ashim-mahara 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F858 中完成了首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.3.0...1.3.1","2024-08-07T09:09:37",{"id":207,"version":208,"summary_zh":209,"released_at":210},153860,"1.3.0","## 变更内容\n* @plaguss 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F747 中添加了新步骤 `CombineKeys`\n* @davidberenstein1957 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F758 中重构了列命名步骤，包括 `combinecolumns`、`combinekeys` 和 `expandcolumns`\n* @davidberenstein1957 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F759 中移除了已弃用的 `LoadHubDataset`\n* @plaguss 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F720 中为 `Pipeline` 添加了 `requirements` 列表\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F750 中为 `Pipeline` 添加了 `StepResources` 和步骤副本\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F760 中添加了加载阶段\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F770 中将最低要求版本更新为 `python==3.9`\n* @plaguss 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F762 中增加了在推送 DistiSet 时可选地将流水线脚本包含到 Hub 的功能\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F774 中添加了 `docs-pr.yml` 和 `docs-pr-close.yml` 工作流\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F769 中添加了 `RayPipeline` 类\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F776 中修复了已关闭 PR 的工作流\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F778 中添加了 `Magpie` 和 `MagpieGenerator` 任务\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F783 中修复了与 `Magpie` 任务相关的一些问题\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F784 中为 `Magpie` 任务添加了 `end_with_user` 和 `include_system_prompt` 标志，并处理了 `None` 值的情况。\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F796 中为文档发布添加了工作流并发组\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F795 中为 `CudaDevicePlacementMixin` 添加了 `_desired_num_gpus` 属性\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F805 中实现了与 `vLLM` 的兼容性，并支持 `tensor_parallel_size` 参数\n* @plaguss 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F808 中更新了 `GroupColumns` 的默认名称\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F828 中提出，如果流水线中只有一个步骤，则应向 `GeneratorStep` 请求批次\n* @plaguss 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F809 中为流水线添加了默认名称\n* @davidberenstein1957 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F821 中根据 PR 更新了基于 Hugging Face Hub 的 DistiLabel 表述\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F833 中进行了更多 `Magpie` 改进\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F830 中添加了 `Embeddings` 基类、`SentenceTransformerEmbeddings` 类、`EmbeddingGeneration` 和 `FaissNearestNeighbour` 步骤\n* @gabrielmbmb 在 https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F814 中为 `CudaDevicePlacementMixin` 按主机名创建单独文件\n* 从…创建一个 `GeneratorStep`","2024-08-06T14:16:22",{"id":212,"version":213,"summary_zh":214,"released_at":215},153861,"1.2.4","## What's Changed\r\n* Update `InferenceEndpointsLLM` to use `chat_completion` method by @gabrielmbmb in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F815\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.2.3...1.2.4","2024-07-23T16:03:11",{"id":217,"version":218,"summary_zh":219,"released_at":220},153862,"1.2.3","## What's Changed\r\n* Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue #785) by @Hassaan-Qaisar in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F786\r\n* Correct variable name in dataset push example (in ultrafeedback.md file) (Issue #787) by @Hassaan-Qaisar in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F791\r\n* docs: update script for issue dashboard by @sdiazlor in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F775\r\n* Fix 404 model not found for private Serverless IE by @dvsrepo in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F806\r\n\r\n## New Contributors\r\n* @Hassaan-Qaisar made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F786\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.2.2...1.2.3","2024-07-23T08:02:06",{"id":222,"version":223,"summary_zh":224,"released_at":225},153863,"1.2.2","## What's Changed\r\n* Fix passing `input` to `format_output` function by @gabrielmbmb in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F781\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.2.1...1.2.2","2024-07-12T11:09:43",{"id":227,"version":228,"summary_zh":229,"released_at":230},153864,"1.2.1","## What's Changed\r\n* Fix docs for distiset.save_to_disk kwargs by @fpreiss in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F745\r\n* docs: change references by @sdiazlor in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F754\r\n* Fix `response_format` for `TogetherLLM` and `AnyScaleLLM` by @gabrielmbmb in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F764\r\n\r\n## New Contributors\r\n* @fpreiss made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F745\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.2.0...1.2.1","2024-07-01T08:58:30",{"id":232,"version":233,"summary_zh":234,"released_at":235},153865,"1.2.0","# ✨ Release highlights\r\n\r\n## Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task\r\n\r\n* [`instructor`](https:\u002F\u002Fgithub.com\u002Fjxnl\u002Finstructor) has been integrated bringing support for structured generation with `OpenAILLM`, `AnthropicLLM`, `LiteLLM`, `MistralLLM`, `CohereLLM` and `GroqLLM`:\r\n\r\n  \u003Cdetails>\r\n    \u003Csummary>Structured generation with `instructor` example\u003C\u002Fsummary>\r\n  \r\n  ```python\r\n  from typing import List\r\n  \r\n  from distilabel.llms import MistralLLM\r\n  from distilabel.pipeline import Pipeline\r\n  from distilabel.steps import LoadDataFromDicts\r\n  from distilabel.steps.tasks import TextGeneration\r\n  from pydantic import BaseModel, Field\r\n  \r\n  \r\n  class Node(BaseModel):\r\n      id: int\r\n      label: str\r\n      color: str\r\n  \r\n  \r\n  class Edge(BaseModel):\r\n      source: int\r\n      target: int\r\n      label: str\r\n      color: str = \"black\"\r\n  \r\n  \r\n  class KnowledgeGraph(BaseModel):\r\n      nodes: List[Node] = Field(..., default_factory=list)\r\n      edges: List[Edge] = Field(..., default_factory=list)\r\n  \r\n  \r\n  with Pipeline(\r\n      name=\"Knowledge-Graphs\",\r\n      description=(\r\n          \"Generate knowledge graphs to answer questions, this type of dataset can be used to \"\r\n          \"steer a model to answer questions with a knowledge graph.\"\r\n      ),\r\n  ) as pipeline:\r\n      sample_questions = [\r\n          \"Teach me about quantum mechanics\",\r\n          \"Who is who in The Simpsons family?\",\r\n          \"Tell me about the evolution of programming languages\",\r\n      ]\r\n  \r\n      load_dataset = LoadDataFromDicts(\r\n          name=\"load_instructions\",\r\n          data=[\r\n              {\r\n                  \"system_prompt\": \"You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.\",\r\n                  \"instruction\": f\"{question}\",\r\n              }\r\n              for question in sample_questions\r\n          ],\r\n      )\r\n  \r\n      text_generation = TextGeneration(\r\n          name=\"knowledge_graph_generation\",\r\n          llm=MistralLLM(\r\n              model=\"open-mixtral-8x22b\",\r\n              structured_output={\"schema\": KnowledgeGraph}\r\n          ),\r\n      )\r\n      load_dataset >> text_generation\r\n  ```\r\n  \u003C\u002Fdetails>\r\n* `InferenceEndpointsLLM` now supports structured generation\r\n* New [`StructuredGeneration`](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gallery\u002Ftasks\u002Fstructuredgeneration\u002F) task that allows defining the schema of the structured generation per input row.\r\n\r\n## New tasks for generating datasets for training embedding models\r\n\r\n[`sentence-transformers`](https:\u002F\u002Fsbert.net\u002F) v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!\r\n\r\n* New [`GenerateSentencePair`](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gallery\u002Ftasks\u002Fgeneratesentencepair\u002F) task that allows to generate a `positive` sentence for an input `anchor`, and optionally also a `negative` sentence. The tasks allows creating different kind of data specifying the `action` to perform with respect to the `anchor`: paraphrasing, generate semantically-similar sentence, generate a query or generate an answer.\r\n* Implemented [Improving Text Embeddings with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00368) and adding the following tasks derived from the paper:\r\n  * [`EmbeddingTaskGenerator`](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gallery\u002Ftasks\u002Fembeddingtaskgenerator\u002F) which allows generating new embedding-related tasks using an `LLM`.\r\n  * [`GenerateTextRetrievalData`](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gallery\u002Ftasks\u002Fgeneratetextretrievaldata\u002F) which allows creating text retrieval data with an `LLM`.\r\n  * [`GenerateShortTextMatchingData`](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gallery\u002Ftasks\u002Fgenerateshorttextmatchingdata\u002F) which allows creating short texts matching the input data.\r\n  * [`GenerateLongTextMatchingData`](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gallery\u002Ftasks\u002Fgeneratelongtextmatchingdata\u002F) which allows creating long texts matching the input data.\r\n  * [`GenerateTextClassificationData`](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gallery\u002Ftasks\u002Fgeneratetextclassificationdata\u002F) which allows creating text classification data from the input data.\r\n  * [`MonolingualTripletGenerator`](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gallery\u002Ftasks\u002Fmonolingualtripletgenerator\u002F) which allows creating monolingual triplets from the input data.\r\n  * [`BitextRetrievalGenerator `](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gallery\u002Ftasks\u002Fbitextretrievalgenerator) which allows creating bitext retrieval data from the input data.\r\n  \r\n## New `Step`s for loading data from different sources and saving\u002Floading `Distiset` to disk\r\n\r\nWe've added a few new steps allowing to load data from different sources:\r\n\r\n* [`LoadDataFromDisk `](https:\u002F\u002Fdistilabel.argilla.io\u002Flatest\u002Fcomponents-gall","2024-06-18T12:40:40",{"id":237,"version":238,"summary_zh":239,"released_at":240},153866,"1.1.1","## What's Changed\r\n* Fix crash when using vLLM without structured generation by @cg123 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F658\r\n* Fix error on `Pipeline.dry_run` without `parameters` by @plaguss in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F655\r\n\r\n## New Contributors\r\n* @cg123 made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F658\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.1.0...1.1.1","2024-05-22T06:29:39",{"id":242,"version":243,"summary_zh":244,"released_at":245},153867,"1.1.0","## Distilabel 1.1.0\r\n\r\n### Two new tasks implemented!\r\n\r\n#### `Genstruct` task (https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F600)\r\n\r\nYou can now use `Genstruct` task as described in https:\u002F\u002Fhuggingface.co\u002FNousResearch\u002FGenstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:\r\n\r\n```python\r\nfrom distilabel.llms import TransformersLLM\r\nfrom distilabel.pipeline import Pipeline\r\nfrom distilabel.steps import KeepColumns, LoadDataFromDicts\r\nfrom distilabel.steps.tasks import Genstruct\r\n\r\nwith Pipeline(name=\"harry-potter-genstruct\") as pipeline:\r\n    load_hub_dataset = LoadDataFromDicts(\r\n        name=\"load_dataset\",\r\n        data=[\r\n            {\r\n                \"title\": \"Harry Potter and the Sorcerer's Stone\",\r\n                \"content\": \"An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.\",\r\n            },\r\n            {\r\n                \"title\": \"Harry Potter and the Chamber of Secrets\",\r\n                \"content\": \"Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.\",\r\n            },\r\n        ],\r\n    )\r\n\r\n    task = Genstruct(\r\n        name=\"task\",\r\n        llm=TransformersLLM(\r\n            model=\"NousResearch\u002FGenstruct-7B\",\r\n            torch_dtype=\"float16\",\r\n            chat_template=\"{{ messages[0]['content'] }}\",\r\n            device=\"cuda:0\",\r\n        ),\r\n        num_generations=2,\r\n        group_generations=False,\r\n        output_mappings={\"model_name\": \"model\"},\r\n    )\r\n```\r\n\r\n#### `PrometheusEval` task (https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F610)\r\n\r\nA new `PrometheusEval` task, based on the recently published paper [\"Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models\"](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.01535):\r\n\r\n```python\r\nfrom distilabel.steps.tasks import PrometheusEval\r\n\r\nwith Pipeline(name=\"prometheus\") as pipeline:\r\n    load_dataset = LoadHubDataset(\r\n        name=\"load_dataset\",\r\n        repo_id=\"HuggingFaceH4\u002Finstruction-dataset\",\r\n        split=\"test\",\r\n        output_mappings={\"prompt\": \"instruction\", \"completion\": \"generation\"},\r\n    )\r\n\r\n    task = PrometheusEval(\r\n        name=\"task\",\r\n        llm=vLLM(\r\n            model=\"prometheus-eval\u002Fprometheus-7b-v2.0\",\r\n            chat_template=\"[INST] {{ messages[0]['content'] }}\\n{{ messages[1]['content'] }}[\u002FINST]\",\r\n        ),\r\n        mode=\"absolute\",\r\n        rubric=\"factual-validity\",\r\n        reference=False,\r\n        num_generations=1,\r\n        group_generations=False,\r\n    )\r\n    \r\n    load_dataset >> task\r\n```\r\n\r\n### Connect the steps in the pipeline with `>>` (https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F490)\r\n\r\nNow you can connect your steps using the *binary shift* operator in python:\r\n\r\n```python\r\nfrom distilabel.pipeline import Pipeline\r\nfrom distilabel.steps.generators.huggingface import LoadHubDataset\r\nfrom distilabel.steps.task.evol_instruct.base import EvolInstruct\r\nfrom distilabel.steps.combine import CombineColumns\r\n\r\nwith Pipeline(name=\"Pipe name\") as pipeline:\r\n    load_hub_dataset = LoadHubDataset(name=\"load_dataset\", batch_size=8)\r\n    evol_instruction_complexity_1 = EvolInstruct(\r\n        llm=OpenAILLM(model=\"gpt-3.5-turbo\"),\r\n    )\r\n    evol_instruction_complexity_2 = EvolInstruct(\r\n        llm=InferenceEndpointsLLM(model_id=\"mistralai\u002FMixtral-8x7B-Instruct-v0.1\"),\r\n    )\r\n\r\n    combine_columns = CombineColumns(\r\n        columns=[\"response\"],\r\n        output_columns=[\"candidates\"],\r\n    )\r\n\r\n    (\r\n        load_hub_dataset \r\n        >> [evol_instruction_complexity_1, evol_instruction_complexity_2] \r\n        >> combine_columns\r\n    )\r\n```\r\n\r\n### Routing batch function (https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F595)\r\n\r\nThanks to the new `routing_batch_function`, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a `sample_n_steps` routing batch function, making easier replicating the definition of the original UltraFeedback paper:\r\n\r\n```python\r\nimport random\r\nfrom distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM\r\nfrom distilabel.pipeline import Pipeline, routing_batch_function\r\nfrom distilabel.steps import CombineColumns, LoadHubDataset\r\nfrom distilabel.steps.tasks import TextGeneration\r\n\r\n@routing_batch_function()\r\ndef sample_two_steps(steps: list[str]) -> list[str]:\r\n    return random.sample(steps, 2)\r\n\r\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\r\n    load_dataset = LoadHubDataset(\r\n        name=\"load_dataset\",\r\n        output_mappings={\"prompt\": \"instruction\"},\r\n    )\r\n\r\n    tasks = []\r\n    for llm in (\r\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\r\n        MistralLLM(model=\"mistral-large-2402\"),\r\n        VertexAILLM(model=\"gemini-1.0-pro\"),\r\n ","2024-05-20T14:02:10",{"id":247,"version":248,"summary_zh":249,"released_at":250},153868,"1.0.3","## What's Changed\r\n* Add `stop` and `stop_sequences` in `LLM.generate` subclasses by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F585\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.0.2...1.0.3","2024-04-25T12:48:19",{"id":252,"version":253,"summary_zh":254,"released_at":255},153869,"1.0.2","## What's Changed\r\n\r\n* Fix `RuntimeParamater` validation when provided as `_Step` attr by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F564\r\n* Add `seed` with `random.randint` to ensure cache is not used by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F571\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.0.1...1.0.2","2024-04-24T11:43:32",{"id":257,"version":258,"summary_zh":259,"released_at":260},153870,"1.0.1","## What's Changed\r\n* Fix typo in readme and remove the ToArgilla step by @dvsrepo in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F548\r\n* Fix `model_validator` in `InferenceEndpoints` due to `Pipeline` pickling by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fpull\u002F552\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel\u002Fcompare\u002F1.0.0...1.0.1","2024-04-19T10:11:54"]