[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-argilla-io--synthetic-data-generator":3,"tool-argilla-io--synthetic-data-generator":64},[4,17,25,39,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":10,"last_commit_at":23,"category_tags":24,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":26,"name":27,"github_repo":28,"description_zh":29,"stars":30,"difficulty_score":10,"last_commit_at":31,"category_tags":32,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[33,34,35,36,14,37,15,13,38],"图像","数据工具","视频","插件","其他","音频",{"id":40,"name":41,"github_repo":42,"description_zh":43,"stars":44,"difficulty_score":45,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[14,33,13,15,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":45,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74939,"2026-04-05T23:16:38",[15,33,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":45,"last_commit_at":62,"category_tags":63,"status":16},2181,"OpenHands","OpenHands\u002FOpenHands","OpenHands 是一个专注于 AI 驱动开发的开源平台，旨在让智能体（Agent）像人类开发者一样理解、编写和调试代码。它解决了传统编程中重复性劳动多、环境配置复杂以及人机协作效率低等痛点，通过自动化流程显著提升开发速度。\n\n无论是希望提升编码效率的软件工程师、探索智能体技术的研究人员，还是需要快速原型验证的技术团队，都能从中受益。OpenHands 提供了灵活多样的使用方式：既可以通过命令行（CLI）或本地图形界面在个人电脑上轻松上手，体验类似 Devin 的流畅交互；也能利用其强大的 Python SDK 自定义智能体逻辑，甚至在云端大规模部署上千个智能体并行工作。\n\n其核心技术亮点在于模块化的软件智能体 SDK，这不仅构成了平台的引擎，还支持高度可组合的开发模式。此外，OpenHands 在 SWE-bench 基准测试中取得了 77.6% 的优异成绩，证明了其解决真实世界软件工程问题的能力。平台还具备完善的企业级功能，支持与 Slack、Jira 等工具集成，并提供细粒度的权限管理，适合从个人开发者到大型企业的各类用户场景。",70626,"2026-04-05T22:51:36",[15,14,13,36],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":45,"env_os":100,"env_gpu":101,"env_ram":101,"env_deps":102,"category_tags":113,"github_topics":79,"view_count":45,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":114,"updated_at":115,"faqs":116,"releases":146},1197,"argilla-io\u002Fsynthetic-data-generator","synthetic-data-generator","Build datasets using natural language","生成数据集的工具，通过自然语言描述即可创建高质量的数据集。它解决了传统数据集构建过程繁琐、耗时的问题，尤其适合需要定制化数据的开发者和研究人员。可以用于文本分类、对话数据生成以及增强生成任务等场景。支持与Hugging Face Hub和Argilla集成，方便数据管理和使用。基于LLM和distilabel技术，能够快速生成符合需求的合成数据，提升AI开发效率。适合希望高效构建数据集的技术人员使用。","---\ntitle: Synthetic Data Generator\nshort_description: Build datasets using natural language\nemoji: 🧬\ncolorFrom: yellow\ncolorTo: pink\nsdk: gradio\nsdk_version: 5.8.0\napp_file: app.py\npinned: true\nlicense: apache-2.0\nhf_oauth: true\n#header: mini\nhf_oauth_scopes:\n- read-repos\n- write-repos\n- manage-repos\n- inference-api\n---\n> [!IMPORTANT]  \nCheck the succesor of this project:\nhttps:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faisheets\n> \n\u003Cbr>\n\n> [!IMPORTANT]  \nThe original authors have moved on to other projects. While the code might still be functional for its original purpose, please be aware that the original team does not plan to develop new features, bug fixes, or updates. If you'd like to become a maintainer, please open an issue to discuss.\n>\n> \n\u003Cbr>\n\n\u003Ch2 align=\"center\">\n  \u003Ca href=\"\">\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fmain\u002Fassets\u002Flogo.svg\" alt=\"Synthetic Data Generator Logo\" width=\"80%\">\u003C\u002Fa>\n\u003C\u002Fh2>\n\u003Ch3 align=\"center\">Build datasets using natural language\u003C\u002Fh3>\n\n![Synthetic Data Generator](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fargilla\u002Fsynthetic-data-generator\u002Fresolve\u002Fmain\u002Fassets\u002Fui-full.png)\n\n## Introduction\n\nSynthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs. [The announcement blog](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fsynthetic-data-generator) goes over a practical example of how to use it but you can also watch the [video](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=nXjVtnGeEss) to see it in action.\n\nSupported Tasks:\n\n- Text Classification\n- Chat Data for Supervised Fine-Tuning\n- Retrieval Augmented Generation\n\nThis tool simplifies the process of creating custom datasets, enabling you to:\n\n- Describe the characteristics of your desired application\n- Iterate on sample datasets\n- Produce full-scale datasets\n- Push your datasets to the [Hugging Face Hub](https:\u002F\u002Fhuggingface.co\u002Fdatasets?other=datacraft) and\u002For [Argilla](https:\u002F\u002Fdocs.argilla.io\u002F)\n\nBy using the Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Ftwitter.com\u002Fargilla_io\">\n\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftwitter-black?logo=x\"\u002F>\n\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fwww.linkedin.com\u002Fcompany\u002Fargilla-io\">\n\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flinkedin-blue?logo=linkedin\"\u002F>\n\u003C\u002Fa>\n\u003Ca href=\"http:\u002F\u002Fhf.co\u002Fjoin\u002Fdiscord\">\n\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-7289DA?&logo=discord&logoColor=white\"\u002F>\n\u003C\u002Fa>\n\u003C\u002Fp>\n\n## Installation\n\nYou can simply install the package with:\n\n```bash\npip install synthetic-dataset-generator\n```\n\n### Quickstart\n\n```python\nfrom synthetic_dataset_generator import launch\n\nlaunch()\n```\n\n### Environment Variables\n\n- `HF_TOKEN`: Your [Hugging Face token](https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens\u002Fnew?ownUserPermissions=repo.content.read&ownUserPermissions=repo.write&globalPermissions=inference.serverless.write&tokenType=fineGrained) to push your datasets to the Hugging Face Hub and generate free completions from Hugging Face Inference Endpoints. You can find some configuration examples in the [examples](examples\u002F) folder.\n\nYou can set the following environment variables to customize the generation process.\n\n- `MAX_NUM_TOKENS`: The maximum number of tokens to generate, defaults to `2048`.\n- `MAX_NUM_ROWS`: The maximum number of rows to generate, defaults to `1000`.\n- `DEFAULT_BATCH_SIZE`: The default batch size to use for generating the dataset, defaults to `5`.\n\nOptionally, you can use different API providers and models.\n\n- `MODEL`: The model to use for generating the dataset, e.g. `meta-llama\u002FMeta-Llama-3.1-8B-Instruct`, `gpt-4o`, `llama3.1`.\n- `API_KEY`: The API key to use for the generation API, e.g. `hf_...`, `sk-...`. If not provided, it will default to the `HF_TOKEN` environment variable.\n- `OPENAI_BASE_URL`: The base URL for any OpenAI compatible API, e.g. `https:\u002F\u002Fapi.openai.com\u002Fv1\u002F`.\n- `OLLAMA_BASE_URL`: The base URL for any Ollama compatible API, e.g. `http:\u002F\u002F127.0.0.1:11434\u002F`.\n- `HUGGINGFACE_BASE_URL`: The base URL for any Hugging Face compatible API, e.g. TGI server or Dedicated Inference Endpoints. If you want to use serverless inference, only set the `MODEL`.\n- `VLLM_BASE_URL`: The base URL for any VLLM compatible API, e.g. `http:\u002F\u002Flocalhost:8000\u002F`.\n\nTo use a specific model exclusively for generating completions, set the corresponding environment variables by appending `_COMPLETION` to the ones mentioned earlier. For example, you can use `MODEL_COMPLETION` and `OPENAI_BASE_URL_COMPLETION`.\n\nSFT and Chat Data generation is not supported with OpenAI Endpoints. Additionally, you need to configure it per model family based on their prompt templates using the right `TOKENIZER_ID` and `MAGPIE_PRE_QUERY_TEMPLATE` environment variables.\n\n- `TOKENIZER_ID`: The tokenizer ID to use for the magpie pipeline, e.g. `meta-llama\u002FMeta-Llama-3.1-8B-Instruct`.\n- `MAGPIE_PRE_QUERY_TEMPLATE`: Enforce setting the pre-query template for Magpie, which is only supported with Hugging Face Inference Endpoints. `llama3` and `qwen2` are supported out of the box and will use `\"\u003C|begin_of_text|>\u003C|start_header_id|>user\u003C|end_header_id|>\\n\\n\"` and `\"\u003C|im_start|>user\\n\"`, respectively. For other models, you can pass a custom pre-query template string.\n\nOptionally, you can also push your datasets to Argilla for further curation by setting the following environment variables:\n\n- `ARGILLA_API_KEY`: Your Argilla API key to push your datasets to Argilla.\n- `ARGILLA_API_URL`: Your Argilla API URL to push your datasets to Argilla.\n\nTo save the generated datasets to a local directory instead of pushing them to the Hugging Face Hub, set the following environment variable:\n\n- `SAVE_LOCAL_DIR`: The local directory to save the generated datasets to.\n\nYou can use our environment template as a starting point:\n\n```bash\ncp .env.local.template .env\n```\n\n### Argilla integration\n\nArgilla is an open source tool for data curation. It allows you to annotate and review datasets, and push curated datasets to the Hugging Face Hub. You can easily get started with Argilla by following the [quickstart guide](https:\u002F\u002Fdocs.argilla.io\u002Flatest\u002Fgetting_started\u002Fquickstart\u002F).\n\n![Argilla integration](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fargilla\u002Fsynthetic-data-generator\u002Fresolve\u002Fmain\u002Fassets\u002Fargilla.png)\n\n## Custom synthetic data generation?\n\nEach pipeline is based on distilabel, so you can easily change the LLM or the pipeline steps.\n\nCheck out the [distilabel library](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel) for more information.\n\n## Development\n\nInstall the dependencies:\n\n```bash\n# Create a virtual environment\npython -m venv .venv\nsource .venv\u002Fbin\u002Factivate\n\n# Install the dependencies\npip install -e . # pdm install\n```\n\nRun the app:\n\n```bash\npython app.py\n```\n\n## 🐳 Docker Setup\n\nThe containerized tool uses Ollama for local LLM inference and Argilla for data curation. Here's the architecture:\n\n![Container Structure](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_synthetic-data-generator_readme_1aa7ef539fd5.png)\n\nQuick setup with all services (App + Ollama + Argilla):\n\n```bash\n# Copy environment template\ncp docker\u002F.env.docker.template .env # Add your HF_TOKEN in .env\n\n# Build all services (this may take a few minutes)\ndocker compose -f docker-compose.yml -f docker\u002Follama\u002Fcompose.yml -f docker\u002Fargilla\u002Fcompose.yml build\n\n# Start all services\ndocker compose -f docker-compose.yml -f docker\u002Follama\u002Fcompose.yml -f docker\u002Fargilla\u002Fcompose.yml up -d\n```\n\n> For more detailed Docker configurations and setups, check [docker\u002FREADME.md](docker\u002FREADME.md)\n","---\ntitle: 合成数据生成器\nshort_description: 使用自然语言构建数据集\nemoji: 🧬\ncolorFrom: yellow\ncolorTo: pink\nsdk: gradio\nsdk_version: 5.8.0\napp_file: app.py\npinned: true\nlicense: apache-2.0\nhf_oauth: true\n#header: mini\nhf_oauth_scopes:\n- read-repos\n- write-repos\n- manage-repos\n- inference-api\n---\n> [!IMPORTANT]  \n请查看该项目的后续版本：\nhttps:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faisheets\n> \n\u003Cbr>\n\n> [!IMPORTANT]  \n原作者已转向其他项目。尽管代码可能仍能用于其原始目的，但请注意，原团队不计划开发新功能、修复错误或进行更新。如果您希望成为维护者，请提交一个问题以进行讨论。\n>\n> \n\u003Cbr>\n\n\u003Ch2 align=\"center\">\n  \u003Ca href=\"\">\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fmain\u002Fassets\u002Flogo.svg\" alt=\"合成数据生成器Logo\" width=\"80%\">\u003C\u002Fa>\n\u003C\u002Fh2>\n\u003Ch3 align=\"center\">使用自然语言构建数据集\u003C\u002Fh3>\n\n![合成数据生成器](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fargilla\u002Fsynthetic-data-generator\u002Fresolve\u002Fmain\u002Fassets\u002Fui-full.png)\n\n## 引言\n\n合成数据生成器是一款工具，可帮助您创建高质量的数据集，用于训练和微调语言模型。它利用 distilabel 和大型语言模型的强大功能，根据您的具体需求生成定制化的合成数据。[公告博客](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fsynthetic-data-generator)详细介绍了如何实际使用该工具，您也可以观看[视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=nXjVtnGeEss)，直观地了解其操作过程。\n\n支持的任务：\n\n- 文本分类\n- 监督式微调的聊天数据\n- 检索增强生成\n\n该工具简化了自定义数据集的创建流程，使您能够：\n\n- 描述所需应用的特点\n- 迭代样本数据集\n- 生成全规模数据集\n- 将数据集推送到 [Hugging Face Hub](https:\u002F\u002Fhuggingface.co\u002Fdatasets?other=datacraft) 和\u002F或 [Argilla](https:\u002F\u002Fdocs.argilla.io\u002F)\n\n通过使用合成数据生成器，您可以快速原型化并创建数据集，从而加速您的 AI 开发进程。\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Ftwitter.com\u002Fargilla_io\">\n\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftwitter-black?logo=x\"\u002F>\n\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fwww.linkedin.com\u002Fcompany\u002Fargilla-io\">\n\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flinkedin-blue?logo=linkedin\"\u002F>\n\u003C\u002Fa>\n\u003Ca href=\"http:\u002F\u002Fhf.co\u002Fjoin\u002Fdiscord\">\n\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-7289DA?&logo=discord&logoColor=white\"\u002F>\n\u003C\u002Fa>\n\u003C\u002Fp>\n\n## 安装\n\n您只需运行以下命令即可安装该包：\n\n```bash\npip install synthetic-dataset-generator\n```\n\n### 快速入门\n\n```python\nfrom synthetic_dataset_generator import launch\n\nlaunch()\n```\n\n### 环境变量\n\n- `HF_TOKEN`：您的 [Hugging Face 令牌](https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens\u002Fnew?ownUserPermissions=repo.content.read&ownUserPermissions=repo.write&globalPermissions=inference.serverless.write&tokenType=fineGrained)，用于将数据集推送到 Hugging Face Hub，并从 Hugging Face 推理端点获取免费补全结果。您可以在 [examples](examples\u002F) 文件夹中找到一些配置示例。\n\n您还可以设置以下环境变量来定制生成过程：\n\n- `MAX_NUM_TOKENS`：生成的最大标记数，默认为 `2048`。\n- `MAX_NUM_ROWS`：生成的最大行数，默认为 `1000`。\n- `DEFAULT_BATCH_SIZE`：生成数据集时使用的默认批次大小，默认为 `5`。\n\n此外，您还可以选择不同的 API 提供商和模型：\n\n- `MODEL`：用于生成数据集的模型，例如 `meta-llama\u002FMeta-Llama-3.1-8B-Instruct`、`gpt-4o`、`llama3.1`。\n- `API_KEY`：用于生成 API 的 API 密钥，例如 `hf_...`、`sk-...`。如果未提供，则默认使用 `HF_TOKEN` 环境变量。\n- `OPENAI_BASE_URL`：任何兼容 OpenAI 的 API 的基础 URL，例如 `https:\u002F\u002Fapi.openai.com\u002Fv1\u002F`。\n- `OLLAMA_BASE_URL`：任何兼容 Ollama 的 API 的基础 URL，例如 `http:\u002F\u002F127.0.0.1:11434\u002F`。\n- `HUGGINGFACE_BASE_URL`：任何兼容 Hugging Face 的 API 的基础 URL，例如 TGI 服务器或专用推理端点。如果您想使用无服务器推理，只需设置 `MODEL` 即可。\n- `VLLM_BASE_URL`：任何兼容 VLLM 的 API 的基础 URL，例如 `http:\u002F\u002Flocalhost:8000\u002F`。\n\n若要专门使用特定模型进行补全生成，可在上述环境变量后添加 `_COMPLETION` 后缀。例如，可以使用 `MODEL_COMPLETION` 和 `OPENAI_BASE_URL_COMPLETION`。\n\nOpenAI 端点不支持 SFT 和聊天数据的生成。此外，您需要根据各模型系列的提示模板，通过正确的 `TOKENIZER_ID` 和 `MAGPIE_PRE_QUERY_TEMPLATE` 环境变量进行配置。\n\n- `TOKENIZER_ID`：用于 magpie 流程的分词器 ID，例如 `meta-llama\u002FMeta-Llama-3.1-8B-Instruct`。\n- `MAGPIE_PRE_QUERY_TEMPLATE`：强制设置 Magpie 的预查询模板，此功能仅在 Hugging Face 推理端点上支持。`llama3` 和 `qwen2` 已开箱即用，分别使用 `\"\u003C|begin_of_text|>\u003C|start_header_id|>user\u003C|end_header_id|>\\n\\n\"` 和 `\"\\\u003C|im_start|>user\\n\"`。对于其他模型，您可以传递自定义的预查询模板字符串。\n\n另外，您还可以通过设置以下环境变量，将数据集推送到 Argilla 进行进一步的整理：\n\n- `ARGILLA_API_KEY`：您的 Argilla API 密钥，用于将数据集推送到 Argilla。\n- `ARGILLA_API_URL`：您的 Argilla API 地址，用于将数据集推送到 Argilla。\n\n若要将生成的数据集保存到本地目录而非推送到 Hugging Face Hub，可设置以下环境变量：\n\n- `SAVE_LOCAL_DIR`：用于保存生成数据集的本地目录。\n\n您可以使用我们的环境模板作为起点：\n\n```bash\ncp .env.local.template .env\n```\n\n### Argilla 集成\n\nArgilla 是一款开源的数据整理工具。它允许您对数据集进行标注和审查，并将整理好的数据集推送到 Hugging Face Hub。您可以按照 [快速入门指南](https:\u002F\u002Fdocs.argilla.io\u002Flatest\u002Fgetting_started\u002Fquickstart\u002F) 轻松开始使用 Argilla。\n\n![Argilla 集成](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fargilla\u002Fsynthetic-data-generator\u002Fresolve\u002Fmain\u002Fassets\u002Fargilla.png)\n\n## 自定义合成数据生成？\n\n每个管道都基于 distilabel，因此您可以轻松更改 LLM 或管道步骤。\n\n有关更多信息，请参阅 [distilabel 库](https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fdistilabel)。\n\n## 开发\n\n安装依赖项：\n\n```bash\n# 创建虚拟环境\npython -m venv .venv\nsource .venv\u002Fbin\u002Factivate\n\n# 安装依赖项\npip install -e . # pdm install\n```\n\n运行应用程序：\n\n```bash\npython app.py\n```\n\n## 🐳 Docker 部署\n\n该容器化工具使用 Ollama 进行本地大模型推理，并使用 Argilla 进行数据整理。其架构如下：\n\n![容器结构](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_synthetic-data-generator_readme_1aa7ef539fd5.png)\n\n快速部署所有服务（应用 + Ollama + Argilla）：\n\n```bash\n# 复制环境模板\ncp docker\u002F.env.docker.template .env # 在 .env 文件中添加你的 HF_TOKEN\n\n# 构建所有服务（可能需要几分钟）\ndocker compose -f docker-compose.yml -f docker\u002Follama\u002Fcompose.yml -f docker\u002Fargilla\u002Fcompose.yml build\n\n# 启动所有服务\ndocker compose -f docker-compose.yml -f docker\u002Follama\u002Fcompose.yml -f docker\u002Fargilla\u002Fcompose.yml up -d\n```\n\n> 如需更详细的 Docker 配置和部署说明，请参阅 [docker\u002FREADME.md](docker\u002FREADME.md)","# Synthetic Data Generator 快速上手指南\n\n## 环境准备\n\n### 系统要求\n- Python 3.8 或更高版本\n- 支持 Linux、macOS 或 Windows 操作系统\n\n### 前置依赖\n- pip（Python 包管理工具）\n- Hugging Face 账户（用于访问模型和推送数据集）\n\n> 推荐使用国内镜像源加速安装，例如：\n```bash\npip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 安装步骤\n\n使用 pip 安装工具包：\n\n```bash\npip install synthetic-dataset-generator\n```\n\n## 基本使用\n\n运行工具的最简单方式如下：\n\n```python\nfrom synthetic_dataset_generator import launch\n\nlaunch()\n```\n\n该命令将启动一个基于 Gradio 的 Web 界面，您可以通过浏览器进行交互式数据集生成。\n\n### 环境变量配置（可选）\n\n根据需要设置以下环境变量以自定义生成过程：\n\n```bash\nexport HF_TOKEN=your_huggingface_token\nexport MAX_NUM_TOKENS=2048\nexport MAX_NUM_ROWS=1000\nexport DEFAULT_BATCH_SIZE=5\n```\n\n如需使用特定模型或 API，请设置相应变量，例如：\n\n```bash\nexport MODEL=meta-llama\u002FMeta-Llama-3.1-8B-Instruct\nexport API_KEY=your_api_key\n```\n\n如需将生成的数据集保存到本地目录，可以设置：\n\n```bash\nexport SAVE_LOCAL_DIR=.\u002Fgenerated_datasets\n```\n\n如需集成 Argilla 数据标注工具，请设置：\n\n```bash\nexport ARGILLA_API_KEY=your_argilla_api_key\nexport ARGILLA_API_URL=https:\u002F\u002Fapi.argilla.io\n```","某医疗AI公司正在开发一个用于病历分类的自然语言处理模型，需要大量标注好的病历文本数据来训练模型。由于真实病历数据涉及隐私，无法直接获取，团队需要自行生成符合临床场景的合成数据。\n\n### 没有 synthetic-data-generator 时  \n- 数据生成过程繁琐，需要手动编写大量病历样本，耗时且容易出错  \n- 缺乏统一的数据结构和标签规范，导致后续模型训练效率低下  \n- 难以快速迭代数据集，每次调整需求都需要重新编写大量内容  \n- 生成的数据质量参差不齐，难以满足模型训练的高标准要求  \n\n### 使用 synthetic-data-generator 后  \n- 通过自然语言描述即可快速生成结构化病历数据，节省大量人工时间  \n- 自动遵循预设的标签体系，确保数据一致性与可训练性  \n- 支持快速调整生成参数，实现数据集的灵活迭代与优化  \n- 生成的数据质量高，更贴近真实场景，提升模型训练效果  \n\nsynthetic-data-generator 通过自然语言驱动的数据生成方式，显著提升了医疗AI团队在数据准备阶段的效率与质量。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargilla-io_synthetic-data-generator_9a68cbf5.png","argilla-io","Argilla","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fargilla-io_5953542f.png","Building the open-source feedback layer for LLMs",null,"contact@argilla.io","argilla_io","https:\u002F\u002Fargilla.io","https:\u002F\u002Fgithub.com\u002Fargilla-io",[85,89,93],{"name":86,"color":87,"percentage":88},"Python","#3572A5",99.1,{"name":90,"color":91,"percentage":92},"Dockerfile","#384d54",0.5,{"name":94,"color":95,"percentage":92},"Shell","#89e051",573,64,"2026-04-03T09:31:35","Apache-2.0","Linux, macOS, Windows","未说明",{"notes":103,"python":104,"dependencies":105},"需要安装 Hugging Face Token，使用 Docker 时需配置环境变量 HF_TOKEN。建议使用虚拟环境进行开发。","3.8+",[106,107,108,109,110,111,112],"gradio>=5.8.0","distilabel","huggingface_hub","argilla","torch","transformers","accelerate",[15,34],"2026-03-27T02:49:30.150509","2026-04-06T08:52:26.363771",[117,122,127,131,136,141],{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},5461,"如何在本地使用不支持 SYSTEM prompt 的模型（如 Gemma）生成问答数据？","可以设置 `MODEL_COMPLETIONS` 来仅生成问题，然后使用另一个模型（如 Gemma）从文件中回答问题。需要确保配置正确的 Ollama 基础 URL 和模型 ID，例如：`os.environ[\"OLLAMA_BASE_URL\"] = \"http:\u002F\u002F127.0.0.1:11434\u002F\"` 和 `os.environ[\"MODEL\"] = \"gemma2-9b\"`。","https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fissues\u002F27",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},5462,"如何解决本地 LLM 集成失败的问题？","确保正确配置 `BASE_URL` 和 `MODEL`，并安装所需的依赖项。例如，使用 Ollama 时，需设置 `OLLAMA_BASE_URL` 和 `MODEL` 环境变量，并运行 `ollama serve` 和 `ollama run \u003Cmodel>`。此外，可参考 PR #20 和 #1084 来改进本地部署支持。","https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fissues\u002F15",{"id":128,"question_zh":129,"answer_zh":130,"source_url":126},5463,"如何避免 Hugging Face API 调用？","可以通过修改 `constants.py` 文件，将 `SFT_AVAILABLE` 设置为 `True` 并将 `MAGPIE_PRE_QUERY_TEMPLATE` 设置为 `None`，以禁用 Hugging Face API 调用。同时，建议使用本地 LLM（如 Ollama 或 llama-cpp）进行数据生成。",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},5464,"如何解决创建数据集时的错误？","确保设置 `ARGILLA_API_URL` 和 `ARGILLA_API_KEY`，或者使用本地 LLM 配置。可以尝试通过以下命令安装修复后的版本：`pip install git+https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator.git@feat\u002Fimprove-support-local-deployment`，并运行 Ollama 模型。","https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fissues\u002F14",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},5465,"如何使用现有数据生成结构化聊天数据？","可以将现有数据（如私有代码仓库或文档）作为输入，但需确保其格式与工具要求的格式一致。如果遇到问题，可能需要检查数据集的格式是否符合预期，例如参考 `extended_train.json` 文件的结构。","https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fissues\u002F11",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},5466,"如何处理多标签分类时出现的错误？","该问题可能与数据集格式或模型配置有关。请确保数据格式正确，并检查是否有任何异常或缺失字段。若问题持续，建议提供更详细的错误信息以便进一步排查。","https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fissues\u002F37",[147,152,157,162,167,172,177],{"id":148,"version":149,"summary_zh":150,"released_at":151},114653,"0.1.3","## What's Changed\r\n* feat enable usage without argilla by @davidberenstein1957 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F3\r\n* add support for custom BASE_URL, MODEL, API_KEY by @davidberenstein1957 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F4\r\n* add logic to push pipeline code to hub by @davidberenstein1957 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F5\r\n* add temperature to response generation by @davidberenstein1957 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F7\r\n* update logic for creating samples within the textcat pipeline by @davidberenstein1957 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F6\r\n* Feat\u002Fadd multi label by @davidberenstein1957 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F8\r\n\r\n## New Contributors\r\n* @davidberenstein1957 made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F3\r\n\r\n## What's Changed\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fcommits\u002F0.1.3\r\n","2024-12-04T10:44:16",{"id":153,"version":154,"summary_zh":155,"released_at":156},114647,"0.2.0","## What's Changed\r\n* Add Docker Support by @mcdaqc in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F26\r\n* Bump version to 0.2.0 by @davidberenstein1957 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F40\r\n\r\n## New Contributors\r\n* @mcdaqc made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F26\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fcompare\u002F0.1.8...0.2.0","2025-02-21T07:41:54",{"id":158,"version":159,"summary_zh":160,"released_at":161},114648,"0.1.8","## What's Changed\r\n* Bug\u002Ffix bugs by @sdiazlor in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F28\r\n* add deepseek example by @sdiazlor in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F30\r\n* Delete .DS_Store by @sebaxakerhtc in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F33\r\n* Fix gradio UI by @sebaxakerhtc in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F34\r\n* feat: add seed data for chat data by @sdiazlor in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F32\r\n* feat: different model completion by @sdiazlor in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F31\r\n* Wrong tokenizer by @sebaxakerhtc in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F36\r\n* Added local saving to CSV and JSON by @sebaxakerhtc in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F38\r\n\r\n## New Contributors\r\n* @sebaxakerhtc made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F33\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fcompare\u002F0.1.7...0.1.8","2025-02-10T12:43:37",{"id":163,"version":164,"summary_zh":165,"released_at":166},114649,"0.1.7","**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fcompare\u002F0.1.6...0.1.7","2025-01-20T15:59:28",{"id":168,"version":169,"summary_zh":170,"released_at":171},114650,"0.1.6","## What's Changed\r\n* Enable launch in single command by @Riezebos in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F17\r\n* add example on fine-tuning ModernBERT by @davidberenstein1957 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F21\r\n* Add inline script metadata for users to launch example apps easily by @ftnext in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F23\r\n* update deployment with API providers by @davidberenstein1957 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F20\r\n* Add RAG generation by @sdiazlor in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F19\r\n\r\n## New Contributors\r\n* @Riezebos made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F17\r\n* @ftnext made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F23\r\n* @sdiazlor made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F19\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fcompare\u002F0.1.5...0.1.6","2025-01-17T10:01:49",{"id":173,"version":174,"summary_zh":175,"released_at":176},114651,"0.1.5","## What's Changed\r\n* docs: update README.md by @eltociear in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F12\r\n* fix define constants on launch by @davidberenstein1957 in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F16\r\n\r\n## New Contributors\r\n* @eltociear made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F12\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fcompare\u002F0.1.4...0.1.5","2024-12-17T16:46:47",{"id":178,"version":179,"summary_zh":180,"released_at":181},114652,"0.1.4","## What's Changed\r\n* Update utils.py by @CharlesCNorton in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F13\r\n* QoL - remove `src` import requirement from instal\r\n* QoL - fix behavior pre-query template\r\n* QoL - improve validation argilla imports\r\n* QoL - improve validation HF dependency\r\n* QoL - add option to enforce custom magpie template\r\n* QoL - fix empty values are passed during textcat generation\r\n\r\n## New Contributors\r\n* @CharlesCNorton made their first contribution in https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fpull\u002F13\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargilla-io\u002Fsynthetic-data-generator\u002Fcompare\u002F0.1.3...0.1.4","2024-12-17T07:11:25"]