[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-simplescaling--s1":3,"tool-simplescaling--s1":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":78,"owner_url":79,"languages":80,"stars":101,"forks":102,"last_commit_at":103,"license":104,"difficulty_score":105,"env_os":106,"env_gpu":107,"env_ram":108,"env_deps":109,"category_tags":117,"github_topics":78,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":118,"updated_at":119,"faqs":120,"releases":149},3927,"simplescaling\u002Fs1","s1","s1: Simple test-time scaling","s1 是一个专注于“测试时扩展”（Test-time Scaling）的开源 AI 项目，旨在通过极简的方法显著提升大模型的推理能力。它的核心目标是让模型在回答问题时进行更深入的思考，从而在数学、逻辑等复杂任务上达到媲美顶尖闭源模型（如 o1-preview）的性能，而无需昂贵的训练成本。\n\n传统大模型往往受限于固定的生成长度，难以处理需要多步推导的难题。s1 巧妙地解决了这一痛点，它提出了一套最小化的训练方案：仅使用 1000 个高质量示例，配合独特的“预算强制”（Budget Forcing）技术，就能教会模型自主规划思考过程。所谓“预算强制”，是指在推理阶段动态控制模型的思考令牌数量，甚至允许忽略早期的停止信号，强制模型进行更充分的演算后再输出最终答案。\n\n该项目非常适合 AI 研究人员、开发者以及对大模型推理机制感兴趣的技术爱好者。对于研究者，s1 提供了完整的论文、数据集和训练脚本，是探索高效推理范式的绝佳案例；对于开发者，项目集成了 vLLM 等主流推理框架，支持快速部署和微调，方便将强大的推理能力集成到实际应用中。凭借极低的资源门槛和出色的效果，s1 为构建高智商 AI ","s1 是一个专注于“测试时扩展”（Test-time Scaling）的开源 AI 项目，旨在通过极简的方法显著提升大模型的推理能力。它的核心目标是让模型在回答问题时进行更深入的思考，从而在数学、逻辑等复杂任务上达到媲美顶尖闭源模型（如 o1-preview）的性能，而无需昂贵的训练成本。\n\n传统大模型往往受限于固定的生成长度，难以处理需要多步推导的难题。s1 巧妙地解决了这一痛点，它提出了一套最小化的训练方案：仅使用 1000 个高质量示例，配合独特的“预算强制”（Budget Forcing）技术，就能教会模型自主规划思考过程。所谓“预算强制”，是指在推理阶段动态控制模型的思考令牌数量，甚至允许忽略早期的停止信号，强制模型进行更充分的演算后再输出最终答案。\n\n该项目非常适合 AI 研究人员、开发者以及对大模型推理机制感兴趣的技术爱好者。对于研究者，s1 提供了完整的论文、数据集和训练脚本，是探索高效推理范式的绝佳案例；对于开发者，项目集成了 vLLM 等主流推理框架，支持快速部署和微调，方便将强大的推理能力集成到实际应用中。凭借极低的资源门槛和出色的效果，s1 为构建高智商 AI 助手提供了一条简单而高效的新路径。","\u003Cdiv align=\"center\">\n  \u003Ch1>s1: Simple test-time scaling\u003C\u002Fh1>\n  \u003Cp>Minimal recipe for test-time scaling and strong reasoning performance matching o1-preview with just 1,000 examples & budget forcing\n \u003C\u002Fp>\n\u003C\u002Fdiv>\n\u003Cbr>\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsimplescaling_s1_readme_170f11b19eea.png)\n\n****************************************************************\n\n**Updates:**\n\n* 2025-03: Released 2 videos on s1: [TWIML Podcast (Sam Charrington & Niklas Muennighoff)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=kEfUaLBlSHc) & [Microsoft GenAI Talk (Niklas Muennighoff)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=EEkxuqlvCss)\n* 2025-02: We released [s1.1](https:\u002F\u002Fhuggingface.co\u002Fsimplescaling\u002Fs1.1-32B) a better model than s1 by reusing the same s1K questions but with reasoning traces generated by r1 instead of Gemini: [s1K-1.1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1). Check [this tweet](https:\u002F\u002Fx.com\u002FMuennighoff\u002Fstatus\u002F1889310803746246694) for details\n* 2025-01: We released [our paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393) announced via [this tweet](https:\u002F\u002Fx.com\u002FMuennighoff\u002Fstatus\u002F1886405528777073134).\n\n****************************************************************\n\nThis repository provides an overview of all resources for the paper [\"s1: Simple test-time scaling\"](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393).\n\n- [Artifacts](#artifacts)\n- [Structure](#structure)\n- [Inference](#inference)\n  - [vLLM](#vllm)\n  - [vLLM with budget forcing](#vllm-with-budget-forcing)\n  - [transformers](#transformers)\n- [Training](#training)\n- [Evaluation](#evaluation)\n- [Data](#data)\n- [Visuals](#visuals)\n- [Known Issues](#known-issues)\n- [Citation](#citation)\n\n### Artifacts\n\n- **Paper**: https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393\n- **Model**: https:\u002F\u002Fhf.co\u002Fsimplescaling\u002Fs1.1-32B (Old: https:\u002F\u002Fhf.co\u002Fsimplescaling\u002Fs1-32B)\n- **Data**: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1 (Old: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K)\n    - s1-prob: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1-prob\n    - s1-teasers: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1-teasers\n    - Full 59K: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fdata_ablation_full59K\n\n### Structure\n\n- `eval\u002F`: Evaluation scripts\n- `data\u002F`: Synthetic data creation scripts & co\n- `train\u002F`: Training scripts\n\n### Inference\n\n#### vLLM\n\nInstall the `vllm` library and run:\n```python\nfrom vllm import LLM, SamplingParams\nfrom transformers import AutoTokenizer\n\nmodel = LLM(\n    \"simplescaling\u002Fs1.1-32B\",\n    tensor_parallel_size=2,\n)\ntok = AutoTokenizer.from_pretrained(\"simplescaling\u002Fs1-32B\")\n\nstop_token_ids = tok(\"\u003C|im_end|>\")[\"input_ids\"]\n\nsampling_params = SamplingParams(\n    max_tokens=32768,\n    min_tokens=0,\n    stop_token_ids=stop_token_ids,\n)\n\nprompt = \"How many r in raspberry\"\nprompt = \"\u003C|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.\u003C|im_end|>\\n\u003C|im_start|>user\\n\" + prompt + \"\u003C|im_end|>\\n\u003C|im_start|>assistant\\n\"\n\no = model.generate(prompt, sampling_params=sampling_params)\nprint(o[0].outputs[0].text)\n```\n\n#### vLLM with budget forcing\n\n```python\nfrom vllm import LLM, SamplingParams\nfrom transformers import AutoTokenizer\n\n# Decide on a token limit for thinking; As the model's max tokens is 32768, 32000 usually ensures there is enough space for the model to still answer\nMAX_TOKENS_THINKING = 32000\n# Decide how often to ignore end-of-thinking token\nNUM_IGNORE = 1\n\nmodel = LLM(\n    \"simplescaling\u002Fs1-32B\", # s1 originally gets this prompt wrong but with budget forcing it fixes it\n    tensor_parallel_size=2,\n)\ntok = AutoTokenizer.from_pretrained(\n    \"simplescaling\u002Fs1-32B\"\n)\n\nstop_token_ids = tok(\"\u003C|im_end|>\")[\"input_ids\"]\nsampling_params = SamplingParams(\n    max_tokens=32768,\n    min_tokens=0,\n    stop_token_ids=stop_token_ids,\n    skip_special_tokens=False,\n    temperature=0.0,\n)\n\n# For the exact raspberry sample in the paper see\nprompts = [\n    \"How many r in raspberry\",\n]\n\nfor i, p in enumerate(prompts):\n    prompt = \"\u003C|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.\u003C|im_end|>\\n\u003C|im_start|>user\\n\" + p + \"\u003C|im_end|>\\n\u003C|im_start|>assistant\\n\"\n    stop_token_ids = tok(\"\u003C|im_start|>\u003C|im_end|>\")[\"input_ids\"]\n    sampling_params = SamplingParams(\n        max_tokens=MAX_TOKENS_THINKING,\n        min_tokens=0,\n        stop_token_ids=stop_token_ids,\n        skip_special_tokens=False,\n        temperature=0.0,\n    )\n    prompt += \"\u003C|im_start|>think\"\n    o = model.generate(\n        prompt,\n        sampling_params=sampling_params\n    )\n    ignore_str = \"Wait\"\n    max_tokens_thinking_tmp = MAX_TOKENS_THINKING\n    for i in range(NUM_IGNORE): # Num of times to skip stop token\n        max_tokens_thinking_tmp -= len(o[0].outputs[0].token_ids)\n        if max_tokens_thinking_tmp > 0:\n            prompt += o[0].outputs[0].text + ignore_str\n            sampling_params = SamplingParams(\n                max_tokens=max_tokens_thinking_tmp,\n                min_tokens=1,\n                stop_token_ids=stop_token_ids,\n                skip_special_tokens=False,\n                temperature=0.0,\n            )\n            o = model.generate(\n                prompt,\n                sampling_params=sampling_params\n            )\n    ### Final answer ###\n    prompt += o[0].outputs[0].text # You can also append \"Final Answer:\" here like we do for some evaluations to prevent the model from just continuing to reason in its answer when early exiting\n    stop_token_ids = tok(\"\u003C|im_end|>\")[\"input_ids\"]\n    sampling_params = SamplingParams(\n        max_tokens=32768,\n        min_tokens=0,\n        stop_token_ids=stop_token_ids,\n        skip_special_tokens=False,\n        temperature=0.0,\n    )\n    o = model.generate(\n        prompt,\n        sampling_params=sampling_params,\n    )\n    print(\"With budget forcing:\") # You will see that after the \"Wait\" in the reasoning trace it fixes its answer\n    print(prompt + o[0].outputs[0].text)\n```\n\n#### transformers\n\nInstall the `transformers` & `torch` libraries and run:\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\n\nDEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nmodel_name = \"simplescaling\u002Fs1.1-32B\"\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\nprompt = \"How many r in raspberry\"\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.\"},\n    {\"role\": \"user\", \"content\": prompt}\n]\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\ngenerated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=512\n)\ngenerated_ids = [\n    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n]\n\nresponse = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n```\n\n### Training\n\n\nTo run training, you can find our script at `train\u002Fsft.py` which you can invoke via one of the `train\u002Fsft*sh` scripts which in turn you can launch via `train\u002Flaunch.sh` if you are on a SLURM cluster (requires editing the file for your cluster setup).\n\nTo train s1-32B\u002Fs1.1-32B, we recommend 16 H100 GPUs i.e. 2 nodes with 8 each. For s1.1, we set the block size to 20000 to avoid OOM (https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1\u002Fblob\u002F0ad4b3de32507b4aa0d4be28f336276ee99b2315\u002Ftrain\u002Fsft.sh#L17); Check the wandb logs [here](https:\u002F\u002Fwandb.ai\u002Fhashimoto-group\u002Fo1\u002Fruns\u002Fm1ilia77\u002Foverview).\n\nQuick start:\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1.git\ncd s1\npip3 install -r requirements.txt\nbash train\u002Fsft.sh\n```\n*Note: If you encounter an out-of-memory (OOM) issue with 8 GPUs, consider enabling gradient checkpointing by adding the following line to your script: `--gradient_checkpointing=True`.*\n\n### Evaluation\n\nWe cloned [lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) at commit `4cec66e4e468d15789473d6d63c3a61a751fa524` and modified it. Setup:\n```bash\ncd eval\u002Flm-evaluation-harness\npip install -e .[math,vllm]\n```\n\nAll commands are in `eval\u002Fcommands.sh`. For AIME24 we always pick the `aime24_nofigures` result, which uses a dataset that only contains the AIME24 figures if they are important for the task.\n\nIf you want to compute statistics (avg thinking tokens etc) for an evaluation run you can use \n`python eval\u002Fcompute_sample_stats.py path_to_samples_file.jsonl`\n\nAll our evaluation result files are at: https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fresults\n\nTo run REBASE: commands are in `eval\u002Frebase\u002Frun.sh`\nNote that for the evaluations in the Discussion section with REBASE we used https:\u002F\u002Fhuggingface.co\u002Fsimplescaling\u002Fstep-conditional-control-old trained on an older version of our dataset https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-step-conditional-control-old and run on an older version of our evaluation using https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMaxwell-Jia\u002FAIME_2024.\n\n### Data\n\nTo recreate s1K follow the steps below. In various files you will have to rename the organizations `simplescaling` and `qfq` with an organization that you own. **Note that [s1K-1.1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1) is a better dataset generated with r1 traces instead of Gemini traces.**\n1. Run `data\u002Fcollect_data.py` followed by `data\u002Ffix_gpqa.py` & `data\u002Fadd_aime.py` to collect the questions; Make sure to change the hub path in the respective files to one of your own.\n2. Generate traces with Gemini via `python data\u002Fgemini.py`. This step will use https:\u002F\u002Fhf.co\u002Fdatasets\u002Fqfq\u002Ftrain which should be roughly equivalent to the dataet you have produced in 1.\n3. Generate answers with Qwen via `python data\u002Fbulk_inference.py` that can be launched with `data\u002Fbulk_inference.sh`.\n4. Add features by running `python data\u002Ffeaturization.py`.\n5. Run final filtering via going through `data\u002Ffilter.ipynb`.\n6. If you want to run grading on the final questions to produce e.g. a gemini_grade column as in [this dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1), you can use `data\u002Fgrading.ipynb`.\n\n### Visuals\n\nAll figures and some tables are created via [this colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1GAfwbJs2Y1dgGGsxrQyQg2G7CRH5NgN3?usp=sharing) equivalent to `visuals\u002Fvisuals.ipynb`. Some are subsequently edited via the `visuals\u002Fs1.fig` file, which you can load in Figma. The output figures are in `visuals\u002F` in pdf or png format.\n\n### Known Issues\n\n- vLLM throws `ValueError: Token id XXXXX is out of vocabulary`\n  - This can happen with budget forcing, especially when running with temperature 1, where the model will sometimes do crazy stuff and predict a vocab id that is larger than its max token id but still within its embedding size i.e. anything \u003C152064, >151664; When we refeed the model's previous outputs to it which is done when setting e.g. max_thinking_tokens in the evaluation then this will cause the error cuz vLLM does this check even though it would only be an issue for IDs >152064. To fix it you can just uncomment the vLLM ValueError (It is the line `if max_input_id > tokenizer.max_token_id:` in `vllm\u002Fengine\u002Fllm_engine.py`)\n\n### Citation\n\n```bibtex\n@misc{muennighoff2025s1simpletesttimescaling,\n      title={s1: Simple test-time scaling}, \n      author={Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto},\n      year={2025},\n      eprint={2501.19393},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393}, \n}\n```\n","\u003Cdiv align=\"center\">\n  \u003Ch1>s1：简单的测试时缩放\u003C\u002Fh1>\n  \u003Cp>仅需1,000个示例和预算强制，即可实现测试时缩放并达到与o1-preview相当的强大推理性能的极简方案\u003C\u002Fp>\n\u003C\u002Fdiv>\n\u003Cbr>\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsimplescaling_s1_readme_170f11b19eea.png)\n\n****************************************************************\n\n**更新：**\n\n* 2025年3月：发布了关于s1的2段视频：[TWIML播客（Sam Charrington & Niklas Muennighoff）](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=kEfUaLBlSHc) 和 [Microsoft GenAI演讲（Niklas Muennighoff）](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=EEkxuqlvCss)\n* 2025年2月：我们发布了[s1.1](https:\u002F\u002Fhuggingface.co\u002Fsimplescaling\u002Fs1.1-32B)，这是一个比s1更好的模型，它复用了相同的s1K问题，但推理轨迹是由r1生成的，而不是Gemini：[s1K-1.1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1)。详情请参阅[这条推文](https:\u002F\u002Fx.com\u002FMuennighoff\u002Fstatus\u002F1889310803746246694)\n* 2025年1月：我们发布了[我们的论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393)，并通过[这条推文](https:\u002F\u002Fx.com\u002FMuennighoff\u002Fstatus\u002F1886405528777073134)进行了公告。\n\n****************************************************************\n\n本仓库提供了论文[\"s1：简单的测试时缩放\"](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393)的所有资源概览。\n\n- [成果](#artifacts)\n- [结构](#structure)\n- [推理](#inference)\n  - [vLLM](#vllm)\n  - [带预算强制的vLLM](#vllm-with-budget-forcing)\n  - [transformers](#transformers)\n- [训练](#training)\n- [评估](#evaluation)\n- [数据](#data)\n- [可视化](#visuals)\n- [已知问题](#known-issues)\n- [引用](#citation)\n\n### 成果\n\n- **论文**：https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393\n- **模型**：https:\u002F\u002Fhf.co\u002Fsimplescaling\u002Fs1.1-32B（旧版：https:\u002F\u002Fhf.co\u002Fsimplescaling\u002Fs1-32B）\n- **数据**：https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1（旧版：https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K）\n    - s1-prob：https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1-prob\n    - s1-teasers：https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fs1-teasers\n    - 完整59K：https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fdata_ablation_full59K\n\n### 结构\n\n- `eval\u002F`：评估脚本\n- `data\u002F`：合成数据生成脚本等\n- `train\u002F`：训练脚本\n\n### 推理\n\n#### vLLM\n\n安装`vllm`库并运行：\n```python\nfrom vllm import LLM, SamplingParams\nfrom transformers import AutoTokenizer\n\nmodel = LLM(\n    \"simplescaling\u002Fs1.1-32B\",\n    tensor_parallel_size=2,\n)\ntok = AutoTokenizer.from_pretrained(\"simplescaling\u002Fs1-32B\")\n\nstop_token_ids = tok(\"\u003C\u003C|im_end|>>\")[\"input_ids\"]\n\nsampling_params = SamplingParams(\n    max_tokens=32768,\n    min_tokens=0，\n    stop_token_ids=stop_token_ids，\n)\n\nprompt = \"How many r in raspberry\"\nprompt = \"\u003C\u003C|im_start|>>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.\u003C\u003C|im_end|>>\\n\u003C\u003C|im_start|>>user\\n\" + prompt + \"\u003C\u003C|im_end|>>\\n\u003C\u003C|im_start|>>assistant\\n\"\n\no = model.generate(prompt, sampling_params=sampling_params)\nprint(o[0].outputs[0].text)\n```\n\n#### 带预算强制的vLLM\n\n```python\nfrom vllm import LLM, SamplingParams\nfrom transformers import AutoTokenizer\n\n# 决定思考时的token限制；由于模型的最大token数为32768，通常设置为32000可以确保模型仍有足够的空间来回答\nMAX_TOKENS_THINKING = 32000\n# 决定忽略结束思考标记的次数\nNUM_IGNORE = 1\n\nmodel = LLM(\n    \"simplescaling\u002Fs1-32B\", # s1原本会错误地回答这个问题，但在预算强制下则能纠正\n    tensor_parallel_size=2，\n)\ntok = AutoTokenizer.from_pretrained(\n    \"simplescaling\u002Fs1-32B\"\n)\n\nstop_token_ids = tok(\"\u003C\u003C|im_end|>>\")[\"input_ids\"]\nsampling_params = SamplingParams(\n    max_tokens=32768，\n    min_tokens=0，\n    stop_token_ids=stop_token_ids，\n    skip_special_tokens=False，\n    temperature=0.0，\n)\n\n# 对于论文中提到的覆盆子具体例子，请参见\nprompts = [\n    \"How many r in raspberry\",\n]\n\nfor i, p in enumerate(prompts):\n    prompt = \"\u003C\u003C|im_start|>>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.\u003C\u003C|im_end|>>\\n\u003C\u003C|im_start|>>user\\n\" + p + \"\u003C\u003C|im_end|>>\\n\u003C\u003C|im_start|>>assistant\\n\"\n    stop_token_ids = tok(\"\u003C\u003C|im_start|>>\u003C\u003C|im_end|>>\")[\"input_ids\"]\n    sampling_params = SamplingParams(\n        max_tokens=MAX_TOKENS_THINKING，\n        min_tokens=0，\n        stop_token_ids=stop_token_ids，\n        skip_special_tokens=False，\n        temperature=0.0，\n    )\n    prompt += \"\u003C\u003C|im_start|>>think\"\n    o = model.generate(\n        prompt，\n        sampling_params=sampling_params\n    )\n    ignore_str = \"Wait\"\n    max_tokens_thinking_tmp = MAX_TOKENS_THINKING\n    for i in range(NUM_IGNORE): # 忽略结束标记的次数\n        max_tokens_thinking_tmp -= len(o[0].outputs[0].token_ids)\n        if max_tokens_thinking_tmp > 0:\n            prompt += o[0].outputs[0].text + ignore_str\n            sampling_params = SamplingParams(\n                max_tokens=max_tokens_thinking_tmp，\n                min_tokens=1，\n                stop_token_ids=stop_token_ids，\n                skip_special_tokens=False，\n                temperature=0.0，\n            )\n            o = model.generate(\n                prompt，\n                sampling_params=sampling_params\n            )\n    ### 最终答案 ###\n    prompt += o[0].outputs[0].text # 你也可以在这里加上“Final Answer:”，就像我们在某些评估中做的那样，以防止模型在提前退出时继续在其答案中进行推理\n    stop_token_ids = tok(\"\u003C\u003C|im_end|>>\")[\"input_ids\"]\n    sampling_params = SamplingParams(\n        max_tokens=32768，\n        min_tokens=0，\n        stop_token_ids=stop_token_ids，\n        skip_special_tokens=False，\n        temperature=0.0，\n    )\n    o = model.generate(\n        prompt，\n        sampling_params=sampling_params，\n    )\n    print(\"With budget forcing:\") # 你会看到，在推理轨迹中的“Wait”之后，模型会修正其答案\n    print(prompt + o[0].outputs[0].text)\n```\n\n#### transformers\n\n安装`transformers`和`torch`库并运行：\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\n\nDEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nmodel_name = \"simplescaling\u002Fs1.1-32B\"\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name，\n    torch_dtype=\"auto，\n    device_map=\"auto\"\n)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\nprompt = \"How many r in raspberry\"\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.\"},\n    {\"role\": \"user\", \"content\": prompt}\n]\ntext = tokenizer.apply_chat_template(\n    messages，\n    tokenize=False，\n    add_generation_prompt=True\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\ngenerated_ids = model.generate(\n    **model_inputs，\n    max_new_tokens=512\n)\ngenerated_ids = [\n    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n]\n\nresponse = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n```\n\n### 训练\n\n\n要运行训练，您可以在 `train\u002Fsft.py` 中找到我们的脚本，可以通过 `train\u002Fsft*sh` 脚本之一调用它。如果您在 SLURM 集群上，还可以通过 `train\u002Flaunch.sh` 启动这些脚本（需要根据您的集群设置编辑该文件）。\n\n对于 s1-32B 和 s1.1-32B 的训练，我们建议使用 16 张 H100 GPU，即两台节点，每台配备 8 张卡。对于 s1.1，我们将块大小设置为 20000，以避免内存溢出（OOM）问题（https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1\u002Fblob\u002F0ad4b3de32507b4aa0d4be28f336276ee99b2315\u002Ftrain\u002Fsft.sh#L17）；请在此处查看 wandb 日志：[链接](https:\u002F\u002Fwandb.ai\u002Fhashimoto-group\u002Fo1\u002Fruns\u002Fm1ilia77\u002Foverview)。\n\n快速入门：\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1.git\ncd s1\npip3 install -r requirements.txt\nbash train\u002Fsft.sh\n```\n*注意：如果您在使用 8 张 GPU 时遇到内存溢出（OOM）问题，可以考虑启用梯度检查点功能，在脚本中添加以下行：`--gradient_checkpointing=True`。*\n\n### 评估\n\n我们克隆了 [lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness)，版本为 `4cec66e4e468d15789473d6d63c3a61a751fa524`，并对其进行了修改。设置步骤如下：\n```bash\ncd eval\u002Flm-evaluation-harness\npip install -e .[math,vllm]\n```\n\n所有命令都位于 `eval\u002Fcommands.sh` 中。对于 AIME24，我们始终选择 `aime24_nofigures` 结果，该结果使用的数据集仅包含对任务重要的 AIME24 图形。\n\n如果您想计算某次评估运行的统计信息（如平均思考 token 数等），可以使用以下命令：\n`python eval\u002Fcompute_sample_stats.py path_to_samples_file.jsonl`\n\n我们所有的评估结果文件都位于：https:\u002F\u002Fhf.co\u002Fdatasets\u002Fsimplescaling\u002Fresults\n\n要运行 REBASE：相关命令位于 `eval\u002Frebase\u002Frun.sh` 中。\n\n请注意，在讨论部分使用 REBASE 进行的评估中，我们使用的是 https:\u002F\u002Fhuggingface.co\u002Fsimplescaling\u002Fstep-conditional-control-old 模型，该模型是在旧版数据集 https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-step-conditional-control-old 上训练的，并且是在旧版评估流程 https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMaxwell-Jia\u002FAIME_2024 下运行的。\n\n### 数据\n\n\n要重现 s1K 数据集，请按照以下步骤操作。在多个文件中，您需要将组织名称 `simplescaling` 和 `qfq` 替换为您自己的组织名称。**请注意，[s1K-1.1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1) 是一个更好的数据集，它是基于 r1 跟踪日志生成的，而非 Gemini 跟踪日志。**\n\n1. 先运行 `data\u002Fcollect_data.py`，然后依次运行 `data\u002Ffix_gpqa.py` 和 `data\u002Fadd_aime.py` 来收集题目；请务必在相应文件中将 hub 路径替换为您自己的路径。\n2. 使用 `python data\u002Fgemini.py` 通过 Gemini 生成跟踪日志。此步骤将使用 https:\u002F\u002Fhf.co\u002Fdatasets\u002Fqfq\u002Ftrain 数据集，该数据集应大致与您在第一步中生成的数据集相当。\n3. 使用 Qwen 通过 `python data\u002Fbulk_inference.py` 生成答案，可通过 `data\u002Fbulk_inference.sh` 启动批量推理。\n4. 运行 `python data\u002Ffeaturization.py` 添加特征。\n5. 最后通过 `data\u002Ffilter.ipynb` 进行最终筛选。\n6. 如果您希望对最终题目进行评分，以生成例如 [此数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fsimplescaling\u002Fs1K-1.1) 中的 `gemini_grade` 列，可以使用 `data\u002Fgrading.ipynb`。\n\n### 可视化\n\n\n所有图表和部分表格均通过 [此 Colab 笔记本](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1GAfwbJs2Y1dgGGsxrQyQg2G7CRH5NgN3?usp=sharing) 生成，该笔记本相当于 `visuals\u002Fvisuals.ipynb`。部分图表随后通过 `visuals\u002Fs1.fig` 文件进行编辑，您可以将其导入 Figma。最终生成的图表以 PDF 或 PNG 格式保存在 `visuals\u002F` 目录中。\n\n### 已知问题\n\n\n- vLLM 抛出 `ValueError: Token id XXXXX is out of vocabulary` 错误。\n  - 这种情况可能发生在预算强制优化时，尤其是在使用温度 1 运行时，模型有时会做出一些异常行为，预测超出其最大 token ID 但仍在嵌入空间范围内的词汇 ID，即小于 152064、大于 151664 的 ID。当我们重新输入模型之前的输出作为输入时（例如在评估中设置 `max_thinking_tokens` 时），就会触发此错误，因为 vLLM 会执行这种检查，尽管只有超过 152064 的 ID 才会导致实际问题。要解决这个问题，您可以取消注释 vLLM 中的 ValueError 检查代码（位于 `vllm\u002Fengine\u002Fllm_engine.py` 中的 `if max_input_id > tokenizer.max_token_id:` 这一行）。\n\n### 引用\n\n\n```bibtex\n@misc{muennighoff2025s1simpletesttimescaling,\n      title={s1: Simple test-time scaling}, \n      author={Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto},\n      year={2025},\n      eprint={2501.19393},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393}, \n}\n```","# s1 快速上手指南\n\ns1 是一个专注于“测试时扩展”（Test-time Scaling）的开源项目，仅需 1,000 个示例即可实现强大的推理性能，效果媲美 o1-preview。本指南将帮助您快速部署并使用 s1.1-32B 模型。\n\n## 环境准备\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+)\n*   **硬件要求**:\n    *   **推理**: 建议显存充足以运行 32B 模型（如使用 vLLM 多卡并行）。\n    *   **训练**: 官方推荐 16 张 H100 GPU (2 节点，每节点 8 卡)。若显存受限，需开启梯度检查点。\n*   **软件依赖**:\n    *   Python 3.8+\n    *   CUDA 兼容的 PyTorch\n    *   Git\n\n## 安装步骤\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1.git\n    cd s1\n    ```\n\n2.  **安装依赖**\n    推荐使用国内镜像源加速安装（如清华源或阿里源）：\n    ```bash\n    pip3 install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n3.  **安装推理引擎 (可选但推荐)**\n    为了获得最佳推理性能，建议安装 `vllm`：\n    ```bash\n    pip install vllm -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n    若使用原生 `transformers`，请确保已安装：\n    ```bash\n    pip install transformers torch -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n## 基本使用\n\n以下提供两种最常用的推理方式：基于 **vLLM** 的高性能推理和基于 **Transformers** 的原生推理。\n\n### 方式一：使用 vLLM 推理 (推荐)\n\n此方式支持高并发和更快的生成速度。\n\n```python\nfrom vllm import LLM, SamplingParams\nfrom transformers import AutoTokenizer\n\n# 加载模型 (自动从 HuggingFace 下载 simplescaling\u002Fs1.1-32B)\n# tensor_parallel_size 根据显卡数量调整，单卡设为 1\nmodel = LLM(\n    \"simplescaling\u002Fs1.1-32B\",\n    tensor_parallel_size=2,\n)\ntok = AutoTokenizer.from_pretrained(\"simplescaling\u002Fs1-32B\")\n\n# 设置停止符\nstop_token_ids = tok(\"\u003C|im_end|>\")[\"input_ids\"]\n\nsampling_params = SamplingParams(\n    max_tokens=32768,\n    min_tokens=0,\n    stop_token_ids=stop_token_ids,\n)\n\n# 构建提示词 (遵循 Qwen 对话格式)\nprompt_text = \"How many r in raspberry\"\nprompt = \"\u003C|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.\u003C|im_end|>\\n\u003C|im_start|>user\\n\" + prompt_text + \"\u003C|im_end|>\\n\u003C|im_start|>assistant\\n\"\n\n# 生成回答\no = model.generate(prompt, sampling_params=sampling_params)\nprint(o[0].outputs[0].text)\n```\n\n### 方式二：使用 Transformers 推理\n\n适用于单机单卡或调试场景。\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\n\nDEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nmodel_name = \"simplescaling\u002Fs1.1-32B\"\n\n# 加载模型和分词器\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\n# 构建对话消息\nprompt = \"How many r in raspberry\"\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.\"},\n    {\"role\": \"user\", \"content\": prompt}\n]\n\n# 应用聊天模板\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\n# 生成回答\ngenerated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=512\n)\ngenerated_ids = [\n    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n]\n\nresponse = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\nprint(response)\n```\n\n### 进阶：预算强制 (Budget Forcing)\n\ns1 的核心特性之一是“预算强制”，即通过控制思考 token 的数量来引导模型修正错误答案。若需体验此功能，请参考仓库中的 `vLLM with budget forcing` 代码段，其核心逻辑是在生成过程中多次忽略停止符并追加 \"Wait\" 指令，迫使模型进行更长时间的推理。","某教育科技公司的算法团队正在开发一款面向中学生的 AI 数学辅导助手，需要模型不仅能给出答案，更要展示严谨的逐步推导过程以帮助学生理解。\n\n### 没有 s1 时\n- **推理深度不足**：面对复杂的几何证明或奥数题，普通模型往往直接跳跃步骤给出结论，甚至因缺乏深思熟虑而产生“幻觉”错误。\n- **训练成本高昂**：为了提升逻辑能力，团队通常需要收集数十万条高质量思维链数据进行微调，数据清洗与标注耗时数月且费用昂贵。\n- **响应不可控**：模型有时过早停止思考输出不完整方案，有时又冗长啰嗦，难以在有限的算力预算内平衡思考时间与回答质量。\n- **效果天花板明显**：即使增加参数量，模型在处理需要多步回溯的高难度问题时，准确率仍远低于人类专家水平，无法满足教学需求。\n\n### 使用 s1 后\n- **强化测试时扩展**：s1 通过“预算强制（budget forcing）”机制，迫使模型在生成答案前充分利用预设的思考 token 上限，显著提升了复杂问题的推导准确率。\n- **极低数据依赖**：团队仅需使用 s1 提供的约 1,000 条精选示例进行微调，即可让模型获得媲美 o1-preview 的强推理能力，将数据准备周期从数月缩短至几天。\n- **思考过程可控**：开发者可精确设定最大思考 token 数（如 32,000），确保模型在资源受限环境下依然能完成深度推理并输出完整解答，避免中途截断。\n- **性能越级提升**：原本需要千亿参数模型才能解决的难题，现在基于 32B 的 s1 模型即可高效处理，大幅降低了推理部署的硬件门槛和运营成本。\n\ns1 通过极简的训练配方和创新的测试时扩展策略，让中小规模模型也能具备顶尖的深度推理能力，极大降低了高性能 AI 助手的落地门槛。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsimplescaling_s1_170f11b1.png","simplescaling","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fsimplescaling_4f7839fb.png",null,"https:\u002F\u002Fgithub.com\u002Fsimplescaling",[81,85,89,93,97],{"name":82,"color":83,"percentage":84},"Python","#3572A5",75.8,{"name":86,"color":87,"percentage":88},"Jupyter Notebook","#DA5B0B",23,{"name":90,"color":91,"percentage":92},"Shell","#89e051",1,{"name":94,"color":95,"percentage":96},"C++","#f34b7d",0.2,{"name":98,"color":99,"percentage":100},"Dockerfile","#384d54",0,6644,764,"2026-04-04T09:31:24","Apache-2.0",4,"Linux","训练必需：推荐 16 张 NVIDIA H100 GPU (2 节点 x8)；推理必需：支持 CUDA 的 NVIDIA GPU，需足够显存加载 32B 模型 (建议 80GB+ 或使用多卡 tensor_parallel_size)，CUDA 版本未说明","未说明 (取决于模型加载方式，32B 模型通常需大量系统内存或显存)",{"notes":110,"python":111,"dependencies":112},"1. 训练 s1-32B\u002Fs1.1-32B 推荐配置为 16 张 H100 GPU，若显存不足 (OOM) 需设置 block size 为 20000 或开启梯度检查点 (--gradient_checkpointing=True)。2. 推理 32B 模型建议使用 vLLM 并配置 tensor_parallel_size (示例为 2)。3. 已知问题：使用 budget forcing 且 temperature=1 时，vLLM 可能报错 'Token id out of vocabulary'，需修改 vLLM 源码注释掉相关检查行。4. 评估模块基于修改版的 lm-evaluation-harness。5. 数据复现需替换 HuggingFace 组织名为自有组织。","未说明",[113,114,115,116],"vllm","transformers","torch","lm-evaluation-harness",[26,13],"2026-03-27T02:49:30.150509","2026-04-06T09:43:33.899014",[121,126,131,136,141,145],{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},17950,"为什么我的模型在 AIME 2024 基准测试上的准确率低于论文报告的结果？","性能差异通常由评估时的数据类型（dtype）设置引起。官方为了结果的可复现性使用 fp32 进行评估，但用户反馈使用 bf16 可能会得到更高的分数（例如从 56.7% 提升到 60.0%）。此外，请确保严格使用官方提供的评估命令和参数，特别是 `max_gen_toks=32768` 和 `dtype=float32`。如果差异仍然很大，请检查是否使用了正确的系统提示词（System Prompt）和聊天模板（apply_chat_template）。","https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1\u002Fissues\u002F41",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},17951,"复现 Qwen2.5-32B-Instruct 基线评估的具体命令是什么？","请使用以下官方提供的 `lm_eval` 命令进行复现，注意指定 `dtype=float32`、`tensor_parallel_size` 以及生成参数 `max_gen_toks=32768`：\n\nOPENAI_API_KEY=YOUR_OPENAI_KEY PROCESSOR=gpt-4o-mini lm_eval --model vllm --model_args pretrained=Qwen\u002FQwen2.5-32B-Instruct,tokenizer=Qwen\u002FQwen2.5-32B-Instruct,dtype=float32,tensor_parallel_size=8 --tasks aime24_figures,aime24_nofigures,openai_math,gpqa_diamond_openai --batch_size auto --apply_chat_template --output_path qwen --log_samples --gen_kwargs \"max_gen_toks=32768\"","https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1\u002Fissues\u002F58",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},17952,"评估时遇到 \"please provide at least one prompt\" 错误或空提示列表错误怎么办？","这通常是因为显存不足（OOM），用户尝试减小 `max_len` 以避免 OOM，导致上下文被截断为空。解决方案是不要过度减小序列长度，因为这会损害模型行为；如果可能，应增加 `max_gen_toks` 或使用更多显存的 GPU（如官方推荐的 8xH100）。如果必须使用较小显存的显卡，需仔细调整截断逻辑，确保不会生成空的 prompt 列表。","https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1\u002Fissues\u002F35",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},17953,"复现 s1.1 的测试时缩放（Test-time Scaling）时遇到 \"min_tokens_thinking only supports until_thinking tokens that are 1 token long\" 报错如何解决？","该错误是因为默认的 `until_thinking` 参数设置为 `\u003C|im_start|>`，该字符串会被 tokenizer 编码为多个 token，而代码断言要求必须是单个 token。解决方法是在运行命令时显式传递符合单 token 要求的 `until_thinking` 参数，或者修改代码逻辑以支持多 token 的停止符，避免使用默认值触发断言错误。","https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1\u002Fissues\u002F113",{"id":142,"question_zh":143,"answer_zh":144,"source_url":125},17954,"评估过程中的随机性或确定性问题是由什么引起的？","评估结果的微小差异可能源于 vLLM 的确定性问题。官方在附录中提到，虽然尽量使用 fp32 来提高可复现性，但底层推理引擎（如 vLLM）可能导致非确定性行为。如果发现结果波动，请检查是否混合使用了不同的精度（如 bf16 与 fp32），并留意推理后端的具体版本和行为。",{"id":146,"question_zh":147,"answer_zh":148,"source_url":125},17955,"系统提示词（System Prompt）对用户提示词（User Prompt）的性能影响大吗？","是的，提示词格式对性能有显著影响。官方评估使用了特定的聊天模板（通过 `--apply_chat_template` 参数自动处理），其中系统提示词默认为 \"You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\"。如果用户自定义了系统提示词（例如改为数学专用助手）且未正确应用聊天模板，可能会导致模型在 AIME 等基准测试上的表现大幅下降（如从预期的 50%+ 降至 30%）。建议严格遵循官方的消息格式构造。",[]]