[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-carlini--yet-another-applied-llm-benchmark":3,"tool-carlini--yet-another-applied-llm-benchmark":65},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159267,2,"2026-04-17T11:29:14",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},8553,"spec-kit","github\u002Fspec-kit","Spec Kit 是一款专为提升软件开发效率而设计的开源工具包，旨在帮助团队快速落地“规格驱动开发”（Spec-Driven Development）模式。传统开发中，需求文档往往与代码实现脱节，导致沟通成本高且结果不可控；而 Spec Kit 通过将规格说明书转化为可执行的指令，让 AI 直接依据明确的业务场景生成高质量代码，从而减少从零开始的随意编码，确保产出结果的可预测性。\n\n该工具特别适合希望利用 AI 辅助编程的开发者、技术负责人及初创团队。无论是启动全新项目还是在现有工程中引入规范化流程，用户只需通过简单的命令行操作，即可初始化项目并集成主流的 AI 编程助手。其核心技术亮点在于“规格即代码”的理念，支持社区扩展与预设模板，允许用户根据特定技术栈定制开发流程。此外，Spec Kit 强调官方维护的安全性，提供稳定的版本管理，帮助开发者在享受 AI 红利的同时，依然牢牢掌握架构设计的主动权，真正实现从“凭感觉写代码”到“按规格建系统”的转变。",88749,"2026-04-17T09:48:14",[15,26,14,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":10,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,51,52,53,14,54,15,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,51,54],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":80,"owner_website":82,"owner_url":83,"languages":84,"stars":101,"forks":102,"last_commit_at":103,"license":104,"difficulty_score":23,"env_os":105,"env_gpu":106,"env_ram":106,"env_deps":107,"category_tags":115,"github_topics":80,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":116,"updated_at":117,"faqs":118,"releases":148},8518,"carlini\u002Fyet-another-applied-llm-benchmark","yet-another-applied-llm-benchmark","A benchmark to evaluate language models on questions I've previously asked them to solve.","yet-another-applied-llm-benchmark 是一个专注于评估大语言模型在实际应用场景中表现的个人化测试基准。它并非严谨的学术评测，而是收录了作者过去一年中真实向模型提出过的近 100 个具体问题，涵盖代码优化、字节码反编译、混淆脚本解释、数据编码识别及 SQL 生成等实用任务，旨在回答“模型能否解决我真正关心的工作难题”。\n\n该工具特别适合开发者和技术人员使用，帮助他们直观了解不同模型在处理日常编程与工程任务时的真实能力，而非仅仅关注理论分数。其核心亮点在于构建了一套简洁的数据流领域特定语言（DSL），允许用户通过类似\"LLM 生成代码 >> 自动运行 >> 结果验证”的链式语法，轻松自定义并自动化执行复杂的评估流程。这种设计不仅支持在沙箱环境中安全运行生成的代码，还能结合其他模型进行辅助评判，极大地扩展了测试的深度与多样性。虽然作者未对提示词进行刻意优化以模拟最自然的交互状态，但这反而为观察模型在“无修饰”指令下的原生表现提供了独特视角。","# Yet Another Applied LLM Benchmark\n\nThis is a benchmark I made, for me, to test how well language models perform\non tasks I care about. I know I care about them because each test is directly\nderived from something I've asked a LLM to perform for me in the last year.\n\nFor example, there are tests in this benchmark that evaluate the ability of a model to:\n- convert a python function to an equivalent-but-faster c function;\n- decompile python bytecode to functional source code;\n- explain the functionality of minified javascript;\n- identify the encoding format (in this case, uuencoded) of some data;\n- write a parser from a BNF-like grammar;\n- convert some english sentences to SQL queries; or,\n- write some bash oneliners\n\nThere are two defining features of this benchmark that make it interesting:\n\n1. I've implemented a simple dataflow domain specific language to make it easy for\nme (or anyone else!) to add new tests that realistically evaluate model capabilities.\n\n2. Directly as a result of this, I've written nearly 100 tests for different\nsituations I've actually encountered when working with LLMs as assistants.\n\nFor example, here's a complete test to evaluate if a model can write a \"hello world\" program.\n\n```python\n'Write a \"hello world\" program in python' >> LLMRun() >> PythonRun() >> SubstringEvaluator(\"hello world\")\n```\n\nI first ask the model to write hello world, then run that program (in a docker\ncontainer, seamlessly behind the scenes), and then check if stdout contains \"hello world\".\n(You should read the `>>` operator as \"and then do\". So \"a >> b\" means \"do a, and then do b\".)\n\nMore interestingly, here's a test case that first asks a model to produce code that\ndraws a flag. To evaluate it, I run the code, partially judge it with another\nlanguage model, and then fully evaluate the output by comparing to a reference solution.\n\n```python\n\"Write a C program that draws an american flag to stdout.\" >> LLMRun() >> CRun() >> \\\n    LLMRun(\"What flag is shown in this image?\") >> \\\n        (SubstringEvaluator(\"United States\") | SubstringEvaluator(\"USA\") | SubstringEvaluator(\"America\"))\n```\n\nThis DSL makes it easy for me to evaluate significantly more diverse and\nmore sophisticated behavior than any other evaluation benchmark I'm aware of.\nThis is helpful for determining whether or not models are capable of performing tasks I actually care about.\n\n\n## Results\n\nI've evaluated a few models on this benchmark. Here's how they perform:\n* o1-mini: 62% passed\n* Claude 3.5 Sonnet: 56% passed\n* GPT 4o: 48% passed\n* Gemini 1.5 Pro: 43% passed\n* Claude 3 Opus: 42% passed\n* GPT 4o Mini: 36% passed\n* Mistral Large: 28% passed\n* GPT 3.5: 26% passed\n\nA complete evaluation grid is available [here](https:\u002F\u002Fnicholas.carlini.com\u002Fwriting\u002F2024\u002Fevaluation_examples\u002Findex.html).\n\n\n## What this is not\n\nA serious academic benchmark.\n\nIn more words: this is not meant to try to rigorously evaluate the capabilities of\nmodels on any particular task. It's not meant to be something you can use to decide\nwhich model is more capable, more knowledgeable, more factual, less biased, less\nharmful, more aligned, more helpful, or anything else.\n\nThe questions are not optimally prompt-engineered. It is entirely\npossible---and indeed likely!---that a better phrasing of some of the questions\nwould allow the model to give a better answer.\n\nBut I am lazy.\n\nI do not want to remind the model it is AN EXPERT IN PYTHON\nand tell it that I'll give it a $100,000 tip for giving the right answer\nOR I WILL MURDER A KITTEN but please pause....take a deep breath....and think step\nby step by step before answering.\n(Or whatever the current incantation is people use to get models to work best.)\n\nI just want to type my question and get the right answer.\nSo this benchmark tests for that,\non types of questions I've actually cared about having answered.\n\n### Failing a question doesn't mean much\n\nAs a result of my (often intentional) lack of prompt engineering,\nwhen a model fails a question, you won't learn very much. Maybe my question was\njust poorly worded. Maybe it was ambiguous in some way.\n\nInstead, these tests are designed so that I learn something when the model passes.\nYou don't luck your way into correctly compiling Rust programs\nwithout having some skill at the language. But you might luck your\nway into failing by naming the function something I didn't expect and so your\ncorrect code just is never invoked.\n\n\n## What this is\n\nAgain, it's just a collection of questions I've actually asked language models to solve for me\nto help with various programming tasks,\ninterspursed with a few questions I've asked language models just for fun.\nThe questions are, for the most part, unmodified questions as I typed them.\nThis means they may not be the most clearly worded\n(e.g., `In python what __thing__ do I use for ~, kind of like how __add__ is for +`,\nwith the answser I'm expecting is `__inv__`).\nOther questions are \"unfair\" because they require recent knowledge\n(e.g., \"what is the hidden dimension of llama-2 70b?\").\nBut I care if a model can answer these correctly for me.\n\n\n# Installing\n\nGetting this benchmark up and running is fairly straightforward.\n\n## Python requirements\n\nOn the python side you'll just need to run\n`pip install -r requirements.txt` to install the python dependencies.\n\nIf you want to run it and evaluate a wide range of models you'll also need\n`pip install -r requirements-extra.txt` to install the other models.\n\n\n## Podman (preferred)\n\nI want to run things in a container to keep them basically safe.\nDocker is nicer and has slightly better security controls (and so you can\nuse that if you want below) but on linux you need to be root or give your\nuser almost-root permissions to start new docker jobs. This scares me a bit.\n\nSo I prefer to use podman. Install it however you're supposed to for your\nsystem.\n\n\n## Docker (optional)\n\nAgain this is fairly system dependent so you'll have to go somewhere else to find\nout how to install it for your system.\n\n\n## Why do I need docker\u002Fpodman?\n\nThe test cases in this benchmark are evaluated by directly\nexecuting code that comes out of a language model.\nSome tests ask the model to rename files, move files around, or make other\nstate-changing operations to your machine.\n\nWhile I don't think these models have it out for us and will emit `rm -rf \u002F` out of\nmalice or spite, it's entirely possible (and even likely!) that they'll produce buggy\ncode that will just accidentally trash your computer.\nSo, to safeguard against this, all LLM output is evaluated from within a\ntemporary docker container that gets deleted immediately after the test is complete.\n\n(There's also another reason, though: some of the tests assume a fresh install of\nUbuntu with particular dependencies in various places. These tests might behave\ndifferently on your local machine than they do from within the docker VM.)\n\nIf you like to live dangerously (VERY MUCH NOT RECOMENDED) then there is\na flag in the code\n`I_HAVE_BLIND_FAITH_IN_LLMS_AND_AM_OKAY_WITH_THEM_BRICKING_MY_MACHINE_OR_MAKING_THEM_HALT_AND_CATCH_FIRE`\nthat you can set to True and then this will just eval() everything that comes\nout of the LLMs on your machine directly.\n\n\n# Setup\n\nOnce you've installed everything,\nthere are a few setup steps before you can run the benchmark.\n\n## Add API keys\n\nYou should add API keys for any model you want to evaluate. The keys are stored\nin the config.json file. You can find a template at [config.json.example](config.json.example)\n\nWhatever model you are testing, you will also need to load API keys for OpenAI as the default\nevaluation model. This is because a few of the questions require evaluation by a second language model\nto judge correctness.\nThese secondary evaluations are as simple as possible, but using a high-quality model\nhere is helpful to ensure consistency in the results.\n\nI have had good success using gpt-4-turbo as the evaluation model, but you can configure\nany model that you want as the evaluator. In my experiments, I had almost identical\nresults with the (cheaper) gpt-3.5-turbo, but in a few cases having the more capable\nevaluation model gives more reliable results.\n\n## Set up docker\u002Fpodman container [highly recommended]\n\nTo start you'll need to create the docker container where the tests will run.\nThis will first require that you install docker on your machine.\nOnce you've done that, you can then build the image:\n\n```bash\ndocker build -t llm-benchmark-image . # if you're using docker\npodman build -t llm-benchmark-image . # if you're using podman\n```\n\n## Set up selenium\u002Fchrome\n\nA few test cases require Selenium and Chrome to test if models can generate valid\nhtml\u002Fjavascript programs. Installing the requirements file should install selenium\nfor you, but you'll also need to make sure you install chrome. If you're on\nubuntu then you can just run\n\n```bash\nwget https:\u002F\u002Fdl.google.com\u002Flinux\u002Fdirect\u002Fgoogle-chrome-stable_current_amd64.deb\nsudo dpkg -i google-chrome-stable_current_amd64.deb\n```\n\n\n# Running the benchmark\n\nOnce you've set up your environment, you can run the entire benchmark in just one line:\n\n```bash\npython main.py --model gpt-3.5-turbo --run-tests --generate-report\n```\n\nThis command will run every single test that's configured on one model.\nIt will therefore take some time, and also will cost you a few dollars in\nlanguage model queries. After you can view the full reslt html file in the\ndirectory `evaluation_examples`.\n\nIt will also save a cache of this run, so that the next time you can run\na new model and view the two results side-by-side. These are saved by\ndefault in the directory results\u002F[current git commit hash]\u002F[model name].\n\nIf you want to run individual test cases, you can do that too in two ways.\nOne is to just directly run test\n\n```bash\nPYTHONPATH='.' python tests\u002Fprint_hello.py\n```\n* Explore the `run_a_simple_testcase.ipynb` notebook to quickly run a sample test case on Colab. \n\nThe other, if you want to save the result of this run so you can load it later,\nis to run the main script and specify which test(s) you want to run.\n(Be careful if you do this, though, beacuse it will overwrite any prior run.)\n\n\n```bash\npython main.py --run-tests --test print_hello --model gpt-3.5-turbo-0125\n```\n\nIf you've run generated many saved runs previously, you can load them into one grid with\n\n```bash\npython main.py --load-saved --generate-report --model [model1] --model [model2]\n```\n\nAnd finally, if you've run the tests previously at one git commit, and want to just run any tests that have changed since then, you can run\n\n```bash\npython main.py --run-tests --only-changed [commit hash] --model gpt-3.5-turbo-0125\n```\n\n# Modifying the benchmark\n\n## Adding your own LLM\n\nThis framework is fairly modular.\nIf you'd like to test a new LLM, you'll just need to add a new file to llms\u002F[model_name]_model.py\nwhich implements an inference function. From there, modify [llm.py](llm.py) to include the\nappropriate model.\n\n## Adding new test cases\n\nAs discussed above, I've built a small dataflow DSL to implement test cases.\nThis should let anyone add new test cases fairly easily.\nLet me repeat the initial test case examples in full:\n\n```python\nTestHelloWorld = 'Print \"hello world\" in python' >> LLMRun() >> \\\n      ExtractCode() >> PythonRun() >> SubstringEvaluator(\"hello world\")\n```\n\nRecall that you should read the `>>` operator as \"and then do\".\nIf you're a bash person, it's like a `|` pipe.\nIf you're a haskel person, it's like the `$` operator.\n- We start by running the LLM we want to test with the prompt: Print \"hello world\" in python. \n- The model will probably give some code, but might also give an explanation or start by saying \"Sure! I can answer your question.\" So we take whatever output came out of the model and pass it through a function to just extract the first code block.\n- We then actually go and run this python code, whatever it is. To be somewhat safe we do this by spawning a new docker env and run the code there.\n- And finally, we verify that the code was correct, and the output of the python contains a substring of \"hello world\".\n\nThis dataflow pipeline also lets you do more complex verification.\nBelow is the same example as from the top where I evaluate a model's\nability to understand how to draw .bmp images.\n\n\n```python\nTestFlagDraw = \"Write a C program that draws an american flag to stdout as a bmp.\" >> \\\n       ExtractCode() >> CRun() >> LLMVisionRun(\"What flag is shown in this image?\") >> \\\n          (SubstringEvaluator(\"United States\") | \\\n           SubstringEvaluator(\"USA\") | \\\n           SubstringEvaluator(\"America\"))\n```\n\nHere, after asking the model to draw the flag and running the resulting C code,\nI evaluate the model by asking *another* model what flag has been drawn,\nand checking if it says something like the US flag.\nIs this a perfect check? No.\nBut verification is usually easier than generation, and so it's probably a good\nenough approximation of what I want.\n\n\n# Contributing\n\nIf you'd like to add your own tests to this benchmark feel free to open a PR!\nI'd be happy to accept basically anything interesting.\n\n## Adding new tests\n\nThere are only a few requirements for adding a test.\n\n1. Test cases must be mechanistically verifiable. This is very limiting, I know. A whole lot of\nwhat I use LLMs for isn't verifiable in this way. Especially when I'm giving them large\nblocks of code and asking for specific changes that are hard to unit test. But in order for\nthese to be useful your test must be easy to verify.\n\n2. Test cases should complete quickly. I don't want to wait several minutes just for one test to run.\n\n3. Tests should not be evaluated against LLMs during construction. Don't modify the test because\nthe model gave an answer you didn't like. Most LLMs are stochastic enough that there is *some*\nway to elicit most behavior with enough trial and error. I want to see how the model answers\nwith a human-written test, as they are normally asked, before LM refinement.\n\n4. Tests should be designed so that *passing* demonstrates some interesting model capability.\nMaking \"gotcha\" tests that are designed to show models fail in some way are not useful\nin this setup.\n\n3. Test cases must not download large amounts of data from the internet.\nSomeone else shouldn't have to pay for each run of this benchmark.\nIf you need to test a library add it to the Dockerfile.\n\n\n## Fixing tests\n\nAre there any tests here that are broken? I tried my best to make them all correct but can't\nguarantee correctness for sure. If so I'd be happy to accept fixes.\n\nBut please note: a broken test means one where the answer is **objectively wrong**. Like a\ntest that says 6 is prime. A test that just expects a specific answer to an ambiguous question\nis not wrong. For example, one test asks\n\"What do I do to fix AutoModel.from_pretrained to make it auto model with lm head\"\nand expects the model to tell me that I should be using the class \"AutoModelForCausalLM\";\neven though the class \"AutoModelWithLMHead\" exists, that's not what I was looking for.\n\n# I want to cite this in an academic paper\n\nNo you probably don't. At least, you probably don't if you're trying to compare\nwhy your new model is better or something.\nThis is not meant to be something for academic papers and only evaluates\na very specific set of capabilities.\nFor all the reasons mentioned earlier I don't think this benchmark\nwill accurately capture what academic people should care about for their models.\nGood for \"useful for me?\": yes. Good for \"is my model better?\": I don't think so.\nBut I've now had at least a few people ask me about this who appear unswayed by the\nabove argument.\n\nSo here's my answer: if you want to user this in a paper,\nthen link to this github project AND INCLUDE THE GIT COMMIT HASH YOU USED.\nI make NO GUARANTEES that I won't just arbitrarily edit test cases without\nwarning. In fact, it's already happened in #1! And #3! And #6.\nSo if you want your paper\nto be at all scientific make sure to include the git commit hash.\n\n\n# License\n\nCopyright (C) 2024, Nicholas Carlini \u003Cnicholas@carlini.com>.\n\nThis program is free software: you can redistribute it and\u002For modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation, either version 3 of the License, or\n(at your option) any later version.\n\nThis program is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\nGNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program.  If not, see \u003Chttp:\u002F\u002Fwww.gnu.org\u002Flicenses\u002F>.\n","# 又一个应用型大语言模型基准测试\n\n这是我为自己制作的一个基准测试，用于评估语言模型在我关心的任务上的表现。我之所以关注这些任务，是因为每一个测试都直接源自于过去一年中我曾要求大语言模型为我完成的事情。\n\n例如，这个基准测试中包含一些测试用例，用来评估模型是否能够：\n- 将 Python 函数转换为等效但更快的 C 函数；\n- 将 Python 字节码反编译为函数式源代码；\n- 解释经过压缩的 JavaScript 代码的功能；\n- 识别某些数据的编码格式（比如 uuencode）；\n- 根据类似 BNF 的语法编写解析器；\n- 将一些英文句子转换为 SQL 查询；或者\n- 编写一些 Bash 单行脚本。\n\n这个基准测试有两个显著的特点，使其颇具吸引力：\n\n1. 我实现了一种简单的数据流领域特定语言，使得我可以（或任何人！）轻松添加新的测试用例，以更真实地评估模型的能力。\n\n2. 正是由于这一点，我为实际工作中作为助手使用大语言模型时遇到的不同场景编写了近 100 个测试用例。\n\n例如，下面是一个完整的测试用例，用于评估模型能否写出“Hello, World”程序：\n\n```python\n'用 Python 写一个 \"Hello, World\" 程序' >> LLMRun() >> PythonRun() >> SubstringEvaluator(\"hello world\")\n```\n\n首先，我让模型写出“Hello, World”程序，然后在后台无缝地在一个 Docker 容器中运行该程序，并检查标准输出中是否包含“hello world”。（这里的 `>>` 操作符可以理解为“然后执行”，即“a >> b”表示先执行 a，再执行 b。）\n\n更具趣味性的是，下面这个测试用例首先要求模型生成一段绘制美国国旗的代码。为了评估其结果，我会先运行这段代码，再用另一个语言模型对其进行初步判断，最后通过与参考答案进行比较来全面评估输出结果。\n\n```python\n\"用 C 语言编写一个将美国国旗输出到 stdout 的程序。\" >> LLMRun() >> CRun() >> \\\n    LLMRun(\"这张图片上显示的是什么国旗？\") >> \\\n        (SubstringEvaluator(\"United States\") | SubstringEvaluator(\"USA\") | SubstringEvaluator(\"America\"))\n```\n\n这种领域特定语言使我能够评估比目前我所知的任何其他评估基准都要更加多样化和复杂的行为。这对于判断模型是否能够完成我真正关心的任务非常有帮助。\n\n## 测试结果\n\n我已经在这个基准测试上评估了几款模型。它们的表现如下：\n* o1-mini：62% 通过率\n* Claude 3.5 Sonnet：56% 通过率\n* GPT-4o：48% 通过率\n* Gemini 1.5 Pro：43% 通过率\n* Claude 3 Opus：42% 通过率\n* GPT-4o Mini：36% 通过率\n* Mistral Large：28% 通过率\n* GPT-3.5：26% 通过率\n\n完整的评估表格可以在 [这里](https:\u002F\u002Fnicholas.carlini.com\u002Fwriting\u002F2024\u002Fevaluation_examples\u002Findex.html) 查看。\n\n## 这不是什么\n\n这并不是一个严肃的学术基准测试。\n\n换句话说，它并非旨在严格评估模型在任何特定任务上的能力。它也不是用来决定哪款模型更有能力、更博学、更准确、更少偏见、更安全、更对齐或更有帮助的工具。\n\n这些问题并没有经过最优的提示工程设计。完全有可能——事实上也很可能——如果对其中一些问题进行更好的措辞，模型就能给出更好的答案。\n\n但我就是懒。\n\n我不想提醒模型：“你可是 Python 方面的专家！”也不想告诉它：“如果你答对了，我就给你 10 万美元的小费；但如果答错了，我就杀了只小猫。”当然，我也不会真的这么做……我只是希望它能深呼吸，一步一步地思考后再作答。（或者现在大家常用的那些能让模型发挥最佳水平的咒语。）\n\n我只想简单地输入一个问题，然后得到正确的答案。因此，这个基准测试正是针对这类问题设计的，而且这些问题都是我曾经真正关心并希望得到解答的类型。\n\n### 未能通过某个问题并不意味着什么\n\n由于我在提示工程方面的不足（通常是故意的），当模型未能通过某个问题时，我们并不能从中获得太多有用的信息。也许只是我的问题表述得不够清晰，或者存在某种歧义。\n\n相反，这些测试的设计目的是让我在模型通过时有所收获。毕竟，你不可能靠运气就正确编译 Rust 程序，这需要一定的编程技能。然而，你却可能因为把函数名取成了我不期望的名字，导致你的正确代码根本无法被调用而失败。\n\n## 这是什么\n\n再次强调，这不过是我曾经要求语言模型帮我解决的各种编程任务的问题集合，其中也穿插了一些纯粹出于娱乐目的提出的问题。这些问题大多未经修改，直接按照我当时输入的内容呈现。这意味着它们可能并不是最清晰的表述方式（例如，“在 Python 中，我应该用什么 __东西__ 来表示 ~，就像 __add__ 代表 + 一样？”而我期待的答案是 `__inv__`）。此外，还有一些问题因为需要最新的知识而显得有些“不公平”（比如，“Llama-2 70B 的隐藏维度是多少？”）。但对我来说，重要的是模型能否正确回答这些问题。\n\n# 安装说明\n\n要让这个基准测试运行起来其实相当简单。\n\n## Python 依赖\n\n在 Python 端，你只需运行 `pip install -r requirements.txt` 来安装所需的依赖包。\n\n如果你想运行并评估多种不同的模型，还需要额外安装其他模型的依赖包，可以通过运行 `pip install -r requirements-extra.txt` 来完成。\n\n## Podman（推荐）\n\n我希望在容器中运行所有内容，以确保基本的安全性。Docker 虽然界面更好，安全性控制也略胜一筹（所以如果你愿意也可以使用 Docker），但在 Linux 系统上，启动新的 Docker 任务通常需要 root 权限，或者至少赋予当前用户接近 root 的权限。这让我有些担心。\n\n因此，我更倾向于使用 Podman。请根据你的系统情况，按照官方文档安装即可。\n\n## Docker（可选）\n\n同样，Docker 的安装方法因系统而异，你需要查阅相关文档来了解如何在你的系统上安装它。\n\n## 为什么需要 Docker\u002FPodman？\n\n本基准测试中的测试用例是通过直接执行语言模型生成的代码来评估的。\n某些测试会要求模型重命名文件、移动文件，或对您的机器进行其他改变状态的操作。\n\n虽然我不认为这些模型对我们怀有恶意，会出于故意或报复而输出 `rm -rf \u002F` 命令，但它们确实有可能生成有缺陷的代码，从而意外地破坏您的计算机。因此，为了防止这种情况发生，所有 LLM 的输出都会在一个临时的 Docker 容器中进行评估，测试完成后该容器会立即被删除。\n\n（还有一个原因：部分测试假设系统是一个装有特定依赖项的全新 Ubuntu 系统。在您的本地机器上，这些测试的行为可能与在 Docker 虚拟机中运行时有所不同。）\n\n如果您喜欢冒险（强烈不推荐），可以在代码中设置一个标志：\n\n```python\nI_HAVE_BLIND_FAITH_IN_LLMS_AND_AM_OKAY_WITH_THEM_BRICKING_MY_MACHINE_OR_MAKING_THEM_HALT_AND_CATCH_FIRE\n```\n\n将其设置为 `True` 后，LLM 生成的所有内容将直接在您的机器上通过 `eval()` 执行。\n\n# 贡献\n\n如果你想为这个基准测试添加自己的测试，请随时提交一个 PR！只要内容有趣，我都很乐意接受。\n\n## 添加新测试\n\n添加测试只需满足几个要求：\n\n1. 测试用例必须在机制上可验证。我知道这一点限制很大。我使用大语言模型的很多场景都无法以这种方式验证，尤其是当我给它们提供大量代码并要求进行难以单元测试的具体修改时。但为了使这些测试有用，你的测试必须易于验证。\n\n2. 测试用例应快速完成。我不希望仅仅运行一个测试就等待几分钟。\n\n3. 在构建测试时，不应根据大语言模型的输出来调整测试。不要因为模型的回答不符合你的预期而修改测试。大多数大语言模型都具有一定的随机性，通过多次尝试和错误，总能找到某种方式来诱导出特定行为。我希望看到模型在面对人类编写的、正常提问方式的测试时如何回答，而不是在经过模型优化之后的表现。\n\n4. 测试的设计应确保**通过**能够展示模型的一些有趣能力。那些专门设计用来让模型失败的“陷阱”式测试，在这种设置中并不实用。\n\n5. 测试用例不得从互联网下载大量数据。不应该让其他人为每次运行这个基准测试买单。如果需要测试某个库，可以将其添加到 Dockerfile 中。\n\n## 修复测试\n\n这里是否有损坏的测试？我已尽力确保所有测试都是正确的，但无法完全保证其正确性。如果有，请随时提交修复。\n\n请注意：损坏的测试是指答案**客观上错误**的测试。例如，一个声称 6 是质数的测试。但如果测试只是对一个模糊问题期望特定答案，则不能算作错误。比如，有一个测试问：“我该如何修改 `AutoModel.from_pretrained` 使其成为带有 LM 头的自动模型？”并期望模型告诉我应该使用 `AutoModelForCausalLM` 类；尽管存在 `AutoModelWithLMHead` 类，但这并不是我想要的答案。\n\n# 我想在学术论文中引用这个项目\n\n恐怕你不需要这样做。至少，如果你试图比较你的新模型为何更好之类的问题，就不需要了。这个基准测试并非为学术论文设计，它只评估非常特定的一组能力。基于前面提到的原因，我认为这个基准测试并不能准确反映学术界人士关心的模型特性。对于“对我是否有用？”这个问题，答案是肯定的；但对于“我的模型是否更好？”这个问题，我认为未必。不过，我已经遇到过几位对此感兴趣的人，他们似乎并不受上述观点的影响。\n\n所以我的建议是：如果你确实想在论文中使用这个基准测试，请链接到这个 GitHub 项目，并**包含你所使用的 Git 提交哈希值**。我无法保证不会在未事先通知的情况下随意修改测试用例。事实上，这种情况已经在第 1、3 和 6 个测试中发生过了。因此，如果你想让你的论文具有一定的科学性，请务必注明所使用的 Git 提交哈希值。\n\n# 许可证\n\n版权所有 © 2024，尼古拉斯·卡林尼 \u003Cnicholas@carlini.com>。\n\n本程序是自由软件，您可以按照自由软件基金会发布的 GNU 通用公共许可证条款重新分发和修改它，无论是该许可证的第 3 版，还是（由您选择）任何后续版本。\n\n本程序以“按原样”提供，不提供任何形式的担保，包括对适销性或特定用途适用性的默示担保。有关详细信息，请参阅 GNU 通用公共许可证。\n\n您应随本程序收到一份 GNU 通用公共许可证的副本。如果没有，请访问 \u003Chttp:\u002F\u002Fwww.gnu.org\u002Flicenses\u002F> 查看。","# Yet Another Applied LLM Benchmark 快速上手指南\n\n本指南旨在帮助开发者快速部署并运行 `yet-another-applied-llm-benchmark`，用于评估大语言模型在实际编程任务（如代码转换、反编译、SQL 生成等）中的表现。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**：推荐 Linux (Ubuntu) 或 macOS。Windows 用户建议使用 WSL2。\n*   **Python 环境**：Python 3.8+。\n*   **容器运行时（必需）**：为了安全地执行模型生成的代码，必须安装 **Docker** 或 **Podman**。\n    *   *推荐*：**Podman**（无需 root 权限即可运行容器，更安全）。\n    *   *备选*：Docker（需配置用户权限以避免每次使用 `sudo`）。\n*   **浏览器驱动**：部分测试涉及 HTML\u002FJS 生成，需安装 **Google Chrome** 和 **Selenium**。\n    *   Ubuntu 安装 Chrome:\n        ```bash\n        wget https:\u002F\u002Fdl.google.com\u002Flinux\u002Fdirect\u002Fgoogle-chrome-stable_current_amd64.deb\n        sudo dpkg -i google-chrome-stable_current_amd64.deb\n        ```\n\n## 安装步骤\n\n### 1. 克隆项目与安装依赖\n\n```bash\ngit clone \u003Crepository-url>\ncd yet-another-applied-llm-benchmark\n\n# 安装基础 Python 依赖\npip install -r requirements.txt\n\n# (可选) 如果需要评估更多种类的模型，安装额外依赖\npip install -r requirements-extra.txt\n```\n> **提示**：国内用户如遇 pip 下载缓慢，可添加清华或阿里镜像源：\n> `pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n### 2. 配置 API 密钥\n\n复制配置文件模板并填入您的 API Key（支持 OpenAI, Claude, Gemini 等）：\n\n```bash\ncp config.json.example config.json\n```\n\n编辑 `config.json`，填入对应模型的密钥。**注意**：即使只测试非 OpenAI 模型，也需配置 OpenAI Key，因为部分测试结果需要调用另一个高质量模型（默认 `gpt-4-turbo`）进行自动化评判。\n\n### 3. 构建容器镜像\n\n构建用于隔离运行生成代码的 Docker\u002FPodman 镜像：\n\n```bash\n# 如果使用 Podman (推荐)\npodman build -t llm-benchmark-image .\n\n# 如果使用 Docker\ndocker build -t llm-benchmark-image .\n```\n\n## 基本使用\n\n### 运行完整基准测试\n\n以下命令将对指定模型运行所有测试用例，并生成 HTML 报告：\n\n```bash\npython main.py --model gpt-3.5-turbo --run-tests --generate-report\n```\n\n*   `--model`: 指定要测试的模型名称（需在 `config.json` 中配置）。\n*   `--run-tests`: 执行测试套件。\n*   `--generate-report`: 测试完成后生成对比报告。\n\n运行结束后，可在 `evaluation_examples` 目录下查看详细的 HTML 结果报告。测试结果缓存将自动保存在 `results\u002F` 目录中，便于后续多模型对比。\n\n### 运行单个测试用例\n\n若只想快速验证某个特定功能（例如 \"hello world\" 测试）：\n\n```bash\nPYTHONPATH='.' python tests\u002Fprint_hello.py\n```\n\n或者通过主脚本指定特定测试：\n\n```bash\npython main.py --run-tests --test print_hello --model gpt-3.5-turbo-0125\n```\n\n### 对比不同模型结果\n\n如果您已经运行过多次测试，可以加载保存的结果生成对比网格：\n\n```bash\npython main.py --load-saved --generate-report --model [model1] --model [model2]\n```","一位全栈开发者在日常工作中频繁依赖大模型协助完成代码转换、逆向分析及脚本编写等具体任务，急需验证不同模型在真实场景下的可靠性。\n\n### 没有 yet-another-applied-llm-benchmark 时\n- **评估脱离实际**：依赖学术榜单选择模型，却发现模型在处理“将 Python 函数转为高效 C 代码”或“反编译字节码”等具体需求时表现不佳。\n- **验证流程繁琐**：每次测试新模型都需要人工编写提示词、手动运行代码并核对输出，难以批量复现过去一年内遇到的近百种真实难题。\n- **过度依赖提示工程**：为了得到正确结果，不得不花费大量时间研究复杂的提示词技巧（如“逐步思考”），而非直接获得所需答案。\n- **缺乏自动化反馈**：无法自动判断模型生成的 Bash 单行命令或 SQL 查询是否真正可执行且结果正确，只能靠肉眼检查。\n\n### 使用 yet-another-applied-llm-benchmark 后\n- **场景高度契合**：直接利用源自真实工作流的近 100 个测试用例（如解释混淆 JS、识别编码格式），精准评估模型解决实际问题的能力。\n- **自动化闭环测试**：通过内置的数据流 DSL，一键实现“提问 - 执行代码 - 自动比对结果”的全流程，例如自动运行生成的 C 程序并验证绘制的国旗图像。\n- **回归自然交互**：无需刻意设计复杂提示词，直接用日常语言提问即可测试模型水平，还原最真实的使用体验。\n- **量化决策依据**：基于 o1-mini 或 Claude 3.5 Sonnet 等模型在特定任务上的通过率数据，快速选出最适合当前开发任务的助手。\n\nyet-another-applied-llm-benchmark 的核心价值在于它将抽象的模型能力评估转化为对开发者真实痛点的自动化验收测试。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcarlini_yet-another-applied-llm-benchmark_f548af91.png","carlini","Nicholas Carlini","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fcarlini_94385bac.jpg","I break things",null,"nicholas@carlini.com","nicholas.carlini.com","https:\u002F\u002Fgithub.com\u002Fcarlini",[85,89,93,97],{"name":86,"color":87,"percentage":88},"Python","#3572A5",85.2,{"name":90,"color":91,"percentage":92},"Jupyter Notebook","#DA5B0B",14.4,{"name":94,"color":95,"percentage":96},"Dockerfile","#384d54",0.4,{"name":98,"color":99,"percentage":100},"Shell","#89e051",0,1053,79,"2026-04-12T21:32:17","GPL-3.0","Linux","未说明",{"notes":108,"python":106,"dependencies":109},"该工具主要通过 API 调用外部大模型进行测试，本地无需 GPU。必须安装 Docker 或 Podman 以在隔离容器中安全执行模型生成的代码（防止恶意或错误代码破坏主机）。部分测试用例需要安装 Google Chrome 和 Selenium 以验证生成的 HTML\u002FJS。运行前需在 config.json 中配置各模型的 API Key，且必须配置 OpenAI API Key 用于部分结果的二次评估。",[110,111,112,113,114],"requirements.txt 中的依赖","requirements-extra.txt 中的依赖","selenium","docker 或 podman","Google Chrome",[15,54],"2026-03-27T02:49:30.150509","2026-04-18T00:45:52.208133",[119,124,129,134,139,143],{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},38134,"如何查看多个模型的评估结果对比表？","运行多次测试后，默认页面可能只显示最新模型的结果。维护者已提交更新以澄清 README 中的相关说明。通常你需要确保在生成报告时正确配置了输出路径或参数，以便累积显示所有已评估模型的数据。请查阅最新的 README 文档获取具体操作指南。","https:\u002F\u002Fgithub.com\u002Fcarlini\u002Fyet-another-applied-llm-benchmark\u002Fissues\u002F14",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},38135,"自定义评估器类（继承 Node）时报错，如何解决？","在自定义类中创建 `Reason` 对象时，`node` 参数应传递 `type(self)` 而不是 `self`。例如：`Reason(node=type(self), children=[...])`。这是框架内部处理节点类型的方式，参考 `evaluator.py` 中其他类的实现（如第 225 行）即可修正此问题。","https:\u002F\u002Fgithub.com\u002Fcarlini\u002Fyet-another-applied-llm-benchmark\u002Fissues\u002F23",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},38136,"模型生成的代码提取失败或包含噪音怎么办？","如果模型未能正确提取自身生成的代码（例如缺少 `disconnectedchildren`），这通常被视为模型本身的失败而非框架错误，因为提示词已明确要求模型执行该操作。目前框架接受这种“失败模式”。未来可能会引入“自我修正”模式来自动修复此类小错误，但当前需视为评估结果的一部分。","https:\u002F\u002Fgithub.com\u002Fcarlini\u002Fyet-another-applied-llm-benchmark\u002Fissues\u002F18",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},38137,"在 Mac M1 上使用 Podman 构建镜像失败怎么办？","该问题已通过社区提交的 PR 修复。如果你遇到类似的构建问题（如在 Macbook Pro M1 Sonoma 14.5 上运行 `podman build` 失败），请拉取仓库的最新代码，其中包含了针对 ARM64 架构和依赖安装的修复补丁。","https:\u002F\u002Fgithub.com\u002Fcarlini\u002Fyet-another-applied-llm-benchmark\u002Fissues\u002F20",{"id":140,"question_zh":141,"answer_zh":142,"source_url":123},38138,"评估结果的误差范围是多少？多少分的差距才算有意义？","目前框架未自动计算统计置信区间。维护者指出，+3% 的提升可能仅相当于两个测试题的差异，未必具有统计学意义；通常 +\u002F- 10% 的波动才被认为是有意义的。若需更公平的比较，建议对每个模型运行 N 次测试取平均值，并考虑不同参数的影响。",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},38139,"是否可以使用 DSPy 重构或增强此基准测试？","虽然 DSPy 是一个优秀的项目，但它设计为通用框架，复杂度较高。本基准测试专为特定评估目的构建，结构更简单直接，便于快速编写和运行测试用例。因此，目前暂无计划使用 DSPy 进行重构，以保持项目的轻量性和针对性。","https:\u002F\u002Fgithub.com\u002Fcarlini\u002Fyet-another-applied-llm-benchmark\u002Fissues\u002F12",[]]