[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-JShollaj--awesome-llm-interpretability":3,"tool-JShollaj--awesome-llm-interpretability":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,2,"2026-04-18T11:30:52",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":75,"owner_location":76,"owner_email":75,"owner_twitter":75,"owner_website":75,"owner_url":77,"languages":75,"stars":78,"forks":79,"last_commit_at":80,"license":75,"difficulty_score":81,"env_os":82,"env_gpu":83,"env_ram":83,"env_deps":84,"category_tags":87,"github_topics":89,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":92,"updated_at":93,"faqs":94,"releases":95},9282,"JShollaj\u002Fawesome-llm-interpretability","awesome-llm-interpretability","A curated list of Large Language Model (LLM) Interpretability resources.","awesome-llm-interpretability 是一个精心整理的开源资源清单，专注于大语言模型（LLM）的可解释性研究。面对大模型常被视为“黑盒”、内部决策逻辑难以捉摸的痛点，它汇聚了丰富的工具、学术论文、技术文章及专业社区，旨在帮助用户深入理解模型如何思考、知识如何在训练中演化，以及定位并修正模型中的事实错误。\n\n这份资源特别适合 AI 研究人员、算法工程师及开发者使用。无论是需要调试神经网络的研究者，还是希望评估和优化模型表现的开发团队，都能从中找到得力助手。其独特亮点在于收录了多样化的前沿工具：从支持可视化注意力机制和神经元激活的 The Learning Interpretability Tool、TransformerLens，到能自动生成功能解释的 Automated Interpretability，再到用于编辑模型事实关联的 Rome 项目。这些资源覆盖了从底层的机械可解释性分析到高层的应用评估，为揭开大模型神秘面纱提供了全方位的技术支持，是推动可信 AI 发展的重要参考库。","\n# Awesome LLM Interpretability ![](https:\u002F\u002Fgithub.com\u002Fyour-username\u002Fawesome-llm-interpretability\u002Fworkflows\u002FAwesome%20Bot\u002Fbadge.svg)\n\nA curated list of amazingly awesome tools, papers, articles, and communities focused on Large Language Model (LLM) Interpretability.\n\n\n## Table of Contents\n- [Awesome LLM Interpretability](#awesome-llm-interpretability)\n    - [Tools](#llm-interpretability-tools)\n    - [Papers](#llm-interpretability-papers)\n    - [Articles](#llm-interpretability-articles)\n    - [Groups](#llm-interpretability-groups)\n\n### LLM Interpretability Tools\n*Tools and libraries for LLM interpretability and analysis.*\n\n1. [The Learning Interpretability Tool](https:\u002F\u002Fpair-code.github.io\u002Flit\u002F) - an open-source platform for visualization and understanding of ML models, supports classification, refression, and generative models (text & image data); includes saliency methods, attention attribution, counter-facturals, TCAV, embedding visualizations, and facets style data analysis.\n1. [Comgra](https:\u002F\u002Fgithub.com\u002FFlorianDietz\u002Fcomgra) - Comgra helps you analyze and debug neural networks in pytorch.\n1. [Pythia](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia) - Interpretability analysis to understand how knowledge develops and evolves during training in autoregressive transformers.\n1. [Phoenix](https:\u002F\u002Fgithub.com\u002FArize-ai\u002Fphoenix) - AI Observability & Evaluation - Evaluate, troubleshoot, and fine tune your LLM, CV, and NLP models in a notebook.\n1. [Floom](https:\u002F\u002Fgithub.com\u002FFloomAI\u002FFloom) AI gateway and marketplace for developers, enables streamlined integration of AI features into products\n1. [Automated Interpretability](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fautomated-interpretability) - Code for automatically generating, simulating, and scoring explanations of neuron behavior.\n1. [Fmr.ai](https:\u002F\u002Fgithub.com\u002FBizarreCake\u002Ffmr.ai) - AI interpretability and explainability platform.\n1. [Attention Analysis](https:\u002F\u002Fgithub.com\u002Fclarkkev\u002Fattention-analysis) - Analyzing attention maps from BERT transformer.\n1. [SpellGPT](https:\u002F\u002Fgithub.com\u002Fmwatkins1970\u002FSpellGPT) - Explores GPT-3’s ability to spell own token strings.\n1. [SuperICL](https:\u002F\u002Fgithub.com\u002FJetRunner\u002FSuperICL) - Super In-Context Learning code which allows black-box LLMs to work with locally fine-tuned smaller models.\n1. [Git Re-Basin](https:\u002F\u002Fgithub.com\u002Fsamuela\u002Fgit-re-basin) - Code release for \"Git Re-Basin: Merging Models modulo Permutation Symmetries.”\n1. [Functionary](https:\u002F\u002Fgithub.com\u002FMeetKai\u002Ffunctionary) - Chat language model that can interpret and execute functions\u002Fplugins.\n1. [Sparse Autoencoder](https:\u002F\u002Fgithub.com\u002Fai-safety-foundation\u002Fsparse_autoencoder) - Sparse Autoencoder for Mechanistic Interpretability.\n1. [Rome](https:\u002F\u002Fgithub.com\u002Fkmeng01\u002Frome) - Locating and editing factual associations in GPT.\n1. [Inseq](https:\u002F\u002Fgithub.com\u002Finseq-team\u002Finseq) - Interpretability for sequence generation models.\n1. [Neuron Viewer](https:\u002F\u002Fopenaipublic.blob.core.windows.net\u002Fneuron-explainer\u002Fneuron-viewer\u002Findex.html) - Tool for viewing neuron activations and explanations.\n1. [LLM Visualization](https:\u002F\u002Fbbycroft.net\u002Fllm) - Visualizing LLMs in low level.\n1. [Vanna](https:\u002F\u002Fgithub.com\u002Fvanna-ai\u002Fvanna) - Abstractions to use RAG to generate SQL with any LLM\n1. [Copy Suppression](https:\u002F\u002Fcopy-suppression.streamlit.app\u002F) - Designed to help explore different prompts for GPT-2 Small, as part of a research project regarding copy-suppression in LLMs.\n1. [TransformerViz](https:\u002F\u002Ftransformervis.github.io\u002Ftransformervis\u002F) - Interative tool to visualize transformer model by its latent space.\n1. [TransformerLens](https:\u002F\u002Fgithub.com\u002Fneelnanda-io\u002FTransformerLens) - A Library for Mechanistic Interpretability of Generative Language Models.\n1. [Awesome-Attention-Heads](https:\u002F\u002Fgithub.com\u002FIAAR-Shanghai\u002FAwesome-Attention-Heads) - A carefully compiled list that summarizes the diverse functions of the attention heads.\n1. [ecco](https:\u002F\u002Fgithub.com\u002Fjalammar\u002Fecco) - A python library for exploring and explaining Natural Language Processing models using interactive visualizations.\n\n---\n\n### LLM Interpretability Papers\n*Academic and industry papers on LLM interpretability.*\n\n1. [Interpretability Illusions in the Generalization of Simplified Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03656) – Shows how interpretability methods based on simplied models (e.g. linear probes etc) can be prone to generalisation illusions. \n1. [Self-Influence Guided Data Reweighting for Language Model Pre-training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00913)] - An application of training data attribution methods to re-weight training data and improve performance. \n1. [Data Similarity is Not Enough to Explain Language Model Performance](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.695\u002F) - Discusses the limits of embedding models to explain data effective selection. \n1. [Post Hoc Explanations of Language Models Can Improve Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.11426.pdf)] - Evaluates language-model generated explanations ability to also improve model quality.\n1. [Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.04213), [Tweet Summary](https:\u002F\u002Ftwitter.com\u002Fpeterbhase\u002Fstatus\u002F1613573825932697600)] (NeurIPS 2023 Spotlight) - highlights the limits of Causal Tracing: how a fact is stored in an LLM can be changed by editing weights in a different location than where Causal Tracing suggests.\n1. [Finding Neurons in a Haystack: Case Studies with Sparse Probing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.01610) - Explores the representation of high-level human-interpretable features within neuron activations of large language models (LLMs).\n1. [Copy Suppression: Comprehensively Understanding an Attention Head](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.04625) - Investigates a specific attention head in GPT-2 Small, revealing its primary role in copy suppression.\n1. [Linear Representations of Sentiment in Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.15154) - Shows how sentiment is represented in Large Language Models (LLMs), finding that sentiment is linearly represented in these models.\n1. [Emergent world representations: Exploring a sequence model trained on a synthetic task](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.13382) - Explores emergent internal representations in a GPT variant trained to predict legal moves in the board game Othello.\n1. [Towards Automated Circuit Discovery for Mechanistic Interpretability](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14997) - Introduces the Automatic Circuit Discovery (ACDC) algorithm for identifying important units in neural networks.\n1. [A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03025) - Examines small neural networks to understand how they learn group compositions, using representation theory.\n1. [Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.12265) - Causal mediation analysis as a method for interpreting neural models in natural language processing.\n1. [The Quantization Model of Neural Scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13506) - Proposes the Quantization Model for explaining neural scaling laws in neural networks.\n1. [Discovering Latent Knowledge in Language Models Without Supervision](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.03827) - Presents a method for extracting accurate answers to yes-no questions from language models' internal activations without supervision.\n1. [How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.00586) - Analyzes mathematical capabilities of GPT-2 Small, focusing on its ability to perform the 'greater-than' operation.\n1. [Towards Monosemanticity: Decomposing Language Models With Dictionary Learning](https:\u002F\u002Ftransformer-circuits.pub\u002F2023\u002Fmonosemantic-features\u002Findex.html) - Using a sparse autoencoder to decompose the activations of a one-layer transformer into interpretable, monosemantic features.\n1. [Language models can explain neurons in language models](https:\u002F\u002Fopenaipublic.blob.core.windows.net\u002Fneuron-explainer\u002Fpaper\u002Findex.html) - Explores how language models like GPT-4 can be used to explain the functioning of neurons within similar models.\n1. [Emergent Linear Representations in World Models of Self-Supervised Sequence Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.00941) - Linear representations in a world model of Othello-playing sequence models.\n1. [\"Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model\"](https:\u002F\u002Fopenreview.net\u002Fforum?id=VJEcAnFPqC) - Explores stepwise inference in autoregressive language models using a synthetic task based on navigating directed acyclic graphs.\n1. [\"Successor Heads: Recurring, Interpretable Attention Heads In The Wild\"](https:\u002F\u002Fopenreview.net\u002Fforum?id=kvcbV8KQsi) - Introduces 'successor heads,' attention heads that increment tokens with a natural ordering, such as numbers and days, in LLM’s.\n1. [\"Large Language Models Are Not Robust Multiple Choice Selectors\"](https:\u002F\u002Fopenreview.net\u002Fforum?id=shr9PXz7T0) - Analyzes the bias and robustness of LLMs in multiple-choice questions, revealing their vulnerability to option position changes due to inherent \"selection bias”.\n1. [\"Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory\"](https:\u002F\u002Fopenreview.net\u002Fforum?id=4bSQ3lsfEV) - Presents a novel approach to understanding neural networks by examining feature complexity through category theory.\n1. [\"Let's Verify Step by Step\"](https:\u002F\u002Fopenreview.net\u002Fforum?id=v8L0pN6EOi) - Focuses on improving the reliability of LLMs in multi-step reasoning tasks using step-level human feedback.\n1. [\"Interpretability Illusions in the Generalization of Simplified Models\"](https:\u002F\u002Fopenreview.net\u002Fforum?id=v675Iyu0ta) - Examines the limitations of simplified representations (like SVD) used to interpret deep learning systems, especially in out-of-distribution scenarios.\n1. [\"The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models\"](https:\u002F\u002Fopenreview.net\u002Fforum?id=SQGUDc9tC8) - Presents a novel approach for identifying and mitigating social biases in language models, introducing the concept of 'Social Bias Neurons'.\n1. [\"Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition\"](https:\u002F\u002Fopenreview.net\u002Fforum?id=VpCqrMMGVm) - Investigates how LLMs perform the task of mathematical addition.\n1. [\"Measuring Feature Sparsity in Language Models\"](https:\u002F\u002Fopenreview.net\u002Fforum?id=SznHfMwmjG) - Develops metrics to evaluate the success of sparse coding techniques in language model activations.\n1. [Toy Models of Superposition](https:\u002F\u002Ftransformer-circuits.pub\u002F2022\u002Ftoy_model\u002Findex.html) - Investigates how models represent more features than dimensions, especially when features are sparse.\n1. [Spine: Sparse interpretable neural embeddings](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.08792) - Presents SPINE, a method transforming dense word embeddings into sparse, interpretable ones using denoising autoencoders.\n1. [Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.15949) - Introduces a novel method for visualizing transformer networks using dictionary learning.\n1. [Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11207) - Introduces Pythia, a toolset designed for analyzing the training and scaling behaviors of LLMs.\n1. [On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron](https:\u002F\u002Fzuva.ai\u002Fscience\u002Finterpretability-and-feature-representations\u002F) - Critically examines the effectiveness of the \"Sentiment Neuron”.\n1. [Engineering monosemanticity in toy models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.09169) - Explores engineering monosemanticity in neural networks, where individual neurons correspond to distinct features.\n1. [Polysemanticity and capacity in neural networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.01892) - Investigates polysemanticity in neural networks, where individual neurons represent multiple features.\n1. [An Overview of Early Vision in InceptionV1](https:\u002F\u002Fdistill.pub\u002F2020\u002Fcircuits\u002Fearly-vision\u002F) - A comprehensive exploration of the initial five layers of the InceptionV1 neural network, focusing on early vision.\n1. [Visualizing and measuring the geometry of BERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.02715) - Delves into BERT's internal representation of linguistic information, focusing on both syntactic and semantic aspects.\n1. [Neurons in Large Language Models: Dead, N-gram, Positional](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04827) - An analysis of neurons in large language models, focusing on the OPT family.\n1. [Can Large Language Models Explain Themselves?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11207) - Evaluates the effectiveness of self-explanations generated by LLMs in sentiment analysis tasks.\n1. [Interpretability in the Wild: GPT-2 small (arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.00593) - Provides a mechanistic explanation of how GPT-2 small performs indirect object identification (IOI) in natural language processing.\n1. [Sparse Autoencoders Find Highly Interpretable Features in Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11207) - Explores the use of sparse autoencoders to extract more interpretable and less polysemantic features from LLMs.\n1. [Emergent and Predictable Memorization in Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11207) - Investigates the use of sparse autoencoders for enhancing the interpretability of features in LLMs.\n1. [Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.01429) - Demonstrates that focusing only on specific parts like attention heads or weight matrices in Transformers can lead to misleading interpretability claims.\n1. [The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True\u002FFalse Datasets](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06824) - This paper investigates the representation of truth in Large Language Models (LLMs) using true\u002Ffalse datasets.\n1. [Interpretability at Scale: Identifying Causal Mechanisms in Alpaca](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.08809) - This study presents Boundless Distributed Alignment Search (Boundless DAS), an advanced method for interpreting LLMs like Alpaca.\n1. [Representation Engineering: A Top-Down Approach to AI Transparency](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01405) - Introduces Representation Engineering (RepE), a novel approach for enhancing AI transparency, focusing on high-level representations rather than neurons or circuits.\n1. [Explaining black box text modules in natural language with language models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.09863) - Natural language explanations for LLM attention heads, evaluated using synthetic text\n1. [N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.12918) - Explain each LLM neuron as a graph\n1. [Augmenting Interpretable Models with LLMs during Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.11799) - Use LLMs to build interpretable classifiers of text data\n1. [ChainPoll: A High Efficacy Method for LLM Hallucination Detection](https:\u002F\u002Fwww.rungalileo.io\u002Fblog\u002Fchainpoll) - ChainPoll, a novel hallucination detection methodology that substantially outperforms existing alternatives, and RealHall, a carefully curated suite of benchmark datasets for evaluating hallucination detection metrics proposed in recent literature.\n1. [A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11917) - Identifies backward chaining circuits in a transformer trained to perform pathfinding in a tree.\n1. [Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition](https:\u002F\u002Fdoi.org\u002F10.1162\u002Fcoli_a_00416) Introduces instance-based, metric-learner approximations of neural network models and hard-attention mechanisms that can be constructed with task-specific inductive biases for effective semi-supervised learning (i.e., feature detection). These mechanisms combine to yield effective methods for interpretability-by-exemplar over the representation space of neural models.\n1. [Similarity-Distance-Magnitude Universal Verification](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.20167) Introduces SDM activation functions, SDM calibration, and SDM networks, which are neural networks (e.g., LLMs) with uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties. See the blog post [\"The Determinants of Controllable AGI\"](https:\u002F\u002Fraw.githubusercontent.com\u002Fallenschmaltz\u002FResolute_Resolutions\u002Fmaster\u002Fvolume5\u002Fvolume5.pdf) for a high-level overview of the broader implications.\n1. [Unveiling LLMs: The Evolution of Latent Representations in a Dynamic Knowledge Graph](https:\u002F\u002Fopenreview.net\u002Fforum?id=dWYRjT501w) A framework based on the technique of activation patching to represent the factual knowledge embedded in the vector space of LLMs as dynamic knowledge graphs.\n\n---\n\n### LLM Interpretability Articles\n*Insightful articles and blog posts on LLM interpretability.*\n\n1. [Do Machine Learning Models Memorize or Generalize?](https:\u002F\u002Fpair.withgoogle.com\u002Fexplorables\u002Fgrokking\u002F) - an interactive visualization exploring the phenopmena known as Grokking (VISxAI hall of fame)\n1. [What Have Language Models Learned?](https:\u002F\u002Fpair.withgoogle.com\u002Fexplorables\u002Ffill-in-the-blank\u002F) - an interactive visualization to undertsand how large language models work, and understand the nature of their biases (VISxAI hall of fame)\n1. [A New Approach to Computation Reimagines Artificial Intelligenceg](https:\u002F\u002Fwww.quantamagazine.org\u002Fa-new-approach-to-computation-reimagines-artificial-intelligence-20230413\u002F?mc_cid=ad9a93c472&mc_eid=506130a407) - Discusses hyperdimensional computing, a novel method involving hyperdimensional vectors (hypervectors) for more efficient, transparent, and robust artificial intelligence.\n1. [Interpreting GPT: the logit lens](https:\u002F\u002Fwww.lesswrong.com\u002Fposts\u002FAcKRB8wDpdaN6v6ru\u002Finterpreting-gpt-the-logit-lens) - Explores how the logit lens, reveals a gradual convergence of GPT's probabilistic predictions across its layers, from initial nonsensical or shallow guesses to more refined predictions.\n1. [A Mechanistic Interpretability Analysis of Grokking](https:\u002F\u002Fwww.lesswrong.com\u002Fposts\u002FN6WM6hs7RQMKDhYjB\u002Fa-mechanistic-interpretability-analysis-of-grokking) - Explores the phenomenon of 'grokking' in deep learning, where models suddenly shift from memorization to generalization during training.\n1. [200 Concrete Open Problems in Mechanistic Interpretability](https:\u002F\u002Fwww.lesswrong.com\u002Fs\u002FyivyHaCAmMJ3CqSyj) - Series of posts discussing open research problems in the field of Mechanistic Interpretability (MI), which focuses on reverse-engineering neural networks.\n1. [Evaluating LLMs is a minefield](https:\u002F\u002Fwww.cs.princeton.edu\u002F~arvindn\u002Ftalks\u002Fevaluating_llms_minefield\u002F) - Challenges in assessing the performance and biases of large language models (LLMs) like GPT.\n1. [Attribution Patching: Activation Patching At Industrial Scale](https:\u002F\u002Fwww.neelnanda.io\u002Fmechanistic-interpretability\u002Fattribution-patching) - Method that uses gradients for a linear approximation of activation patching in neural networks.\n1. [Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]](https:\u002F\u002Fwww.lesswrong.com\u002Fposts\u002FJvZhhzycHu2Yd57RN\u002Fcausal-scrubbing-a-method-for-rigorously-testing) - Introduces causal scrubbing, a method for evaluating the quality of mechanistic interpretations in neural networks.\n1. [A circuit for Python docstrings in a 4-layer attention-only transformer](https:\u002F\u002Fwww.lesswrong.com\u002Fposts\u002Fu6KXXmKFbXfWzoAXn\u002Fa-circuit-for-python-docstrings-in-a-4-layer-attention-only) - Proposes the Quantization Model for explaining neural scaling laws in neural networks.\n1. [Discovering Latent Knowledge in Language Models Without Supervision](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.03827) - Examines a specific neural circuit within a 4-layer transformer model responsible for generating Python docstrings.\n1. [Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.13243) - Survey on mechanistic interpretability\n\n---\n\n### LLM Interpretability Groups\n*Communities and groups dedicated to LLM interpretability.*\n\n1. [PAIR](https:\u002F\u002Fpair.withgoogle.com) - at Google work on [opensource tools](https:\u002F\u002Fgithub.com\u002Fpair-code), [interactive explorables visualizations](https:\u002F\u002Fpair.withgoogle.com\u002Fexplorables) and [research interpretability methods](https:\u002F\u002Fpair.withgoogle.com\u002Fresearch\u002F). \n1. [Alignment Lab AI](https:\u002F\u002Fdiscord.com\u002Finvite\u002Fad27GQgc7K) - Group of researchers focusing on AI alignment.\n1. [Nous Research](https:\u002F\u002Fdiscord.com\u002Finvite\u002FAVB9jkHZ) - Research group discussing various topics on interpretability.\n1. [EleutherAI](https:\u002F\u002Fdiscord.com\u002Finvite\u002F4pEBpVTN) - Non-profit AI research lab that focuses on interpretability and alignment of large models.\n\n---\n\n### LLM Survey Paper\n*Survey Papers for LLM.*\n1. [A Survey of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.18223) - . This survey paper provides\nan up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers..\n\n## Contributing and Collaborating\nPlease see [CONTRIBUTING](https:\u002F\u002Fgithub.com\u002FJShollaj\u002Fawesome-llm-interpretability\u002Fblob\u002Fmaster\u002FCONTRIBUTING.md) and [CODE-OF-CONDUCT](https:\u002F\u002Fgithub.com\u002FJShollaj\u002Fawesome-llm-interpretability\u002Fblob\u002Fmaster\u002FCODE-OF-CONDUCT.md) for details.\n\n","# 令人惊叹的LLM可解释性！[](https:\u002F\u002Fgithub.com\u002Fyour-username\u002Fawesome-llm-interpretability\u002Fworkflows\u002FAwesome%20Bot\u002Fbadge.svg)\n\n一份精心整理的清单，汇集了专注于大型语言模型（LLM）可解释性的超棒工具、论文、文章和社区。\n\n\n## 目录\n- [令人惊叹的LLM可解释性](#awesome-llm-interpretability)\n    - [工具](#llm-interpretability-tools)\n    - [论文](#llm-interpretability-papers)\n    - [文章](#llm-interpretability-articles)\n    - [社群](#llm-interpretability-groups)\n\n### LLM可解释性工具\n*用于LLM可解释性和分析的工具与库。*\n\n1. [The Learning Interpretability Tool](https:\u002F\u002Fpair-code.github.io\u002Flit\u002F) - 一个开源的机器学习模型可视化与理解平台，支持分类、回归以及生成模型（文本与图像数据）；包含显著性方法、注意力归因、反事实分析、TCAV、嵌入可视化及类似Facets的数据分析功能。\n1. [Comgra](https:\u002F\u002Fgithub.com\u002FFlorianDietz\u002Fcomgra) - Comgra帮助你在PyTorch中分析和调试神经网络。\n1. [Pythia](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia) - 可解释性分析工具，用于理解知识在自回归Transformer训练过程中如何发展与演变。\n1. [Phoenix](https:\u002F\u002Fgithub.com\u002FArize-ai\u002Fphoenix) - AI可观测性与评估平台——在Notebook中评估、排查并微调你的LLM、CV和NLP模型。\n1. [Floom](https:\u002F\u002Fgithub.com\u002FFloomAI\u002FFloom) 面向开发者的AI网关与市场，可简化AI功能集成到产品中的流程。\n1. [Automated Interpretability](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fautomated-interpretability) - 自动化生成、模拟并评分神经元行为解释的代码。\n1. [Fmr.ai](https:\u002F\u002Fgithub.com\u002FBizarreCake\u002Ffmr.ai) - AI可解释性与透明度平台。\n1. [Attention Analysis](https:\u002F\u002Fgithub.com\u002Fclarkkev\u002Fattention-analysis) - 分析BERT Transformer的注意力图。\n1. [SpellGPT](https:\u002F\u002Fgithub.com\u002Fmwatkins1970\u002FSpellGPT) - 探索GPT-3拼写自身标记串的能力。\n1. [SuperICL](https:\u002F\u002Fgithub.com\u002FJetRunner\u002FSuperICL) - 超级上下文学习代码，允许黑盒LLM与本地微调的小型模型协同工作。\n1. [Git Re-Basin](https:\u002F\u002Fgithub.com\u002Fsamuela\u002Fgit-re-basin) - “Git Re-Basin：基于置换对称性的模型合并”相关代码发布。\n1. [Functionary](https:\u002F\u002Fgithub.com\u002FMeetKai\u002Ffunctionary) - 一款能够解析并执行函数\u002F插件的聊天语言模型。\n1. [Sparse Autoencoder](https:\u002F\u002Fgithub.com\u002Fai-safety-foundation\u002Fsparse_autoencoder) - 用于机制性可解释性的稀疏自编码器。\n1. [Rome](https:\u002F\u002Fgithub.com\u002Fkmeng01\u002Frome) - 在GPT中定位并编辑事实关联。\n1. [Inseq](https:\u002F\u002Fgithub.com\u002Finseq-team\u002Finseq) - 针对序列生成模型的可解释性工具。\n1. [Neuron Viewer](https:\u002F\u002Fopenaipublic.blob.core.windows.net\u002Fneuron-explainer\u002Fneuron-viewer\u002Findex.html) - 用于查看神经元激活及解释的工具。\n1. [LLM Visualization](https:\u002F\u002Fbbycroft.net\u002Fllm) - 从底层视角可视化LLM。\n1. [Vanna](https:\u002F\u002Fgithub.com\u002Fvanna-ai\u002Fvanna) - 提供抽象层，使任何LLM都能通过RAG生成SQL。\n1. [Copy Suppression](https:\u002F\u002Fcopy-suppression.streamlit.app\u002F) - 专为探索GPT-2 Small的不同提示而设计，作为LLM中复制抑制研究项目的一部分。\n1. [TransformerViz](https:\u002F\u002Ftransformervis.github.io\u002Ftransformervis\u002F) - 交互式工具，可通过潜空间可视化Transformer模型。\n1. [TransformerLens](https:\u002F\u002Fgithub.com\u002Fneelnanda-io\u002FTransformerLens) - 用于生成式语言模型机制性可解释性的库。\n1. [Awesome-Attention-Heads](https:\u002F\u002Fgithub.com\u002FIAAR-Shanghai\u002FAwesome-Attention-Heads) - 精心编纂的列表，总结了注意力头的多样化功能。\n1. [ecco](https:\u002F\u002Fgithub.com\u002Fjalammar\u002Fecco) - 使用交互式可视化探索和解释自然语言处理模型的Python库。\n\n---\n\n### LLM可解释性论文\n*关于LLM可解释性的学术与行业论文。*\n\n1. [简化模型泛化中的可解释性幻象](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03656) – 展示了基于简化模型的可解释性方法（例如线性探测等）如何容易产生泛化幻象。\n1. [面向语言模型预训练的自影响引导数据重加权](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00913) – 将训练数据归因方法应用于重新加权训练数据，以提升性能。\n1. [数据相似性不足以解释语言模型性能](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.695\u002F) – 讨论了嵌入模型在解释数据有效选择方面的局限性。\n1. [语言模型的事后解释可以改进语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.11426.pdf) – 评估了由语言模型生成的解释是否也能提升模型质量。\n1. [定位信息能指导编辑吗？基于因果追踪的定位与语言模型知识编辑之间的惊人差异](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.04213)，[推文摘要](https:\u002F\u002Ftwitter.com\u002Fpeterbhase\u002Fstatus\u002F1613573825932697600)（NeurIPS 2023 Spotlight）– 强调了因果追踪的局限性：LLM中事实的存储方式可能被修改于因果追踪所指示位置之外的权重上。\n1. [在干草堆中寻找神经元：稀疏探测案例研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.01610) – 探索大型语言模型（LLMs）中神经元激活里高层人类可解释特征的表征。\n1. [复制抑制：全面理解一个注意力头](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.04625) – 研究GPT-2 Small中的特定注意力头，揭示其在复制抑制中的主要作用。\n1. [大型语言模型中情感的线性表征](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.15154) – 展示了情感在大型语言模型（LLMs）中的表征方式，发现情感在这些模型中呈线性表征。\n1. [涌现的世界表征：探索在合成任务上训练的序列模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.13382) – 探讨在一款用于预测棋盘游戏奥赛罗合法走法的GPT变体中涌现的内部表征。\n1. [迈向机制性可解释性的自动化回路发现](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14997) – 介绍了用于识别神经网络中重要单元的自动回路发现（ACDC）算法。\n1. [普适性的玩具模型：逆向工程网络如何学习群运算](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03025) – 使用表示理论研究小型神经网络，以理解它们如何学习群的复合结构。\n1. [用于解释神经NLP的因果中介分析：以性别偏见为例](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.12265) – 将因果中介分析作为一种方法用于解释自然语言处理中的神经模型。\n1. [神经缩放的量化模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13506) – 提出量化模型来解释神经网络中的神经缩放规律。\n1. [无需监督地从语言模型中发现潜在知识](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.03827) – 提出一种方法，在无需监督的情况下从语言模型的内部激活中提取对是非问题的准确答案。\n1. [GPT-2如何计算“大于”？：解读预训练语言模型的数学能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.00586) – 分析GPT-2 Small的数学能力，重点关注其执行“大于”操作的能力。\n1. [迈向单义性：用字典学习分解语言模型](https:\u002F\u002Ftransformer-circuits.pub\u002F2023\u002Fmonosemantic-features\u002Findex.html) – 利用稀疏自编码器将单层Transformer的激活分解为可解释的、单义性特征。\n1. [语言模型可以解释语言模型中的神经元](https:\u002F\u002Fopenaipublic.blob.core.windows.net\u002Fneuron-explainer\u002Fpaper\u002Findex.html) – 探讨如何利用GPT-4等语言模型来解释同类模型中神经元的功能。\n1. [自监督序列模型世界模型中的涌现线性表征](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.00941) – 在一款玩奥赛罗的序列模型的世界模型中出现的线性表征。\n1. [“迈向对Transformer逐步推理的机制性理解：一种合成图导航模型”](https:\u002F\u002Fopenreview.net\u002Fforum?id=VJEcAnFPqC) – 通过基于有向无环图导航的合成任务，探讨自回归语言模型中的逐步推理过程。\n1. [“后继头：野外常见的可解释注意力头”](https:\u002F\u002Fopenreview.net\u002Fforum?id=kvcbV8KQsi) – 介绍“后继头”，即能够在具有自然顺序的标记（如数字和日期）上递增的注意力头。\n1. [大型语言模型并非稳健的多项选择题选择器](https:\u002F\u002Fopenreview.net\u002Fforum?id=shr9PXz7T0) – 分析大型语言模型在多项选择题中的偏差与鲁棒性，揭示其因固有的“选择偏见”而易受选项位置变化的影响。\n1. [超越神经网络特征相似性：网络特征复杂度及其范畴论解释](https:\u002F\u002Fopenreview.net\u002Fforum?id=4bSQ3lsfEV) – 提出一种新颖的方法，通过范畴论考察特征复杂度来理解神经网络。\n1. [让我们一步步验证](https:\u002F\u002Fopenreview.net\u002Fforum?id=v8L0pN6EOi) – 专注于利用逐步骤的人工反馈来提高大型语言模型在多步推理任务中的可靠性。\n1. [简化模型泛化中的可解释性幻象](https:\u002F\u002Fopenreview.net\u002Fforum?id=v675Iyu0ta) – 考察用于解释深度学习系统的简化表征（如SVD）的局限性，尤其是在分布外场景中。\n1. [魔鬼藏在神经元里：解释并缓解语言模型中的社会偏见](https:\u002F\u002Fopenreview.net\u002Fforum?id=SQGUDc9tC8) – 提出一种新方法来识别和缓解语言模型中的社会偏见，引入“社会偏见神经元”的概念。\n1. [解释大型语言模型在数学加法中的内在机制](https:\u002F\u002Fopenreview.net\u002Fforum?id=VpCqrMMGVm) – 研究大型语言模型如何完成数学加法任务。\n1. [测量语言模型中的特征稀疏性](https:\u002F\u002Fopenreview.net\u002Fforum?id=SznHfMwmjG) – 开发用于评估语言模型激活中稀疏编码技术成功程度的指标。\n1. [叠加现象的玩具模型](https:\u002F\u002Ftransformer-circuits.pub\u002F2022\u002Ftoy_model\u002Findex.html) – 研究模型如何表征比维度更多的特征，尤其是在特征稀疏的情况下。\n1. [SPINE：稀疏可解释的神经嵌入](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.08792) – 介绍SPINE方法，利用去噪自编码器将稠密词嵌入转化为稀疏且可解释的嵌入。\n1. [通过字典学习可视化Transformer：上下文嵌入作为Transformer因子的线性叠加](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.15949) – 提出一种利用字典学习来可视化Transformer网络的新方法。\n1. [Pythia：一套用于分析大型语言模型训练与扩展行为的工具](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11207) – 介绍专为分析大型语言模型训练和扩展行为设计的工具集Pythia。\n1. [关于可解释性和特征表征：对“情感神经元”的分析](https:\u002F\u002Fzuva.ai\u002Fscience\u002Finterpretability-and-feature-representations\u002F) – 批判性地审视“情感神经元”的有效性。\n1. [在玩具模型中实现单义性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.09169) – 探讨在神经网络中实现单义性，使每个神经元对应于不同的特征。\n1. [神经网络中的多义性与容量](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.01892) – 研究神经网络中单个神经元代表多个特征的现象。\n1. [InceptionV1早期视觉概览](https:\u002F\u002Fdistill.pub\u002F2020\u002Fcircuits\u002Fearly-vision\u002F) – 全面探讨InceptionV1神经网络的前五层，重点聚焦早期视觉。\n1. [BERT几何结构的可视化与测量](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.02715) – 深入研究BERT对语言信息的内部表征，同时关注句法和语义方面。\n1. [大型语言模型中的神经元：死神经元、N-gram神经元、位置神经元](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04827) – 对大型语言模型中的神经元进行分析，重点关注OPT系列。\n1. [大型语言模型能否自我解释？](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11207) – 评估大型语言模型在情感分析任务中自动生成的解释的有效性。\n1. [野外的可解释性：GPT-2 small（arXiv）](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.00593) – 提供了GPT-2 small在自然语言处理中执行间接宾语识别（IOI）的机制性解释。\n1. [稀疏自编码器在语言模型中发现高度可解释的特征](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11207) – 探讨使用稀疏自编码器从大型语言模型中提取更具可解释性且多义性较低的特征。\n1. [大型语言模型中涌现且可预测的记忆现象](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11207) – 研究使用稀疏自编码器来增强大型语言模型中特征的可解释性。\n1. [仅凭短视方法无法解释Transformer：以有界Dyck语法为例](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.01429) – 表明仅关注Transformer中的特定部分，如注意力头或权重矩阵，可能导致误导性的可解释性声明。\n1. [真理的几何：大型语言模型对真\u002F假数据集的表征中的涌现线性结构](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06824) – 本文研究了大型语言模型（LLMs）使用真\u002F假数据集时对真理的表征。\n1. [规模化可解释性：识别Alpaca中的因果机制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.08809) – 本研究提出了无边界分布式对齐搜索（Boundless DAS），这是一种先进的方法，用于解释Alpaca等大型语言模型。\n1. [表征工程：一种自上而下的AI透明度方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01405) – 介绍表征工程（RepE），这是一种新颖的方法，旨在提升AI透明度，侧重于高层表征而非神经元或电路。\n1. [用语言模型解释自然语言中的黑盒文本模块](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.09863) – 使用合成文本评估大型语言模型注意力头的自然语言解释。\n1. [N2G：一种可扩展的方法，用于量化大型语言模型中可解释神经元的表征](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.12918) – 将每个大型语言模型神经元表示为一张图。\n1. [训练过程中用大型语言模型增强可解释模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.11799) – 利用大型语言模型构建文本数据的可解释分类器。\n1. [ChainPoll：一种高效的大型语言模型幻觉检测方法](https:\u002F\u002Fwww.rungalileo.io\u002Fblog\u002Fchainpoll) – ChainPoll是一种新型的幻觉检测方法，其效果显著优于现有替代方案；此外还有RealHall，这是一套精心策划的基准数据集，用于评估近期文献中提出的幻觉检测指标。\n1. [对一个经过符号式多步推理任务训练的Transformer的机制性分析](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11917) – 在一个接受过树形路径规划训练的Transformer中识别出反向链式回路。\n1. [从全局标签中检测局部洞察：基于卷积分解的监督与零样本序列标注](https:\u002F\u002Fdoi.org\u002F10.1162\u002Fcoli_a_00416) – 提出基于实例、度量学习的神经网络模型近似方法，以及可结合特定任务归纳偏好的硬注意力机制，从而实现有效的半监督学习（即特征检测）。这些机制相结合，可在神经网络的表征空间内提供基于示例的高效可解释性方法。\n1. [相似性-距离-幅度通用验证](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.20167) – 介绍SDM激活函数、SDM校准及SDM网络，这些是具备不确定性感知验证和基于示例可解释性等内在属性的神经网络（如大型语言模型）。有关更广泛影响的高层次概述，请参阅博文《可控AGI的决定因素》（https:\u002F\u002Fraw.githubusercontent.com\u002Fallenschmaltz\u002FResolute_Resolutions\u002Fmaster\u002Fvolume5\u002Fvolume5.pdf）。\n1. [揭开LLM的面纱：动态知识图谱中潜在表征的演变](https:\u002F\u002Fopenreview.net\u002Fforum?id=dWYRjT501w) – 基于激活补丁技术构建框架，将嵌入在LLM向量空间中的事实性知识表示为动态知识图谱。\n\n---\n\n\n\n### 大型语言模型可解释性相关文章\n*关于大型语言模型可解释性的深入文章和博客帖子。*\n\n1. [机器学习模型是记忆还是泛化？](https:\u002F\u002Fpair.withgoogle.com\u002Fexplorables\u002Fgrokking\u002F) - 一个交互式可视化工具，用于探索被称为“Grokking”的现象（VISxAI名人堂）。\n1. [语言模型学到了什么？](https:\u002F\u002Fpair.withgoogle.com\u002Fexplorables\u002Ffill-in-the-blank\u002F) - 一个交互式可视化工具，帮助理解大型语言模型的工作机制及其偏见的本质（VISxAI名人堂）。\n1. [一种新的计算方法重新构想人工智能](https:\u002F\u002Fwww.quantamagazine.org\u002Fa-new-approach-to-computation-reimagines-artificial-intelligence-20230413\u002F?mc_cid=ad9a93c472&mc_eid=506130a407) - 讨论了超维计算，这是一种利用超维向量（hypervectors）的新方法，旨在实现更高效、透明且鲁棒的人工智能。\n1. [解读GPT：对数几率视角](https:\u002F\u002Fwww.lesswrong.com\u002Fposts\u002FAcKRB8wDpdaN6v6ru\u002Finterpreting-gpt-the-logit-lens) - 探讨了对数几率视角如何揭示GPT在不同层中概率预测的逐步收敛过程，从最初的无意义或浅层猜测逐渐过渡到更为精细的预测。\n1. [Grokking现象的机制性可解释性分析](https:\u002F\u002Fwww.lesswrong.com\u002Fposts\u002FN6WM6hs7RQMKDhYjB\u002Fa-mechanistic-interpretability-analysis-of-grokking) - 探讨深度学习中的“grokking”现象，即模型在训练过程中突然从记忆转向泛化的转变。\n1. [机制性可解释性领域的200个具体开放问题](https:\u002F\u002Fwww.lesswrong.com\u002Fs\u002FyivyHaCAmMJ3CqSyj) - 一系列讨论机制性可解释性（MI）领域开放研究问题的文章，该领域专注于对神经网络进行逆向工程。\n1. [评估大型语言模型犹如雷区](https:\u002F\u002Fwww.cs.princeton.edu\u002F~arvindn\u002Ftalks\u002Fevaluating_llms_minefield\u002F) - 评估像GPT这样的大型语言模型性能和偏见所面临的挑战。\n1. [归因打补丁：工业规模的激活打补丁](https:\u002F\u002Fwww.neelnanda.io\u002Fmechanistic-interpretability\u002Fattribution-patching) - 一种利用梯度对神经网络中的激活打补丁进行线性近似的方法。\n1. [因果擦除：一种严格检验可解释性假设的方法 [Redwood Research]](https:\u002F\u002Fwww.lesswrong.com\u002Fposts\u002FJvZhhzycHu2Yd57RN\u002Fcausal-scrubbing-a-method-for-rigorously-testing) - 介绍了因果擦除方法，用于评估神经网络中机制性解释的质量。\n1. [一个仅含注意力机制的四层Transformer中的Python文档字符串电路](https:\u002F\u002Fwww.lesswrong.com\u002Fposts\u002Fu6KXXmKFbXfWzoAXn\u002Fa-circuit-for-python-docstrings-in-a-4-layer-attention-only) - 提出了量化模型，用于解释神经网络中的神经尺度法则。\n1. [无需监督的情况下发现语言模型中的潜在知识](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.03827) - 研究了一个四层Transformer模型中负责生成Python文档字符串的特定神经回路。\n1. [迈向透明的人工智能：深度神经网络内部结构解释综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.13243) - 关于机制性可解释性的综述。\n\n---\n\n### 大型语言模型可解释性相关社群\n*致力于大型语言模型可解释性的社区和小组。*\n\n1. [PAIR](https:\u002F\u002Fpair.withgoogle.com) - 谷歌旗下的团队，致力于开发[开源工具](https:\u002F\u002Fgithub.com\u002Fpair-code)、[交互式探索性可视化](https:\u002F\u002Fpair.withgoogle.com\u002Fexplorables)以及[可解释性研究方法](https:\u002F\u002Fpair.withgoogle.com\u002Fresearch\u002F)。\n1. [Alignment Lab AI](https:\u002F\u002Fdiscord.com\u002Finvite\u002Fad27GQgc7K) - 专注于人工智能对齐的研究者群体。\n1. [Nous Research](https:\u002F\u002Fdiscord.com\u002Finvite\u002FAVB9jkHZ) - 讨论可解释性相关话题的研究小组。\n1. [EleutherAI](https:\u002F\u002Fdiscord.com\u002Finvite\u002F4pEBpVTN) - 非营利性人工智能研究实验室，专注于大型模型的可解释性和对齐问题。\n\n---\n\n### 大型语言模型综述论文\n*关于大型语言模型的综述论文。*\n1. [大型语言模型综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.18223) - 本综述论文提供了关于大型语言模型文献的最新回顾，对于研究人员和工程师而言都是很有价值的参考资料。\n\n## 贡献与合作\n详情请参阅[CONTRIBUTING](https:\u002F\u002Fgithub.com\u002FJShollaj\u002Fawesome-llm-interpretability\u002Fblob\u002Fmaster\u002FCONTRIBUTING.md)和[CODE-OF-CONDUCT](https:\u002F\u002Fgithub.com\u002FJShollaj\u002Fawesome-llm-interpretability\u002Fblob\u002Fmaster\u002FCODE-OF-CONDUCT.md)。","# Awesome LLM Interpretability 快速上手指南\n\n`awesome-llm-interpretability` 并非单一的软件库，而是一个精选的资源列表，汇集了用于大语言模型（LLM）可解释性研究的工具、论文和社区。本指南将指导你如何从该列表中选择核心工具并进行环境配置与基础使用。\n\n## 环境准备\n\n由于列表中包含多个独立的 Python 项目，建议为每个工具创建独立的虚拟环境以避免依赖冲突。\n\n*   **操作系统**：Linux (推荐), macOS, Windows (部分工具可能需要 WSL2)\n*   **Python 版本**：建议安装 `Python 3.9` 或 `3.10`（大多数现代 LLM 工具对此支持最佳）\n*   **前置依赖**：\n    *   `git`: 用于克隆仓库\n    *   `pip` 或 `conda`: 包管理工具\n    *   `CUDA` (可选): 若需本地运行大型模型推理或训练，建议安装 NVIDIA CUDA Toolkit 及对应驱动\n\n**国内加速建议**：\n在安装 Python 依赖时，推荐使用清华或阿里镜像源以提升下载速度：\n```bash\npip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple \u003Cpackage_name>\n```\n\n## 安装步骤\n\n由于这是一个资源列表，你需要根据需求选择具体的工具进行安装。以下以列表中两个最具代表性的工具 **TransformerLens** (机制可解释性) 和 **LIT** (可视化分析) 为例演示安装过程。\n\n### 1. 创建虚拟环境\n```bash\npython -m venv llm-interp-env\nsource llm-interp-env\u002Fbin\u002Factivate  # Windows 用户请使用: llm-interp-env\\Scripts\\activate\n```\n\n### 2. 安装 TransformerLens (用于机制可解释性研究)\n该工具专为分析生成式语言模型的内部机制设计。\n\n```bash\npip install transformer-lens\n# 或者从源码安装最新开发版\n# pip install git+https:\u002F\u002Fgithub.com\u002Fneelnanda-io\u002FTransformerLens.git\n```\n\n### 3. 安装 LIT (The Learning Interpretability Tool)\n该工具提供通用的模型可视化和分析平台。\n\n```bash\npip install lit-nlp\n```\n\n> **注意**：列表中其他工具（如 `Pythia`, `Rome`, `Inseq` 等）请前往其对应的 GitHub 仓库页面，按照各自的 `README` 指示进行安装。\n\n## 基本使用\n\n以下展示如何使用上述两个核心工具进行最简单的可解释性分析。\n\n### 场景一：使用 TransformerLens 分析模型注意力\n此示例加载一个小型 GPT-2 模型，并查看其注意力模式。\n\n```python\nfrom transformer_lens import HookedTransformer\n\n# 加载预训练模型 (自动下载)\nmodel = HookedTransformer.from_pretrained(\"gpt2-small\")\n\n# 运行前向传播并获取日志概率\ntext = \"The quick brown fox\"\nlogits = model(text)\n\n# 使用内置功能查看注意力头 (Attention Heads) 的激活情况\n# 这里仅展示如何调用，具体可视化通常结合 Jupyter Notebook 使用\nprint(f\"Model output shape: {logits.shape}\")\n\n# 示例：提取特定层的注意力模式\ncache = model.run_with_cache(text)\nattn_pattern = cache[\"pattern\", 0] # 获取第 0 层的注意力模式\nprint(f\"Attention pattern shape: {attn_pattern.shape}\")\n```\n\n### 场景二：使用 LIT 启动可视化服务\n此示例启动一个本地 Web 服务，用于交互式探索模型预测结果。\n\n```bash\n# 假设你已经有一个训练好的 HuggingFace 模型或预测文件\n# 以下命令启动 LIT 服务器，指向你的数据目录和模型\nlit_nlp_demo --dataset=\u003Cpath_to_your_data> --model=\u003Cpath_to_your_model>\n```\n\n在浏览器中访问终端输出的地址（通常为 `http:\u002F\u002Flocalhost:7007`），即可通过图形界面分析样本的显著性（Saliency）、注意力归因（Attention Attribution）等指标。\n\n### 进阶资源获取\n若要深入研究中提到的具体算法（如自动化神经元解释、稀疏自编码器等），请直接克隆对应仓库：\n\n```bash\n# 示例：克隆自动化可解释性工具\ngit clone https:\u002F\u002Fgithub.com\u002Fopenai\u002Fautomated-interpretability.git\ncd automated-interpretability\npip install -r requirements.txt\n```\n\n建议查阅原仓库中的 `examples\u002F` 目录或相关论文（列表第二部分）以获取针对特定研究任务的详细代码实现。","某金融科技公司的算法团队正在调试一个用于自动审批贷款申请的 LLM 系统，近期发现模型会毫无征兆地拒绝信用良好的少数族裔申请人，急需定位偏差根源。\n\n### 没有 awesome-llm-interpretability 时\n- 团队面对黑盒模型束手无策，只能靠猜测调整提示词，无法确定是训练数据偏见还是注意力机制出错。\n- 缺乏统一的分析工具库，工程师需花费数周在海量论文和零散代码中筛选适用的解释性方法，研发效率极低。\n- 无法可视化神经元激活或注意力图谱，导致向合规部门汇报时只能提供模糊的“可能原因”，难以通过审计。\n- 尝试手动修改模型内部参数以修正错误时，因缺乏如 Rome 或 Sparse Autoencoder 等精准编辑工具，极易破坏模型其他能力。\n\n### 使用 awesome-llm-interpretability 后\n- 团队快速锁定 TransformerLens 和 LIT 等工具，直接可视化注意力头，发现模型过度关注申请人姓名中的特定字符而非财务数据。\n- 借助 curated list 中集成的 Automated Interpretability 代码，自动生成神经元行为解释，将原本数周的排查工作缩短至两天。\n- 利用 Neuron Viewer 生成直观的激活热力图作为证据，清晰地向监管机构展示了偏差产生的具体路径，顺利通过合规审查。\n- 应用 Rome 工具精准定位并编辑了存储错误事实关联的神经元，在不重新训练模型的情况下修复了歧视问题，且未影响整体性能。\n\nawesome-llm-interpretability 将原本如同“盲人摸象”的模型调试过程，转变为可观测、可解释且可精准干预的科学工程流程。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FJShollaj_awesome-llm-interpretability_8573fe9d.png","JShollaj","Xhoni Shollaj","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FJShollaj_131863a5.jpg",null,"Singapore","https:\u002F\u002Fgithub.com\u002FJShollaj",1488,110,"2026-04-11T23:30:26",1,"","未说明",{"notes":85,"python":83,"dependencies":86},"该仓库是一个 curated list（精选列表），主要收集了关于大语言模型可解释性的工具、论文、文章和社区链接，本身不是一个可直接运行的单一软件工具。因此，README 中未包含具体的操作系统、GPU、内存、Python 版本或依赖库的安装需求。用户若需使用列表中提到的具体工具（如 TransformerLens, Pythia, Rome 等），需前往各工具独立的 GitHub 仓库查看其特定的环境配置要求。",[],[35,14,88],"其他",[90,91],"awesome","awesome-list","2026-03-27T02:49:30.150509","2026-04-19T06:02:48.270412",[],[]]