[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-yandex-research--tabm":3,"tool-yandex-research--tabm":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",158594,2,"2026-04-16T23:34:05",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":77,"owner_website":78,"owner_url":79,"languages":80,"stars":93,"forks":94,"last_commit_at":95,"license":96,"difficulty_score":32,"env_os":97,"env_gpu":98,"env_ram":99,"env_deps":100,"category_tags":105,"github_topics":76,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":107,"updated_at":108,"faqs":109,"releases":145},8126,"yandex-research\u002Ftabm","tabm","(ICLR 2025) TabM: Advancing Tabular Deep Learning With Parameter-Efficient Ensembling","TabM 是一款专为表格数据深度学习设计的开源模型，旨在通过高效的集成学习策略提升预测性能。在传统机器学习任务中，表格数据处理往往依赖梯度提升树（GBDT），而深度学习方法虽潜力巨大却常受限于训练成本高或泛化能力不足。TabM 巧妙地在单个神经网络架构内模拟了多个多层感知机（MLP）的集成效果，既保留了集成学习的强大表现力，又大幅降低了资源消耗。\n\n其核心技术亮点在于“并行训练”与“权重共享”。不同于传统集成方法需独立训练多个模型，TabM 允许在训练过程中实时监控整体集成效果并适时停止，同时所有子网络共享大部分参数，这不仅显著提升了运行速度和内存效率，还起到了一种有效的正则化作用，从而在真实工业数据集上取得了优于现有深度学习方法的成果。该模型已在多个 Kaggle 竞赛中助力团队夺冠，并能从容应对包含数千万甚至上亿条记录的大规模数据场景。\n\nTabM 非常适合从事数据挖掘、预测建模的算法工程师、数据科学家以及学术研究人员使用。如果你正在寻找一种既能媲美顶级集成模型性能，又具备深度学习灵活性与扩展性的解决方案，TabM 值得尝试。","# TabM: Advancing Tabular Deep Learning With Parameter-Efficient Ensembling\" (ICLR 2025)\u003C!-- omit in toc -->\n\n:scroll: [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.24210)\n&nbsp; :books: [Other tabular DL projects](https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Frtdl)\n\nThis is the official repository of the paper \"TabM: Advancing Tabular Deep Learning With\nParameter-Efficient Ensembling\".\nIt consists of two parts:\n- [**Python package**](#python-package) described in this document.\n- [**Paper-related content**](.\u002Fpaper\u002FREADME.md) (code, metrics, hyperparameters, etc.) described in `paper\u002FREADME.md`.\n\n\u003Cbr>\n\n\u003Cdetails>\n\u003Csummary>TabM on \u003Cb>Kaggle\u003C\u002Fb> (as of June 2025)\u003C\u002Fsummary>\n\n- TabM was used in [the winning solution](https:\u002F\u002Fwww.kaggle.com\u002Fcompetitions\u002Fum-game-playing-strength-of-mcts-variants\u002Fdiscussion\u002F549801) in the competition by UM.\n- TabM was used in [the winning solution](https:\u002F\u002Fwww.kaggle.com\u002Fcompetitions\u002Fequity-post-HCT-survival-predictions\u002Fdiscussion\u002F566550), as well as in the top-3, top-4, top-5 and many other solutions in the competition by CIBMTR. Later, it turned out that it was possible to achieve the [25-th place](https:\u002F\u002Fwww.kaggle.com\u002Fcompetitions\u002Fequity-post-HCT-survival-predictions\u002Fdiscussion\u002F567863) out of 3300+ with only TabM, without ensembling it with other models.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>TabM on \u003Cb>TabReD\u003C\u002Fb> (a challenging benchmark)\u003C\u002Fsummary>\n\n[TabReD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.19380) is a benchmark based on **real-world industrial datasets** with **time-related distribution drifts** and **hundreds of features**, which makes it more challenging than traditional benchmarks. The figure below shows that TabM achieves higher performance on TabReD (plus one more real-world dataset) compared to prior tabular DL methods.\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyandex-research_tabm_readme_7801a7e73aed.png\" width=35% display=block margin=auto>\n\n*One dot represents a performance score on one dataset. For a given model, a diamond represents the mean value across the datasets.*\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Training and inference efficiency\u003C\u002Fsummary>\n\nTabM is a simple and reasonably efficient model, which makes it suitable for **real-world applications**, including large datasets. The biggest dataset used in the paper contains **13M objects**, and we are aware of a successful training run on **100M+ objects**, though training takes more time in such cases.\n\nThe figure below shows that TabM is relatively slower than MLPs and GBDT, but faster than prior tabular DL methods. Note that (1) the inference throughput was measured on a single CPU thread and *without any optimizations*, in particular without the TabM-specific acceleration technique described later in this document; (2) the left plot uses the *logarithmic* scale.\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyandex-research_tabm_readme_0a7088993bca.png\" display=block margin=auto>\n\n*One dot represents a measurement on one dataset. For a given model, a diamond represents the mean value across the datasets. In the left plot,* $\\mathrm{TabM_{mini}^{\\dagger*}}$ *denotes* $\\mathrm{TabM_{mini}^{\\dagger}}$ *trained with mixed precision and `torch.compile`.*\n\n\u003C\u002Fdetails>\n\n# TL;DR\u003C!-- omit in toc -->\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyandex-research_tabm_readme_711045dca83f.png\" width=65% display=block margin=auto>\n\n**TabM** (**Tab**ular DL model that makes **M**ultiple predictions) is a simple and powerful tabular DL architecture that efficiently imitates an ensemble of MLPs. The two main differences of TabM compared to a regular ensemble of MLPs:\n- **Parallel training** of the MLPs. This allows monitoring the performance of the ensemble during the training and stopping the training when it is optimal for the ensemble, not for individual MLPs.\n- **Weight sharing** between the MLPs. In fact, the whole TabM fits in just *one* MLP-like model. Not only this significantly improves the runtime and memory efficiency, but also turns out to be an effective regularization leading to better task performance.\n\n# Reproducing experiments and browsing results\u003C!-- omit in toc -->\n\n> [!IMPORTANT]\n> To use TabM in practice and for future work, use the `tabm` package described below.\n\nThe [paper-related content](.\u002Fpaper\u002FREADME.md) (code, metrics, hyperparameters, etc.) is located in the `paper\u002F` directory and is described in `paper\u002FREADME.md`.\n\n# Python package\u003C!-- omit in toc -->\n\n`tabm` is a PyTorch-based Python package providing the TabM model, as well as layers and tools for building custom TabM-like architectures (i.e. efficient ensembles of MLP-like models).\n\n- [**Installation**](#installation)\n- [**Basic usage**](#basic-usage)\n    - [Creating TabM](#creating-tabm)\n    - [Creating TabM with feature embeddings](#creating-tabm-with-feature-embeddings)\n    - [Using TabM with custom inputs and input modules](#using-tabm-with-custom-inputs-and-input-modules)\n    - [Training](#training)\n    - [Inference](#inference)\n- [**Examples**](#examples)\n- [Advanced usage](#advanced-usage)\n    - [Intuition](#intuition)\n    - [`EnsembleView`](#ensembleview)\n    - [MLP ensembles](#mlp-ensembles)\n    - [Important implementation details](#important-implementation-details)\n    - [Example: Simple ensemble without weight sharing](#example-simple-ensemble-without-weight-sharing)\n    - [Example: MiniEnsemble](#example-miniensemble)\n    - [Example: BatchEnsemble](#example-batchensemble)\n    - [Example: a custom architecture](#example-a-custom-architecture)\n    - [Turning an existing model to an efficient ensemble](#turning-an-existing-model-to-an-efficient-ensemble)\n- [Hyperparameters](#hyperparameters)\n    - [Default model](#default-model)\n    - [Default optimizer](#default-optimizer)\n    - [`arch_type`](#arch_type)\n    - [`k`](#k)\n    - [`num_embeddings`](#num_embeddings)\n    - [Initialization](#initialization)\n    - [Hyperparameter tuning](#hyperparameter-tuning)\n- [Practical notes](#practical-notes)\n    - [Inference efficiency](#inference-efficiency)\n- [API](#api)\n\n# **Installation**\n\n```\npip install tabm\n```\n\n# **Basic usage**\n\nThis section shows how to create a model in typical use cases, and gives high-level comments on\ntraining and inference.\n\n## Creating TabM\n\nThe below example showcases the basic version of TabM without feature embeddings.\nFor better performance, `num_embeddings` should usually be passed as explained in the next section.\n\n> [!NOTE]\n> `TabM.make(...)` used below adds default hyperparameters based on the provided arguments.\n\n\u003C!-- test main -->\n```python\nimport torch\nfrom tabm import TabM\n\n# >>> Common setup for all subsequent sections.\nd_out = 1  # For example, one regression task.\nbatch_size = 256\n\n# The dataset has 24 numerical (continuous) features.\nn_num_features = 24\n\n# The dataset has 2 categorical features.\n# The first categorical feature has 3 unique categories.\n# The second categorical feature has 7 unique categories.\ncat_cardinalities = [3, 7]\n# \u003C\u003C\u003C\n\nmodel = TabM.make(\n    n_num_features=n_num_features,\n    cat_cardinalities=cat_cardinalities,  # One-hot encoding will be used.\n    d_out=d_out,\n)\nx_num = torch.randn(batch_size, n_num_features)\nx_cat = torch.column_stack([\n    # The i-th categorical features must take values in range(0, cat_cardinalities[i]).\n    torch.randint(0, c, (batch_size,)) for c in cat_cardinalities\n])\ny_pred = model(x_num, x_cat)\n\n# TabM represents an ensemble of k models, hence k predictions per object.\nassert y_pred.shape == (batch_size, model.k, d_out)\n```\n\n## Creating TabM with feature embeddings\n\nOn typical tabular tasks, the best performance is usually achieved by passing feature embedding\nmodules as `num_embeddings` (in the paper, TabM with embeddings is denoted as $\\mathrm{TabM^\\dagger}$). `TabM` supports several feature embedding modules from the\n[`rtdl_num_embeddings`](https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Frtdl-num-embeddings\u002Fblob\u002Fmain\u002Fpackage\u002FREADME.md)\npackage. The below example showcases the simplest embedding module `LinearReLUEmbeddings`.\n\n> [!TIP]\n> The best performance is usually achieved with more advanced embeddings, such as\n> `PiecewiseLinearEmbeddings` and `PeriodicEmbeddings`. Their usage is covered in the end-to-end usage [example](#examples).\n\n\u003C!-- test main _ -->\n```python\nfrom rtdl_num_embeddings import LinearReLUEmbeddings\n\nmodel = TabM.make(\n    n_num_features=n_num_features,\n    num_embeddings=LinearReLUEmbeddings(n_num_features),\n    d_out=d_out\n)\nx_num = torch.randn(batch_size, n_num_features)\ny_pred = model(x_num)\n\nassert y_pred.shape == (batch_size, model.k, d_out)\n```\n\n## Using TabM with custom inputs and input modules\n\n> [!TIP]\n> The implementation of `tabm.TabM` is a good example of defining inputs and input modules in\n> TabM-based models.\n\nAssume that you want to change what input TabM takes or how TabM handles the input, but you still\nwant to use TabM as the backbone. Then, a typical usage looks as follows:\n\n```python\nfrom tabm import EnsembleView, make_tabm_backbone, LinearEnsemble\n\n\nclass Model(nn.Module):\n    def __init__(self, ...):\n        # >>> Create any custom modules.\n        ...\n        # \u003C\u003C\u003C\n\n        # Create the ensemble input module.\n        self.ensemble_view = EnsembleView(...)\n        # Create the backbone.\n        self.backbone = make_tabm_backbone(...)\n        # Create the prediction head.\n        self.output = LinearEnsemble(...)\n\n    def forward(self, arg1, arg2, ...):\n        # Transform the input as needed to one tensor.\n        # This step can include feature embeddings\n        # and all other kinds of feature transformations.\n        # `handle_input` is a hypothetical user-defined function.\n        x = handle_input(arg1, arg2, ...)  # -> (B, D) or (B, k, D)\n\n        # The only difference from conventional models is\n        # the call of self.ensemble_view.\n        x = self.ensemble_view(x)  # -> (B, k, D)\n        x = self.backbone(x)\n        x = self.output(x)\n        return x  # -> (B, k, d_out)\n```\n\n> [!NOTE]\n> Regarding the shape of `x` in the line `x = handle_input(...)`:\n> - TabM can be used as a conventional MLP-like backbone, which corresponds to `x` having the\n>   standard shape `(B, D)` during both training and inference. This approach is recommended by\n>   default due to its simplicity and better efficiency.\n> - There is also an advanced training strategy, where the shape of `x` is `(B, k, D)`\n>   during training and `(B, D)` during inference.\n>\n> The end-to-end usage [example](#examples) covers both approaches.\n\n## Training\n\n**It is crucial to train the `k` predictions of TabM indepedendently without averaging them.**\nIn other words, the *mean loss* must be optimized, *not* the loss of the mean prediction. The\nend-to-end usage [example](#examples) provides a complete reference on how to train TabM.\n\n## Inference\n\nOn inference, to obtain a prediction for a given object, average the `k` predictions. The exact\naveraging strategy depends on the task and loss function. For example, on classification tasks,\n*probabilities* should usually be averaged, not logits. The end-to-end usage [example](#examples) shows how to make predictions with TabM.\n\n# **Examples**\n\n`example.ipynb` provides an end-to-end example of training TabM:\n- [View on GitHub](.\u002Fexample.ipynb)\n- [Open in Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fyandex-research\u002Ftabm\u002Fblob\u002Fmain\u002Fexample.ipynb)\n\n# Advanced usage\n\n> [!TIP]\n> Try and tune TabM before building custom models. The simplicity of TabM can be deceptive, while in\n> fact it is a strong baseline despite using the vanilla MLP as the base model.\n\nThis part of the package goes beyond the TabM paper and provides building blocks for creating\ncustom TabM-like architectures, including:\n- Efficient ensembles of MLPs, linear layers and normalization layers.\n- Other layers useful for efficient ensembles, such as `EnsembleView` and `ElementwiseAffine`.\n- Functions for converting single models to efficient ensembles in-place.\n- And other tools.\n\nSome things are discussed in dedicated sections, and the rest is shown in the examples below.\n\n> [!IMPORTANT]\n> Understanding the implementation details of TabM is important for constructing *correct* and\n> *effective* TabM-like models. Some of them are discussed later in this section. Other important\n> references:\n> - The source code of this package, in particular the `TabM` model and the `make_tabm_backbone`\n>   function.\n> - The TabM paper, in particular the $\\mathrm{TabM_{mini}}$ paragraph of Section 3.3 (arXiv v3).\n\n## Intuition\n\nRecall that, in conventional MLP-like models, a typical module represents one layer applied to\na tensor of the shape `(B, D)`, where `B` is the batch size and `D` is the latent representation\nsize. By contrast, a typical module in this package:\n- Represents an ensemble of `k` layers applied in parallel to `k` inputs.\n- Operates on tensors of the shape `(B, k, D)` representing `k` inputs (one per layer).\n\nExamples:\n- `LinearEnsemble` is an ensemble of `k` independent linear layers.\n- `LinearBatchEnsemble` is an ensemble of `k` linear layers sharing most of their weights. Note that\n  weight sharing does not change how the module is applied: it still represents `k` layers operating\n  in parallel over `k` inputs.\n\n## `EnsembleView`\n\n`EnsembleView` is a special lightweight module doing one simple thing:\n- Tensors of the shape `(B, D)` are turned to tensors of the shape `(B, k, D)` storing `k` identical\n  views of the original tensor. This is a cheap copy-free operation.\n- Tensors of the shape `(B, k, D)` are propagated as-is without any changes.\n\n## MLP ensembles\n\nThe package provides the following efficient MLP ensembles:\n- `MLPBackboneBatchEnsemble` (used by $\\mathrm{TabM}$)\n- `MLPBackboneMiniEnsemble` (used by $\\mathrm{TabM_{mini}}$)\n- `MLPBackboneEnsemble` (used by $\\mathrm{TabM_{packed}}$)\n\n> [!NOTE]\n> The difference between creating the above ensembles directly or with the `tabm.make_tabm_backbone`\n> function is that certain details in `make_tabm_backbone` are optimized for TabM. Those details\n> may be useful outside of TabM, too, but this is not explored.\n\nContrary to `TabM`, they accept only one three-dimensional input of the shape `(batch_size, k, d)`.\nThus, a user is responsible for converting the input to one tensor (e.g. using embeddings, one-hot\nencoding, etc.) storing either `k` views of the same object or `k` full-fledged batches. A basic\nusage example:\n\n\u003C!-- test main -->\n```python\nimport tabm\nimport torch\nimport torch.nn as nn\n\nd_in = 24\nd_out = 1\nk = 32\nmodel = nn.Sequential(\n    tabm.EnsembleView(k=k),\n    tabm.MLPBackboneBatchEnsemble(\n        d_in=d_in,\n        n_blocks=3,\n        d_block=512,\n        dropout=0.1,\n        k=k,\n        tabm_init=True,\n        scaling_init='normal',\n        start_scaling_init_chunks=None,\n    ),\n    tabm.LinearEnsemble(512, d_out, k=k)\n)\nx = torch.randn(batch_size, d_in)\ny_pred = model(x)\nassert y_pred.shape == (batch_size, k, d_out)\n```\n\n## Important implementation details\n\nThis section covers implementation details that are important for building custom TabM-like models.\n\n**The order of layers.** The most important guideline is that **the $k$ different object representations\nshould be created before the tabular features are mixed with linear layers**. This follows directly\nfrom the $\\mathrm{TabM_{mini}}$ paragraph of Section 3.3 of the paper (arXiv V3).\n\n```python\nd_in = 24\nd = 512\nk = 16\n\n# GOOD: collectively, the first two modules create k\n# different object representations before the first\n# linear layer.\nscaling_init = \"normal\"  # or \"random-signs\"\ngood_model = nn.Sequential(\n    tabm.EnsembleView(k=k),\n    tabm.ElementwiseAffine((k, d_in), bias=False, scaling_init=scaling_init),\n    nn.Linear(d_in, d),\n    ...\n)\n\n# GOOD: internally, LinearBatchEnsemble starts with a\n# non-shared elementwise scaling, which diversifies\n# the k object representations before the\n# linear transformation.\nscaling_init = \"normal\"  # or \"random-signs\"\n                         # or (\"normal\", \"ones\")\n                         # or (\"random-signs\", \"ones\")\ngood_model = nn.Sequential(\n    tabm.EnsembleView(k=k),\n    tabm.LinearBatchEnsemble(d_in, d, k=k, scaling_init=scaling_init),\n    ...\n)\n\n# BAD: the tabular features are mixed before\n# the ensemble starts.\nbad_model = nn.Sequential(\n    nn.Linear(d_in, d),\n    tabm.EnsembleView(k=k),\n    nn.ReLU(),\n    ...\n)\n\n# BAD: the k representations are created before the\n# first linear transformations, but these representations\n# are not different. Mathematically, the below snippet is\n# equivalent to the previous one.\nbad_model = nn.Sequential(\n    tabm.EnsembleView(k=k),\n    nn.Linear(d_in, d),\n    nn.ReLU(),\n    ...\n)\n```\n\n**Weight sharing.** When choosing between `torch.nn.Linear` (fully sharing linear layers between\nensemble members), `tabm.LinearBatchEnsemble` (sharing most of the weights) and\n`tabm.LinearEnsemble` (no weight sharing), keep in mind that parameter-efficient ensembling\nstrategies based on weight sharing (e.g. BatchEnsemble and MiniEnsemble) not only significantly\nimprove the efficiency of TabM, but also improve its task performance. Thus, weight sharing seems\nto be an effective regularization. However, it remains underexplored what is the optimal \"amount\" of\nthis regularization and how it depends on a task.\n\n## Example: Simple ensemble without weight sharing\n\nThe following code is a reimplementation of `tabm.MLPBackboneEnsemble`:\n\n\u003C!-- test main _ -->\n```python\nk = 32\nd_in = 24\nd = 512\nd_out = 1\ndropout = 0.1\n\nmodel = nn.Sequential(\n    tabm.EnsembleView(k=k),\n\n    # >>> MLPBackboneEnsemble(n_blocks=2)\n    tabm.LinearEnsemble(d_in, d, k=k),\n    nn.ReLU(),\n    nn.Dropout(dropout),\n\n    tabm.LinearEnsemble(d, d, k=k),\n    nn.ReLU(),\n    nn.Dropout(dropout),\n    # \u003C\u003C\u003C\n\n    tabm.LinearEnsemble(d, d_out, k=k),\n)\n```\n\n## Example: MiniEnsemble\n\nMiniEnsemble is a simple parameter-efficient ensembling strategy:\n1. Create $k$ different representations of an object by passing it through $k$ non-shared randomly\n   initialized affine transformations.\n2. Pass the $k$ representations in parallel through one shared backbone. Any backbone can be used.\n3. Make predictions with non-shared heads.\n\nThe code below is a reimplementation of `tabm.MLPBackboneMiniEnsemble`. In fact,\n`backbone` can be any MLP-like model. The only requirement for `backbone` is to support an arbitrary\nnumber of batch dimensions, since `EnsembleView` adds a new dimension. Alternatively, one can\nreshape the representation before and after the backbone.\n\n\u003C!-- test main _ -->\n```python\nd_in = 24\nd = 512\nd_out = 1\nk = 32\n\n# Any MLP-like backbone can be used.\nbackbone = tabm.MLPBackbone(\n    d_in=d_in, n_blocks=2, d_block=d, dropout=0.1\n)\nmodel = nn.Sequential(\n    tabm.EnsembleView(k=k),\n\n    # >>> MLPBackboneMiniEnsemble\n    tabm.ElementwiseAffine((k, d_in), bias=False, scaling_init='normal'),\n    backbone,\n    # \u003C\u003C\u003C\n\n    tabm.LinearEnsemble(d, d_out, k=k),\n)\n```\n\n## Example: BatchEnsemble\n\nThe following code is a reimplementation of `tabm.MLPBackboneBatchEnsemble` with `n_blocks=2`:\n\n\u003C!-- test main _ -->\n```python\nk = 32\nd_in = 24\nd = 512\ndropout = 0.1\ntabm_init = True  # TabM-style initialization\nscaling_init = 'normal'  # or 'random-signs'\n\nmodel = nn.Sequential(\n    tabm.EnsembleView(k=k),\n\n    # >>> MLPBackboneBatchEnsemble(n_blocks=2)\n    tabm.LinearBatchEnsemble(\n        d_in, d, k=k,\n        scaling_init=(scaling_init, 'ones') if tabm_init else scaling_init,\n    ),\n    nn.ReLU(),\n    nn.Dropout(dropout),\n\n    tabm.LinearBatchEnsemble(\n        d, d, k=k,\n        scaling_init='ones' if tabm_init else scaling_init\n    ),\n    nn.ReLU(),\n    nn.Dropout(dropout),\n    # \u003C\u003C\u003C\n\n    tabm.LinearEnsemble(d, d_out, k=k),\n)\n```\n\n## Example: a custom architecture\n\nA random and most likely **bad** architecture showcasing various layers available in the package:\n\n\u003C!-- test main _ -->\n```python\nd_in = 24\nd = 512\nd_out = 1\nk = 16\ndropout = 0.1\n\nmodel = nn.Sequential(\n                                         #    (B, d_in) or (B, k, d_in)\n    tabm.EnsembleView(k=k),              # -> (B, k, d_in)\n\n    # Most of the weights are shared\n    tabm.LinearBatchEnsemble(            # -> (B, k, d)\n        d_in, d, k=k, scaling_init=('random-signs', 'ones')\n    ),\n    nn.ReLU(),                           # -> (B, k, d)\n    nn.Dropout(dropout),                 # -> (B, k, d)\n\n    # No weight sharing\n    tabm.BatchNorm1dEnsemble(d, k=k),    # -> (B, k, d)\n\n    # No weight sharing\n    tabm.LinearEnsemble(                 # -> (B, k, d)\n        d, d, k=k\n    ),\n    nn.ReLU(),                           # -> (B, k, d)\n    nn.Dropout(dropout),                 # -> (B, k, d)\n\n    # The weights are fully shared\n    nn.Linear(                           # -> (B, k, d)\n        d, d,\n    ),\n    nn.ReLU(),                           # -> (B, k, d)\n    nn.Dropout(dropout),                 # -> (B, k, d)\n\n    # No weight sharing\n    tabm.ElementwiseAffine(              # -> (B, k, d)\n        (k, d), bias=True, scaling_init='normal'\n    ),\n\n    # The weights are fully shared\n    tabm.MLPBackbone(                    # -> (B, k, d)\n        d_in=d, n_blocks=2, d_block=d, dropout=0.1\n    ),\n\n    # Almost all the weights are shared\n    tabm.MLPBackboneMiniEnsemble(        # -> (B, k, d)\n        d_in=d, n_blocks=2, d_block=d, dropout=0.1,\n        k=k, affine_bias=False, affine_scaling_init='normal',\n    ),\n\n    # No weight sharing\n    tabm.LinearEnsemble(d, d_out, k=k),  # -> (B, k, d_out)\n)\n\nx = torch.randn(batch_size, d_in)\ny_pred = model(x)\nassert y_pred.shape == (batch_size, k, d_out)\n```\n\n## Turning an existing model to an efficient ensemble\n\n> [!WARNING]\n> The approach discussed below requires full understanding of all implementation details of both\n> the original model and efficient ensembling, because it does not provide any guarantees on\n> correctness and performance of the obtained models.\n\nAssume that you have an MLP-like model, and you want to quickly evaluate the potential of efficient\nensembling applied to your model without changing the model's source code. Then, you can create a\nsingle-model instance and turn it to an efficient ensemble by replacing its layers with their\nensembled versions. To that end, the package provides the following functions:\n- `tabm.batchensemble_linear_layers_`\n- `tabm.ensemble_linear_layers_`\n- `tabm.ensemble_batchnorm1d_layers_`\n- `tabm.ensemble_layernorm_layers_`\n\nFor example, the following code creates a full-fledged ensemble of $k$ independent MLPs:\n\n\u003C!-- test main _ -->\n```python\nd_in = 24\nd = 512\nd_out = 1\n\n# Create one standard MLP backbone.\nbackbone = tabm.MLPBackbone(\n    d_in=d_in, n_blocks=3, d_block=d, dropout=0.1\n)\n\n# Turn the one backbone into an efficient ensemble.\nk = 32\ntabm.ensemble_linear_layers_(backbone, k=k)\n\n# Compose the final model.\nmodel = nn.Sequential(\n    tabm.EnsembleView(k=k),\n    backbone,\n    tabm.LinearEnsemble(d, d_out, k=k),\n)\nassert model(torch.randn(batch_size, d_in)).shape == (batch_size, k, d_out)\n```\n\n# Hyperparameters\n\n## Default model\n\n`TabM.make` allows one to create TabM with the default hyperparameters and overwrite them as needed:\n\n> [!NOTE]\n> The default hyperparameters are not \"constant\", i.e. they depend on the provided arguments.\n\n\u003C!-- test main _ -->\n```python\n# TabM with default hyperparameters.\nmodel = TabM.make(n_num_features=16, d_out=1)\n\n# TabM with custom n_blocks\n# and all other hyperparameters set to their default values.\nmodel = TabM.make(n_num_features=16, d_out=1, n_blocks=2)\n```\n\n## Default optimizer\n\nCurrently, for the default TabM, the default optimizer is\n`torch.optim.AdamW(..., lr=0.002, weight_decay=0.0003)`.\n\n## `arch_type`\n\n*TL;DR: by default, use TabM.*\n\n- `'tabm'` is the default value and is expected to provide the best performance in most cases.\n- `'tabm-mini'` may result in a faster training and\u002For inference without a significant performance\n  drop, though this option may require a bit more precision when choosing `d_block` and `n_blocks`\n  for a given `k`. `'tabm-mini'` can occasionally provide slightly better performance if the\n  higher degree of regularization turns out to be beneficial for a given task.\n- `'tabm-packed'` is implemented only for completeness. In most cases, it results in a slower,\n  heavier and weaker model.\n\n## `k`\n\n> [!TIP]\n> All the following points about `k`, except for the first one, are inspired by Figure 7 from the\n> paper (arXiv v3).\n\n- If you want to tune `k`, consider running independent hyperparameter tuning runs with different,\n  but *fixed* values of `k`. The motivation is that changing `k` can affect optimal\n  values of other hyperparameters, which can hinder the hyperparameter tuning process if `k` is\n  tuned together with other hyperparameters.\n- For given depth and width, increasing `k` up to a certain threshold can improve performance.\n  After the threshold, the performance may not improve and can even become worse. This effect is\n  more pronounced for `arch_type='tabm-mini'` (not shown in the figure).\n- For exploration purposes, one can use lower values of `k` (i.e. 24 or 16) and still get\n  competitive results (though changing `k` can require retuning other hyperparameters).\n- As a rule of thumb, if you increase `k`, consider increasing `d_block` or `n_block` (or both).\n  The intuition is that, because of the weight sharing, the larger ensemble size may require\n  larger base architecture to successfully accommodate `k` submodels.\n- If the size of your dataset is similar to datasets used in the paper,\n  `n_blocks=1` should usually be avoided unless you have high budget on hyperparameter tuning.\n\n## `num_embeddings`\n\n- Historically, the piecewise-linear embeddings seem to be a more popular choice among users.\n  However, on some tasks, the periodic embeddings can be a better choice.\n- The documentation of the `rtdl_num_embeddings` package provides recommendations on hyperparameter\n  tuning for embeddings.\n\n## Initialization\n\nThis section provides an overview of initialization-related options that one has to specify when\nusing API other than `TabM.make`.\n\n> [!NOTE]\n> The extent to which the initialization-related settings affect the task performance depends on\n> a task at hand. Usually, these settings are about uncovering the full potential of TabM, not\n> about avoiding some failure modes.\n\n**TabM-style initialization**. `tabm_init` is a flag triggering the TabM-style initialization of\nBatchEnsemble-based models. In short, this is a conservative initialization strategy making the $k$\nsubmodels different from each other only in the very first ensembled layer, but equal in all other\nlayers *at initialization*. On benchmarks, `tabm_init=True` showed itself as a better default\nstrategy. If the $k$ submodels collapse to the same model during training, try `tabm_init=False`.\n\n**Initialization of scaling parameters**. The arguments `start_scaling_init`, `scaling_init` and `affine_scaling_init` are all about the same thing in slightly different contexts. TabM uses an\ninformal heuristic rule that can be roughly summarized as follows: use `\"normal\"` if there are\n\"non-trivial\" modules before the TabM backbone (e.g. `num_embeddings`), and `\"random-signs\"`\notherwise. This is not a well-explored aspect of TabM.\n\n**Initialization chunks.** The arguments like `start_scaling_init_chunks` or `scaling_init_chunks`\ntrigger a heuristic chunk-based initialization of scaling parameters, where the chunk sizes must sum\nexactly to the backbone input size (i.e. `d_in`.) By default, in TabM, the \"chunks\" are defined\nsimply as feature representation sizes (see the `d_features` variable in `tabm.TabM.__init__`),\nwhich means that, during the initialization, exactly one random scalar will be sampled per feature.\nIn other contexts, it may be unclear how to define chunks. In such cases, one possible approach is\nto start with `None` and, if needed, try to find a better approach by trial and error.\n\n## Hyperparameter tuning\n\n> [!TIP]\n> The general notes provided above will help you in automatic hyperparameter tuning as well.\n\nIt the paper, to tune TabM's hyperparameters, the\n[TPE sampler from Optuna](https:\u002F\u002Foptuna.readthedocs.io\u002Fen\u002Fstable\u002Freference\u002Fsamplers\u002Fgenerated\u002Foptuna.samplers.TPESampler.html)\nwas used with 100 iterations on smaller datasets, and 50 iterations on larger datasets.\nIf achieving the highest possible performance is not critical, then 30-50 iterations should result\nin a somewhat reasonable configuration.\n\nThe below table provides the hyperparameter distributions used in the paper.\n**Consider changing them based on your setup** and taking previous sections into\naccount, especially if the lower number of iterations is used.\n\n| Hyperparameter | TabM                            | TabM w.\u002F `PiecewiseLinearEmbeddings` |\n| :------------- | :------------------------------ | :----------------------------------- |\n| `k`            | `Const[32]` (not tuned)         | Same as for TabM                     |\n| `n_blocks`     | `UniformInt[1, 5]`              | `UniformInt[1, 4]`                   |\n| `d_block`      | `UniformInt[64, 1024, step=16]` | Same as for TabM                     |\n| `lr`           | `LogUniform[1e-4, 5e-3]`        | Same as for TabM                     |\n| `weight_decay` | `{0, LogUniform[1e-4, 1e-1]}`   | Same as for TabM                     |\n| `n_bins`       | N\u002FA                             | `UniformInt[2, 128]`                 |\n| `d_embedding`  | N\u002FA                             | `UniformInt[8, 32, step=4]`          |\n\n# Practical notes\n\n## Inference efficiency\n\nAs shown in Section 5.2 of the paper, one can prune a significant portion of TabM's submodels\n(i.e. reduce `k`) *after* the training at the cost of a minor performance drop. In theory, there are\nmore advanced algorithms for selecting the best subset of submodels, such as the\n[one by Caruana et al.](https:\u002F\u002Fautoml.github.io\u002Famltk\u002Flatest\u002Fapi\u002Famltk\u002Fensembling\u002Fweighted_ensemble_caruana\u002F),\nbut they were not analyzed in the paper. Also, keep in mind that selecting, say, `k=8` submodels\nafter training with `k=32` will result in a better model than simply training with `k=8`.\n\n# API\n\nTo explore the package API and docstrings, do one of the following:\n- Clone this repository and run `make docs`.\n- On GitHub, open the source code and use the symbols panel.\n- In VSCode, open the source code and use the Outline view.\n\nTo list all available items without cloning or installing anything, run the following snippet:\n\n```\nuv run --no-project --with tabm python -c \"\"\"\nimport tabm\n\nfor x in sorted(\n    x\n    for x in dir(tabm)\n    if getattr(getattr(tabm, x), '__module__', None) == 'tabm'\n    and not x.startswith('_')\n):\n    print(x)\n\"\"\"\n```\n","# TabM：通过参数高效的集成推进表格深度学习》（ICLR 2025）\u003C!-- omit in toc -->\n\n:scroll: [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.24210)\n&nbsp; :books: [其他表格DL项目](https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Frtdl)\n\n这是论文《TabM：通过参数高效的集成推进表格深度学习》的官方仓库。它由两部分组成：\n- 本文档中介绍的[**Python软件包**](#python-package)。\n- `paper\u002FREADME.md` 中描述的[**与论文相关的内容**](.\u002Fpaper\u002FREADME.md)（代码、指标、超参数等）。\n\n\u003Cbr>\n\n\u003Cdetails>\n\u003Csummary>TabM在\u003Cb>Kaggle\u003C\u002Fb>上的应用（截至2025年6月）\u003C\u002Fsummary>\n\n- TabM被用于UM举办的竞赛中的[冠军方案](https:\u002F\u002Fwww.kaggle.com\u002Fcompetitions\u002Fum-game-playing-strength-of-mcts-variants\u002Fdiscussion\u002F549801)。\n- TabM也被用于CIBMTR举办的竞赛中的[冠军方案](https:\u002F\u002Fwww.kaggle.com\u002Fcompetitions\u002Fequity-post-HCT-survival-predictions\u002Fdiscussion\u002F566550)，以及前三名、前四名、前五名和其他许多参赛方案中。后来发现，仅使用TabM而不与其他模型集成，也能在3300多个参赛者中获得第25名[567863号讨论](https:\u002F\u002Fwww.kaggle.com\u002Fcompetitions\u002Fequity-post-HCT-survival-predictions\u002Fdiscussion\u002F567863)。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>TabM在\u003Cb>TabReD\u003C\u002Fb>（一项具有挑战性的基准测试）上的表现\u003C\u002Fsummary>\n\n[TabReD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.19380)是一个基于**真实世界工业数据集**的基准测试，这些数据集存在**与时间相关的分布漂移**和**数百个特征**，因此比传统基准更具挑战性。下图显示，与先前的表格深度学习方法相比，TabM在TabReD（加上另一个真实世界数据集）上取得了更高的性能。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyandex-research_tabm_readme_7801a7e73aed.png\" width=35% display=block margin=auto>\n\n*每个点代表一个数据集上的性能得分。对于给定的模型，菱形表示各数据集上的平均值。*\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>训练与推理效率\u003C\u002Fsummary>\n\nTabM是一种简单且效率合理的模型，因此非常适合**实际应用**，包括处理大规模数据集。论文中使用的最大数据集包含**1300万个样本**，我们还了解到曾成功地在**超过1亿个样本**上进行过训练，尽管在这种情况下训练所需的时间会更长。\n\n下图显示，TabM的速度相对MLP和GBDT较慢，但比之前的表格深度学习方法更快。请注意：(1) 推理吞吐量是在单线程CPU上测量的，且*未进行任何优化*，特别是未采用本文后续介绍的TabM专用加速技术；(2) 左侧图表采用了*对数尺度*。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyandex-research_tabm_readme_0a7088993bca.png\" display=block margin=auto>\n\n*每个点代表在一个数据集上的测量结果。对于给定的模型，菱形表示各数据集上的平均值。在左图中，*$\\mathrm{TabM_{mini}^{\\dagger*}}$ *表示使用混合精度和`torch.compile`训练的*$\\mathrm{TabM_{mini}^{\\dagger}}$*。*\n\n\u003C\u002Fdetails>\n\n# TL;DR\u003C!-- omit in toc -->\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyandex-research_tabm_readme_711045dca83f.png\" width=65% display=block margin=auto>\n\n**TabM**（即**Tab**ular DL模型，能够进行**M**ultiple预测）是一种简单而强大的表格深度学习架构，可高效模拟MLP集成的效果。与常规MLP集成相比，TabM的两个主要区别在于：\n- **并行训练**MLP。这使得可以在训练过程中监控集成的整体性能，并在对整个集成最优时停止训练，而不是分别针对每个MLP进行训练。\n- **权重共享**。实际上，整个TabM可以整合为一个类似MLP的单一模型。这不仅显著提升了运行时间和内存效率，还起到了有效的正则化作用，从而提升任务性能。\n\n# 复现实验与浏览结果\u003C!-- omit in toc -->\n\n> [!IMPORTANT]\n> 若要在实践中及未来工作中使用TabM，请使用下方介绍的`tabm`软件包。\n\n与论文相关的内容（代码、指标、超参数等）位于`paper\u002F`目录下，并在`paper\u002FREADME.md`中进行了详细说明。\n\n# Python软件包\u003C!-- omit in toc -->\n\n`tabm`是一个基于PyTorch的Python软件包，提供了TabM模型，以及用于构建自定义TabM-like架构（即高效MLP-like模型集成）的层和工具。\n\n- [**安装**](#installation)\n- [**基本用法**](#basic-usage)\n    - [创建TabM](#creating-tabm)\n    - [创建带有特征嵌入的TabM](#creating-tabm-with-feature-embeddings)\n    - [将TabM与自定义输入及输入模块结合使用](#using-tabm-with-custom-inputs-and-input-modules)\n    - [训练](#training)\n    - [推理](#inference)\n- [**示例**](#examples)\n- [高级用法](#advanced-usage)\n    - [直观理解](#intuition)\n    - [`EnsembleView`](#ensembleview)\n    - [MLP集成](#mlp-ensembles)\n    - [重要的实现细节](#important-implementation-details)\n    - [示例：无权重共享的简单集成](#example-simple-ensemble-without-weight-sharing)\n    - [示例：MiniEnsemble](#example-miniensemble)\n    - [示例：BatchEnsemble](#example-batchensemble)\n    - [示例：自定义架构](#example-a-custom-architecture)\n    - [将现有模型转化为高效集成](#turning-an-existing-model-to-an-efficient-ensemble)\n- [超参数](#hyperparameters)\n    - [默认模型](#default-model)\n    - [默认优化器](#default-optimizer)\n    - [`arch_type`](#arch_type)\n    - [`k`](#k)\n    - [`num_embeddings`](#num_embeddings)\n    - [初始化](#initialization)\n    - [超参数调优](#hyperparameter-tuning)\n- [实用提示](#practical-notes)\n    - [推理效率](#inference-efficiency)\n- [API](#api)\n\n# **安装**\n\n```\npip install tabm\n```\n\n# **基本用法**\n\n本节展示了在典型应用场景下如何创建模型，并对训练和推理给出了高层次的说明。\n\n## 创建TabM\n\n以下示例展示了不带特征嵌入的基本版TabM。为了获得更好的性能，通常应按照下一节的说明传递`num_embeddings`参数。\n\n> [!NOTE]\n> 下文使用的`TabM.make(...)`会根据提供的参数添加默认超参数。\n\n\u003C!-- test main -->\n```python\nimport torch\nfrom tabm import TabM\n\n# >>> 所有后续章节的通用设置。\nd_out = 1  # 例如，一个回归任务。\nbatch_size = 256\n\n# 数据集包含24个数值型（连续）特征。\nn_num_features = 24\n\n# 数据集包含2个分类特征。\n# 第一个分类特征有3个唯一类别。\n# 第二个分类特征有7个唯一类别。\ncat_cardinalities = [3, 7]\n\n# \u003C\u003C\u003C\n\nmodel = TabM.make(\n    n_num_features=n_num_features,\n    cat_cardinalities=cat_cardinalities,  # 将使用独热编码。\n    d_out=d_out,\n)\nx_num = torch.randn(batch_size, n_num_features)\nx_cat = torch.column_stack([\n    # 第i个类别特征的取值范围为[0, cat_cardinalities[i])。\n    torch.randint(0, c, (batch_size,)) for c in cat_cardinalities\n])\ny_pred = model(x_num, x_cat)\n\n# TabM表示由k个模型组成的集成，因此每个样本会有k个预测。\nassert y_pred.shape == (batch_size, model.k, d_out)\n```\n\n## 使用特征嵌入创建TabM\n\n在典型的表格数据任务中，通常通过将特征嵌入模块作为`num_embeddings`传入来获得最佳性能（在论文中，带有嵌入的TabM被记作$\\mathrm{TabM^\\dagger}$）。`TabM`支持来自[`rtdl_num_embeddings`](https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Frtdl-num-embeddings\u002Fblob\u002Fmain\u002Fpackage\u002FREADME.md)包中的多种特征嵌入模块。下面的例子展示了最简单的嵌入模块`LinearReLUEmbeddings`。\n\n> [!TIP]\n> 通常，更高级的嵌入方法，如`PiecewiseLinearEmbeddings`和`PeriodicEmbeddings`，能够带来更好的性能。它们的使用将在端到端示例中介绍（见#examples）。\n\n\u003C!-- test main _ -->\n```python\nfrom rtdl_num_embeddings import LinearReLUEmbeddings\n\nmodel = TabM.make(\n    n_num_features=n_num_features,\n    num_embeddings=LinearReLUEmbeddings(n_num_features),\n    d_out=d_out\n)\nx_num = torch.randn(batch_size, n_num_features)\ny_pred = model(x_num)\n\nassert y_pred.shape == (batch_size, model.k, d_out)\n```\n\n## 在自定义输入和输入模块中使用TabM\n\n> [!TIP]\n> `tabm.TabM`的实现是定义基于TabM的模型中输入和输入模块的一个良好示例。\n\n假设您希望改变TabM接收的输入内容或处理方式，但仍想以TabM作为骨干网络。那么典型的用法如下：\n\n```python\nfrom tabm import EnsembleView, make_tabm_backbone, LinearEnsemble\n\n\nclass Model(nn.Module):\n    def __init__(self, ...):\n        # >>> 创建任何自定义模块。\n        ...\n        # \u003C\u003C\u003C\n\n        # 创建集成输入模块。\n        self.ensemble_view = EnsembleView(...)\n        # 创建骨干网络。\n        self.backbone = make_tabm_backbone(...)\n        # 创建预测头。\n        self.output = LinearEnsemble(...)\n\n    def forward(self, arg1, arg2, ...):\n        # 根据需要将输入转换为一个张量。\n        # 这一步骤可以包括特征嵌入和其他各种特征变换。\n        # `handle_input`是一个假想的用户定义函数。\n        x = handle_input(arg1, arg2, ...)  # -> (B, D) 或 (B, k, D)\n\n        # 与传统模型唯一的区别在于调用了self.ensemble_view。\n        x = self.ensemble_view(x)  # -> (B, k, D)\n        x = self.backbone(x)\n        x = self.output(x)\n        return x  # -> (B, k, d_out)\n```\n\n> [!NOTE]\n> 关于`x = handle_input(...)`这行中`x`的形状：\n> - TabM可以用作传统的类似MLP的骨干网络，此时`x`在训练和推理阶段都具有标准形状`(B, D)`。由于其简单性和更高的效率，这种做法是默认推荐的。\n> - 另一种高级训练策略是，在训练时`x`的形状为`(B, k, D)`，而在推理时为`(B, D)`。\n>\n> 端到端示例（见#examples）涵盖了这两种方法。\n\n## 训练\n\n**至关重要的是要独立训练TabM的`k`个预测结果，而不要对它们进行平均。** 换句话说，必须优化*平均损失*，而不是平均预测结果的损失。端到端示例（见#examples）提供了关于如何训练TabM的完整参考。\n\n## 推理\n\n在推理阶段，为了得到某个样本的预测结果，需要对`k`个预测结果进行平均。具体的平均策略取决于任务和损失函数。例如，在分类任务中，通常应该对*概率*进行平均，而不是对logit进行平均。端到端示例（见#examples）展示了如何使用TabM进行预测。\n\n# **示例**\n\n`example.ipynb`提供了一个训练TabM的端到端示例：\n- [在GitHub上查看](.\u002Fexample.ipynb)\n- [在Colab中打开](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fyandex-research\u002Ftabm\u002Fblob\u002Fmain\u002Fexample.ipynb)\n\n# 高级用法\n\n> [!TIP]\n> 在构建自定义模型之前，请先尝试并调整TabM。尽管TabM使用的是普通的MLP作为基础模型，但它的简单性可能会让人误以为它不够强大，实际上它是一个非常强的基线。\n\n本部分的内容超出了TabM论文的范围，提供了用于构建自定义TabM类架构的组件，包括：\n- MLP、线性层和归一化层的高效集成；\n- 其他对高效集成有用的层，如`EnsembleView`和`ElementwiseAffine`；\n- 将单个模型就地转换为高效集成的函数；\n- 以及其他工具。\n\n部分内容将在专门章节中讨论，其余则在下面的示例中展示。\n\n> [!IMPORTANT]\n> 理解TabM的实现细节对于构建*正确*且*有效*的TabM类模型至关重要。其中一些内容将在本节稍后讨论。其他重要参考资料包括：\n> - 本包的源代码，特别是`TabM`模型和`make_tabm_backbone`函数；\n> - TabM论文，尤其是第3.3节中的$\\mathrm{TabM_{mini}}$段落（arXiv v3版本）。\n\n## 直觉\n\n回想一下，在传统的类似MLP的模型中，一个典型的模块代表一层作用于形状为`(B, D)`的张量上，其中`B`是批次大小，`D`是潜在表示的维度。相比之下，本包中的典型模块：\n- 代表`k`个层的集合，这些层并行地作用于`k`个输入上；\n- 操作的对象是形状为`(B, k, D)`的张量，该张量表示`k`个输入（每个层对应一个输入）。\n\n示例：\n- `LinearEnsemble`是`k`个独立线性层的集合；\n- `LinearBatchEnsemble`是`k`个共享大部分权重的线性层的集合。需要注意的是，权重共享并不会改变模块的应用方式：它仍然代表`k`个层并行作用于`k`个输入上。\n\n## `EnsembleView`\n\n`EnsembleView`是一个特殊的轻量级模块，只做一件简单的事情：\n- 将形状为`(B, D)`的张量转换为形状为`(B, k, D)`的张量，其中存储了原始张量的`k`个完全相同的视图。这是一个廉价且无需复制的操作。\n- 形状为`(B, k, D)`的张量会原样传递，不做任何更改。\n\n## MLP 集成模型\n\n该包提供了以下高效的 MLP 集成模型：\n- `MLPBackboneBatchEnsemble`（由 $\\mathrm{TabM}$ 使用）\n- `MLPBackboneMiniEnsemble`（由 $\\mathrm{TabM_{mini}}$ 使用）\n- `MLPBackboneEnsemble`（由 $\\mathrm{TabM_{packed}}$ 使用）\n\n> [!NOTE]\n> 直接创建上述集成模型与使用 `tabm.make_tabm_backbone` 函数之间的区别在于，`make_tabm_backbone` 中的某些细节是为 TabM 优化的。这些细节在 TabM 之外也可能有用，但目前尚未深入探讨。\n\n与 `TabM` 不同，它们仅接受形状为 `(batch_size, k, d)` 的三维输入。因此，用户需要负责将输入转换为一个张量（例如使用嵌入、独热编码等），其中存储的是同一对象的 `k` 个视图，或者 `k` 个完整的批次。一个基本的使用示例：\n\n\u003C!-- test main -->\n```python\nimport tabm\nimport torch\nimport torch.nn as nn\n\nd_in = 24\nd_out = 1\nk = 32\nmodel = nn.Sequential(\n    tabm.EnsembleView(k=k),\n    tabm.MLPBackboneBatchEnsemble(\n        d_in=d_in,\n        n_blocks=3,\n        d_block=512,\n        dropout=0.1,\n        k=k,\n        tabm_init=True,\n        scaling_init='normal',\n        start_scaling_init_chunks=None,\n    ),\n    tabm.LinearEnsemble(512, d_out, k=k)\n)\nx = torch.randn(batch_size, d_in)\ny_pred = model(x)\nassert y_pred.shape == (batch_size, k, d_out)\n```\n\n## 重要实现细节\n\n本节介绍构建自定义 TabM 类模型时需要注意的实现细节。\n\n**层的顺序。** 最重要的指导原则是：**在用线性层混合表格特征之前，应先生成 $k$ 种不同的对象表示**。这直接源自论文（arXiv V3）第 3.3 节中关于 $\\mathrm{TabM_{mini}}$ 的说明。\n\n```python\nd_in = 24\nd = 512\nk = 16\n\n# 好的：前两个模块共同在第一个线性层之前生成 k 种不同的对象表示。\nscaling_init = \"normal\"  # 或 \"random-signs\"\ngood_model = nn.Sequential(\n    tabm.EnsembleView(k=k),\n    tabm.ElementwiseAffine((k, d_in), bias=False, scaling_init=scaling_init),\n    nn.Linear(d_in, d),\n    ...\n)\n\n# 好的：LinearBatchEnsemble 内部首先进行非共享的逐元素缩放，从而在进行线性变换之前使 k 种对象表示多样化。\nscaling_init = \"normal\"  # 或 \"random-signs\"\n                         # 或 (\"normal\", \"ones\")\n                         # 或 (\"random-signs\", \"ones\")\ngood_model = nn.Sequential(\n    tabm.EnsembleView(k=k),\n    tabm.LinearBatchEnsemble(d_in, d, k=k, scaling_init=scaling_init),\n    ...\n)\n\n# 不好的：表格特征在线性层之前就被混合了。\nbad_model = nn.Sequential(\n    nn.Linear(d_in, d),\n    tabm.EnsembleView(k=k),\n    nn.ReLU(),\n    ...\n)\n\n# 不好的：虽然在第一次线性变换之前生成了 k 种表示，但这些表示并不不同。从数学上讲，下面的代码与上一个例子等价。\nbad_model = nn.Sequential(\n    tabm.EnsembleView(k=k),\n    nn.Linear(d_in, d),\n    nn.ReLU(),\n    ...\n)\n```\n\n**权重共享。** 在选择 `torch.nn.Linear`（在集成成员之间完全共享线性层）、`tabm.LinearBatchEnsemble`（共享大部分权重）和 `tabm.LinearEnsemble`（不共享权重）时，请注意基于权重共享的参数高效集成策略（如 BatchEnsemble 和 MiniEnsemble）不仅能显著提高 TabM 的效率，还能提升其任务性能。因此，权重共享似乎是一种有效的正则化方法。然而，这种正则化的“最佳程度”以及它如何依赖于具体任务，仍有待进一步研究。\n\n## 示例：无权重共享的简单集成模型\n\n以下代码是对 `tabm.MLPBackboneEnsemble` 的重新实现：\n\n\u003C!-- test main _ -->\n```python\nk = 32\nd_in = 24\nd = 512\nd_out = 1\ndropout = 0.1\n\nmodel = nn.Sequential(\n    tabm.EnsembleView(k=k),\n\n    # >>> MLPBackboneEnsemble(n_blocks=2)\n    tabm.LinearEnsemble(d_in, d, k=k),\n    nn.ReLU(),\n    nn.Dropout(dropout),\n\n    tabm.LinearEnsemble(d, d, k=k),\n    nn.ReLU(),\n    nn.Dropout(dropout),\n    # \u003C\u003C\u003C\n\n    tabm.LinearEnsemble(d, d_out, k=k),\n)\n```\n\n## 示例：MiniEnsemble\n\nMiniEnsemble 是一种简单的参数高效集成策略：\n1. 通过将对象分别经过 $k$ 个非共享且随机初始化的仿射变换，生成 $k$ 种不同的表示。\n2. 将这 $k$ 种表示并行地送入一个共享的主干网络。可以使用任何主干网络。\n3. 使用非共享的输出头进行预测。\n\n以下代码是对 `tabm.MLPBackboneMiniEnsemble` 的重新实现。实际上，`backbone` 可以是任何类似 MLP 的模型。对 `backbone` 的唯一要求是能够支持任意数量的批量维度，因为 `EnsembleView` 会增加一个新的维度。另一种方法是在主干网络前后对表示进行重塑。\n\n\u003C!-- test main _ -->\n```python\nd_in = 24\nd = 512\nd_out = 1\nk = 32\n\n# 可以使用任何类似 MLP 的主干网络。\nbackbone = tabm.MLPBackbone(\n    d_in=d_in, n_blocks=2, d_block=d, dropout=0.1\n)\nmodel = nn.Sequential(\n    tabm.EnsembleView(k=k),\n\n    # >>> MLPBackboneMiniEnsemble\n    tabm.ElementwiseAffine((k, d_in), bias=False, scaling_init='normal'),\n    backbone,\n    # \u003C\u003C\u003C\n\n    tabm.LinearEnsemble(d, d_out, k=k),\n)\n```\n\n## 示例：BatchEnsemble\n\n以下代码是对 `tabm.MLPBackboneBatchEnsemble` 在 `n_blocks=2` 情况下的重新实现：\n\n\u003C!-- test main _ -->\n```python\nk = 32\nd_in = 24\nd = 512\ndropout = 0.1\ntabm_init = True  # TabM 风格的初始化\nscaling_init = 'normal'  # 或 'random-signs'\n\nmodel = nn.Sequential(\n    tabm.EnsembleView(k=k),\n\n    # >>> MLPBackboneBatchEnsemble(n_blocks=2)\n    tabm.LinearBatchEnsemble(\n        d_in, d, k=k,\n        scaling_init=(scaling_init, 'ones') if tabm_init else scaling_init,\n    ),\n    nn.ReLU(),\n    nn.Dropout(dropout),\n\n    tabm.LinearBatchEnsemble(\n        d, d, k=k,\n        scaling_init='ones' if tabm_init else scaling_init\n    ),\n    nn.ReLU(),\n    nn.Dropout(dropout),\n    # \u003C\u003C\u003C\n\n    tabm.LinearEnsemble(d, d_out, k=k),\n)\n```\n\n## 示例：自定义架构\n\n以下是一个随机且很可能**糟糕**的架构，展示了该包中可用的各种层：\n\n\u003C!-- test main _ -->\n```python\nd_in = 24\nd = 512\nd_out = 1\nk = 16\ndropout = 0.1\n\nmodel = nn.Sequential(\n                                         #    (B, d_in) 或 (B, k, d_in)\n    tabm.EnsembleView(k=k),              # -> (B, k, d_in)\n\n    \u002F\u002F 大部分权重共享\n    tabm.LinearBatchEnsemble(            # -> (B, k, d)\n        d_in, d, k=k, scaling_init=('random-signs', 'ones')\n    ),\n    nn.ReLU(),                           # -> (B, k, d)\n    nn.Dropout(dropout),                 # -> (B, k, d)\n\n    \u002F\u002F 不共享权重\n    tabm.BatchNorm1dEnsemble(d, k=k),    # -> (B, k, d)\n\n    \u002F\u002F 不共享权重\n    tabm.LinearEnsemble(                 # -> (B, k, d)\n        d, d, k=k\n    ),\n    nn.ReLU(),                           # -> (B, k, d)\n    nn.Dropout(dropout),                 # -> (B, k, d)\n\n    \u002F\u002F 权重完全共享\n    nn.Linear(                           # -> (B, k, d)\n        d, d,\n    ),\n    nn.ReLU(),                           # -> (B, k, d)\n    nn.Dropout(dropout),                 # -> (B, k, d)\n\n    \u002F\u002F 不共享权重\n    tabm.ElementwiseAffine(              # -> (B, k, d)\n        (k, d), bias=True, scaling_init='normal'\n    ),\n\n    \u002F\u002F 权重完全共享\n    tabm.MLPBackbone(                    # -> (B, k, d)\n        d_in=d, n_blocks=2, d_block=d, dropout=0.1\n    ),\n\n    \u002F\u002F 几乎所有权重共享\n    tabm.MLPBackboneMiniEnsemble(        # -> (B, k, d)\n        d_in=d, n_blocks=2, d_block=d, dropout=0.1,\n        k=k, affine_bias=False, affine_scaling_init='normal',\n    ),\n\n    \u002F\u002F 不共享权重\n    tabm.LinearEnsemble(d, d_out, k=k),  # -> (B, k, d_out)\n)\n\nx = torch.randn(batch_size, d_in)\ny_pred = model(x)\nassert y_pred.shape == (batch_size, k, d_out)\n```\n\n## 将现有模型转换为高效集成模型\n\n> [!WARNING]\n> 下文讨论的方法需要充分理解原始模型和高效集成的所有实现细节，因为它无法保证所得到模型的正确性和性能。\n\n假设您有一个类似 MLP 的模型，并希望在不修改模型源代码的情况下快速评估高效集成对您的模型的潜力。那么，您可以创建一个单模型实例，并通过将其层替换为对应的集成层来将其转换为高效集成模型。为此，该包提供了以下函数：\n- `tabm.batchensemble_linear_layers_`\n- `tabm.ensemble_linear_layers_`\n- `tabm.ensemble_batchnorm1d_layers_`\n- `tabm.ensemble_layernorm_layers_`\n\n例如，以下代码创建了一个由 $k$ 个独立 MLP 组成的完整集成模型：\n\n\u003C!-- test main _ -->\n```python\nd_in = 24\nd = 512\nd_out = 1\n\n\u002F\u002F 创建一个标准的 MLP 主干。\nbackbone = tabm.MLPBackbone(\n    d_in=d_in, n_blocks=3, d_block=d, dropout=0.1\n)\n\n\u002F\u002F 将该主干转换为高效集成。\nk = 32\ntabm.ensemble_linear_layers_(backbone, k=k)\n\n\u002F\u002F 组装最终模型。\nmodel = nn.Sequential(\n    tabm.EnsembleView(k=k),\n    backbone,\n    tabm.LinearEnsemble(d, d_out, k=k),\n)\nassert model(torch.randn(batch_size, d_in)).shape == (batch_size, k, d_out)\n```\n\n# 超参数\n\n## 默认模型\n\n`TabM.make` 允许用户使用默认超参数创建 TabM，并根据需要进行覆盖：\n\n> [!NOTE]\n> 默认超参数并非“常量”，即它们会根据提供的参数而变化。\n\n\u003C!-- test main _ -->\n```python\n\u002F\u002F 使用默认超参数的 TabM。\nmodel = TabM.make(n_num_features=16, d_out=1)\n\n\u002F\u002F 自定义 n_blocks，其余超参数保持默认值的 TabM。\nmodel = TabM.make(n_num_features=16, d_out=1, n_blocks=2)\n```\n\n## 默认优化器\n\n目前，对于默认的 TabM，其默认优化器是 `torch.optim.AdamW(..., lr=0.002, weight_decay=0.0003)`。\n\n## `arch_type`\n\n*TL;DR：默认情况下使用 TabM。*\n\n- `'tabm'` 是默认值，在大多数情况下应能提供最佳性能。\n- `'tabm-mini'` 可能在训练和\u002F或推理速度上更快，同时性能下降不明显，不过此选项可能需要更精确地选择适合给定 `k` 值的 `d_block` 和 `n_blocks`。在某些情况下，如果更高的正则化程度对特定任务有益，`'tabm-mini'` 甚至可能带来略微更好的性能。\n- `'tabm-packed'` 仅出于完整性而实现。在大多数情况下，它会导致模型更慢、更重且性能较差。\n\n## `k`\n\n> [!TIP]\n> 除第一点外，以下关于 `k` 的内容均受论文（arXiv v3）图 7 的启发。\n\n- 如果您打算调整 `k`，建议分别运行多个独立的超参数调优实验，每次使用不同的但*固定*的 `k` 值。原因是改变 `k` 可能会影响其他超参数的最佳取值，若与其他超参数同时调整，则可能使超参数调优过程更加复杂。\n- 在深度和宽度一定的情况下，适当增加 `k` 可以提升性能；但超过某个阈值后，性能可能不再提升，甚至有所下降。这种现象在 `arch_type='tabm-mini'` 时更为明显（未在图中显示）。\n- 为了探索性实验，可以尝试使用较小的 `k` 值（如 24 或 16），通常仍能获得具有竞争力的结果（尽管更改 `k` 后可能需要重新调整其他超参数）。\n- 一般而言，当增大 `k` 时，也应考虑相应增加 `d_block` 或 `n_block`（或两者）。其背后的直觉是，由于权重共享机制的存在，较大的集成规模可能需要更大的基础架构，才能有效容纳 `k` 个子模型。\n- 如果您的数据集规模与论文中使用的数据集相近，除非您有充足的预算进行超参数调优，否则通常应避免使用 `n_blocks=1`。\n\n## `num_embeddings`\n\n- 从历史来看，分段线性嵌入通常是用户的更常见选择。然而，在某些任务上，周期性嵌入可能是更好的选择。\n- `rtdl_num_embeddings` 包的文档提供了关于嵌入超参数调优的建议。\n\n## 初始化\n\n本节概述了在使用 `TabM.make` 之外的 API 时，需要指定的初始化相关选项。\n\n> [!NOTE]\n> 初始化相关设置对任务性能的影响程度取决于具体任务。通常，这些设置旨在充分发挥 TabM 的潜力，而非避免某些失败模式。\n\n**TabM 风格的初始化**。`tabm_init` 是一个标志，用于触发基于 BatchEnsemble 模型的 TabM 风格初始化。简而言之，这是一种保守的初始化策略，使得 $k$ 个子模型仅在第一个集成层中有所不同，而在其他所有层中则完全相同（初始化时）。在基准测试中，`tabm_init=True` 表现出更好的默认策略。如果训练过程中 $k$ 个子模型坍缩为同一个模型，请尝试将 `tabm_init=False`。\n\n**缩放参数的初始化**。`start_scaling_init`、`scaling_init` 和 `affine_scaling_init` 这些参数在略微不同的上下文中指代同一内容。TabM 使用一种非正式的经验法则，大致可以总结为：如果 TabM 主干之前存在“非平凡”的模块（例如 `num_embeddings`），则使用 `\"normal\"`；否则使用 `\"random-signs\"`。这仍然是 TabM 中尚未充分探索的一个方面。\n\n**初始化分块**。类似 `start_scaling_init_chunks` 或 `scaling_init_chunks` 的参数会触发缩放参数的启发式分块初始化，其中各分块的大小之和必须精确等于主干输入大小（即 `d_in`）。默认情况下，在 TabM 中，“分块”被简单定义为特征表示的大小（参见 `tabm.TabM.__init__` 中的 `d_features` 变量），这意味着在初始化过程中，每个特征恰好会采样一个随机标量。在其他场景下，如何定义分块可能并不明确。在这种情况下，一种可行的方法是先从 `None` 开始，必要时再通过试错找到更优的方案。\n\n## 超参数调优\n\n> [!TIP]\n> 上文提供的通用注意事项同样适用于自动超参数调优。\n\n在论文中，为了调优 TabM 的超参数，使用了 Optuna 的 TPE 采样器（[Optuna 文档](https:\u002F\u002Foptuna.readthedocs.io\u002Fen\u002Fstable\u002Freference\u002Fsamplers\u002Fgenerated\u002Foptuna.samplers.TPESampler.html)），在较小的数据集上进行了 100 次迭代，在较大的数据集上进行了 50 次迭代。如果对达到最高性能并非至关重要，则进行 30–50 次迭代即可得到一个较为合理的配置。\n\n下表提供了论文中使用的超参数分布。\n**请根据您的实际设置并结合前文内容调整这些分布**，尤其是在迭代次数较少的情况下。\n\n| 超参数         | TabM                            | 带有 `PiecewiseLinearEmbeddings` 的 TabM |\n| :------------- | :------------------------------ | :----------------------------------- |\n| `k`            | `Const[32]`（不调优）           | 与 TabM 相同                         |\n| `n_blocks`     | `UniformInt[1, 5]`              | `UniformInt[1, 4]`                   |\n| `d_block`      | `UniformInt[64, 1024, step=16]` | 与 TabM 相同                         |\n| `lr`           | `LogUniform[1e-4, 5e-3]`        | 与 TabM 相同                         |\n| `weight_decay` | `{0, LogUniform[1e-4, 1e-1]}`   | 与 TabM 相同                         |\n| `n_bins`       | 无                              | `UniformInt[2, 128]`                 |\n| `d_embedding`  | 无                              | `UniformInt[8, 32, step=4]`          |\n\n# 实用提示\n\n## 推理效率\n\n如论文第 5.2 节所示，可以在训练完成后大幅剪枝 TabM 的子模型数量（即降低 `k`），代价是性能略有下降。理论上，还有更先进的算法可用于选择最佳子模型子集，例如 Caruana 等人提出的算法（[AMLTK 文档](https:\u002F\u002Fautoml.github.io\u002Famltk\u002Flatest\u002Fapi\u002Famltk\u002Fensembling\u002Fweighted_ensemble_caruana\u002F)），但这些算法并未在论文中进行分析。此外，请注意，先以 `k=32` 训练后再选择 `k=8` 子模型，其效果会优于直接以 `k=8` 训练。\n\n# API\n\n要浏览包的 API 和文档字符串，可采取以下任一方法：\n- 克隆此仓库并运行 `make docs`。\n- 在 GitHub 上打开源代码并使用符号面板。\n- 在 VSCode 中打开源代码并使用大纲视图。\n\n若想在不克隆或安装任何东西的情况下列出所有可用项，可运行以下代码片段：\n\n```\nuv run --no-project --with tabm python -c \"\"\"\nimport tabm\n\nfor x in sorted(\n    x\n    for x in dir(tabm)\n    if getattr(getattr(tabm, x), '__module__', None) == 'tabm'\n    and not x.startswith('_')\n):\n    print(x)\n\"\"\"\n```","# TabM 快速上手指南\n\nTabM (Tabular Deep Learning with Parameter-Efficient Ensembling) 是一个基于 PyTorch 的强大表格深度学习模型。它通过参数高效的方式模拟多个 MLP（多层感知机）的集成，在保持高预测性能的同时，显著提升了训练和推理的效率。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux, macOS 或 Windows\n*   **Python 版本**: 3.8 或更高\n*   **核心依赖**:\n    *   `torch` (PyTorch)\n    *   `tabm` (本工具包)\n    *   `rtdl-num-embeddings` (可选但推荐，用于提升数值特征的处理性能)\n\n> **提示**：国内用户建议使用清华源或阿里源加速 Python 包的安装。\n\n## 安装步骤\n\n使用 `pip` 进行安装。为了获得最佳下载速度，推荐使用国内镜像源：\n\n```bash\npip install tabm -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n如果需要支持更高级的特征嵌入（推荐用于生产环境），请同时安装辅助库：\n\n```bash\npip install rtdl-num-embeddings -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\n以下示例展示了如何构建一个基础的 TabM 模型并进行前向推理。\n\n### 1. 基础模型构建（无特征嵌入）\n\n适用于快速验证或简单场景。模型会自动处理数值特征和类别特征（采用 One-hot 编码）。\n\n```python\nimport torch\nfrom tabm import TabM\n\n# --- 数据集配置 ---\nd_out = 1  # 输出维度，例如回归任务为 1\nbatch_size = 256\nn_num_features = 24  # 数值特征数量\ncat_cardinalities = [3, 7]  # 类别特征的基数列表（每个类别特征的唯一值数量）\n\n# --- 创建模型 ---\n# TabM.make 会根据参数自动填充默认的超参数\nmodel = TabM.make(\n    n_num_features=n_num_features,\n    cat_cardinalities=cat_cardinalities, \n    d_out=d_out,\n)\n\n# --- 准备输入数据 ---\nx_num = torch.randn(batch_size, n_num_features)\nx_cat = torch.column_stack([\n    # 第 i 个类别特征的取值范围必须在 [0, cat_cardinalities[i]) 之间\n    torch.randint(0, c, (batch_size,)) for c in cat_cardinalities\n])\n\n# --- 前向推理 ---\ny_pred = model(x_num, x_cat)\n\n# TabM 输出形状为 (Batch_Size, k, d_out)，其中 k 是集成模型的数量\nassert y_pred.shape == (batch_size, model.k, d_out)\nprint(f\"预测结果形状：{y_pred.shape}\")\n```\n\n### 2. 进阶模型构建（带特征嵌入）\n\n在实际表格任务中，使用特征嵌入模块（如 `LinearReLUEmbeddings`）通常能获得更好的性能。\n\n```python\nimport torch\nfrom tabm import TabM\nfrom rtdl_num_embeddings import LinearReLUEmbeddings\n\n# 配置\nd_out = 1\nbatch_size = 256\nn_num_features = 24\n\n# --- 创建带嵌入的模型 ---\nmodel = TabM.make(\n    n_num_features=n_num_features,\n    num_embeddings=LinearReLUEmbeddings(n_num_features), # 启用数值特征嵌入\n    d_out=d_out\n)\n\n# 准备输入 (仅需数值特征，类别特征可根据需要额外添加)\nx_num = torch.randn(batch_size, n_num_features)\n\n# 前向推理\ny_pred = model(x_num)\n\nassert y_pred.shape == (batch_size, model.k, d_out)\n```\n\n### 关键使用说明\n\n*   **训练策略**：训练时必须独立优化 `k` 个预测结果的损失，即优化 **平均损失 (mean loss)**，而不是先平均预测结果再计算损失。\n*   **推理策略**：在推理阶段，通常需要对 `k` 个预测结果进行平均以获得最终预测值。\n    *   **回归任务**：直接对输出值求平均。\n    *   **分类任务**：通常先对 Softmax 后的**概率**求平均，而不是对 Logits 求平均。","某金融风控团队正在处理包含数百万用户交易记录的高维表格数据，急需构建一个能精准预测违约概率且适应数据分布漂移的深度学习模型。\n\n### 没有 tabm 时\n- **集成成本高昂**：为了提升准确率，工程师不得不训练多个独立的 MLP 模型进行集成，导致显存占用翻倍，训练时间成倍增加。\n- **调优目标错位**：只能单独监控每个子模型的损失，无法直接观测整体集成效果，往往在单模型最优时停止训练，却错过了集成性能的最佳点。\n- **难以应对漂移**：面对工业界常见的随时间变化的数据分布漂移（Distribution Drift），传统表格深度学习模型泛化能力不足，线上预测稳定性差。\n- **推理延迟高**：多个模型串行推理导致单次预测耗时过长，难以满足实时风控系统对低延迟的严苛要求。\n\n### 使用 tabm 后\n- **参数高效集成**：tabm 通过权重共享机制，将多个 MLP 集成压缩进一个类 MLP 结构中，在保持集成优势的同时，大幅降低了显存需求和训练时长。\n- **并行协同优化**：支持并行训练多个预测头，团队可直接监控整体集成指标并在最佳时刻停止训练，显著提升了最终模型的预测精度。\n- **鲁棒性增强**：在真实的工业数据集基准测试中，tabm 展现出更强的抗漂移能力，有效解决了因数据分布变化导致的模型性能下降问题。\n- **推理速度提升**：得益于紧凑的架构设计，tabm 的推理吞吐量远超传统表格深度学习方法，轻松支撑大规模数据的实时在线预测。\n\ntabm 通过参数高效的并行集成架构，让表格深度学习在保持高精度的同时，具备了落地大规模真实业务所需的效率与鲁棒性。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyandex-research_tabm_0a708899.png","yandex-research","Yandex Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fyandex-research_2132bbf2.png","",null,"YandexResearch","research.yandex.com","https:\u002F\u002Fgithub.com\u002Fyandex-research",[81,85,89],{"name":82,"color":83,"percentage":84},"Python","#3572A5",88.8,{"name":86,"color":87,"percentage":88},"Jupyter Notebook","#DA5B0B",10.9,{"name":90,"color":91,"percentage":92},"Makefile","#427819",0.3,993,89,"2026-04-16T05:41:17","Apache-2.0","未说明","未说明 (基于 PyTorch，支持 CPU 和 GPU；论文实验提及使用混合精度训练和 torch.compile 加速，暗示兼容 NVIDIA GPU)","未说明 (文中提及可处理 1300 万至 1 亿 + 样本的大型数据集，实际内存需求取决于数据规模)",{"notes":101,"python":97,"dependencies":102},"该工具是一个基于 PyTorch 的 Python 包 (tabm)，主要用于表格数据的深度学习。核心特性是参数高效集成 (Parameter-Efficient Ensembling)，即在一个模型中模拟多个 MLP 的集成。安装只需运行 'pip install tabm'。可选依赖 'rtdl-num-embeddings' 用于提升数值特征嵌入效果。推理测试表明单 CPU 线程即可运行，但训练大型数据集时耗时较长。",[103,104],"torch","rtdl-num-embeddings",[14,106],"其他","2026-03-27T02:49:30.150509","2026-04-17T09:54:07.490803",[110,115,120,125,130,135,140],{"id":111,"question_zh":112,"answer_zh":113,"source_url":114},36628,"如何获取论文中所有模型和数据集的完整评估日志（如 macro-F1, AUC, R²）？","可以通过运行以下命令下载包含所有指标的 JSON 文件：\n\nwget https:\u002F\u002Fstorage.yandexcloud.net\u002Fyandex-research\u002Ftabm_paper_exp_metrics.json\n\n文件结构为嵌套字典：DatasetName -> ModelName -> ModelCount -> [训练\u002F验证\u002F测试指标列表]。注意数据集名称可能包含基准前缀（如 tabred\u002Fcooking-time）和分割 ID。","https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Ftabm\u002Fissues\u002F14",{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},36629,"如果只想在项目中直接使用 TabM 模型，是需要安装完整包还是仅使用独立实现？","如果您只是想导入并使用 TabM 模型而不需要复现论文中的超参数调优等实验设置，推荐使用独立实现（standalone implementation）。完整包（包含 bin, lib 等目录）仅用于复现论文中的完整实验流程。独立实现是功能完整的 TabM 版本。目前暂无计划提供 Scikit-learn API，但可以参考项目中的 example.ipynb 笔记本快速上手。","https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Ftabm\u002Fissues\u002F2",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},36630,"Covertype 数据集的预处理代码在哪里？为什么特征数量与原始数据不同？","所有预处理逻辑都已公开在 `lib\u002Fdata.py` 文件中，没有私有代码。关于 Covertype 数据集特征数量的变化（从 54 个变为 15 数值 +4 二元 +1 类别），处理过程大致如下：\n1. 最初使用特定代码下载、分割并保存数据集；\n2. 后续项目中将部分二元特征移至单独类别；\n3. 在 TabM 项目中，40 个二元特征被合并为一个类别特征。\n具体细节可参考 `lib\u002Fdata.py` 及相关历史代码库。","https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Ftabm\u002Fissues\u002F5",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},36631,"TabM 是否支持导出为 ONNX 格式？","TabM 是一个标准的 PyTorch 模块，因此理论上适用官方的 PyTorch 转 ONNX 文档（https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Fonnx.html）。本仓库未提供额外的导出工具。需要注意的是，目前 PyTorch 到 ONNX 的导出并非“开箱即用”，在实际操作中可能会遇到一些兼容性问题或需要额外调整。","https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Ftabm\u002Fissues\u002F12",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},36632,"可以将 TabM 集成模型中的多个适配器（adapters）权重合并为一个单一模型吗？效果如何？","这是一个有效的问题，但维护者尚未亲自尝试过。直观来看，现有的模型合并技术（如 Model Soups）旨在优化单个模型，而 TabM 本身代表多个模型（多个 MLP）的集成。将子模型合并为单一模型可能会导致性能下降，因为这是从“多模型”到“单模型”的转变。如果子模型在训练结束时差异较大，合并后的单一 MLP 效果可能较差；如果子模型相似，则可能通过模型平均机制获得不错的结果，但这需要在推理效率提升和性能潜在损失之间进行权衡。","https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Ftabm\u002Fissues\u002F16",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},36633,"example.ipynb 中的梯度裁剪（gradient clipping）实现是否有误？","是的，之前 example.ipynb 中存在一个 Bug：梯度裁剪被错误地应用在 backward() 之前，导致其无效，且优化器解锁逻辑也有问题。该问题已在提交 28e47ae301c92ec37787dde1ce923a0793f405b4 中修复。请注意，论文实验中使用的实际代码（paper\u002Fbin\u002Fmodel.py）一直是正确实现梯度裁剪的。","https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Ftabm\u002Fissues\u002F20",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},36634,"论文公式中的符号 r 和 s 位置是否标反了？","是的，用户指出的公式中绿色部分的 r 和 s 符号位置确实有误。该错误已在 arXiv PDF 的第 3 版（v3）中修正。此外，关于图中输入维度表示的不一致问题，经阅读代码和全文后确认为符号表示差异，实际逻辑无误。","https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Ftabm\u002Fissues\u002F4",[146],{"id":147,"version":148,"summary_zh":149,"released_at":150},289397,"v0.0.3","# 错误修复\n\n- 修复了在 `MLPBackboneEnsemble.__init__` 中引发异常的 bug。具体来说，现在可以将 `arch_type='tabm-packed'` 传递给 `TabM`、`TabM.make` 和 `make_tabm_backbone`。","2025-08-14T16:25:02"]