[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NX-AI--xlstm":3,"tool-NX-AI--xlstm":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,2,"2026-04-18T11:30:52",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":76,"owner_url":77,"languages":78,"stars":99,"forks":100,"last_commit_at":101,"license":102,"difficulty_score":10,"env_os":103,"env_gpu":104,"env_ram":105,"env_deps":106,"category_tags":115,"github_topics":116,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":123,"updated_at":124,"faqs":125,"releases":155},9291,"NX-AI\u002Fxlstm","xlstm","Official repository of the xLSTM.","xLSTM 是一种基于经典长短期记忆网络（LSTM）理念革新而来的新型循环神经网络架构。它旨在解决传统 LSTM 在处理长序列时存在的记忆容量限制及训练不稳定问题，同时在语言建模等任务中展现出与当前主流的 Transformer 及状态空间模型相媲美的卓越性能。\n\n该项目的核心亮点在于引入了“指数门控”机制，配合先进的归一化与稳定化技术，并首创了“矩阵记忆”模块，显著提升了模型对长程依赖的捕捉能力。团队已成功利用该架构训练出拥有 70 亿参数的 xLSTM Large 大语言模型，在高达 2.3 万亿 token 的数据集上验证了其高效的训练吞吐量和推理速度，特别适合需要快速、高效推理的场景。\n\nxLSTM 非常适合 AI 研究人员、深度学习开发者以及对高效序列建模感兴趣的技术专家使用。它不仅提供了基于 PyTorch 的易用接口和详细的论文复现代码，还针对大规模应用推出了优化后的内核支持。无论是希望探索超越 Transformer 的新架构的研究者，还是寻求在资源受限环境下部署高性能语言模型的工程师，xLSTM 都是一个值得尝试的强大开源选择。","\u003Cdiv align=\"center\">\n\n# xLSTM: Extended Long Short-Term Memory\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Paper&message=2405.04517&color=B31B1B&logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04517)\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fxlstm?color=blue)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fxlstm\u002F)\n[![PyPI Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNX-AI_xlstm_readme_b096ee60bd26.png)](https:\u002F\u002Fpepy.tech\u002Fprojects\u002Ftirex-ts)\n![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNX-AI\u002Fxlstm)\n[![License: Apache-2.0](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache--2.0-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n\n![xLSTM Figure](.\u002Fres\u002Fdesc_xlstm_overview.svg)\n\n\u003C\u002Fdiv>\n\n> **Paper:** https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04517\n>\n> **Authors:** Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter\n\n## About\n\nxLSTM is a new Recurrent Neural Network architecture based on ideas of the original LSTM.\nThrough Exponential Gating with appropriate normalization and stabilization techniques and a new Matrix Memory it overcomes the limitations of the original LSTM \nand shows promising performance on Language Modeling when compared to Transformers or State Space Models.\n\n:rotating_light: We trained a 7B parameter xLSTM Language Model on 2.3T tokens! :rotating_light:\n\nWe refer to the optimized architecture for our xLSTM 7B as xLSTM Large. \n\n## Minimal Installation\n\nCreate a conda environment from the file `environment_pt240cu124.yaml`.\nInstall the model code only (i.e. the module `xlstm`) as package:\n\nFor using the xLSTM Large 7B model install [`mlstm_kernels`](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fmlstm_kernels) via:\n``` \npip install mlstm_kernels\n```\nThen install the xlstm package via pip: \n```bash\npip install xlstm\n```\nOr clone from github:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm.git\ncd xlstm\npip install -e .\n```\n\n## Requirements\n\nThis package is based on PyTorch and was tested for versions `>=1.8`. For a well-tested environment, install the `environment_pt240cu124.yaml` as:\n```bash\nconda env create -n xlstm -f environment_pt240cu124.yaml\nconda activate xlstm\n``` \n\nFor the xLSTM Large 7B model we require our [`mlstm_kernels`](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fmlstm_kernels) package, which provides fast kernels for the xLSTM.\n\n\u003Cdiv align=\"center\">\n\n# xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Paper&message=2503.13427&color=B31B1B&logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13427)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-xLSTM_7B-yellow?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FNX-AI\u002FxLSTM-7b)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-nxai_community-green)](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Ftirex-internal\u002Fblob\u002Fmain\u002FLICENSE)\n\n\n> **Paper:** https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13427\n>\n> **Authors:** Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M. Blies, Günter Klambauer, Sebastian Böck, Sepp Hochreiter\n\n![xLSTM Figure](.\u002Fres\u002Fxlstm_7b_poster.svg)\n\n\u003C\u002Fdiv>\n\nWe have optimized the xLSTM architecture in terms of training throughput and stability. \nThe code for the updated architecture is located in `xlstm\u002Fxlstm_large`.\n\nThe model weights are available on Huggingface at https:\u002F\u002Fhuggingface.co\u002FNX-AI\u002FxLSTM-7b. \n\n## How to use the xLSTM Large 7B and its architecture\n\nWe provide a standalone single file implementation of the xLSTM Large architecture in [`xlstm\u002Fxlstm_large\u002Fmodel.py`](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm\u002Fblob\u002Fmain\u002Fxlstm\u002Fxlstm_large\u002Fmodel.py).\nThis implementation requires our [`mlstm_kernels`](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fmlstm_kernels) package and other than that has no dependency on the NeurIPS xLSTM architecture implementation.\n\nFor a quick start, we provide a [`demo.ipynb`](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm\u002Fblob\u002Fmain\u002Fnotebooks\u002Fxlstm_large\u002Fdemo.ipynb) notebook for the xLSTM Large architecture at `notebooks\u002Fxlstm_large\u002Fdemo.ipynb`. \n\nIn this notebook we import our config and model class, initialize a random model and perform a forward pass, like so:\n\n```python\nimport torch\nfrom xlstm.xlstm_large.model import xLSTMLargeConfig, xLSTMLarge\n\n# configure the model with TFLA Triton kernels\nxlstm_config = xLSTMLargeConfig(\n    embedding_dim=512,\n    num_heads=4,\n    num_blocks=6,\n    vocab_size=2048,\n    return_last_states=True,\n    mode=\"inference\",\n    chunkwise_kernel=\"chunkwise--triton_xl_chunk\", # xl_chunk == TFLA kernels\n    sequence_kernel=\"native_sequence__triton\",\n    step_kernel=\"triton\",\n)\n# instantiate the model\nxlstm = xLSTMLarge(xlstm_config)\nxlstm = xlstm.to(\"cuda\")\n# create inputs\ninput = torch.randint(0, 2048, (3, 256)).to(\"cuda\")\n# run a forward pass\nout = xlstm(input)\nout.shape[1:] == (256, 2048)\n```\n\n## Recommendation for other hardware\n\nWe have tested our model mostly on NVIDIA GPUs, however our Triton kernels should also run on AMD GPUs. \nFor other platforms, like Apple Metal, we recommend using the native PyTorch implementations for now:\n\n```python \nxlstm_config = xLSTMLargeConfig(\n    embedding_dim=512,\n    num_heads=4,\n    num_blocks=6,\n    vocab_size=2048,\n    return_last_states=True,\n    mode=\"inference\",\n    chunkwise_kernel=\"chunkwise--native_autograd\", # no Triton kernels\n    sequence_kernel=\"native_sequence__native\", # no Triton kernels\n    step_kernel=\"native\", # no Triton kernels\n)\n```\n\nIf you are working inside Apple's MLX ecosystem, check out the community-driven\n[xLSTM-metal](https:\u002F\u002Fgithub.com\u002FMLXPorts\u002FxLSTM-metal) port which provides an\nMLX-native implementation of xLSTM targeting Apple Silicon.\n\n# Models from the xLSTM NeurIPS Paper\n\nThis section explains how to use the models from the xLSTM paper.\n\n## How to use the xLSTM architecture from our NeurIPS paper\n\nFor non language applications or for integrating in other architectures you can use the `xLSTMBlockStack` and for language modeling or other token-based applications you can use the `xLSTMLMModel`.\n\n### Using the sLSTM CUDA kernels\n\nFor the CUDA version of sLSTM, you need Compute Capability >= 8.0, see [https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus). If you have problems with the compilation, please try (thanks to [@zia1138](https:\u002F\u002Fgithub.com\u002Fzia1138) for pointing out):\n```bash\nexport TORCH_CUDA_ARCH_LIST=\"8.0;8.6;9.0\"\n```\n\nFor all kinds of custom setups with torch and CUDA, keep in mind that versions have to match. Also, to make sure the correct CUDA libraries are included you can use the \"XLSTM_EXTRA_INCLUDE_PATHS\" environment variable now to inject different include paths, e.g.:\n\n```bash\nexport XLSTM_EXTRA_INCLUDE_PATHS='\u002Fusr\u002Flocal\u002Finclude\u002Fcuda\u002F:\u002Fusr\u002Finclude\u002Fcuda\u002F'\n```\n\nor within python:\n\n```python\nimport os\nos.environ['XLSTM_EXTRA_INCLUDE_PATHS']='\u002Fusr\u002Flocal\u002Finclude\u002Fcuda\u002F:\u002Fusr\u002Finclude\u002Fcuda\u002F'\n```\n\nfor standalone, even faster sLSTM kernels, feel free to use the [FlashRNN](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fflashrnn) library.\n\n### xLSTM Block Stack\n\nThe `xLSTMBLockStack` is meant for use as alternative backbone in existing projects. It is similar to a stack of Transformer blocks, but uses xLSTM blocks:\n\n```python\nimport torch\n\nfrom xlstm import (\n    xLSTMBlockStack,\n    xLSTMBlockStackConfig,\n    mLSTMBlockConfig,\n    mLSTMLayerConfig,\n    sLSTMBlockConfig,\n    sLSTMLayerConfig,\n    FeedForwardConfig,\n)\n\ncfg = xLSTMBlockStackConfig(\n    mlstm_block=mLSTMBlockConfig(\n        mlstm=mLSTMLayerConfig(\n            conv1d_kernel_size=4, qkv_proj_blocksize=4, num_heads=4\n        )\n    ),\n    slstm_block=sLSTMBlockConfig(\n        slstm=sLSTMLayerConfig(\n            backend=\"cuda\",\n            num_heads=4,\n            conv1d_kernel_size=4,\n            bias_init=\"powerlaw_blockdependent\",\n        ),\n        feedforward=FeedForwardConfig(proj_factor=1.3, act_fn=\"gelu\"),\n    ),\n    context_length=256,\n    num_blocks=7,\n    embedding_dim=128,\n    slstm_at=[1],\n\n)\n\nxlstm_stack = xLSTMBlockStack(cfg)\n\nx = torch.randn(4, 256, 128).to(\"cuda\")\nxlstm_stack = xlstm_stack.to(\"cuda\")\ny = xlstm_stack(x)\ny.shape == x.shape\n```\n\nIf you are working with yaml strings \u002F files for configuration you can also use dacite to create the config dataclasses. This is the same as the snippet above:\n\n```python\nfrom omegaconf import OmegaConf\nfrom dacite import from_dict\nfrom dacite import Config as DaciteConfig\nfrom xlstm import xLSTMBlockStack, xLSTMBlockStackConfig\n\nxlstm_cfg = \"\"\" \nmlstm_block:\n  mlstm:\n    conv1d_kernel_size: 4\n    qkv_proj_blocksize: 4\n    num_heads: 4\nslstm_block:\n  slstm:\n    backend: cuda\n    num_heads: 4\n    conv1d_kernel_size: 4\n    bias_init: powerlaw_blockdependent\n  feedforward:\n    proj_factor: 1.3\n    act_fn: gelu\ncontext_length: 256\nnum_blocks: 7\nembedding_dim: 128\nslstm_at: [1]\n\"\"\"\ncfg = OmegaConf.create(xlstm_cfg)\ncfg = from_dict(data_class=xLSTMBlockStackConfig, data=OmegaConf.to_container(cfg), config=DaciteConfig(strict=True))\nxlstm_stack = xLSTMBlockStack(cfg)\n\nx = torch.randn(4, 256, 128).to(\"cuda\")\nxlstm_stack = xlstm_stack.to(\"cuda\")\ny = xlstm_stack(x)\ny.shape == x.shape\n\n```\n\n\n### xLSTM Language Model\n\nThe `xLSTMLMModel` is a wrapper around the `xLSTMBlockStack` that adds the token embedding and lm head.\n\n```python\nfrom omegaconf import OmegaConf\nfrom dacite import from_dict\nfrom dacite import Config as DaciteConfig\nfrom xlstm import xLSTMLMModel, xLSTMLMModelConfig\n\nxlstm_cfg = \"\"\" \nvocab_size: 50304\nmlstm_block:\n  mlstm:\n    conv1d_kernel_size: 4\n    qkv_proj_blocksize: 4\n    num_heads: 4\nslstm_block:\n  slstm:\n    backend: cuda\n    num_heads: 4\n    conv1d_kernel_size: 4\n    bias_init: powerlaw_blockdependent\n  feedforward:\n    proj_factor: 1.3\n    act_fn: gelu\ncontext_length: 256\nnum_blocks: 7\nembedding_dim: 128\nslstm_at: [1]\n\"\"\"\ncfg = OmegaConf.create(xlstm_cfg)\ncfg = from_dict(data_class=xLSTMLMModelConfig, data=OmegaConf.to_container(cfg), config=DaciteConfig(strict=True))\nxlstm_stack = xLSTMLMModel(cfg)\n\nx = torch.randint(0, 50304, size=(4, 256)).to(\"cuda\")\nxlstm_stack = xlstm_stack.to(\"cuda\")\ny = xlstm_stack(x)\ny.shape[1:] == (256, 50304)\n```\n\n\n## Experiments\n\nThe synthetic experiments show-casing the benefits of sLSTM over mLSTM and vice versa best are the Parity task and the Multi-Query Associative Recall task. The Parity task can only be solved with state-tracking capabilities provided by the memory-mixing of sLSTM. The Multi-Query Associative Recall task measures memorization capabilities, where the matrix-memory and state expansion of mLSTM is very beneficial.\nIn combination they do well on both tasks.\n\nTo run each, run the `main.py` in the experiments folder like:\n```\nPYTHONPATH=. python experiments\u002Fmain.py --config experiments\u002Fparity_xlstm01.yaml   # xLSTM[0:1], sLSTM only\nPYTHONPATH=. python experiments\u002Fmain.py --config experiments\u002Fparity_xlstm10.yaml   # xLSTM[1:0], mLSTM only\nPYTHONPATH=. python experiments\u002Fmain.py --config experiments\u002Fparity_xlstm11.yaml   # xLSTM[1:1], mLSTM and sLSTM\n```\n\nNote that the training loop does not contain early stopping or test evaluation.\n\n\n## Citation\n\nIf you use this codebase, or otherwise find our work valuable, please cite the xLSTM paper:\n```\n@inproceedings{beck:24xlstm,\n  title = {xLSTM: Extended Long Short-Term Memory}, \n  author = {Maximilian Beck and Korbinian Pöppel and Markus Spanring and Andreas Auer and Oleksandra Prudnikova and Michael Kopp and Günter Klambauer and Johannes Brandstetter and Sepp Hochreiter},\n  booktitle = {Thirty-eighth Conference on Neural Information Processing Systems},\n  year = {2024},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04517}, \n}\n\n@article{beck:25xlstm7b,\n  title = {{xLSTM 7B}: A Recurrent LLM for Fast and Efficient Inference},\n  author = {Maximilian Beck and Korbinian Pöppel and Phillip Lippe and Richard Kurle and Patrick M. Blies and Günter Klambauer and Sebastian Böck and Sepp Hochreiter},\n  booktitle = {Forty-second International Conference on Machine Learning},\n  year = {2025},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13427}\n}\n\n```\n","\u003Cdiv align=\"center\">\n\n# xLSTM：扩展型长短期记忆网络\n\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Paper&message=2405.04517&color=B31B1B&logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04517)\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fxlstm?color=blue)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fxlstm\u002F)\n[![PyPI 下载量](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNX-AI_xlstm_readme_b096ee60bd26.png)](https:\u002F\u002Fpepy.tech\u002Fprojects\u002Ftirex-ts)\n![GitHub 仓库星标数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNX-AI\u002Fxlstm)\n[![许可证：Apache-2.0](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache--2.0-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n\n![xLSTM 图示](.\u002Fres\u002Fdesc_xlstm_overview.svg)\n\n\u003C\u002Fdiv>\n\n> **论文:** https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04517\n>\n> **作者:** 马克西米利安·贝克、科尔比尼安·珀佩尔、马库斯·施潘林、安德烈亚斯·奥尔、奥莱克桑德拉·普鲁德尼科娃、迈克尔·科普、京特·克拉姆鲍尔、约翰内斯·布兰德施泰特、塞普·霍赫赖特\n\n## 关于\n\nxLSTM 是一种基于原始 LSTM 思想的新型循环神经网络架构。通过采用指数门控机制，并结合适当的归一化和稳定化技术，以及引入新的矩阵存储单元，xLSTM 克服了原始 LSTM 的局限性，在语言建模任务上表现出与 Transformer 或状态空间模型相媲美的优异性能。\n\n:rotating_light: 我们使用 2.3T 个 token 训练了一个 7B 参数的 xLSTM 语言模型！ :rotating_light:\n\n我们将其优化后的 7B 参数架构称为 xLSTM Large。\n\n## 最小安装步骤\n\n首先从 `environment_pt240cu124.yaml` 文件创建一个 conda 环境。然后仅将模型代码（即 `xlstm` 模块）以包的形式安装：\n\n若要使用 xLSTM Large 7B 模型，请先通过以下命令安装 [`mlstm_kernels`](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fmlstm_kernels)：\n``` \npip install mlstm_kernels\n```\n接着再通过 pip 安装 xlstm 包：\n```bash\npip install xlstm\n```\n或者直接从 GitHub 克隆：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm.git\ncd xlstm\npip install -e .\n```\n\n## 系统要求\n\n本包基于 PyTorch 开发，经测试适用于 `>=1.8` 版本。为确保环境稳定运行，可按如下方式安装已验证过的环境配置文件：\n```bash\nconda env create -n xlstm -f environment_pt240cu124.yaml\nconda activate xlstm\n``` \n\n对于 xLSTM Large 7B 模型，我们需要使用我们的 [`mlstm_kernels`](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fmlstm_kernels) 包，它提供了针对 xLSTM 的高效计算内核。\n\n\u003Cdiv align=\"center\">\n\n# xLSTM 7B：用于快速高效推理的循环式大型语言模型\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Paper&message=2503.13427&color=B31B1B&logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13427)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-xLSTM_7B-yellow?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FNX-AI\u002FxLSTM-7b)\n[![许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-nxai_community-green)](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Ftirex-internal\u002Fblob\u002Fmain\u002FLICENSE)\n\n\n> **论文:** https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13427\n>\n> **作者:** 马克西米利安·贝克、科尔比尼安·珀佩尔、菲利普·利佩、理查德·库尔勒、帕特里克·M·布利斯、京特·克拉姆鲍尔、塞巴斯蒂安·博克、塞普·霍赫赖特\n\n![xLSTM 图示](.\u002Fres\u002Fxlstm_7b_poster.svg)\n\n\u003C\u002Fdiv>\n\n我们在训练吞吐量和稳定性方面对 xLSTM 架构进行了优化。更新后的架构代码位于 `xlstm\u002Fxlstm_large` 目录中。\n\n该模型的权重已在 Hugging Face 上发布，地址为：https:\u002F\u002Fhuggingface.co\u002FNX-AI\u002FxLSTM-7b。\n\n## 如何使用 xLSTM Large 7B 及其架构\n\n我们在 [`xlstm\u002Fxlstm_large\u002Fmodel.py`](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm\u002Fblob\u002Fmain\u002Fxlstm\u002Fxlstm_large\u002Fmodel.py) 中提供了一个独立的单文件实现版本，用于 xLSTM Large 架构。此实现依赖于我们的 [`mlstm_kernels`](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fmlstm_kernels) 包，除此之外不依赖于 NeurIPS 论文中的 xLSTM 架构实现。\n\n为了方便快速上手，我们还提供了一个用于 xLSTM Large 架构的笔记本文件：[`demo.ipynb`](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm\u002Fblob\u002Fmain\u002Fnotebooks\u002Fxlstm_large\u002Fdemo.ipynb)，路径为 `notebooks\u002Fxlstm_large\u002Fdemo.ipynb`。\n\n在该笔记本中，我们导入配置和模型类，初始化一个随机模型并执行前向传播，具体操作如下：\n\n```python\nimport torch\nfrom xlstm.xlstm_large.model import xLSTMLargeConfig, xLSTMLarge\n\n# 使用 TFLA Triton 内核配置模型\nxlstm_config = xLSTMLargeConfig(\n    embedding_dim=512,\n    num_heads=4,\n    num_blocks=6,\n    vocab_size=2048,\n    return_last_states=True,\n    mode=\"inference\",\n    chunkwise_kernel=\"chunkwise--triton_xl_chunk\", # xl_chunk == TFLA 内核\n    sequence_kernel=\"native_sequence__triton\",\n    step_kernel=\"triton\",\n)\n# 实例化模型\nxlstm = xLSTMLarge(xlstm_config)\nxlstm = xlstm.to(\"cuda\")\n# 创建输入\ninput = torch.randint(0, 2048, (3, 256)).to(\"cuda\")\n# 运行前向传播\nout = xlstm(input)\nout.shape[1:] == (256, 2048)\n```\n\n## 其他硬件平台的建议\n\n我们主要在 NVIDIA GPU 上测试了该模型，不过我们的 Triton 内核也应该能在 AMD GPU 上运行。对于其他平台，比如 Apple Metal，目前建议使用原生 PyTorch 实现：\n\n```python \nxlstm_config = xLSTMLargeConfig(\n    embedding_dim=512,\n    num_heads=4,\n    num_blocks=6,\n    vocab_size=2048,\n    return_last_states=True,\n    mode=\"inference\",\n    chunkwise_kernel=\"chunkwise--native_autograd\", # 不使用 Triton 内核\n    sequence_kernel=\"native_sequence__native\", # 不使用 Triton 内核\n    step_kernel=\"native\", # 不使用 Triton 内核\n)\n```\n\n如果您正在使用 Apple 的 MLX 生态系统，可以尝试社区驱动的 [xLSTM-metal](https:\u002F\u002Fgithub.com\u002FMLXPorts\u002FxLSTM-metal) 移植项目，它提供了面向 Apple Silicon 的 MLX 原生 xLSTM 实现。\n\n# 来自 xLSTM NeurIPS 论文的模型\n\n本节介绍如何使用 xLSTM 论文中提出的模型。\n\n## 如何使用我们 NeurIPS 论文中的 xLSTM 架构\n\n对于非语言类应用或需要与其他架构集成的情况，您可以使用 `xLSTMBlockStack`；而对于语言建模或其他基于 token 的应用，则可以使用 `xLSTMLMModel`。\n\n### 使用 sLSTM CUDA 内核\n\n对于 sLSTM 的 CUDA 版本，您的设备需具备 Compute Capability >= 8.0，请参阅 [https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus)。如果编译过程中遇到问题，您可以尝试以下方法（感谢 [@zia1138](https:\u002F\u002Fgithub.com\u002Fzia1138) 的提示）：\n```bash\nexport TORCH_CUDA_ARCH_LIST=\"8.0;8.6;9.0\"\n```\n\n对于所有涉及 PyTorch 和 CUDA 的自定义设置，请务必确保版本匹配。此外，为确保正确加载 CUDA 库，您现在可以使用 `XLSTM_EXTRA_INCLUDE_PATHS` 环境变量来注入不同的包含路径，例如：\n\n```bash\nexport XLSTM_EXTRA_INCLUDE_PATHS='\u002Fusr\u002Flocal\u002Finclude\u002Fcuda\u002F:\u002Fusr\u002Finclude\u002Fcuda\u002F'\n```\n\n或者在 Python 中设置：\n```python\nimport os\nos.environ['XLSTM_EXTRA_INCLUDE_PATHS']='\u002Fusr\u002Flocal\u002Finclude\u002Fcuda\u002F:\u002Fusr\u002Finclude\u002Fcuda\u002F'\n```\n\n如需更加快速的 sLSTM 独立内核，欢迎使用 [FlashRNN](https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fflashrnn) 库。\n\n### xLSTM 块堆栈\n\n`xLSTMBLockStack` 旨在作为现有项目中的替代骨干网络。它类似于 Transformer 块的堆栈，但使用的是 xLSTM 块：\n\n```python\nimport torch\n\nfrom xlstm import (\n    xLSTMBlockStack,\n    xLSTMBlockStackConfig,\n    mLSTMBlockConfig,\n    mLSTMLayerConfig,\n    sLSTMBlockConfig,\n    sLSTMLayerConfig,\n    FeedForwardConfig,\n)\n\ncfg = xLSTMBlockStackConfig(\n    mlstm_block=mLSTMBlockConfig(\n        mlstm=mLSTMLayerConfig(\n            conv1d_kernel_size=4, qkv_proj_blocksize=4, num_heads=4\n        )\n    ),\n    slstm_block=sLSTMBlockConfig(\n        slstm=sLSTMLayerConfig(\n            backend=\"cuda\",\n            num_heads=4,\n            conv1d_kernel_size=4,\n            bias_init=\"powerlaw_blockdependent\",\n        ),\n        feedforward=FeedForwardConfig(proj_factor=1.3, act_fn=\"gelu\"),\n    ),\n    context_length=256,\n    num_blocks=7,\n    embedding_dim=128,\n    slstm_at=[1],\n\n)\n\nxlstm_stack = xLSTMBlockStack(cfg)\n\nx = torch.randn(4, 256, 128).to(\"cuda\")\nxlstm_stack = xlstm_stack.to(\"cuda\")\ny = xlstm_stack(x)\ny.shape == x.shape\n```\n\n如果你使用 YAML 字符串或文件进行配置，也可以使用 dacite 来创建配置数据类。这与上面的代码片段相同：\n\n```python\nfrom omegaconf import OmegaConf\nfrom dacite import from_dict\nfrom dacite import Config as DaciteConfig\nfrom xlstm import xLSTMBlockStack, xLSTMBlockStackConfig\n\nxlstm_cfg = \"\"\" \nmlstm_block:\n  mlstm:\n    conv1d_kernel_size: 4\n    qkv_proj_blocksize: 4\n    num_heads: 4\nslstm_block:\n  slstm:\n    backend: cuda\n    num_heads: 4\n    conv1d_kernel_size: 4\n    bias_init: powerlaw_blockdependent\n  feedforward:\n    proj_factor: 1.3\n    act_fn: gelu\ncontext_length: 256\nnum_blocks: 7\nembedding_dim: 128\nslstm_at: [1]\n\"\"\"\ncfg = OmegaConf.create(xlstm_cfg)\ncfg = from_dict(data_class=xLSTMBlockStackConfig, data=OmegaConf.to_container(cfg), config=DaciteConfig(strict=True))\nxlstm_stack = xLSTMBlockStack(cfg)\n\nx = torch.randn(4, 256, 128).to(\"cuda\")\nxlstm_stack = xlstm_stack.to(\"cuda\")\ny = xlstm_stack(x)\ny.shape == x.shape\n\n```\n\n\n### xLSTM 语言模型\n\n`xLSTMLMModel` 是一个围绕 `xLSTMBlockStack` 的封装器，添加了词嵌入和语言模型头。\n\n```python\nfrom omegaconf import OmegaConf\nfrom dacite import from_dict\nfrom dacite import Config as DaciteConfig\nfrom xlstm import xLSTMLMModel, xLSTMLMModelConfig\n\nxlstm_cfg = \"\"\" \nvocab_size: 50304\nmlstm_block:\n  mlstm:\n    conv1d_kernel_size: 4\n    qkv_proj_blocksize: 4\n    num_heads: 4\nslstm_block:\n  slstm:\n    backend: cuda\n    num_heads: 4\n    conv1d_kernel_size: 4\n    bias_init: powerlaw_blockdependent\n  feedforward:\n    proj_factor: 1.3\n    act_fn: gelu\ncontext_length: 256\nnum_blocks: 7\nembedding_dim: 128\nslstm_at: [1]\n\"\"\"\ncfg = OmegaConf.create(xlstm_cfg)\ncfg = from_dict(data_class=xLSTMLMModelConfig, data=OmegaConf.to_container(cfg), config=DaciteConfig(strict=True))\nxlstm_stack = xLSTMLMModel(cfg)\n\nx = torch.randint(0, 50304, size=(4, 256)).to(\"cuda\")\nxlstm_stack = xlstm_stack.to(\"cuda\")\ny = xlstm_stack(x)\ny.shape[1:] == (256, 50304)\n```\n\n\n## 实验\n\n展示 sLSTM 相对于 mLSTM 优势以及反之亦然的最佳合成实验是奇偶校验任务和多查询联想回忆任务。奇偶校验任务只能通过 sLSTM 的内存混合提供的状态跟踪能力来解决。而多查询联想回忆任务则衡量记忆能力，此时 mLSTM 的矩阵内存和状态扩展非常有益。\n两者结合在两项任务中都表现出色。\n\n要运行每个实验，请在 experiments 文件夹中运行 `main.py`，例如：\n```\nPYTHONPATH=. python experiments\u002Fmain.py --config experiments\u002Fparity_xlstm01.yaml   # xLSTM[0:1], 仅 sLSTM\nPYTHONPATH=. python experiments\u002Fmain.py --config experiments\u002Fparity_xlstm10.yaml   # xLSTM[1:0], 仅 mLSTM\nPYTHONPATH=. python experiments\u002Fmain.py --config experiments\u002Fparity_xlstm11.yaml   # xLSTM[1:1], mLSTM 和 sLSTM\n```\n\n请注意，训练循环不包含早停或测试评估。\n\n\n## 引用\n\n如果您使用此代码库，或者以其他方式认为我们的工作有价值，请引用 xLSTM 论文：\n```\n@inproceedings{beck:24xlstm,\n  title = {xLSTM: 扩展的长短期记忆}, \n  author = {Maximilian Beck 和 Korbinian Pöppel、Markus Spanring、Andreas Auer、Oleksandra Prudnikova、Michael Kopp、Günter Klambauer、Johannes Brandstetter 和 Sepp Hochreiter},\n  booktitle = {第三十八届神经信息处理系统会议},\n  year = {2024},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04517}, \n}\n\n@article{beck:25xlstm7b,\n  title = {{xLSTM 7B}: 用于快速高效推理的循环 LLM},\n  author = {Maximilian Beck、Korbinian Pöppel、Phillip Lippe、Richard Kurle、Patrick M. Blies、Günter Klambauer、Sebastian Böck 和 Sepp Hochreiter},\n  booktitle = {第四十二届国际机器学习会议},\n  year = {2025},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13427}\n}\n\n```","# xLSTM 快速上手指南\n\nxLSTM（Extended Long Short-Term Memory）是一种基于原始 LSTM 理念改进的新型循环神经网络架构。它通过指数门控、归一化稳定技术及新的矩阵记忆机制，克服了传统 LSTM 的局限性，在语言建模任务中展现出媲美 Transformer 和状态空间模型（SSM）的性能。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐), macOS, Windows\n- **Python 版本**: 建议 Python 3.8+\n- **GPU 支持**: \n  - NVIDIA GPU (推荐): 需安装 CUDA。若使用 sLSTM CUDA 加速内核，要求 Compute Capability >= 8.0 (如 A100, RTX 3090\u002F4090 等)。\n  - AMD GPU: 支持 Triton 内核。\n  - Apple Silicon: 建议使用原生 PyTorch 实现或社区驱动的 `xLSTM-metal` (MLX) 版本。\n\n### 前置依赖\n- **PyTorch**: 版本 >= 1.8 (推荐使用较新版本以获得最佳性能)。\n- **可选加速库**: \n  - `mlstm_kernels`: 运行 xLSTM Large 7B 模型或追求极致推理速度时必需。\n  - `flashrnn`: 用于独立的更快 sLSTM 内核。\n\n## 安装步骤\n\n### 方法一：使用 Conda 环境（推荐）\n这是最稳定的安装方式，可确保所有依赖版本匹配。\n\n```bash\n# 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm.git\ncd xlstm\n\n# 创建并激活 conda 环境 (基于 environment_pt240cu124.yaml)\nconda env create -n xlstm -f environment_pt240cu124.yaml\nconda activate xlstm\n\n# 以开发模式安装 xlstm 包\npip install -e .\n```\n\n> **提示**：国内用户若下载 conda 包较慢，可配置清华或中科大镜像源：\n> ```bash\n> conda config --add channels https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fanaconda\u002Fpkgs\u002Fmain\u002F\n> conda config --add channels https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fanaconda\u002Fpkgs\u002Ffree\u002F\n> ```\n\n### 方法二：使用 Pip 直接安装\n如果你已有合适的 PyTorch 环境，可直接通过 pip 安装。\n\n```bash\n# 安装基础 xlstm 包\npip install xlstm\n\n# 【可选】若需使用 xLSTM Large 7B 模型，请额外安装加速内核\npip install mlstm_kernels\n```\n\n### 特殊配置：CUDA 编译问题\n若在使用 sLSTM CUDA 内核时遇到编译错误，请尝试设置以下环境变量指定 GPU 架构列表：\n\n```bash\nexport TORCH_CUDA_ARCH_LIST=\"8.0;8.6;9.0\"\n```\n\n若需指定额外的 CUDA 头文件路径：\n```bash\nexport XLSTM_EXTRA_INCLUDE_PATHS='\u002Fusr\u002Flocal\u002Finclude\u002Fcuda\u002F:\u002Fusr\u002Finclude\u002Fcuda\u002F'\n```\n\n## 基本使用\n\n### 场景一：使用 xLSTM Large 7B 架构 (推理优化版)\n适用于需要高性能推理的场景。此实现依赖 `mlstm_kernels` 和 Triton。\n\n```python\nimport torch\nfrom xlstm.xlstm_large.model import xLSTMLargeConfig, xLSTMLarge\n\n# 1. 配置模型 (使用 TFLA Triton 内核加速)\nxlstm_config = xLSTMLargeConfig(\n    embedding_dim=512,\n    num_heads=4,\n    num_blocks=6,\n    vocab_size=2048,\n    return_last_states=True,\n    mode=\"inference\",\n    chunkwise_kernel=\"chunkwise--triton_xl_chunk\", # xl_chunk == TFLA kernels\n    sequence_kernel=\"native_sequence__triton\",\n    step_kernel=\"triton\",\n)\n\n# 2. 实例化模型并移至 GPU\nxlstm = xLSTMLarge(xlstm_config)\nxlstm = xlstm.to(\"cuda\")\n\n# 3. 创建输入数据 (batch_size=3, seq_len=256)\ninput_ids = torch.randint(0, 2048, (3, 256)).to(\"cuda\")\n\n# 4. 执行前向传播\noutput = xlstm(input_ids)\n\n# 输出形状验证: (batch, seq_len, vocab_size) -> (3, 256, 2048)\nprint(output.shape) \n```\n\n> **非 NVIDIA 硬件用户注意**：若在 Apple Metal 或其他不支持 Triton 的平台，请将配置中的 kernel 选项改为 native 版本：\n> ```python\n> chunkwise_kernel=\"chunkwise--native_autograd\",\n> sequence_kernel=\"native_sequence__native\",\n> step_kernel=\"native\",\n> ```\n\n### 场景二：使用标准 xLSTM 模块 (研究与自定义)\n适用于将 xLSTM 作为骨干网络集成到其他项目中，或进行学术研究。支持混合使用 sLSTM 和 mLSTM 块。\n\n```python\nimport torch\nfrom xlstm import (\n    xLSTMBlockStack,\n    xLSTMBlockStackConfig,\n    mLSTMBlockConfig,\n    mLSTMLayerConfig,\n    sLSTMBlockConfig,\n    sLSTMLayerConfig,\n    FeedForwardConfig,\n)\n\n# 1. 定义配置\ncfg = xLSTMBlockStackConfig(\n    mlstm_block=mLSTMBlockConfig(\n        mlstm=mLSTMLayerConfig(\n            conv1d_kernel_size=4, qkv_proj_blocksize=4, num_heads=4\n        )\n    ),\n    slstm_block=sLSTMBlockConfig(\n        slstm=sLSTMLayerConfig(\n            backend=\"cuda\", # 使用 CUDA 加速后端\n            num_heads=4,\n            conv1d_kernel_size=4,\n            bias_init=\"powerlaw_blockdependent\",\n        ),\n        feedforward=FeedForwardConfig(proj_factor=1.3, act_fn=\"gelu\"),\n    ),\n    context_length=256,\n    num_blocks=7,\n    embedding_dim=128,\n    slstm_at=[1], # 在第 1 层使用 sLSTM，其余默认为 mLSTM\n)\n\n# 2. 初始化模型\nxlstm_stack = xLSTMBlockStack(cfg)\nxlstm_stack = xlstm_stack.to(\"cuda\")\n\n# 3. 准备输入 (batch=4, seq_len=256, dim=128)\nx = torch.randn(4, 256, 128).to(\"cuda\")\n\n# 4. 前向传播\ny = xlstm_stack(x)\n\n# 输出形状应与输入一致 (除了可能的内部状态处理)\nassert y.shape == x.shape\n```\n\n### 场景三：构建语言模型 (LM)\n`xLSTMLMModel` 是包含 Token 嵌入层和 LM Head 的完整封装，适合直接进行语言建模训练。\n\n```python\nfrom omegaconf import OmegaConf\nfrom dacite import from_dict\nfrom dacite import Config as DaciteConfig\nfrom xlstm import xLSTMLMModel, xLSTMLMModelConfig\nimport torch\n\n# 使用 YAML 字符串配置模型\nxlstm_cfg_str = \"\"\" \nvocab_size: 50304\nmlstm_block:\n  mlstm:\n    conv1d_kernel_size: 4\n    qkv_proj_blocksize: 4\n    num_heads: 4\nslstm_block:\n  slstm:\n    backend: cuda\n    num_heads: 4\n    conv1d_kernel_size: 4\n    bias_init: powerlaw_blockdependent\n  feedforward:\n    proj_factor: 1.3\n    act_fn: gelu\ncontext_length: 256\nnum_blocks: 7\nembedding_dim: 128\nslstm_at: [1]\n\"\"\"\n\n# 解析配置并创建模型\ncfg_dict = OmegaConf.create(xlstm_cfg_str)\ncfg = from_dict(\n    data_class=xLSTMLMModelConfig, \n    data=OmegaConf.to_container(cfg_dict), \n    config=DaciteConfig(strict=True)\n)\n\nmodel = xLSTMLMModel(cfg)\nmodel = model.to(\"cuda\")\n\n# 模拟输入 token IDs\ninput_ids = torch.randint(0, 50304, size=(4, 256)).to(\"cuda\")\n\n# 前向传播\nlogits = model(input_ids)\n\n# 输出形状: (batch, seq_len, vocab_size)\nprint(logits.shape) # torch.Size([4, 256, 50304])\n```","某边缘计算团队正在为工业物联网网关开发一款本地化日志异常检测系统，需在资源受限设备上实时分析海量设备日志流。\n\n### 没有 xlstm 时\n- **显存占用过高**：传统 Transformer 架构随着序列长度增加，显存消耗呈平方级增长，导致普通工控机无法加载大模型。\n- **推理延迟波动大**：在处理长上下文历史日志时，注意力机制计算耗时剧烈波动，难以满足毫秒级实时报警需求。\n- **长程依赖捕捉弱**：标准 LSTM 在分析跨越数小时的故障前兆时，容易遗忘早期关键信号，导致漏报率高。\n- **训练收敛缓慢**：在大规模日志数据上训练时，梯度不稳定，需要极长的调参周期才能达到可用精度。\n\n### 使用 xlstm 后\n- **显存效率显著提升**：xLSTM 凭借线性复杂度的矩阵记忆机制，将长序列处理的显存占用降低了一个数量级，顺利部署于边缘设备。\n- **推理速度稳定快速**：指数门控与优化内核让推理时间随序列长度线性增长，实现了稳定且低延迟的实时流式分析。\n- **长程记忆能力增强**：新的存储结构完美保留了数小时前的微弱异常特征，大幅提升了针对隐蔽性故障的召回率。\n- **训练稳定性提高**：内置的归一化与稳定技术使得模型在 2.3T token 量级数据上也能快速收敛，缩短了研发迭代周期。\n\nxLSTM 通过突破传统循环神经网络的记忆瓶颈，让高性能长序列建模得以在低成本边缘设备上高效运行。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNX-AI_xlstm_aaebffbe.png","NX-AI","NXAI","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNX-AI_c7af33e1.png","",null,"https:\u002F\u002Fgithub.com\u002FNX-AI",[79,83,87,91,95],{"name":80,"color":81,"percentage":82},"Python","#3572A5",45.7,{"name":84,"color":85,"percentage":86},"Cuda","#3A4E3A",27.8,{"name":88,"color":89,"percentage":90},"Jupyter Notebook","#DA5B0B",15.8,{"name":92,"color":93,"percentage":94},"C++","#f34b7d",10.3,{"name":96,"color":97,"percentage":98},"C","#555555",0.4,2150,179,"2026-04-16T23:15:15","Apache-2.0","Linux, macOS","NVIDIA GPU (必需，用于 CUDA 内核加速，计算能力 >= 8.0，如 A100\u002FH100 等；支持 AMD GPU 通过 Triton 内核；Apple Silicon 需使用社区移植版或原生 PyTorch 模式)","未说明 (训练 7B 模型通常需要大量显存和系统内存)",{"notes":107,"python":108,"dependencies":109},"1. 官方推荐使用提供的 conda 环境文件 (environment_pt240cu124.yaml) 以确保 PyTorch 和 CUDA 版本匹配。2. 若使用 sLSTM CUDA 内核，显卡计算能力必须 >= 8.0，并可能需要设置 TORCH_CUDA_ARCH_LIST 环境变量。3. 在非 NVIDIA 平台（如 Apple Metal）运行时，需手动配置使用原生 PyTorch 内核而非 Triton 内核，或使用社区提供的 MLX 版本。4. 可通过 XLSTM_EXTRA_INCLUDE_PATHS 环境变量指定 CUDA 头文件路径以解决编译问题。","未说明 (通过 conda 环境文件 environment_pt240cu124.yaml 管理，通常对应 Python 3.10+)",[110,111,112,113,114],"torch>=1.8","mlstm_kernels (可选，用于 xLSTM Large 7B 加速)","triton (隐含依赖，用于自定义内核)","dacite","omegaconf",[14,35],[117,118,119,120,121,122],"deep-learning","deep-learning-architecture","llm","machine-learning","nlp","rnn","2026-03-27T02:49:30.150509","2026-04-19T06:02:50.153991",[126,131,136,141,146,150],{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},41716,"遇到 'RuntimeError: Error building extension' 或 CUDA 编译错误（如 __halves2bfloat16 未定义）怎么办？","这通常是由于缺少必要的 CUDA 编译器库或环境配置不一致导致的。解决方案包括：\n1. 尝试安装 cccl 包：运行命令 `conda install cccl`。\n2. 确保使用 Conda 环境进行安装，因为项目测试依赖 Conda 来管理头文件和依赖，避免使用系统自带的不同版本 PyTorch 或 CUDA。\n3. 如果问题依旧，可以考虑使用 Docker 容器以保证环境的一致性。","https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm\u002Fissues\u002F19",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},41717,"在 PyTorch 2.6.0 及以上版本运行时出现 'TypeError: include_paths() got an unexpected keyword argument cuda' 错误如何解决？","这是一个已知的兼容性问题。该问题在较新的代码版本中已被修复。如果您使用的是旧版本（如 v1.0.8 或 v2.0.2），请升级到最新的代码版本以解决此错误。维护者已确认并修复了针对 Torch 2.6.0+ 的适配问题。","https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm\u002Fissues\u002F66",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},41718,"运行代码时提示 'RuntimeError: Ninja is required to load C++ extensions' 是什么意思？","这意味着您的系统中缺少 'ninja' 构建工具，而加载 C++\u002FCUDA 扩展必须依赖它。请在您的环境中安装 ninja，例如使用 pip 安装：`pip install ninja`，或者通过 conda 安装：`conda install ninja`。安装完成后重新运行代码即可。","https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm\u002Fissues\u002F12",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},41719,"遇到 LayerNorm 形状不匹配错误 'expected input with shape [*, 128], but got input of size[64, 10, 6]' 怎么办？","这个错误通常是因为输入数据的维度与模型预期的归一化形状不匹配。检查您的数据预处理步骤，确保输入张量的最后一个维度与 `normalized_shape` 一致。此外，确认您是否按照示例正确重塑了数据维度。建议参考官方提供的测试设置，通常推荐使用 Conda 环境来避免此类隐式的维度处理差异。","https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm\u002Fissues\u002F52",{"id":147,"question_zh":148,"answer_zh":149,"source_url":135},41720,"如何获取可在中等硬件上运行的 xLSTM 推理示例代码？","官方目前主要提供训练和基础测试代码。社区用户已经基于官方片段构建了基本的流式推理示例，可以在 GitHub 仓库 `ai-bits\u002Fxlstm-test` 中找到相关资源。这些示例包含了在普通硬件环境下进行推理所需的技巧（如 token 拼接等），适合想要快速上手体验的用户。",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},41721,"代码上周还能正常运行，本周突然报错且未做任何更改，可能是什么原因？","这种情况通常是由底层依赖库（如 PyTorch、CUDA 驱动或编译器）的自动更新引起的环境变化。建议检查您的 Python 环境中的包版本是否发生了变动。如果是 CUDA 扩展编译错误，尝试清理之前的编译缓存（删除 `build` 目录或 `__pycache__`），并确保安装了正确的 `cccl` 包（`conda install cccl`）。保持环境与项目推荐的 Conda 配置一致通常能解决此类突发问题。","https:\u002F\u002Fgithub.com\u002FNX-AI\u002Fxlstm\u002Fissues\u002F54",[156],{"id":157,"version":158,"summary_zh":159,"released_at":160},333762,"v2.0.4","修复标准sLSTM单元的稳定性问题。","2025-05-28T22:40:53"]