[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-libxsmm--libxsmm":3,"tool-libxsmm--libxsmm":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":67,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":119,"forks":120,"last_commit_at":121,"license":122,"difficulty_score":123,"env_os":124,"env_gpu":125,"env_ram":126,"env_deps":127,"category_tags":135,"github_topics":136,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":156,"updated_at":157,"faqs":158,"releases":189},2679,"libxsmm\u002Flibxsmm","libxsmm","Library for specialized dense and sparse matrix operations, and deep learning primitives.","libxsmm 是一款专为高性能计算设计的开源库，专注于加速密集与稀疏矩阵运算，以及深度学习中的基础操作（如小型卷积）。它主要解决了在 Intel 架构处理器上，如何无需重新编译即可自动适配不同指令集（如 SSE、AVX-512 及未来的 AMX）以榨取极致性能的问题。\n\n传统数学库往往需要针对特定硬件设置复杂的编译标志，而 libxsmm 通过独特的即时（JIT）代码生成技术，能够在运行时动态生成最优化的机器码。这意味着开发者只需构建一次程序，即可在各种支持的 Intel CPU 上自动获得最佳执行效率，真正实现了“一次构建，处处高效”。此外，它还广泛支持 FP64、FP32、bfloat16 以及 int8\u002Fint16 等多种数据类型，灵活满足从科学计算到量化神经网络的不同需求。\n\n这款工具非常适合需要底层性能优化的 C\u002FC++ 或 Fortran 开发者、高性能计算研究人员以及从事深度学习框架开发的工程师使用。如果你正在开发对矩阵乘法速度极其敏感的应用，或者希望在异构计算环境中简化部署流程，libxsmm 提供了一个强大且编译器无关的解决方案，帮助你将硬件潜力发挥到极致。","# LIBXSMM\n\n[![BSD 3-Clause License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-BSD3-blue.svg \"BSD 3-Clause License\")](LICENSE.md) [![GCC Build Status](https:\u002F\u002Fbadge.buildkite.com\u002F2e962d4cfc7ddb10a6cd6c27b0d8033edf179a799e156cb363.svg?branch=main \"GCC Build Status\")](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FStatus) [![Clang Build Status](https:\u002F\u002Fbadge.buildkite.com\u002Fdafe7b363a2e66f7d5c9087f074f3eceb69b9aae4278202fd7.svg?branch=main \"Clang Build Status\")](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FStatus) [![Intel Build Status](https:\u002F\u002Fbadge.buildkite.com\u002F63b5dc4095f460f1c011ae782f8e67ec0b8a6a9732d8abe3c7.svg?branch=main \"Intel Build Status\")](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FStatus) [![Mixed Build Status](https:\u002F\u002Fbadge.buildkite.com\u002Ffad67b2fcad79e07ddfe9141974f360e9eca6223cd89e3593f.svg?branch=main \"Mixed Build Status\")](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FStatus) [![Static Analysis Status](https:\u002F\u002Fscan.coverity.com\u002Fprojects\u002F7405\u002Fbadge.svg \"Static Analysis Status\")](https:\u002F\u002Fscan.coverity.com\u002Fprojects\u002Fhfp-libxsmm) [![Read the Docs](https:\u002F\u002Freadthedocs.org\u002Fprojects\u002Flibxsmm\u002Fbadge\u002F?version=latest \"Read the Docs\")](https:\u002F\u002Flibxsmm.readthedocs.io\u002F)\n\nLIBXSMM is a library for specialized dense and sparse matrix operations as well as for deep learning primitives such as small convolutions. The library is targeting Intel Architecture with \u003Cspan>Intel&#160;SSE\u003C\u002Fspan>, \u003Cspan>Intel&#160;AVX\u003C\u002Fspan>, \u003Cspan>Intel&#160;AVX2\u003C\u002Fspan>, \u003Cspan>Intel&#160;AVX&#8209;512\u003C\u002Fspan> (with VNNI and Bfloat16), and \u003Cspan>Intel&#160;AMX\u003C\u002Fspan> (Advanced Matrix Extensions) supported by future Intel processor code-named Sapphire Rapids. Code generation is mainly based on \u003Cspan>Just&#8209;In&#8209;Time (JIT)\u003C\u002Fspan> code specialization for compiler-independent performance (matrix multiplications, matrix transpose\u002Fcopy, sparse functionality, and deep learning). LIBXSMM is suitable for \"build once and deploy everywhere\", i.e., no special target flags are needed to exploit the available performance. Supported GEMM datatypes are: `FP64`, `FP32`, `bfloat16`, `int16`, and `int8`.\n\nFor a list questions and answers, please also have a look at [https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FQ&A](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FQ&A).\n\n**Where to go for documentation?**\n\n* **ReadtheDocs**: [main](https:\u002F\u002Flibxsmm.readthedocs.io\u002F) and [sample](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_samples\u002F) documentation with full text search.\n* **PDF**: [main](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fraw\u002Fmain\u002Fdocumentation\u002Flibxsmm.pdf) documentation file, and separate [sample](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fraw\u002Fmain\u002Fdocumentation\u002Flibxsmm_samples.pdf) documentation.\n* **Articles**: [magazine article](https:\u002F\u002Fsoftware.intel.com\u002Fsites\u002Fdefault\u002Ffiles\u002Fparallel-universe-issue-34.pdf) incl. [sample code](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples\u002Fmagazine) (full list of [Articles](#articles)).\n\n\u003Ca name=\"getting-started\">\u003C\u002Fa>\u003Ca name=\"hello-libxsmm\">\u003C\u002Fa>**Getting Started**: The following C++ code is focused on a specific functionality but may be considered as [Hello LIBXSMM](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples\u002Fhello). Build the example with `cd \u002Fpath\u002Fto\u002Flibxsmm; make STATIC=0` (shared library), save the code under `hello.cpp` (below) and compile with `g++ -I\u002Fpath\u002Fto\u002Flibxsmm\u002Finclude hello.cpp -L\u002Fpath\u002Fto\u002Flibxsmm\u002Flib -lxsmm -lblas -o hello` (GNU CCC), and finally execute with `LD_LIBRARY_PATH=\u002Fpath\u002Fto\u002Flibxsmm\u002Flib LIBXSMM_VERBOSE=2 .\u002Fhello`.\n\n```cpp\n#include \u003Clibxsmm.h>\n#include \u003Cvector>\nint main(int argc, char* argv[]) {\n  typedef double T;\n  int batchsize = 1000, m = 13, n = 5, k = 7;\n  std::vector\u003CT> a(batchsize * m * k), b(batchsize * k * n), c(m * n, 0);\n  \u002F* C\u002FC++ and Fortran interfaces are available *\u002F\n  typedef libxsmm_mmfunction\u003CT> kernel_type;\n  \u002F* generates and dispatches a matrix multiplication kernel (C++ functor) *\u002F\n  kernel_type kernel(LIBXSMM_GEMM_FLAG_NONE, m, n, k, 1.0 \u002F*alpha*\u002F, 1.0 \u002F*beta*\u002F);\n  assert(kernel);\n  for (int i = 0; i \u003C batchsize; ++i) { \u002F* initialize input *\u002F\n    for (int ki = 0; ki \u003C k; ++ki) {\n      for (int j = 0; j \u003C m; ++j) a[i * j * ki] = static_cast\u003CT>(1) \u002F ((i + j + ki) % 25);\n      for (int j = 0; j \u003C n; ++j) b[i * j * ki] = static_cast\u003CT>(7) \u002F ((i + j + ki) % 75);\n    }\n  }\n  \u002F* kernel multiplies and accumulates matrices: C += Ai * Bi *\u002F\n  for (int i = 0; i \u003C batchsize; ++i) kernel(&a[i * m * k], &b[i * k * n], &c[0]);\n}\n```\n\nPlain [C code](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Fsamples\u002Fhello\u002Fhello.c) as well as [Fortran code](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Fsamples\u002Fhello\u002Fhello.f) resemble the same [example](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples\u002Fhello).\n\n\u003Ca name=\"what-is-a-small-matrix-multiplication\">\u003C\u002Fa>**What is a small matrix multiplication?** When characterizing the problem-size by using the M, N, and K parameters, a problem-size suitable for LIBXSMM falls approximately within \u003Ci>(M&#160;N&#160;K)\u003Csup>1\u002F3\u003C\u002Fsup>&#160;&lt;=&#160;64\u003C\u002Fi> (which illustrates that non-square matrices or even \"tall and skinny\" shapes are covered as well). The library does not employ multiplevel K,M,N blocking. Using LIBXSMM for much larger sizes may generate excessive amounts of code (due to unrolling in M or K dimension), but also misses to implement a tiling scheme to effectively utilize the cache hierarchy. In terms of GEMM, the supported kernels are limited to *Alpha := 1*, *Beta := \\{ 1, 0 \\}*, and *TransA := 'N'*.\n\n## Interfaces and Domains\u003Ca name=\"interfaces\">\u003C\u002Fa>\n\n### Overview\u003Ca name=\"general-interface\">\u003C\u002Fa>\n\nPlease have a look at [https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Finclude](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Finclude) for all published functions. Get started with the following list of available domains and documented functionality:\n\n* MM: [Matrix Multiplication](#matrix-multiplication)\n* TPP: [Tensor Processing Primitives](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Fdocumentation\u002Flibxsmm_tpp.md)\n* DNN: [Deep Neural Networks](#deep-neural-networks)\n* AUX: [Service Functions](#service-functions)\n* PERF: [Performance](#performance)\n* BE: [Backend](#jit-backend)\n\nTo initialize library internal resources, an explicit initialization routine helps to avoid lazy initialization overhead when calling LIBXSMM for the first time. The library deallocates internal resources at program exit, but also provides a companion of the afore mentioned initialization (finalize).\n\n```C\n\u002F** Initialize the library; pay for setup cost at a specific point. *\u002F\nvoid libxsmm_init(void);\n\u002F** De-initialize the library and free internal memory (optional). *\u002F\nvoid libxsmm_finalize(void);\n```\n\n### Matrix Multiplication\u003Ca name=\"interface-for-matrix-multiplication\">\u003C\u002Fa>\n\nThis domain (MM) supports Small Matrix Multiplications (SMM), batches of multiple multiplications as well as the industry-standard interface for GEneral Matrix Matrix multiplication (GEMM).\n\nThe [Matrix Multiplication domain (MM)](documentation\u002Flibxsmm_mm.md) contains routines for:\n\n* [Small, tiled, and parallelized matrix multiplications](documentation\u002Flibxsmm_mm.md#overview)\n* [Manual code dispatch (customized matrix batches)](documentation\u002Flibxsmm_mm.md#manual-code-dispatch)\n\n### Deep Learning\u003Ca name=\"interface-for-dl\">\u003C\u002Fa>\n\nHere we demonstrate how common operators in deep learning applications (GEMM with activation function fusion, Convolutions with activation function fusion, various norming operators, and pooling operators, etc.) can be implemented using the Tensor Processing Primitive provided by LIBXSMM. Example drivers for performance evaluation are provided as part of [LIBXSMM_DNN](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm-dnn\u002Ftree\u002Fmain\u002Ftests).\n\n### Service Functions\n\nFor convenient operation of the library and to ease integration, some service routines are available. These routines may not belong to the core functionality of LIBXSMM (SMM or DNN domain), but users are encouraged to use this domain (AUX). There are two categories: \u003Cspan>(1)&#160;routines\u003C\u002Fspan> which are available for C and FORTRAN, and \u003Cspan>(2)&#160;routines\u003C\u002Fspan> that are only available per C interface.\n\nThe [service function domain (AUX)](documentation\u002Flibxsmm_aux.md) contains routines for:\n\n* [Getting and setting the target architecture](documentation\u002Flibxsmm_aux.md#getting-and-setting-the-target-architecture)\n* [Getting and setting the verbosity](documentation\u002Flibxsmm_aux.md#getting-and-setting-the-verbosity)\n* [Measuring time durations (timer)](documentation\u002Flibxsmm_aux.md#timer-facility)\n* [Dispatching user-data and multiple kernels](documentation\u002Flibxsmm_aux.md#user-data-dispatch)\n* [Allocating memory](documentation\u002Flibxsmm_aux.md#memory-allocation)\n\n### Backend\u003Ca name=\"jit-backend\">\u003C\u002Fa>\n\nMore information about the JIT-backend and the code generator can be found in a separate [document](documentation\u002Flibxsmm_be.md). The [encoder sample collection](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples\u002Fencoder) can help to get started writing a kernel using LIBXSMM. Please note, LIBXSMM's stand-alone \u003Ca name=\"generator-driver\">\u003C\u002Fa>[generator-driver](documentation\u002Flibxsmm_be.md#generator-driver) is considered legacy (deprecated).\n\n## Build Instructions\n\n### Overview\n\nThe main interface file is *generated*, and it is therefore **not** stored in the code repository. To inspect the interface for [C\u002FC++](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Fsrc\u002Ftemplate\u002Flibxsmm.h) and [FORTRAN](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Fsrc\u002Ftemplate\u002Flibxsmm.f), one can take a look at the template files used to generate the actual interface. There are two general ways to build and use LIBXSMM:\n\n* [Classic Library (ABI)](#classic-library-abi) and [Link Instructions](#link-instructions) (C\u002FC++ and FORTRAN)\n* [Header-Only](#header-only) (C and C++)\n\n**Note**: LIBXSMM is available as prebuilt package for Fedora\u002FRedHat\u002FCentOS, Debian\u002FUbuntu, FreeBSD, and others. Further, LIBXSMM can be installed with the [Spack Package Manager](https:\u002F\u002Fcomputing.llnl.gov\u002Fprojects\u002Fspack-hpc-package-manager) or per [EasyBuild+EasyConfig](https:\u002F\u002Fgithub.com\u002Feasybuilders).\n\n### Classic Library (ABI)\n\nThere are two ways to rely on prebuilt code for a given project: \u003Cspan>(1)&#160;using\u003C\u002Fspan> LIBXSMM's Makefile based build system, \u003Cspan>(2)&#160;or\u003C\u002Fspan> using another build system and writing own [rules for building LIBXSMM](#rules-for-building-libxsmm). The Makefile based build system relies on \u003Cspan>GNU&#160;Make\u003C\u002Fspan> (typically associated with the `make` command, but e.g. FreeBSD is calling it `gmake`). The build can be customized by using \u003Cspan>key&#8209;value\u003C\u002Fspan> pairs. \u003Cspan>Key&#8209;value\u003C\u002Fspan> pairs can be supplied in two ways: \u003Cspan>(1)&#160;after\u003C\u002Fspan> the \"make\" command, or \u003Cspan>(2)&#160;prior\u003C\u002Fspan> to the \"make\" command (`env`) which is effectively the same as exporting the \u003Cspan>key&#8209;value\u003C\u002Fspan> pair as an environment variable (`export`, or `setenv`). Both methods can be mixed (the second method may require make's `-e` flag).\n\n\u003Ca name=\"zero-config-abi\">\u003C\u002Fa>In contrast to [header-only](#zero-config) which does not require configuration by default, 3rd-party build systems can compile and link LIBXSMM's sources but still avoid configuring the library (per `libxsmm_config.py`). The prerequisite to omit configuration is to opt-in by defining LIBXSMM_DEFAULT_CONFIG (`-D`). The zero-config feature is not available for LIBXSMM's Fortran interface.\n\n**Note**: By default, C\u002FC++ and FORTRAN compilers are needed (some sample code is written in C++). Beside of specifying the compilers (`make CXX=g++ CC=gcc FC=gfortran` and maybe `AR=ar`), the need for a FORTRAN compiler can be relaxed (`make FC=` or `make FORTRAN=0`). The latter affects the availability of the MODule file and the corresponding `libxsmm.f` library (the interface `libxsmm.f` is still generated).\n\nThe build system considers a set of given key-value pairs as a single unique build and triggers a rebuild for a distinct set of flags. For more advanced builds or additional background, please consult the section about [Customization](documentation\u002Flibxsmm_tune.md). To generate the interface of the library inside of the `include` directory and to build the static library (by default, STATIC=1 is activated). Run any (or both) of the following command(s):\n\n```bash\nmake STATIC=0\nmake\n```\n\nOn CRAY systems, the CRAY Compiling Environment (CCE) should be used regardless of utilizing the CRAY compiler, the Intel Compiler, or the \u003Cspan>GNU&#160;Compiler Collection (GCC)\u003C\u002Fspan>. The CCE is eventually suppressing to build shared libraries (STATIC=0). In any case, \u003Cspan>(1)&#160;switch\u003C\u002Fspan> to the desired compiler (module load\u002Fswitch), and \u003Cspan>(2)&#160;rely\u003C\u002Fspan> on:\n\n```bash\nmake CXX=CC CC=cc FC=ftn\n```\n\nA variety of build environments is out-of-the-box compatible, see [https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FCompatibility](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FCompatibility). If the build process is not successful, it may help to avoid advanced GCC flags. This is useful with a tool chain, which pretends to be GCC-compatible (and is treated as such) but fails to consume the afore mentioned flags:\n\n```bash\nmake COMPATIBLE=1\n```\n\n\u003Ca name=\"outdated-binutils\">\u003C\u002Fa>In case of outdated Binutils, compilation can fail to assemble code when building the library (this has nothing to do with JIT-generated code and it does not affect how JIT-code is targeting the system). LIBXSMM implements some functionality using compiler-intrinsics and multiple code-paths which are scheduled according to CPUID. In contrast to `INTRINSICS=2` (default), `INTRINSICS=1` enables a fully static code path according to the desired target. If no target is given (e.g., `AVX=3`, or `AVX=2`), instruction set extensions cannot be leveraged for such code-paths. Try to fix failing compilation by building the latest GNU Binutils (and `export PATH=\u002Fpath\u002Fto\u002Fbinutils\u002Fbin:${PATH}`). Binutils are versioned independently of \u003Cspan>GNU&#160;GCC\u003C\u002Fspan> and other compilers. If one cannot update Binutils, work around with a CPUID-value as tabulated in [libxsmm_cpuid.h](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Finclude\u002Flibxsmm_cpuid.h): start at the upper end (less than 1999) and decrement until compilation passes (make INTRINSICS=_CPUID_, e.g., `make INTRINSICS=1021`). As a last resort, rely on a fully static code path:\n\n```bash\nmake INTRINSICS=1\n```\n\nTo test and validate a build, please consult [https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FValidation](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FValidation). To run some basic sanity checks, remember that each set of given key-value pairs represents a different build (and test):\n\n```bash\nmake STATIC=0 tests\n```\n\nTo remove intermediate files, or to remove all generated files and folders (including the interface and the library archives), run one of the make-targets below. An additional distclean-target recursively cleans the entire tree (after \u003Cspan>version&#160;1.9\u003C\u002Fspan>).\n\n```bash\nmake clean\nmake realclean\n```\n\n\u003Ca name=\"fortran\">\u003C\u002Fa>FORTRAN code can make use of LIBXSMM:\n\n* By using the module and linking with `libxsmmf`, `libxsmm`, and `libxsmmext`,\n* \u003Ca name=\"header-only-fortran\">\u003C\u002Fa>By including `libxsmm.f` and linking with `libxsmm`, and `libxsmmext`, or\n* By (implicitly) calling a SUBROUTINE and linking with `libxsmm`, and `libxsmmext`.\n\n**Note**: `libxsmmf` requires `libxsmmext` (starting with LIBXSMM&#160;2.0), and thereby requires to link with the OpenMP runtime as well.\n\nUsing the Fortran module (or including the interface), requires at least a \u003Cspan>Fortran&#160;2003\u003C\u002Fspan> compiler (F2K3). \u003Cspan>FORTRAN&#160;77\u003C\u002Fspan> compatibility is only implicitly available (no interface), and the available subset of routines is documented in `libxsmm.f` and marked with [comments](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fsearch?q=implementation+provided+for+Fortran+77+compatibility) (part of the implementation).\n\n### Header-Only\n\n\u003Cspan>Version&#160;1.4.4\u003C\u002Fspan> introduced support for \"header-only\" usage in C and C++. By only including `libxsmm_source.h` allows to get around building the library. However, this gives up on a clearly defined application binary interface (ABI). An ABI may allow for hot-fixes after deploying an application (when relying on the shared library form), and it may also ensure to only rely on the public interface of LIBXSMM. In contrast, the header-only form not only exposes the internal implementation of LIBXSMM but can also increase the turnaround time during development of an application (due to longer compilation times). The header file is intentionally named \"libxsmm_**source**.h\" since this header file relies on the [src](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsrc) directory (with the implications as noted earlier).\n\n\u003Ca name=\"zero-config\">\u003C\u002Fa>The header-only form depends on `libxsmm_source.h` which is *generated* according to the content of the source folder (`src`). \u003Cspan>LIBXSMM&#160;1.16\u003C\u002Fspan> (and later) provides header-only support without invoking a make-target (zero configuration) for any given checkout of LIBXSMM. To use configured header-only (non-default), LIBXSMM_CONFIGURED must be defined (`-D`). Previously, it was necessary to invoke `make header-only` (v1.6.2 or later), `make cheader` (prior to v1.6.2), or any target building the library (`make`). The zero-config feature allows 3rd-party build systems an easier integration of LIBXSMM, which also holds true if the system builds LIBXSMM from source (see [classic ABI](#zero-config-abi)). Fortran code may [include](#header-only-fortran) `libxsmm.f` but still requires that interface to be generated.\n\n**Note**: building an application applies the same build settings to LIBXSMM! For instance, to omit debug code inside of LIBXSMM `NDEBUG` must be defined (`-DNDEBUG`).\n\n## Rules for building LIBXSMM\n\nLIBXSMM can be used as header-only library, i.e., no source code must be (pre-)built. However, it can be desirable to build LIBXSMM as an intermediate library using a custom setup or build system. The latter can still implement custom build rules to configure LIBXSMM's interface before building the code. More likely, building LIBXSMM from source in a custom fashion can still be omitting to configure the interface and rely on \"(zero-config)[#zero-config-abi]\", i.e., defining LIBXSMM_DEFAULT_CONFIG (`-DLIBXSMM_DEFAULT_CONFIG`). For example, a CMake module for LIBXSMM can look like:\n\n```cmake\ninclude(FetchContent)\nFetchContent_Declare(\n  xsmm\n  URL https:\u002F\u002Fgithub.com\u002Fchelini\u002Flibxsmm\u002Farchive\u002F\u003Cyour-preferred-revision>.tar.gz\n  URL_HASH SHA256=\u003Csha256sum-corresponding-to-above-revision>\n)\nFetchContent_GetProperties(xsmm)\nif(NOT xsmm_POPULATED)\n  FetchContent_Populate(xsmm)\nendif()\n\nset(LIBXSMMROOT ${xsmm_SOURCE_DIR})\nfile(GLOB _GLOB_XSMM_SRCS LIST_DIRECTORIES false CONFIGURE_DEPENDS ${LIBXSMMROOT}\u002Fsrc\u002F*.c)\nlist(REMOVE_ITEM _GLOB_XSMM_SRCS ${LIBXSMMROOT}\u002Fsrc\u002Flibxsmm_generator_gemm_driver.c)\nset(XSMM_INCLUDE_DIRS ${LIBXSMMROOT}\u002Finclude)\n\nadd_library(xsmm STATIC ${_GLOB_XSMM_SRCS})\ntarget_include_directories(xsmm PUBLIC ${XSMM_INCLUDE_DIRS})\ntarget_compile_definitions(xsmm PUBLIC\n  LIBXSMM_DEFAULT_CONFIG\n)\n```\n\n## Link Instructions\n\nUsing the [classic ABI](#classic-library-abi) (including [Fortran](#fortran) code), requires linking LIBXSMM against the application. The library is agnostic with respect to the threading-runtime, and therefore an application is free to use any threading runtime (e.g., OpenMP). The library is also thread-safe, and multiple application threads can call LIBXSMM's routines concurrently. Enabling OpenMP for LIBXSMM's main library is supported as well (OMP=1), and mostly affects the synchronization primitives used inside of the library. All the \"omp\" functionality (function postfix) is served by the `libxsmmext` library, which is automatically built with OpenMP enabled. When using this \"omp\" functionality, `libxsmmext` needs to be present at the link line.\n\n\u003Ca name=\"table-of-libraries\">\u003C\u002Fa>Library | Purpose\n:-------------|---------\nlibxsmm       | Thread-safe core functions (same routine can be called concurrently). Contains routines that can take a thread-ID and the number of library-external threads.\nlibxsmmf      | Necessary when using the Fortran MODule but not when including `libxsmm.f` or relying on implicit interfaces ([Fortran 77](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fsearch?q=implementation+provided+for+Fortran+77+compatibility)).\n\n\u003Ca name=\"pkg-config\">\u003C\u002Fa>To ease linking with LIBXSMM, `pkg-config` can be used. For example:\n\n```bash\nexport PKG_CONFIG_PATH=\u002Fpath\u002Fto\u002Flibxsmm\u002Flib\npkg-config libxsmm --libs\n```\n\nSimilarly, an application is free to choose any BLAS or LAPACK library (if the link model available on the OS supports this), and therefore linking GEMM routines when linking LIBXSMM itself (by supplying BLAS=1&#124;2) may prevent a user from making this decision at the time of linking the actual application. To use LIBXSMM without GEMM-related functionality, any BLAS-dependency can be removed in two ways: \u003Cspan>(1)&#160;building\u003C\u002Fspan> a special library with `make BLAS=0`, or \u003Cspan>(2)&#160;linking\u003C\u002Fspan> the application against the `libxsmmnoblas` library. If an application however uses BLAS already, the [Call Wrapper](documentation\u002Flibxsmm_mm.md#call-wrapper) can be used to intercept existing BLAS calls (and to rely on LIBXSMM instead).\n\n**Note**: LIBXSMM does not support to dynamically link `libxsmm` or `libxsmmext` (\"so\") when BLAS is linked statically (\"a\"). If BLAS is linked statically, the static version of LIBXSMM must be used!\n\n### Installation\n\nThere are two main mechanisms to install LIBXSMM (both mechanisms can be combined): \u003Cspan>(1)&#160;building\u003C\u002Fspan> the library in an \u003Cspan>out&#8209;of&#8209;tree\u003C\u002Fspan> fashion, and \u003Cspan>(2)&#160;installing\u003C\u002Fspan> into a certain location. \u003Ca name=\"install-build\">\u003C\u002Fa>Building in an \u003Cspan>out&#8209;of&#8209;tree\u003C\u002Fspan> fashion looks like:\n\n```bash\ncd libxsmm-install\nmake -f \u002Fpath\u002Fto\u002Flibxsmm\u002FMakefile\n```\n\n\u003Ca name=\"install-prefix\">\u003C\u002Fa>Installation into a specific location looks like (`PREFIX` or `DESTDIR`):\n\n```bash\nmake MNK=\"1 2 3 4 5\" PREFIX=\u002Fpath\u002Fto\u002Flibxsmm-install install\n```\n\n\u003Ca name=\"install-destdir\">\u003C\u002Fa>Both `PREFIX` and `DESTDIR` are equivalent and can be relative or absolute paths. An installation can be repeated for different locations without triggering a rebuild. The prefix directory *inside* of each of the [package configuration files](#pkg-config) is set to where LIBXSMM is built (staging folder) unless `PREFIX` or `DESTDIR` is specified. The effect of `PREFIX` (or `DESTDIR`) with respect to the pkg-config files is independent of whether the install-target is invoked or not (make).\n\nFurther, performing `make install-minimal` omits the documentation (default: `PREFIX\u002Fshare\u002Flibxsmm`). Moreover, PINCDIR, POUTDIR, PBINDIR, and PDOCDIR allow to customize the locations underneath of the PREFIX location. To build a general package for an unpredictable audience (Linux distribution, or similar), it is advised to not over-specify or customize the build step, i.e., JIT, SSE, AVX, OMP, BLAS, etc. should not be used. The following is building and installing a complete set of libraries where the generated interface matches both the static and the shared libraries:\n\n```bash\nmake PREFIX=\u002Fpath\u002Fto\u002Flibxsmm-install STATIC=0 install\nmake PREFIX=\u002Fpath\u002Fto\u002Flibxsmm-install install\n```\n\n## Runtime Control\u003Ca name=\"running\">\u003C\u002Fa>\n\n### Handling Errors\n\nThe library handles errors with mechanisms available to the C programming language (no exceptions). The backend uses result codes passed by an argument rather than an actual return value. Such an argument is often a descriptor (struct) guiding and covering the state of the code generation. The frontend however may not hand-out any error state, which can be a big relief on the call-side. Instead, the frontend implements a [verbose mode](#verbose-mode) to inform about unexpected input or an error captured from the backend. Guiding principles of LIBXSMM are muted operation by default (non-verbose) and no unexpected exit from execution.\n\n### Verbose Mode\n\nThe [verbose mode](documentation\u002Flibxsmm_aux.md#getting-and-setting-the-verbosity) (level of verbosity) allows for an insight into the code dispatch mechanism by receiving a small, tabulated statistic as soon as the library terminates. The design point for this functionality is to not impact the performance of any critical code path, i.e., verbose mode is always enabled and does not require symbols (SYM=1) or debug code (DBG=1). The statistics appears (`stderr`) when the environment variable LIBXSMM_VERBOSE is set to a non-zero value. For example:\n\n```bash\nLIBXSMM_VERBOSE=1 .\u002Fmyapplication\n[... application output]\n\nHSW\u002FSP      TRY    JIT    STA    COL\n   0..13      0      0      0      0\n  14..23      0      0      0      0\n 24..128      3      3      0      0\n```\n\nThe tables are distinct between single-precision and double-precision, but either table is pruned if all counters are zero. If both tables are pruned, the library shows the code path which would have been used for JIT'ting the code: `LIBXSMM_TARGET=hsw` (otherwise the code path is shown in the table's header). The actual counters are collected for three buckets: small kernels (\u003Cspan>MNK\u003Csup>1\u002F3\u003C\u002Fsup>&#160;&lt;=&#160;13\u003C\u002Fspan>), medium-sized kernels (\u003Cspan>13&#160;&lt;&#160;MNK\u003Csup>1\u002F3\u003C\u002Fsup>&#160;&lt;=&#160;23\u003C\u002Fspan>), and larger kernels (\u003Cspan>23&#160;&lt;&#160;MNK\u003Csup>1\u002F3\u003C\u002Fsup>&#160;&lt;=&#160;64\u003C\u002Fspan>; the actual upper bound depends on LIBXSMM_MAX_MNK as selected at compile-time). Keep in mind, that \"larger\" is supposedly still small in terms of arithmetic intensity (which grows linearly with the kernel size). Unfortunately, the arithmetic intensity depends on the way a kernel is used (which operands are loaded\u002Fstored into main memory), and it is not performance-neutral to collect this information.\n\nThe TRY counter represents all attempts to register statically generated kernels, and all attempts to dynamically generate and register kernels. The TRY counter includes rejected JIT requests due to unsupported GEMM arguments. The JIT and STA counters distinct the successful cases of the afore mentioned event (TRY) into dynamically (JIT) and statically (STA) generated code. In case the capacity (\u003Cspan>O(*n*)&#160;=&#160;10\u003Csup>5\u003C\u002Fsup>\u003C\u002Fspan>) of the code registry is exhausted, no more kernels can be registered although further attempts are not prevented. Registering many kernels (\u003Cspan>O(*n*)&#160;=&#160;10\u003Csup>3\u003C\u002Fsup>\u003C\u002Fspan>) may ramp the number of hash key collisions (COL), which can degrade performance. The latter is prevented if the small thread-local cache is utilized effectively.\n\n\n```bash\nRegistry: 20 MB (gemm=0 mcopy=14 tcopy=0)\n```\n\nIf the call-wrapper is used, an additional runtime statistic becomes available (see [Call Wrapper](documentation\u002Flibxsmm_mm.md#call-wrapper)).\n\n\u003Ca name=\"objdump\">\u003C\u002Fa>**Note**: Setting LIBXSMM_VERBOSE to a negative value dumps each generated JIT kernel to a file (binary) with each file being named like the function name shown in [Intel VTune](documentation\u002Flibxsmm_prof.md#intelvtuneamplifier). Disassembly of the raw binary files can be accomplished by:\n\n```bash\nobjdump -D -b binary -m i386 -M x86-64 [JIT-dump-file]\n```\n\n### Call Trace\n\nDuring the initial steps of employing the LIBXSMM API, one may rely on a debug version of the library (`make DBG=1`). The latter also implies console output (`stderr`) in case of an error\u002Fwarning condition inside of the library. It is also possible to print the execution flow (call trace) inside of LIBXSMM (can be combined with DBG=1 or OPT=0):\n\n```bash\nmake TRACE=1\n```\n\nBuilding an application which traces calls (inside of the library) requires the shared library of LIBXSMM, alternatively the application is required to link the static library of LIBXSMM in a dynamic fashion (GNU tool chain: `-rdynamic`). Tracing calls (without debugger) can be then accomplished by an environment variable called LIBXSMM_TRACE.\n\n```bash\nLIBXSMM_TRACE=1 .\u002Fmyapplication\n```\n\nSyntactically up to three arguments separated by commas (which allows to omit arguments) are taken (*tid*,*i*,*n*): *tid* signifies the ID of the thread to be traced with 1...NTHREADS being valid and where LIBXSMM_TRACE=1 is filtering for the \"main thread\" (in fact the first thread running into the trace facility); grabbing all threads (no filter) can be achieved by supplying a negative id (which is also the default when omitted). The second argument is pruning higher levels of the call-tree with *i=1* being the default (level zero is the highest at the same level as the main function). The last argument is taking the number of inclusive call levels with *n=-1* being the default (signifying no filter).\n\nAlthough the `ltrace` (Linux utility) provides similar insight, the trace facility might be useful due to the afore mentioned filtering expressions. Please note that the trace facility is severely impacting the performance (even with LIBXSMM_TRACE=0), and this is not just because of console output but rather since inlining (internal) functions might be prevented along with additional call overhead on each function entry and exit. Therefore, debug symbols can be also enabled separately (`make SYM=1`; implied by TRACE=1 or DBG=1) which might be useful when profiling an application.\n\n### Verification\n\nThis section refers to testing correctness of an application using LIBXSMM utilities, i.e., using `libxsmm_matdiff` or `libxsmm_matdiff_epsilon` in particular. The former function (`libxsmm_matdiff`) compares two matrices (which can degenerate to vector shape), and yields a structure with information about the difference of both matrices (gold vs. test). The latter function (`libxsmm_matdiff_epsilon`) combines absolute and relative norms (given by afore mentioned structure) and calculates a scalar \"epsilon\" which can be used to check against a margin.\n\nUsing `libxsmm_matdiff_epsilon` in an application exposes an environment variable `LIBXSMM_MATDIFF` which can specify a file or directory path (`LIBXSMM_MATDIFF=1` simply uses some filename as default). In any case, the application appends one line to the respective file for each call of `libxsmm_matdiff_epsilon`. A data record consists of the epsilon and the command line used to launch the application. A generated file can be further evaluated, e.g., `sort -gk1 libxsmm_matdiff.log | tail -n 10` which yields the largest ten epsilon values discovered along with the application's command line.\n\nThe environment variable `LIBXSMM_MATDIFF` can carry optional space-separated arguments to amend each file entry like `export LIBXSMM_MATDIFF=\"libxsmm_matdiff.log hello world\"`. In sophisticated cases this can be used to amend a value only known at runtime, e.g., the actual margin which is used to judge the epsilon (`putenv`).\n\n## Performance\n\n\u003Ca name=\"profiling\">\u003C\u002Fa>Profiling an application, which uses LIBXSMM's JIT-code is well-supported. The library supports \u003Cspan>Intel&#160;VTune&#160;Amplifier\u003C\u002Fspan> and \u003Cspan>Linux&#160;perf\u003C\u002Fspan>. Details are given on how to include profiler support, and how to run the application.\n\n* [Profiling using Intel VTune Amplifier](documentation\u002Flibxsmm_prof.md#intelvtuneamplifier)\n* [Profiling using Linux perf](documentation\u002Flibxsmm_prof.md#linuxperf)\n\n\u003Ca name=\"tuning\">\u003C\u002Fa>At build time, a variety of options exist to customize LIBXSMM. The library is setup for a broad range of use cases, which include sophisticated defaults for typical use.\n\n* [Customizing performance](documentation\u002Flibxsmm_tune.md#tuning)\n* \u003Ca name=\"auto-dispatch\">\u003C\u002Fa>[Tuning auto-dispatch](documentation\u002Flibxsmm_tune.md#auto-dispatch)\n\n\u003Ca name=\"results\">\u003C\u002Fa>To find performance results of applications or performance reproducers, the repository provides an orphaned branch called \"results\" which collects collateral material such as measured performance results along with explanatory figures. The results can be found at [https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fresults#libxsmm-results](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fresults#libxsmm-results), or the results can be cloned as shown below.\n\n```bash\ngit clone --branch results \\\n  https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm.git \\\n  libxsmm-results\n```\n\nPlease note that comparing performance results depends on whether the operands of the matrix multiplication are streamed or not. For example, multiplying with all matrices covered by the L1 cache may have an emphasis towards an implementation which perhaps performs worse for the real workload (if this real workload needs to stream some or all matrices from the main memory). Most of the [code samples](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples) are aimed to reproduce performance results, and it is encouraged to model the exact case or to look at real [applications](#applications).\n\n## Applications\n\n### High Performance Computing (HPC)\n\n\u003Cb>[1]&#160;\u003C\u002Fb>[https:\u002F\u002Fcp2k.org\u002F](https:\u002F\u002Fcp2k.org\u002F): Open Source Molecular Dynamics and the [DBCSR library](https:\u002F\u002Fgithub.com\u002Fcp2k\u002Fdbcsr), which processes batches of small matrix multiplications. The batches originate from a distributed block-sparse matrix with problem-specific small matrices. Starting with [CP2K&#160;3.0](https:\u002F\u002Fwww.cp2k.org\u002Fversion_history), LIBXSMM can substitute CP2K's `libsmm` library.\n\n\u003Cb>[2]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002FSeisSol\u002FSeisSol\u002F](https:\u002F\u002Fgithub.com\u002FSeisSol\u002FSeisSol\u002F): SeisSol is one of the leading codes for earthquake scenarios, for simulating dynamic rupture processes. LIBXSMM provides highly optimized assembly kernels which form the computational back-bone of SeisSol (see [https:\u002F\u002Fgithub.com\u002FTUM-I5\u002Fseissol_kernels\u002F](https:\u002F\u002Fgithub.com\u002FTUM-I5\u002Fseissol_kernels\u002F).\n\n\u003Cb>[3]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002FNekBox\u002FNekBox](https:\u002F\u002Fgithub.com\u002FNekBox\u002FNekBox): NekBox is a highly scalable and portable spectral element code, which is inspired by the [Nek5000](https:\u002F\u002Fnek5000.mcs.anl.gov\u002F) code. NekBox is specialized for box geometries and intended to prototype new methods as well as to leverage FORTRAN beyond the FORTRAN&#160;77 standard. LIBXSMM can be used to substitute the [MXM_STD](https:\u002F\u002Fgithub.com\u002FNek5000\u002FNekBox\u002Fblob\u002Fbox\u002Fmxm_std.F90) code. Please also note LIBXSMM's [NekBox reproducer](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples\u002Fnek#nek-sample-collection).\n\n\u003Cb>[4]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002FNek5000\u002FNek5000](https:\u002F\u002Fgithub.com\u002FNek5000\u002FNek5000): Nek5000 is the open-source, highly-scalable, always-portable spectral element code from [https:\u002F\u002Fnek5000.mcs.anl.gov\u002F](https:\u002F\u002Fnek5000.mcs.anl.gov\u002F). The development branch of the Nek5000 code [incorporates](https:\u002F\u002Fgithub.com\u002FNek5000\u002FNek5000\u002Fblob\u002Fmaster\u002Fcore\u002Fmxm_wrapper.f) LIBXSMM.\n\n\u003Cb>[5]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.pyfr.org\u002F](https:\u002F\u002Fwww.pyfr.org\u002F): PyFR is an open-source Python based framework for solving advection-diffusion type problems on streaming architectures by using the flux reconstruction approach. PyFR incorporates LIBXSMM as a matrix multiplication provider for the OpenMP backend.\n\n\u003Cb>[6]&#160;\u003C\u002Fb>[http:\u002F\u002Fdial3343.org\u002Fabout\u002F](http:\u002F\u002Fdial3343.org\u002Fabout\u002F): The Extreme-scale Discontinuous Galerkin Environment (EDGE) is a solver for hyperbolic partial differential equations with emphasis on seismic simulations. The EDGE [source code](https:\u002F\u002Fgithub.com\u002F3343\u002Fedge) optionally relies on LIBXSMM, but for high performance LIBXSMM's kernels are highly recommended.\n\n\u003Cb>[7]&#160;\u003C\u002Fb>[https:\u002F\u002Fsxs-collaboration.github.io\u002Fspectre\u002F](https:\u002F\u002Fsxs-collaboration.github.io\u002Fspectre\u002F): SpECTRE is an open-source code for multi-scale, multi-physics problems in astrophysics and gravitational physics which runs at Petascale and is designed for Exascale computers. In the future, SpECTRE may be applied to problems across discipline boundaries in fluid dynamics, geoscience, plasma physics, nuclear physics, and engineering.\n\n\u003Cb>[8]&#160;\u003C\u002Fb>[https:\u002F\u002Fceed.exascaleproject.org\u002Fceed-code\u002F](https:\u002F\u002Fceed.exascaleproject.org\u002Fceed-code\u002F): The Center for Efficient Exascale Discretizations (CEED) is building on the efforts of the Nek5000, MFEM, MAGMA, OCCA and PETSc projects to develop application program interfaces (APIs), both at high-level and at low-level to enable applications to take advantage of high-order methods. The CEED low-level API, [libCEED](https:\u002F\u002Fceed.exascaleproject.org\u002Flibceed\u002F) uses LIBXSMM as a [backend](https:\u002F\u002Fgithub.com\u002FCEED\u002FlibCEED#backends) for high performance on CPUs.\n\n\u003Cb>[9]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002Fromeric\u002FFastor](https:\u002F\u002Fgithub.com\u002Fromeric\u002FFastor): Fastor is a lightweight high performance tensor algebra framework for modern C++ and can optionally use LIBXSMM as [JIT-backend](https:\u002F\u002Fgithub.com\u002Fromeric\u002FFastor\u002Fwiki\u002F9.-Using-the-LIBXSMM-MKL-JIT-backend).\n\n### Machine and Deep Learning (AI)\n\n\u003Cb>[10]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002Fplaidml\u002Fplaidml](https:\u002F\u002Fgithub.com\u002Fplaidml\u002Fplaidml): PlaidML is an open source tensor compiler aiming for performance portability across a wide range of CPUs, GPUs and other accelerators. Combined with Intel’s nGraph compiler, PlaidML is targeting popular deep learning frameworks such as PyTorch, Keras (TensorFlow), and OpenVino. [PlaidML\u002Fv1](https:\u002F\u002Fgithub.com\u002Fplaidml\u002Fplaidml\u002Ftree\u002Fplaidml-v1) (development branch) adopted [MLIR](https:\u002F\u002Fmlir.llvm.org\u002F), an extensible compiler infrastructure gaining industry-wide adoption. PlaidML\u002Fv1 started using LIBXSMM as backend for targeting CPUs.\n\n\u003Cb>[11]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002Fintel\u002Fintel-extension-for-pytorch](https:\u002F\u002Fgithub.com\u002Fintel\u002Fintel-extension-for-pytorch): Intel Extension for PyTorch aims for a smooth user experience of PyTorch on CPUs by the means of good performance. The extension pack started to rely on [LIBXSMM for achieving high performance on CPUs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.04680).\n\n\u003Cb>[12]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Ftpp-pytorch-extension](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Ftpp-pytorch-extension): Intel(R) Tensor Processing Primitive Extension for pytorch is an open source software library the integrates Tensor Processing Primitives ([TPP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.05755)) into pytorch. It is aiming for a smooth user experience of PyTorch on CPUs by the means of good performance. Intel's MLPerf Training submission codes leverage this [project](https:\u002F\u002Fgithub.com\u002Fmlcommons\u002Ftraining_results.1\u002Ftree\u002Fmain\u002FIntel\u002Fbenchmarks\u002Fbert\u002Fimplementations\u002Fpytorch-cpu).\n\n\u003Cb>[13]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm-dnn](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm-dnn): LIBXSMM-DNN is an open source software library that demonstrates how Tensor Processing Primitives ([TPP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.05755)) can be used to implement various deep learning primitives such as convolutions, linear layers or even pooling and norming. Due to the use of TPP not a single line of platform-specific code is needed. \n\n### Automated Driving (AD)\n\n\u003Cb>[15]&#160;\u003C\u002Fb>[https:\u002F\u002Fsoftware.seek.intel.com\u002Faccelerating-eigen-math-library](https:\u002F\u002Fsoftware.seek.intel.com\u002Faccelerating-eigen-math-library): Accelerating The Eigen Math Library for Automated Driving Workloads: The Need for Speed in Kalman Filtering. An article in [Issue&#160;31](https:\u002F\u002Fsoftware.intel.com\u002Fcontent\u002Fwww\u002Fus\u002Fen\u002Fdevelop\u002Fdownload\u002Fparallel-universe-magazine-issue-31-january-2018.html) of The Parallel Universe magazine ([pdf](https:\u002F\u002Fsoftware.intel.com\u002Fcontent\u002Fdam\u002Fdevelop\u002Fpublic\u002Fus\u002Fen\u002Fdocuments\u002Fparallel-universe-issue-31.pdf)).\n\n## References\n\n\u003Cb>[1]&#160;\u003C\u002Fb>[https:\u002F\u002Fsc19.supercomputing.org\u002Fproceedings\u002Ftech_poster\u002Ftech_poster_pages\u002Frpost244.html](https:\u002F\u002Fsc19.supercomputing.org\u002Fproceedings\u002Ftech_poster\u002Ftech_poster_pages\u002Frpost244.html): High-Performance Deep Learning via a Single Building Block ([poster](https:\u002F\u002Fsc19.supercomputing.org\u002Fproceedings\u002Ftech_poster\u002Fposter_files\u002Frpost244s2-file2.pdf) and [abstract](https:\u002F\u002Fsc19.supercomputing.org\u002Fproceedings\u002Ftech_poster\u002Fposter_files\u002Frpost244s2-file3.pdf)), SC’19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Denver (Colorado).\n\n\u003Cb>[2]&#160;\u003C\u002Fb>[https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1109\u002FSC.2018.00069](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1109\u002FSC.2018.00069): Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures ([paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1808.05567.pdf)). SC'18: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas (Texas).\n\n\u003Cb>[3]&#160;\u003C\u002Fb>[https:\u002F\u002Fpasc17.pasc-conference.org\u002Ffileadmin\u002Fuser_upload\u002Fpasc17\u002Fprogram\u002Fpost116s2.pdf](https:\u002F\u002Fpasc17.pasc-conference.org\u002Ffileadmin\u002Fuser_upload\u002Fpasc17\u002Fprogram\u002Fpost116s2.pdf): DBCSR: A Sparse Matrix Multiplication Library for Electronic Structure Codes (poster), PASC’17: The PASC17 Conference, Lugano (Switzerland).\n\n\u003Cb>[4]&#160;\u003C\u002Fb>[https:\u002F\u002Fsc17.supercomputing.org\u002FSC17%20Archive\u002Ftech_poster\u002Ftech_poster_pages\u002Fpost190.html](https:\u002F\u002Fsc17.supercomputing.org\u002FSC17%20Archive\u002Ftech_poster\u002Ftech_poster_pages\u002Fpost190.html): Understanding the Performance of Small Convolution Operations for CNN on Intel Architecture ([poster](https:\u002F\u002Fsc17.supercomputing.org\u002FSC17%20Archive\u002Ftech_poster\u002Fposter_files\u002Fpost190s2-file2.pdf) and [abstract](https:\u002F\u002Fsc17.supercomputing.org\u002FSC17%20Archive\u002Ftech_poster\u002Fposter_files\u002Fpost190s2-file3.pdf)), SC’17: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Denver (Colorado).\n\n\u003Cb>[5]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.computer.org\u002Fcsdl\u002Fproceedings-article\u002Fsc\u002F2016\u002F8815a981\u002F12OmNCeaQ1D](https:\u002F\u002Fwww.computer.org\u002Fcsdl\u002Fproceedings-article\u002Fsc\u002F2016\u002F8815a981\u002F12OmNCeaQ1D): LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation. SC'16: The International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City (Utah).\n\n\u003Cb>[6]&#160;\u003C\u002Fb>[http:\u002F\u002Fsc15.supercomputing.org\u002Fsites\u002Fall\u002Fthemes\u002FSC15images\u002Ftech_poster\u002Ftech_poster_pages\u002Fpost137.html](http:\u002F\u002Fsc15.supercomputing.org\u002Fsites\u002Fall\u002Fthemes\u002FSC15images\u002Ftech_poster\u002Ftech_poster_pages\u002Fpost137.html): LIBXSMM: A High Performance Library for Small Matrix Multiplications ([poster](http:\u002F\u002Fsc15.supercomputing.org\u002Fsites\u002Fall\u002Fthemes\u002FSC15images\u002Ftech_poster\u002Fposter_files\u002Fpost137s2-file2.pdf) and [abstract](http:\u002F\u002Fsc15.supercomputing.org\u002Fsites\u002Fall\u002Fthemes\u002FSC15images\u002Ftech_poster\u002Fposter_files\u002Fpost137s2-file3.pdf)). SC'15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Austin (Texas).\n\n\u003Cb>[7]&#160;\u003C\u002Fb>[Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning & HPC Workloads](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.05755) SC'21: The International Conference for High Performance Computing, Networking, Storage and Analysis, St Louis.\n\n## Articles\n\n\u003Cb>[1]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F10\u002F09\u002Fcloudy-supercomputers-join-the-hpc-petascale-club\u002F](https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F10\u002F09\u002Fcloudy-supercomputers-join-the-hpc-petascale-club\u002F): Cloudy Supercomputers Join the HPC Petascale Club. An article written by Rob Farber, 2019. The article covers LIBXSMM in a separate section.\n\n\u003Cb>[2]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F06\u002F26\u002Fcounting-the-cost-of-scaling-hpc-applications\u002F](https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F06\u002F26\u002Fcounting-the-cost-of-scaling-hpc-applications\u002F): Counting The Cost Of Scaling HPC Applications. An article written by Timothy Prickett Morgan, 2019. This article is about CP2K Open Source Molecular Dynamics and not about LIBXSMM. However, LIBXSMM was key for application performance.\n\n\u003Cb>[3]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F06\u002F26\u002Fcounting-the-cost-of-scaling-hpc-applications\u002F](https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F06\u002F26\u002Fcounting-the-cost-of-scaling-hpc-applications\u002F): Azure Benchmarks HC-series Across Twenty-thousand Cores for HPC. An article written by John Russell, 2019. This article is about CP2K Open Source Molecular Dynamics and not about LIBXSMM. However, LIBXSMM was key for application performance.\n\n\u003Cb>[4]&#160;\u003C\u002Fb>[https:\u002F\u002Fsoftware.intel.com\u002Fsites\u002Fdefault\u002Ffiles\u002Fparallel-universe-issue-34.pdf](https:\u002F\u002Fsoftware.intel.com\u002Fcontent\u002Fwww\u002Fus\u002Fen\u002Fdevelop\u002Fdownload\u002Fparallel-universe-magazine-issue-34-october-2018.html): LIBXSMM: An Open Source-Based Inspiration for Hardware and Software Development at Intel ([pdf](https:\u002F\u002Fsoftware.intel.com\u002Fcontent\u002Fdam\u002Fdevelop\u002Fpublic\u002Fus\u002Fen\u002Fdocuments\u002Fparallel-universe-issue-34.pdf)). An article written by Hans Pabst, Greg Henry, and Alexander Heinecke, 2018.\n\n\u003Cb>[5]&#160;\u003C\u002Fb>[https:\u002F\u002Fmedium.com\u002F@rmfarber\u002Flibxsmm-brings-deep-learning-lessons-learned-to-many-hpc-applications-9143c6c93125](https:\u002F\u002Fmedium.com\u002F@rmfarber\u002Flibxsmm-brings-deep-learning-lessons-learned-to-many-hpc-applications-9143c6c93125): LIBXSMM Brings Deep-learning \"Lessons Learned\" to Many HPC Applications. An article written by Rob Farber, 2018.\n\n\u003Cb>[6]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.rdworldonline.com\u002Flargest-supercomputer-simulation-of-sumatra-andaman-earthquake\u002F](https:\u002F\u002Fwww.rdworldonline.com\u002Flargest-supercomputer-simulation-of-sumatra-andaman-earthquake\u002F): Largest Supercomputer Simulation of Sumatra-Andaman Earthquake. An article written by Linda Barney, 2018.\n\n","# LIBXSMM\n\n[![BSD 3-Clause 许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-BSD3-blue.svg \"BSD 3-Clause 许可证\")](LICENSE.md) [![GCC 构建状态](https:\u002F\u002Fbadge.buildkite.com\u002F2e962d4cfc7ddb10a6cd6c27b0d8033edf179a799e156cb363.svg?branch=main \"GCC 构建状态\")](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FStatus) [![Clang 构建状态](https:\u002F\u002Fbadge.buildkite.com\u002Fdafe7b363a2e66f7d5c9087f074f3eceb69b9aae4278202fd7.svg?branch=main \"Clang 构建状态\")](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FStatus) [![Intel 构建状态](https:\u002F\u002Fbadge.buildkite.com\u002F63b5dc4095f460f1c011ae782f8e67ec0b8a6a9732d8abe3c7.svg?branch=main \"Intel 构建状态\")](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FStatus) [![混合构建状态](https:\u002F\u002Fbadge.buildkite.com\u002Ffad67b2fcad79e07ddfe9141974f360e9eca6223cd89e3593f.svg?branch=main \"混合构建状态\")](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FStatus) [![静态分析状态](https:\u002F\u002Fscan.coverity.com\u002Fprojects\u002F7405\u002Fbadge.svg \"静态分析状态\")](https:\u002F\u002Fscan.coverity.com\u002Fprojects\u002Fhfp-libxsmm) [![Read the Docs](https:\u002F\u002Freadthedocs.org\u002Fprojects\u002Flibxsmm\u002Fbadge\u002F?version=latest \"Read the Docs\")](https:\u002F\u002Flibxsmm.readthedocs.io\u002F)\n\nLIBXSMM 是一个用于专用密集和稀疏矩阵运算以及深度学习原语（如小型卷积）的库。该库主要面向英特尔架构，支持 \u003Cspan>Intel&#160;SSE\u003C\u002Fspan>、\u003Cspan>Intel&#160;AVX\u003C\u002Fspan>、\u003Cspan>Intel&#160;AVX2\u003C\u002Fspan>、\u003Cspan>Intel&#160;AVX&#8209;512\u003C\u002Fspan>（含 VNNI 和 Bfloat16）以及未来代号为 Sapphire Rapids 的英特尔处理器所支持的 \u003Cspan>Intel&#160;AMX\u003C\u002Fspan>（高级矩阵扩展）。代码生成主要基于 \u003Cspan>即时编译 (JIT)\u003C\u002Fspan> 技术，以实现与编译器无关的高性能（包括矩阵乘法、矩阵转置\u002F复制、稀疏功能和深度学习）。LIBXSMM 适合“一次构建，随处部署”，即无需使用特定的目标标志即可充分利用现有性能。支持的 GEMM 数据类型有：`FP64`、`FP32`、`bfloat16`、`int16` 和 `int8`。\n\n有关常见问题解答，请参阅 [https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FQ&A](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FQ&A)。\n\n**文档在哪里可以找到？**\n\n* **ReadtheDocs**：[主文档](https:\u002F\u002Flibxsmm.readthedocs.io\u002F) 和 [示例文档](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_samples\u002F) 提供全文搜索功能。\n* **PDF**：[主文档](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fraw\u002Fmain\u002Fdocumentation\u002Flibxsmm.pdf) 文件，以及单独的 [示例文档](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fraw\u002Fmain\u002Fdocumentation\u002Flibxsmm_samples.pdf)。\n* **文章**：[杂志文章](https:\u002F\u002Fsoftware.intel.com\u002Fsites\u002Fdefault\u002Ffiles\u002Fparallel-universe-issue-34.pdf) 包括 [示例代码](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples\u002Fmagazine)（完整文章列表见 [Articles](#articles)）。\n\n\u003Ca name=\"getting-started\">\u003C\u002Fa>\u003Ca name=\"hello-libxsmm\">\u003C\u002Fa>**入门**：以下 C++ 代码专注于特定功能，但也可被视为 [Hello LIBXSMM](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples\u002Fhello)。使用 `cd \u002Fpath\u002Fto\u002Flibxsmm; make STATIC=0`（共享库）编译示例，将代码保存为 `hello.cpp`（如下所示），并使用 `g++ -I\u002Fpath\u002Fto\u002Flibxsmm\u002Finclude hello.cpp -L\u002Fpath\u002Fto\u002Flibxsmm\u002Flib -lxsmm -lblas -o hello`（GNU CCC）进行编译，最后通过 `LD_LIBRARY_PATH=\u002Fpath\u002Fto\u002Flibxsmm\u002Flib LIBXSMM_VERBOSE=2 .\u002Fhello` 运行。\n\n```cpp\n#include \u003Clibxsmm.h>\n#include \u003Cvector>\nint main(int argc, char* argv[]) {\n  typedef double T;\n  int batchsize = 1000, m = 13, n = 5, k = 7;\n  std::vector\u003CT> a(batchsize * m * k), b(batchsize * k * n), c(m * n, 0);\n  \u002F* C\u002FC++ 和 Fortran 接口均可使用 *\u002F\n  typedef libxsmm_mmfunction\u003CT> kernel_type;\n  \u002F* 生成并调度一个矩阵乘法内核（C++ 函数对象） *\u002F\n  kernel_type kernel(LIBXSMM_GEMM_FLAG_NONE, m, n, k, 1.0 \u002F*alpha*\u002F, 1.0 \u002F*beta*\u002F);\n  assert(kernel);\n  for (int i = 0; i \u003C batchsize; ++i) { \u002F* 初始化输入 *\u002F\n    for (int ki = 0; ki \u003C k; ++ki) {\n      for (int j = 0; j \u003C m; ++j) a[i * j * ki] = static_cast\u003CT>(1) \u002F ((i + j + ki) % 25);\n      for (int j = 0; j \u003C n; ++j) b[i * j * ki] = static_cast\u003CT>(7) \u002F ((i + j + ki) % 75);\n    }\n  }\n  \u002F* 内核执行矩阵乘法并累加结果：C += Ai * Bi *\u002F\n  for (int i = 0; i \u003C batchsize; ++i) kernel(&a[i * m * k], &b[i * k * n], &c[0]);\n}\n```\n\n纯 [C 代码](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Fsamples\u002Fhello\u002Fhello.c) 以及 [Fortran 代码](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Fsamples\u002Fhello\u002Fhello.f) 与上述 [示例](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples\u002Fhello) 类似。\n\n\u003Ca name=\"what-is-a-small-matrix-multiplication\">\u003C\u002Fa>**什么是小型矩阵乘法？** 当使用 M、N 和 K 参数来表征问题规模时，适合 LIBXSMM 的问题规模大致位于 \u003Ci>(M&#160;N&#160;K)\u003Csup>1\u002F3\u003C\u002Fsup>&#160;&lt;=&#160;64\u003C\u002Fi> 范围内（这表明非方阵甚至“高而瘦”的形状也包含在内）。该库不采用多级 K、M、N 分块策略。如果将 LIBXSMM 用于更大的规模，可能会生成过多的代码（由于在 M 或 K 维度上的展开），同时也无法实现有效的分块方案来充分利用缓存层次结构。就 GEMM 而言，支持的内核仅限于 *Alpha := 1*、*Beta := \\{ 1, 0 \\}* 以及 *TransA := 'N'*。\n\n## 接口与领域\u003Ca name=\"interfaces\">\u003C\u002Fa>\n\n### 概述\u003Ca name=\"general-interface\">\u003C\u002Fa>\n\n请参阅 [https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Finclude](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Finclude)，了解所有已发布的函数。以下列出可用的领域及已记录的功能：\n\n* MM：[矩阵乘法](#matrix-multiplication)\n* TPP：[张量处理原语](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Fdocumentation\u002Flibxsmm_tpp.md)\n* DNN：[深度神经网络](#deep-neural-networks)\n* AUX：[服务函数](#service-functions)\n* PERF：[性能](#performance)\n* BE：[后端](#jit-backend)\n\n为了初始化库内部资源，显式初始化例程有助于避免首次调用 LIBXSMM 时的延迟初始化开销。库会在程序退出时释放内部资源，但也提供了与上述初始化配套的终止例程。\n\n```C\n\u002F** 初始化库；在特定时刻承担设置成本。 *\u002F\nvoid libxsmm_init(void);\n\u002F** 反初始化库并释放内部内存（可选）。 *\u002F\nvoid libxsmm_finalize(void);\n```\n\n### 矩阵乘法\u003Ca name=\"interface-for-matrix-multiplication\">\u003C\u002Fa>\n\n该领域（MM）支持小型矩阵乘法（SMM）、多组矩阵乘法批处理，以及业界标准的通用矩阵-矩阵乘法（GEMM）接口。\n\n[矩阵乘法领域（MM）](documentation\u002Flibxsmm_mm.md) 包含以下例程：\n\n* [小型、分块及并行化的矩阵乘法](documentation\u002Flibxsmm_mm.md#overview)\n* [手动代码调度（自定义矩阵批处理）](documentation\u002Flibxsmm_mm.md#manual-code-dispatch)\n\n### 深度学习\u003Ca name=\"interface-for-dl\">\u003C\u002Fa>\n\n在此我们展示如何利用 LIBXSMM 提供的张量处理原语来实现深度学习应用中的常见算子，例如带有激活函数融合的 GEMM、带有激活函数融合的卷积、各类归一化算子以及池化算子等。用于性能评估的示例驱动程序作为 [LIBXSMM_DNN](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm-dnn\u002Ftree\u002Fmain\u002Ftests) 的一部分提供。\n\n### 服务函数\n\n为了方便库的使用和集成，提供了一些服务例程。这些例程可能不属于 LIBXSMM 的核心功能（SMM 或 DNN 领域），但建议用户使用此领域（AUX）。它们分为两类：(1) 同时适用于 C 和 FORTRAN 的例程，以及 (2) 仅通过 C 接口可用的例程。\n\n[服务函数领域（AUX）](documentation\u002Flibxsmm_aux.md) 包含以下例程：\n\n* [获取和设置目标架构](documentation\u002Flibxsmm_aux.md#getting-and-setting-the-target-architecture)\n* [获取和设置详细程度](documentation\u002Flibxsmm_aux.md#getting-and-setting-the-verbosity)\n* [测量时间间隔（计时器）](documentation\u002Flibxsmm_aux.md#timer-facility)\n* [调度用户数据和多个内核](documentation\u002Flibxsmm_aux.md#user-data-dispatch)\n* [分配内存](documentation\u002Flibxsmm_aux.md#memory-allocation)\n\n### 后端\u003Ca name=\"jit-backend\">\u003C\u002Fa>\n\n有关 JIT 后端和代码生成器的更多信息，请参阅单独的 [文档](documentation\u002Flibxsmm_be.md)。[编码器示例集](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples\u002Fencoder) 可帮助您开始使用 LIBXSMM 编写内核。请注意，LIBXSMM 的独立 \u003Ca name=\"generator-driver\">\u003C\u002Fa>[生成器驱动程序](documentation\u002Flibxsmm_be.md#generator-driver) 已被视为遗留（已弃用）。\n\n## 构建说明\n\n### 概述\n\n主接口文件是 *生成的*，因此 **不** 存储在代码仓库中。要查看 [C\u002FC++](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Fsrc\u002Ftemplate\u002Flibxsmm.h) 和 [FORTRAN](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Fsrc\u002Ftemplate\u002Flibxsmm.f) 的接口，可以参考用于生成实际接口的模板文件。构建和使用 LIBXSMM 的一般方法有两种：\n\n* [经典库（ABI）](#classic-library-abi) 和 [链接说明](#link-instructions)（C\u002FC++ 和 FORTRAN）\n* [仅头文件](#header-only)（C 和 C++）\n\n**注**：LIBXSMM 以预编译包的形式提供给 Fedora\u002FRedHat\u002FCentOS、Debian\u002FUbuntu、FreeBSD 等系统。此外，LIBXSMM 还可以通过 [Spack 包管理器](https:\u002F\u002Fcomputing.llnl.gov\u002Fprojects\u002Fspack-hpc-package-manager) 或 [EasyBuild+EasyConfig](https:\u002F\u002Fgithub.com\u002Feasybuilders) 进行安装。\n\n### 经典库（ABI）\n\n对于给定的项目，有两种方式可以依赖预构建的代码：\u003Cspan>(1)&#160;使用\u003C\u002Fspan> LIBXSMM 的基于 Makefile 的构建系统，\u003Cspan>(2)&#160;或者\u003C\u002Fspan> 使用其他构建系统并编写自己的 [LIBXSMM 构建规则](#rules-for-building-libxsmm)。基于 Makefile 的构建系统依赖于 \u003Cspan>GNU&#160;Make\u003C\u002Fspan>（通常与 `make` 命令相关联，但例如 FreeBSD 将其称为 `gmake`）。可以通过使用 \u003Cspan>键&#8209;值\u003C\u002Fspan> 对来定制构建。\u003Cspan>键&#8209;值\u003C\u002Fspan> 对可以通过两种方式提供：\u003Cspan>(1)&#160;在\u003C\u002Fspan> “make” 命令之后，或\u003Cspan>(2)&#160;在\u003C\u002Fspan> “make” 命令之前（`env`），这实际上等同于将 \u003Cspan>键&#8209;值\u003C\u002Fspan> 对作为环境变量导出（`export` 或 `setenv`）。这两种方法可以混合使用（第二种方法可能需要 make 的 `-e` 标志）。\n\n\u003Ca name=\"zero-config-abi\">\u003C\u002Fa>与默认情况下无需配置的 [仅头文件](#zero-config) 不同，第三方构建系统可以编译和链接 LIBXSMM 的源代码，但仍可避免对库进行配置（通过 `libxsmm_config.py`）。省略配置的前提是通过定义 LIBXSMM_DEFAULT_CONFIG (`-D`) 来选择启用该功能。零配置功能不适用于 LIBXSMM 的 Fortran 接口。\n\n**注意**：默认情况下需要 C\u002FC++ 和 FORTRAN 编译器（部分示例代码是用 C++ 编写的）。除了指定编译器（`make CXX=g++ CC=gcc FC=gfortran`，也许还需要 `AR=ar`）之外，还可以放宽对 FORTRAN 编译器的要求（`make FC=` 或 `make FORTRAN=0`）。后者会影响 MODule 文件以及相应的 `libxsmm.f` 库的可用性（接口 `libxsmm.f` 仍然会被生成）。\n\n构建系统会将一组给定的键值对视为一个唯一的构建，并为不同的标志组合触发重新构建。如需更高级的构建或更多背景信息，请参阅关于 [自定义](documentation\u002Flibxsmm_tune.md) 的章节。要在 `include` 目录内生成库的接口并构建静态库（默认情况下，STATIC=1 已启用），请运行以下任一命令（或两者）：\n\n```bash\nmake STATIC=0\nmake\n```\n\n在 CRAY 系统上，无论使用 CRAY 编译器、Intel 编译器还是 \u003Cspan>GNU&#160;编译器集合（GCC）\u003C\u002Fspan>,都应使用 CRAY 编译环境（CCE）。CCE 最终会禁止构建共享库（STATIC=0）。无论如何，\u003Cspan>(1)&#160;切换\u003C\u002Fspan>到所需的编译器（加载\u002F切换模块），然后\u003Cspan>(2)&#160;依赖\u003C\u002Fspan>以下命令：\n\n```bash\nmake CXX=CC CC=cc FC=ftn\n```\n\n多种构建环境开箱即用即可兼容，详情请参阅 [https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FCompatibility](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FCompatibility)。如果构建过程失败，尝试避免使用高级 GCC 标志可能会有所帮助。这对于那些假装与 GCC 兼容（并被当作 GCC 兼容处理）但实际上无法识别上述标志的工具链尤其有用：\n\n```bash\nmake COMPATIBLE=1\n```\n\n\u003Ca name=\"outdated-binutils\">\u003C\u002Fa>如果 Binutils 版本过旧，在构建库时可能会导致汇编失败（这与 JIT 生成的代码无关，也不会影响 JIT 代码对系统的优化）。LIBXSMM 实现了一些功能，使用了编译器内置函数和多个根据 CPUID 调度的代码路径。与 `INTRINSICS=2`（默认值）不同，`INTRINSICS=1` 会启用完全静态的代码路径，以适应目标架构。如果没有指定目标（例如 `AVX=3` 或 `AVX=2`），则无法针对这些代码路径利用指令集扩展。尝试通过安装最新版本的 GNU Binutils 来修复编译失败（并执行 `export PATH=\u002Fpath\u002Fto\u002Fbinutils\u002Fbin:${PATH}`）。Binutils 的版本独立于 \u003Cspan>GNU&#160;GCC\u003C\u002Fspan> 和其他编译器。如果无法更新 Binutils，则可以使用 [libxsmm_cpuid.h](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fblob\u002Fmain\u002Finclude\u002Flibxsmm_cpuid.h) 中列出的 CPUID 值作为替代方案：从较高的值开始（小于 1999），逐步递减，直到编译成功为止（例如，`make INTRINSICS=1021`）。作为最后的手段，可以采用完全静态的代码路径：\n\n```bash\nmake INTRINSICS=1\n```\n\n要测试和验证构建结果，请参阅 [https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FValidation](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fwiki\u002FValidation)。为了运行一些基本的健全性检查，需要注意的是，每组给定的键值对代表不同的构建（及测试）：\n\n```bash\nmake STATIC=0 tests\n```\n\n要删除中间文件，或删除所有生成的文件和文件夹（包括接口和库归档文件），请运行以下其中一个 make 目标。额外的 distclean 目标会递归地清理整个目录树（自 \u003Cspan>版本&#160;1.9\u003C\u002Fspan> 之后）。\n\n```bash\nmake clean\nmake realclean\n```\n\n\u003Ca name=\"fortran\">\u003C\u002Fa>FORTAN 代码可以使用 LIBXSMM：\n\n* 通过使用模块并与 `libxsmmf`、`libxsmm` 和 `libxsmmext` 链接，\n* \u003Ca name=\"header-only-fortran\">\u003C\u002Fa>通过包含 `libxsmm.f` 并与 `libxsmm` 和 `libxsmmext` 链接，或\n* 通过（隐式）调用子程序并与 `libxsmm` 和 `libxsmmext` 链接。\n\n**注意**：`libxsmmf` 需要 `libxsmmext`（自 LIBXSMM 2.0 开始），因此也需要链接 OpenMP 运行时库。\n\n使用 Fortran 模块（或包含接口）至少需要 \u003Cspan>Fortran&#160;2003\u003C\u002Fspan> 编译器（F2K3）。\u003Cspan>FORTAN&#160;77\u003C\u002Fspan> 兼容性仅以隐式方式提供（无接口），且可用的例程子集记录在 `libxsmm.f` 中，并用 [注释](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fsearch?q=implementation+provided+for+Fortran+77+compatibility) 标记（属于实现的一部分）。\n\n### 仅头文件\n\n\u003Cspan>版本&#160;1.4.4\u003C\u002Fspan> 引入了对 C 和 C++ 中“仅头文件”用法的支持。只需包含 `libxsmm_source.h` 即可避免构建整个库。然而，这种方式放弃了明确定义的应用二进制接口（ABI）。使用共享库形式时，ABI 允许在应用部署后进行热修复，并确保仅依赖 LIBXSMM 的公共接口。相比之下，仅头文件形式不仅暴露了 LIBXSMM 的内部实现，还可能因编译时间延长而增加应用开发的周转时间。该头文件被有意命名为 “libxsmm_**source**.h”，因为它依赖于 [src](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsrc) 目录（如前所述）。\n\n\u003Ca name=\"zero-config\">\u003C\u002Fa>仅头文件形式依赖于根据源代码目录（`src`）内容*生成*的 `libxsmm_source.h`。\u003Cspan>LIBXSMM&#160;1.16\u003C\u002Fspan>（及更高版本）提供了无需调用 make 目标即可实现的仅头文件支持（零配置），适用于任何 LIBXSMM 检出版本。若要使用非默认的已配置仅头文件模式，必须定义 `LIBXSMM_CONFIGURED`（`-D`）。此前，需要调用 `make header-only`（v1.6.2 或更高版本）、`make cheader`（v1.6.2 之前）或任何用于构建库的目标（`make`）。零配置特性使第三方构建系统能够更轻松地集成 LIBXSMM，即使系统从源码构建 LIBXSMM 也是如此（参见 [经典 ABI](#zero-config-abi)）。Fortran 代码可以[包含](#header-only-fortran) `libxsmm.f`，但仍需生成相应接口。\n\n**注意**：构建应用程序时，会将相同的构建设置应用于 LIBXSMM！例如，若要省略 LIBXSMM 内部的调试代码，必须定义 `NDEBUG`（`-DNDEBUG`）。\n\n## 构建 LIBXSMM 的规则\n\nLIBXSMM 可以作为仅头文件库使用，即无需预先构建任何源代码。然而，有时也可能希望使用自定义设置或构建系统将 LIBXSMM 构建成中间库。在这种情况下，仍可实施自定义构建规则，在编译代码之前配置 LIBXSMM 的接口。更常见的是，以自定义方式从源码构建 LIBXSMM 时，可能会选择不配置接口，而直接采用“(零配置)[#zero-config-abi]”，即定义 `LIBXSMM_DEFAULT_CONFIG`（`-DLIBXSMM_DEFAULT_CONFIG`）。例如，LIBXSMM 的 CMake 模块可能如下所示：\n\n```cmake\ninclude(FetchContent)\nFetchContent_Declare(\n  xsmm\n  URL https:\u002F\u002Fgithub.com\u002Fchelini\u002Flibxsmm\u002Farchive\u002F\u003Cyour-preferred-revision>.tar.gz\n  URL_HASH SHA256=\u003Csha256sum-corresponding-to-above-revision>\n)\nFetchContent_GetProperties(xsmm)\nif(NOT xsmm_POPULATED)\n  FetchContent_Populate(xsmm)\nendif()\n\nset(LIBXSMMROOT ${xsmm_SOURCE_DIR})\nfile(GLOB _GLOB_XSMM_SRCS LIST_DIRECTORIES false CONFIGURE_DEPENDS ${LIBXSMMROOT}\u002Fsrc\u002F*.c)\nlist(REMOVE_ITEM _GLOB_XSMM_SRCS ${LIBXSMMROOT}\u002Fsrc\u002Flibxsmm_generator_gemm_driver.c)\nset(XSMM_INCLUDE_DIRS ${LIBXSMMROOT}\u002Finclude)\n\nadd_library(xsmm STATIC ${_GLOB_XSMM_SRCS})\ntarget_include_directories(xsmm PUBLIC ${XSMM_INCLUDE_DIRS})\ntarget_compile_definitions(xsmm PUBLIC\n  LIBXSMM_DEFAULT_CONFIG\n)\n```\n\n## 链接说明\n\n使用[经典 ABI](#classic-library-abi)（包括[Fortran](#fortran)代码）时，需要将 LIBXSMM 链接到应用程序中。该库与线程运行时无关，因此应用程序可以自由使用任何线程运行时（例如 OpenMP）。该库也是线程安全的，多个应用线程可以同时调用 LIBXSMM 的例程。为 LIBXSMM 主库启用 OpenMP 也受支持（OMP=1），这主要影响库内使用的同步原语。所有“omp”功能（函数后缀）均由 `libxsmmext` 库提供，该库会自动以启用 OpenMP 的方式构建。使用这些“omp”功能时，链接行中必须包含 `libxsmmext`。\n\n\u003Ca name=\"table-of-libraries\">\u003C\u002Fa>库 | 用途\n:-------------|---------\nlibxsmm       | 线程安全的核心函数（同一例程可被并发调用）。包含可接收线程 ID 和库外线程数量的例程。\nlibxsmmf      | 使用 Fortran MODule 时必需，但包含 `libxsmm.f` 或依赖隐式接口（[Fortran 77](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fsearch?q=implementation+provided+for+Fortran+77+compatibility))时则不需要。\n\n\u003Ca name=\"pkg-config\">\u003C\u002Fa>为简化与 LIBXSMM 的链接，可以使用 `pkg-config`。例如：\n\n```bash\nexport PKG_CONFIG_PATH=\u002Fpath\u002Fto\u002Flibxsmm\u002Flib\npkg-config libxsmm --libs\n```\n\n同样，应用程序可以自由选择任何 BLAS 或 LAPACK 库（前提是操作系统上的链接模型支持），因此在链接 LIBXSMM 本身时链接 GEMM 例程（通过指定 BLAS=1&#124;2）可能会妨碍用户在最终链接应用程序时做出这一决定。若要使用不包含 GEMM 相关功能的 LIBXSMM，可以通过两种方式移除对 BLAS 的依赖：\u003Cspan>(1)&#160;构建\u003C\u002Fspan>一个特殊库，使用 `make BLAS=0`；或 \u003Cspan>(2)&#160;链接\u003C\u002Fspan>应用程序至 `libxsmmnoblas` 库。然而，如果应用程序已经使用 BLAS，则可以使用[调用包装器](documentation\u002Flibxsmm_mm.md#call-wrapper)来拦截现有的 BLAS 调用，并改用 LIBXSMM。\n\n**注意**：当 BLAS 以静态方式链接（“.a”）时，LIBXSMM 不支持动态链接 `libxsmm` 或 `libxsmmext`（“.so”）。如果 BLAS 是静态链接的，则必须使用 LIBXSMM 的静态版本！\n\n### 安装\n\n安装 LIBXSMM 有两种主要方式（这两种方式可以结合使用）：(1) 以\u003Cspan>树外\u003C\u002Fspan>的方式构建库，以及 (2) 将其\u003Cspan>安装\u003C\u002Fspan>到指定位置。 \u003Ca name=\"install-build\">\u003C\u002Fa>树外构建的方式如下：\n\n```bash\ncd libxsmm-install\nmake -f \u002Fpath\u002Fto\u002Flibxsmm\u002FMakefile\n```\n\n\u003Ca name=\"install-prefix\">\u003C\u002Fa>安装到特定位置的方式如下（使用 `PREFIX` 或 `DESTDIR`）：\n\n```bash\nmake MNK=\"1 2 3 4 5\" PREFIX=\u002Fpath\u002Fto\u002Flibxsmm-install install\n```\n\n\u003Ca name=\"install-destdir\">\u003C\u002Fa>`PREFIX` 和 `DESTDIR` 是等效的，可以是相对路径或绝对路径。可以在不同的位置重复进行安装操作，而无需重新构建。除非指定了 `PREFIX` 或 `DESTDIR`，否则每个 [包配置文件](#pkg-config) 中的前缀目录都会被设置为 LIBXSMM 的构建目录（暂存文件夹）。`PREFIX`（或 `DESTDIR`）对 pkg-config 文件的影响与是否调用了安装目标无关（即通过 `make` 命令执行时）。\n\n此外，执行 `make install-minimal` 会跳过文档的安装（默认情况下，文档会被安装到 `PREFIX\u002Fshare\u002Flibxsmm`）。另外，`PINCDIR`、`POUTDIR`、`PBINDIR` 和 `PDOCDIR` 允许自定义在 `PREFIX` 目录下的子目录位置。为了构建一个面向不确定用户的通用软件包（例如 Linux 发行版或其他类似环境），建议不要过度指定或自定义构建步骤，即不应启用 JIT、SSE、AVX、OMP、BLAS 等选项。以下示例展示了如何构建并安装一套完整的库，使生成的接口同时匹配静态库和共享库：\n\n```bash\nmake PREFIX=\u002Fpath\u002Fto\u002Flibxsmm-install STATIC=0 install\nmake PREFIX=\u002Fpath\u002Fto\u002Flibxsmm-install install\n```\n\n## 运行时控制\u003Ca name=\"running\">\u003C\u002Fa>\n\n### 错误处理\n\n该库使用 C 语言提供的机制来处理错误（不使用异常）。后端通过参数传递结果代码，而不是直接返回值。这个参数通常是一个描述符结构体，用于指示和记录代码生成的状态。然而，前端可能不会报告任何错误状态，这在调用端可能会带来很大的便利。相反，前端实现了[详细模式](#verbose-mode)，用于提示意外输入或从后端捕获的错误。LIBXSMM 的设计原则是默认静默运行（非详细模式），并且不会导致程序意外退出。\n\n### 详细模式\n\n[详细模式](documentation\u002Flibxsmm_aux.md#getting-and-setting-the-verbosity)（详细程度）允许用户在库结束运行时，通过接收一份简短的表格式统计信息，深入了解代码调度机制。此功能的设计目标是不影响任何关键代码路径的性能，因此详细模式始终启用，且不需要符号表（SYM=1）或调试代码（DBG=1）。当环境变量 `LIBXSMM_VERBOSE` 被设置为非零值时，统计信息会输出到标准错误流 (`stderr`)。例如：\n\n```bash\nLIBXSMM_VERBOSE=1 .\u002Fmyapplication\n[... 应用程序输出]\n\nHSW\u002FSP      TRY    JIT    STA    COL\n   0..13      0      0      0      0\n  14..23      0      0      0      0\n 24..128      3      3      0      0\n```\n\n这些表格会根据单精度和双精度分别显示，但如果所有计数器都为零，则表格会被省略。如果两个表格都被省略，库会显示原本计划用于 JIT 编译的代码路径：`LIBXSMM_TARGET=hsw`（否则代码路径会显示在表格标题中）。实际的计数器分为三个类别：小型核（\u003Cspan>MNK\u003Csup>1\u002F3\u003C\u002Fsup>&#160;&lt;=&#160;13\u003C\u002Fspan>）、中型核（\u003Cspan>13&#160;&lt;&#160;MNK\u003Csup>1\u002F3\u003C\u002Fsup>&#160;&lt;=&#160;23\u003C\u002Fspan>）以及大型核（\u003Cspan>23&#160;&lt;&#160;MNK\u003Csup>1\u002F3\u003C\u002Fsup>&#160;&lt;=&&#160;64\u003C\u002Fspan>；实际上限取决于编译时选择的 `LIBXSMM_MAX_MNK`）。需要注意的是，“大型”核在算术强度方面仍然属于较小规模，而算术强度会随着核大小线性增长。遗憾的是，算术强度取决于核的具体使用方式（哪些操作数被加载或存储到主内存中），收集这些信息会对性能产生一定影响。\n\n`TRY` 计数器表示所有尝试注册静态生成的核，以及所有尝试动态生成并注册核的行为。它还包括因 GEMM 参数不支持而被拒绝的 JIT 请求。`JIT` 和 `STA` 计数器则将上述事件（`TRY`）的成功案例进一步区分为动态生成的代码（`JIT`）和静态生成的代码（`STA`）。如果代码注册表的容量（\u003Cspan>O(*n*)&#160;=&#160;10\u003Csup>5\u003C\u002Fsup>\u003C\u002Fspan>）已满，即使继续尝试也无法再注册新的核。不过，多次注册核（\u003Cspan>O(*n*)&#160;=&#160;10\u003Csup>3\u003C\u002Fsup>\u003C\u002Fspan>）可能会增加哈希冲突次数（`COL`），从而降低性能。如果有效利用小型线程局部缓存，则可以避免这种情况。\n\n\n```bash\n注册表：20 MB（gemm=0 mcopy=14 tcopy=0）\n```\n\n如果使用调用包装器，还可以获得额外的运行时统计信息（参见 [调用包装器](documentation\u002Flibxsmm_mm.md#call-wrapper)）。\n\n\u003Ca name=\"objdump\">\u003C\u002Fa>**注意**：将 `LIBXSMM_VERBOSE` 设置为负值会使每个生成的 JIT 核被转储到一个二进制文件中，文件名与 [Intel VTune](documentation\u002Flibxsmm_prof.md#intelvtuneamplifier) 中显示的函数名相同。可以通过以下命令反汇编这些原始二进制文件：\n\n```bash\nobjdump -D -b binary -m i386 -M x86-64 [JIT-dump-file]\n```\n\n### 调用跟踪\n\n在初次使用 LIBXSMM API 时，可以依赖库的调试版本（`make DBG=1`）。该版本在库内部出现错误或警告时还会输出到控制台（stderr）。此外，还可以打印 LIBXSMM 内部的执行流程（调用跟踪），这可以与 DBG=1 或 OPT=0 结合使用：\n\n```bash\nmake TRACE=1\n```\n\n构建一个能够跟踪库内调用的应用程序需要 LIBXSMM 的共享库；或者，应用程序也可以以动态方式链接 LIBXSMM 的静态库（GNU 工具链：`-rdynamic`）。无需调试器即可实现调用跟踪的方法是通过名为 LIBXSMM_TRACE 的环境变量。\n\n```bash\nLIBXSMM_TRACE=1 .\u002Fmyapplication\n```\n\n语法上最多可接受三个由逗号分隔的参数（允许省略参数）：*tid* 表示要跟踪的线程 ID，取值范围为 1 到 NTHREADS；当 LIBXSMM_TRACE=1 时，默认会筛选“主线程”（即首次进入跟踪设施的线程）。若要捕获所有线程（不加筛选），可以提供一个负数 ID（这也是省略参数时的默认值）。第二个参数用于修剪调用树的更高层级，默认值为 *i=1*（其中零级是最顶层，与主函数处于同一级别）。最后一个参数指定包含的调用层级数，默认值为 *n=-1*（表示无筛选）。\n\n尽管 `ltrace`（Linux 工具）也能提供类似的洞察，但上述提到的筛选表达式使得 LIBXSMM 的跟踪功能更具实用性。请注意，即使在 LIBXSMM_TRACE=0 的情况下，跟踪功能也会严重降低性能，这不仅是因为控制台输出，还因为可能会阻止（内部）函数的内联，并且每次函数进入和退出时都会增加额外的调用开销。因此，也可以单独启用调试符号（`make SYM=1`；TRACE=1 或 DBG=1 会自动启用），这在对应用程序进行性能分析时可能会很有用。\n\n### 验证\n\n本节介绍如何使用 LIBXSMM 工具测试应用程序的正确性，特别是使用 `libxsmm_matdiff` 或 `libxsmm_matdiff_epsilon` 函数。前者（`libxsmm_matdiff`）比较两个矩阵（也可以退化为向量形式），并返回一个结构体，其中包含关于两矩阵差异的信息（黄金结果与测试结果）。后者（`libxsmm_matdiff_epsilon`）结合了绝对范数和相对范数（由上述结构体给出），计算出一个标量“epsilon”，可用于与容差进行比较。\n\n在应用程序中使用 `libxsmm_matdiff_epsilon` 时，会暴露一个名为 LIBXSMM_MATDIFF 的环境变量，该变量可以指定文件或目录路径（如果仅设置为 `LIBXSMM_MATDIFF=1`，则会使用默认文件名）。无论哪种情况，应用程序每次调用 `libxsmm_matdiff_epsilon` 时，都会向相应文件追加一行记录，内容包括 epsilon 值以及启动应用程序时使用的命令行。生成的文件可以进一步分析，例如运行 `sort -gk1 libxsmm_matdiff.log | tail -n 10`，即可得到发现的前十个最大 epsilon 值及其对应的命令行。\n\n环境变量 LIBXSMM_MATDIFF 还可以携带可选的空格分隔参数，用于补充每个文件条目的信息，例如 `export LIBXSMM_MATDIFF=\"libxsmm_matdiff.log hello world\"`。在复杂情况下，这可用于补充仅在运行时才知道的值，比如用于判断 epsilon 的实际容差（通过 `putenv` 设置）。\n\n## 性能\n\n\u003Ca name=\"profiling\">\u003C\u002Fa>对于使用 LIBXSMM JIT 代码的应用程序，性能剖析得到了很好的支持。该库支持 \u003Cspan>Intel VTune Amplifier\u003C\u002Fspan> 和 \u003Cspan>Linux perf\u003C\u002Fspan>。文中详细介绍了如何集成性能剖析工具以及如何运行应用程序。\n\n* [使用 Intel VTune Amplifier 进行性能剖析](documentation\u002Flibxsmm_prof.md#intelvtuneamplifier)\n* [使用 Linux perf 进行性能剖析](documentation\u002Flibxsmm_prof.md#linuxperf)\n\n\u003Ca name=\"tuning\">\u003C\u002Fa>在编译时，有多种选项可以对 LIBXSMM 进行定制。该库针对广泛的使用场景进行了配置，其中包括针对典型用途的高级默认设置。\n\n* [自定义性能](documentation\u002Flibxsmm_tune.md#tuning)\n* \u003Ca name=\"auto-dispatch\">\u003C\u002Fa>[调整自动调度](documentation\u002Flibxsmm_tune.md#auto-dispatch)\n\n\u003Ca name=\"results\">\u003C\u002Fa>为了查找应用程序或性能复现工具的性能结果，仓库提供了一个名为 “results”的孤立分支，用于收集相关材料，如实测的性能结果及说明性图表。这些结果可以在 [https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fresults#libxsmm-results](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fresults#libxsmm-results) 找到，也可以按如下方式克隆：\n\n```bash\ngit clone --branch results \\\n  https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm.git \\\n  libxsmm-results\n```\n\n请注意，比较性能结果时，需要考虑矩阵乘法的运算对象是否被流式加载。例如，在所有矩阵都位于 L1 缓存的情况下进行乘法运算，可能会更倾向于某种实现方式，而这种实现方式在实际工作负载中表现可能较差（如果实际工作负载需要从主内存中流式加载部分或全部矩阵）。大多数 [代码示例](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples) 都旨在重现性能结果，建议尽可能模拟实际情况，或参考真实的 [应用程序](#applications)。\n\n### 高性能计算（HPC）\n\n\u003Cb>[1]&#160;\u003C\u002Fb>[https:\u002F\u002Fcp2k.org\u002F](https:\u002F\u002Fcp2k.org\u002F)：开源分子动力学软件CP2K及其配套的[DBCSR库](https:\u002F\u002Fgithub.com\u002Fcp2k\u002Fdbcsr)，该库用于处理大量小矩阵乘法的批处理运算。这些批处理数据来源于一个分布式块稀疏矩阵，其中包含针对特定问题的小型矩阵。自CP2K 3.0版本起，LIBXSMM可替代CP2K中的`libsmm`库。\n\n\u003Cb>[2]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002FSeisSol\u002FSeisSol\u002F](https:\u002F\u002Fgithub.com\u002FSeisSol\u002FSeisSol\u002F)：SeisSol是用于地震场景模拟、特别是动态破裂过程模拟的领先代码之一。LIBXSMM提供了高度优化的汇编内核，构成了SeisSol的核心计算基础（详见[https:\u002F\u002Fgithub.com\u002FTUM-I5\u002Fseissol_kernels\u002F](https:\u002F\u002Fgithub.com\u002FTUM-I5\u002Fseissol_kernels\u002F)）。\n\n\u003Cb>[3]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002FNekBox\u002FNekBox](https:\u002F\u002Fgithub.com\u002FNekBox\u002FNekBox)：NekBox是一款高度可扩展且跨平台的谱元法代码，其灵感源自[Nek5000](https:\u002F\u002Fnek5000.mcs.anl.gov\u002F)。NekBox专为盒状几何问题设计，旨在快速原型化新方法，并充分利用超越FORTRAN 77标准的FORTRAN特性。LIBXSMM可用于替代[Nek5000 NekBox中的MXM_STD代码](https:\u002F\u002Fgithub.com\u002FNek5000\u002FNekBox\u002Fblob\u002Fbox\u002Fmxm_std.F90)。此外，请参阅LIBXSMM提供的[NekBox重现示例](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Ftree\u002Fmain\u002Fsamples\u002Fnek#nek-sample-collection)。\n\n\u003Cb>[4]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002FNek5000\u002FNek5000](https:\u002F\u002Fgithub.com\u002FNek5000\u002FNek5000)：Nek5000是由[https:\u002F\u002Fnek5000.mcs.anl.gov\u002F](https:\u002F\u002Fnek5000.mcs.anl.gov\u002F)开发的开源、高度可扩展且始终可移植的谱元法代码。Nek5000的开发分支已[集成](https:\u002F\u002Fgithub.com\u002FNek5000\u002FNek5000\u002Fblob\u002Fmaster\u002Fcore\u002Fmxm_wrapper.f)LIBXSMM。\n\n\u003Cb>[5]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.pyfr.org\u002F](https:\u002F\u002Fwww.pyfr.org\u002F)：PyFR是一个基于Python的开源框架，采用通量重构方法在流式架构上求解对流-扩散类问题。PyFR将LIBXSMM作为OpenMP后端的矩阵乘法提供者。\n\n\u003Cb>[6]&#160;\u003C\u002Fb>[http:\u002F\u002Fdial3343.org\u002Fabout\u002F](http:\u002F\u002Fdial3343.org\u002Fabout\u002F)：极端规模间断伽辽金环境（EDGE）是一种用于求解双曲型偏微分方程的求解器，尤其适用于地震模拟。EDGE的[源代码](https:\u002F\u002Fgithub.com\u002F3343\u002Fedge)可以选择性地依赖LIBXSMM，但为了获得高性能，强烈建议使用LIBXSMM的内核。\n\n\u003Cb>[7]&#160;\u003C\u002Fb>[https:\u002F\u002Fsxs-collaboration.github.io\u002Fspectre\u002F](https:\u002F\u002Fsxs-collaboration.github.io\u002Fspectre\u002F)：SpECTRE是一款开源代码，用于天体物理和引力物理中的多尺度、多物理场问题，可在Petascale级别运行，并专为Exascale计算机设计。未来，SpECTRE可能被应用于流体力学、地球科学、等离子体物理、核物理以及工程学等跨学科领域的问题。\n\n\u003Cb>[8]&#160;\u003C\u002Fb>[https:\u002F\u002Fceed.exascaleproject.org\u002Fceed-code\u002F](https:\u002F\u002Fceed.exascaleproject.org\u002Fceed-code\u002F)：高效Exascale离散化中心（CEED）依托Nek5000、MFEM、MAGMA、OCCA和PETSc等项目的成果，开发应用程序接口（API），涵盖高层和底层两个层面，以使应用能够充分利用高阶方法。CEED的底层API——[libCEED](https:\u002F\u002Fceed.exascaleproject.org\u002Flibceed\u002F)——将LIBXSMM用作[后端](https:\u002F\u002Fgithub.com\u002FCEED\u002FlibCEED#backends)，从而在CPU上实现高性能。\n\n\u003Cb>[9]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002Fromeric\u002FFastor](https:\u002F\u002Fgithub.com\u002Fromeric\u002FFastor)：Fastor是一个轻量级的高性能张量代数框架，适用于现代C++，并可选择将LIBXSMM作为[JIT后端](https:\u002F\u002Fgithub.com\u002Fromeric\u002FFastor\u002Fwiki\u002F9.-Using-the-LIBXSMM-MKL-JIT-backend)。\n\n### 机器学习与深度学习（AI）\n\n\u003Cb>[10]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002Fplaidml\u002Fplaidml](https:\u002F\u002Fgithub.com\u002Fplaidml\u002Fplaidml)：PlaidML是一个开源张量编译器，旨在实现跨多种CPU、GPU及其他加速器的高性能可移植性。结合Intel的nGraph编译器，PlaidML主要面向PyTorch、Keras（TensorFlow）和OpenVino等流行的深度学习框架。[PlaidML\u002Fv1](https:\u002F\u002Fgithub.com\u002Fplaidml\u002Fplaidml\u002Ftree\u002Fplaidml-v1)（开发分支）采用了正在行业范围内广泛普及的可扩展编译基础设施——MLIR。PlaidML\u002Fv1开始使用LIBXSMM作为面向CPU的后端。\n\n\u003Cb>[11]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002Fintel\u002Fintel-extension-for-pytorch](https:\u002F\u002Fgithub.com\u002Fintel\u002Fintel-extension-for-pytorch)：英特尔PyTorch扩展旨在通过优异的性能，为用户在CPU上使用PyTorch提供流畅的体验。该扩展包开始依赖[LIBXSMM以实现CPU上的高性能](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.04680)。\n\n\u003Cb>[12]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Ftpp-pytorch-extension](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Ftpp-pytorch-extension)：英特尔(R) PyTorch张量处理原语扩展是一个开源软件库，将张量处理原语（TPP）集成到PyTorch中。它旨在通过优异的性能，为用户在CPU上使用PyTorch提供流畅的体验。英特尔提交的MLPerf训练基准代码正是利用了该项目（详见[https:\u002F\u002Fgithub.com\u002Fmlcommons\u002Ftraining_results.1\u002Ftree\u002Fmain\u002FIntel\u002Fbenchmarks\u002Fbert\u002Fimplementations\u002Fpytorch-cpu](https:\u002F\u002Fgithub.com\u002Fmlcommons\u002Ftraining_results.1\u002Ftree\u002Fmain\u002FIntel\u002Fbenchmarks\u002Fbert\u002Fimplementations\u002Fpytorch-cpu)）。\n\n\u003Cb>[13]&#160;\u003C\u002Fb>[https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm-dnn](https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm-dnn)：LIBXSMM-DNN是一个开源软件库，展示了如何利用张量处理原语（TPP）来实现各种深度学习原语，如卷积、线性层，甚至池化和归一化操作。由于采用了TPP技术，无需编写任何平台相关的代码。\n\n### 自动驾驶（AD）\n\n\u003Cb>[15]&#160;\u003C\u002Fb>[https:\u002F\u002Fsoftware.seek.intel.com\u002Faccelerating-eigen-math-library](https:\u002F\u002Fsoftware.seek.intel.com\u002Faccelerating-eigen-math-library)：加速Eigen数学库以应对自动驾驶工作负载：卡尔曼滤波对速度的需求。发表于《Parallel Universe》杂志第31期（[pdf](https:\u002F\u002Fsoftware.intel.com\u002Fcontent\u002Fdam\u002Fdevelop\u002Fpublic\u002Fus\u002Fen\u002Fdocuments\u002Fparallel-universe-issue-31.pdf)）的文章。\n\n## 参考文献\n\n\u003Cb>[1]&#160;\u003C\u002Fb>[https:\u002F\u002Fsc19.supercomputing.org\u002Fproceedings\u002Ftech_poster\u002Ftech_poster_pages\u002Frpost244.html](https:\u002F\u002Fsc19.supercomputing.org\u002Fproceedings\u002Ftech_poster\u002Ftech_poster_pages\u002Frpost244.html)：通过单一构建模块实现高性能深度学习（[海报](https:\u002F\u002Fsc19.supercomputing.org\u002Fproceedings\u002Ftech_poster\u002Fposter_files\u002Frpost244s2-file2.pdf)和[摘要](https:\u002F\u002Fsc19.supercomputing.org\u002Fproceedings\u002Ftech_poster\u002Fposter_files\u002Frpost244s2-file3.pdf)），SC’19：国际高性能计算、网络、存储与分析大会，丹佛（科罗拉多州）。\n\n\u003Cb>[2]&#160;\u003C\u002Fb>[https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1109\u002FSC.2018.00069](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1109\u002FSC.2018.00069)：SIMD架构上高性能深度学习卷积的剖析（[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1808.05567.pdf)）。SC'18：国际高性能计算、网络、存储与分析大会，达拉斯（德克萨斯州）。\n\n\u003Cb>[3]&#160;\u003C\u002Fb>[https:\u002F\u002Fpasc17.pasc-conference.org\u002Ffileadmin\u002Fuser_upload\u002Fpasc17\u002Fprogram\u002Fpost116s2.pdf](https:\u002F\u002Fpasc17.pasc-conference.org\u002Ffileadmin\u002Fuser_upload\u002Fpasc17\u002Fprogram\u002Fpost116s2.pdf)：DBCSR：用于电子结构代码的稀疏矩阵乘法库（海报），PASC’17：PASC17大会，卢加诺（瑞士）。\n\n\u003Cb>[4]&#160;\u003C\u002Fb>[https:\u002F\u002Fsc17.supercomputing.org\u002FSC17%20Archive\u002Ftech_poster\u002Ftech_poster_pages\u002Fpost190.html](https:\u002F\u002Fsc17.supercomputing.org\u002FSC17%20Archive\u002Ftech_poster\u002Ftech_poster_pages\u002Fpost190.html)：理解英特尔架构上CNN小型卷积运算的性能（[海报](https:\u002F\u002Fsc17.supercomputing.org\u002FSC17%20Archive\u002Ftech_poster\u002Fposter_files\u002Fpost190s2-file2.pdf)和[摘要](https:\u002F\u002Fsc17.supercomputing.org\u002FSC17%20Archive\u002Ftech_poster\u002Fposter_files\u002Fpost190s2-file3.pdf)），SC’17：国际高性能计算、网络、存储与分析大会，丹佛（科罗拉多州）。\n\n\u003Cb>[5]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.computer.org\u002Fcsdl\u002Fproceedings-article\u002Fsc\u002F2016\u002F8815a981\u002F12OmNCeaQ1D](https:\u002F\u002Fwww.computer.org\u002Fcsdl\u002Fproceedings-article\u002Fsc\u002F2016\u002F8815a981\u002F12OmNCeaQ1D)：LIBXSMM：通过运行时代码生成加速小型矩阵乘法。SC'16：国际高性能计算、网络、存储与分析大会，盐湖城（犹他州）。\n\n\u003Cb>[6]&#160;\u003C\u002Fb>[http:\u002F\u002Fsc15.supercomputing.org\u002Fsites\u002Fall\u002Fthemes\u002FSC15images\u002Ftech_poster\u002Ftech_poster_pages\u002Fpost137.html](http:\u002F\u002Fsc15.supercomputing.org\u002Fsites\u002Fall\u002Fthemes\u002FSC15images\u002Ftech_poster\u002Ftech_poster_pages\u002Fpost137.html)：LIBXSMM：一个用于小型矩阵乘法的高性能库（[海报](http:\u002F\u002Fsc15.supercomputing.org\u002Fsites\u002Fall\u002Fthemes\u002FSC15images\u002Ftech_poster\u002Fposter_files\u002Fpost137s2-file2.pdf)和[摘要](http:\u002F\u002Fsc15.supercomputing.org\u002Fsites\u002Fall\u002Fthemes\u002FSC15images\u002Ftech_poster\u002Fposter_files\u002Fpost137s2-file3.pdf)）。SC'15：国际高性能计算、网络、存储与分析大会，奥斯汀（德克萨斯州）。\n\n\u003Cb>[7]&#160;\u003C\u002Fb>《张量处理原语：面向深度学习与HPC工作loads的高效性和可移植性编程抽象》（[arXiv预印本](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.05755)）。SC'21：国际高性能计算、网络、存储与分析大会，圣路易斯。\n\n## 文章\n\n\u003Cb>[1]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F10\u002F09\u002Fcloudy-supercomputers-join-the-hpc-petascale-club\u002F](https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F10\u002F09\u002Fcloudy-supercomputers-join-the-hpc-petascale-club\u002F)：云端超级计算机加入HPC拍字节级俱乐部。罗布·法伯撰写的文章，2019年。文章在单独一节中介绍了LIBXSMM。\n\n\u003Cb>[2]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F06\u002F26\u002Fcounting-the-cost-of-scaling-hpc-applications\u002F](https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F06\u002F26\u002Fcounting-the-cost-of-scaling-hpc-applications\u002F)：计算HPC应用扩展的成本。蒂莫西·普里克特·摩根撰写的文章，2019年。本文主要讨论CP2K开源分子动力学软件，而非LIBXSMM。然而，LIBXSMM对应用性能起到了关键作用。\n\n\u003Cb>[3]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F06\u002F26\u002Fcounting-the-cost-of-scaling-hpc-applications\u002F](https:\u002F\u002Fwww.nextplatform.com\u002F2019\u002F06\u002F26\u002Fcounting-the-cost-of-scaling-hpc-applications\u002F)：Azure HC系列在两万个核心上进行HPC基准测试。约翰·拉塞尔撰写的文章，2019年。本文同样聚焦于CP2K开源分子动力学软件，而非LIBXSMM。不过，LIBXSMM对应用性能至关重要。\n\n\u003Cb>[4]&#160;\u003C\u002Fb>[https:\u002F\u002Fsoftware.intel.com\u002Fsites\u002Fdefault\u002Ffiles\u002Fparallel-universe-issue-34.pdf](https:\u002F\u002Fsoftware.intel.com\u002Fcontent\u002Fwww\u002Fus\u002Fen\u002Fdevelop\u002Fdownload\u002Fparallel-universe-magazine-issue-34-october-2018.html)：LIBXSMM：英特尔硬件与软件开发的开源灵感来源（[PDF文件](https:\u002F\u002Fsoftware.intel.com\u002Fcontent\u002Fdam\u002Fdevelop\u002Fpublic\u002Fus\u002Fen\u002Fdocuments\u002Fparallel-universe-issue-34.pdf)）。汉斯·帕布斯特、格雷格·亨利和亚历山大·海内克合著的文章，2018年。\n\n\u003Cb>[5]&#160;\u003C\u002Fb>[https:\u002F\u002Fmedium.com\u002F@rmfarber\u002Flibxsmm-brings-deep-learning-lessons-learned-to-many-hpc-applications-9143c6c93125](https:\u002F\u002Fmedium.com\u002F@rmfarber\u002Flibxsmm-brings-deep-learning-lessons-learned-to-many-hpc-applications-9143c6c93125)：LIBXSMM将深度学习的经验教训应用于众多HPC应用。罗布·法伯撰写的文章，2018年。\n\n\u003Cb>[6]&#160;\u003C\u002Fb>[https:\u002F\u002Fwww.rdworldonline.com\u002Flargest-supercomputer-simulation-of-sumatra-andaman-earthquake\u002F](https:\u002F\u002Fwww.rdworldonline.com\u002Flargest-supercomputer-simulation-of-sumatra-andaman-earthquake\u002F)：苏门答腊—安达曼地震的最大规模超级计算机模拟。琳达·巴尼撰写的文章，2018年。","# LIBXSMM 快速上手指南\n\nLIBXSMM 是一个专为小型稠密和稀疏矩阵运算以及深度学习原语（如小卷积）优化的高性能库。它主要针对 Intel 架构，支持 SSE、AVX、AVX2、AVX-512（含 VNNI 和 Bfloat16）以及 AMX 指令集，通过即时编译（JIT）技术生成专用代码以实现最佳性能。\n\n## 环境准备\n\n### 系统要求\n*   **操作系统**: Linux (推荐), macOS, FreeBSD, Windows (需特定环境)。\n*   **处理器**: Intel 架构处理器（支持 SSE\u002FAVX\u002FAVX2\u002FAVX-512\u002FAMX 指令集）。虽然也可在 AMD 或其他架构上运行，但针对 Intel 指令集进行了深度优化。\n*   **编译器**:\n    *   C\u002FC++: GCC, Clang, 或 Intel oneAPI DPC++\u002FC++ Compiler。\n    *   Fortran (可选): Gfortran, Intel Fortran Compiler (若不需要 Fortran 接口可省略)。\n*   **构建工具**: GNU Make (`make` 或 `gmake`)。\n\n### 前置依赖\n*   通常无需额外第三方库即可构建核心功能。\n*   若需运行部分示例或进行完整测试，可能需要 BLAS 库（如 OpenBLAS, MKL）。\n\n## 安装步骤\n\nLIBXSMM 提供多种安装方式，推荐使用源码编译以获取针对当前硬件的最佳优化。\n\n### 方法一：源码编译（推荐）\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm.git\n    cd libxsmm\n    ```\n    *(注：国内用户若访问 GitHub 较慢，可配置 Git 加速或使用国内镜像源克隆)*\n\n2.  **编译库**\n    使用默认的 `make` 命令进行编译。该过程会自动检测当前 CPU 支持的指令集并生成相应的 JIT 代码。\n    ```bash\n    make\n    ```\n    \n    *可选配置*:\n    *   如果不需 Fortran 支持，可跳过 Fortran 编译器检查：\n        ```bash\n        make FC=\n        ```\n    *   指定编译器（例如显式使用 g++ 和 gcc）：\n        ```bash\n        make CXX=g++ CC=gcc\n        ```\n\n3.  **验证编译**\n    编译完成后，库文件位于 `lib\u002F` 目录，头文件位于 `include\u002F` 目录。\n\n### 方法二：包管理器安装\n\n对于主流 Linux 发行版，可直接通过包管理器安装预编译版本（可能非最新版本）：\n\n*   **Fedora\u002FRHEL\u002FCentOS**:\n    ```bash\n    sudo dnf install libxsmm-devel\n    ```\n*   **Debian\u002FUbuntu**:\n    ```bash\n    sudo apt-get install libxsmm-dev\n    ```\n*   **Spack (HPC 环境推荐)**:\n    ```bash\n    spack install libxsmm\n    ```\n\n## 基本使用\n\n以下是一个最简单的 C++ 示例，演示如何初始化矩阵并执行批量矩阵乘法（GEMM）。\n\n### 1. 编写代码\n\n将以下代码保存为 `hello.cpp`：\n\n```cpp\n#include \u003Clibxsmm.h>\n#include \u003Cvector>\n#include \u003Ccassert>\n#include \u003Ciostream>\n\nint main(int argc, char* argv[]) {\n  typedef double T;\n  \u002F\u002F 定义矩阵维度：Batchsize=1000, M=13, N=5, K=7\n  int batchsize = 1000, m = 13, n = 5, k = 7;\n  \n  \u002F\u002F 分配内存\n  std::vector\u003CT> a(batchsize * m * k);\n  std::vector\u003CT> b(batchsize * k * n);\n  std::vector\u003CT> c(m * n, 0);\n\n  \u002F\u002F 定义内核类型 (C++ 函子风格)\n  typedef libxsmm_mmfunction\u003CT> kernel_type;\n  \n  \u002F\u002F 生成并分发矩阵乘法内核: C += alpha * A * B + beta * C\n  \u002F\u002F 此处 alpha=1.0, beta=1.0\n  kernel_type kernel(LIBXSMM_GEMM_FLAG_NONE, m, n, k, 1.0 \u002F*alpha*\u002F, 1.0 \u002F*beta*\u002F);\n  \n  assert(kernel); \u002F\u002F 确保内核生成成功\n\n  \u002F\u002F 初始化输入数据\n  for (int i = 0; i \u003C batchsize; ++i) {\n    for (int ki = 0; ki \u003C k; ++ki) {\n      for (int j = 0; j \u003C m; ++j) \n        a[i * m * k + j * k + ki] = static_cast\u003CT>(1) \u002F ((i + j + ki) % 25 + 1);\n      for (int j = 0; j \u003C n; ++j) \n        b[i * k * n + ki * n + j] = static_cast\u003CT>(7) \u002F ((i + j + ki) % 75 + 1);\n    }\n  }\n\n  \u002F\u002F 执行批量矩阵乘法: C += Ai * Bi\n  for (int i = 0; i \u003C batchsize; ++i) {\n    kernel(&a[i * m * k], &b[i * k * n], &c[0]);\n  }\n\n  std::cout \u003C\u003C \"Computation finished. Result C[0] = \" \u003C\u003C c[0] \u003C\u003C std::endl;\n  return 0;\n}\n```\n*(注意：修正了原 README 示例中数组索引计算的逻辑错误，以确保正确的行\u002F列主序访问)*\n\n### 2. 编译与链接\n\n假设 LIBXSMM 源码位于 `\u002Fpath\u002Fto\u002Flibxsmm`，使用以下命令编译：\n\n```bash\ng++ -I\u002Fpath\u002Fto\u002Flibxsmm\u002Finclude hello.cpp -L\u002Fpath\u002Fto\u002Flibxsmm\u002Flib -lxsmm -lblas -o hello\n```\n*   `-I`: 指向 `libxsmm.h` 所在目录。\n*   `-L`: 指向编译生成的 `.so` 或 `.a` 库文件所在目录。\n*   `-lxsmm`: 链接 LIBXSMM 库。\n*   `-lblas`: 链接 BLAS 库（某些系统可能需要，若未安装可尝试移除或替换为具体 BLAS 实现如 `-lopenblas`）。\n\n### 3. 运行程序\n\n设置动态库路径并运行。开启 `LIBXSMM_VERBOSE=2` 可查看 JIT 代码生成详情：\n\n```bash\nexport LD_LIBRARY_PATH=\u002Fpath\u002Fto\u002Flibxsmm\u002Flib:$LD_LIBRARY_PATH\nLIBXSMM_VERBOSE=2 .\u002Fhello\n```\n\n### 关键提示\n*   **适用场景**: LIBXSMM 最适合小型矩阵乘法，经验法则为 $(M \\times N \\times K)^{1\u002F3} \\le 64$。对于超大矩阵，传统 BLAS 库可能更合适。\n*   **无需重编译**: 编译一次生成的库可在不同代际的 Intel CPU 上运行，库会在运行时根据实际 CPU 特性自动生成最优代码（Build once, deploy everywhere）。\n*   **初始化**: 在高性能敏感场景中，可手动调用 `libxsmm_init()` 避免首次调用时的延迟，程序退出时会自动清理资源，也可手动调用 `libxsmm_finalize()`。","某高性能计算团队正在开发一套基于 CPU 的实时金融风险预测系统，核心算法依赖海量的小矩阵乘法运算。\n\n### 没有 libxsmm 时\n- **性能瓶颈严重**：通用数学库（如标准 BLAS）针对大矩阵优化，处理系统中大量 $13 \\times 5$ 这类微小矩阵时，函数调用开销远超计算本身，导致 CPU 利用率极低。\n- **硬件潜力浪费**：代码无法自动适配不同代际的 Intel 处理器，难以利用 AVX-512、VNNI 或 AMX 等新指令集加速，升级硬件后性能无明显提升。\n- **维护成本高昂**：为了榨取性能，开发人员被迫手动编写复杂的内联汇编或针对特定编译器调整标志，导致代码移植性差，“一次构建”无法“到处部署”。\n\n### 使用 libxsmm 后\n- **运算速度飞跃**：libxsmm 通过即时编译（JIT）技术生成专用内核，消除了小矩阵计算的调用开销，使风险预测模型的推理延迟降低了 80% 以上。\n- **自动硬件适配**：程序运行时自动检测当前 CPU 架构并生成最优指令代码，无需修改源码即可在旧款服务器或最新 Sapphire Rapids 芯片上满血运行。\n- **开发效率提升**：团队只需调用简洁的 C++ 接口即可实现高精度矩阵运算，不再受困于底层指令集差异，显著缩短了从算法验证到生产部署的周期。\n\nlibxsmm 通过动态生成专属机器码，将原本被通用库拖累的小矩阵计算转化为极致性能，让算法真正跑满硬件极限。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Flibxsmm_libxsmm_d4a9a6bc.png","LIBXSMM","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Flibxsmm_e1f1b188.png","Library targeting Intel Architecture for specialized dense and sparse matrix operations, and deep learning primitives.",null,"https:\u002F\u002Fgithub.com\u002Flibxsmm\u002F","https:\u002F\u002Fgithub.com\u002Flibxsmm",[82,86,90,94,98,102,106,110,113,116],{"name":83,"color":84,"percentage":85},"C","#555555",94.5,{"name":87,"color":88,"percentage":89},"Makefile","#427819",1.4,{"name":91,"color":92,"percentage":93},"Shell","#89e051",1.2,{"name":95,"color":96,"percentage":97},"C++","#f34b7d",1.1,{"name":99,"color":100,"percentage":101},"Python","#3572A5",1,{"name":103,"color":104,"percentage":105},"Fortran","#4d41b1",0.6,{"name":107,"color":108,"percentage":109},"Batchfile","#C1F12E",0,{"name":111,"color":112,"percentage":109},"CMake","#DA3434",{"name":114,"color":115,"percentage":109},"JavaScript","#f1e05a",{"name":117,"color":118,"percentage":109},"Starlark","#76d275",948,202,"2026-04-02T15:45:08","BSD-3-Clause",4,"Linux, macOS, FreeBSD","不需要 GPU。该库专为 Intel CPU 架构优化，支持 SSE, AVX, AVX2, AVX-512 (含 VNNI\u002FBfloat16) 及 AMX 指令集。","未说明（取决于具体矩阵运算规模，库本身针对小矩阵乘法优化）",{"notes":128,"python":129,"dependencies":130},"1. 该库主要针对 Intel 架构处理器进行优化，不支持非 Intel 平台的 SIMD 加速。\n2. 核心功能是即时编译 (JIT) 生成代码以执行小型密集\u002F稀疏矩阵乘法及深度学习原语。\n3. 适用于‘一次构建，到处部署’，无需针对特定目标架构设置特殊编译标志即可利用可用性能。\n4. 可通过 Fedora\u002FRedHat\u002FCentOS, Debian\u002FUbuntu, FreeBSD 的包管理器安装，或使用 Spack\u002FEasyBuild 安装。\n5. 默认需要 C\u002FC++ 和 Fortran 编译器，若不需要 Fortran 接口可通过配置禁用。","不需要 Python（主要提供 C, C++, Fortran 接口）",[131,132,133,134],"GNU Make","C\u002FC++ 编译器 (如 GCC, Clang, Intel Compiler)","Fortran 编译器 (可选，如 gfortran)","BLAS 库 (用于示例链接，如 -lblas)",[51,13],[137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155],"jit","simd","avx512","machine-learning","sparse","blas","matrix-multiplication","transpose","bfloat16","avx2","avx","sse","vector","intel","matrix","tensor","convolution","amx","fortran","2026-03-27T02:49:30.150509","2026-04-06T09:46:58.852277",[159,164,169,174,179,184],{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},12409,"在 fsspmdm 中，矩阵 B 的填充（padding）和维度有什么硬性要求？","是的，矩阵 B 必须进行填充，使其列数（或者至少是前导维度 leading dimension）成为向量长度的倍数。鉴于 B 的列数与元素数量成正比，这通常不是一个特别严格的要求。","https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fissues\u002F581",{"id":165,"question_zh":166,"answer_zh":167,"source_url":168},12410,"如何在 JIT 生成的内核中分配和管理内存，以避免内存泄漏？","可以通过增强 `libxsmm_x86_instruction_open_stream` 来实现，向其传递一个本地数组指针及其大小。该数组的内容可以被复制到代码流中（理想情况下位于函数上方，以避免跳转）。此外，可以向 `libxsmm_gp_reg_mapping` 添加一个额外的寄存器映射参数，该参数将在函数序言中初始化为指向此数据的地址。另一种常见做法是利用 JIT 函数的第一个参数来传递指向这些表的指针（当它们无法放入寄存器时）。","https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fissues\u002F546",{"id":170,"question_zh":171,"answer_zh":172,"source_url":173},12411,"为什么在 AARCH64 (如 AWS t4g.small) 上，libxsmm 的密集矩阵乘法性能远低于 MKL 或 OpenBLAS？","这种差异在特定硬件上是预期的。例如，A64FX 架构的 FMA 指令延迟约为 11 个周期。由于时间和硬件限制，目前可能无法完全支持针对此类架构的深度优化。虽然针对 V2 和 N2 核心做了大量修复并持续改进，但在某些小尺寸密集矩阵场景下，OpenBLAS 等库仍可能比 JIT 生成的 libxsmm 内核快约 3.5 倍。","https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fissues\u002F537",{"id":175,"question_zh":176,"answer_zh":177,"source_url":178},12412,"使用 dfsspmdm 时遇到 'double free or corruption' 或内存分配错误怎么办？","这通常与生成 'k_sparse2' 内核时的内存损坏有关，而非单纯的内存分配问题。解决方案是禁用生成 k_sparse2 内核（即在 `libxsmm_?fsspmdm_create` 中跳过该步骤），这通常能解决崩溃问题且对性能没有负面影响。如果问题依然存在，可能需要检查生成的内核中是否有指令越界，特别是在处理常数打包和置换从内存加载的情况下。","https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fissues\u002F501",{"id":180,"question_zh":181,"answer_zh":182,"source_url":183},12413,"如何编译并生成使用 AMX 指令的代码，而不是默认的 AVX512？","仅仅设置 `LIBXSMM_TARGET=spr` 可能不够，还需要确保在调用内核时启用了 'batch-reduce' (BR) 模式。如果在命令行中看到 `nobr` 参数（代表 \"no batch-reduce\"），请将其移除或替换为相应的 batch-reduce 选项。AMX 的支持通常依赖于批处理归约（Batch-Reduce）操作，可以通过指定矩阵指针数组或索引来实现。集成时通常需要编写与张量首选格式相关的循环嵌套，并在最内层使用 BR 内核。","https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fissues\u002F403",{"id":185,"question_zh":186,"answer_zh":187,"source_url":188},12414,"当稀疏矩阵 A 的唯一常数数量过多无法放入寄存器时，libxsmm 的回退机制是什么？为什么不使用 `libxsmm_generator_spgemm_csr_asparse`？","当前实现中，如果唯一常数不适合寄存器内核，回退方案是使用稠密例程（dense routine）。`libxsmm_generator_spgemm_csr_asparse` 或其相关内核目前在库中未被广泛使用或处于未激活状态。维护者建议将相关的内容迁移或复用到稀疏领域的正式文档（如 `documentation\u002Flibxsmm_sp.md`）中，以明确当前的支持范围和替代方案。","https:\u002F\u002Fgithub.com\u002Flibxsmm\u002Flibxsmm\u002Fissues\u002F378",[190,195,200,205,210,215,220,225,230,235,240,245,250,255,260,265,270,275,280,285],{"id":191,"version":192,"summary_zh":193,"released_at":194},62750,"1.17","本次发布将 master\u002Fmain 分支的构建系统回迁至 v1.16 版本，同时尽可能减少了必要的代码改动。然而，由于仍需进行一些非 trivial 的代码调整，因此版本号定为 v1.17。之所以需要此次发布，主要是因为 1.16 版本的代码已较为陈旧，且在此期间出现了新的编译器。例如，在使用 GNU GCC 10.x 或 11.x 等编译器时，会出现类似 #562 这样的问题。\n\n**注意**：v1.17 版本与 v1.16x 版本基于相同的代码库。所有新功能、修复及开发进展目前仍未对外发布。根据 LIBXSMM 保持 master\u002Fmain 分支稳定性的政策，用户可直接使用 master\u002Fmain 分支以获得最新的功能、修复和开发进展。\n\n**新增内容**\n* 经由原 v1.16 之后发布的编译器（GNU GCC 10.x、11.x 以及多个 Clang 版本）验证通过。\n\n**改进与变更**\n* 针对使用特定 ISA 扩展的静态代码路径，默认配置进行了优化，无需再手动调整 INTRINSICS 设置。\n\n---\n构建系统控制着多项选项，而自 v1.16 以来，这些选项的集合已经发生了演变，这也是导致代码变动的主要原因。更多变更的一个积极影响是能够进行全面的（重新）验证。本次发布已适配 LIBXSMM 不断演进的测试环境（v1.16.x 已无法再次验证）。v1.17 的代码验证水平再次达到原始 v1.16 的水准，并进一步涵盖了此后出现的新编译器。","2021-12-03T09:00:21",{"id":196,"version":197,"summary_zh":198,"released_at":199},62751,"1.16.3","本次更新在 LIBXSMM 的 master\u002Fmain 分支中引入了修复，并解决了两个 [CVE](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fissues\u002F513) 漏洞。版本 1.16.3 继续沿用与 1.16.2 和 1.16.1 相同的代码库。所有新功能、修复及开发进展目前仍未发布。根据 LIBXSMM 保持 master\u002Fmain 分支稳定性的政策，用户可以依赖该分支来使用新的功能、修复及开发进展。\n\n**改进 \u002F 变更 \u002F 修复**\n* CVE-2021-39535\n* CVE-2021-39536","2021-10-14T07:25:38",{"id":201,"version":202,"summary_zh":203,"released_at":204},62752,"1.16.2","此小幅更新可解决一个问题：在旧式系统上进行操作系统安装时，系统不会通知保存使用 SSE 等指令集扩展的上下文状态。该问题其实很早就在 LIBXSMM 的主开发分支中得到修复。这一问题曾在某些虚拟机环境中以及部分操作系统安装中被发现（例如 [此处](https:\u002F\u002Fgithub.com\u002Fcp2k\u002Fcp2k\u002Fpull\u002F1629)）。\n\n**新增内容**\n* 新功能与特性将继续保留在 LIBXSMM 的主开发分支中。\n\n**改进\u002F变更\u002F修复**\n* 即使操作系统未能正确报告对某项 ISA 扩展的支持，也采用相应的代码路径。\n\n**注**：版本 1.16.2 与版本 1.16.1 使用相同的代码库（仅有一行代码应用了上述修复）。所有新功能、修复及开发进展目前仍未发布。根据 LIBXSMM 保持主分支稳定的政策，用户可使用主分支以获得最新的功能、修复和开发进展。","2021-08-31T10:41:41",{"id":206,"version":207,"summary_zh":208,"released_at":209},62753,"1.16.1","此（次要）版本修复了下方列出的问题，并增强了对各平台的支持。\n\n> 感谢苏黎世大学化学系慷慨资助，提供了克雷系统的使用权限。\n\n**改进\u002F变更**\n* 抑制由独立 OpenMP 运行时（基于 Clang 的工具链）引起的编译器警告。\n* 示例代码：在包含 `77blas.h` 时，防止出现 OpenBLAS 未定义类型的问题（[问题](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fissues\u002F390)）。\n\n**修复**\n* 修复了基于 Clang 的克雷编译器以及克雷经典编译器的编译和运行时问题。\n* 修订了 `libxsmm_xdiff` 的 Fortran 实现，并移除了对 `_Bool` 类型的依赖（[问题](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fcommit\u002F32b687d6e0bd8fc56b7451b789d9b6ca9b3a5ebb#commitcomment-40121284)）。","2020-06-26T15:39:29",{"id":211,"version":212,"summary_zh":213,"released_at":214},62754,"1.16","这是一个维护版本，旨在将项目的持续开发成果整合进一个稳定的发布版本中。经过验证的发布版本使我们的用户能够充分利用多项改进和修复（见下文），尤其是在即将推出的新功能背景下。\n\n> 感谢您的贡献——您的贡献至关重要！本项目收到了来自各方的诸多贡献，包括拉取请求、问题报告、功能建议以及非正式咨询等。我们衷心感谢您为开源软件所付出的努力与时间！\n\n**新增功能**\n* 所有平台实现零配置，对于仅包含头文件的用法完全无需任何配置。这简化了 Visual Studio 的使用，不再需要预先配置或内置自定义步骤。同时，也简化了第三方构建系统集成 LIBXSMM 的过程，无论是仅头文件方式还是经典 ABI 方式。\n* 更新了 [Hello LIBXSMM](libxsmm.readthedocs.io\u002F#hello-libxsmm)，并添加了 C\u002FC++ 和 Fortran 的代码示例（位于 [GitHub 仓库](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Ftree\u002Fmaster\u002Fsamples\u002Fhello)）。此外，还提供了对 Bazel 的最低限度支持（按需求提供）。这一改动并非要改变我们基于 Makefile 的构建流程，而是为了帮助偏好使用 Bazel 的开发者更快上手。\n* 提供了用于 [用户数据分发](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_aux\u002F#user-data-dispatch) 的 Fortran 接口，并附带了一个使用该接口一次性分发多个内核的 [Fortran 示例代码](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Ftree\u002Fmaster\u002Fsamples\u002Futilities\u002Fdispatch#user-data-dispatch)。C 语言接口已于此前版本（v1.15）中引入。\n* 实验性功能：支持矩阵元素逐点操作的内核（meltw），例如缩放、归约、类型转换等。\n\n**改进与变更**\n* 扩展了使用 LIBXSMM 的[应用列表](https:\u002F\u002Flibxsmm.readthedocs.io\u002F#applications and projects)。我们的[文档](https:\u002F\u002Flibxsmm.readthedocs.io\u002F#references)也在左侧菜单底部列出了各热门类别下的相关应用。\n* 修复了 matcopy 例程中的性能缺陷，并新增了 [微基准测试](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Ftree\u002Fmaster\u002Fsamples\u002Fmatcopy)。\n* 改进了详细输出信息（水印、额外警告等）。\n* 在编译时禁用内存包装器（仅可通过 [opt-in](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_tune\u002F#intercepted-allocations) 方式启用）。\n* 完全迁移到 Python3 shebang（回退至 Python2）。\n* 改进了 Fortran 接口（重载等功能）。\n* 进一步增强了对 GNU GCC 10 的支持。\n* 扩展了稀疏运算功能。\n\n**修复**\n* 避免直接操作 GNU 的特性标志位（优化仅头文件库的使用）。\n* 修复了 Intel VTune 2020 的检测问题（当使用源码级分析器且 SYM=1 时）。\n* 确保始终以未对齐方式发出 LD\u002FST 指令（基于内联汇编的代码）。","2020-06-20T17:47:09",{"id":216,"version":217,"summary_zh":218,"released_at":219},62755,"1.15","版本 2.0 是我们预期的下一个发布版本。而 v1.15 的目标则是将 LIBXSMM 无缝合入各操作系统发行版，这些发行版很快就会开始使用 GNU GCC 10 构建软件包（更多[详情](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fissues\u002F369#issuecomment-586705304)）。\n\n除了新增对编译器的支持外，LIBXSMM 即便在核心功能——包括批量归约在内的 SMM 内核——上，也实现了细微但持续的性能提升。其中，深度学习领域的开发进展最为显著，并以滚动发布的方式持续交付新特性。DNN 后端进一步扩展了对低精度\u002F混合精度内核以及内核融合（如卷积神经网络中使用的批量归约加 X）的支持。\n\n**新增功能**\n* 小矩阵乘法和批量归约内核现已支持以下输入类型：`FP64`、`FP32`、`bfloat16`、`int16` 和 `int8`。借助 AVX-512 扩展（VNNI 和 Bfloat16），低精度支持覆盖了多种输入与累加类型组合。\n* 新增 C API（Fortran 版本随后推出）：(1) 内核自检接口，接收内核函数指针并填充包含 FLOPS 计数、代码大小等信息的数据结构；无需搜索开销；(2) [注册用户自定义数据](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_mm\u002F#user-data-dispatch)，利用 LIBXSMM 的高效键值数据库\u002F查询机制，例如降低单个任务中多个内核调度的开销。\n* Fortran API：增加了部分通用过程的不同变体；通过精确匹配（过程重载），可潜在避免临时变量的生成。\n* 使用 LIBXSMM 处理稀疏权重矩阵的有限元方法（DGFEM）向量化示例。\n* 展示稀疏权重矩阵乘法（深度学习）的示例。\n* 下一代 CP2K\u002Fcollocate 实现的复现工具。\n* 构建过程中生成的模块文件（`module av`）。\n\n**改进与变更**\n* 允许在 Windows 系统下跳过完整的配置步骤；提升了对 Visual Studio 的构建支持。需要注意的是，Windows 调用约定仍待实现，但已在推进中。目前状态尚未保持调用约定，因此可能无法正常工作（作为 workaround，可以为 LIBXSMM 内核使用包装函数）。\n* 停止为卷积生成专用代码，卷积现基于 [批量归约内核](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.06440) 实现，并修订了批量归约 API，以支持 (1) 与先前版本一致的绝对地址，(2) 相对偏移\u002F索引，以及 (3) 常量\u002F相同的偏移量\u002F步长。\n* LIBXSMM\u002FEXT：在 macOS 上支持 OpenMP（使用 Apple 的 LLVM 编译器）。\n* LIBXSMM 整个代码库均采用 SPDX 许可标识符（BSD-3-Clause）。\n* 针对虚拟化平台的计时器精度详细提示信息。\n* 总体提升了输出信息的详尽程度（洞察力、细节性和准确性）。\n* 后端新增指令支持。\n* 调度开销略有降低。\n* NUMA 感知的 [GxM 框架](https:\u002F\u002Flibxsmm.readthedocs.io\u002Fgxm\u002F)。\n\n**修复**\n* 修复了问题 #334、#347、#371、#368 和 #369。\n* 截至 Synopsys Coverity 测试，未发现任何缺陷。\n* 修复了构建系统中的重新构建问题。","2020-03-13T16:01:56",{"id":221,"version":222,"summary_zh":223,"released_at":224},62756,"1.14","本次发布在合并我们重构的深度学习后端之前，带来了一些重要的修复和改进（见下文）。这个版本很可能是我们1.x系列的最后一个版本。对于LIBXSMM即将发布的重大版本，除了深度学习领域之外，核心功能的API将保持兼容。即使在深度学习领域，也仅涉及API调整，而非重大变更（通常是直接或轻微的改动）。\n\n> 感谢您的贡献：[jewillco](https:\u002F\u002Fgithub.com\u002Fjewillco)、[yurivict](https:\u002F\u002Fgithub.com\u002Fyurivict)、[antoscha](https:\u002F\u002Fgithub.com\u002Fantoscha)、[breuera](https:\u002F\u002Fgithub.com\u002Fbreuera)、[jeremylt](https:\u002F\u002Fgithub.com\u002Fjeremylt)、[HiSPEET](https:\u002F\u002Fgithub.com\u002FHiSPEET)以及[legrosbuffle](https:\u002F\u002Fgithub.com\u002Flegrosbuffle)。我们还要感谢所有直接贡献者，以及那些为这款开源软件付出心血与时间的非正式参与者！\n\n**新增功能**\n* 为通用3\u002F6参数函数（元数）引入原生PROCEDURE类型（Fortran接口）。\n* 引入[拦截式内存分配](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_tune\u002F#intercepted-allocations)，适用于基于LIBXSMM临时内存的应用程序。\n* LIBXSMM自多个版本以来一直保证对有效请求返回非空内核。现在，空形状请求也被视为有效（SMM、MCOPY和TCOPY）。\n* 在文档中新增“入门”章节（“你好，LIBXSMM”）。\n\n**改进与变更**\n* 终止统计现可区分普通SMM和退化SMM（GEMV）。\n* 支持Immintrin-debug（https:\u002F\u002Fgithub.com\u002Fintel\u002FImmintrin-debug）。\n* 若编译器仅支持低分辨率计时器，则发出警告。\n* 支持基于GNU GCC设置的PGI编译器；但仍存在一些[问题](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fwiki\u002FCompatibility#pgi)。\n* 通常情况下，即使操作系统不允许，也会启用ISA扩展（XSAFE）。\n* 在OSX i\u002FMac Pro上强制启用AVX-512（OSX：XSAFE\u002FZMM被禁用）。\n* 当VTUNE=0时，会禁用性能分析器支持（即使已检测到且SYM=1）。\n* 增加对外部指针（非由库分配）的内存信息处理能力。\n* 优化临时内存分配逻辑，避免不必要的冗余警告（verbose模式）。\n* 改进临时内存分配统计信息（如水位线等）。\n* 实现了针对使用STOP语句的Fortran程序的退出处理程序。\n* 避免此前通过编译器标志抑制的警告。\n* Makefile现仅允许静态库和共享库版本匹配构建。\n* 适配Windows下的Clang编译器。\n* 提升极短序列的随机数生成器性能。\n* 更新了Visual Studio项目及安装配置（VS2019）。\n* 更新并修订了文档。\n* 更新了[文章](https:\u002F\u002Flibxsmm.readthedocs.io\u002F#articles)和应用示例。\n* 合并了第355号贡献。\n* 降低了调度开销。\n\n**修复**\n* 修复了2019年2月24日发生的编译器生成代码调度问题（影响SpMDM和DL）。\n* 修正了将字面量-1强制转换为无符号整数的问题，该问题源于本应使用64位整数的情况。\n* 解决了与结构对齐\u002F填充\u002F复制相关的问题（CCE）。\n* 潜在无效的内核缓存与co","2019-10-25T14:51:51",{"id":226,"version":227,"summary_zh":228,"released_at":229},62757,"1.13","本次发布对构建系统和内部结构进行了改进。主要目的是为最新的操作系统环境持续提供流畅的构建和运行体验。\n\n> 感谢您的贡献——您的贡献至关重要！无论是提交问题报告、提出功能建议，还是偶然接触到本项目并参与其中，每一位参与者都为本项目做出了直接或间接的贡献。我们衷心感谢大家为开源软件所付出的努力和时间！\n\n**改进\u002F变更**\n* Fortran：已启用 libxsmm_ptr* 最终返回 C_NULL_PTR。\n* 避免将 Spack 环境视为维护者构建（应用 SSE4 标志）。\n* 将结构体数组（SOA）密集型例程重命名为“packed”。\n* 为即将推出的功能进行内部准备（内存分配）。\n* 改进了构建系统（针对最新操作系统环境）。\n\n**修复**\n* 解决缺少 _Float128 定义的 workaround 的前提条件（#339）。\n* 从概念上避免访问零大小数组（Fortran 接口）。\n* 修正了临时内存池的数量（LIBXSMM_VERBOSE）。","2019-07-15T07:06:35",{"id":231,"version":232,"summary_zh":233,"released_at":234},62758,"1.12.1","本次发布修复了 pkg-config 文件中前缀目录相关的问题，这些问题曾影响维护者构建（Linux 和 FreeBSD 软件包）、软件包管理器 Spack，以及使用 pkg-config 来确定编译\u002F链接标志的用户。此外，还对 FreeBSD 下的维护者构建进行了一些预设调整，以使其更加顺畅。\n\n**改进 \u002F 变更**\n* 构建示例：检测 Intel MKL（当通过软件包管理器安装时）。\n* 改进了 FreeBSD 下的构建系统（检测 BLAS 库等）。\n\n**修复**\n* 问题 #331、问题 #333、问题 #334，以及 spack\u002Fspack#11413。\n* 其中一个代码示例中的 OpenMP 构建问题（GCC 9.1）。","2019-05-23T08:55:03",{"id":236,"version":237,"summary_zh":238,"released_at":239},62759,"1.12","本次发布旨在提升易用性，并修复若干非关键性缺陷。此外，还新增了 BLAS（类似）的 [批处理 GEMM](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_mm#blas-batch-interface) 实现（`?GEMM_BATCH`）。该接口目前仅支持 C\u002FC++ 语言，但可以通过隐式调用方式（类似于 Fortran 77）或通过 [拦截现有调用](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_mm\u002F#call-wrapper) 的方式使用（静态和动态链接）。\n\nLIBXSMM 自数个版本起便提供了批处理 GEMM 接口，既支持指针，也支持索引数组及字节级步长，用于从结构体数组（AoS）中提取数据。而新的 BLAS 接口仅支持指向操作数矩阵的连续指针数组，但允许多组同质的批处理任务并行执行。所有批处理接口均提供顺序（ST）和多线程（MT）两种实现形式，并在多线程模式下支持 [同步机制](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_mm\u002F#batch-sync)。\n\n**新增功能**\n* 批处理 GEMM 接口及其实现（GEMM_BATCH）。\n* LSTM 操作的 TensorFlow 封装代码。\n* GEMM_BATCH 和 GEMV 的拦截器。\n\n**改进与变更**\n* LSTM：为 Bfloat16 数据类型启用了更多张量格式。\n* 已通过 GNU GCC 9.1 版本的验证。\n\n**修复**\n* 问题 #331、问题 #333、问题 #334，以及 https:\u002F\u002Fgithub.com\u002Fspack\u002Fspack\u002Fissues\u002F11413。\n* 若干其他较小的修复。","2019-05-10T13:05:48",{"id":241,"version":242,"summary_zh":243,"released_at":244},62760,"1.11","This release accumulated more than 1200 changes since the last release and is a major preparation for our future v2 of the library. Beside stability improvements, refining existing functionality and bug-fixes, there were several introductions of new functionality: packed\u002Fcompact data layout functions for solving linear equations, new flavors of SMM-kernels along with relaxed limitations (transb), and overall support for low-precision based on the Bfloat16 FP-format.\r\n\r\nThe Deep Learning (DL) domain is still under active research and development including co-design. The API however is rather stable (DLv2 since v1.8) with an implementation that continues to receive major development. Towards LIBXSMMv2, the DL domain will undergo major code reduction (implementation) while providing the same or more functionality (first sign is the removal of the Winograd code in this release).\r\n\r\n> THANK YOU FOR YOUR CONTRIBUTION - we had **again** several direct (and indirect) contributions, reports, and involvement from people who came across the project. We would like to thank you all for the effort and time you spent working on Open Source! \r\n\r\n**INTRODUCED**\r\n* Packed function domain (compact data format) with GEMM, GETRF, TRMM, and TRSM functions.\r\n* Relaxed limitation of SMM kernels: `TransB=T` is now allowed (in addition to `TransB=N`).\r\n* Batch-reduce GEMM-kernel which is optimized for in-cache accumulation (Beta=1).\r\n* Included build setup in library (environment variable `LIBXSMM_DUMP_BUILD=1`).\r\n* CPU feature detection is updated for Cascadelake and Cooperlake (CLX and CPX).\r\n* Bfloat16 instruction support for Cooperlake (CPX).\r\n* Bfloat16 support for DL and SMM domain.\r\n* Fast RNGs for single-precision FP data.\r\n\r\n**IMPROVEMENTS**\r\n* Cray Compiler (legacy and current versions) is supported with LIBXSMM's use of intrinsics, inline assembly, and CPUID detection, and therefore received major performance improvements. Previously, even JIT code was limited to AVX due to unsupported CPUID flow.\r\n* Updated support for tensorflow::cpu_allocator for API change in TensorFlow API (v1.12.0 and beyond).\r\n* Guarantee JIT'ted function (non-NULL); see CHANGE about libxsmm_[get|set]_dispatch_trylock.\r\n* Call wrapper\u002Finterceptor (static\u002Fshared library) now always works i.e., no special build required.\r\n* SpMDM\u002FBfloat16 interface to enable TensorFlow which gained type-support for Bfloat16.\r\n* GxM framework updated for fused DL ops, Bfloat16, and a variety of new DL operators.\r\n* DL domain with LSTM and GRU cells, fully connected layer, and batch norm support.\r\n* Reduced unrolling and code size of transpose kernels (to fit i$).\r\n* Extended Fortran interface (matdiff, diff, hash, shuffle).\r\n* Purified some more routines (Fortran interface).\r\n* More statistical values (libxsmm_matdiff\u002Finfo).\r\n\r\n**CHANGES**\r\n* KNC support has been removed (maps to generic code). Offload infrastructure has been kept.\r\n* Winograd code has been removed from DL domain (see also introduction to this release).\r\n* Removed libxsmm_[get|set]_dispatch_trylock (demoted to compile-time option).\r\n* Threshold criterion of libxsmm_gemm (optionally based on arithmetic intensity).\r\n\r\n**FIXES**\r\n* Fixed corner case which eventually led to leaking memory (scratch).\r\n* Exhausted file handles (in ulimit'ed or restricted environments).\r\n* Fixed libxsmm_timer in case of lazy library initialization.\r\n* Flawed detection of restricted environments (SELinux).\r\n* Fixed buffer handling in case of incorrect input.\r\n* Fixed setup of AVX2 code path in SpMDM.\r\n* Ensure correct prefix in pkg-config files.\r\n* Guarantee JIT'ted function (non-NULL).\r\n\r\n**Note about platform support**: an explicit compile-error (error message) is generated on platforms beside of Intel (or compatible processors) since upstreamed code was reported to produce \"compilation failure\". Beside of the introduced artificial error, any platform is supported with generic code (tested with ARM cross-compiler). Of course, any Open Source contribution to add JIT support is welcome.\r\n\r\n**Note about binary compatibility**: LIBXSMM's API for Small Matrix Multiplications (SMMs) is stable, and all major known applications (e.g., CP2K, EDGE, NEK5K, and SeisSol) either rely on SMMs or are able (and want) to benefit from an improved API of the other domains (e.g., DL). Until at least v2.0, binary compatibility is not maintained (SONAME version goes with the semantic version).","2019-04-29T10:43:48",{"id":246,"version":247,"summary_zh":248,"released_at":249},62761,"1.10","Development accumulated many changes since the last release (v1.9) as this version (v1.10) kept slipping because of validation was not able to keep up and started over several times. On the positive side this may allow to call it the \"Supercomputing 2018 Edition\" which is complemented by an updated [list of references](https:\u002F\u002Flibxsmm.readthedocs.io\u002Fen\u002Flatest\u002F#references) including the [SC'18 paper](https:\u002F\u002Fsc18.supercomputing.org\u002Fpresentation\u002F?id=pap322&sess=sess190) \"Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures\". Among several external [articles](https:\u002F\u002Flibxsmm.readthedocs.io\u002F#articles), the Parallel Universe Magazine [published](https:\u002F\u002Fsoftware.intel.com\u002Fen-us\u002Fdownload\u002Fparallel-universe-magazine-issue-34-october-2018) \"LIBXSMM: An Open Source-Based Inspiration for Hardware and Software Development at Intel\".\r\n\r\nThe intense development of LIBXSMM brought many improvements and detailed features across domains as well as end-to-end support for Bfloat16 in LIBXSMM's Deep Learning domain (DL). The latter can be already exercised with the [GxM framework](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Ftree\u002Fmaster\u002Fsamples\u002Fdeeplearning\u002Fgxm) which was added to the collection of sample codes. Testing and validation were updated for latest compilers and upcoming Linux distributions. FreeBSD is now formally supported (previously it was only tested occasionally). RPM-, Debian- and FreeBSD package updates will benefit from the smoothed default build-targets and compiler flags.\r\n\r\nLIBXSMM supports \"one build for all\" while exploiting the existing instructions set extensions (CPUID based code-dispatch). Developers may enjoy support for pkg-config (`.pc` files in the `lib` folder) for easier linkage when using the [Classic ABI](https:\u002F\u002Flibxsmm.readthedocs.io\u002F#classic-library-abi) (e.g., `PKG_CONFIG_PATH=\u002Fpath\u002Fto\u002Flibxsmm\u002Flib pkg-config libxsmm --libs`).\r\n\r\n> **THANK YOU FOR YOUR CONTRIBUTION** - we had several direct (and indirect) contributions, reports, and involvement from people who came across the project. We would like to thank you all for the effort and time you spent working on Open Source!\r\n\r\n**INTRODUCED**\r\n* Removed need to build LIBXSMM's static library in a special way for GEMM call-interception.\r\n* Moved some previously internal but generally useful code to the public interface (math etc.).\r\n* Initial support handle-based \"big\" GEMM (revamped libxsmm_?gemm_omp).\r\n* Support transposed cases in libxsmm_?gemm_omp; not perf.-competitive yet.\r\n* [Code samples](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_samples\u002F#magazine) accompanying article in the [Parallel Universe](https:\u002F\u002Flibxsmm.readthedocs.io\u002F#articles) magazine.\r\n* Fortran interface for some previously only C-exposed functions.\r\n* Support Intel C\u002F++ Compiler together with GNU Fortran.\r\n* Packed\u002FSOA domain: expanded functionality (EDGE solver).\r\n* Deep Learning framework [GxM](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Ftree\u002Fmaster\u002Fsamples\u002Fdeeplearning\u002Fgxm) (added as code sample).\r\n* RNNs, and LSTM\u002FGRU-cell (driver code experimental).\r\n* End-to-end support for Bfloat16 (DL domain).\r\n* Fused batch-norm, and fully-connected layer.\r\n* Compact\u002Fpacked TRSM kernels and interface.\r\n* Experimental TRMM code (no interface yet).\r\n* Support for pkg-config.\r\n\r\n**IMPROVEMENTS \u002F CHANGES**\r\n* Zero-mask unused register parts to avoid false positives with enabled FPEs (MM kernels).\r\n* Added libxsmm_ptrx helper to Fortran interface (works around C_LOC portability issue).\r\n* Mapped TF low-precision to appropriate types, map unknowns to DATATYPE_UNSUPPORTED.\r\n* Build banner with platform name, info about Intel VTune (available but JIT-profiling disabled).\r\n* Smoothed code base for most recent compilers (incl. improved target attribution).\r\n* Official packages for Debian, and FreeBSD (incl. OpenMP in libxsmm\u002Fext for BSD).\r\n* LIBXSMM_DUMP environment var. writes [MHD-files](https:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_aux\u002F#meta-image-file-io) if libxsmm_matdiff is called.\r\n* Warn when libxsmm_release_kernel is called for registered kernel.\r\n* Consolidated [Deep Learning sample codes](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Ftree\u002Fmaster\u002Fsamples\u002Fdeeplearning) into one folder.\r\n* Revised default for AVX=3 (MIC=0 is now implicitly set).\r\n* LIBXSMM_TARGET: more keys count for AVX512\u002FCore.\r\n* Updated TF integration\u002Fdocumentation.\r\n* Included workarounds for flang (LLVM).\r\n* Attempt to enable OpenMP with Clang.\r\n* Install header-only form (`make install`).\r\n* SpMDM code dispatch for AVX2.\r\n* Improved CI\u002Ftest infrastructure.\r\n* Show [hint](https:\u002F\u002Flibxsmm.readthedocs.io\u002F#outdated-binutils) if compilation fails.\r\n\r\n**FIXES**\r\n* Properly dispatch CRC32 instruction (support older CPUs).\r\n* Fixed fallback of statically generated MM kernels (rare).\r\n* Remove temporary files that were previously dangling.\r\n* Fixed termination message\u002Fstatistic (code registry).\r\n* Fixed finalizing the library (corner case).\r\n* Fixed code portability of DNN domain.","2018-11-12T10:52:22",{"id":251,"version":252,"summary_zh":253,"released_at":254},62762,"1.9","This release enables JIT-code generation of small matrix multiplications for SSE3 targets. Previously, only AVX and beyond has been supported using JIT code. SSE JIT-code generation is only supported for the MM domain (matrix multiplication). The [compatibility](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fwiki\u002FCompatibility) of the library has been further refined and fine-tuned. The application binary interface (ABI) narrowed from above 500 functions down to ~50% due to adjusted symbol visibility. This revision prepares for a smooth transition to v2.0 and really internalizes low-level internals (descriptor handling, etc.), and two deprecated functions have been removed. More prominent, prefetch enumerators have been renamed e.g., LIBXSMM_PREFETCH_AL2 renamed to LIBXSMM_GEMM_PREFETCH_AL2.\r\n\r\n**INTRODUCED**\r\n* ABI specification improved: exported functions are decorated for visibility\u002Finternal use (issue #205).\r\n* Math functions to eventually avoid LIBM dep., or to control specific requirements (libxsmm_math.h).\r\n* MM: enabled JIT-generation of SSE code for small matrix multiplications (BE and FE support).\r\n* MM: extended FE to handle multiple flavors of low-precision GEMMs (C and C++).\r\n* Detect mainainer build and avoid target flags (GCC toolchain, STATIC=0).\r\n* SMM: I16I32 and I16F32 WGEMM for SKX and future processors.\r\n* Hardening all builds by default (Linux package requirements).\r\n\r\n**IMPROVEMENTS \u002F CHANGES**\r\n* MM domain: renamed prefetch enumerators; kept \"generic\" names SIGONLY, NONE, and AUTO (FE).\r\n* Build system presents final summary (similar to initial summary); also mentions VTune (if enabled).\r\n* Adjusted TF scratch allocator to adopt global rather than context's allocator (limited memory).\r\n* Combined JIT-kernel samples with respective higher level samples (xgemm, transpose).\r\n* Enabled extra (even more pedantic) warnings, and adjusted the code base accordingly.\r\n* Adjusted Fortran samples for PGI compiler (failed to deduce generic procedures).\r\n* Removed deprecated libxsmm_[create\u002Frelease]_dgemm_descriptor functions.\r\n* Included [validation](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fwiki\u002FValidation) and [compatibility](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fwiki\u002FCompatibility) information into [PDF](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fraw\u002Fmaster\u002Fdocumentation\u002Flibxsmm.pdf) (Appendix).\r\n* MinGW: automatically apply certain compiler flags (workaround).\r\n* Internalized low-level descriptor setup (opaque type definitions).\r\n* Moved LIBXSMM_DNN_INTERNAL_API into internal API.\r\n* Fixed dynamic linkage with CCE (CRAY compiler).\r\n\r\n**FIXES**\r\n* Take prefetch requests in libxsmm_xmmdispatch (similar to libxsmm_[s|d|w]mmdispatch).\r\n* SpMM: prevent to generate (unsupported) SP-kernels (incorrect condition).\r\n* Fixed code-gen. bug in GEMM\u002FKNM, corrected K-check in WGEMM\u002FKNM.\r\n* MinGW: correctly parse path of library requirements (\"drive letter\").\r\n* Fixed VC projects to build DLLs if requested.","2018-03-15T13:12:40",{"id":256,"version":257,"summary_zh":258,"released_at":259},62763,"1.8.3","**Overview**: while v1.9 is in the works, this release fixes two issues, and pushes for an improved (OSX w\u002F Intel Compiler) and wider OS\u002FCompiler coverage (MinGW, BSD, see [Compatibility](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fwiki\u002FCompatibility)). Among minor or exotic issues resolved in this release, the stand-alone JIT-generated matrix transposes (out-of-place) are now limited to matrix shapes such that only reasonable amounts of code are generated. There has been also a rare synchronization issue reproduced with CP2K\u002Fsmp in LIBXSMM v1.8.1 (and likely earlier), which is resolved since the previous release (v1.8.2).\r\n\r\n**JIT code generation\u002Fdispatch performance**: JIT-generating code (non-transposed GEMMs) is [known](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm#references) to be blazingly fast, which this release (re-)confirms with the extended [dispatch microbenchmark](http:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_samples\u002F#dispatch-microbenchmark): single-threaded code generation (uncontended) of matrix kernels with *M,N,K* := *4...64* (equally distributed random numbers) takes less than 25 µs on typical systems, and non-cached code dispatch takes less than 50x longer than calling a function that does nothing whereas cached code-dispatch takes less than 15x longer than an empty function (code dispatch is roughly three orders of magnitudes faster than code generation i.e., Nanoseconds vs. Microseconds).\r\n\r\n**INTRODUCED**\r\n* Support for mixing C and C++ code when using header-only based LIBXSMM.\r\n* [Issue 202](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fissues\u002F202): reintroduced copy-update with LIBXSMM's install target (make).\r\n* Experimental: sketched Python support built into LIBXSMM (PYMOD=1).\r\n\r\n**IMPROVEMENTS \u002F CHANGES**\r\n* Completed revision of synchronization layer (started in v1.8.2); initial [documentation](http:\u002F\u002Flibxsmm.readthedocs.io\u002Flibxsmm_aux\u002F#thread-synchronization).\r\n* Reduced TRACE output due to self-watching (internal) initialization\u002Ftermination.\r\n* Wider OS validation incl. more exotic sets (MinGW in addition to Cygwin, BSD).\r\n* Prevent production code (non-debug) on 32-bit platforms (compilation error).\r\n* Increased test variety while staying within same turnaround time limit.\r\n* Continued to close implementation gaps (synchronization primitives).\r\n* Sparse SOA domain received fixes\u002Fimprovements driven by [EDGE](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm#high-performance-computing-hpc).\r\n* More readable code snippets in documentation (reduced width).\r\n* Initial preparation for JIT-generating SSE code (disabled).\r\n* Improved detection of OpenBLAS library (Makefile.inc).\r\n* Updated (outdated) support for Intel Compiler (OSX).\r\n* Compliant soname under Linux and OSX.\r\n\r\n**FIXES**\r\n* Fixed selection of statically generated code targeting Skylake server (SKX).\r\n* Sparse SOA domain: resolved issues pointed out by static analysis.\r\n* Fixed support for JIT-generated matrix transpose (code size).\r\n* Fixed selecting an incorrect prefetch strategy (BGEMM).","2018-02-02T17:58:06",{"id":261,"version":262,"summary_zh":263,"released_at":264},62764,"1.8.2","This last release of the 1.8.x line (before 1.9) accumulated a large number of changes to tweak interfaces, and to generally improve usability. The documentation vastly improved and extended, is more structured, and also available per [ReadtheDocs](http:\u002F\u002Flibxsmm.readthedocs.io\u002F) (with online full-text search). In preparation of a fully revised implementation of the DNN API (rewrite), the interface of the DNN domain (Tensor API) changed in an incompatible way (our [policy](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fwiki\u002FQ&A#is-libxsmm-compatible-from-version-to-version-or-what-is-the-abi-commitment) should have delayed this to v1.9). However, the current main user of the DNN API has been updated (integration with [TensorFlow](libxsmm.readthedocs.io\u002Ftensorflow\u002F)). Also notable, v1.8.2 introduces JIT-code generation with Windows call-convention (support limited to 4-argument kernels i.e., no prefetch signature for the MM domain, and no support for DNN\u002Fconvolution kernels).\r\n\r\n**INTRODUCED**\r\n* Introduced kernel introspection\u002Fquery API for registered code: full GEMM descriptor, and code size.\r\n* Introduced [explicit batch interface](libxsmm.readthedocs.io\u002Flibxsmm_mm\u002F#batched-multiplication) (and an experimental auto-batch option); parallelized\u002Fsequential.\r\n* Introduced BGEMM interface for handle-based GEMM using optimized format (copy-in\u002Fout).\r\n* More comprehensive sparse support (EDGE: Extreme Scale Fused Seismic Simulations).\r\n* More comprehensive collection of DNN test cases (DeepBench, ResNet50, etc.).\r\n* Implemented CI for DNN domain, and infrastructure for validation (libxsmm_matdiff).\r\n* Support to schedule CI\u002Ftests into a Slurm based cluster environment (.travis.sh).\r\n* Introduced \"make INTRINSICS=0\" to allow building with outdated Binutils.\r\n* Generate preprocessor symbols for statically generated code (presence check).\r\n* Allow FORTRAN to access (static-)configuration values using preprocessor.\r\n* FORTRAN 77 support for a much wider set of functionality (MM domain).\r\n* Introduced [MHD file I\u002FO](libxsmm.readthedocs.io\u002Flibxsmm_aux\u002F#meta-image-file-io) to e.g., aid visual inspection and validation.\r\n* Cleaned up type-definitions and FE-macros (lower precision GEMM).\r\n* More comprehensive set of prefetch strategies (SMM domain).\r\n* Extended LIBXSMM_VERBOSE=2 to show library version, etc.\r\n* Wider use of QFMA accross domains (MM, SpMM, DNN).\r\n* Updated application recipe for [CP2K](libxsmm.readthedocs.io\u002Fcp2k\u002F) and [TensorFlow](libxsmm.readthedocs.io\u002Ftensorflow\u002F).\r\n* Initial Eigen related code sample (batched SMMs).\r\n* CPUID for CPUs codenamed \"Icelake\".\r\n\r\n**CHANGES**\r\n* Revised\u002Funified API attribute decoration, and cleaned up header-only header.\r\n* Removed script for regenerating documentation bits (README.sh); now only per make.\r\n* Changed matcopy kernels to have column-major semantics (similar to transpose).\r\n* Support const\u002Fnon-const GEMM prototypes interfering with LIBXSMM's header-only.\r\n* Slightly revised and based all F2K3 interfaces on lower-level F77 (implicit) routines.\r\n* Incorporated\u002Fenabled new\u002Fadditional instructions in the code generator (BE).\r\n* Reshuffled properties\u002Fsizes in GEMM descriptor for future extensions.\r\n* Portable build-locks for improved turnaround time in parallel CI builds.\r\n* Comprehensive validation of DNN domain (all major benchmarks).\r\n* Consistent use of libxsmm_blasint (libxsmm_dmmdispatch).\r\n* Revised error\u002Fwarning messages (LIBXSMM_VERBOSE=1).\r\n* Initial support for some fused operations (DNN domain).\r\n* Removed support for small GEMM descriptors (BIG=0).\r\n* Removed libxsmm_timer_xtick (libxsmm_timer.h).\r\n* Improved turnaround time in Travis CI testing.\r\n* Thread-safe scratch memory allocation.\r\n* Support VS 2017 (startup script, etc.)\r\n\r\n**FIXES**\r\n* Fixed potential issue with GEMM flags being incorrectly created (GEMM wrapper).\r\n* Several fixes for improved FORTRAN interface compatibility (optional arguments, etc.).\r\n* Disabled AVX-512 code generation with Intel Compiler 2013 (SP1 brings the req. bits).\r\n* Fixed code gen. issue with SOA sparse kernels; corrected precision of SOA sample code.\r\n* Fixed index calculation in tiled libxsmm_matcopy; updated test case accordingly.\r\n* Fixed a number of issues in several DNN code paths unveiled by better testing.\r\n* Several fixes in sparse SOA domain (unveiled by LIBXSMM's integration into PyFR).\r\n* Improved support for (legacy) Clang wrt AVX-512 code generation (intrinsics).\r\n* Ported bit-scan intrinsics abstraction to yield same result with all compilers.\r\n* Allow static code generation to target SKX and KNM (Makefile).\r\n* Fixed several code generation issues for SMMs on KNM.","2017-12-24T11:21:54",{"id":266,"version":267,"summary_zh":268,"released_at":269},62765,"1.8.1","This release brings some new features (matcopy\u002F2d-copy and tcopy based on JIT-generated code) as well as a number of bug fixes (TGEMM), improvements (KNM), and refinements (LIBXSMM_GEMM_WRAP control, etc). Given the completed copy\u002Ftranspose support, this release prepares for a complete stand-alone GEMM routines.\r\n\r\n**INTRODUCED**\r\n* Choice between tiled\u002Fsmall GEMM during call-interception (LIBXSMM_GEMM_WRAP=1|2).\r\n* Introduced JIT'ted transpose kernels including tiling for larger matrices.\r\n* Transpose routines now auto-dispatch JIT-kernels incl. auto-tuned tiles.\r\n* Introduced matcopy routines similar to the transpose routines (C\u002FC++\u002FF).\r\n* LIBXSMM_DNN_CONV_OPTION_OVERWRITE for faster initial forward convolution.\r\n* Implemented\u002Fdocumented named JIT routines in TF when using VTune.\r\n* Additional statistics about MCOPY\u002FTCOPY (LIBXSMM_VERBOSE=2).\r\n* Lowered overhead of tiled\u002Fparallelized GEMM\u002FMCOPY\u002FTCOPY.\r\n* Made libxsmm_hash function available (MEM\u002FAUX module).\r\n* Initial support for lower precision (backward conv.)\r\n\r\n**CHANGES**\r\n* AVX-512 based CPUID-dispatched input\u002Foutput of Winograd transformation (forward conv.).\r\n* Adjusted build system to pick-up RPM_OPT_FLAGS (RPM based Linux distributions).\r\n* Moved extensive Q&A to Wiki page and cleaned up the reference documentation.\r\n* Improved\u002Fextended Getting Started Guide for TensorFlow with LIBXSMM.\r\n* Improved general backend error propagation, and avoid duplicated messages.\r\n* Iterative subdivision of large matrix transposes (tcopy) and matcopy (mcopy).\r\n* Non-task based and (optional) task based parallelization of tcopy and mcopy.\r\n* Mentioned KNM target key (\"knm\") in reference documentation.\r\n* Improved prefetches in KNM code path of weight update.\r\n* Adjusted initialization sequence during startup.\r\n* Improved parallelization grammar.\r\n\r\n**FIXES**\r\n* Fixed pruned tile sizes and division-by-zero error in tiled GEMM.\r\n* Propagate backend errors in case of an insufficient JIT buffer.\r\n* CRC32 SW implementation issues unveiled by the CRAY Compiler.\r\n* Call parallelized transpose (C++ interface) when requested.\r\n* Fixed VTune support (named JIT code); broken in v1.8.\r\n* Fixed incorrect prefetch locations in KNM code path.\r\n* Fixed alignment condition in tcopy\u002Fmcopy code.\r\n* Fixed TF allocator integration with GCC 7.1.0.\r\n* Fixed some more warnings in sample codes.","2017-05-12T15:53:05",{"id":271,"version":272,"summary_zh":273,"released_at":274},62766,"1.8","This set of changes brings the Padding API to life and implements the necessary mechanisms to cover a wider range of cases. This may allow to run a larger variety of TensorFlow workloads using LIBXSMM. The implementation also brings Winograd-based convolutions (chosen automatically when using LIBXSMM_DNN_CONV_ALGO_AUTO). Moreover, support for the Intel Xeon Phi processor code-named \"Knights Mill\" (\"KNM\") has been added (QFMA and VNNI instructions can be executed using the [Intel SDE](https:\u002F\u002Fsoftware.intel.com\u002Fen-us\u002Farticles\u002Fintel-software-development-emulator)).\r\n\r\n**INTRODUCED**\r\n- A summary of code samples has been added ([pdf](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fraw\u002Fmaster\u002Fdocumentation\u002Fsamples.pdf)), and also a [guide](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fblob\u002Fmaster\u002Fdocumentation\u002Ftensorflow.md) (mainly for contributors) to \"Getting Started using TensorFlow with LIBXSMM\" [[PDF](https:\u002F\u002Fgithub.com\u002Fhfp\u002Flibxsmm\u002Fraw\u002Fmaster\u002Fdocumentation\u002Ftensorflow.pdf)]\r\n- Additional sparse matrix primitives (fsspmdm domain); see \"pyfr\" and \"edge\" sample code\r\n- Support for OpenMP SIMD directive on GCC (-fopenmp-simd) used in some translation units\r\n- Improved code path selection for legacy compiler versions (functions with multiple compilation targets)\r\n- DNN: Winograd based convolutions incl. threshold to automatically select (LIBXSMM_DNN_CONV_ALGO_AUTO) between LIBXSMM_DNN_CONV_ALGO_DIRECT, and LIBXSMM_DNN_CONV_ALGO_WINOGRAD\r\n- DNN: logically padded data incl. support for Winograd based implementation\r\n- DNN: support for Intel Knights Mill (KNM) instruction set extension (AVX-512)\r\n- DNN: support another custom format that blocks the minibatch dimension\r\n- SMM: support of FORTRAN 77 for manual JIT-dispatch (libxsmm_xmmdispatch, libxsmm_xmmcall)\r\n- SPMDM: narrowed scope of \"sum\" array to improve optimization on LLVM\r\n- SMM\u002FEXT\u002FOMP: introduced table of blocksizes depending on problem size; already yields improved performance for big(er) i.e., tiled matrix multiplications (xgemm sample now includes a hyperparameter tuning script)\r\n- SMM\u002FDNN: JIT'ted matrix copy functions (already used in CNN domain); both matcopy and (upcoming) JIT'ted transpose will fully unlock performance of big(ger) GEMMs\r\n- AUX\u002FMEM: scope-oriented multi-pool scratch memory allocator with heuristic for buffers of different lifetime\r\n\r\n**CHANGES**\r\n- Removed LIBXSMM_MT and LIBXSMM_TASKS environment variables, and updated documentation\r\n- COMPATIBLE=1 setting is now automatically applied (e.g., useful with Cray Compiler)\r\n- LIBXSMM_TRYLOCK=1 now uses a single lock, and thereby reduces code duplication for the contended case; the trylock property is for user-code that can handle a NULL-pointer as result of the code dispatch i.e., implementing a fallback code path (BLAS)\r\n- AUX\u002FMEM: superseded libxsmm_malloc_size function with libxsmm_get_malloc_info\r\n- Revised termination message wrt scratch memory allocation (LIBXSMM_VERBOSE)\r\n- Other: updated \"spack\" (HPC packet manager) to use more reasonable build options\r\n- SPMDM: improved load balance\r\n\r\n**FIXES**\r\n- Implemented FORTRAN dispatch interface (F2K) differently to get it working with CCE (Cray Compiler)\r\n- Worked around problem\u002Fcrashes due to an outdated TCMALLOC replacement of malloc\u002Ffree (CCE)\r\n- TMM: tiled MM fallback code path in multi-threaded tiled GEMM exposed an issue with LIBXSMM_TRYLOCK=1\r\n- TMM: fixed incorrect OpenMP in task-based implementation; now always selected when in external par. region\r\n- SPMDM: bug fix for handling last block of k correctly and avoid out-of-bound accesses\r\n- Minor: fixed all flake8 complaints of our Python scripts, fixed code issues pointed out by static analysis\r\n- Fixed transpose FORTRAN sample code","2017-03-30T11:35:53",{"id":276,"version":277,"summary_zh":278,"released_at":279},62767,"1.7.1","This release finishes the memory allocation interface and documents the two memory allocation domains (default and scratch). Otherwise this release focuses on code quality (sample code) with no fixes or breaking changes when compared to version 1.7.\n\n**INTRODUCED**\n- MEM: libxsmm_release_scratch has been introduced (unimplemented)\n- MEM: libxsmm_release_scratch now called during finalization\n- MEM: documented memory allocation domains\n- DNN: updated API documentation\n\n**CHANGES**\n- More error\u002Fwarning messages promoted to LIBXSMM_VERBOSE\n\n**FIXES**\n- None\n","2017-01-27T17:15:34",{"id":281,"version":282,"summary_zh":283,"released_at":284},62768,"1.7","This version releases a revised DNN API to better suit with an upcoming TensorFlow integration. There is also some foundation laid out to distinct scratch memory from regular\u002Fdefault memory buffers.\n\n**INTRODUCED**\n- MEM: ability to change the allocation functions; two different domains: default and scratch\n- MEM: C++ scoped allocator (\"syntactical sugar\"); incl. TensorFlow-specific adapter\n- MEM: optional TBB scalable malloc in both default and scratch allocator domain\n- DNN: more general buffer and filter link\u002Fbind functionality\n- LIBXSMM_VERBOSE messages rather than debug build\n- Improved dispatch for legacy compilers\n\n**CHANGES**\n- DNN: revised API (breaking changes)\n\n**FIXES**\n- SPMDM: fixed disagreement between static\u002Fdynamic code path (on top of v1.6.6)\n- MEM: avoid CRC memory checks for header-only library (different code versions)\n","2017-01-26T18:10:35",{"id":286,"version":287,"summary_zh":288,"released_at":289},62769,"1.6.6","This is a bug-fix release with focus on the SPMDM domain. There are also a number of code quality improvements. This is potentially the last 1.6.x release with a number of API changes scheduled for the DNN domain (v1.7).\n\n**INTRODUCED**\n- SPMDM: promoted error messages from debug-only builds to LIBXSMM_VERBOSE mode\n- README now documents on how to inspect the raw binary dumps\n\n**CHANGES**\n- Improved code quality according to a code quality checker (potential issues)\n\n**FIXES**\n- SPMDM: fixed setup of handle to correspond with CPUID-dispatched\u002Favailable code path\n- SPMDM: fixed calculating the size of the scratch buffer (single-threaded case)\n","2017-01-19T23:23:54"]