[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NVIDIA-Merlin--HugeCTR":3,"tool-NVIDIA-Merlin--HugeCTR":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",158594,2,"2026-04-16T23:34:05",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":75,"owner_url":76,"languages":77,"stars":115,"forks":116,"last_commit_at":117,"license":118,"difficulty_score":119,"env_os":120,"env_gpu":121,"env_ram":122,"env_deps":123,"category_tags":131,"github_topics":132,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":138,"updated_at":139,"faqs":140,"releases":169},8200,"NVIDIA-Merlin\u002FHugeCTR","HugeCTR","HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training","HugeCTR 是 NVIDIA 推出的一款高性能 GPU 加速框架，专为点击率（CTR）预估模型的训练与推理而设计。在推荐系统中，处理海量稀疏特征和超大嵌入表往往面临计算效率低、显存占用高等挑战，HugeCTR 通过深度优化的 GPU 工作流和模型并行技术，有效解决了这些痛点，显著提升了大规模深度学习模型的训练速度与部署效率。\n\n这款工具非常适合数据科学家、机器学习工程师以及从事推荐系统研发的专业人员使用。无论您是需要快速验证算法的研究者，还是致力于生产环境落地的开发者，HugeCTR 都提供了友好的高层 Python 接口、丰富的示例代码及详尽的文档，帮助您轻松上手。\n\n其核心技术亮点包括支持多节点分布式训练、混合精度训练以节省显存，以及独特的稀疏操作套件（Sparse Operation Kit），能够高效管理超大规模嵌入参数。此外，HugeCTR 还支持将模型转换为通用的 ONNX 格式，便于在不同平台间部署。作为 MLPerf 等权威基准测试中的佼佼者，HugeCTR 以“快速、易用、专业”为设计理念，助力用户构建更精准的推荐系统。","# [HugeCTR](README.md)\n\n[![Version](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FNVIDIA-Merlin\u002FHugeCTR?color=orange)](release_notes.md\u002F)\n[![LICENSE](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FNVIDIA-Merlin\u002FHugeCTR)](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fblob\u002Fmain\u002FLICENSE)\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocumentation-blue.svg)](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_user_guide.html)\n[![SOK Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSOK%20Documentation-blue?logoColor=blue)](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fsparse_operation_kit\u002Fmaster\u002Findex.html)\n\nHugeCTR is a GPU-accelerated recommender framework designed for training and inference of large deep learning models. \n\nDesign Goals:\n* **Fast**: HugeCTR performs outstandingly in recommendation [benchmarks](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fperformance.html) including MLPerf.\n* **Easy**: Regardless of whether you are a data scientist or machine learning practitioner, we've made it easy for anybody to use HugeCTR with plenty of [documents](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_user_guide.html), [notebooks](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fmain\u002Fnotebooks) and [samples](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fmain\u002Fsamples).\n* **Domain Specific**: HugeCTR provides the [essentials](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR#core-features), so that you can efficiently deploy your recommender models with very large embedding.\n\n**NOTE**: If you have any questions in using HugeCTR, please file an issue or join our [Slack channel](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fhugectr\u002Fshared_invite\u002Fzt-2ji0b305s-SIVB~_XZYtz38JCkT8VFSg) to have more interactive discussions. \n\n## Table of Contents\n* [Core Features](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_core_features.html)\n* [Getting Started](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_user_guide.html#installing-and-building-hugectr)\n* [HugeCTR SDK](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_user_guide.html#tools)\n* [Support and Feedback](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_contributor_guide.html)\n* [Contributing to HugeCTR](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_contributor_guide.html)\n* [Additional Resources](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fadditional_resources.html)\n\n## Core Features ##\nHugeCTR supports a variety of features, including the following:\n\n* [High-Level abstracted Python interface](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fapi\u002Fpython_interface.html)\n* [Model parallel training](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_core_features.html#model-parallel-training)\n* [Optimized GPU workflow](performance.md)\n* [Multi-node training](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_core_features.html#multi-node-training)\n* [Mixed precision training](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_core_features.html#mixed-precision-training)\n* [HugeCTR to ONNX Converter](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_core_features.html#hugectr-to-onnx-converter)\n* [Sparse Operation Kit](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fmain\u002Fsparse_operation_kit)\n\n\nTo learn about our latest enhancements, refer to our [release notes](release_notes.md).\n\n## Getting Started ##\nIf you'd like to quickly train a model using the Python interface, do the following:\n\n1. Build the HugeCTR Docker image:\n   From version 25.03, HugeCTR only provides the Dockerfile source, and users need to build the image by themselves. To build the hugectr image, use the Dockerfile located at `tools\u002Fdockerfiles\u002FDockerfile.base` with the following command:\n   ```sh\n   docker build --build-arg RELEASE=true -t hugectr:release -f tools\u002Fdockerfiles\u002FDockerfile.base .\n\n2. Start the container with your local host directory (\u002Fyour\u002Fhost\u002Fdir mounted) by running the following command:\n   ```\n   docker run --gpus=all --rm -it --cap-add SYS_NICE -v \u002Fyour\u002Fhost\u002Fdir:\u002Fyour\u002Fcontainer\u002Fdir -w \u002Fyour\u002Fcontainer\u002Fdir -it -u $(id -u):$(id -g) hugectr:release\n   ```\n\n   **NOTE**: The **\u002Fyour\u002Fhost\u002Fdir** directory is just as visible as the **\u002Fyour\u002Fcontainer\u002Fdir** directory. The **\u002Fyour\u002Fhost\u002Fdir** directory is also your starting directory.\n\n   **NOTE**: HugeCTR uses NCCL to share data between ranks, and NCCL may requires shared memory for IPC and pinned (page-locked) system memory resources. It is recommended that you increase these resources by issuing the following options in the `docker run` command.\n   ```text\n   -shm-size=1g -ulimit memlock=-1\n   ```\n\n3. Write a simple Python script to generate a synthetic dataset:\n   ```\n   # dcn_parquet_generate.py\n   import hugectr\n   from hugectr.tools import DataGeneratorParams, DataGenerator\n   data_generator_params = DataGeneratorParams(\n     format = hugectr.DataReaderType_t.Parquet,\n     label_dim = 1,\n     dense_dim = 13,\n     num_slot = 26,\n     i64_input_key = False,\n     source = \".\u002Fdcn_parquet\u002Ffile_list.txt\",\n     eval_source = \".\u002Fdcn_parquet\u002Ffile_list_test.txt\",\n     slot_size_array = [39884, 39043, 17289, 7420, 20263, 3, 7120, 1543, 39884, 39043, 17289, 7420, \n                        20263, 3, 7120, 1543, 63, 63, 39884, 39043, 17289, 7420, 20263, 3, 7120,\n                        1543 ],\n     dist_type = hugectr.Distribution_t.PowerLaw,\n     power_law_type = hugectr.PowerLaw_t.Short)\n   data_generator = DataGenerator(data_generator_params)\n   data_generator.generate()\n   ```\n\n3. Generate the Parquet dataset for your DCN model by running the following command:\n   ```\n   python dcn_parquet_generate.py\n   ```\n   **NOTE**: The generated dataset will reside in the folder `.\u002Fdcn_parquet`, which contains training and evaluation data.\n\n4. Write a simple Python script for training:\n   ```\n   # dcn_parquet_train.py\n   import hugectr\n   from mpi4py import MPI\n   solver = hugectr.CreateSolver(max_eval_batches = 1280,\n                                 batchsize_eval = 1024,\n                                 batchsize = 1024,\n                                 lr = 0.001,\n                                 vvgpu = [[0]],\n                                 repeat_dataset = True)\n   reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,\n                                    source = [\".\u002Fdcn_parquet\u002Ffile_list.txt\"],\n                                    eval_source = \".\u002Fdcn_parquet\u002Ffile_list_test.txt\",\n                                    slot_size_array = [39884, 39043, 17289, 7420, 20263, 3, 7120, 1543, 39884, 39043, 17289, 7420, \n                                                      20263, 3, 7120, 1543, 63, 63, 39884, 39043, 17289, 7420, 20263, 3, 7120, 1543 ])\n   optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam,\n                                       update_type = hugectr.Update_t.Global)\n   model = hugectr.Model(solver, reader, optimizer)\n   model.add(hugectr.Input(label_dim = 1, label_name = \"label\",\n                           dense_dim = 13, dense_name = \"dense\",\n                           data_reader_sparse_param_array =\n                           [hugectr.DataReaderSparseParam(\"data1\", 1, True, 26)]))\n   model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,\n                              workspace_size_per_gpu_in_mb = 75,\n                              embedding_vec_size = 16,\n                              combiner = \"sum\",\n                              sparse_embedding_name = \"sparse_embedding1\",\n                              bottom_name = \"data1\",\n                              optimizer = optimizer))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,\n                              bottom_names = [\"sparse_embedding1\"],\n                              top_names = [\"reshape1\"],\n                              leading_dim=416))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,\n                              bottom_names = [\"reshape1\", \"dense\"], top_names = [\"concat1\"]))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.MultiCross,\n                              bottom_names = [\"concat1\"],\n                              top_names = [\"multicross1\"],\n                              num_layers=6))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,\n                              bottom_names = [\"concat1\"],\n                              top_names = [\"fc1\"],\n                              num_output=1024))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,\n                              bottom_names = [\"fc1\"],\n                              top_names = [\"relu1\"]))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,\n                              bottom_names = [\"relu1\"],\n                              top_names = [\"dropout1\"],\n                              dropout_rate=0.5))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,\n                              bottom_names = [\"dropout1\", \"multicross1\"],\n                              top_names = [\"concat2\"]))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,\n                              bottom_names = [\"concat2\"],\n                              top_names = [\"fc2\"],\n                              num_output=1))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,\n                              bottom_names = [\"fc2\", \"label\"],\n                              top_names = [\"loss\"]))\n   model.compile()\n   model.summary()\n   model.graph_to_json(graph_config_file = \"dcn.json\")\n   model.fit(max_iter = 5120, display = 200, eval_interval = 1000, snapshot = 5000, snapshot_prefix = \"dcn\")\n   ```\n   **NOTE**: Ensure that the paths to the synthetic datasets are correct with respect to this Python script. `data_reader_type`, `check_type`, `label_dim`, `dense_dim`, and\n   `data_reader_sparse_param_array` should be consistent with the generated dataset.\n\n5. Train the model by running the following command:\n   ```\n   python dcn_parquet_train.py\n   ```\n   **NOTE**: It is presumed that the evaluation AUC value is incorrect since randomly generated datasets are being used. When the training is done, files that contain the\n   dumped graph JSON, saved model weights, and optimizer states will be generated.\n\nFor more information, refer to the [HugeCTR User Guide](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_user_guide.html).\n\n## HugeCTR SDK ##\nWe're able to support external developers who can't use HugeCTR directly by exporting important HugeCTR components using:\n* Sparse Operation Kit [directory](sparse_operation_kit) | [documentation](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fsparse_operation_kit\u002Fmaster\u002F): a python package wrapped with GPU accelerated operations dedicated for sparse training\u002Finference cases.\n\n## Support and Feedback ##\nIf you encounter any issues or have questions, go to [https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FHugeCTR\u002Fissues](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FHugeCTR\u002Fissues) and submit an issue so that we can provide you with the necessary resolutions and answers. To further advance the HugeCTR Roadmap, we encourage you to share all the details regarding your recommender system pipeline using this [survey](https:\u002F\u002Fdeveloper.nvidia.com\u002Fmerlin-devzone-survey).\n\n## Contributing to HugeCTR ##\nWith HugeCTR being an open source project, we welcome contributions from the general public. With your contributions, we can continue to improve HugeCTR's quality and performance. To learn how to contribute, refer to our [HugeCTR Contributor Guide](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_contributor_guide.html).\n\n## Additional Resources ##\n|Webpages|\n|--------|\n|[NVIDIA Merlin](https:\u002F\u002Fdeveloper.nvidia.com\u002Fnvidia-merlin)|\n|[NVIDIA HugeCTR](https:\u002F\u002Fdeveloper.nvidia.com\u002Fnvidia-merlin\u002Fhugectr)|\n\n### Publications  ###\n\n*Shijie Liu, Nan Zheng, Hui Kang, Xavier Simmons, Junjie Zhang, Matthias Langer, Wenjing Zhu, Minseok Lee, and Zehuan Wang*. \"[Embedding Optimization for Training Large-scale Deep Learning Recommendation Systems with EMBark](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3640457.3688111).\" In Proceedings of the 18th ACM Conference on Recommender Systems, pp. 622-632. 2024.\n\n*Yingcan Wei, Matthias Langer, Fan Yu, Minseok Lee, Jie Liu, Ji Shi and Zehuan Wang*, \"[A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3523227.3546765),\" Proceedings of the 16th ACM Conference on Recommender Systems, pp. 408-419, 2022.\n\n*Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G. Abel, Xu Guo, Jianbing Dong, Ji Shi and Kunlun Li*, \"[Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3523227.3547405),\" Proceedings of the 16th ACM Conference on Recommender Systems, pp.  534-537, 2022.\n\n### Talks ###\n|Conference \u002F Website|Title|Date|Speaker|Language|\n|--------------------|-----|----|-------|--------|\n|ACM RecSys 2022|[A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models](https:\u002F\u002Fvimeo.com\u002F752339625\u002F6ecec7fa70)|September 2022|Matthias Langer|English|\n|Short Videos Episode 1|[Merlin HugeCTR：GPU 加速的推荐系统框架](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1jT411E7VJ\u002F)|May 2022|Joey Wang|中文|\n|Short Videos Episode 2|[HugeCTR 分级参数服务器如何加速推理](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1PW4y127UA\u002F)|May 2022|Joey Wang|中文|\n|Short Videos Episode 3|[使用 HugeCTR SOK 加速 TensorFlow 训练](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1mG411n7XH\u002F)|May 2022|Gems Guo|中文|\n|GTC Sping 2022|[Merlin HugeCTR: Distributed Hierarchical Inference Parameter Server Using GPU Embedding Cache](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring22-s41126\u002F)|March 2022|Matthias Langer, Yingcan Wei, Yu Fan|English|\n|APSARA 2021|[GPU 推荐系统 Merlin](https:\u002F\u002Fyunqi.aliyun.com\u002F2021\u002Fagenda\u002Fsession205?spm=5176.23948577a2c4e.J_6988780170.27.5ae7379893BcVp)|Oct 2021|Joey Wang|中文|\n|GTC Spring 2021|[Learn how Tencent Deployed an Advertising System on the Merlin GPU Recommender Framework](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring21-s31820\u002F)|April 2021|Xiangting Kong, Joey Wang|English|\n|GTC Spring 2021|[Merlin HugeCTR: Deep Dive Into Performance Optimization](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring21-s31269\u002F)|April 2021|Minseok Lee|English|\n|GTC Spring 2021|[Integrate HugeCTR Embedding with TensorFlow](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring21-s31425\u002F)|April 2021|Jianbing Dong|English|\n|GTC China 2020|[MERLIN HUGECTR ：深入研究性能优化](https:\u002F\u002Fwww.nvidia.cn\u002Fon-demand\u002Fsession\u002Fgtccn2020-cns20516\u002F)|Oct 2020|Minseok Lee|English|\n|GTC China 2020|[性能提升 7 倍 + 的高性能 GPU 广告推荐加速系统的落地实现](https:\u002F\u002Fwww.nvidia.cn\u002Fon-demand\u002Fsession\u002Fgtccn2020-cns20483\u002F)|Oct 2020|Xiangting Kong|中文|\n|GTC China 2020|[使用 GPU EMBEDDING CACHE 加速 CTR 推理过程](https:\u002F\u002Fwww.nvidia.cn\u002Fon-demand\u002Fsession\u002Fgtccn2020-cns20626\u002F)|Oct 2020|Fan Yu|中文|\n|GTC China 2020|[将 HUGECTR EMBEDDING 集成于 TENSORFLOW](https:\u002F\u002Fwww.nvidia.cn\u002Fon-demand\u002Fsession\u002Fgtccn2020-cns20377\u002F)|Oct 2020|Jianbing Dong|中文|\n|GTC Spring 2020|[HugeCTR: High-Performance Click-Through Rate Estimation Training](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcsj20-s21455\u002F)|March 2020|Minseok Lee, Joey Wang|English|\n|GTC China 2019|[HUGECTR: GPU 加速的推荐系统训练](https:\u002F\u002Fwww.nvidia.cn\u002Fon-demand\u002Fsession\u002Fgtcchina2019-cn9794\u002F)|Oct 2019|Joey Wang|中文|\n\n### Blogs ###\n|Conference \u002F Website|Title|Date|Authors|Language|\n|--------------------|-----|----|-------|--------|\n|NVIDIA Devblog|[Boost Large-Scale Recommendation System Training Embedding Using EMBark](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fboost-large-scale-recommendation-system-training-embedding-using-embark\u002F)|Nov. 2024|Shijie Liu|English|\n|Wechat Blog|[RecSys'24：使用 EMBark 进行大规模推荐系统训练 Embedding 加速](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FqpIoVSnePgYZd2X1BSoVyA)|Nov. 2024|Shijie Liu|中文|\n|Wechat Blog|[利用 NVIDIA Merlin HierarchicalKV 实现唯品会在搜推广场景中的 GPU 推理实践](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002F02032v2bORzcKsNCPEVwrA)|Apr. 2024|Haidong Rong, Zehuan Wang|中文|\n|Wechat Blog|[NVIDIA Merlin 助力陌陌推荐业务实现高性能训练优化](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002F6bTOIiG9FI0XjvuIuTT5mw)|Nov. 2023|Hui Kang|中文|\n|Wechat Blog|[Merlin HugeCTR 分级参数服务器系列之三：集成到TensorFlow](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FsFmJXZ53Qj4J7iGkzGvQbw)|Nov. 2022|Kingsley Liu|中文|\n|NVIDIA Devblog|[Scaling Recommendation System Inference with Merlin Hierarchical Parameter Server\u002F使用 Merlin 分层参数服务器扩展推荐系统推理](https:\u002F\u002Fdeveloper.nvidia.com\u002Fzh-cn\u002Fblog\u002Fscaling-recommendation-system-inference-with-merlin-hierarchical-parameter-server\u002F)|August 2022|Shashank Verma, Wenwen Gao, Yingcan Wei, Matthias Langer, Jerry Shi, Fan Yu, Kingsley Liu, Minseok Lee|English\u002F中文|\n|NVIDIA Devblog|[Merlin HugeCTR Sparse Operation Kit 系列之二](https:\u002F\u002Fdeveloper.nvidia.cn\u002Fzh-cn\u002Fblog\u002Fmerlin-hugectr-sparse-operation-kit-series-2\u002F)|June 2022|Kunlun Li|中文|\n|NVIDIA Devblog|[Merlin HugeCTR Sparse Operation Kit 系列之一](https:\u002F\u002Fdeveloper.nvidia.com\u002Fzh-cn\u002Fblog\u002Fmerlin-hugectr-sparse-operation-kit-part-1\u002F)|March 2022|Gems Guo, Jianbing Dong|中文|\n|Wechat Blog|[Merlin HugeCTR 分级参数服务器系列之二](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002Fz-K3UNg6-ysrfKe3C6McZg)|March 2022|Yingcan Wei, Matthias Langer, Jerry Shi|中文|\n|Wechat Blog|[Merlin HugeCTR 分级参数服务器系列之一](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002F5_AKe6f_nJjddCLZU28P2A)|Jan. 2022|Yingcan Wei, Jerry Shi|中文|\n|NVIDIA Devblog|[Accelerating Embedding with the HugeCTR TensorFlow Embedding Plugin](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Faccelerating-embedding-with-the-hugectr-tensorflow-embedding-plugin\u002F)|Sept 2021|Vinh Nguyen, Ann Spencer, Joey Wang and Jianbing Dong|English|\n|medium.com|[Optimizing Meituan’s Machine Learning Platform: An Interview with Jun Huang](https:\u002F\u002Fmedium.com\u002Fnvidia-merlin\u002Foptimizing-meituans-machine-learning-platform-an-interview-with-jun-huang-7e046143131f)|Sept 2021|Sheng Luo and Benedikt Schifferer|English|\n|medium.com|[Leading Design and Development of the Advertising Recommender System at Tencent: An Interview with Xiangting Kong](https:\u002F\u002Fmedium.com\u002Fnvidia-merlin\u002Fleading-design-and-development-of-the-advertising-recommender-system-at-tencent-an-interview-with-37f1eed898a7)|Sept 2021|Xiangting Kong, Ann Spencer|English|\n|NVIDIA Devblog|[扩展和加速大型深度学习推荐系统 – HugeCTR 系列第 1 部分](https:\u002F\u002Fdeveloper.nvidia.com\u002Fzh-cn\u002Fblog\u002Fscaling-and-accelerating-large-deep-learning-recommender-systems-hugectr-series-part-1\u002F)|June 2021|Minseok Lee|中文|\n|NVIDIA Devblog|[使用 Merlin HugeCTR 的 Python API 训练大型深度学习推荐模型 – HugeCTR 系列第 2 部分](https:\u002F\u002Fdeveloper.nvidia.com\u002Fzh-cn\u002Fblog\u002Ftraining-large-deep-learning-recommender-models-with-merlin-hugectrs-python-apis-hugectr-series-part2\u002F)|June 2021|Vinh Nguyen|中文|\n|medium.com|[Training large Deep Learning Recommender Models with Merlin HugeCTR’s Python APIs — HugeCTR Series Part 2](https:\u002F\u002Fmedium.com\u002Fnvidia-merlin\u002Ftraining-large-deep-learning-recommender-models-with-merlin-hugectrs-python-apis-hugectr-series-69a666e0bdb7)|May 2021|Minseok Lee, Joey Wang, Vinh Nguyen and Ashish Sardana|English|\n|medium.com|[Scaling and Accelerating large Deep Learning Recommender Systems — HugeCTR Series Part 1](https:\u002F\u002Fmedium.com\u002Fnvidia-merlin\u002Fscaling-and-accelerating-large-deep-learning-recommender-systems-hugectr-series-part-1-c19577acfe9d)|May 2021|Minseok Lee|English|\n|IRS 2020|[Merlin: A GPU Accelerated Recommendation Framework](https:\u002F\u002Firsworkshop.github.io\u002F2020\u002Fpublications\u002Fpaper_21_Oldridge_Merlin.pdf)|Aug 2020|Even Oldridge etc.|English|\n|NVIDIA Devblog|[Introducing NVIDIA Merlin HugeCTR: A Training Framework Dedicated to Recommender Systems](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fintroducing-merlin-hugectr-training-framework-dedicated-to-recommender-systems\u002F)|July 2020|Minseok Lee and Joey Wang|English|\n\n## Deprecation Note\n\n- HugeCTR Hierarchical Parameter Server (HPS) \n- Embedding Cache\n\nAbove components have been deprecated since v25.03. Please refer to prior version if you need such features.\n","# [HugeCTR](README.md)\n\n[![版本](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FNVIDIA-Merlin\u002FHugeCTR?color=orange)](release_notes.md\u002F)\n[![许可证](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FNVIDIA-Merlin\u002FHugeCTR)](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fblob\u002Fmain\u002FLICENSE)\n[![文档](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocumentation-blue.svg)](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_user_guide.html)\n[![SOK 文档](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSOK%20Documentation-blue?logoColor=blue)](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fsparse_operation_kit\u002Fmaster\u002Findex.html)\n\nHugeCTR 是一个 GPU 加速的推荐系统框架，专为大规模深度学习模型的训练和推理而设计。\n\n设计目标：\n* **快速**：HugeCTR 在包括 MLPerf 在内的推荐基准测试中表现出色。\n* **易用**：无论您是数据科学家还是机器学习从业者，我们都通过丰富的[文档](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_user_guide.html)、[笔记本](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fmain\u002Fnotebooks)和[示例](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fmain\u002Fsamples)使 HugeCTR 易于使用。\n* **领域专用**：HugeCTR 提供了[核心功能](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR#core-features)，使您能够高效地部署具有超大规模嵌入的推荐模型。\n\n**注意**：如果您在使用 HugeCTR 时有任何问题，请提交一个问题或加入我们的[Slack 频道](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fhugectr\u002Fshared_invite\u002Fzt-2ji0b305s-SIVB~_XZYtz38JCkT8VFSg)，以便进行更深入的交流。\n\n## 目录\n* [核心功能](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_core_features.html)\n* [入门指南](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_user_guide.html#installing-and-building-hugectr)\n* [HugeCTR SDK](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_user_guide.html#tools)\n* [支持与反馈](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_contributor_guide.html)\n* [贡献 HugeCTR](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_contributor_guide.html)\n* [其他资源](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fadditional_resources.html)\n\n## 核心功能 ##\nHugeCTR 支持多种功能，包括以下内容：\n\n* [高级抽象的 Python 接口](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fapi\u002Fpython_interface.html)\n* [模型并行训练](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_core_features.html#model-parallel-training)\n* [优化的 GPU 工作流](performance.md)\n* [多节点训练](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_core_features.html#multi-node-training)\n* [混合精度训练](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_core_features.html#mixed-precision-training)\n* [HugeCTR 到 ONNX 转换器](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_core_features.html#hugectr-to-onnx-converter)\n* [稀疏运算工具包](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fmain\u002Fsparse_operation_kit)\n\n\n要了解我们最新的改进，请参阅我们的[发布说明](release_notes.md)。\n\n## 入门指南 ##\n如果您想使用 Python 接口快速训练一个模型，请按照以下步骤操作：\n\n1. 构建 HugeCTR Docker 镜像：\n   从 25.03 版本开始，HugeCTR 仅提供 Dockerfile 源文件，用户需要自行构建镜像。要构建 hugectr 镜像，请使用位于 `tools\u002Fdockerfiles\u002FDockerfile.base` 的 Dockerfile，并运行以下命令：\n   ```sh\n   docker build --build-arg RELEASE=true -t hugectr:release -f tools\u002Fdockerfiles\u002FDockerfile.base .\n   ```\n\n2. 使用本地主机目录（挂载 `\u002Fyour\u002Fhost\u002Fdir`）启动容器，运行以下命令：\n   ```\n   docker run --gpus=all --rm -it --cap-add SYS_NICE -v \u002Fyour\u002Fhost\u002Fdir:\u002Fyour\u002Fcontainer\u002Fdir -w \u002Fyour\u002Fcontainer\u002Fdir -it -u $(id -u):$(id -g) hugectr:release\n   ```\n\n   **注意**：**\u002Fyour\u002Fhost\u002Fdir** 目录与 **\u002Fyour\u002Fcontainer\u002Fdir** 目录同样可见。**\u002Fyour\u002Fhost\u002Fdir** 目录也是您的起始目录。\n\n   **注意**：HugeCTR 使用 NCCL 在不同进程间共享数据，NCCL 可能需要共享内存用于 IPC 和固定（页面锁定）的系统内存资源。建议您在 `docker run` 命令中添加以下选项以增加这些资源：\n   ```text\n   -shm-size=1g -ulimit memlock=-1\n   ```\n\n3. 编写一个简单的 Python 脚本来生成合成数据集：\n   ```\n   # dcn_parquet_generate.py\n   import hugectr\n   from hugectr.tools import DataGeneratorParams, DataGenerator\n   data_generator_params = DataGeneratorParams(\n     format = hugectr.DataReaderType_t.Parquet,\n     label_dim = 1,\n     dense_dim = 13,\n     num_slot = 26,\n     i64_input_key = False,\n     source = \".\u002Fdcn_parquet\u002Ffile_list.txt\",\n     eval_source = \".\u002Fdcn_parquet\u002Ffile_list_test.txt\",\n     slot_size_array = [39884, 39043, 17289, 7420, 20263, 3, 7120, 1543, 39884, 39043, 17289, 7420, \n                        20263, 3, 7120, 1543, 63, 63, 39884, 39043, 17289, 7420, 20263, 3, 7120,\n                        1543 ],\n     dist_type = hugectr.Distribution_t.PowerLaw,\n     power_law_type = hugectr.PowerLaw_t.Short)\n   data_generator = DataGenerator(data_generator_params)\n   data_generator.generate()\n   ```\n\n3. 运行以下命令为您的 DCN 模型生成 Parquet 数据集：\n   ```\n   python dcn_parquet_generate.py\n   ```\n   **注意**：生成的数据集将存放在 `.\u002Fdcn_parquet` 文件夹中，其中包含训练和评估数据。\n\n4. 编写一个用于训练的简单 Python 脚本：\n   ```\n   # dcn_parquet_train.py\n   import hugectr\n   from mpi4py import MPI\n   solver = hugectr.CreateSolver(max_eval_batches = 1280,\n                                 batchsize_eval = 1024,\n                                 batchsize = 1024,\n                                 lr = 0.001,\n                                 vvgpu = [[0]],\n                                 repeat_dataset = True)\n   reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,\n                                    source = [\".\u002Fdcn_parquet\u002Ffile_list.txt\"],\n                                    eval_source = \".\u002Fdcn_parquet\u002Ffile_list_test.txt\",\n                                    slot_size_array = [39884, 39043, 17289, 7420, 20263, 3, 7120, 1543, 39884, 39043, 17289, 7420, \n                                                      20263, 3, 7120, 1543, 63, 63, 39884, 39043, 17289, 7420, 20263, 3, 7120, 1543 ])\n   optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam,\n                                       update_type = hugectr.Update_t.Global)\n   model = hugectr.Model(solver, reader, optimizer)\n   model.add(hugectr.Input(label_dim = 1, label_name = \"label\",\n                           dense_dim = 13, dense_name = \"dense\",\n                           data_reader_sparse_param_array =\n                           [hugectr.DataReaderSparseParam(\"data1\", 1, True, 26)]))\n   model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,\n                              workspace_size_per_gpu_in_mb = 75,\n                              embedding_vec_size = 16,\n                              combiner = \"sum\",\n                              sparse_embedding_name = \"sparse_embedding1\",\n                              bottom_name = \"data1\",\n                              optimizer = optimizer))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,\n                              bottom_names = [\"sparse_embedding1\"],\n                              top_names = [\"reshape1\"],\n                              leading_dim=416))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,\n                              bottom_names = [\"reshape1\", \"dense\"], top_names = [\"concat1\"]))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.MultiCross,\n                              bottom_names = [\"concat1\"],\n                              top_names = [\"multicross1\"],\n                              num_layers=6))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,\n                              bottom_names = [\"concat1\"],\n                              top_names = [\"fc1\"],\n                              num_output=1024))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,\n                              bottom_names = [\"fc1\"],\n                              top_names = [\"relu1\"]))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,\n                              bottom_names = [\"relu1\"],\n                              top_names = [\"dropout1\"],\n                              dropout_rate=0.5))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,\n                              bottom_names = [\"dropout1\", \"multicross1\"],\n                              top_names = [\"concat2\"]))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,\n                              bottom_names = [\"concat2\"],\n                              top_names = [\"fc2\"],\n                              num_output=1))\n   model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,\n                              bottom_names = [\"fc2\", \"label\"],\n                              top_names = [\"loss\"]))\n   model.compile()\n   model.summary()\n   model.graph_to_json(graph_config_file = \"dcn.json\")\n   model.fit(max_iter = 5120, display = 200, eval_interval = 1000, snapshot = 5000, snapshot_prefix = \"dcn\")\n   ```\n   **注意**：请确保与该 Python 脚本相对应的合成数据集路径正确。`data_reader_type`、`check_type`、`label_dim`、`dense_dim` 和 `data_reader_sparse_param_array` 应与生成的数据集保持一致。\n\n5. 通过运行以下命令来训练模型：\n   ```\n   python dcn_parquet_train.py\n   ```\n   **注意**：由于使用的是随机生成的数据集，因此评估 AUC 值可能不准确。训练完成后，将生成包含导出图 JSON 文件、保存的模型权重以及优化器状态的文件。\n\n如需更多信息，请参阅 [HugeCTR 用户指南](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_user_guide.html)。\n\n\n\n## HugeCTR SDK ##\n我们可以通过以下方式支持无法直接使用 HugeCTR 的外部开发者：导出重要的 HugeCTR 组件：\n* 稀疏运算工具包 [目录](sparse_operation_kit) | [文档](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fsparse_operation_kit\u002Fmaster\u002F)：一个用 GPU 加速操作封装的 Python 包，专为稀疏训练\u002F推理场景设计。\n\n## 支持与反馈 ##\n如果您遇到任何问题或有疑问，请访问 [https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FHugeCTR\u002Fissues](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FHugeCTR\u002Fissues) 提交问题，我们将为您提供必要的解决方案和解答。为了进一步推进 HugeCTR 发展路线图，我们鼓励您使用此 [调查问卷](https:\u002F\u002Fdeveloper.nvidia.com\u002Fmerlin-devzone-survey) 分享有关推荐系统流水线的所有详细信息。\n\n## 参与 HugeCTR 开发 ##\n作为开源项目，HugeCTR 欢迎公众贡献。您的参与将有助于我们持续提升 HugeCTR 的质量和性能。有关如何贡献的信息，请参阅我们的 [HugeCTR 贡献者指南](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_contributor_guide.html)。\n\n## 其他资源 ##\n|网页|\n|--------|\n|[NVIDIA Merlin](https:\u002F\u002Fdeveloper.nvidia.com\u002Fnvidia-merlin)|\n|[NVIDIA HugeCTR](https:\u002F\u002Fdeveloper.nvidia.com\u002Fnvidia-merlin\u002Fhugectr)|\n\n### 出版物  ###\n\n*刘世杰、郑楠、康辉、泽维尔·西蒙斯、张俊杰、马蒂亚斯·兰格、朱文静、李敏锡和王哲寰*。\"[EMBark：用于训练大规模深度学习推荐系统的嵌入优化](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3640457.3688111)。\" 载于第18届ACM推荐系统会议论文集，第622–632页。2024年。\n\n*魏英灿、马蒂亚斯·兰格、于帆、李敏锡、刘杰、史济和王哲寰*，\"[面向大规模深度推荐模型的GPU专用推理参数服务器](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3523227.3546765)\"，载于第16届ACM推荐系统会议论文集，第408–419页，2022年。\n\n*王哲寰、魏英灿、李敏锡、马蒂亚斯·兰格、于帆、刘杰、刘世杰、丹尼尔·G·阿贝尔、郭旭、董建兵、史济和李昆仑*，\"[Merlin HugeCTR：GPU加速的推荐系统训练与推理](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3523227.3547405)\"，载于第16届ACM推荐系统会议论文集，第534–537页，2022年。\n\n### 报告 ###\n|会议 \u002F 网站|标题|日期|演讲者|语言|\n|--------------------|-----|----|-------|--------|\n|ACM RecSys 2022|[面向大规模深度推荐模型的GPU专用推理参数服务器](https:\u002F\u002Fvimeo.com\u002F752339625\u002F6ecec7fa70)|2022年9月|马蒂亚斯·兰格|英语|\n|短视频第一集|[Merlin HugeCTR：GPU加速的推荐系统框架](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1jT411E7VJ\u002F)|2022年5月|Joey Wang|中文|\n|短视频第二集|[HugeCTR分级参数服务器如何加速推理](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1PW4y127UA\u002F)|2022年5月|Joey Wang|中文|\n|短视频第三集|[使用HugeCTR SOK加速TensorFlow训练](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1mG411n7XH\u002F)|2022年5月|Gems Guo|中文|\n|GTC Spring 2022|[Merlin HugeCTR：基于GPU嵌入缓存的分布式分层推理参数服务器](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring22-s41126\u002F)|2022年3月|马蒂亚斯·兰格、魏英灿、于帆|英语|\n|APSARA 2021|[GPU推荐系统Merlin](https:\u002F\u002Fyunqi.aliyun.com\u002F2021\u002Fagenda\u002Fsession205?spm=5176.23948577a2c4e.J_6988780170.27.5ae7379893BcVp)|2021年10月|Joey Wang|中文|\n|GTC Spring 2021|[了解腾讯如何在Merlin GPU推荐框架上部署广告系统](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring21-s31820\u002F)|2021年4月|孔祥婷、Joey Wang|英语|\n|GTC Spring 2021|[Merlin HugeCTR：深入性能优化](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring21-s31269\u002F)|2021年4月|李敏锡|英语|\n|GTC Spring 2021|[将HugeCTR嵌入集成到TensorFlow中](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring21-s31425\u002F)|2021年4月|董建兵|英语|\n|GTC China 2020|[MERLIN HUGECTR：深入研究性能优化](https:\u002F\u002Fwww.nvidia.cn\u002Fon-demand\u002Fsession\u002Fgtccn2020-cns20516\u002F)|2020年10月|李敏锡|英语|\n|GTC China 2020|[高性能GPU广告推荐加速系统的落地实现：性能提升7倍以上](https:\u002F\u002Fwww.nvidia.cn\u002Fon-demand\u002Fsession\u002Fgtccn2020-cns20483\u002F)|2020年10月|孔祥婷|中文|\n|GTC China 2020|[利用GPU嵌入缓存加速CTR推理过程](https:\u002F\u002Fwww.nvidia.cn\u002Fon-demand\u002Fsession\u002Fgtccn2020-cns20626\u002F)|2020年10月|于帆|中文|\n|GTC China 2020|[将HUGECTR嵌入集成到TensorFlow中](https:\u002F\u002Fwww.nvidia.cn\u002Fon-demand\u002Fsession\u002Fgtccn2020-cns20377\u002F)|2020年10月|董建兵|中文|\n|GTC Spring 2020|[HugeCTR：高性能点击率预估训练](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcsj20-s21455\u002F)|2020年3月|李敏锡、Joey Wang|英语|\n|GTC China 2019|[HUGECTR：GPU加速的推荐系统训练](https:\u002F\u002Fwww.nvidia.cn\u002Fon-demand\u002Fsession\u002Fgtcchina2019-cn9794\u002F)|2019年10月|Joey Wang|中文|\n\n### 博客 ###\n|会议 \u002F 网站|标题|日期|作者|语言|\n|--------------------|-----|----|-------|--------|\n|NVIDIA 开发者博客|[使用 EMBark 加速大规模推荐系统训练中的 Embedding](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fboost-large-scale-recommendation-system-training-embedding-using-embark\u002F)|2024年11月|刘世杰|英语|\n|微信公众号|[RecSys'24：使用 EMBark 进行大规模推荐系统训练 Embedding 加速](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FqpIoVSnePgYZd2X1BSoVyA)|2024年11月|刘世杰|中文|\n|微信公众号|[利用 NVIDIA Merlin HierarchicalKV 实现唯品会在搜推广场景中的 GPU 推理实践](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002F02032v2bORzcKsNCPEVwrA)|2024年4月|荣海东、王泽寰|中文|\n|微信公众号|[NVIDIA Merlin 助力陌陌推荐业务实现高性能训练优化](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002F6bTOIiG9FI0XjvuIuTT5mw)|2023年11月|康辉|中文|\n|微信公众号|[Merlin HugeCTR 分级参数服务器系列之三：集成到TensorFlow](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FsFmJXZ53Qj4J7iGkzGvQbw)|2022年11月|刘金斯利|中文|\n|NVIDIA 开发者博客|[使用 Merlin 分层参数服务器扩展推荐系统推理\u002FScaling Recommendation System Inference with Merlin Hierarchical Parameter Server](https:\u002F\u002Fdeveloper.nvidia.com\u002Fzh-cn\u002Fblog\u002Fscaling-recommendation-system-inference-with-merlin-hierarchical-parameter-server\u002F)|2022年8月|Shashank Verma、高文文、魏英灿、Matthias Langer、Jerry Shi、Yu Fan、刘金斯利、Minseok Lee|英语\u002F中文|\n|NVIDIA 开发者博客|[Merlin HugeCTR Sparse Operation Kit 系列之二](https:\u002F\u002Fdeveloper.nvidia.cn\u002Fzh-cn\u002Fblog\u002Fmerlin-hugectr-sparse-operation-kit-series-2\u002F)|2022年6月|李昆仑|中文|\n|NVIDIA 开发者博客|[Merlin HugeCTR Sparse Operation Kit 系列之一](https:\u002F\u002Fdeveloper.nvidia.com\u002Fzh-cn\u002Fblog\u002Fmerlin-hugectr-sparse-operation-kit-part-1\u002F)|2022年3月|郭宝石、董建兵|中文|\n|微信公众号|[Merlin HugeCTR 分级参数服务器系列之二](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002Fz-K3UNg6-ysrfKe3C6McZg)|2022年3月|魏英灿、Matthias Langer、Jerry Shi|中文|\n|微信公众号|[Merlin HugeCTR 分级参数服务器系列之一](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002F5_AKe6f_nJjddCLZU28P2A)|2022年1月|魏英灿、Jerry Shi|中文|\n|NVIDIA 开发者博客|[使用 HugeCTR TensorFlow Embedding Plugin 加速 Embedding](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Faccelerating-embedding-with-the-hugectr-tensorflow-embedding-plugin\u002F)|2021年9月|Vinh Nguyen、Ann Spencer、Joey Wang 和董建兵|英语|\n|medium.com|[优化美团的机器学习平台：专访黄俊](https:\u002F\u002Fmedium.com\u002Fnvidia-merlin\u002Foptimizing-meituans-machine-learning-platform-an-interview-with-jun-huang-7e046143131f)|2021年9月|罗晟、Benedikt Schifferer|英语|\n|medium.com|[引领腾讯广告推荐系统的设计与开发：专访孔祥亭](https:\u002F\u002Fmedium.com\u002Fnvidia-merlin\u002Fleading-design-and-development-of-the-advertising-recommender-system-at-tencent-an-interview-with-37f1eed898a7)|2021年9月|孔祥亭、Ann Spencer|英语|\n|NVIDIA 开发者博客|[扩展和加速大型深度学习推荐系统 – HugeCTR 系列第 1 部分](https:\u002F\u002Fdeveloper.nvidia.com\u002Fzh-cn\u002Fblog\u002Fscaling-and-accelerating-large-deep-learning-recommender-systems-hugectr-series-part-1\u002F)|2021年6月|Minseok Lee|中文|\n|NVIDIA 开发者博客|[使用 Merlin HugeCTR 的 Python API 训练大型深度学习推荐模型 – HugeCTR 系列第 2 部分](https:\u002F\u002Fdeveloper.nvidia.com\u002Fzh-cn\u002Fblog\u002Ftraining-large-deep-learning-recommender-models-with-merlin-hugectrs-python-apis-hugectr-series-part2\u002F)|2021年6月|Vinh Nguyen|中文|\n|medium.com|[使用 Merlin HugeCTR 的 Python APIs 训练大型深度学习推荐模型 — HugeCTR 系列第 2 部分](https:\u002F\u002Fmedium.com\u002Fnvidia-merlin\u002Ftraining-large-deep-learning-recommender-models-with-merlin-hugectrs-python-apis-hugectr-series-69a666e0bdb7)|2021年5月|Minseok Lee、Joey Wang、Vinh Nguyen 和 Ashish Sardana|英语|\n|medium.com|[扩展和加速大型深度学习推荐系统 — HugeCTR 系列第 1 部分](https:\u002F\u002Fmedium.com\u002Fnvidia-merlin\u002Fscaling-and-accelerating-large-deep-learning-recommender-systems-hugectr-series-part-1-c19577acfe9d)|2021年5月|Minseok Lee|英语|\n|IRS 2020|[Merlin：一个 GPU 加速的推荐框架](https:\u002F\u002Firsworkshop.github.io\u002F2020\u002Fpublications\u002Fpaper_21_Oldridge_Merlin.pdf)|2020年8月|Even Oldridge 等|英语|\n|NVIDIA 开发者博客|[推出 NVIDIA Merlin HugeCTR：专为推荐系统设计的训练框架](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fintroducing-merlin-hugectr-training-framework-dedicated-to-recommender-systems\u002F)|2020年7月|Minseok Lee 和 Joey Wang|英语|\n\n## 废弃说明\n\n- HugeCTR 分级参数服务器 (HPS)\n- Embedding Cache\n\n以上组件自 v25.03 版本起已被废弃。如需使用这些功能，请参考旧版本。","# HugeCTR 快速上手指南\n\nHugeCTR 是 NVIDIA 推出的基于 GPU 加速的推荐系统框架，专为训练和推理大规模深度学习模型而设计。它具备高性能、易用性和领域专用特性，支持模型并行、多节点训练及混合精度训练。\n\n## 环境准备\n\n### 系统要求\n*   **操作系统**: Linux (推荐 Ubuntu 18.04\u002F20.04\u002F22.04)\n*   **GPU**: 支持 CUDA 的 NVIDIA GPU (建议使用 Volta 架构或更高版本)\n*   **Docker**: 已安装 Docker Engine 和 NVIDIA Container Toolkit (`nvidia-docker`)\n\n### 前置依赖\n确保您的系统已正确配置 NVIDIA 容器运行时，以便 Docker 容器能够访问 GPU 资源。\n```bash\n# 验证 nvidia-docker 是否正常工作\ndocker run --rm --gpus all nvidia\u002Fcuda:11.0-base nvidia-smi\n```\n\n## 安装步骤\n\n从版本 25.03 开始，HugeCTR 仅提供 Dockerfile 源码，用户需自行构建镜像。\n\n1.  **克隆仓库** (如果尚未克隆):\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR.git\n    cd HugeCTR\n    ```\n\n2.  **构建 Docker 镜像**:\n    使用提供的 `Dockerfile.base` 构建发布版镜像：\n    ```sh\n    docker build --build-arg RELEASE=true -t hugectr:release -f tools\u002Fdockerfiles\u002FDockerfile.base .\n    ```\n    > **提示**: 国内用户若遇到拉取基础镜像缓慢的问题，建议配置 Docker 镜像加速器（如阿里云、腾讯云等）后再执行构建。\n\n3.  **启动容器**:\n    运行容器并挂载本地目录，同时优化共享内存以支持 NCCL 通信：\n    ```bash\n    docker run --gpus=all --rm -it --cap-add SYS_NICE \\\n      -shm-size=1g -ulimit memlock=-1 \\\n      -v \u002Fyour\u002Fhost\u002Fdir:\u002Fyour\u002Fcontainer\u002Fdir \\\n      -w \u002Fyour\u002Fcontainer\u002Fdir \\\n      -u $(id -u):$(id -g) \\\n      hugectr:release\n    ```\n    *请将 `\u002Fyour\u002Fhost\u002Fdir` 替换为您本地的实际工作目录路径。*\n\n## 基本使用\n\n以下示例演示如何使用 Python 接口生成合成数据并训练一个简单的 DCN (Deep & Cross Network) 模型。\n\n### 1. 生成合成数据集\n\n创建文件 `dcn_parquet_generate.py`：\n\n```python\n# dcn_parquet_generate.py\nimport hugectr\nfrom hugectr.tools import DataGeneratorParams, DataGenerator\n\ndata_generator_params = DataGeneratorParams(\n  format = hugectr.DataReaderType_t.Parquet,\n  label_dim = 1,\n  dense_dim = 13,\n  num_slot = 26,\n  i64_input_key = False,\n  source = \".\u002Fdcn_parquet\u002Ffile_list.txt\",\n  eval_source = \".\u002Fdcn_parquet\u002Ffile_list_test.txt\",\n  slot_size_array = [39884, 39043, 17289, 7420, 20263, 3, 7120, 1543, 39884, 39043, 17289, 7420, \n                     20263, 3, 7120, 1543, 63, 63, 39884, 39043, 17289, 7420, 20263, 3, 7120,\n                     1543 ],\n  dist_type = hugectr.Distribution_t.PowerLaw,\n  power_law_type = hugectr.PowerLaw_t.Short)\n\ndata_generator = DataGenerator(data_generator_params)\ndata_generator.generate()\n```\n\n执行脚本生成数据：\n```bash\npython dcn_parquet_generate.py\n```\n*生成的数据将保存在 `.\u002Fdcn_parquet` 目录下。*\n\n### 2. 定义并训练模型\n\n创建文件 `dcn_parquet_train.py`：\n\n```python\n# dcn_parquet_train.py\nimport hugectr\nfrom mpi4py import MPI\n\n# 初始化求解器\nsolver = hugectr.CreateSolver(max_eval_batches = 1280,\n                              batchsize_eval = 1024,\n                              batchsize = 1024,\n                              lr = 0.001,\n                              vvgpu = [[0]],\n                              repeat_dataset = True)\n\n# 配置数据读取器\nreader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,\n                                 source = [\".\u002Fdcn_parquet\u002Ffile_list.txt\"],\n                                 eval_source = \".\u002Fdcn_parquet\u002Ffile_list_test.txt\",\n                                 slot_size_array = [39884, 39043, 17289, 7420, 20263, 3, 7120, 1543, 39884, 39043, 17289, 7420, \n                                                   20263, 3, 7120, 1543, 63, 63, 39884, 39043, 17289, 7420, 20263, 3, 7120, 1543 ])\n\n# 配置优化器\noptimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam,\n                                    update_type = hugectr.Update_t.Global)\n\n# 构建模型\nmodel = hugectr.Model(solver, reader, optimizer)\n\n# 添加输入层\nmodel.add(hugectr.Input(label_dim = 1, label_name = \"label\",\n                        dense_dim = 13, dense_name = \"dense\",\n                        data_reader_sparse_param_array =\n                        [hugectr.DataReaderSparseParam(\"data1\", 1, True, 26)]))\n\n# 添加稀疏嵌入层\nmodel.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,\n                           workspace_size_per_gpu_in_mb = 75,\n                           embedding_vec_size = 16,\n                           combiner = \"sum\",\n                           sparse_embedding_name = \"sparse_embedding1\",\n                           bottom_name = \"data1\",\n                           optimizer = optimizer))\n\n# 添加密集连接层 (DCN 结构)\nmodel.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,\n                           bottom_names = [\"sparse_embedding1\"],\n                           top_names = [\"reshape1\"],\n                           leading_dim=416))\nmodel.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,\n                           bottom_names = [\"reshape1\", \"dense\"], top_names = [\"concat1\"]))\nmodel.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.MultiCross,\n                           bottom_names = [\"concat1\"],\n                           top_names = [\"multicross1\"],\n                           num_layers=6))\nmodel.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,\n                           bottom_names = [\"concat1\"],\n                           top_names = [\"fc1\"],\n                           num_output=1024))\nmodel.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,\n                           bottom_names = [\"fc1\"],\n                           top_names = [\"relu1\"]))\nmodel.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,\n                           bottom_names = [\"relu1\"],\n                           top_names = [\"dropout1\"],\n                           dropout_rate=0.5))\nmodel.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,\n                           bottom_names = [\"dropout1\", \"multicross1\"],\n                           top_names = [\"concat2\"]))\nmodel.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,\n                           bottom_names = [\"concat2\"],\n                           top_names = [\"fc2\"],\n                           num_output=1))\nmodel.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,\n                           bottom_names = [\"fc2\", \"label\"],\n                           top_names = [\"loss\"]))\n\n# 编译并训练\nmodel.compile()\nmodel.summary()\nmodel.graph_to_json(graph_config_file = \"dcn.json\")\nmodel.fit(max_iter = 5120, display = 200, eval_interval = 1000, snapshot = 5000, snapshot_prefix = \"dcn\")\n```\n\n执行训练脚本：\n```bash\npython dcn_parquet_train.py\n```\n\n> **注意**: 由于使用的是随机生成的合成数据，评估得到的 AUC 值可能不准确。训练完成后，当前目录将生成模型图配置 (`dcn.json`)、模型权重快照及优化器状态文件。","某大型电商平台的算法团队正面临每日亿级用户行为数据的挑战，急需训练一个包含数十亿参数嵌入层的高精度点击率（CTR）预估模型，以实时优化商品推荐列表。\n\n### 没有 HugeCTR 时\n- **训练速度极慢**：传统 CPU 框架或通用 GPU 方案在处理大规模稀疏嵌入表时效率低下，单次全量模型训练往往需要数天甚至更久，严重拖慢迭代节奏。\n- **显存瓶颈频发**：面对 TB 级别的嵌入参数，单卡或多卡简单并行极易导致显存溢出（OOM），迫使团队不得不进行复杂的模型裁剪或降维，牺牲了模型精度。\n- **工程落地复杂**：缺乏针对推荐场景的原生多节点训练支持，数据科学家需花费大量时间编写底层通信代码和手动管理分布式资源，难以专注于算法优化。\n- **推理部署困难**：训练好的模型格式与生产环境不兼容，转换过程繁琐且容易出错，导致从实验到上线的周期长达数周。\n\n### 使用 HugeCTR 后\n- **训练效率飞跃**：利用 HugeCTR 专为 GPU 优化的稀疏操作内核和混合精度训练，相同规模模型的训练时间从数天缩短至数小时，显著提升了 MLPerf 基准测试表现。\n- **突破显存限制**：通过内置的模型并行训练机制，HugeCTR 能智能将巨大的嵌入表分片存储在多张 GPU 甚至多节点间，轻松支撑千亿级参数模型而无需牺牲特征维度。\n- **开发门槛降低**：提供高层抽象的 Python 接口和丰富的预置样本，团队可直接调用原生多节点训练功能，无需关注底层细节，让算法工程师回归模型设计本身。\n- **无缝生产部署**：借助 HugeCTR 到 ONNX 的转换器，模型可一键导出为标准格式，平滑对接现有推理引擎，将新模型上线周期压缩至天级别。\n\nHugeCTR 通过极致的 GPU 加速与领域专用的架构设计，让超大规模推荐模型的训练与部署变得既快速又简单。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA-Merlin_HugeCTR_f36221d7.png","NVIDIA-Merlin","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNVIDIA-Merlin_46f912e0.jpg","Merlin is a framework providing end-to-end GPU-accelerated recommender systems, from feature engineering to deep learning training and deploying to production",null,"https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin",[78,82,86,90,94,98,102,106,109,112],{"name":79,"color":80,"percentage":81},"C++","#f34b7d",38.9,{"name":83,"color":84,"percentage":85},"Cuda","#3A4E3A",32.2,{"name":87,"color":88,"percentage":89},"Python","#3572A5",16.5,{"name":91,"color":92,"percentage":93},"Jupyter Notebook","#DA5B0B",10.5,{"name":95,"color":96,"percentage":97},"CMake","#DA3434",1.1,{"name":99,"color":100,"percentage":101},"Shell","#89e051",0.6,{"name":103,"color":104,"percentage":105},"C","#555555",0,{"name":107,"color":108,"percentage":105},"Makefile","#427819",{"name":110,"color":111,"percentage":105},"HTML","#e34c26",{"name":113,"color":114,"percentage":105},"Batchfile","#C1F12E",1058,204,"2026-04-09T07:01:49","Apache-2.0",4,"Linux","必需 NVIDIA GPU，支持多卡及多节点训练（通过 NCCL），需启用 --gpus=all，建议增加共享内存 (-shm-size=1g) 和锁定内存限制 (-ulimit memlock=-1)","未说明（但建议配置充足的系统内存以支持 pinned memory 和大型嵌入表）",{"notes":124,"python":125,"dependencies":126},"从 v25.03 版本起，官方不再提供预构建镜像，用户需基于提供的 Dockerfile.base 自行构建镜像。运行容器时必须挂载宿主目录，并强烈建议配置共享内存和锁定内存资源以满足 NCCL 通信需求。该框架专为大规模推荐系统设计，支持模型并行和混合精度训练。","未说明（需配合 Docker 镜像使用）",[127,128,129,130],"Docker","NCCL","mpi4py","hugectr (内置)",[14],[133,134,135,136,137],"cpp","deep-learning","gpu-acceleration","recommendation-system","recommender-system","2026-03-27T02:49:30.150509","2026-04-17T09:54:30.448264",[141,146,151,156,161,165],{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},36680,"SparseOperationKit (SOK) 在初始化时挂起（hangs）怎么办？","该问题已在 v22.06 版本中修复。如果您遇到 SOK 初始化挂起的问题，请升级到 v22.06 或更高版本。如果升级后问题仍然存在，请重新打开 Issue 反馈。","https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fissues\u002F261",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},36681,"开启混合精度训练 (use_mixed_precision=True) 后模型不收敛或报错怎么办？","HugeCTR 的 `use_mixed_precision` 标志具有全局影响，不支持仅对特定层使用混合精度。\n1. 如果开启该标志，所有输入到层的张量必须是 `fp16` 数据类型。\n2. 如果关闭该标志，所有输入张量必须是 `fp32`。\n3. 不要手动转换数据类型并馈送到不同数据类型的层。在 `fp32` 模式下，`cast` 层会将 `fp32` 转为 `fp16`；在 `fp16` 模式下，`cast` 层会将 `fp16` 转为 `fp32`。\n建议检查模型配置，确保数据类型与全局设置一致，或者参考官方脚本（如 DLRM）中已正确配置的混合精度参数。","https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fissues\u002F393",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},36682,"运行 MLCommons 基准测试时遇到 'CUDNN_STATUS_MAPPING_ERROR with cudnnSetStream' 错误如何解决？","该错误通常与环境配置或启动参数有关。解决方案如下：\n1. 在使用 `srun` 启动任务时，添加 `--propagate=STACK` 参数。\n2. 重新安装运行环境。\n执行上述步骤后通常可解决该运行时错误。","https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fissues\u002F433",{"id":157,"question_zh":158,"answer_zh":159,"source_url":160},36683,"通过 pip 安装 sparse_operation_kit 失败（构建 merlin-sok wheel 出错）怎么办？","从提供的日志看，直接通过 `pip install sparse_operation_kit` 安装时，构建 `merlin-sok` 依赖包失败。这通常是因为缺少编译环境或依赖项。\n建议方案：\n1. 不要直接使用 pip 安装源码包，而是使用官方提供的预构建 Docker 镜像（如 `gcr.io\u002Fdeeplearning-platform-release\u002Ftf2-gpu...`）。\n2. 如果必须自行安装，请确保系统已安装完整的构建工具链（如 gcc, g++, make, cmake）以及与 TensorFlow 版本匹配的 CUDA\u002FcuDNN 开发库。\n3. 参考官方文档中的 `install.sh` 脚本流程进行手动编译安装，而不是依赖 pip 自动构建。","https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fissues\u002F346",{"id":162,"question_zh":163,"answer_zh":164,"source_url":155},36684,"在多节点集群上运行 HugeCTR 时，如何正确配置 mpirun 以避免性能问题或错误？","在多节点或多 GPU 环境下，需注意以下配置细节：\n1. **NUMA 绑定**：使用 `mpirun` 时，务必正确使用 `--bind-to numa` 参数，并确保主机列表格式正确（例如 `--host A100-01:8` 表示该节点使用 8 个 GPU）。\n2. **进程数量**：`-np` 参数应等于总 GPU 数量（例如 8 个节点 x 8 GPU = 64），而不是节点数。\n3. **参数冲突**：如果在 `all_reduce_perf` 等测试中指定了 `-g` 参数，可能与 `mpirun` 的绑定策略冲突，建议移除 `-g` 参数让 mpirun 统一管理。\n错误的配置（如将 np 设为节点数而非 GPU 总数，或 NUMA 绑定不当）会导致通信性能下降或运行时错误。",{"id":166,"question_zh":167,"answer_zh":168,"source_url":145},36685,"如何在自定义 Docker 镜像中正确安装和配置 SparseOperationKit (SOK)？","基于官方基础镜像（如 `tf2-gpu`）构建时，推荐步骤如下：\n1. 复制 HugeCTR\u002FSOK 源码到容器内。\n2. 运行安装脚本时，即使 `install.sh` 中的 pythonpath 设置失败也不要中断构建，可以使用 `RUN .\u002Finstall.sh --SM=\"70;75;80\" --USE_NVTX=OFF; exit 0` 忽略错误继续。\n3. 手动设置环境变量：`ENV PYTHONPATH \"\u002Fusr\u002Flocal\u002Flib\u002F:${PYTHONPATH}\"` 以确保 Python 能找到安装的库。\n4. 确保指定正确的 SM 架构版本（如 70, 75, 80）以匹配您的 GPU 型号。",[170,175,180,185,189,194,199,204,208,212,217,222,227,232,237,241,246,251,256,261],{"id":171,"version":172,"summary_zh":173,"released_at":174},292015,"v26.03.00","## 变更内容\n* @EmmaQiaoCh 在 https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fpull\u002F474 中升级了上传操作的版本\n* 支持 GB300\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fcompare\u002Fv25.03.00...v26.03.00","2026-03-12T05:34:23",{"id":176,"version":177,"summary_zh":178,"released_at":179},292016,"v25.03.00","## 变更内容\n* 低频滤波器，由 @ccccjunkang 在 https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fpull\u002F455 中实现\n* 从 GitLab 同步，由 @EmmaQiaoCh 在 https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fpull\u002F458 中完成\n* 25.03 版本预览，由 @EmmaQiaoCh 在 https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fpull\u002F473 中提供\n\n## 新贡献者\n* @ccccjunkang 在 https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fpull\u002F455 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fcompare\u002Fv24.06.00...v25.03.00","2025-03-14T09:04:46",{"id":181,"version":182,"summary_zh":183,"released_at":184},292017,"v24.06.00","## 24.06 版本更新内容\n\n\n+ **稀疏运算工具包 (SOK) 更新：**\n    + 新增了一个 API `sok.incremental_dump`，允许用户通过指定时间阈值，将新添加的键和值转储到 NumPy 数组中。目前该 API 仅支持以 HKV 为后端的 `sok.DynamicVariable`。\n    + 修复了 wgrad 占用过多 GPU 内存的问题。\n    + 修复了反向传播过程中 CUDA 内核中的非法内存访问问题。\n    + SOK（稀疏运算工具包）的文档和示例已更新。更多详情请参阅 [SOK 文档](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fsparse_operation_kit\u002Fmaster\u002Findex.html)。","2024-06-14T07:14:49",{"id":186,"version":187,"summary_zh":75,"released_at":188},292018,"v24.04.00","2024-04-18T12:10:41",{"id":190,"version":191,"summary_zh":192,"released_at":193},292019,"v23.12.00","## 23.12 版本更新内容\n\n\n+ **HPS 中的无锁推理缓存**\n  + 我们为分层参数服务器新增了一个无锁 GPU 嵌入缓存，可进一步提升推理阶段嵌入表查找的性能。即使在并发模型更新或缺失键插入的情况下，也不会导致数据不一致。这是因为我们通过异步流同步机制确保了缓存的一致性。要启用无锁 GPU 嵌入缓存，用户只需将 [\"embedding_cache_type\"](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhierarchical_parameter_server\u002Fhps_database_backend.html#configuration) 设置为 `dynamic`，并将 `\"use_hctr_cache_implementation\"` 设置为 `false`。\n+ **SOK 正式发布**\n  + SOK 现已不再是 `experiment` 包，而是由 HugeCTR 官方正式支持。请使用 `import sparse_operation_kit as sok`，而非 `from sparse_operation_kit import experiment as sok`。\n  + `sok.DynamicVariable` 支持 Merlin-HKV 作为其后端。\n  + 新增了并行转储和加载功能。\n+ **代码清理与弃用**\n  + 已弃用 `Model::export_predictions` 函数，请改用 [Model::check_out_tensor](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fapi\u002Fpython_interface.html#check-out-tensor-method) 函数。\n  + 我们已弃用 `Norm` 和旧版 `Raw` 数据读取器。请改用 `hugectr.DataReaderType_t.RawAsync` 或 `hugectr.DataReaderType_t.Parquet` 作为替代。\n+ **已修复的问题**：\n  + 通过 SOK 提升了 [HKV](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHierarchicalKV) 查找的性能。\n  + 修复了 SOK 反向传播中出现的一个非法内存访问问题，该问题仅在特定情况下发生。\n  + 解决了当池化因子为零时，均值组合器返回零值，从而导致 SOK 查找结果为 NaN 的问题。\n  + 修复了一些与依赖项相关的构建问题。\n  + 优化了 SOK 中动态嵌入表（DET）的性能。\n  + 修复了用户通过 SOK 使用 DET 时指定负键导致的程序崩溃问题。\n  + 解决了在处理数千个嵌入表时，SOK 反向传播阶段偶尔出现的正确性问题。\n  + 移除了 TensorFlow >= 2.13 中发生的运行时错误。\n+ **已知问题**：\n  + 如果将 `max_eval_batches` 和 `batchsize_eval` 分别设置为较大的值，例如 5000 和 12000，则训练过程可能会导致非法内存访问错误。该问题[源自 CUB](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl\u002Fissues\u002F293)，并在最新版本中已修复。然而，该修复仅包含在 CUDA 12.3 中，而我们的 NGC 容器尚未采用此版本。因此，在我们将 NGC 容器更新至使用 CUDA 12.3 之前，作为临时解决方案，请使用最新的 CUB 重新编译 HugeCTR。否则，请尽量避免设置如此大的 `max_eval_batches` 和 `batchsize_eval`。\n  + 如果客户端代码调用 RMM 的 `rmm::mr::set_current_devi","2024-01-11T14:00:38",{"id":195,"version":196,"summary_zh":197,"released_at":198},292020,"v23.09.00","## 23.09 版本更新内容\n\n+ **代码清理与弃用**\n  + 离线推理功能已从我们的文档、Notebook 示例和代码中弃用。请查看适用于 [TensorFlow](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhierarchical_parameter_server\u002Fhps_tf_user_guide.html) 和 [TensorRT](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhierarchical_parameter_server\u002Fhps_trt_user_guide.html) 的 HPS 插件。在 [此 HPS TensorRT Notebook](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fblob\u002Fmain\u002Fhps_trt\u002Fnotebooks\u002Fdemo_for_tf_trained_model.ipynb) 中并未展示多 GPU 推理。\n  + 我们正在逐步弃用 [Embedding Training Cache (ETC)](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhugectr_embedding_training_cache.html) 功能。如果您仍在使用该功能，它目前仍可正常工作，但会显示弃用警告信息。在不久的将来，该功能将从 API 和代码层面彻底移除。建议您改用 NVIDIA 的 [HierarchicalKV](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHierarchicalKV) 作为替代方案。\n  + 在本版本中，我们还对 C++ 代码和 CMakeLists.txt 进行了清理，以提高代码的可维护性并修复一些潜在的小问题。未来几个版本中还将继续进行更多的代码清理工作。\n+ **常规更新**：\n  + 启用了静态 CUDA 运行时的支持。现在，您可以通过在使用 CMake 配置代码时指定 `-DUSE_CUDART_STATIC=ON` 来实验性地启用该功能；不过，默认情况下仍然使用动态 CUDA 运行时。\n  + 将 HPS 作为 TorchScript 的自定义扩展添加进来。用户可以在脚本化 Torch 模块的推理过程中利用 HPS 的嵌入查找功能。\n+ **问题修复**：\n  + 解决了在 SOK 与 HKV 联合使用时，由于 unique 操作和统一内存相关导致的性能退化问题。\n  - 减少了在加载和转储 SOK 嵌入时中间缓冲区的不必要内存消耗。\n  + 修复了 Interaction Layer，使其能够支持较大的 `num_slots` 参数。\n  + 解决了在使用多张 H800 GPU 时偶尔出现的运行时错误。\n+ **已知问题**：\n  + 如果我们将 `max_eval_batches` 和 `batchsize_eval` 分别设置为较大的值，例如 5000 和 12000，则训练过程可能会导致非法内存访问错误。该问题源自 CUB 库中的一个缺陷，已在最新版本中修复。然而，该修复仅包含在 CUDA 12.3 中，而我们的 NGC 容器尚未采用该版本的 CUDA。因此，在我们将 NGC 容器更新至依赖 CUDA 12.3 之前，您可以尝试使用最新版 CUB 重新构建 HugeCTR 作为临时解决方案。否则，请尽量避免将 `max_eval_batches` 和 `batchsize_eval` 设置得过大。\n  + 如果客户端代码调用 RMM 的 `rmm::mr::set_current_device_resource()` 或 `rmm::mr::set_current_device_resource()`，HugeCTR 可能会导致运行时错误。这是因为 HugeCTR 的 Parquet 数据读取器也会调用 `rmm::mr::set_current_device_resource()`，从而使得该调用对同一进程中的其他库可见。有关详细信息，请参阅 [此问题](https:\u002F\u002F","2023-09-27T03:18:14",{"id":200,"version":201,"summary_zh":202,"released_at":203},292021,"v23.08.00","## 23.08 版本更新内容\n\n+ **分层参数服务器**：\n  + 支持静态 EC fp8 量化  \n    我们已在静态缓存中支持 fp8 量化。通过启用 `fp8_quant` 配置，HPS 会在读取嵌入表时对嵌入向量进行 fp8 量化，并在查询的嵌入键对应的静态嵌入缓存中对该向量执行 fp32 反量化，从而确保密集部分预测的准确性。\n  + 基于 HPS TensorRT 插件的大模型部署演示  \n    该演示展示了如何使用 HPS TRT 插件，基于 1TB 的 Criteo 数据集构建完整的 TRT 引擎，以部署一个 147GB 的嵌入表。我们还提供了静态嵌入实现，可将嵌入表完全卸载到主机的页锁定内存中，用于在 x86 和 Grace Hopper 超级芯片上的基准测试。\n  + 问题修复  \n    + 解决 Kafka 更新摄取错误。此前存在一个问题，导致来自 Kafka 消息队列的在线参数更新无法传递至 Redis 数据库后端。\n    + 修复了 HPS Triton 后端因在获取对应设备上的嵌入缓存时出现未定义空值而导致嵌入缓存重新初始化的问题。\n\n+ **HugeCTR 训练与 SOK**：\n  + 嵌入集合中的密集嵌入支持  \n    我们在嵌入集合中新增了密集嵌入功能。用户只需将组合器指定为 `_concat_` 即可使用密集嵌入。更多信息请参阅 [dense_embedding.py](test\u002Fembedding_collection_test\u002Fdgx_a100_one_hot.py)。\n  + 对序列掩码层和注意力 softmax 层进行了优化，以支持交叉注意力。\n  + 我们引入了一个更为通用的重塑层，允许用户在不限制维度的情况下将源张量重塑为目标张量。更多详细信息请参阅 [重塑层 API](docs\u002Fsource\u002Fapi\u002Fhugectr_layer_book.md#reshape-layer)。\n  + 问题修复  \n    + 修复了在稀疏运算工具包中使用 Localized Variable 时出现的错误。\n    + 修复了稀疏运算工具包反向传播计算中的 bug。\n    + 通过将对 `DeviceSegmentedSort` 的调用替换为 `DeviceSegmentedRadixSort`，修复了一些 SOK 性能相关的 bug。\n    + 修复了 SOK Python API 方面的一个 bug，该 bug 导致模型前向传播函数被重复调用，从而降低了性能。\n    + 减少 CPU 启动开销  \n      + 移除了 DataDistributor 中的动态向量分配。\n      + 从 DataReader 中移除了 checkout 值张量的使用。此前，DataReader 会实时生成一个嵌套的 std::vector 并将其返回给嵌入集合，这会产生大量主机端开销。现将其改为类成员变量，从而消除了这部分开销。\n    + 与最新的 Parquet 更新保持一致。  \n      我们修复了由于 cudf 23.06 中 `parquet_reader_options::set_num_rows()` 更新所引发的 bug：[PR](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fcudf\u002Fpull\u002F13063)。\n    + 修复了 debug 模式下的 core23 断言错误。  \n      我们已修复了新核心库中存在的断言错误。","2023-08-28T10:46:43",{"id":205,"version":206,"summary_zh":75,"released_at":207},292022,"v23.06.01","2023-07-17T07:14:13",{"id":209,"version":210,"summary_zh":75,"released_at":211},292023,"v23.06.00","2023-06-14T09:30:29",{"id":213,"version":214,"summary_zh":215,"released_at":216},292024,"v23.05.01","## 23.05 版本更新内容\n在本次发布中，我们修复了一些问题并优化了代码。\n\n+ **3G Embedding 更新**:\n  + 重构了与 `DataDistributor` 相关的代码\n  + 新增了 SOK 的 `load()` 和 `dump()` API，现已支持 TensorFlow 2。使用该 API 时，除了指定 `path` 外，还需提供 `sok_vars`。\n  + `sok_vars` 是一个包含 `sok.variable` 和\u002F或 `sok.dynamic_variable` 的列表。\n  + 如果需要保存如 `Adam` 优化器中的 `m` 和 `v` 状态，则还需同时指定 `optimizer`。\n  + `optimizer` 必须是 `tf.keras.optimizers.Optimizer` 或 `sok.OptimizerWrapper`，且其底层类型必须为 `SGD`、`Adamax`、`Adadelta`、`Adagrad` 或 `Ftrl`。\n\n  ```python\n  import sparse_operation_kit as sok\n  \n  sok.load(path, sok_vars, optimizer=None)\n  \n  sok.dump(path, sok_vars, optimizer=None)\n  ```\n\n  这些 API 与使用的 GPU 数量及分片策略无关。例如，使用 8 块 GPU 训练并导出的分布式嵌入表，可以加载到 4 块 GPU 的机器上继续训练。\n\n+ **已修复的问题**:\n  + 修复了在使用 HPS UVM 实现并启用嵌入表融合时出现的段错误和初始化错误。\n  - 在调试模式下构建 HugeCTR 时移除了 `cudaDeviceSynchronize()`，因此即使在调试模式下也可以启用 CUDA Graph。\n  + 修改了一些 Notebook，使其使用最新版本的 NGC 容器。\n  + 修复了 `EmbeddingTableCollection` 的单元测试，使其能够在多 GPU 环境下正确运行。\n\n+ **已知问题**:\n  + 如果客户端代码调用 RMM 的 `rmm::mr::set_current_device_resource()` 或 `rmm::mr::set_current_device_resource()`，HugeCTR 可能会导致运行时错误。这是因为 HugeCTR 的 Parquet 数据读取器也会调用 `rmm::mr::set_current_device_resource()`，从而对同一进程中的其他库可见。请参考 [此问题](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fissues\u002F356)。作为 workaround，如果知道 `rmm::mr::set_current_device_resource()` 是在 HugeCTR 外部调用的，可以将环境变量 `HCTR_RMM_SETTABLE` 设置为 0，以禁用 HugeCTR 自定义设置 RMM 设备资源的功能。但需谨慎，因为这可能会影响 Parquet 文件的读取性能。\n  + HugeCTR 使用 NCCL 在不同进程间共享数据，而 NCCL 需要共享系统内存用于 IPC，以及固定（页锁定）的系统内存资源。\n    如果在容器内使用 NCCL，请在启动容器时通过以下参数增加这些资源：\n\n    ```shell\n      -shm-size=1g -ulimit memlock=-1\n    ```\n\n    更多信息请参阅 [NCCL 已知问题](https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Fnccl\u002Fuser-guide\u002Fdocs\u002Ftroubleshooting.html#sharing-data) 和 [GitHub 问题](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fissues\u002F243)。\n  + 即使目标 Kafka 代理无响应，`KafkaProducers` 仍能成功启动。\n    为避免因从 Kafka 流式更新模型而导致的数据丢失，请确保有足够的 Kafka 代理正在运行并正常工作。","2023-05-18T09:27:59",{"id":218,"version":219,"summary_zh":220,"released_at":221},292025,"v23.05.00","## 23.05 版本更新内容\n在本次发布中，我们修复了一些问题并优化了代码。\n\n+ **3G Embedding 更新**：\n  + 重构了与 `DataDistributor` 相关的代码\n  + 新增了 SOK 的 `load()` 和 `dump()` API，现已支持 TensorFlow 2。使用该 API 时，除了指定 `path` 外，还需提供 `sok_vars`。\n  + `sok_vars` 是一个包含 `sok.variable` 和\u002F或 `sok.dynamic_variable` 的列表。\n  + 如果需要保存优化器状态（如 Adam 优化器中的 `m` 和 `v`），则还必须指定 `optimizer`。\n  + `optimizer` 必须是 `tf.keras.optimizers.Optimizer` 或 `sok.OptimizerWrapper`，且其底层类型必须为 `SGD`、`Adamax`、`Adadelta`、`Adagrad` 或 `Ftrl`。\n\n  ```python\n  import sparse_operation_kit as sok\n  \n  sok.load(path, sok_vars, optimizer=None)\n  \n  sok.dump(path, sok_vars, optimizer=None)\n  ```\n\n  这些 API 与使用的 GPU 数量及分片策略无关。例如，使用 8 块 GPU 训练并导出的分布式嵌入表，可以加载到 4 块 GPU 的机器上继续训练。\n\n+ **已修复的问题**：\n  + 修复了在使用 HPS UVM 实现并启用嵌入表融合时出现的段错误和初始化错误。\n  - 在调试模式下构建 HugeCTR 时移除了 `cudaDeviceSynchronize()`，因此即使在调试模式下也可以启用 CUDA Graph。\n  + 修改了一些 Notebook，使其使用最新版本的 NGC 容器。\n  + 修复了 `EmbeddingTableCollection` 的单元测试，使其能够在多 GPU 环境下正确运行。\n\n+ **已知问题**：\n  + 如果客户端代码调用 RMM 的 `rmm::mr::set_current_device_resource()` 或 `rmm::mr::set_current_device_resource()`，HugeCTR 可能会导致运行时错误。这是因为 HugeCTR 的 Parquet 数据读取器也会调用 `rmm::mr::set_current_device_resource()`，从而对同一进程中的其他库可见。请参考 [此问题](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fissues\u002F356)。作为 workaround，如果知道 `rmm::mr::set_current_device_resource()` 是在 HugeCTR 外部调用的，可以将环境变量 `HCTR_RMM_SETTABLE` 设置为 0，以禁用 HugeCTR 设置自定义 RMM 设备资源的功能。但请注意，这可能会影响 Parquet 文件的读取性能。\n  + HugeCTR 使用 NCCL 在不同进程间共享数据，而 NCCL 需要共享系统内存用于 IPC，以及固定（页锁定）系统内存资源。\n    如果在容器内使用 NCCL，请在启动容器时通过以下参数增加这些资源：\n\n    ```shell\n      -shm-size=1g -ulimit memlock=-1\n    ```\n\n    更多信息请参阅 [NCCL 已知问题](https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Fnccl\u002Fuser-guide\u002Fdocs\u002Ftroubleshooting.html#sharing-data) 和 [GitHub 问题](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fissues\u002F243)。\n  + 即使目标 Kafka 代理无响应，`KafkaProducers` 仍能成功启动。\n    为避免因从 Kafka 流式更新模型而导致的数据丢失，请确保有足够的 Kafka 代理正在运行并正常工作。","2023-05-18T04:35:35",{"id":223,"version":224,"summary_zh":225,"released_at":226},292026,"v23.04.00","## 23.04 版本新特性\n\n+ **分层参数服务器增强**：\n  + HPS 表融合：从本版本开始，您可以在 HPS 中融合嵌入向量大小相同的表。我们已在 TensorFlow 的 HPS 插件以及 HPS 的 Triton 后端中支持此功能。要启用表融合，请在 HPS JSON 文件中将 `fuse_embedding_table` 设置为 `true`。此功能要求不同表中的键值不重叠，且模型图中各嵌入查找层之间不存在依赖关系。更多信息请参阅 [HPS 配置](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhierarchical_parameter_server\u002Fhps_database_backend.html#configuration) 和 [HPS 表融合演示笔记本](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhps_tf\u002Fnotebooks\u002Fhps_table_fusion_demo.html)。当存在多个表并使用 GPU 嵌入缓存时，此功能可显著降低嵌入查找延迟。与未融合的情况相比，在笔记本中演示的融合案例在 V100 上实现了约 3 倍的加速。\n\n  + UVM 支持：我们已升级静态嵌入解决方案。对于大小超过设备内存的嵌入表，我们将高频嵌入保存在 HBM 中作为嵌入缓存，并将剩余嵌入卸载到 UVM。与将剩余嵌入卸载到 [Volatile DB](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhierarchical_parameter_server\u002Fhps_database_backend.html#volatile-database-configuration) 的动态缓存方案相比，UVM 方案具有更高的 CPU 查找吞吐量。我们将在未来的版本中支持 UVM 方案的在线更新。用户可以通过 [embedding_cache_type](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmain\u002Fhierarchical_parameter_server\u002Fhps_database_backend.html#inference-parameters) 配置参数在不同的嵌入缓存方案之间切换。\n  + Triton 性能分析器请求生成器：我们新增了一个 [推理请求生成器](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fmain\u002Ftools\u002Finference_test_scripts\u002Frequest_generator)，用于生成 Triton 性能分析器所需的 JSON 请求格式。结合 [模型生成器](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fmain\u002Ftools\u002Finference_test_scripts\u002Fmodel_generator)，您可以使用 Triton 性能分析器对 HPS 性能进行剖析和压力测试。有关 API 文档和示例用法，请参阅 [README](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fblob\u002Fmain\u002Ftools\u002Finference_test_scripts\u002FREADME.md)。\n\n+ **通用更新**：\n  + DenseLayerComputeConfig：MLP 和 CrossLayer 在训练时支持在数据梯度反向传播的同时异步计算权重梯度。我们为 `hugectr.DenseLayer` 添加了一个新的成员 `hugectr DenseLayerComputeConfig`，用于配置计算行为。启用异步权重梯度计算的开关已从 `hugectr.CreateSolver` 中移出。","2023-04-21T05:40:26",{"id":228,"version":229,"summary_zh":230,"released_at":231},292027,"v23.02.00","## 23.02 版本更新内容\n\n+ **HPS 增强功能**：\n  + 启用了 [HPS TensorFlow 插件](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmaster\u002Fhierarchical_parameter_server\u002Fhps_tf_user_guide.html)。\n  + 为 HPS TensorFlow 插件启用了 max_norm 裁剪功能。\n  + 优化了 HPS HashMap 检索的性能。\n  + 启用了 [HPS 性能分析器](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fblob\u002Fmain\u002FHugeCTR\u002Fsrc\u002Finference_benchmark\u002Fhps_profiler.md)。\n\n+ **Google Cloud Storage (GCS) 支持**：\n  + 为训练和推理增加了对 Google Cloud Storage (GCS) 的支持。更多详情，请参阅 [使用远程文件系统进行训练](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fblob\u002Fmain\u002Fnotebooks\u002Ftraining_and_inference_with_remote_filesystem.ipynb) 笔记本中的 GCS 部分。\n\n+ **已修复的问题**：\n  + 修复了 HPS 静态表中的一个 bug，该 bug 会导致当批次大小大于 256 时产生错误结果。\n  + 修复了 `wdl_prediction` 笔记本中的预处理问题。\n  • 纠正了 HPS 和 InferenceModel 中设备的设置与管理方式。\n  + 修复了调试构建错误。\n  + 修复了与 CUDA 12.0 相关的构建错误。\n  + 修复了笔记本中关于多进程 HashMap 的报告问题，以及一些其他小问题。\n\n+ **已知问题**：\n  + HugeCTR 使用 NCCL 在不同进程间共享数据，而 NCCL 可能需要用于 IPC 的共享系统内存以及固定的（页锁定的）系统内存资源。\n    如果您在容器内使用 NCCL，请在启动容器时通过指定以下参数来增加这些资源：\n\n    ```shell\n      -shm-size=1g -ulimit memlock=-1\n    ```\n\n    请参阅 NCCL 的 [已知问题](https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Fnccl\u002Fuser-guide\u002Fdocs\u002Ftroubleshooting.html#sharing-data) 和 GitHub 上的 [问题](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fissues\u002F243)。\n  + 即使目标 Kafka 代理无响应，`KafkaProducers` 仍能成功启动。\n    为了避免因从 Kafka 流式传输模型更新而导致的数据丢失，您必须确保有足够数量的 Kafka 代理正在运行、正常工作，并且可以从运行 HugeCTR 的节点访问到它们。\n  + 文件列表中的数据文件数量应大于或等于数据读取工作者的数量。\n    否则，不同的工作者会被映射到同一文件，导致数据加载无法按预期进行。\n  + 不支持带有正则化项的联合损失训练。\n  + 不支持将 Adam 优化器的状态转储到 AWS S3。","2023-03-13T15:29:32",{"id":233,"version":234,"summary_zh":235,"released_at":236},292028,"v4.3.1","## What's New in Version 4.3\r\n\r\n  ```{important}\r\n  In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as `v4.3`.\r\n  Afterward, the library will use calendar versioning only, such as `v23.01`.\r\n  ```\r\n\r\n+ **Support for BERT and Variants**:\r\nThis release includes support for BERT in HugeCTR.\r\nThe documentation includes updates to the [MultiHeadAttention](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.3\u002Fapi\u002Fhugectr_layer_book.html#multiheadattention-layer) layer and adds documentation for the [SequenceMask](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.3\u002Fapi\u002Fhugectr_layer_book.html#sequencemask-layer) layer.\r\nFor more information, refer to the [samples\u002Fbst](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fv4.3\u002Fsamples\u002Fbst) directory of the repository in GitHub.\r\n\r\n+ **HPS Plugin for TensorFlow integration with TensorFlow-TensorRT (TF-TRT)**:\r\nThis release includes plugin support for integration with TensorFlow-TensorRT.\r\nFor sample code, refer to the [Deploy SavedModel using HPS with Triton TensorFlow Backend](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.3\u002Fhps_tf\u002Fnotebooks\u002Fhps_tensorflow_triton_deployment_demo.html) notebook.\r\n\r\n+ **Deep & Cross Network Layer version 2 Support**:\r\nThis release includes support for Deep & Cross Network version 2.\r\nFor conceptual information, refer to \u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F2008.13535>.\r\nThe documentation for the [MultiCross Layer](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.3\u002Fapi\u002Fhugectr_layer_book.html#multicross-layer) is updated.\r\n\r\n+ **Enhancements to Hierarchical Parameter Server**:\r\n  + RedisClusterBackend now supports TLS\u002FSSL communication.\r\n    For sample code, refer to the [Hierarchical Parameter Server Demo](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.3\u002Fnotebooks\u002Fhps_demo.html) notebook.\r\n    The notebook is updated with step-by-step instructions to show you how to setup HPS to use Redis with (and without) encryption.\r\n    The [Volatile Database Parameters](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.3\u002Fhugectr_parameter_server.html#volatile-database-parameters) documentation for HPS is updated with the `enable_tls`, `tls_ca_certificate`, `tls_client_certificate`, `tls_client_key`, and `tls_server_name_identification` parameters.\r\n  + MultiProcessHashMapBackend includes a bug fix that prevented configuring the shared memory size when using JSON file-based configuration.\r\n  + On-device input keys are supported now so that an extra host-to-device copy is removed to improve performance.\r\n  + A dependency on the XX-Hash library is removed.\r\n    The library is no longer used by HugeCTR.\r\n  + Added the static table support to the embedding cache.\r\n    The static table is suitable when the embedding table can be placed entirely in GPU memory.\r\n    In this case, the static table is more than three times faster than the embedding cache lookup.\r\n    The static table does not support embedding updates.\r\n\r\n+ **Support for New Optimizers**:\r\n  + Added support for SGD, Momentum SGD, Nesterov Momentum, AdaGrad, RMS-Prop, Adam and FTRL optimizers for dynamic embedding table (DET).\r\n    For sample code, refer to the `test_embedding_table_optimizer.cpp` file in the [test\u002Futest\u002Fembedding_collection\u002F](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fv4.3\u002Ftest\u002Futest\u002Fembedding_collection) directory of the repository on GitHub.\r\n  + Added support for the FTRL optimizer for dense networks.\r\n\r\n+ **Data Reading from S3 for Offline Inference**:\r\nIn addition to reading during training, HugeCTR now supports reading data from remote file systems such as HDFS and S3 during offline inference by using the DataSourceParams API.\r\nThe [HugeCTR Training and Inference with Remote File System Example](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.3\u002Fnotebooks\u002Ftraining_and_inference_with_remote_filesystem.html) is updated to demonstrate the new functionality.\r\n\r\n+ **Documentation Enhancements**:\r\n  + The set up [instructions for running the example notebooks](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.3\u002Fnotebooks\u002Findex.html) are revised for clarity.\r\n  + The example notebooks are also updated to show using a data preprocessing script that simplifies the user experience.\r\n  + Documentation for the [MLP Layer](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.3\u002Fapi\u002Fhugectr_layer_book.html#mlp-layer) is new.\r\n  + Several 2022 talks and blogs are added to the [HugeCTR Talks and Blogs](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.3\u002Fhugectr_talks_blogs.html) page.\r\n\r\n+ **Issues Fixed**:\r\n  + The original CUDA device with NUMA bind before a call to some HugeCTR APIs is recovered correctly now.\r\n    This issue sometimes lead to a problem when you mixed calls to HugeCTR and other CUDA enabled libraries.\r\n  + Fixed the occasional CUDA kernel launch failure of embedding when installed HugeCTR with macro DEBUG.\r\n  + Fixed an SOK build error that was related to TensorFlow v2.1.0 and higher.\r\n    The issue was that the C++ API and C++ standard were updated to use C++17.\r\n  + Fixed a CUDA 12 related compilation error.\r\n\r\n+ *","2023-02-09T01:39:57",{"id":238,"version":239,"summary_zh":235,"released_at":240},292029,"v4.3","2023-01-05T03:04:07",{"id":242,"version":243,"summary_zh":244,"released_at":245},292030,"v4.2","## What's New in Version 4.2\r\n\r\n  ```{important}\r\n  In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as `v4.2`.\r\n  Afterward, the library will use calendar versioning only, such as `v23.01`.\r\n  ```\r\n\r\n+ **Change to HPS with Redis or Kafka**:\r\nThis release includes a change to Hierarchical Parameter Server and affects deployments that use `RedisClusterBackend` or model parameter streaming with Kafka.\r\nA third-party library that was used for HPS partition selection algorithm is replaced to improve performance.\r\nThe new algorithm can produce different partition assignments for volatile databases.\r\nAs a result, volatile database backends that retain data between application startup, such as the `RedisClusterBackend`, must be reinitialized.\r\nModel streaming with Kafka is equally affected.\r\nTo avoid issues with updates, reset all respective queue offsets to the `end_offset` before you reinitialize the `RedisClusterBackend`.\r\n\r\n+ **Enhancements to the Sparse Operation Kit in DeepRec**:\r\nThis release includes updates to the Sparse Operation Kit to improve the performance of the embedding variable lookup operation in DeepRec.\r\nThe API for the `lookup_sparse()` function is changed to remove the `hotness` argument.\r\nThe `lookup_sparse()` function is enhanced to calculate the number of non-zero elements dynamically.\r\nFor more information, refer to the [sparse_operation_kit directory](https:\u002F\u002Fgithub.com\u002Falibaba\u002FDeepRec\u002Ftree\u002Fmain\u002Faddons\u002Fsparse_operation_kit) of the DeepRec repository in GitHub.\r\n\r\n+ **Enhancements to 3G Embedding**:\r\nThis release includes the following enhancements to 3G embedding:\r\n  + The API is changed.\r\n    The `EmbeddingPlanner` class is replaced with the `EmbeddingCollectionConfig` class.\r\n    For examples of the API, see the tests in the [test\u002Fembedding_collection_test](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fv4.2\u002Ftest\u002Fembedding_collection_test) directory of the repository in GitHub.\r\n  + The API is enhanced to support dumping and loading weights during the training process.\r\n    The methods are `Model.embedding_dump(path: str, table_names: list[str])` and `Model.embedding_load(path: str, list[str])`.\r\n    The `path` argument is a directory in file system that you can dump weights to or load weights from.\r\n    The `table_names` argument is a list of embedding table names as strings.\r\n\r\n+ **New Volatile Database Type for HPS**:\r\nThis release adds a `db_type` value of `multi_process_hash_map` to the Hierarchical Parameter Server.\r\nThis database type supports sharing embeddings across process boundaries by using shared memory and the `\u002Fdev\u002Fshm` device file.\r\nMultiple processes running HPS can read and write to the same hash map.\r\nFor an example, refer to the [Hierarchcal Parameter Server Demo](.\u002Fnotebooks\u002Fhps_demo.ipynb) notebook.\r\n\r\n+ **Enhancements to the HPS Redis Backend**:\r\nIn this release, the Hierarchical Parameter Server can open multiple connections in parallel to each Redis node.\r\nThis enhancement enables HPS to take advantage of overlapped processing optimizations in the I\u002FO module of Redis servers.\r\nIn addition, HPS can now take advantage of Redis hash tags to co-locate embedding values and metadata.\r\nThis enhancement can reduce the number of accesses to Redis nodes and the number of per-node round trip communications that are needed to complete transactions.\r\nAs a result, the enhancement increases the insertion performance.\r\n\r\n+ **MLPLayer is New**:\r\nThis release adds an MLP layer with the `hugectr.Layer_t.MLP` class.\r\nThis layer is very flexible and makes it easier to use a group of fused fully-connected layers and enable the related optimizations.\r\nFor each fused fully-connected layer in `MLPLayer`, the output dimension, bias, and activation function are all adjustable.\r\nMLPLayer supports FP32, FP16 and TF32 data types.\r\nFor an example, refer to the [dgx_a100_mlp.py](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fblob\u002Fv4.2\u002Fsamples\u002Fdlrm\u002Fdgx_a100_mlp.py) in the `samples\u002Fdlrm` directory of the GitHub repository to learn how to use the layer.\r\n\r\n+ **Sparse Operation Kit installable from PyPi**:\r\nVersion `1.1.4` of the Sparse Operation Kit is installable from PyPi in the [merlin-sok](https:\u002F\u002Fpypi.org\u002Fproject\u002Fmerlin-sok\u002F) package.\r\n\r\n+ **Multi-task Model Support added to the ONNX Model Converter**:\r\nThis release adds support for multi-task models to the ONNX converter.\r\nThis release also includes an enhancement to the [preprocess_census.py](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fblob\u002Fv4.2\u002Fsamples\u002Fmmoe\u002Fpreprocess_census.py) script in `samples\u002Fmmoe` directory of the GitHub repository.\r\n\r\n+ **Issues Fixed**:\r\n  + Using the HPS Plugin for TensorFlow with `MirroredStrategy` and running the [Hierarchical Parameter Server Demo](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.2\u002Fhierarchical_parameter_server\u002Fnotebooks\u002Fhierarchical_parameter_server_demo.html) notebook triggered an issue with [ReplicaContext](https:\u002F\u002Fwww.tensorflow.org\u002Fapi_docs\u002Fpython\u002Ftf\u002Fdistribute\u002FReplicaCon","2022-11-15T00:08:37",{"id":247,"version":248,"summary_zh":249,"released_at":250},292031,"v4.1.1","## What's New in Version 4.1.1\r\n\r\n+ **Simplified Interface for 3G Embedding Table Placement Strategy**:\r\n3G embedding now provides an easier way for you to configure an embedding table placement strategy.\r\nInstead of using JSON, you can configure the embedding table placement strategy by using function arguments.\r\nYou only need to provide the `shard_matrix`, `table_group_strategy`, and `table_placement_strategy` arguments.\r\nWith these arguments, 3G embedding can group different tables together and place them according to the `shard_matrix` argument.\r\nFor an example, refer to [dlrm_train.py](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fv4.1.1\u002Ftest\u002Fembedding_collection_test\u002Fdlrm_train.py) file in the `test\u002Fembedding_collection_test` directory of the repository on GitHub.\r\nFor comparison, refer to the [same file](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fv4.0\u002Ftest\u002Fembedding_collection_test\u002Fdlrm_train.py) from the v4.0 branch of the repository.\r\n\r\n+ **New MMoE and Shared-Bottom Samples**:\r\nThis release includes a new shared-bottom model, an example program, preprocessing scripts, and updates to documentation.\r\nFor more information, refer to the `README.md`, `mmoe_parquet.py`, and other files in the [`samples\u002Fmmoe`](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fv4.1.1\u002Fsamples\u002Fmmoe) directory of the repository on GitHub.\r\nThis release also includes a fix to the calculation and reporting of AUC for multi-task models, such as MMoE.\r\n\r\n+ **Support for AWS S3 File System**:\r\nThe Parquet DataReader can now read datasets from the Amazon Web Services S3 file system.\r\nYou can also load and dump models from and to S3 during training.\r\nThe documentation for the [`DataSourceParams`](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1.1\u002Fapi\u002Fpython_interface.html#datasourceparams-class) class is updated.\r\nTo view sample code, refer to the [HugeCTR Training with Remote File System Example](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1.1\u002Fnotebooks\u002Ftraining_with_remote_filesystem.html) class is updated.\r\n\r\n+ **Simplication for File System Usage**:\r\nYou no longer ’t need to pass `DataSourceParams` for model loading and dumping.\r\nThe `FileSystem` class automatically infers the correct file system type, local, HDFS, or S3, based on the path URI that you specified when you built the model.\r\nFor example, the path `hdfs:\u002F\u002Flocalhost:9000\u002F` is inferred as an HDFS file system and the path `https:\u002F\u002Fmybucket.s3.us-east-1.amazonaws.com\u002F` is inferred as an S3 file system.\r\n\r\n+ **Support for Loading Models from Remote File Systems to HPS**:\r\nThis release enables you to load models from HDFS and S3 remote file systems to HPS during inference.\r\nTo use the new feature, specify an HDFS for S3 path URI in `InferenceParams`.\r\n\r\n+ **Support for Exporting Intermediate Tensor Values into a Numpy Array**:\r\nThis release adds function `check_out_tensor` to `Model` and `InferenceModel`.\r\nYou can use this function to check out the intermediate tensor values using the Python interface.\r\nThis function is especially helpful for debugging.\r\nFor more information, refer to [`Model.check_out_tensor`](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1.1\u002Fapi\u002Fpython_interface.html#check-out-tensor-method) and [`InferenceModel.check_out_tensor`](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmaster\u002Fapi\u002Fpython_interface.html#id3).\r\n\r\n+ **On-Device Input Keys for HPS Lookup**:\r\nThe HPS lookup supports input embedding keys that are on GPU memory during inference.\r\nThis enhancement removes a host-to-device copy by using the DLPack `lookup_fromdlpack()` interface.\r\nBy using the interface, the input DLPack capsule of embedding key can be a GPU tensor.\r\n\r\n+ **Documentation Enhancements**:\r\n  + The graphic for the [Hierarchical Parameter Server](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1.1\u002Fhierarchical_parameter_server\u002Findex.html) library that shows relationship to other software packages is enhanced.\r\n  + The sample notebook for [Deploy SavedModel using HPS with Triton TensorFlow Backend](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1.1\u002Fhierarchical_parameter_server\u002Fnotebooks\u002Fhps_tensorflow_triton_deployment_demo.html) is added to the documentation.\r\n  + Style updates to the [Hierarchical Parameter Server API](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1.1\u002Fhierarchical_parameter_server\u002Fapi\u002Findex.html) documentation.\r\n\r\n+ **Issues Fixed**:\r\n  + The `InteractionLayer` class is fixed so that it works correctly with `num_feas > 30`.\r\n  + The cuBLASLt configuration is corrected by increasing the workspace size and adding the epilogue mask.\r\n  + The NVTabular based preprocessing script for our samples that demonstrate feature crossing is fixed.\r\n  + The async data reader is fixed. Previously, it would hang and cause a corruption issue due to an improper I\u002FO block size and I\u002FO alignment problem.\r\n    The `AsyncParam` class is changed to implement the fix.\r\n    The `io_block_size` argument is replaced by the `max_nr_request` argument and the actual I\u002FO block size that the async reader uses is comp","2022-11-02T08:05:46",{"id":252,"version":253,"summary_zh":254,"released_at":255},292032,"v4.1","## What's New in Version 4.1\r\n\r\n+ **Simplified Interface for 3G Embedding Table Placement Strategy**:\r\n3G embedding now provides an easier way for you to configure an embedding table placement strategy.\r\nInstead of using JSON, you can configure the embedding table placement strategy by using function arguments.\r\nYou only need to provide the `shard_matrix`, `table_group_strategy`, and `table_placement_strategy` arguments.\r\nWith these arguments, 3G embedding can group different tables together and place them according to the `shard_matrix` argument.\r\nFor an example, refer to [dlrm_train.py](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fv4.1\u002Ftest\u002Fembedding_collection_test\u002Fdlrm_train.py) file in the `test\u002Fembedding_collection_test` directory of the repository on GitHub.\r\nFor comparison, refer to the [same file](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fv4.0\u002Ftest\u002Fembedding_collection_test\u002Fdlrm_train.py) from the v4.0 branch of the repository.\r\n\r\n+ **New MMoE and Shared-Bottom Samples**:\r\nThis release includes a new shared-bottom model, an example program, preprocessing scripts, and updates to documentation.\r\nFor more information, refer to the `README.md`, `mmoe_parquet.py`, and other files in the [`samples\u002Fmmoe`](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fv4.1\u002Fsamples\u002Fmmoe) directory of the repository on GitHub.\r\nThis release also includes a fix to the calculation and reporting of AUC for multi-task models, such as MMoE.\r\n\r\n+ **Support for AWS S3 File System**:\r\nThe Parquet DataReader can now read datasets from the Amazon Web Services S3 file system.\r\nYou can also load and dump models from and to S3 during training.\r\nThe documentation for the [`DataSourceParams`](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1\u002Fapi\u002Fpython_interface.html#datasourceparams-class) class is updated.\r\nTo view sample code, refer to the [HugeCTR Training with Remote File System Example](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1\u002Fnotebooks\u002Ftraining_with_remote_filesystem.html) class is updated.\r\n\r\n+ **Simplication for File System Usage**:\r\nYou no longer ’t need to pass `DataSourceParams` for model loading and dumping.\r\nThe `FileSystem` class automatically infers the correct file system type, local, HDFS, or S3, based on the path URI that you specified when you built the model.\r\nFor example, the path `hdfs:\u002F\u002Flocalhost:9000\u002F` is inferred as an HDFS file system and the path `https:\u002F\u002Fmybucket.s3.us-east-1.amazonaws.com\u002F` is inferred as an S3 file system.\r\n\r\n+ **Support for Loading Models from Remote File Systems to HPS**:\r\nThis release enables you to load models from HDFS and S3 remote file systems to HPS during inference.\r\nTo use the new feature, specify an HDFS for S3 path URI in `InferenceParams`.\r\n\r\n+ **Support for Exporting Intermediate Tensor Values into a Numpy Array**:\r\nThis release adds function `check_out_tensor` to `Model` and `InferenceModel`.\r\nYou can use this function to check out the intermediate tensor values using the Python interface.\r\nThis function is especially helpful for debugging.\r\nFor more information, refer to [`Model.check_out_tensor`](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1\u002Fapi\u002Fpython_interface.html#check-out-tensor-method) and [`InferenceModel.check_out_tensor`](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmaster\u002Fapi\u002Fpython_interface.html#id3).\r\n\r\n+ **On-Device Input Keys for HPS Lookup**:\r\nThe HPS lookup supports input embedding keys that are on GPU memory during inference.\r\nThis enhancement removes a host-to-device copy by using the DLPack `lookup_fromdlpack()` interface.\r\nBy using the interface, the input DLPack capsule of embedding key can be a GPU tensor.\r\n\r\n+ **Documentation Enhancements**:\r\n  + The graphic for the [Hierarchical Parameter Server](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1\u002Fhierarchical_parameter_server\u002Findex.html) library that shows relationship to other software packages is enhanced.\r\n  + The sample notebook for [Deploy SavedModel using HPS with Triton TensorFlow Backend](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1\u002Fhierarchical_parameter_server\u002Fnotebooks\u002Fhps_tensorflow_triton_deployment_demo.html) is added to the documentation.\r\n  + Style updates to the [Hierarchical Parameter Server API](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.1\u002Fhierarchical_parameter_server\u002Fapi\u002Findex.html) documentation.\r\n\r\n+ **Issues Fixed**:\r\n  + The `InteractionLayer` class is fixed so that it works correctly with `num_feas > 30`.\r\n  + The cuBLASLt configuration is corrected by increasing the workspace size and adding the epilogue mask.\r\n  + The NVTabular based preprocessing script for our samples that demonstrate feature crossing is fixed.\r\n  + The async data reader is fixed. Previously, it would hang and cause a corruption issue due to an improper I\u002FO block size and I\u002FO alignment problem.\r\n    The `AsyncParam` class is changed to implement the fix.\r\n    The `io_block_size` argument is replaced by the `max_nr_request` argument and the actual I\u002FO block size that the async reader uses is computed accordingly.\r","2022-10-17T07:32:30",{"id":257,"version":258,"summary_zh":259,"released_at":260},292033,"v4.0","## What's New in Version 4.0\r\n\r\n+ **3G Embedding Stablization**:\r\nSince the introduction of the next generation of HugeCTR embedding in v3.7, several updates and enhancements were made, including code refactoring to improve usability.\r\nThe enhancements for this release are as follows:\r\n  + Optimized the performance for sparse lookup in terms of inter-warp load imbalance.\r\n    Sparse Operation Kit (SOK) takes advantage of the enhancement to improve performance.\r\n  + This release includes a fix for determining the maximum embedding vector size in the `GlobalEmbeddingData` and `LocalEmbeddingData` classes.\r\n  + Version 1.1.4 of Sparse Operation Kit can be installed with Pip and includes the enhancements mentioned in the preceding bullets.\r\n\r\n+ **Embedding Cache Initialization with Configurable Ratio**:\r\nIn previous releases, the default value for the `cache_refresh_percentage_per_iteration` parameter of the [InferenceParams](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.0\u002Fhugectr_parameter_server.html#inference-parameters-and-embedding-cache-configuration) was `0.1`.\r\n\r\n  In this release, default value is `0.0` and the parameter provides an additional purpose.\r\n  If you set the parameter to a value greater than `0.0` and also set `use_gpu_embedding_cache` to `True` for a model, when Hierarchical Parameter Server (HPS) starts, HPS initializes the embedding cache for the model on the GPU by loading a subset of the embedding vectors from the sparse files for the model.\r\n  When embedding cache initialization is used, HPS creates log records when it starts at the INFO level.\r\n  The logging records are similar to `EC initialization for model: \"\u003Cmodel-name>\", num_tables: \u003Cint>` and `EC initialization on device: \u003Cint>`.\r\n  This enhancement reduces the duration of the warm up phase.\r\n\r\n+ **Lazy Initialization of HPS Plugin for TensorFlow**:\r\nIn this release, when you deploy a `SavedModel` of TensorFlow with Triton Inference Server, HPS is implicitly initialized when the loaded model is executed for the first time.\r\nIn previous releases, you needed to run `hps.Init(ps_config_file, global_batch_size)` explicitly.\r\nFor more information, see the API documentation for [`hierarchical_parameter_server.Init`](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.0\u002Fhierarchical_parameter_server\u002Fapi\u002Finitialize.html#hierarchical_parameter_server.Init).\r\n\r\n+ **Enhancements to the HDFS Backend**:\r\n  + The HDFS Backend is now called IO::HadoopFileSystem.\r\n  + This release includes fixes for memory leaks.\r\n  + This release includes refactoring to generalize the interface for HDFS and S3 as remote filesystems.\r\n  + For more information, see `hadoop_filesystem.hpp` in the [`include\u002Fio`](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Ftree\u002Fv4.0\u002FHugeCTR\u002Finclude\u002Fio) directory of the repository on GitHub.\r\n\r\n+ **Dependency Clarification for Protobuf and Hadoop**:\r\nHadoop and Protobuf are true `third_party` modules now.\r\nDevelopers can now avoid unnecessary and frequent cloning and deletion.\r\n\r\n+ **Finer granularity control for overlap behavior**:\r\nWe deperacated the old `overlapped_pipeline` knob and introduces four new knobs `train_intra_iteration_overlap`\u002F`train_inter_iteration_overlap`\u002F`eval_intra_iteration_overlap`\u002F`eval_inter_iteration_overlap` to help user better control the overlap behavior. For more information, see the API documentation for [`Solver.CreateSolver`](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fmaster\u002Fapi\u002Fpython_interface.html#createsolver-method)\r\n\r\n+ **Documentation Improvements**:\r\n  + Removed two deprecated tutorials `triton_tf_deploy` and `dump_to_tf`.\r\n  + Previously, the graphics in the [Performance](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.0\u002Fperformance.html) page did not appear.\r\n    This issue is fixed in this release.\r\n  + Previously, the [API documentation](https:\u002F\u002Fnvidia-merlin.github.io\u002FHugeCTR\u002Fv4.0\u002Fhierarchical_parameter_server\u002Fapi\u002Findex.html) for the HPS Plugin for TensorFlow did not show the class information. This issue is fixed in this release.\r\n\r\n\r\n+ **Issues Fixed**:\r\n  + Fixed a build error that was triggered in debug mode.\r\n    The error was caused by the newly introduced 3G embedding unit tests.\r\n  + When using the Parquet DataReader, if a parquet dataset file specified in `metadata.json` does not exist, HugeCTR no longer crashes.\r\n    The new behavior is to skip the missing file and display a warning message.\r\n    This change relates to GitHub issue [321](https:\u002F\u002Fgithub.com\u002FNVIDIA-Merlin\u002FHugeCTR\u002Fissues\u002F321).\r\n\r\n+ **Known Issues**:\r\n  + HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.\r\n    If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:\r\n\r\n    ```shell\r\n      -shm-size=1g -ulimit memlock=-1\r\n    ```\r\n\r\n    See also the NCCL [known issue](https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Fnccl\u002Fuser-guide\u002Fdocs\u002Ftroubleshooting.html#sharing-data) and the GitHu","2022-09-14T09:21:19",{"id":262,"version":263,"summary_zh":264,"released_at":265},292034,"v3.9.1","* fix compatibility issue of cudf 22.06\r\n* some document refactors","2022-09-08T02:30:18"]