[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-baidu-research--DeepBench":3,"similar-baidu-research--DeepBench":101},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":15,"owner_company":17,"owner_location":17,"owner_email":17,"owner_twitter":17,"owner_website":18,"owner_url":19,"languages":20,"stars":41,"forks":42,"last_commit_at":43,"license":44,"difficulty_score":45,"env_os":46,"env_gpu":47,"env_ram":48,"env_deps":49,"category_tags":58,"github_topics":17,"view_count":60,"oss_zip_url":17,"oss_zip_packed_at":17,"status":61,"created_at":62,"updated_at":63,"faqs":64,"releases":100},5292,"baidu-research\u002FDeepBench","DeepBench","Benchmarking Deep Learning operations on different hardware","DeepBench 是由百度研究院推出的开源基准测试工具，专注于评估不同硬件平台上深度学习核心算子的性能表现。尽管深度学习的基础计算原理已十分成熟，但在实际应用中，矩阵乘法、卷积等关键操作会因参数规模、内核实现及硬件特性的不同，呈现出计算受限、带宽受限或占用率受限等多种瓶颈。DeepBench 正是为了解决这一“优化空间巨大且定义模糊”的难题而生，它通过底层标准化的测试用例，回答“哪种硬件在深度学习基础操作上表现最佳”这一核心问题。\n\n该工具主要面向硬件厂商、芯片架构师以及系统软件开发者。与直接测试完整模型的上层框架不同，DeepBench 不依赖具体的深度学习应用模型，而是直接调用 NVIDIA cuDNN、Intel MKL 等底层神经网络库，对训练和推理过程中的密集矩阵运算、卷积及通信操作进行独立评测。其独特亮点在于提供了详尽的训练与推理测试尺寸集合，并坚持使用厂商官方提供的库进行测试，从而最真实地反映广大用户的实际体验。通过量化这些基础算子的性能，DeepBench 帮助开发者精准定位硬件与软件协同优化中的瓶颈，为新一代深度学习处理器的设计与调优提供可靠的数据支撑。","![Baidu Logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_readme_1f594f75ad6e.png)\n\n- [DeepBench](#deepbench)\n- [Types of Operations](#types-of-operations)\n- [Training Benchmark](#training-benchmark)\n- [Inference Benchmark](#inference-benchmark)\n- [Supported Ops & Precision](#supported-ops-and-precision)\n- [Results](#results)\n- [Get Involved](#get-involved)\n- [Getting the Code](#getting-the-code)\n\n\n# DeepBench\n\nThe primary purpose of DeepBench is to benchmark operations that are\nimportant to deep learning on different hardware platforms. Although\nthe fundamental computations behind deep learning are well understood,\nthe way they are used in practice can be surprisingly diverse. For\nexample, a matrix multiplication may be compute-bound,\nbandwidth-bound, or occupancy-bound, based on the size of the matrices\nbeing multiplied and the kernel implementation. Because every deep\nlearning model uses these operations with different parameters, the\noptimization space for hardware and software targeting deep learning\nis large and underspecified.\n\nDeepBench attempts to answer the question, \"Which hardware provides\nthe best performance on the basic operations used for deep\nneural networks?\".  We specify these operations at a low level,\nsuitable for use in hardware simulators for groups building new\nprocessors targeted at deep learning. DeepBench includes operations\nand workloads that are important to both training and inference.\n\n## Where does DeepBench fit in? \n\nThe Deep Learning eco system consists of several different pieces. \nWe wanted to highlight where DeepBench fits into this eco system. \nThe diagram below describes the software and hardware components involved with deep learning.\nAt the very top, deep learning frameworks like Baidu's [PaddlePaddle](https:\u002F\u002Fgithub.com\u002Fbaidu\u002FPaddle), Theano, \nTensorFlow, Torch etc. All these frameworks allow deep learning researchers to build models. They include basic building \nblocks like layers which can be connected in different ways to create a model. In order to train the deep learning models, \nthe frameworks work with underlying neural network libraries such as NVIDIA's cuDNN and Intel's MKL. \nThese libraries implement operations such as matrix multiply that are important to deep learning models. \nFinally, the models are trained on hardware like NVIDIA GPUs or Intel's Xeon Phi processor.\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_readme_55c73499f733.png\" height=300>\n\nDeepBench uses the neural network libraries to benchmark the performance of basic operations on different hardware.\nIt does not work with deep learning frameworks or deep learning models built for applications. \nWe cannot measure the time required to train an entire model using DeepBench.\nThe performance characteristics of models built for different applications are very different from each other. \nTherefore, we are benchmarking the underlying operations involved in a deep learning model. \nBenchmarking these operations will help raise awareness amongst hardware vendors and software developers \nabout the bottlenecks in deep learning training and inference.\n\n## Methodology\n\nDeepBench consists of a set of basic operations (dense matrix\nmultiplies, convolutions and communication) as well as some recurrent\nlayer types.  There are Excel spreadsheets (`DeepBenchKernels_train.xlsx` & \n`DeepBenchKernels_inference.xlsx`) in this repository that describes all \nof the sizes for training and inference respectively.\n\nFor training, both forward and backward operations are tested. The precision\nrequirements for training and inference are discussed in the sections below.\n\nWe will use vendor supplied libraries even if faster independent\nlibraries exist or faster results have been published. Most users will\ndefault to the vendor supplied libraries and as such the vendor\nsupplied libraries are most representative of users' experience.\n\n## Entry\n\nDeepBench includes training results for seven hardware platforms, NVIDIA's\nTitanX, M40, TitanX Pascal, TitanXp, 1080 Ti, P100 and Intel's Knights\nLanding. Inference results are included for three server platforms, NVIDIA's\nTitanX Pascal, TitanXp and 1080 Ti. Inference results are also included \nfor three mobile devices iPhone 6 &7, RaspBerry Pi 3. We provide an overview of the\nresults and all results are available in the `results` folder. We will\ngladly accept pull requests for new hardware platforms.\n\n\n# Types of Operations\n\n## Dense Matrix Multiplies\n\nDense matrix multiplies exist in almost all deep neural networks\ntoday.  They are used to implement fully connected layers and vanilla\nRNNs and are building blocks for other types of recurrent layers.\nSometimes they are also used as a quick way to implement novel layer\ntypes for which custom code doesn't exist.\n\nWhen performing the GEMM operation `A * B = C`, either or both of `A`\nand `B` can be optionally transposed. Common terminology to describe a matrix problem \nis the triple (M, N, K), which describes the sizes of the matrices involved, \nand the “op” which tells us which matrices (if any) are transposed. The figure below\ndescribes how the triple (M, N, K) correspond to the sizes of the matrices being multiplied.\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_readme_1372fe24d1a7.png\" width=\"550\" \u002F>\n\nThe variant where both matrices\nare transposed is not used in neural networks.  The other three\nvariants *are* used, but they need not be implemented as a call to\n`SGEMM` with those transpose descriptors.  Sometimes it can be faster\nto perform an in-place transpose followed by the appropriate\nmultiplication and a transpose back.  Such optimizations should be\ndetailed in the spreadsheet.\n\nThe constant coefficients alpha and beta should both be 1.0 so that no\nwork is elided.\n\n## Convolutions\n\nConvolutions make up the vast majority of flops in networks that\noperate on images and videos and form important parts of networks such\nas speech and natural language modeling, thus making them perhaps the\nsingle most important layer from a performance perspective.\n\nConvolutions have 4 or 5 dimensional inputs and outputs giving rise to\na large number of possible orderings for these dimensions.  For the\nfirst version of the benchmark we are only concerned with performance\nin NCHW format i.e.  data is presented in image, feature maps, rows\nand columns.\n\nThere are many techniques for computing convolutions that are optimal\nfor different sizes of the filter and image, including:  direct, matrix multiply\nbased, FFT based, and Winograd based approaches.  In the first version\nof this benchmark, we are not concerned about the accuracy of the\ndifferent approaches since the general consensus is that 32-bit\nfloating point is accurate *enough* for each of them. We have noted\nthe approach used for each size in the spreadsheet.\n\n## Recurrent Layers\n\nRecurrent layers are usually made up of some combination of the above\noperations and also simpler operations such as unary or binary\noperations which aren't very compute intensive and generally constitute a\nsmall percentage of overall runtime.  However, the GEMM and\nconvolution operations are relatively small in recurrent layers, \nso the cost of these smaller operations can become significant.  This is especially true if there\nis a high fixed overhead associated with starting a computation.  It\nis also possible to use alternate storage formats for the recurrent\nmatrices because the cost of converting to a new storage format can be\namortized over the many steps of the recurrent computation.  If this\nis done, the time to convert to and from the custom format should be\nincluded in the overall time.\n\nThese factors lead to many optimization possibilities both within a\ntime step and across a sequence of time steps such that measuring the\nraw performance of the operations is not necessarily\nrepresentative of the performance of an entire recurrent layer.  In\nthis benchmark we focus on only one recurrent layer, even though there\nare even more optimization opportunities if one considers stacks of\nthem.\n\nThe calculation of the inputs should not be included in the time for\nthe recurrent layer calculation since it can be calculated as one\nlarge multiply and then consumed by the actual recurrent calculation.\nSo in: h_t = g(Wx_t + Uh_t-1) the time for the calculation of Wx_t for\nall t should not be included in the time for the recurrent layer.\n\nThe backward calculation should calculate the updates with respect to\nthe weights but not the inputs.  All the recurrent work is done to\ncalculate the weight updates, so calculating the updates with respect\nto the inputs as well just obscures what we are trying to measure.\n\nDeepBench includes support for three types of recurrent cells; \nvanilla RNNs, LSTMs and GRUs. The non-linearity for vanilla RNNs \nshould be a ReLU.  The internal non-linearities of the LSTM should\nbe the standard operations - sigmoid for the gates and tanh for \nthe activations.  The LSTM should not have peephole connections.\nThe internal of the GRU should be a sigmoid for reset and update\ngates. The output gate non linearity should be a ReLU.\n\n\n## All-Reduce\n\nNeural networks today are often trained across multiple GPUs or even\nmultiple systems, each with multiple GPUs.  There are two main categories of techniques for\ndoing this: synchronous and asynchronous. Synchronous techniques rely\non keeping the parameters on all instances of the model synchronized, usually by making\nsure all instances of the model have the same copy of the gradients before taking an\noptimization step.  The\n[Message Passing Interface (MPI)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMessage_Passing_Interface)\nprimitive usually used to perform this\noperation is called All-Reduce. There are many ways to implement\nAll-Reduce based on the number of ranks, the size of the data, and the\ntopology of the network.  This benchmark places no constraints on the\nimplementation other than that it should be\ndeterministic. Asynchronous methods are quite varied and in this\nversion of the benchmark we will not be attempting to test these\nmethods.\n\nIn order to evaluate All-Reduce, we use the following libraries and benchmarks:\n* [NVIDIA's NCCL](https:\u002F\u002Fdeveloper.nvidia.com\u002Fnccl)\n* [Ohio State University (OSU) Benchmarks](http:\u002F\u002Fmvapich.cse.ohio-state.edu\u002Fbenchmarks\u002F)\n* [Baidu's Allreduce](https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002Fbaidu-allreduce\u002F)\n* [Intel's MLSL](https:\u002F\u002Fgithub.com\u002Fintel\u002FMLSL)\n\nThe NCCL library can be build without MPI (for single node) and with MPI (for multinode) as shown in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnccl-tests. \nWe therefore have two versions of NCCL for the single node in the experiments. For  multinode experiments,\nwe use only NCCL with MPI, the benchmark from OSU, and Baidu's Allreduce implementation. \nWe report the shortest latency achieved from all implementations for each configuration.\n\nIntel(R) Machine Learning Scaling Library (Intel(R) MLSL) is a library\nproviding an efficient implementation of communication patterns used in deep learning.\nIn order to evaluate All-Reduce performance, we use All-Reduce benchmark from OSU.\n\n#### Topology for NVIDIA 8 GPU System\nEach node has two CPU sockets (dual root topology), and each socket has a PCIe root complex.  For each socket there are two PLX switches that are each connected to the CPU socket via 16 lanes of PCIe v3.  There are two GPUs on each PLX switch. All pairs of GPUs communicate simultaneously over 16 lanes of PCIe v3. The two CPU sockets are connected via Intel QPI. The interconnect across nodes is InfiniBand FDR. The figure below shows a schematic diagram of one our nodes, where all devices connected by the same PCI\nroot complex are encapsulated in a dotted box. In our experiments, P100, TitanX Maxwell and M40 were such systems.\n\n![Topology of NVIDIA GPU system with 8 GPUs](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_readme_296361718c70.png)\n\n#### Topology for NVIDIA 10 GPU System\nEach node has one CPU socket (single root topology) with two PLX switches, each switch are connected to 5 GPUs. The communication among the GPUs in the same PLX switch traverses through the PLX switch only, whereas \nthe communication to any GPU connected to the other PLX switch requires traversal both PLX switches along with the connecting PCIe bridge. In our experiments, TitanX Pascal, and 1080Ti were such systems.\n\n#### Topology for Intel Xeon Phi and Omni-Path System\nThe blocking All-Reduce latency is measured on Intel Xeon Phi processor 7250 on Intel’s internal Endeavor cluster\nwith Intel® Omni-Path Architecture (Intel® OPA) series 100 fabric with fat-tree topology, using Intel MPI 2017 Update 3 and Intel MLSL 2017 Update 2 Preview.\n\n# Training Benchmark\n\nThe training benchmark includes support for all the operations discussed\nabove. The `DeepBenchKernels_train.xlsx` file contains the entire list of\nkernels for the training benchmark.\n\n## Training Precision\n\n\nWhile training deep learning models, most researchers typically use \nsingle precision floating point numbers for all compute kernels. \nAcademic research has demonstrated that reduced precision training works \nfor several different models trained on limited datasets. In our experience, \nwe’ve found that 16 bit half precision floating point numbers are \nsufficient to train large deep learning models on large datasets reliably. \nTraining with half precision numbers allows hardware vendors to better \nutilize the available computing power. In addition, the parameters require \nhalf the total storage for the entire model.\n\n\nDeepBench specifies the minimum precision requirements for training. We are specifying \nprecision for multiply and add for all the operations. **The minimum precision \nfor multiplication and addition is set to 16 and 32 bits respectively.** \nNone of the currently available hardware supports 16 bit multiply and 32 bit accumulate. \nWe will accept results on any hardware platform that satisfies this minimum precision \nrequirement. All results will include the precision that is used for the benchmark.\n\n# Inference Benchmark\n\nBenchmarking inference is a very challenging problem. There are many applications \nthat have been enabled by deep learning and each of them have their unique \nperformance characteristics and requirements. We selected applications for benchmarking\nthat receive high user traffic. We are also including kernels from deep learning models\nthat are used across several different applications.\n\nFor the inference kernels, we cover the same set of operations as the training set i.e. \nmatrix multiply, convolution and recurrent operations. The kernels have some differences \nfrom the training counterparts. In the next few sections, we discuss the changes needed \nto benchmark inference workloads. The `DeepBenchKernels_inference.xlsx` file contains\nthe complete list of kernels for the training benchmark.\n\n\n## Deployment Platform\n\nLarge scale real world applications such as image search, language translation and \nspeech recognition are typically deployed on servers located in data centers. The client \nsends the request over the internet which is processed on the remote server hosting the \ndeep learning model. The remote server is typically a powerful machine consisting of many \nprocessors. The memory and compute capabilities are large enough to host very large deep \nlearning models. The downside of deploying the model on the server is the latency depends \non the network bandwidth between the client and the server. It also requires the user to \nbe connected to the internet. In order to address these issues, several models are being \ndeployed on end devices.  On-device deployment enables deep learning models to have lower \nlatency and are always available regardless of internet connectivity. However, these models \nneed to be smaller in order to fit within the power and memory constraints of mobile and \nembedded devices.\n\nIn DeepBench, we measure the performance of inference kernels on both server and mobile \nplatforms. Hardware vendors or users can \nrun the appropriate benchmarks and add their results to the repository. We provide an overview \nof the results below and detailed results are available in the `results\u002Finference` folder. \nWe will gladly accept pull requests for new hardware platforms.\n\n## Inference Batch Size\n\nIn order to meet latency requirements of user requests, most internet applications process \nrequests individually as they arrive at the data center. This makes for a straightforward \napplication where each request is handled by a single thread. However, this is inefficient \nfor two reasons. Processing requests individually makes the operation bandwidth bound as \nthe processor needs to load weights of the network. This makes it harder for processor to \neffectively utilize the on chip caches. Secondly, the amount of parallelism that can be \nexploited to classify one request is limited, making it difficult to exploit SIMD or multicore \nparallelism. RNNs are especially challenging to deploy because evaluating RNNs sample by sample \nrelies on matrix vector multiplication, which are bandwidth bound and difficult to parallelize.\n\nTo overcome these issues, we built a batching scheduler called Batch Dispatch which assembles \nstreams of data from user requests into batches before performing forward propagation on these \nbatches. In this case, there is a tradeoff between increased batch size, and consequently \nimproved efficiency, and increased latency. The more we buffer user requests to assemble a large batch, \nthe longer users must wait for their results. This places constraints on the amount of batching we can perform.\n\nIn practice, we’ve seen that batching requests up to 4 or 5 seems to work well for efficiency and \nlatency for data center deployment. In the case of deployment on devices, the batch size is limited to 1.\n\n## Inference Precision\n\nDeep Neural networks are trained using single precision or half precision floating point numbers. \nThe precision requirements for inference are significantly lower than training. Several different \nmodels can deployed with 8 bit representations for inference with little or no loss in accuracy \ncompared to their floating point models. **Therefore, for inference kernels, we’re specifying the \nminimum precision for multiplication and accumulation of 8 and 32 bits respectively.** Since \nall hardware platforms may not support this precision requirement, we will accept results for any \nplatform that satisfies this minimum requirement. All results will include the precision used for the benchmark.\n\nTo benchmark matrix multiplication with 8 bit inputs for ARM processors, \nwe use the Gemmlowp library. Convolution kernels from the ARM Compute Library are used for convolution benchmark. \nThe ARM Compute library only supports single precision convolutions. Low precision convolution \nsupport should be available shortly. The ARM Compute library doesn’t have any support for RNNs. \nTherefore, DeepBench does not include RNN results for ARM devices. We welcome contributions from other \nlibraries that support RNN operations for ARM devices.\n\nFor server deployment, we use the cudNN library and cuBLAS library for Nvidia GPUs. For Nvidia GPUs, \nRNN kerenels only support single precision and results are reported with the same. More details regarding \nwhich ops are supported on different processors can be found in later sections.\n\n## Sparse Operations\n\nA sparse neural network is one where most of the weights of the neural network are zero. \nThese zero weights don’t contribute in determining the prediction of the neural network. Sparse neural \nnetworks reduce memory and computation footprint which enables deep learning models to be deployed on \nmobile devices. Inference performance of RNNs is dominated by the memory bandwidth of the hardware, \nsince most of the work is simply reading in the parameters at every time step. Moving from a dense \ncalculation to a sparse one comes with a penalty, but if the sparsity factor is large enough, then \nthe smaller amount of data required by the sparse routines becomes a win.\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_readme_cfb82c290332.png\" width=\"550\" \u002F>\n\nThe more powerful server class processors used in data centers can generally perform inference quickly \nenough to serve one user, but in the data center performance per dollar is very important. Techniques \nsuch as sparsity that allow models to be evaluated faster enable more users to be served per GPU \nincreasing the effective performance per dollar.\n\nThere has been a lot of progress in developing sparse neural networks in the past couple of years. DeepBench \nincludes sparse matrix vector and sparse matrix multiply kernels. Based on our research, we’ve learnt \nthat neural networks with 90 to 95% sparsity can achieve relatively good performance compared to their \ndense baselines. However, current implementations of sparse matrix multiply are optimized for much higher \nsparsity (around 99% or higher). By including sparse kernels, we’re hoping to incentivize hardware vendors \nand software developers to build libraries that provide good performance for sparsity in the range of 90~95%.\n\nWe use the Eigen library to benchmark sparse operations on ARM devices. For GPU benchmarks, we use the cuSparse \nlibrary from Nvidia.\n\n## Measuring Latency\n\nMany inference applications have real time latency requirements. For example, speech interfaces \nrequire speech recognition models to return a result without a delay that is noticeable to a user. \nDeepBench kernels can be used as a starting point to measure the best case latency of individual \noperations. However, measuring full system latency is outside the scope of this release of DeepBench, \ngiven the focus on basic operations rather than complete applications. For example, a complete \napplication running on a mobile device might need to modify the power state of the system when \nstarting up. In another example, a complete server application might have a significant latency \ncomponent that is determined by a user’s network connection to the server. We may consider \naddressing operation latency in a future version of DeepBench.\n\n# Supported Ops and Precision\nIn this section, we document the support for the various operations across precisions for different processors.\nAs far as possible, we pick the precision that closely matches the minimum required precision. The precision \nrequirements are stated below again. However, there are cases where we need to benchmark higher precision operations. \nThe tables below highlight which operations are benchmarked for each processor.\n\n**Minimum Precision for training**: 16 bit multiply, 32 bit accumulate\n\n**Minimum Precision for inference**: 8 bit multiply, 32 bit accumulate\n\n## Training\n\nSingle precision results are available for 5 Nvidia GPUs and Intel's Xeon Phi processor. None of the available\nprocessors support 16 bit multiplication and 32 bit addition. Instead, we benchmark Nvidia's Psuedo FP16 mode\nwhere inputs\u002Foutputs are 16 bit but the compute is still in single precision. Support for mixed precision training\nis available in upcoming hardware processors.\n\n| Processor               | Single precision   | FP16 inputs\u002FFP32 math   | FP16 inputs \u002F Mixed Precision Math |\n| ----------------------- | ------------------ | ----------------------- | ---------------------------------- |\n| Nvidia TitanX Maxwell   | GEMM, Conv, RNN    |                         |                                    |\n| Nvidia Tesla M40        | GEMM, Conv, RNN    |                         |                                    |\n| Nvidia 1080Ti           | GEMM, Conv, RNN    |                         |                                    |\n| Nvidia TitanX Pascal    | GEMM, Conv, RNN    |                         |                                    |\n| Nvidia TitanXp          | GEMM, Conv, RNN    |                         |                                    |\n| Nvidia Tesla P100       | GEMM, Conv, RNN    | GEMM, Conv, RNN         |                                    |\n| Nvidia Tesla V100       | GEMM, Conv, RNN    |                         | GEMM, Conv, RNN                    |\n| Intel Xeon Phi 7250     | GEMM, Conv         |                         |                                    |\n\n\n## Server Deployment\n\nThe GEMM and convolution benchmark are run with 8 bit multiplication and 32 bit accumulate on \nNVIDIA processors. However, NVIDIA GPUs don't support all input sizes for this precision mode.\nInput sizes have to be a multiple of 4 to run in this precision mode. We have padded inputs dimensions \nto be multiples of 4 for all kernels. The cost of padding and discarding extra outputs is small\ncompared to the cost of the operation. The results spreadsheet indicates which of the kernels required\npadding. Sparse operations and Recurrent kernel results are reported in single precision since \nthe relevant libraries don't support low precision.\n\n| Processor                  | Single Precision             | Int8 multiply\u002F32 bit accumulate | \n|-----------------------|------------------|-----------------------|\n| Nvidia 1080Ti              | RNN, Sparse GEMM | GEMM, Conv                                 |\n| Nvidia TitanX Pascal       | RNN, Sparse GEMM | GEMM, Conv                                 |\n| Nvidia TitanXp             | RNN, Sparse GEMM | GEMM, Conv                                 |\n\n## Device Deployment\n\nThe table below describes the inference device kernel results available on different processors, ops and \nprecision. We don't have any results for RNNs since no ARM libraries support RNNs. ARM Compute library\nis not yet supported on the iPhone.\n\n| Processor                  | Single Precision             | Int8 inputs\u002F32 bit math | \n|-----------------------|------------------|-----------------------|\n| Raspberry Pi 3        | Conv                         | GEMM, Sparse GEMM               |\n| iPhone 6                   |                              | GEMM, Sparse GEMM               |\n| iPhone 7                   |                              | GEMM, Sparse GEMM               |\n\n\n# Results\nIn this section, we are documenting the performance for a few operations. \nThese are picked at random and are only meant to demonstrate the performance for a few applications.\n__The results below only include the time and TeraFLOPS for the fastest processor for the particular operation and parameters. The full results can be found in the `results` folder__. \n\nThe precision used for benchmarking the training and inference processors is listed at the top of the results file. \n\nTraining results can be found in the `results\u002Ftraining` folder which contains the following files:\n\n* `DeepBench_IA_KNL7250.xlsx`: Training results on Intel's Xeon Phi Processor\n* `DeepBench_NV_TitanX.xlsx`: Training results on NVIDIA's TitanX GPUs\n* `DeepBench_NV_M40.xlsx`: Training results on NVIDIA's M40 GPUs\n* `DeepBench_NV_TitanX_Pascal.xlsx`: Training results on NVIDIA's TitanX Pascal GPU\n* `DeepBench_NV_TitanXp.xlsx`: Training results on NVIDIA's TitanXp Pascal GPU\n* `DeepBench_NV_1080Ti.xlxs`: Training results on NVIDIA's 1080 Ti GPU\n* `DeepBench_NV_P100.xlsx`: Training results on NVIDIA's P100 GPU\n* `DeepBench_NV_V100.xlsx`: Training results on NVIDIA's V100 GPU\n\nDetailed inference results can be found in the `results\u002Finference` folder which contains the following files:\n* `server\u002FDeepBench_NV_TitanXp.xlsx`: Inference results on NVIDIA's TitanXp GPUs\n* `server\u002FDeepBench_NV_TitanXp.xlsx`: Inference results on NVIDIA's TitanXp Pascal GPU\n* `server\u002FDeepBench_NV_1080Ti.xlxs`: Inference results on NVIDIA's 1080 Ti GPU\n* `device\u002FDeepBench_iPhone_7.xlsx` : Inference results on iPhone 7\n* `device\u002FDeepBench_iPhone_6.xlsx` : Inference results on iPhone 6\n* `device\u002FDeepBench_Raspberry_Pi_3.xlsx` : Inference results on Raspberry Pi 3\n\nThe software libraries (e.g. cuDNN, OpenMPI) used to benchmark performance are mentioned in each of Excel workbooks in `Specs` sheet.\nPlease feel free to ask us any clarifying questions.\n\nResults on more hardware platforms will be added once they are available. We welcome contributions from all hardware vendors.\n\n## Training Results\n\n### GEMM Results\n\n| Kernel                 | A Transpose | B Transpose | Application        | Time (ms) | TeraFLOPS | Processor     |\n|------------------------|-------------|-------------|--------------------|--------------|-----------|---------------|\n| M=1760, N=128, K=1760  | N           | N           | Speech Recognition | 0.07         | 10.72      | Tesla V100 Mixed Precision |\n| M=7860, N=64, K=2560   | N           | N           | Speech Recognition | 0.10         | 25.94      | Tesla V100 Mixed Precision |\n| M=2560, N=64, K=2560   | N           | N           | Speech Recognition | 0.08         | 10.11      | Tesla V100 Mixed Precision |\n| M=5124, N=9124, K=2560 | T           | N           | Speech Recognition | 8.73         | 27.43      | Tesla V100 Mixed Precision |\n| M=3072, N=128, K=1024  | T           | N           | Speech Recognition | 0.04         | 18.73      | Tesla V100 Mixed Precision |\n\n### Convolution Results\n\n| Input Size                        | Filter Size     | # of Filters   | Padding (h, w)   | Stride (h, w)   | Application          | Total Time (ms)   | Fwd TeraFLOPS   | Processor       |\n| --------------------------------- | --------------- | -------------- | ---------------- | --------------- | -------------------- | ----------------- | --------------- | --------------- |\n| W = 700, H = 161, C = 1, N = 32   | R = 5, S = 20   | 32             | 0, 0             | 2, 2            | Speech Recognition   | 1.53              | 7.75            | Tesla V100 FP32 |\n| W = 54, H = 54, C = 64, N = 8     | R = 3, S = 3    | 64             | 1, 1             | 1, 1            | Face Recognition     | 0.55              | 10.12           | Tesla V100 FP32 |\n| W = 224, H = 224, C = 3, N = 16   | R = 3, S = 3    | 64             | 1, 1             | 1, 1            | Computer Vision      | 2.40              | 1.40            | Tesla V100 FP32 |\n| W = 7, H = 7,  C = 512, N = 16    | R = 3, S = 3    | 512            | 1, 1             | 1, 1            | Computer Vision      | 0.70              | 14.56           | Tesla V100 Mixed Precision |\n| W = 28, H = 28, C = 192, N = 16   | R = 5, S = 5    | 32             | 2, 2             | 1, 1            | Computer Vision      | 0.93              | 16.90           | Tesla V100 FP32  |\n\n### Recurrent Ops Results\n\nThe recurrent op kernels are only run on NVIDIA hardware.\n\n| Hidden Units   | Batch Size   | TimeSteps   | Recurrent Type   | Application           | Total Time (ms) | Fwd TeraFLOPS   | Processor       |\n| -------------- | ------------ | ----------- | ---------------- | --------------------- | ------------    | --------------- | --------------- |\n| 1760           | 16           | 50          | Vanilla          | Speech Recognition    | 8.21            | 1.19            | Tesla V100 Mixed Precision |\n| 2560           | 32           | 50          | Vanilla          | Speech Recognition    | 10.50           | 4.08            | Tesla V100 Mixed Precision |\n| 1024           | 128          | 25          | LSTM             | Machine Translation   | 5.56            | 10.91           | Tesla V100 Mixed Precision |\n| 2816           | 32           | 1500        | GRU              | Speech Recognition    | 380.04          | 11.85           | Tesla V100 Mixed Precision |\n\n### All-Reduce Results\n\n| Size (# of floats) | Number of Processors | Application        | Time (ms)   | Bandwidth (GB\u002Fs) | Processor                           |\n|--------------------|----------------------|--------------------|-------------|------------------|-------------------------------------|\n| 16777216           | 8                    | Speech Recognition | 8.66        | 61.99            | Xeon Phi 7250 with Intel® Omni-Path |\n| 16777216           | 16                   | Speech Recognition | 14.72       | 72.94            | Xeon Phi 7250 with Intel® Omni-Path |\n| 16777216           | 32                   | Speech Recognition | 19          | 113.03           | Xeon Phi 7250 with Intel® Omni-Path |\n| 64500000           | 32                   | Speech Recognition | 76.68       | 107.67           | Xeon Phi 7250 with Intel® Omni-Path |\n\n## Inference Server Results\n\nThe next few sections provide a few results for GEMM, Convolution and Recurrent operations for inference kernels on\nserver platforms. Results on Intel platforms should be available shortly.\n\n### GEMM Results\n\n| Kernel                 | Application        | Results (ms) | TeraFLOPS | Processor |\n|------------------------|--------------------|--------------|-----------|-----------|\n| M=5124, N=700, K=2048  | Speech Recognition | 0.46         | 31.94     | 1080 Ti   |\n| M=35, N=700, K=2048    | Speech Recognition | 0.05         | 2.09      | 1080 Ti   |\n| M=3072, N=3000, K=1024 | Speech Recognition | 0.49         | 38.36     | Titan Xp  |\n| M=512, N=6000, K=2816  | Speech Recognition | 0.43         | 40.71     | Titan Xp  |\n\n### Sparse GEMM Results\n\n| Kernel                 | Sparsity | Application        | Results (ms) | Speedup wrt dense | TeraFLOPS | Processor |\n|------------------------|----------|--------------------|--------------|-------------------|-----------|-----------|\n| M=7680, N=1, K=2560    | 0.95     |Speech Recognition  | 0.03         | 6.56              | 1.10      | 1080 Ti   |\n| M=7680, N=2, K=2560    | 0.95     |Speech Recognition  | 0.04         | 5.93              | 1.74      | 1080 Ti   |\n| M=7680, N=1500, K=2560 | 0.95     |Speech Recognition  | 29.81        | 0.16              | 1.88      | TitanXp   |\n| M=10752, N=1, K=3584   | 0.9      | Speech Recognition | 0.1          | 4                 | 0.72      | TitanXp   |\n\n### Convolution Results\n\n| Input Size                     | Filter Size   | # of Filters | Padding (h, w) | Stride (h, w) | Application        | Time (ms) | TeraFLOPS | Processor     |\n|--------------------------------|---------------|--------------|----------------|---------------|--------------------|-----------|-----------|---------------|\n| W = 341, H = 79, C = 32, N = 4 | R = 5, S = 10 | 32           | 0,0            | 2,2           | Speech Recognition | 0.29      | 9.03      | TitanXp       |\n| W = 224, H = 224, C = 3, N = 1 | R = 7, S = 7  | 64           | 3, 3           | 2, 2          | Computer Vision    | 0.14      | 1.64      | TitanXp       |\n| W = 56, H = 56, C = 256, N = 1 | R = 1, S = 1  | 128          | 0, 0           | 2, 2          | Computer Vision    | 0.015     | 3.43      | TitanX Pascal |\n| W = 7, H = 7,  C = 512, N = 2  | R = 1, S = 1  | 2048         | 0, 0           | 1, 1          | Computer Vision    | 0.018     | 11.42     | 1080 Ti       |\n\n### RNN Results\n\n| Hidden Units | Batch Size | TimeSteps | Recurrent Type | Application                  | Results (ms) | Fwd TeraFLOPS | Processor |\n|--------------|------------|-----------|----------------|------------------------------|------------|---------------|-----------|\n| 1536         | 4          | 50        | LSTM           | Language Modelling           |   6.93      |  0.55             |  TitanXp         |\n| 256          | 4          | 150       | LSTM           | Character Language Modelling |   1.63         |  0.19             |   1080 Ti        |\n| 2816         | 1          | 1500      | GRU            | Speech Recognition           |   350.62         |  0.20             | TitanXp          |\n| 2560    |   2         |   375        |  GRU              | Speech Recognition       |  75.02          |     0.39 | TitanXp          |\n\n## Inference Device Results\n\n### GEMM Results\n\n| Kernel                 |  Application        | Results (ms) | GigaFLOPS | Processor     |\n|------------------------|--------------------|--------------|-----------|---------------|\n| M=5124, N=700, K=2048  | Speech Recognition | 212.84         | 69.03      | iPhone 7 |\n| M=35, N=700, K=2048   | Speech Recognition | 1.94         | 51.69      | iPhone 7 |\n| M=3072, N=1500, K=1024 | Speech Recognition | 136.63         | 69.07      | iPhone 7 |\n\n### Sparse GEMM Results\n\n| Kernel                 | Sparsity | Application        | Results (ms) | Speedup wrt dense | GigaFLOPS | Processor |\n|------------------------|----------|--------------------|--------------|-------------------|-----------|-----------|\n| M=7680, N=1, K=2560    | 0.95     |Speech Recognition  | 1.01         | 15.55             | 18.55     | iPhone 7  |\n| M=7680, N=1500, K=2560 | 0.95     |Speech Recognition  | 1677.36      | 5.46              | 16.70     | iPhone 7  |\n| M=7680, N=1, K=2560    | 0.9      | Speech Recognition | 2.1          | 8.02              | 8.41      | iPhone 7  |\n\n\n### Convolution Results\n\n| Input Size                      | Filter Size  | # of Filters | Padding (h, w) | Stride (h, w) | Application     | Time (ms) | GigaFLOPS | Processor      |\n|---------------------------------|--------------|--------------|----------------|---------------|-----------------|-----------|-----------|----------------|\n| W = 112, H = 112, C = 64, N = 1 | R = 1, S = 1 | 64           | 0, 0           | 1, 1          | Computer Vision | 670.75    | 0.15      | Raspberry Pi 3 |\n| W = 56, H = 56, C = 256, N = 1  | R = 1, S = 1 | 128          | 0, 0           | 2, 2          | Computer Vision | 185.87    | 0.28      | Raspberry Pi 3 |\n| W = 7, H = 7,  C = 512, N = 1   | R = 1, S = 1 | 2048         | 0, 0           | 1, 1          | Computer Vision | 735.28    | 0.14      | Raspberry Pi 3 |\n\n# Get Involved\n\nWe welcome contributions from the community to DeepBench. You can contribute in two ways:\n\n1. Deep Learning Researchers\u002FEngineers: If you are deep learning researcher or engineer working on a new deep learning application, you may have different operations and\u002For workloads involved in training your model. We are interested in learning more about the underlying operations that are adversely impacting the performance (speed) of your model. Please contribute these operations and workloads!\n2. Hardware Vendors: We would gladly accept contributions from other hardware vendors. We're open to accepting benchmark results from large companies or smaller startups building hardware for training deep learning models. Please contribute benchmark results for your hardware!\n\n# Getting the Code\nTo get the code, simply clone the github repo\n\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002FDeepBench\n```\n\n# NVIDIA Benchmarks\n## Compiling\n\nIn order to build the benchmarks, you will need to specify the following paths:\n```\nMPI_PATH: Path to MPI library. The benchmarks have been tested with OpenMPI version 1.10.2.\nCUDA_PATH: Path to CUDA library. The benchmarks have been tested with version 7.5.18.\nCUDNN_PATH: Path to CUDNN library. The benchmarks have been tested with version 5.0.\nNCCL_PATH: Path to NCCL library. NCCL library is available at https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnccl. The benchmarks have been tested with commit b3a9e1333d9e2e1b8553b5843ba1ba4f7c79739d\n```\n\nTo build all the benchmarks, please use the following command:\n```\ncd code\u002F\nmake CUDA_PATH=\u003Ccuda_path> CUDNN_PATH=\u003Ccudnn_path> MPI_PATH=\u003Cmpi_path> NCCL_PATH=\u003Cnccl_path>\n```\n\nFor distributions that split their MPI headers and libraries (e.g. RHEL, Fedora, CentOS) into separate directories you should also specify the path to the include files:\n\n```\nMPI_INCLUDE_PATH=\u003Cmpi_include_path>\n```\n\nYou need to build the code for the appropriate architecture. By default, the architecture version is set to 5.2. This works for the TitanX and Tesla M40 GPU. In order build the benchmark for another architecture (such as Pascal with version 6.1), please append the following variable to the `make` command:\n\n```\nARCH=sm_61 ## Just an example for Pascal architecture\n```\n\nIn some cases, it may be useful to generate benchmarking executables for multiple architectures. For example, some systems may have multiple graphics processors with different architectures installed. The NVIDIA compiler (nvcc) supports the generation of \"fat binaries\" that contain intermediate and compiled code for multiple target architectures. To compile for multiple architectures, add a comma separated list of architectures to the `make` command line.\n\n```\nARCH=sm_30,sm_32,sm_35,sm_50,sm_52,sm_60,sm_61,sm_62,sm_70     # Everything since Kepler!\n```\nNote that compilation for multiple architectures will take longer than compilation for a single architecture. Also, not all CUDA versions support all architectures. For example, support for sm_60 (and later) require CUDA 8 or later.\n\n\nFor inference problems with `int8` precision, the convolution and gemm kernels need to be padded to be multiples of 4. By default, the kernels are padded and results are reported with padding. To disable padding, please use the following build option. When padding is disabled, the benchmark numbers aren't reported for the kernels that aren't supported. \n\n```\nmake gemm PAD_KERNELS=0\nmake conv PAD_KERNELS=0\n```\n\nIn order to use Tensor Cores on NVIDIA's V100 processor, you need to use CUDA 9.0 and cudNN 7.0 or higher. Using the correct libraries, add the following option to the make command:\n\n```\nmake USE_TENSOR_CORES=1 ARCH=sm_70\n```\nConvolution operations running Tensor Cores need input and output channels to be a multiple of 8. The benchmarks currently pad the input channels to be a multiple of 8 and report padded numbers.\n\n## Running the Benchmarks\n\nOnce compilation completes successfully, the executables will be\ngenerated in the `bin` directory. Before executing the benchmarks, it\nis important to set your `LD_LIBRARY_PATH` correctly. For bash shells,\nplease use:\n\n```\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:\u003Ccuda_path>:\u003Ccudnn_path>:\u003Cmpi_path>:\u003Cnccl_path>\n```\n\n\nThe GEMM, convolution, recurrent op and sparse GEMM benchmarks can be run by calling\nthe respective executables. Here is some of the output from the GEMM \nbenchmark:\n\n```\n~\u002FDeepBench\u002Fcode$ bin\u002Fgemm_bench\n                  Running training benchmark \n                         Times\n----------------------------------------------------------------------------------------\n    m       n      k      a_t     b_t      precision  time (usec) \n   1760     16   1760      0      0        float          180\n   1760     32   1760      0      0        float          182\n   1760     64   1760      0      0        float          247\n   1760    128   1760      0      0        float          318\n```\n\nBy default, the benchmarks are run with training problems. The default \nprecision for benchmarking is determined based on the CUDA and cudnn \nlibrary versions. The mode (inference or training) and precision can be specified on the command line \nusing: \n\n```\nbin\u002Fgemm_bench \u003Cinference|train> \u003Cint8|float|half>\n```\n\nEach of the benchmark files includes a note indicating which precision is\nsupported for different GPUs. \n\nTo execute the NCCL single All-Reduce benchmark, you need to specify\nthe number of GPUs as an argument. Please note that the number of GPUs\nmust not be greater than the number of GPUs visible in your system.\n\n```\nbin\u002Fnccl_single_all_reduce \u003Cnum_gpus>\n```\n\nThe NCCL MPI All-Reduce benchmark can be run using `mpirun` as shown below:\n\n```\nmpirun -np \u003Cnum_ranks> bin\u002Fnccl_mpi_all_reduce\n```\n`num_ranks` cannot be greater than the number of GPUs in the system.\n\nThe `osu_allreduce` benchmark can be executed using mpirun as follows:\n```\nmpirun -np \u003Cnum_processes> bin\u002Fosu_allreduce\n```\n\nThe `osu_allreduce` benchmark can be run with more processes than\nGPUs. However, all our experiments were conducted with each process\nrunning on a single GPU.\n\n# Baidu Benchmarks\n## Compiling\n\nIn order to build the benchmarks, you will need to specify the following paths:\n```\nMPI_PATH: Path to MPI library. The benchmarks have been tested with OpenMPI version 2.0.1.\nCUDA_PATH: Path to CUDA library. The benchmarks have been tested with version 8.0.61.\nBAIDU_ALLREDUCE_PATH: Path to Baidu's allreduce implementation, which is avaiable at https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002Fbaidu-allreduce\u002F.\n```\n\nTo build all the benchmarks, please use the following command:\n```\ncd code\u002F\nmake CUDA_PATH=\u003Ccuda_path> MPI_PATH=\u003Cmpi_path> BAIDU_ALLREDUCE_PATH=\u003Cbaidu_allreduce_path>\n```\n\nFor distributions that split their MPI headers and libraries (e.g. RHEL, Fedora, CentOS) into separate directories you should also specify the path to the include files:\n\n```\nMPI_INCLUDE_PATH=\u003Cmpi_include_path>\n```\n\nPlease set the ARCH paramter for appropriate architecture as discussed above in the NVIDIA Benchmarks section.\n\n## Running the Benchmarks\n\nOnce compilation completes successfully, the executables will be\ngenerated in the `bin` directory. Before executing the benchmarks, it\nis important to set your `LD_LIBRARY_PATH` correctly. For bash shells,\nplease use:\n\n```\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:\u003Ccuda_path>:\u003Cmpi_path>:\u003Cbaidu_allreduce_path>\n```\n\nThe Baidu All-Reduce benchmark can be run using `mpirun` as shown below:\n\n```\nmpirun -np \u003Cnum_ranks> bin\u002Fring_all_reduce\n```\n`num_ranks` is used as the total number of GPUs in the system.\n\n# Intel Benchmarks\n# Compiling and Running the Benchmarks\n\nSource all the Intel tools (icc, mkl, mpi) into the path\n\n```\nsource \u003Cicc_installdir>\u002Fbin\u002Fcompilervars.sh intel64\nsource \u003Cmkl_installdir>\u002Fbin\u002Fmklvars.sh intel64\nsource \u003Cimpi_installdir>\u002Fbin\u002Fmpivars.sh intel64\nsource \u003Cmlsl_installdir>\u002Fintel64\u002Fbin\u002Fmlslvars.sh\n```\n\nRunning the Intel GEMM benchmark (MKL 2017)\n\n```\ncode\u002Fintel\u002Fsgemm\u002Frun_mkl_sgemm_ia.sh\n```\n\nRunning the Intel convolution benchmark (MKL 2017 and libxsmm (open\nsource KNL optimized convolution implementation))\n\n```\ncode\u002Fintel\u002Fconvolution\u002Frun_conv_ia.sh\n```\n\nThe Intel All-Reduce benchmarks use the standard OSU benchmark compiled\u002Frunning with Intel MPI or with Intel MLSL.\n\nIn order to build the Intel All-Reduce benchmarks, you will need to specify the following paths:\n```\nMPI_PATH: Path to Intel MPI library ($I_MPI_ROOT by default). The benchmarks have been tested with Intel MPI 2017 Update 3.\nMLSL_PATH: Path to Intel MLSL library ($MLSL_ROOT by default). The benchmarks have been tested with Intel MLSL 2017 Update 2 Preview.\n```\nand use \"Makefile_ia\" makefile.\n\nFor example (building with default paths):\n```\nmake -f Makefile_ia all\n```\n\nRunning the Intel All-Reduce benchmarks:\n```\ncode\u002Fosu_allreduce\u002Frun_allreduce_ia.sh \u003Chostfile> \u003Callreduce_binary>\n```\n\nThere are 2 possible values for \u003Callreduce_binary>:\n* osu_allreduce - benchmark for blocking All-Reduce over MPI\n* mlsl_osu_allreduce - benchmark for blocking All-Reduce over MLSL\n\nThe performance of blocking All-Reduce over MLSL is reported in DeepBench result files.\n\nFor example, to run All-Reduce benchmark over MLSL create hostfile with one hostname per line\nand run script as following:\n```\ncode\u002Fosu_allreduce\u002Frun_allreduce_ia.sh \u003Chostfile> mlsl_osu_allreduce\n```\nScript will run benchmark on different scales (2, 4, 8, 16, 32 nodes) and on DeepBench specific message sizes.\nBenchmark will report average latency metric.\n\nFor example, benchmark output on 32 KNL\u002FOPA nodes:\n```\n# Size         Avg Latency(ms)\n100000                    0.31\n3097600                   3.59\n4194304                   4.67\n6553600                   7.17\n16777217                 16.80\n38360000                 56.65\n64500000                 75.77\n```\n\n# ARM Benchmarks\n\nThe ARM benchmarks in DeepBench are compiled and run on 64 bit ARM v8 processors. \nThe `Makefile` in the `code\u002Farm` folder only supports this processor. In order to benchmark\nother processors, you will have to modify the `Makefile` to support them. \n\n## GEMM Benchmark\n\nThe ARM GEMM benchmark uses the [Gemmlowp](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fgemmlowp) library\nfor `int8` kernels. This library is included as a submodule in the DeepBench repository. \nTo build and run the benchmark, simply run:\n```\n.\u002Frun_gemm_bench.sh\n```\n\n## Convolution Benchmark\nThe ARM Convolution benchmark uses the [ARM Compute Library](https:\u002F\u002Fgithub.com\u002FARM-software\u002FComputeLibrary).\nTo build the benchmark, you need to specify the include and lib paths for ARM compute library:\n```\nARM_COMPUTE_INCLUDE_PATH: Path to ARM Compute Library \nARM_COMPUTE_LIB_PATH: Path to ARM Compute library binary\n```\nTo build and run the benchmark, please use:\n```\nmake conv ARM_COMPUTE_INCLUDE_PATH=\u003Cpath> ARM_COMPUTE_LIB_PATH=\u003Cpath>\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:\u003Cpath to arm compute library binary>\nbin\u002Fconv_bench\n```\n\n## Sparse GEMM Benchmark\nThe Sparse GEMM Benchmark uses the [Eigen](http:\u002F\u002Feigen.tuxfamily.org\u002Findex.php?title=Main_Page) library. \nTo build the benchmark, you need to download the eigen library and specify the path:\n```\nEIGEN_PATH: path to Eigen library\n```\n\nTo compile and run the benchmark, please use the following command:\n```\nmake sparse EIGEN_PATH=\u003Cpath>\nbin\u002Fsparse_bench\n```\n\n# AMD Benchmarks\n\n## Prerequisites\n* A ROCm enabled platform, more info [here](https:\u002F\u002Frocm.github.io\u002Finstall.html).\n* [MIOpen](https:\u002F\u002Fgithub.com\u002FROCmSoftwarePlatform\u002FMIOpen) - HIP backend of MIOpen is required.\n* [rocBLAS](https:\u002F\u002Fgithub.com\u002FROCmSoftwarePlatform\u002FrocBLAS)\n\nAt present only `fp32 train` benchmarks are enabled.\n\n## Compiling\n\nThe `Makefile` in `code\u002Famd` is for an AMD `gfx900` GPU. To benchmark other generations, please modify the `Makefile` accordingly.\n\nSetting your enviroment variables before compiling\u002Frunning:\n\n```\nexport PATH=PATH_TO_ROCM\u002Fbin:$PATH\nexport CPATH=PATH_TO_MIOPEN\u002Finclude:$CPATH\nexport LIBRARY_PATH=PATH_TO_MIOPEN\u002Flib:$LIBRARY_PATH\nexport LD_LIBRARY_PATH=PATH_TO_MIOPEN\u002Flib:PATH_TO_MIOPENGEMM\u002Flib:$LD_LIBRARY_PATH\n```\n\nTo compile the convolution, RNNs and GEMM benchmarks, run:\n\n```\nmake conv rnn gemm\n```\n\n## Running the Benchmarks\nAfter successful compilation, the executables will be generated in the `bin` directory.\n\nTo benchmark convolutions:\n```\nbin\u002Fconv_bench\n```\n\nTo benchmark RNN:\n```\nbin\u002Frnn_bench\n```\n\nTo benchmark GEMM:\n```\nbin\u002Fgemm_bench\n```\n","![百度Logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_readme_1f594f75ad6e.png)\n\n- [DeepBench](#deepbench)\n- [操作类型](#types-of-operations)\n- [训练基准测试](#training-benchmark)\n- [推理基准测试](#inference-benchmark)\n- [支持的操作与精度](#supported-ops-and-precision)\n- [结果](#results)\n- [参与方式](#get-involved)\n- [获取代码](#getting-the-code)\n\n\n# DeepBench\n\nDeepBench 的主要目的是在不同硬件平台上对深度学习中重要的操作进行基准测试。尽管深度学习背后的基本计算原理已被充分理解，但在实际应用中，这些计算的使用方式却可能出人意料地多样化。例如，矩阵乘法可能是计算受限、带宽受限或占用率受限，具体取决于所乘矩阵的大小以及内核实现方式。由于每个深度学习模型都会以不同的参数使用这些操作，因此面向深度学习的硬件和软件优化空间非常大且尚未完全明确。\n\nDeepBench 试图回答这样一个问题：“哪种硬件在用于深度神经网络的基本操作上能够提供最佳性能？”我们以适合用于构建针对深度学习的新处理器的硬件模拟器使用的低级别形式来定义这些操作。DeepBench 包含对训练和推理都至关重要的操作和工作负载。\n\n## DeepBench 在整个生态体系中的定位是？\n\n深度学习生态系统由多个不同的组成部分构成。我们希望阐明 DeepBench 在这一生态体系中的位置。下图描述了深度学习涉及的软硬件组件。最顶层是深度学习框架，如百度的 PaddlePaddle、Theano、TensorFlow、Torch 等。这些框架允许深度学习研究人员构建模型，它们包含诸如层等基本构建模块，可以通过不同的连接方式组合成一个完整的模型。为了训练深度学习模型，这些框架会与底层的神经网络库协同工作，例如 NVIDIA 的 cuDNN 和 Intel 的 MKL。这些库实现了诸如矩阵乘法等对深度学习模型至关重要的操作。最后，模型会在 NVIDIA GPU 或 Intel Xeon Phi 处理器等硬件上进行训练。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_readme_55c73499f733.png\" height=300>\n\nDeepBench 利用这些神经网络库来评估不同硬件上基本操作的性能。它不直接与深度学习框架或为特定应用构建的深度学习模型交互。我们无法通过 DeepBench 测量训练整个模型所需的时间。针对不同应用构建的模型其性能特征差异很大，因此我们专注于对深度学习模型中涉及的底层操作进行基准测试。对这些操作的基准测试将有助于提高硬件供应商和软件开发者对深度学习训练和推理过程中瓶颈的认识。\n\n## 方法论\n\nDeepBench 包含一组基本操作（密集矩阵乘法、卷积和通信）以及一些循环层类型。本仓库中提供了两个 Excel 表格（`DeepBenchKernels_train.xlsx` 和 `DeepBenchKernels_inference.xlsx`），分别详细列出了用于训练和推理的所有尺寸配置。\n\n对于训练，正向和反向操作都会被测试。训练和推理所需的精度要求将在下文讨论。\n\n我们将使用厂商提供的库，即使存在更快的独立库或已发表过更快的结果。大多数用户默认会使用厂商提供的库，因此这些库更能代表用户的实际体验。\n\n## 入场\n\nDeepBench 包含七个硬件平台的训练结果：NVIDIA 的 TitanX、M40、TitanX Pascal、TitanXp、1080 Ti、P100 以及 Intel 的 Knights Landing。推理结果则涵盖三个服务器平台：NVIDIA 的 TitanX Pascal、TitanXp 和 1080 Ti。此外，还包含了三个移动设备的推理结果：iPhone 6 和 7 以及 Raspberry Pi 3。我们提供了结果概览，所有结果均可在 `results` 文件夹中找到。我们也欢迎针对新硬件平台的拉取请求。\n\n\n# 操作类型\n\n## 密集矩阵乘法\n\n如今，几乎所有的深度神经网络中都存在密集矩阵乘法。它们用于实现全连接层和普通的 RNN，并且是其他类型循环层的基础构建模块。有时，它们也被用作一种快速实现新型层的方式，尤其是在没有相应自定义代码的情况下。\n\n在执行 GEMM 操作 `A * B = C` 时，`A` 和 `B` 中的任意一个或两个都可以选择转置。描述矩阵问题的常用术语是三元组 (M, N, K)，它表示参与运算的矩阵大小，而“op”则告诉我们哪些矩阵（如果有的话）需要转置。下图展示了三元组 (M, N, K) 如何对应于待乘矩阵的大小。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_readme_1372fe24d1a7.png\" width=\"550\" \u002F>\n\n其中两个矩阵都转置的情况在神经网络中并不常见。另外三种情况则会被使用，但并不一定需要通过带有相应转置描述符的 `SGEMM` 调用来实现。有时，先就地转置一次，再进行适当的乘法运算，最后再转置回来，反而会更高效。这类优化应在电子表格中详细说明。\n\n常数系数 alpha 和 beta 均应设为 1.0，以确保不会省略任何计算步骤。\n\n## 卷积\n\n卷积占据了处理图像和视频的网络中绝大多数的浮点运算量，同时也是语音和自然语言建模等网络的重要组成部分，因此从性能角度来看，它们或许是最重要的层。\n\n卷积的输入和输出具有 4 或 5 维，这导致这些维度存在大量的排列组合可能性。在基准测试的第一个版本中，我们仅关注 NCHW 格式下的性能，即数据以图像、特征图、行和列的形式呈现。\n\n有许多计算卷积的技术，它们针对不同大小的滤波器和图像各有优势，包括直接计算、基于矩阵乘法、基于 FFT 以及基于 Winograd 的方法。在本次基准测试的第一个版本中，我们并不关心不同方法的精确性，因为普遍认为 32 位浮点数对于这些方法来说已经足够准确了。我们在电子表格中记录了每种尺寸所采用的方法。\n\n## 循环层\n\n循环层通常由上述操作的某种组合以及一些更简单的操作（如一元或二元操作）构成。这些简单操作的计算量不大，通常只占总运行时间的一小部分。然而，在循环层中，GEMM和卷积操作所占的比例相对较小，因此这些小型操作的成本可能会变得显著。特别是在启动计算时存在较高固定开销的情况下，这种影响尤为明显。此外，还可以为循环矩阵使用替代存储格式，因为转换到新存储格式的开销可以在循环计算的多个步骤中分摊。如果采用这种方法，则自定义格式与常规格式之间的转换时间也应计入总时间。\n\n这些因素导致了在单个时间步长内以及整个序列中的多种优化可能性，因此单纯测量操作的原始性能并不一定能反映整个循环层的实际性能。在本基准测试中，我们仅关注一个循环层，尽管如果考虑多个循环层的堆叠，优化机会会更多。\n\n输入的计算不应计入循环层的计算时间，因为它可以作为一个大型乘法运算完成，随后被实际的循环计算所消耗。例如，在公式 h_t = g(Wx_t + Uh_t-1) 中，所有 t 对应的 Wx_t 的计算时间都不应计入循环层的时间。\n\n反向传播计算应仅计算相对于权重的更新，而不应计算相对于输入的更新。循环层的所有工作都是为了计算权重更新，因此同时计算输入的更新只会掩盖我们想要测量的内容。\n\nDeepBench 支持三种类型的循环单元：普通 RNN、LSTM 和 GRU。普通 RNN 的非线性激活函数应为 ReLU。LSTM 的内部非线性操作应遵循标准设置——门控使用 sigmoid 函数，而状态更新使用 tanh 函数。LSTM 不应包含窥视连接。GRU 的内部结构应使用 sigmoid 函数作为重置门和更新门的激活函数，而输出门的非线性激活函数则应为 ReLU。\n\n\n## All-Reduce\n\n如今，神经网络训练常常跨多个 GPU 甚至多个系统进行，每个系统可能配备多个 GPU。实现这一目标的主要技术分为同步和异步两类。同步方法依赖于保持模型所有实例的参数一致，通常通过确保所有实例拥有相同的梯度副本后再执行优化步骤来实现。用于执行此操作的 [消息传递接口 (MPI)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMessage_Passing_Interface) 原语称为 All-Reduce。All-Reduce 的具体实现方式因进程数、数据规模及网络拓扑的不同而有多种选择。本基准测试对实现方式没有限制，只要求其具有确定性即可。异步方法则更加多样化，在本版本的基准测试中，我们暂不对其性能进行评估。\n\n为了评估 All-Reduce 性能，我们使用以下库和基准测试：\n* NVIDIA 的 NCCL\n* 俄亥俄州立大学 (OSU) 基准测试\n* 百度的 Allreduce\n* 英特尔的 MLSL\n\nNCCL 库既可以不使用 MPI（适用于单节点场景），也可以使用 MPI（适用于多节点场景），具体可参考 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnccl-tests。因此，在单节点实验中，我们使用两种不同配置的 NCCL；而在多节点实验中，则仅使用带有 MPI 的 NCCL、OSU 基准测试以及百度的 Allreduce 实现。对于每种配置，我们报告所有实现中取得的最短延迟。\n\n英特尔(R) 机器学习扩展库（Intel(R) MLSL）是一个提供深度学习中常用通信模式高效实现的库。为了评估 All-Reduce 性能，我们使用 OSU 提供的 All-Reduce 基准测试。\n\n#### NVIDIA 8 GPU 系统的拓扑结构\n每个节点配备两个 CPU 插槽（双根拓扑），每个插槽都连接着一个 PCIe 根复合体。每个插槽还配有两台 PLX 交换机，每台交换机通过 16 条 PCIe v3 链路与 CPU 插槽相连。每台 PLX 交换机上安装有两个 GPU，任意一对 GPU 都可通过 16 条 PCIe v3 链路同时进行通信。两个 CPU 插槽之间通过 Intel QPI 连接，节点间的互连则采用 InfiniBand FDR。下图展示了一个节点的示意图，其中所有由同一 PCI 根复合体连接的设备都被框在一个虚线框内。在我们的实验中，P100、TitanX Maxwell 和 M40 就属于此类系统。\n\n![NVIDIA 8 GPU 系统的拓扑结构](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_readme_296361718c70.png)\n\n#### NVIDIA 10 GPU 系统的拓扑结构\n每个节点配备一个 CPU 插槽（单根拓扑），并连接着两台 PLX 交换机，每台交换机又连接着 5 个 GPU。位于同一 PLX 交换机上的 GPU 之间可以直接通信，而与其他 PLX 交换机上的 GPU 通信则需要依次经过两台 PLX 交换机以及连接它们的 PCIe 桥。在我们的实验中，TitanX Pascal 和 1080Ti 属于此类系统。\n\n#### 英特尔 Xeon Phi 和 Omni-Path 系统的拓扑结构\n阻塞式 All-Reduce 延迟是在英特尔内部 Endeavor 集群上，使用英特尔® Omni-Path 架构（Intel® OPA）系列 100 脂肪树拓扑结构，搭载英特尔 Xeon Phi 处理器 7250，并采用 Intel MPI 2017 Update 3 和 Intel MLSL 2017 Update 2 Preview 版本进行测量的。\n\n# 训练基准测试\n\n训练基准测试支持上述讨论的所有操作。`DeepBenchKernels_train.xlsx` 文件包含了训练基准测试中的全部内核列表。\n\n## 训练精度\n\n\n在训练深度学习模型时，大多数研究者通常对所有计算内核使用单精度浮点数。学术研究表明，在有限数据集上训练的多种模型可以采用低精度训练。根据我们的经验，16位半精度浮点数足以可靠地训练大型深度学习模型和大规模数据集。使用半精度数值进行训练可以使硬件供应商更好地利用现有计算资源。此外，参数所需的存储空间仅为全模型的一半。\n\n\nDeepBench 规定了训练的最低精度要求。我们为所有操作指定了乘法和加法的精度。**乘法和加法的最低精度分别设定为16位和32位。** 目前没有任何可用的硬件支持16位乘法和32位累加。我们将接受任何满足此最低精度要求的硬件平台上的结果。所有结果都将包含用于基准测试的精度设置。\n\n# 推理基准测试\n\n推理性能的基准测试是一项极具挑战性的任务。深度学习已经催生了众多应用，而每种应用都有其独特的性能特征和需求。我们选择的基准测试应用均具有较高的用户流量。此外，我们还纳入了在多个不同应用中广泛使用的深度学习模型中的内核。\n\n\n对于推理内核，我们涵盖了与训练部分相同的运算类型，即矩阵乘法、卷积和循环神经网络运算。这些内核与训练用内核存在一些差异。在接下来的几节中，我们将讨论针对推理工作负载进行基准测试所需的变化。`DeepBenchKernels_inference.xlsx` 文件包含了训练基准测试的完整内核列表。\n\n\n## 部署平台\n\n\n像图像搜索、语言翻译和语音识别这样的大规模实际应用，通常部署在数据中心的服务器上。客户端通过互联网发送请求，由托管深度学习模型的远程服务器进行处理。远程服务器通常是配备多处理器的强大机器，其内存和计算能力足以运行非常庞大的深度学习模型。然而，将模型部署在服务器上的缺点是延迟取决于客户端与服务器之间的网络带宽，并且需要用户保持联网状态。为了解决这些问题，许多模型开始被部署到终端设备上。在设备端部署可以降低延迟，并且无论是否连接互联网都能随时使用。不过，为了适应移动和嵌入式设备的功耗及内存限制，这些模型必须更加精简小巧。\n\n\n在 DeepBench 中，我们同时测量推理内核在服务器和移动平台上的性能。硬件厂商或用户可以运行相应的基准测试，并将结果提交到代码库中。以下是对结果的概述，详细结果则可在 `results\u002Finference` 文件夹中找到。我们欢迎针对新硬件平台的拉取请求。\n\n\n## 推理批大小\n\n\n为了满足用户请求的延迟要求，大多数互联网应用会在请求到达数据中心后逐一处理。这种方式实现起来较为简单，每个请求由一个线程单独处理。然而，这种做法存在两个弊端：首先，逐个处理请求会导致带宽受限，因为处理器需要频繁加载网络权重，从而难以有效利用片上缓存；其次，用于处理单个请求的并行度有限，难以充分利用 SIMD 或多核并行性。尤其是 RNN 模型，由于其逐样本评估依赖于矩阵向量乘法，而这类运算同样受带宽限制且难以并行化，因此部署起来尤为困难。\n\n\n为了解决这些问题，我们开发了一个名为 Batch Dispatch 的批处理调度器，它会将来自用户请求的数据流汇集为批次，然后再对这些批次执行前向传播。在这种情况下，批大小的增加虽然能提升效率，却也会导致延迟上升。我们缓冲的用户请求越多、批次越大，用户等待结果的时间就越长，这就限制了我们可以进行的批处理规模。\n\n\n实践中，我们发现对于数据中心部署而言，将请求批量控制在4至5个左右，能够在效率和延迟之间取得较好的平衡。而在设备端部署时，批大小则通常限制为1。\n\n\n## 推理精度\n\n\n深度神经网络通常使用单精度或半精度浮点数进行训练。而推理阶段的精度要求则显著低于训练阶段。许多不同的模型在推理时采用8位表示即可，且与浮点模型相比几乎不会损失精度。**因此，对于推理内核，我们规定了乘法和累加的最低精度分别为8位和32位。** 由于并非所有硬件平台都支持这一精度要求，我们将接受任何满足该最低要求的平台上的结果。所有结果都将注明用于基准测试的具体精度。\n\n\n为了在 ARM 处理器上对8位输入的矩阵乘法进行基准测试，我们使用 Gemmlowp 库。卷积基准测试则采用 ARM Compute Library 中的卷积内核。需要注意的是，ARM Compute Library 目前仅支持单精度卷积，低精度卷积的支持预计很快就会推出。此外，ARM Compute Library 尚未提供 RNN 支持，因此 DeepBench 不包含 ARM 设备上的 RNN 结果。我们欢迎其他能够支持 ARM 设备 RNN 运算的库贡献相关内容。\n\n\n对于服务器部署，我们使用 NVIDIA GPU 上的 cuDNN 和 cuBLAS 库。在 NVIDIA GPU 上，RNN 内核仅支持单精度运算，相关结果也以单精度形式报告。关于不同处理器支持的具体运算类型，更多细节将在后续章节中介绍。\n\n## 稀疏运算\n\n稀疏神经网络是指神经网络中大部分权重为零的网络。这些零权重在决定神经网络的预测结果时不起作用。稀疏神经网络可以减少内存和计算开销，从而使深度学习模型能够在移动设备上部署。RNN 的推理性能主要受硬件内存带宽的限制，因为大多数工作只是在每个时间步读取参数。从密集计算转向稀疏计算会带来一定的开销，但如果稀疏因子足够大，则稀疏算法所需的数据量较少，反而会带来优势。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_readme_cfb82c290332.png\" width=\"550\" \u002F>\n\n数据中心使用的更强大的服务器级处理器通常能够以足够快的速度完成推理，从而服务一位用户，但在数据中心环境中，每美元的性能非常重要。诸如稀疏化等技术能够加快模型的评估速度，从而使得每块 GPU 可以服务更多用户，提高单位成本的性能。\n\n过去几年里，稀疏神经网络的发展取得了很大进展。DeepBench 包含稀疏矩阵向量乘和稀疏矩阵乘核。根据我们的研究，我们了解到，稀疏度达到 90% 至 95% 的神经网络，与对应的密集基线相比，仍能获得相对较好的性能。然而，目前稀疏矩阵乘的实现往往针对更高的稀疏度（约 99% 或更高）进行了优化。通过包含稀疏核，我们希望激励硬件厂商和软件开发者构建能够在 90% 至 95% 稀疏度范围内提供良好性能的库。\n\n我们在 ARM 设备上使用 Eigen 库来基准测试稀疏运算。对于 GPU 基准测试，我们则使用 NVIDIA 的 cuSparse 库。\n\n## 延迟的测量\n\n许多推理应用都有实时延迟要求。例如，语音交互界面要求语音识别模型在用户可察觉的时间内返回结果。DeepBench 核可以作为起点，用于测量单个操作的最佳情况延迟。然而，鉴于 DeepBench 此次发布侧重于基础操作而非完整应用，测量整个系统的延迟超出了本次范围。例如，在移动设备上运行的完整应用可能需要在启动时调整系统的功耗状态。又如，一个完整的服务器应用可能会有显著的延迟成分，这部分延迟由用户与服务器之间的网络连接决定。我们可能会在 DeepBench 的未来版本中考虑处理操作延迟的问题。\n\n# 支持的运算与精度\n在本节中，我们记录了不同处理器对各种运算在不同精度下的支持情况。在可能的情况下，我们会选择最接近最低要求精度的精度。以下再次列出精度要求。不过，在某些情况下，我们也需要对更高精度的运算进行基准测试。下表展示了各处理器所基准测试的具体运算。\n\n**训练的最低精度**：16 位乘法，32 位累加\n\n**推理的最低精度**：8 位乘法，32 位累加\n\n## 训练\n\n单精度结果适用于 5 款 NVIDIA GPU 和 Intel 的 Xeon Phi 处理器。目前没有可用的处理器支持 16 位乘法和 32 位加法。因此，我们对 NVIDIA 的伪 FP16 模式进行了基准测试，该模式下输入\u002F输出为 16 位，但计算仍以单精度进行。混合精度训练的支持将在未来的硬件处理器中实现。\n\n| 处理器               | 单精度   | FP16 输入\u002FFP32 计算   | FP16 输入 \u002F 混合精度计算 |\n| ----------------------- | ------------------ | ----------------------- | ---------------------------------- |\n| NVIDIA TitanX Maxwell   | GEMM、卷积、RNN    |                         |                                    |\n| NVIDIA Tesla M40        | GEMM、卷积、RNN    |                         |                                    |\n| NVIDIA 1080Ti           | GEMM、卷积、RNN    |                         |                                    |\n| NVIDIA TitanX Pascal    | GEMM、卷积、RNN    |                         |                                    |\n| NVIDIA TitanXp          | GEMM、卷积、RNN    |                         |                                    |\n| NVIDIA Tesla P100       | GEMM、卷积、RNN    | GEMM、卷积、RNN         |                                    |\n| NVIDIA Tesla V100       | GEMM、卷积、RNN    |                         | GEMM、卷积、RNN                    |\n| Intel Xeon Phi 7250     | GEMM、卷积         |                         |                                    |\n\n\n## 服务器部署\n\n在 NVIDIA 处理器上，GEMM 和卷积基准测试采用 8 位乘法和 32 位累加的方式进行。然而，NVIDIA GPU 并不支持该精度模式下的所有输入尺寸。在这种精度模式下，输入尺寸必须是 4 的倍数。为此，我们已将所有核的输入维度填充为 4 的倍数。与运算本身的开销相比，填充和丢弃多余输出的成本微不足道。结果表格中标明了哪些核需要填充。稀疏运算和循环核的结果以单精度报告，因为相关库尚不支持低精度。\n\n| 处理器                  | 单精度             | Int8 乘法\u002F32 位累加 | \n|-----------------------|------------------|-----------------------|\n| NVIDIA 1080Ti              | RNN、稀疏 GEMM | GEMM、卷积                                 |\n| NVIDIA TitanX Pascal       | RNN、稀疏 GEMM | GEMM、卷积                                 |\n| NVIDIA TitanXp             | RNN、稀疏 GEMM | GEMM、卷积                                 |\n\n## 设备部署\n\n下表描述了不同处理器、运算和精度下可用的推理设备核结果。我们没有 RNN 的结果，因为没有任何 ARM 库支持 RNN。ARM Compute 库目前尚未在 iPhone 上得到支持。\n\n| 处理器                  | 单精度             | Int8 输入\u002F32 位计算 | \n|-----------------------|------------------|-----------------------|\n| Raspberry Pi 3        | 卷积                         | GEMM、稀疏 GEMM               |\n| iPhone 6                   |                              | GEMM、稀疏 GEMM               |\n| iPhone 7                   |                              | GEMM、稀疏 GEMM               |\n\n# 结果\n在这一部分，我们记录了几种操作的性能表现。这些结果是随机选取的，仅用于展示几种应用的性能。\n\n__以下结果仅包含特定操作和参数下最快处理器的时间和 TeraFLOPS 值。完整结果可在 `results` 文件夹中找到__。\n\n用于基准测试训练和推理处理器的精度列于结果文件的顶部。\n\n训练结果位于 `results\u002Ftraining` 文件夹中，包含以下文件：\n\n* `DeepBench_IA_KNL7250.xlsx`：英特尔至强融核处理器上的训练结果\n* `DeepBench_NV_TitanX.xlsx`：英伟达 TitanX 显卡上的训练结果\n* `DeepBench_NV_M40.xlsx`：英伟达 M40 显卡上的训练结果\n* `DeepBench_NV_TitanX_Pascal.xlsx`：英伟达 TitanX Pascal 显卡上的训练结果\n* `DeepBench_NV_TitanXp.xlsx`：英伟达 TitanXp Pascal 显卡上的训练结果\n* `DeepBench_NV_1080Ti.xlxs`：英伟达 1080 Ti 显卡上的训练结果\n* `DeepBench_NV_P100.xlsx`：英伟达 P100 显卡上的训练结果\n* `DeepBench_NV_V100.xlsx`：英伟达 V100 显卡上的训练结果\n\n详细的推理结果位于 `results\u002Finference` 文件夹中，包含以下文件：\n* `server\u002FDeepBench_NV_TitanXp.xlsx`：英伟达 TitanXp 显卡上的推理结果\n* `server\u002FDeepBench_NV_TitanXp.xlsx`：英伟达 TitanXp Pascal 显卡上的推理结果\n* `server\u002FDeepBench_NV_1080Ti.xlxs`：英伟达 1080 Ti 显卡上的推理结果\n* `device\u002FDeepBench_iPhone_7.xlsx`：iPhone 7 上的推理结果\n* `device\u002FDeepBench_iPhone_6.xlsx`：iPhone 6 上的推理结果\n* `device\u002FDeepBench_Raspberry_Pi_3.xlsx`：树莓派 3 上的推理结果\n\n用于基准测试性能的软件库（如 cuDNN、OpenMPI）在每个 Excel 工作簿的 `Specs` 表中均有提及。\n\n如有任何疑问，请随时向我们咨询。\n\n更多硬件平台的结果将在可用时陆续添加。我们欢迎所有硬件厂商的贡献。\n\n## 训练结果\n\n### GEMM 结果\n\n| 内核                 | A 转置 | B 转置 | 应用        | 时间 (ms) | TeraFLOPS | 处理器     |\n|------------------------|---------|-----------|-------------|------------|-----------|------------|\n| M=1760, N=128, K=1760  | 否      | 否        | 语音识别    | 0.07       | 10.72     | Tesla V100 混合精度 |\n| M=7860, N=64, K=2560   | 否      | 否        | 语音识别    | 0.10       | 25.94     | Tesla V100 混合精度 |\n| M=2560, N=64, K=2560   | 否      | 否        | 语音识别    | 0.08       | 10.11     | Tesla V100 混合精度 |\n| M=5124, N=9124, K=2560 | 是      | 否        | 语音识别    | 8.73       | 27.43     | Tesla V100 混合精度 |\n| M=3072, N=128, K=1024  | 是      | 否        | 语音识别    | 0.04       | 18.73     | Tesla V100 混合精度 |\n\n### 卷积结果\n\n| 输入大小                        | 滤波器大小     | 滤波器数量   | 填充 (h, w)   | 步幅 (h, w)   | 应用          | 总时间 (ms)   | Fwd TeraFLOPS   | 处理器       |\n| --------------------------------- | --------------- | -------------- | ---------------- | --------------- | -------------------- | ----------------- | --------------- | --------------- |\n| W = 700, H = 161, C = 1, N = 32   | R = 5, S = 20   | 32             | 0, 0             | 2, 2            | 语音识别   | 1.53              | 7.75            | Tesla V100 FP32 |\n| W = 54, H = 54, C = 64, N = 8     | R = 3, S = 3    | 64             | 1, 1             | 1, 1            | 人脸识别     | 0.55              | 10.12           | Tesla V100 FP32 |\n| W = 224, H = 224, C = 3, N = 16   | R = 3, S = 3    | 64             | 1, 1             | 1, 1            | 计算机视觉      | 2.40              | 1.40            | Tesla V100 FP32 |\n| W = 7, H = 7,  C = 512, N = 16    | R = 3, S = 3    | 512            | 1, 1             | 1, 1            | 计算机视觉      | 0.70              | 14.56           | Tesla V100 混合精度 |\n| W = 28, H = 28, C = 192, N = 16   | R = 5, S = 5    | 32             | 2, 2             | 1, 1            | 计算机视觉      | 0.93              | 16.90           | Tesla V100 FP32  |\n\n### 循环运算结果\n\n循环运算内核仅在英伟达硬件上运行。\n\n| 隐藏单元数   | 批量大小   | 时间步数   | 循环类型   | 应用           | 总时间 (ms) | Fwd TeraFLOPS   | 处理器       |\n| -------------- | ------------ | ----------- | ------------ | --------------------- | ------------    | --------------- | --------------- |\n| 1760           | 16           | 50          | Vanilla      | 语音识别    | 8.21            | 1.19            | Tesla V100 混合精度 |\n| 2560           | 32           | 50          | Vanilla      | 语音识别    | 10.50           | 4.08            | Tesla V100 混合精度 |\n| 1024           | 128          | 25          | LSTM         | 机器翻译      | 5.56            | 10.91           | Tesla V100 混合精度 |\n| 2816           | 32           | 1500        | GRU          | 语音识别      | 380.04          | 11.85           | Tesla V100 混合精度 |\n\n### 全归约结果\n\n| 大小 (# of floats) | 处理器数量 | 应用        | 时间 (ms)   | 带宽 (GB\u002Fs) | 处理器                           |\n|--------------------|----------------------|--------------------|-------------|------------------|-------------------------------------|\n| 16777216           | 8                    | 语音识别 | 8.66        | 61.99            | 至强融核 7250 配备 Intel® Omni-Path |\n| 16777216           | 16                   | 语音识别 | 14.72       | 72.94            | 至强融核 7250 配备 Intel® Omni-Path |\n| 16777216           | 32                   | 语音识别 | 19          | 113.03           | 至强融核 7250 配备 Intel® Omni-Path |\n| 64500000           | 32                   | 语音识别 | 76.68       | 107.67           | 至强融核 7250 配备 Intel® Omni-Path |\n\n## 推理服务器结果\n\n接下来的部分提供了服务器平台上 GEMM、卷积和循环运算的推理内核的一些结果。英特尔平台的结果将很快公布。\n\n### GEMM 结果\n\n| 内核                 | 应用        | 结果 (ms) | TeraFLOPS | 处理器 |\n|------------------------|-------------|-----------|-----------|---------|\n| M=5124, N=700, K=2048  | 语音识别 | 0.46         | 31.94     | 1080 Ti   |\n| M=35, N=700, K=2048    | 语音识别 | 0.05         | 2.09      | 1080 Ti   |\n| M=3072, N=3000, K=1024 | 语音识别 | 0.49         | 38.36     | Titan Xp  |\n| M=512, N=6000, K=2816  | 语音识别 | 0.43         | 40.71     | Titan Xp  |\n\n### 稀疏 GEMM 结果\n\n| 内核                 | 稀疏度 | 应用        | 结果 (ms) | 相对于密集的加速比 | TeraFLOPS | 处理器     |\n|------------------------|----------|-------------|-----------|--------------------|-----------|------------|\n| M=7680, N=1, K=2560    | 0.95     |语音识别   | 0.03      | 6.56               | 1.10      | 1080 Ti    |\n| M=7680, N=2, K=2560    | 0.95     |语音识别   | 0.04      | 5.93               | 1.74      | 1080 Ti    |\n| M=7680, N=1500, K=2560 | 0.95     |语音识别   | 29.81     | 0.16               | 1.88      | TitanXp    |\n| M=10752, N=1, K=3584   | 0.9      |语音识别   | 0.1       | 4                  | 0.72      | TitanXp    |\n\n### 卷积结果\n\n| 输入大小                     | 滤波器大小   | 滤波器数量 | 填充 (h, w) | 步幅 (h, w) | 应用        | 时间 (ms) | TeraFLOPS | 处理器     |\n|--------------------------------|---------------|--------------|--------------|-------------|-------------|-----------|-----------|------------|\n| W = 341, H = 79, C = 32, N = 4 | R = 5, S = 10 | 32           | 0,0          | 2,2         |语音识别   | 0.29      | 9.03      | TitanXp    |\n| W = 224, H = 224, C = 3, N = 1 | R = 7, S = 7  | 64           | 3, 3       | 2, 2      |计算机视觉 | 0.14      | 1.64      | TitanXp    |\n| W = 56, H = 56, C = 256, N = 1 | R = 1, S = 1  | 128          | 0, 0       | 2, 2      |计算机视觉 | 0.015     | 3.43      | TitanX Pascal |\n| W = 7, H = 7,  C = 512, N = 2  | R = 1, S = 1  | 2048         | 0, 0       | 1, 1      |计算机视觉 | 0.018     | 11.42     | 1080 Ti    |\n\n### RNN 结果\n\n| 隐藏单元数 | 批量大小 | 时间步长 | 循环类型 | 应用                  | 结果 (ms) | Fwd TeraFLOPS | 处理器     |\n|------------|------------|-----------|----------|-----------------------|-----------|---------------|------------|\n| 1536       | 4          | 50        | LSTM     |语言建模             |   6.93      |  0.55             |  TitanXp         |\n| 256        | 4          | 150       | LSTM     |字符语言建模         |   1.63         |  0.19             |   1080 Ti        |\n| 2816       | 1          | 1500      | GRU      |语音识别             |   350.62         |  0.20             | TitanXp          |\n| 2560    |   2         |   375        |  GRU              |语音识别       |  75.02          |     0.39 | TitanXp          |\n\n## 推理设备结果\n\n### GEMM 结果\n\n| 内核                 |  应用        | 结果 (ms) | GigaFLOPS | 处理器     |\n|------------------------|--------------|-----------|-----------|------------|\n| M=5124, N=700, K=2048  |语音识别   | 212.84         | 69.03      | iPhone 7 |\n| M=35, N=700, K=2048   |语音识别   | 1.94         | 51.69      | iPhone 7 |\n| M=3072, N=1500, K=1024 |语音识别   | 136.63         | 69.07      | iPhone 7 |\n\n### 稀疏 GEMM 结果\n\n| 内核                 | 稀疏度 | 应用        | 结果 (ms) | 相对于密集的加速比 | GigaFLOPS | 处理器     |\n|------------------------|----------|-------------|-----------|--------------------|-----------|------------|\n| M=7680, N=1, K=2560    | 0.95     |语音识别   | 1.01      | 15.55             | 18.55     | iPhone 7  |\n| M=7680, N=1500, K=2560 | 0.95     |语音识别   | 1677.36   | 5.46              | 16.70     | iPhone 7  |\n| M=7680, N=1, K=2560    | 0.9      |语音识别   | 2.1       | 8.02              | 8.41      | iPhone 7  |\n\n\n### 卷积结果\n\n| 输入大小                      | 滤波器大小  | 滤波器数量 | 填充 (h, w) | 步幅 (h, w) | 应用     | 时间 (ms) | GigaFLOPS | 处理器      |\n|---------------------------------|--------------|--------------|--------------|-------------|----------|-----------|-----------|-------------|\n| W = 112, H = 112, C = 64, N = 1 | R = 1, S = 1 | 64           | 0, 0       | 1, 1      |计算机视觉 | 670.75    | 0.15      | Raspberry Pi 3 |\n| W = 56, H = 56, C = 256, N = 1  | R = 1, S = 1 | 128          | 0, 0       | 2, 2      |计算机视觉 | 185.87    | 0.28      | Raspberry Pi 3 |\n| W = 7, H = 7,  C = 512, N = 1   | R = 1, S = 1 | 2048         | 0, 0       | 1, 1      |计算机视觉 | 735.28    | 0.14      | Raspberry Pi 3 |\n\n# 参与进来\n\n我们欢迎社区为 DeepBench 做出贡献。您可以通过两种方式参与：\n\n1. 深度学习研究人员\u002F工程师：如果您是从事新型深度学习应用的研究人员或工程师，您的模型训练可能涉及不同的操作和\u002F或工作负载。我们希望了解那些对模型性能（速度）产生负面影响的基本操作。请贡献这些操作和工作负载！\n2. 硬件厂商：我们也非常乐意接受其他硬件厂商的贡献。无论是大型公司还是致力于开发深度学习训练硬件的小型初创企业，我们都欢迎提交基准测试结果。请为您的硬件贡献基准测试结果！\n\n# 获取代码\n要获取代码，只需克隆 GitHub 仓库：\n\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002FDeepBench\n```\n\n# NVIDIA 基准测试\n\n## 编译\n\n为了构建基准测试，您需要指定以下路径：\n```\nMPI_PATH：MPI 库的路径。这些基准测试已在 OpenMPI 1.10.2 版本上进行过测试。\nCUDA_PATH：CUDA 库的路径。这些基准测试已在 7.5.18 版本上进行过测试。\nCUDNN_PATH：cuDNN 库的路径。这些基准测试已在 5.0 版本上进行过测试。\nNCCL_PATH：NCCL 库的路径。NCCL 库可在 https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnccl 上获取。这些基准测试已在 commit b3a9e1333d9e2e1b8553b5843ba1ba4f7c79739d 上进行过测试。\n```\n\n要构建所有基准测试，请使用以下命令：\n```\ncd code\u002F\nmake CUDA_PATH=\u003Ccuda_path> CUDNN_PATH=\u003Ccudnn_path> MPI_PATH=\u003Cmpi_path> NCCL_PATH=\u003Cnccl_path>\n```\n\n对于将 MPI 头文件和库文件分开存储的发行版（例如 RHEL、Fedora、CentOS），您还应指定包含文件的路径：\n\n```\nMPI_INCLUDE_PATH=\u003Cmpi_include_path>\n```\n\n您需要为适当的架构编译代码。默认情况下，架构版本设置为 5.2，这适用于 TitanX 和 Tesla M40 GPU。若要为其他架构（如 Pascal 架构，版本为 6.1）编译基准测试，请在 `make` 命令中添加以下变量：\n\n```\nARCH=sm_61 ## Pascal 架构的示例\n```\n\n在某些情况下，为多个架构生成基准测试可执行文件可能会很有用。例如，某些系统可能安装了具有不同架构的多个图形处理器。NVIDIA 编译器 (nvcc) 支持生成“胖二进制文件”，其中包含针对多个目标架构的中间代码和已编译代码。要为多个架构编译，请在 `make` 命令行中添加一个以逗号分隔的架构列表。\n\n```\nARCH=sm_30,sm_32,sm_35,sm_50,sm_52,sm_60,sm_61,sm_62,sm_70     # Kepler 以来的所有架构！\n```\n\n请注意，为多个架构编译所需的时间会比为单个架构编译更长。此外，并非所有 CUDA 版本都支持所有架构。例如，对 sm_60（及更高版本）的支持需要 CUDA 8 或更高版本。\n\n\n对于采用 `int8` 精度的推理问题，卷积和 GEMM 内核需要填充到 4 的倍数。默认情况下，内核会被填充，并且结果也会以填充后的形式报告。若要禁用填充，请使用以下编译选项。当填充被禁用时，对于不支持的内核，基准测试结果将不会被报告。\n\n```\nmake gemm PAD_KERNELS=0\nmake conv PAD_KERNELS=0\n```\n\n要在 NVIDIA V100 处理器上使用 Tensor Cores，您需要使用 CUDA 9.0 和 cuDNN 7.0 或更高版本。在使用正确的库的情况下，将以下选项添加到 make 命令中：\n\n```\nmake USE_TENSOR_CORES=1 ARCH=sm_70\n```\n\n运行 Tensor Cores 的卷积操作要求输入和输出通道数必须是 8 的倍数。目前，基准测试会将输入通道数填充到 8 的倍数，并报告填充后的数值。\n\n## 运行基准测试\n\n编译成功完成后，可执行文件将在 `bin` 目录中生成。在运行基准测试之前，正确设置 `LD_LIBRARY_PATH` 非常重要。对于 bash shell，请使用：\n\n```\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:\u003Ccuda_path>:\u003Ccudnn_path>:\u003Cmpi_path>:\u003Cnccl_path>\n```\n\n\nGEMM、卷积、循环运算和稀疏 GEMM 基准测试可以通过调用相应的可执行文件来运行。以下是 GEMM 基准测试的部分输出：\n\n```\n~\u002FDeepBench\u002Fcode$ bin\u002Fgemm_bench\n                  Running training benchmark \n                         Times\n----------------------------------------------------------------------------------------\n    m       n      k      a_t     b_t      precision  time (usec) \n   1760     16   1760      0      0        float          180\n   1760     32   1760      0      0        float          182\n   1760     64   1760      0      0        float          247\n   1760    128   1760      0      0        float          318\n```\n\n默认情况下，基准测试是针对训练问题运行的。基准测试的默认精度取决于 CUDA 和 cuDNN 库的版本。模式（推理或训练）和精度可以在命令行中通过以下方式指定：\n\n```\nbin\u002Fgemm_bench \u003Cinference|train> \u003Cint8|float|half>\n```\n\n每个基准测试文件都包含注释，说明不同 GPU 支持哪些精度。\n\n要运行 NCCL 单节点 All-Reduce 基准测试，您需要将 GPU 数量作为参数指定。请注意，GPU 数量不得超过系统中可见的 GPU 数量。\n\n```\nbin\u002Fnccl_single_all_reduce \u003Cnum_gpus>\n```\n\nNCCL MPI All-Reduce 基准测试可以使用 `mpirun` 运行，如下所示：\n\n```\nmpirun -np \u003Cnum_ranks> bin\u002Fnccl_mpi_all_reduce\n```\n\n`num_ranks` 不能超过系统中的 GPU 数量。\n\n`osu_allreduce` 基准测试也可以使用 mpirun 运行，如下所示：\n\n```\nmpirun -np \u003Cnum_processes> bin\u002Fosu_allreduce\n```\n\n`osu_allreduce` 基准测试可以使用多于 GPU 数量的进程运行。然而，我们所有的实验都是在每个进程运行在一个单独 GPU 上的情况下进行的。\n\n# 百度基准测试\n## 编译\n\n为了构建基准测试，您需要指定以下路径：\n```\nMPI_PATH：MPI 库的路径。这些基准测试已在 OpenMPI 2.0.1 版本上进行过测试。\nCUDA_PATH：CUDA 库的路径。这些基准测试已在 8.0.61 版本上进行过测试。\nBAIDU_ALLREDUCE_PATH：百度 All-Reduce 实现的路径，可在 https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002Fbaidu-allreduce\u002F 获取。\n```\n\n要构建所有基准测试，请使用以下命令：\n\n```\ncd code\u002F\nmake CUDA_PATH=\u003Ccuda_path> MPI_PATH=\u003Cmpi_path> BAIDU_ALLREDUCE_PATH=\u003Cbaidu_allreduce_path>\n```\n\n对于将 MPI 头文件和库文件分开存储的发行版（例如 RHEL、Fedora、CentOS），您还应指定包含文件的路径：\n\n```\nMPI_INCLUDE_PATH=\u003Cmpi_include_path>\n```\n\n请按照上述 NVIDIA 基准测试部分所述，为适当的架构设置 ARCH 参数。\n\n## 运行基准测试\n\n编译成功完成后，可执行文件将在 `bin` 目录中生成。在运行基准测试之前，正确设置 `LD_LIBRARY_PATH` 非常重要。对于 bash shell，请使用：\n\n```\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:\u003Ccuda_path>:\u003Cmpi_path>:\u003Cbaidu_allreduce_path>\n```\n\n百度 All-Reduce 基准测试可以使用 `mpirun` 运行，如下所示：\n\n```\nmpirun -np \u003Cnum_ranks> bin\u002Fring_all_reduce\n```\n\n`num_ranks` 用于表示系统中 GPU 的总数。\n\n# 英特尔基准测试\n\n# 编译和运行基准测试\n\n将所有 Intel 工具（icc、mkl、mpi）的路径添加到环境变量中：\n\n```\nsource \u003Cicc_installdir>\u002Fbin\u002Fcompilervars.sh intel64\nsource \u003Cmkl_installdir>\u002Fbin\u002Fmklvars.sh intel64\nsource \u003Cimpi_installdir>\u002Fbin\u002Fmpivars.sh intel64\nsource \u003Cmlsl_installdir>\u002Fintel64\u002Fbin\u002Fmlslvars.sh\n```\n\n运行 Intel GEMM 基准测试（MKL 2017）：\n\n```\ncode\u002Fintel\u002Fsgemm\u002Frun_mkl_sgemm_ia.sh\n```\n\n运行 Intel 卷积基准测试（MKL 2017 和 libxsmm，一个开源的 KNL 优化卷积实现）：\n\n```\ncode\u002Fintel\u002Fconvolution\u002Frun_conv_ia.sh\n```\n\nIntel All-Reduce 基准测试使用标准的 OSU 基准测试程序，该程序通过 Intel MPI 或 Intel MLSL 进行编译和运行。\n\n为了构建 Intel All-Reduce 基准测试，您需要指定以下路径：\n```\nMPI_PATH：Intel MPI 库的路径（默认为 $I_MPI_ROOT）。这些基准测试已在 Intel MPI 2017 Update 3 上进行过测试。\nMLSL_PATH：Intel MLSL 库的路径（默认为 $MLSL_ROOT）。这些基准测试已在 Intel MLSL 2017 Update 2 Preview 上进行过测试。\n```\n并使用 `Makefile_ia` 构建文件。\n\n例如（使用默认路径进行构建）：\n```\nmake -f Makefile_ia all\n```\n\n运行 Intel All-Reduce 基准测试：\n```\ncode\u002Fosu_allreduce\u002Frun_allreduce_ia.sh \u003Chostfile> \u003Callreduce_binary>\n```\n\n\u003Callreduce_binary> 可以有两个值：\n* osu_allreduce：基于 MPI 的阻塞式 All-Reduce 基准测试\n* mlsl_osu_allreduce：基于 MLSL 的阻塞式 All-Reduce 基准测试\n\n基于 MLSL 的阻塞式 All-Reduce 性能会记录在 DeepBench 结果文件中。\n\n例如，要运行基于 MLSL 的 All-Reduce 基准测试，请创建一个每行包含一个主机名的 hostfile，并按如下方式运行脚本：\n```\ncode\u002Fosu_allreduce\u002Frun_allreduce_ia.sh \u003Chostfile> mlsl_osu_allreduce\n```\n脚本将在不同规模（2、4、8、16、32 个节点）以及 DeepBench 特定的消息大小下运行基准测试。基准测试将报告平均延迟指标。\n\n例如，在 32 个 KNL\u002FOPA 节点上的基准测试输出：\n```\n# 大小         平均延迟(ms)\n100000                    0.31\n3097600                   3.59\n4194304                   4.67\n6553600                   7.17\n16777217                 16.80\n38360000                 56.65\n64500000                 75.77\n```\n\n# ARM 基准测试\n\nDeepBench 中的 ARM 基准测试是在 64 位 ARM v8 处理器上编译和运行的。`code\u002Farm` 文件夹中的 `Makefile` 仅支持此处理器。如果要对其他处理器进行基准测试，您需要修改 `Makefile` 以支持它们。\n\n## GEMM 基准测试\n\nARM GEMM 基准测试使用 [Gemmlowp](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fgemmlowp) 库来实现 `int8` 内核。该库作为子模块包含在 DeepBench 代码库中。要构建并运行该基准测试，只需执行：\n```\n.\u002Frun_gemm_bench.sh\n```\n\n## 卷积基准测试\n\nARM 卷积基准测试使用 [ARM Compute Library](https:\u002F\u002Fgithub.com\u002FARM-software\u002FComputeLibrary)。要构建该基准测试，您需要指定 ARM 计算库的头文件和库文件路径：\n```\nARM_COMPUTE_INCLUDE_PATH：ARM 计算库的头文件路径\nARM_COMPUTE_LIB_PATH：ARM 计算库的二进制文件路径\n```\n构建并运行基准测试时，请使用以下命令：\n```\nmake conv ARM_COMPUTE_INCLUDE_PATH=\u003Cpath> ARM_COMPUTE_LIB_PATH=\u003Cpath>\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:\u003CARM 计算库二进制文件路径>\nbin\u002Fconv_bench\n```\n\n## 稀疏 GEMM 基准测试\n\n稀疏 GEMM 基准测试使用 [Eigen](http:\u002F\u002Feigen.tuxfamily.org\u002Findex.php?title=Main_Page) 库。要构建该基准测试，您需要下载 Eigen 库并指定其路径：\n```\nEIGEN_PATH：Eigen 库的路径\n```\n\n编译并运行基准测试时，请使用以下命令：\n```\nmake sparse EIGEN_PATH=\u003Cpath>\nbin\u002Fsparse_bench\n```\n\n# AMD 基准测试\n\n## 先决条件\n* 支持 ROCm 的平台，更多信息请参见 [这里](https:\u002F\u002Frocm.github.io\u002Finstall.html)。\n* [MIOpen](https:\u002F\u002Fgithub.com\u002FROCmSoftwarePlatform\u002FMIOpen)——需要 MIOpen 的 HIP 后端。\n* [rocBLAS](https:\u002F\u002Fgithub.com\u002FROCmSoftwarePlatform\u002FrocBLAS)\n\n目前仅启用了 `fp32 train` 基准测试。\n\n## 编译\n\n`code\u002Famd` 中的 `Makefile` 适用于 AMD `gfx900` GPU。如果要对其他代的 GPU 进行基准测试，请相应地修改 `Makefile`。\n\n在编译或运行之前设置环境变量：\n\n```\nexport PATH=PATH_TO_ROCM\u002Fbin:$PATH\nexport CPATH=PATH_TO_MIOPEN\u002Finclude:$CPATH\nexport LIBRARY_PATH=PATH_TO_MIOPEN\u002Flib:$LIBRARY_PATH\nexport LD_LIBRARY_PATH=PATH_TO_MIOPEN\u002Flib:PATH_TO_MIOPENGEMM\u002Flib:$LD_LIBRARY_PATH\n```\n\n要编译卷积、RNN 和 GEMM 基准测试，请运行：\n\n```\nmake conv rnn gemm\n```\n\n## 运行基准测试\n\n成功编译后，可执行文件将生成在 `bin` 目录中。\n\n要运行卷积基准测试：\n```\nbin\u002Fconv_bench\n```\n\n要运行 RNN 基准测试：\n```\nbin\u002Frnn_bench\n```\n\n要运行 GEMM 基准测试：\n```\nbin\u002Fgemm_bench\n```","# DeepBench 快速上手指南\n\nDeepBench 是百度研究院开源的深度学习基准测试工具，旨在评估不同硬件平台在深度学习核心算子（如矩阵乘法、卷积、循环层及通信操作）上的性能表现。它不测试完整的模型训练时间，而是专注于底层算子的执行效率，帮助开发者和硬件厂商识别性能瓶颈。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐 Ubuntu 16.04 或更高版本)\n- **硬件支持**:\n  - **GPU**: NVIDIA GPUs (如 TitanX, P100, 1080 Ti 等)，需安装对应的 CUDA 驱动。\n  - **CPU**: Intel Xeon Phi (Knights Landing) 或其他支持 AVX-512 的处理器。\n  - **多卡\u002F多机**: 若测试 All-Reduce 通信性能，需配置多 GPU 或多节点环境（支持 InfiniBand 或高速以太网）。\n- **移动端设备** (可选): iPhone 6\u002F7, Raspberry Pi 3 (需单独交叉编译)。\n\n### 前置依赖\n请确保安装以下基础依赖库：\n- **CUDA Toolkit**: 版本需与显卡驱动匹配 (建议 8.0 或以上)。\n- **cuDNN**: NVIDIA 深度学习加速库。\n- **MPI (Message Passing Interface)**: 用于多节点通信测试 (如 OpenMPI 或 MPICH)。\n- **编译工具**: `gcc`, `g++`, `make`, `cmake`。\n- **Python**: 用于部分脚本处理 (可选，视具体测试结果分析需求而定)。\n- **第三方库**:\n  - [NCCL](https:\u002F\u002Fdeveloper.nvidia.com\u002Fnccl) (NVIDIA 多卡通信库)\n  - [Intel MLSL](https:\u002F\u002Fgithub.com\u002Fintel\u002FMLSL) (Intel 机器学习缩放库，仅 Intel 平台需要)\n\n> **注意**: DeepBench 默认调用厂商提供的优化库（如 cuDNN, MKL），请确保这些库已正确安装并配置在系统路径中。\n\n## 安装步骤\n\n1. **克隆代码仓库**\n   从 GitHub 获取源代码：\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002FDeepBench.git\n   cd DeepBench\n   ```\n   *(注：若国内访问 GitHub 较慢，可尝试使用 Gitee 镜像或配置代理)*\n\n2. **配置编译环境**\n   根据目标硬件平台修改 `Makefile` 或配置文件。通常需指定 CUDA 路径和架构：\n   ```bash\n   export CUDA_HOME=\u002Fusr\u002Flocal\u002Fcuda\n   export CUDNN_HOME=\u002Fusr\u002Flocal\u002Fcudnn\n   ```\n\n3. **编译项目**\n   在项目根目录下执行编译命令。DeepBench 包含训练（Training）和推理（Inference）两套基准：\n   ```bash\n   make all\n   ```\n   若只需编译特定模块（如仅测试 GEMM 或 Convolution），可指定 target：\n   ```bash\n   make gemm\n   make convolution\n   make rnn\n   make allreduce\n   ```\n\n4. **验证安装**\n   编译成功后，将在 `bin\u002F` 目录下生成对应的可执行文件。运行任意一个测试以确保环境正常：\n   ```bash\n   .\u002Fbin\u002Fgemm_benchmark\n   ```\n\n## 基本使用\n\nDeepBench 通过命令行运行不同的基准测试。测试参数（如矩阵大小 M, N, K）通常在源码或配套的 Excel 文件 (`DeepBenchKernels_train.xlsx`, `DeepBenchKernels_inference.xlsx`) 中定义。\n\n### 1. 运行矩阵乘法 (GEMM) 测试\n测试深度学习中最基础的密集矩阵乘法性能：\n```bash\n.\u002Fbin\u002Fgemm_benchmark --type=train --precision=fp32\n```\n- `--type`: 指定测试类型 (`train` 或 `inference`)。\n- `--precision`: 指定精度 (`fp32`, `fp16`, `int8` 等，取决于硬件支持)。\n\n### 2. 运行卷积 (Convolution) 测试\n测试图像和视频处理中核心的卷积操作性能（默认使用 NCHW 格式）：\n```bash\n.\u002Fbin\u002Fconvolution_benchmark --type=train --format=NCHW\n```\n\n### 3. 运行循环层 (RNN\u002FLSTM\u002FGRU) 测试\n测试序列模型中的循环单元性能：\n```bash\n.\u002Fbin\u002Frnn_benchmark --cell=lstm --direction=unidirectional\n```\n支持的 `--cell` 参数包括：`rnn`, `lstm`, `gru`。\n\n### 4. 运行多卡通信 (All-Reduce) 测试\n测试多 GPU 或多节点间的梯度同步性能（需配置 MPI 环境）：\n```bash\nmpirun -np 8 .\u002Fbin\u002Fallreduce_benchmark --size=1048576\n```\n- `-np`: 指定参与的 GPU 数量或进程数。\n- `--size`: 测试的数据传输大小（字节）。\n\n### 查看结果\n测试完成后，性能数据（延迟、吞吐量等）通常会直接输出到终端，或部分结果会保存在 `results\u002F` 文件夹中。您可以对比不同硬件或不同库实现的性能差异，以指导后续的模型优化或硬件选型。\n\n> **提示**: 具体的测试用例尺寸请参考项目目录下的 `DeepBenchKernels_train.xlsx` 和 `DeepBenchKernels_inference.xlsx` 文件，这些文件详细列出了官方推荐的测试参数组合。","某芯片初创公司的架构团队正在研发一款专为深度学习训练定制的 AI 加速器，急需验证其核心计算单元在不同矩阵规模下的实际效能。\n\n### 没有 DeepBench 时\n- **评估依据模糊**：团队只能依赖完整的深度学习框架（如 TensorFlow）进行端到端测试，无法剥离出矩阵乘法或卷积等底层算子的独立性能，难以定位是硬件缺陷还是软件调度问题。\n- **优化方向迷失**：面对计算受限、带宽受限或占用率受限等不同瓶颈，缺乏标准化的基准数据来判断当前硬件在特定算子尺寸下究竟卡在哪里。\n- **竞品对比困难**：由于缺乏统一的低层级操作标准，很难将自研芯片与 NVIDIA TitanX 或 Intel Xeon Phi 等成熟平台在同等细粒度下进行公平的性能对标。\n\n### 使用 DeepBench 后\n- **精准定位瓶颈**：DeepBench 直接对稠密矩阵乘法、卷积及通信等基础操作进行微基准测试，帮助团队迅速识别出在特定矩阵尺寸下硬件是受限于算力还是显存带宽。\n- **指导架构迭代**：基于训练和推理场景下不同精度的详细测试数据，架构师能针对性地调整计算单元比例或内存层级设计，避免盲目优化。\n- **建立客观标尺**：利用 DeepBench 预定义的标准化算子集，团队成功将自研芯片与主流 GPU 在同一维度上进行量化对比，用确凿的数据证明了新架构在特定负载下的优势。\n\nDeepBench 通过剥离框架干扰、聚焦底层算子，为硬件研发者提供了一把衡量深度学习基础性能的精确标尺。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbaidu-research_DeepBench_f4bb6371.png","baidu-research","Baidu Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fbaidu-research_faaa443a.png",null,"http:\u002F\u002Fusa.baidu.com\u002F","https:\u002F\u002Fgithub.com\u002Fbaidu-research",[21,25,29,33,37],{"name":22,"color":23,"percentage":24},"C++","#f34b7d",51.2,{"name":26,"color":27,"percentage":28},"Cuda","#3A4E3A",27.8,{"name":30,"color":31,"percentage":32},"C","#555555",10.2,{"name":34,"color":35,"percentage":36},"Shell","#89e051",6.7,{"name":38,"color":39,"percentage":40},"Makefile","#427819",4.1,1103,241,"2026-03-09T12:56:05","Apache-2.0",4,"Linux","必需。支持 NVIDIA GPU (TitanX, M40, TitanX Pascal, TitanXp, 1080 Ti, P100) 和 Intel Xeon Phi (Knights Landing)。需安装厂商提供的底层库（如 NVIDIA cuDNN, NCCL）。未明确具体显存大小和 CUDA 版本，但需匹配对应硬件的驱动和库版本。","未说明",{"notes":50,"python":48,"dependencies":51},"DeepBench 是一个底层算子基准测试工具，不直接依赖深度学习框架（如 TensorFlow, PyTorch），而是通过调用厂商提供的底层神经网络库（如 cuDNN, MKL）来测试硬件性能。主要测试密集矩阵乘法、卷积、循环层（RNN\u002FLSTM\u002FGRU）及多卡通信（All-Reduce）。多节点测试需要 InfiniBand FDR 互联环境。代码库中包含 Excel 文件定义了具体的测试算子尺寸。",[52,53,54,55,56,57],"NVIDIA cuDNN","Intel MKL","NVIDIA NCCL","MPI (Message Passing Interface)","Baidu Allreduce","Intel MLSL",[59],"其他",2,"ready","2026-03-27T02:49:30.150509","2026-04-08T10:01:13.475147",[65,70,75,80,85,90,95],{"id":66,"question_zh":67,"answer_zh":68,"source_url":69},23994,"DeepBench 的基准测试是如何进行的？Excel 表格中的卷积结果参数 P 和 Q 代表什么？","Excel 表格列出了不同内核的所有可用结果，没有额外的合并结果。对于 Intel XeonPhi 7250 (KNL) 上的所有卷积内核结果，可以查看文件 `DeepBench\u002Fresults\u002FDeepBench_IA_KNL7250.xlsx`。关于具体的 P 和 Q 参数含义，需参考该文件中的具体数据定义，官方未提供额外的解释文档。","https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002FDeepBench\u002Fissues\u002F9",{"id":71,"question_zh":72,"answer_zh":73,"source_url":74},23995,"为什么在 P100 GPU 上运行 RNN 基准测试的结果比 M40 差？","这通常是因为深度学习框架需要时间来支持最新版本的 cuDNN。建议报告使用最新驱动程序和 cuDNN 版本（如 V6）的 DeepBench 数据。用户遇到此问题时，应检查是否使用了较旧的 cuDNN 版本（如 5.1），尝试升级到受框架支持的最新版 cuDNN 可能会改善 P100 的性能表现。","https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002FDeepBench\u002Fissues\u002F27",{"id":76,"question_zh":77,"answer_zh":78,"source_url":79},23996,"如何在 RHEL 系统上解决因库文件和头文件路径分离导致的 DeepBench 编译错误？","RHEL 系统将库文件和头文件安装在不同目录。解决方法是在编译时添加 `MPI_INCLUDE_PATH` 变量。具体步骤：\n1. 修改 Makefile，添加 `MPI_INCLUDE_PATH?=\u002Fusr\u002Flocal\u002Fopenmpi`（或实际头文件路径，如 `\u002Fusr\u002Finclude\u002Fopenmpi-x86_64`）。\n2. 将编译命令中的 `-I $(MPI_PATH)\u002Finclude` 替换为 `-I $(MPI_INCLUDE_PATH)`。\n3. 编译时传入参数：`make MPI_INCLUDE_PATH=\u002Fusr\u002Finclude\u002Fopenmpi-x86_64`。\n这样可以在不改变其他系统构建方式的前提下解决 RHEL 的编译问题。","https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002FDeepBench\u002Fissues\u002F81",{"id":81,"question_zh":82,"answer_zh":83,"source_url":84},23997,"如何从 Excel 结果文件中推导 LSTM 和 GRU 的 SGEMM 矩阵乘法参数？","LSTM 矩阵乘法的典型参数如下：\n- m = 4 * 隐藏节点数 (hidden nodes)\n- k = 输入特征大小 (input feature size)\n- n = 批次大小 (batch size)\n在深度神经网络中，LSTM 层通常是堆叠的，因此可以假设输入特征大小等于隐藏节点数。该操作会对所有时间步长重复执行。虽然 Excel 文件只提到了隐藏单元和时间步长，但可以通过上述公式计算出实际的 SGEMM 调用参数。","https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002FDeepBench\u002Fissues\u002F44",{"id":86,"question_zh":87,"answer_zh":88,"source_url":89},23998,"运行 `run_mkl_conv_ia.sh` 脚本时提示找不到 `run_mkl_conv_ia_SKX.sh` 等子脚本怎么办？","这是一个已知问题，相关脚本可能在仓库中缺失或路径配置有误。维护者已更新相关逻辑以根据 CPU 标志（如 avx512dq, avx512f）自动调用正确的脚本。如果遇到此问题：\n1. 确保拉取了最新的代码版本。\n2. 检查 `intel\u002Fconvolution\u002Fmkl_conv\u002F` 目录下是否存在对应的脚本文件。\n3. 如果仍然缺失，可能需要手动创建或从补丁中恢复这些脚本，同时注意修复代码中的拼写错误（如 `precicion` 应为 `precision`）。","https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002FDeepBench\u002Fissues\u002F43",{"id":91,"question_zh":92,"answer_zh":93,"source_url":94},23999,"在 Xeon 或 KNL 机器上运行 DeepBench 脚本需要多长时间？运行前需要配置哪些 BIOS、软件或硬件设置？","运行时间取决于具体配置，之前有用户反映运行了数小时，部分原因是 `libxsmm` 的问题（现已修复）。关于软件配置，官方测试结果使用的版本如下：\n- ICC 版本：16.0.3 (20160415)\n- Intel MKL 版本：MKL 2017 Gold u1, MKLML (2017.0.1)\n具体的规格和配置详情可以在生成的结果文件中找到。确保安装了上述兼容版本的编译器和分析库，并解决了已知的 `libxsmm` 问题，通常即可正常运行。","https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002FDeepBench\u002Fissues\u002F22",{"id":96,"question_zh":97,"answer_zh":98,"source_url":99},24000,"GEMM 基准测试中 TN 情况的维度打印值与传递给 cublasSgemm 的值不一致是怎么回事？","这是一个已确认的代码缺陷。在 `gemm_bench.cu` 文件中，对于 m 不等于 k 的 TN 情况，实际计算的 GEMM 配置与 Excel 文件中报告的配置不一致，导致报告的 GPU 性能数据错误。维护者已经合并了修复该问题的 PR。如果您遇到此问题，请确保更新到包含该修复的最新代码版本，以获取准确的基准测试结果。","https:\u002F\u002Fgithub.com\u002Fbaidu-research\u002FDeepBench\u002Fissues\u002F13",[],[102,118,127,135,144,152],{"id":103,"name":104,"github_repo":105,"description_zh":106,"stars":107,"difficulty_score":60,"last_commit_at":108,"category_tags":109,"status":61},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85013,"2026-04-06T11:09:19",[110,111,112,113,114,59,115,116,117],"图像","数据工具","视频","插件","Agent","语言模型","开发框架","音频",{"id":119,"name":120,"github_repo":121,"description_zh":122,"stars":123,"difficulty_score":124,"last_commit_at":125,"category_tags":126,"status":61},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[114,110,116,115,59],{"id":128,"name":129,"github_repo":130,"description_zh":131,"stars":132,"difficulty_score":124,"last_commit_at":133,"category_tags":134,"status":61},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",75097,"2026-04-07T22:51:14",[115,110,116,59],{"id":136,"name":137,"github_repo":138,"description_zh":139,"stars":140,"difficulty_score":141,"last_commit_at":142,"category_tags":143,"status":61},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[116,59],{"id":145,"name":146,"github_repo":147,"description_zh":148,"stars":149,"difficulty_score":141,"last_commit_at":150,"category_tags":151,"status":61},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65697,"2026-04-07T23:34:58",[116,59,111],{"id":153,"name":154,"github_repo":155,"description_zh":156,"stars":157,"difficulty_score":60,"last_commit_at":158,"category_tags":159,"status":61},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[116,111,59]]