计算机作业范文:A Survey on Parallel and Distributed Deep Learning

发布时间:2022-01-20 15:30:32 论文编辑:wangda1203

本文是计算机专业的留学生作业范例,题目是“A Survey on Parallel and Distributed Deep Learning(并行和分布式深度学习综述)”,深度学习被认为是近年来最杰出的机器学习技术之一。它在图像分析、语音识别和文本理解等领域取得了巨大的成功。它使用有监督和无监督的方法进行分类和模式识别任务,以学习分层体系结构中的多级表示和特征。最近并行和分布式技术的发展使得深度学习能够处理大数据。虽然大数据为电子商务、工业控制、智能医疗等广泛领域提供了巨大的机遇,但由于其规模大、种类多、速度快,在数据挖掘和信息处理方面提出了许多具有挑战性的问题。在过去的几年中,深度学习在大数据应用中发挥了重要作用。

Abstract 摘要

Deep learning is considered as one of the most remarkable machine learning techniques in recent years. It has achieved great success in many applications, such as image analysis, speech recognition, and text understanding. It uses supervised and unsupervised methods for the tasks of classification and pattern recognition to learn multi-level representations and features in hierarchical architectures. The recent development in parallel and distributed technologies has enabled the processing of big data using deep learning. Although big data offers great opportunities for a wide range of areas including e-commerce, industrial control, and smart medicine, it poses many challenging issues in data mining and information processing due to its high size, wide variety and high speed. Deep learning has played a major role in big data applications over the past few years.

In this paper, we review the emerging researches of deep learning models using parallel and distributed tools, resources, algorithms and techniques. Furthermore, we point out the remaining challenges of using parallel and distributed deep learning and discuss future topics.Index Terms - Deep learning, Parallel, Distributed, Highperformance computing, GPU

在本文中,我们回顾了使用并行和分布式工具、资源、算法和技术的深度学习模型的新兴研究。此外,我们指出了使用并行和分布式深度学习的剩余挑战,并讨论了未来的主题。

1.INTRODUCTION引言

Machine learning and, in particular, deep learning are rapidly taking on a range of dimensions of our daily lives. Inspired by the integrated nature of the human brain, the Deep Neural Network (DNN) is at the core of deep learning. Properly trained, the expressiveness of DNNs offers precise solutions only by analyzing large amounts of data to previously thought unsolvable problems. Deep learning has been successfully implemented for a multitude of fields, ranging from image classification, speech recognition, and medical diagnosis to autonomous driving and defeating in complex games human players. As the size and complexity of DNNs increase in data sets, the computational frequency and storage requirements of deep learning increase proportionately. A high-performance computing cluster is required to train a DNN for today's competitive accuracy. Different aspects of DNN learning and inference (evaluation) were modified to increase competitiveness to exploit these systems.

机器学习,尤其是深度学习,正迅速融入我们日常生活的方方面面。深度神经网络(DNN)是深度学习的核心,其灵感来自于人类大脑的集成特性。经过适当的训练,dnn的表达能力只能通过分析大量的数据来解决以前认为无法解决的问题,从而提供精确的解决方案。深度学习已经成功应用于多个领域,从图像分类、语音识别、医疗诊断到自动驾驶和在复杂游戏中击败人类玩家。随着数据集中dnn规模和复杂度的增加,深度学习的计算频率和存储需求也相应增加。一个高性能的计算集群需要训练DNN,以达到今天的竞争精度。DNN学习和推理(评估)的不同方面被修改,以增加利用这些系统的竞争力。

计算机作业代写

Parallel processors such as GPUs have played an important role in the practical implementation of deep neural networks. The computations that arise naturally lend themselves to effective parallel implementation when training and using deep neural networks. The performance of these deployment products enables researchers to test networks that are significantly higher in size and train them on larger datasets. This has helped to greatly enhance the accuracy of tasks such as speech recognition and object identification, among other things. A recognized drawback of the GPU approach is that the speedup of learning is limited when the template does not fit in GPU memory. Researchers also reduce the size of the data or parameters to efficiently use a GPU so that transfers from CPU to GPU are not a major bottleneck. Although information and parameter reduction function well for small problems (e.g. acoustic modeling for speech recognition), problems with a large number of examples and measurements (e.g. highresolution images) are less desirable.

Nonetheless, there are two major challenges in designing a distributed deep learning model. First, deep learning models have a ton of parameters to synchronize nodes when modified, which would cause a lot of overhead interactions. Consequently, the scalability in terms of training time to achieve such accuracy is a concern. Second, it is non-trivial for programmers to build and train models with deep and complex system structures. In addition, distributed training increases the burden on programmers, e.g. partitioning data and model, and communication with the network. This question is compounded by the fact that it will work with deep learning models for data scientists with a little deep learning experience.

In this study, we discuss the variety of topics, ranging from vectorization to supercomputer-efficient use, in the context of parallelism and in-depth learning distribution. Parallel strategies are presented to evaluate and implement multiple research and existing tools, as well as extensions to training algorithms and systems to support distributed environments.

英文作业代写

2.RELATED WORKS相关的工作

Some field studies focus on deep learning applications [1], neural networks and their history [2], [3], [4], [5], deep learning scaling [6], and DNN hardware architecture [7], [8], [9]. Specifically, three surveys [2], [4], [5] describe DNNs and the origins of deep learning methodologies from a historical perspective, as well as discuss the possible capabilities of DNNs w.r.t. training and representational energy. [4], [5] also describe in detail the methods of optimization and the methods of regularization. Bengio [6] discusses the scaling of deep learning from different perspectives, focusing on models, algorithms for optimization, and datasets.

一些领域的研究集中在深度学习应用[1]、神经网络及其历史[2]、[3]、[4]、[5]、深度学习扩展[6]和DNN硬件架构[7]、[8]、[9]。具体来说,[2],[4],[5]三个调查从历史的角度描述了dnn和深度学习方法的起源,并讨论了dnn w.r.t.训练和表征能量的可能能力。[4],[5]也详细描述了优化方法和正则化方法。本吉奥[6]从不同的角度讨论了深度学习的扩展,重点是模型、优化算法和数据集。

The paper also looks at some aspects of distributed computing, including sparse and asynchronous communication. Hardware design surveys focus primarily on the computational learning side and not on optimization. It includes a recent survey [9] which analyses DNN operator computing techniques (layer types) and mapping hardware computations, leveraging inherent parallelism.

The survey also provides a discussion on increasing data representation (e.g. through quantization) to reduce the overall bandwidth of hardware memory. Other surveys focus on accelerators for traditional neural networks [7] and the use of FPGAs in deep learning [8].

3.TERMINOLOGY术语

This section sets out the conventions of theory and naming for the material presented in the survey. We first discuss the class of subjects of supervised learning, followed by relevant parallel programming foundations.

本节列出了在调查中提出的材料的理论和命名的约定。我们首先讨论有监督学习的课程,然后是相关的并行编程基础。

A. Deep Neural Network

An artificial neural network (ANN) with multiple layers between the layers of input and output is a deep neural network (DNN). The DNN finds the correct mathematical manipulation, whether it is a linear relationship or a non-linear relationship, to turn the input into an output. The network passes through the layers to measure each output's probability. For example, a DNN trained to recognize dog breeds will go over the given image and determine the probability of a certain breed being the dog in the picture. The user can review the results and select the probabilities to be displayed by the network (above a certain threshold, etc.) and return the proposed label. Every mathematical manipulation as such is considered a layer, and there are many layers of complex DNN, hence the name "deep" networks.

B. Parallel Computer Architecture

1) Multi-core Computing: A multi-core processor is a processor on the same chip that includes multiple processing units (known as "core"). From multiple instruction streams, a multi-core processor can issue multiple instructions per clock cycle. Each core in a multi-core processor can theoretically be superscalar, i.e. each core can issue multiple instructions from a single thread on every clock cycle.

2) Symmetric Multiprocessing: Symmetric multiprocessor (SMP) is a multi-processor computer system that shares memory and links through a bus. Bus controversy prevents the scaling of bus architectures. As a result, there are generally no more than 32 processors in SMPs. Due to the small size of the processors and the significant reduction in bus bandwidth requirements achieved by large caches, such symmetric multiprocessors are extremely cost-effective, as long as there is enough memory bandwidth.

3) Distributed Computing: A distributed computer (also known as a multiprocessor distributed memory) is a distributed computer system where a network connects the processing elements. Distributed computers are highly scalable. There is a lot of overlap between the terms "competitive computing," "parallel computing," and "distributed computing," and there is no clear distinction between them. The same system can be characterized as "parallel" as well as "distributed;" processors run simultaneously in a typical distributed system.

C. Parallel Programming

The programming techniques used on parallel computers to implement parallel learning algorithms depend on the target architecture. They range from simple, single-machine threaded implementations to OpenMP. Accelerators are usually programmed with special languages such as the CUDA of NVIDIA, OpenCL, or with hardware design languages in the case of FPGAs. However, the details are often hidden behind calls from libraries (e.g. cuDNN or MKL-DNN) implementing the time-consuming primitive. It is possible to use simple communication mechanisms such as TCP / IP or Remote Direct Memory Access (RDMA) on multiple machines with distributed memory. It is also possible to use more convenient libraries such as the Message Passing Interface (MPI) or Apache Spark on distributed memory machines. MPI is a lowlevel library that aims to deliver portable performance while Spark is a higher-level framework that focuses more on the productivity of programmers.

4.OPERATOR CONCURRENCY运营商并发

There are several ways to parallel execution of neural network layer. Computations (for example, in the case of pooling operators) can be directly parallelized in most cases. However, computations need to be reshaped in order to expose parallelism in other types of operators. Below, we describe concurrency analysis of three popular operators.

神经网络层并行执行的方法有很多种。在大多数情况下,计算(例如池操作符)可以直接并行化。然而,为了暴露其他类型操作符的并行性,需要对计算进行重新设计。下面,我们描述了三种流行的运营商的并发分析。

A. Performance Modeling

It is difficult to estimate the runtime of a single DNN operator, let alone a whole network, even with work and depth models. But with performance modeling, other works still manage to approximate the runtime of a given DNN. Using the values in the figure as a lookup table, it was possible to predict the time to calculate and backpropagate with a 5–19 percent error through lots of different sizes, even on asynchronous communication [10] clusters of GPUs. In a distributed system [11], the same was done for CPUs, using a similar approach. Paleo [12] derives a performance model from service counts alone (with a prediction error of 10–30 percent) and Pervasive CNN's [13] uses performance modeling to pick networks with reduced precision to meet users real-time needs.

B. Fully Connected Layers

It is possible to express and model a fully connected layer as a matrix multiplication of weights and neuron values (column per batch sample). To that end, it is possible to use efficient linear algebra libraries like CUBLAS [14] and MKL [15]. [16] presents a variety of methods for further optimizing fully connected layer CPU implementation. The paper shows efficient loop building, vectorizing, blocking, unrolling, and batching in particular. The paper also demonstrates how weights can be quantized to use fixed-point math instead of floating-point.

C. Convolution

The bulk of computations involved in DNN learning and inference are convolutions. As such, significant efforts have been made by the research community and industry to optimize their computation across all platforms. While a convolution operator can be explicitly determined, it will not make full use of the capabilities of vector processors and multi-core architectures, which are oriented to many parallel multiplication-accumulation operations. Furthermore, through ordering operations to optimize information reuse [17], adding data redundancy, or through base transformation, it is possible to increase utilization. In CNN's [18], the first occurrence of unrolling convolutions used both CPUs and GPUs for training. The approach was subsequently popularized by [19] to reshape images from 3D tensors to 2D matrices in the array. Every 1D row in the matrix contains an unrolled 2D patch that would produce redundant data, normally converted (possibly with overlap). Then the kernels of convolution are stored as a 2D matrix, where each column represents an unrolled kernel (one filter of convolution). Multiplying these two matrices results in a matrix containing the converted tensor in 2D format, which for subsequent operations can be reshaped to 3D. Note that this operation can be generalized to 4D tensors (a whole batch), converting it into a multiplication of a single matrix. DNN basic libraries, such as CUDNN [20], provide a variety of methods and software formats for convolution. These libraries include functions that select the best performing algorithm given tensor sizes and memory constraints to assist users in selecting an algorithm. The libraries will run all methods internally and pick the fastest one.

D. Recurrent Units

The complex gate systems within RNN units contain multiple operations, each involving a small multiplication of matrixes or an element-wise operation. For this reason, as a series of high-level operations, such as GEMMs, these layers have traditionally been implemented. Nevertheless, these layers can be further accelerated. Moreover, since RNN units are usually chained together (forming consecutive recurrent layers), it is possible to consider two types of competition: within the same layer and between consecutive layers. [21] describes several GPU optimizations. The first optimization fuses all computations (GEMMs and otherwise) into one function (kernel), saving the memory of scratch-pad intermediate results. This both decreases the overhead of the kernel scheduling and maintains round trips to the global memory using the massively parallel GPU's multi-level memory hierarchy. Certain enhancements involve the pre-transposition of matrices and allowing the GPU to concurrently perform separate recurrent units on various multi-processors. Competitiveness between layers is achieved through a parallel pipeline with which [21] implements stacked RNN unit computations, starting to propagate immediately through the next layer once its data dependencies are met. Overall, these optimizations result in an 11x increase in performance over the implementation at a high level. Dynamic programming [22] for RNNs was proposed from the memory consumption perspective to balance between caching intermediate results and recomputing forward inferences for backpropagation.

5.NETWORK CONCURRENCY网络并发

The high average parallelism of neural networks can be used not only to accurately quantify individual operators but also to simultaneously test the entire network for various dimensions. Below, together with a variant, we analyze three influential partitioning techniques.

神经网络的高平均并行性不仅可以精确地量化单个算子,而且可以同时测试整个网络的不同维数。下面,连同一种变体,我们分析三种有影响力的分区技术。

A. Data Parallelism

A simple parallel approach is to split the work of batch samples between multiple computational resources (core or devices). This approach (initially called pattern parallelism, because input samples are called patterns), dates back to the first practical implementation of [23] artificial neural networks. Multiple vector accelerator microprocessors (Spert-II) were used by [24] to parallel neural network training error backpropagation. The paper presents a version of delayed gradient updates called "bunch mode" to support data parallelism, where the gradient is updated several times before the weights are updated. [25] performed one of the earliest occurrences of mapping DNN computations to software parallel architectures (e.g., GPUs). When teaching Restricted Boltzmann Machines, the paper shows a speedup of up to 72.6x over CPU. Today, the vast majority of deep learning frameworks support data parallelism, using a single GPU, multiple GPUs, or a multiGPU node cluster. Additional methods have been suggested in the literature for software parallelism. SGD runs (possibly with batches) k times in parallel in ParallelSGD [26], splitting the data set between the processors. The resulting weights are aggregated and averaged after the convergence of all SGD instances. ParallelSGD was developed using the programming model for MapReduce [27]. It is easy to plan parallel tasks on multiple processors and distributed environments using MapReduce. While the MapReduce model initially succeeded in deep learning, its generality hindered specific DNN optimizations. Current implementations, therefore, use high-performance communication interfaces (e.g., MPI) to implement features of fine-grained parallelism.

B. Model Parallelism

The second DNN learning partitioning technique is template parallelism also known as network parallelism. This technique separates the job into each layer according to the neurons. In this case, the sample batch is copied to all processors and different parts of the DNN are measured on different processors, which can save memory (because the entire network is not stored in one place) but contributes to further interaction after each layer. [28] has been proposed to incorporate redundant computations into neural networks in order to reduce communication costs in fully connected layers. In particular, the proposed method partitions an NN in such a way that each processor is responsible for twice the neurons (with overlap), thus requiring more calculation but less communication. One approach proposed to reduce interaction in fully connected layers is to use Cannon's algorithm for matrix multiplication, updated for DNNs[29]. The paper reports that Cannon's algorithm produces better efficiency and speed-ups by simple partitioning on fully connected networks on smallscale multilayer.

C. Pipelining

In deep learning, the pipeline can either refer to overlapping computations, i.e. between one layer and the next (as data becomes ready); or to dividing the DNN by depth, assigning layers to specific processors. Pipelining can be viewed as a form of data parallelism, as elements (samples) are processed in parallel through the network, but also as model parallelism, since the DNN structure determines the length of the pipeline. For combine forward analysis, backpropagation, and weight updates, the first type of pipeline can be used. This scheme is commonly used in [30], [31], [32] training, and improves use by reducing idle time for the processor.

D. Hybrid Parallelism

Most parallelism schemes combined would overcome each scheme's drawbacks. Below we take a look at successful examples of such hybrids. applies parallel data to the convolution layer and parallel model to the fully connected portion. Using this hybrid approach, it is possible to achieve a speed-up of up to 6.25x over one for 8 GPUs, with less than 1 percent loss of accuracy (due to an increase in lot size). AMPNet [33] is an asynchronous implementation of CPU DNN learning using an intermediate representation to implement parallel fine-grained modeling. In particular, internal parallel tasks are defined and designed asynchronously within and between layers. In addition, asynchronous dynamic control flow execution enables the tasks of forwarding analysis, backpropagation, and weight updating to be pipelined. Finally, a deep learning system was distributed by the DistBelief [34] which combines all three parallel strategies. Training is conducted simultaneously in the implementation of multiple model replicas, where each replica is trained on different samples (data parallelism). Within each replica, the DNN is distributed in the same layer (model parallelism) by neurons as well as by different layers (pipeline).

6.TRAINING CONCURRENCY培训并发

So far we have addressed training algorithms where there is only one copy and all processors can see its up-to-date quality directly. There may be multiple instances of training agents operating independently in distributed environments, and therefore the overall algorithm has to be adapted.

到目前为止,我们已经解决了只有一个副本和所有处理器可以直接看到其最新质量的训练算法。在分布式环境中,可能有多个训练代理独立运行的实例,因此必须对整个算法进行调整。

A. Model Consistency

Directly splitting computations between nodes produces a distributed type of data parallelism where all nodes have to communicate their updates to others before a new batch is retrieved. It involves a large overhead on the overall system, hampering the scaling of learning. The HOGWILD sharedmemory algorithm [35] is a well-known instance of inconsistency, which enables training agents to read parameters and modify gradients at will, overwriting existing development. Stale-Synchronous Parallelism (SSP) [36] proposes a balance between consistent and inconsistent models to provide accuracy guarantees given asynchrony.

B. Parameter Distribution

For distributed deep learning, there are two general ways to maintain communication bandwidth: compressing parameters with efficient data representations and avoiding sending unnecessary information entirely, resulting in sparse data structures being communicated. Although the methods used in the former category are orthogonal to the network infrastructure, when applied using hierarchical (PS) and distributed topologies, the methods used in the latter classification vary. A common gradient (or parameter) compression data representation is quantization, i.e. the mapping of continuous information into buckets representing value sets (usually ranges). [37] has been shown to be closely distributed in the distributions of parameter and gradient values, so these approaches are successful in representing the working range to reduce the number of bits per parameter. Sparsification is another popular method used for the distribution of parameters. DNNs (especially CNNs) display sparse gradients during updates of parameters. Using a fixed threshold, the first application of gradient sparsification [38] prunes gradient values should not be sent below.

The world of deep learning is full of concurrency. Nearly every aspect of learning is essentially parallel, from convolution computing to meta-optimizing DNN architectures. Even if an aspect is sequential, due to the robustness of nonlinear optimization, its consistency requirements can be reduced to increase competition while at the same time achieving reasonable accuracy, if not better. In this paper, we provide an overview of many of these aspects, the approaches documented in the literature, and the analysis of competition. It's hard to predict what the future holds for this highly active research field (many have tried over the years) but we can assume there's a lot to do with parallel and distributed deep learning to progress. As research progresses, DNN architectures between consecutive and non-consecutive layers are becoming deeper and more interconnected. Apart from accuracy, considerable effort is made to reduce memory footprint and the number of operations so that inferences on mobile devices can be successfully executed. This also means that compression of DNN post-training is likely to be further studied, and compressible networks learning is desirable. Because mobile hardware is limited in memory capacity and must be energy-efficient, it requires specialized DNN computational hardware.

深度学习的世界充满了并发性。从卷积计算到元优化DNN架构,几乎学习的每个方面本质上都是并行的。即使一个方面是顺序的,由于非线性优化的鲁棒性,它的一致性要求可以降低,以增加竞争,同时实现合理的准确性,如果不是更好。在这篇作业中,我们提供了许多这些方面的概述,文献中记录的方法,以及竞争的分析。很难预测这个高度活跃的研究领域的未来(多年来很多人都在尝试),但我们可以假设,并行和分布式深度学习有很多工作要做。随着研究的不断深入,连续层与非连续层之间的DNN体系结构变得更加深入、互联。除了准确性之外,还在减少内存占用和操作数量方面做了大量工作,以便能够成功执行移动设备上的推理。这也意味着DNN训练后的压缩很可能会被进一步研究,压缩网络学习是可取的。由于移动硬件内存容量有限,而且必须节能,因此需要专门的DNN计算硬件。

留学生作业相关专业范文素材资料,尽在本网,可以随时查阅参考。本站也提供多国留学生课程作业写作指导服务,如有需要可咨询本平台。

提交代写需求

如果您有论文代写需求,可以通过下面的方式联系我们。