AI accelerator

An AI accelerator is (as of 2016) an emerging class of microprocessor designed to accelerate artificial neural networks, machine vision and other machine learning algorithms for robotics, internet of things and other data-intensive or sensor-driven tasks.^[1] They are frequently manycore designs (mirroring the massively-parallel nature of biological neural networks). They are targeted at practical narrow AI applications, rather than artificial general intelligence research. Many vendor specific terms exist for devices in this space.

They are distinct from GPUs (which are commonly used for the same role) in that they lack any fixed function units for graphics, and generally focus on low-precision arithmetic.

History

Computer systems have frequently complemented the CPU with special purpose accelerators for intensive tasks, most notably graphics, but also sound, video, etc. Over time various accelerators have appeared that have been applicable to AI workloads.

Early attempts

In the early days, DSPs (such as the AT&T DSP32C) have been used as neural network accelerators e.g. to accelerate OCR software,^[2] and there have been attempts to create parallel high throughput systems for workstations (e.g. TetraSpert in the 1990s, which was a parallel fixed point vector processor^[3]), aimed at various applications including neural network simulations.^[4] ANNA was a neural net CMOS accelerator developed by Yann LeCun.^[5] There was another attempt to build a neural net workstation called Synapse-1^[6] (not to be confused with the current IBM SyNAPSE project).

Heterogeneous computing

Architectures such as the Cell microprocessor (itself inspired by the PS2 vector units, one of which was tied more closely to the CPU for general purpose work) have exhibited features significantly overlap with AI accelerators - in its support for packed low precision arithmetic, dataflow architecture, and prioritising 'throughput' over latency and "branchy-int" code. This was a move toward heterogeneous computing, with a number of throughput-oriented accelerators intended to assist the CPU with a range of intensive tasks: physics-simulation, AI, video encoding/decoding, and certain graphics tasks beyond its contemporary GPUs.

The Physics processing unit was yet another example of an attempt to fill the gap between CPU and GPU in PC hardware, however physics tends to require 32bit precision and up, whilst much lower precision can be a better tradeoff for AI.^[7]

CPUs themselves have gained increasingly wide SIMD units (driven by video and gaming workloads) and increased the number of cores in a bid to eliminate the need for another accelerator, as well as for accelerating application code. These tend to support packed low precision data types.^[8]

Use of GPGPU

Innovative software appeared using vertex and pixel shaders for general purpose computation through rendering APIs, by storing non graphical data in vertex-buffers and texture maps (including implementations of convolutional neural networks for OCR^[9]),^[10] Vendors of graphics processing units subsequently saw the opportunity and generalised their shader pipelines with specific support for GPGPU, mostly motivated by the demands of video game-physics but also targeting scientific computing.^[11]

This killed off the market for a dedicated physics accelerator, and superseded Cell in video game consoles,^[12] and eventually led to their use in running convolutional neural networks such as AlexNet (which exhibited leading performance the ImageNet Large Scale Visual Recognition Challenge).^[13]

As such, as of 2016 GPUs are popular for AI work, and they continue to evolve in a direction to facilitate deep learning, both for training^[14] and inference in devices such as self-driving cars.^[15] - and gaining additional connective capability for the kind of dataflow workloads AI benefits from (e.g. NVidia NVLink).^[16]

Use of FPGA

Deep learning frameworks are still evolving, making it hard to design custom hardware. Reconfigurable devices like Field-programmable gate arrays (FPGA) make it easier to evolve hardware, frameworks and software alongside each other.^[17]

Microsoft has used FPGA chips to accelerate inference.^[18]^[19] This has motivated Intel to purchase Altera with the aim of integrating FPGAs in server CPUs, which would be capable of accelerating AI as well as other tasks.

Use of ASIC

Whilst GPUs perform far better than CPUs for these tasks, a factor of 10 in efficiency^[20]^[21] can still be gained with a more specific design, via an application-specific integrated circuit (ASIC).

Memory access pattern

The memory access pattern of AI calculations differs from graphics: a more predictable but deeper dataflow, benefiting more from the ability to keep more temporary variables on-chip (e.g. in scratchpad memory rather than caches); GPUs by contrast devote silicon to efficiently dealing with highly non-linear gather-scatter addressing between texture maps and frame-buffers, and texture filtering, as is needed for their primary role in 3D rendering.

Precision

AI researchers are often finding minimal accuracy losses whilst dropping to 16 or even 8 bits,^[7] suggesting that a larger volume of low precision arithmetic is a better use of the same bandwidth. Some researchers have even tried using 1bit precision (i.e. putting the emphasis entirely on spatial information in vision tasks).^[22] IBM's design is more radical, dispensing with scalar values altogether and accumulating timed pulses to represent activations stochastically, requiring conversion of traditional representations.^[23]

Nomenclature

As of 2016, the field is still in flux and vendors are pushing their own marketing term for what amounts to an "AI accelerator", in the hope that their designs and APIs will dominate. There is no consensus on the boundary between these devices, nor the exact form they will take, however several examples clearly aim to fill this new space, with a fair amount of overlap in capabilities.

In the past when consumer graphics accelerators emerged, the industry eventually adopted NVidias self assigned term, "the GPU",^[24] as the collective noun for "graphics accelerators", which had taken many forms before settling on an overall pipeline implementing a model presented by Direct3D.

Slowing of Moore's law

As of 2016, the slowing (and possible imminent end of) Moore's law^[25] drives some to suggest refocussing industry efforts on application led silicon design,^[26] whereas in the past, increasingly powerful general purpose chips have been applied to varying applications via software. In this scenario, a diversification of dedicated AI accelerators makes more sense than continuing to stretch GPUs and CPUs.

Future

It remains to be seen however if the eventual shape on AI accelerator is a radically new device like TrueNorth, or a more general purpose processor that just happens to be optimised for the right mix of precision and dataflow.^[4] There are also some even more exotic approaches on the horizon, e.g. using memristors, attempting to use individual memristors as synapses.

Potential Applications

Autonomous cars, NVidia have targeted their Drive PX-series boards at this space.^[27]
Agricultural robots, for example chemical-free weed control.^[28]
Voice control, e.g. in mobile phones, a target for Qualcomm Zeroth.^[29]
Machine translation
Unmanned aerial vehicles, e.g. navigation systems, e.g. the Movidius Myriad 2 has been demonstrated successfully guiding autonomous drones.^[30]
Industrial robots, increasing the range of tasks that can be automated, by adding adaptability to variable situations.
Healthcare assisting with diagnoses
Search engines, increasing the energy efficiency of data centres and ability to use increasingly advanced queries.
Natural language processing

Examples

Vision processing units
- e.g. Movidius Myriad 2, which is a many-core VLIW AI accelerator at its heart, complemented with video fixed function units.
Tensor processing unit - presented as an accelerator for Google's TensorFlow framework, which is extensively used for convolutional neural networks. It focusses on a high volume of 8-bit precision arithmetic.
SpiNNaker, a many-core design coming traditional ARM architecture cores with an enhanced network fabric design specialised for simulating a large neural network.
TrueNorth The most unconventional example, a manycore design based on spiking neurons rather than traditional arithmetic. Frequency of pulses represents signal intensity. As of 2016 there is no consensus amongst AI researchers if this is the right way to go,^[31] but some results are promising, with large energy savings demonstrated for vision tasks.^[32]
Zeroth NPU a design by Qualcom aimed squarely at bringing speech and image recognition capabilities to mobile devices.
Nervana Engine Nervana Systems
Eyeriss, a design aimed explicitly at convolutional neural networks, using a scratchpad and on chip network architecture.
Adapteva epiphany is targeted as a coprocessor, featuring a network on a chip scratchpad memory model, suitable for a dataflow programming model, which should be suitable for many machine learning tasks.
Kalray have demonstrated an MPPA^[33] and report efficiency gains over GPUs for convolutional neural nets.
IIT Madras are designing a spiking neuron accelerator for new RISC-V systems, aimed at big-data analytics in servers.^[34]
Nvidia DGX-1 is based on GPU technology however the use of multiple chips forming a fabric via NVLink specialises its memory architecture in a way that is particularly suitable for deep learning.

References

↑ "google developing AI processors". google using its own AI accelerators.
↑ "convolutional neural network demo from 1993 featuring DSP32 accelerator".
↑ "design of a connectionist network supercomputer".
1 2 "The end of general purpose computers (not)". This presentation covers a past attempt at neural net accelerators, notes the similarity to the modern SLI GPGPU processor setup, and argues that general purpose vector accelerators are the way forward (in relation to RISC-V hwacha project. Argues that NN's are just dense and sparse matrices, one of several recurring algorithms)
↑ Application of the ANNA Neural Network Chip to High-Speed Character Recognition
↑ "SYNAPSE-1: a high-speed general purpose parallel neurocomputer system".
1 2 "Deep Learning with Limited Numerical Precision" (PDF).
↑ "Improving the performance of video with AVX".
↑ "microsoft research/pixel shaders/MNIST".
↑ "how the gpu came to be used for general computation".
↑ "nvidia tesla microarchitecture" (PDF).
↑ "End of the line for IBM's Cell".
↑ "imagenet classification with deep convolutional neural networks" (PDF).
↑ "nvidia driving the development of deep learning".
↑ "nvidia introduces supercomputer for self driving cars".
↑ "how nvlink will enable faster easier multi GPU computing".
↑ "FPGA Based Deep Learning Accelerators Take on ASICs". The Next Platform. 2016-08-23. Retrieved 2016-09-07.
↑ "microsoft extends fpga reach from bing to deep learning".
↑ "Accelerating Deep Convolutional Neural Networks Using Specialized Hardware" (PDF).
↑ "Google boosts machine learning with its Tensor Processing Unit". 2016-05-19. Retrieved 2016-09-13.
↑ "Chip could bring deep learning to mobile devices". www.sciencedaily.com. 2016-02-03. Retrieved 2016-09-13.
↑ Rastegari, Mohammad; Ordonez, Vicente; Redmon, Joseph; Farhadi, Ali (2016). "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks". arXiv:1603.05279 [cs.CV].
↑ Diehl, Peter U.; Zarrella, Guido; Cassidy, Andrew; Pedroni, Bruno U.; Neftci, Emre (2016). "Conversion of Artificial Recurrent Neural Networks to Spiking Neural Networks for Low-power Neuromorphic Hardware". arXiv:1601.04187 [cs.NE].
↑ "NVIDIA launches he Worlds First Graphics Processing Unit, the GeForce 256,".
↑ "intels former chief architect - moore's law will be dead within a decade".
↑ "more than moore" (PDF).
↑ "drive px".
↑ "design of a machine vision system for weed control" (PDF).
↑ "qualcomm research brings server class machine learning to every data devices".
↑ "movidius powers worlds most intelligent drone".
↑ "yann lecun on IBM truenorth". argues that spiking neurons have never produce leading quality results, and that 8-16 bit precision is optimal, pushes the competing 'neuflow' design
↑ "IBM cracks open new era of neuromorphic computing". TrueNorth is incredibly efficient: The chip consumes just 72 milliwatts at max load, which equates to around 400 billion synaptic operations per second per watt — or about 176,000 times more efficient than a modern CPU running the same brain-like workload, or 769 times more efficient than other state-of-the-art neuromorphic approaches
↑ "kalray MPPA" (PDF).
↑ "India preps RISC-V Processors - Shakti targets servers, IoT, analytics". The Shakti project now includes plans for at least six microprocessor designs as well as associated fabrics and an accelerator chip

External links

http://www.nextplatform.com/2016/04/05/nvidia-puts-accelerator-metal-pascal/

This article is issued from Wikipedia - version of the 11/30/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.