For this month's paper challenge, Saif Haq and Jan Xu have chosen to present the paper Searching for Winogradaware Quantized Networks from researchers at University of Oxford and Arm ML Research Lab. They start the paper review by taking a look at the abstract as a whole, and breaking it down into parts:
Let's divide this into separate parts and elaborate on each of them.
Lightweight architectural designs of Convolutional Neural Networks (CNNs) together with quantization have paved the way for the deployment of demanding computer vision applications on mobile devices. Parallel to this, alternative formulations to the convolution operation such as FFT, Strassen and Winograd, have been adapted for use in CNNs offering further speedups.
Running CNNs on mobile and IoT devices with constrained hardware has necessitated the push for architectural designs with low memory, compute and energy budgets, that can yield fast inference times without overly compromising on the model performance. To achieve this, we can either
(a) adopt lightweight CNN architectures such as smaller kernels, depthwise or separable convolutions;
(b) quantise our model to lower bit widths;
(c) use alternative convolution algorithms, such as FFT, Strassen and Winograd convolution, instead of direct convolution.
Winograd convolutions are the fastest known algorithm for spatially small convolutions, but exploiting their full potential comes with the burden of numerical error, rendering them unusable in quantized contexts.
In the case of (c), the Winograd convolution is fastest for convolutions with small kernels, often 3x3. In broad terms, it consists of transforming the input and kernel into a Winograd domain, applying the convolution by an elementwise product, and transforming the output back to the original domain. However, certain Winograd algorithms suffers inherently from numerical errors that get exacerbated in quantised models, implying that (b) and (c) are, to a certain extent, irreconcilable.
In this work we propose a Winogradaware formulation of convolution layers which exposes the numerical inaccuracies introduced by the Winograd transformations to the learning of the model parameters, enabling the design of competitive quantized models without impacting model size.
This paper proposes a way to resolve this incompatibility by introduces Winogradaware convolutions. This allows the model parameters to account for the numerical errors of Winograd, even in lowerbit networks.
We also address the source of the numerical error and propose a relaxation on the form of the transformation matrices, resulting in up to 10% higher classification accuracy on CIFAR10.
The paper also proposes a relaxation technique on the Winograd transforms, which further closes the accuracy gap (by up to 10%) between the slow but precise direct convolution and the fast but imprecise Winograd convolution.
Finally, we propose wiNAS, a neural architecture search (NAS) framework that jointly optimizes a given macroarchitecture for accuracy and latency leveraging Winogradaware layers. A Winogradaware ResNet18 optimized with wiNAS for CIFAR10 results in 2.66× speedup compared to im2row, one of the most widely used optimized convolution implementations, with no loss in accuracy.
Lastly, this paper offers a NAS framework that can trade off between accuracy and latency by searching for the best convolution method and bit width in each layer. As we shall soon see, they perform their experiments on actual CPU hardware and notice an impressive speedup compared to direct convolution on classification tasks, with near zero loss in accuracy.
In many ML applications, convolutional operations have become a staple. Importantly for us, they are an integral part to our compression pipeline, so finding a small increase in efficiency would bring about large benefits and savings in time and efficient use of compute. As we are trying to port our models to mobile devices, runtime is going to be a big concern in inference.
First, we give a general introduction about runtime concepts, before delving deeper into standard Winograd convolutions. These concepts will set the foundations for the paper under question.
So what does it mean to improve runtime? Figure 1 below shows a hierarchy of what can be changed to improve runtime.
Figure 1: A hierarchy of constituents that have an impact on runtime. This blog article focuses on bit width and the convolution algorithm to improve arithmetic intensity, and in turn runtime.
From a highlevel overview, the runtime is governed by arithmetic intensity and hardware. The arithmetic (or operational) intensity $$I$$ of an algorithm is the ratio of FLOPS (or performance) and data movement required by the algorithm, measured in FLOPS/byte. The roofline model defines the attainable performance of our algorithm as $$P = {(\pi
,(\beta x I)}$$ where $$\pi$$ is the peak performance and $$\beta$$ is the peak memory bandwidth of the hardware. In other words, if our $$I < \pi / \beta$$, the algorithm is memorybound. If $$I > \pi / \beta$$, it is computebound. The ratio $$ \pi / \beta$$ is sometimes also referred to as computetomemory ratio (CMR).
Figure 2: The roofline model showing performance estimates of a given compute kernel by showing limitations in either its FLOPS or its data movement.
At Deep Render, our decoding algorithm will most likely be memorybound, due to the large memory transfer between the cache memory / RAM and the ALU of the feature maps and kernels. Therefore, reducing the number of FLOPS would not improve our inference speed. This can be seen in the roofline diagram below at $$Y$$, where an excess of FLOPs is being limited by the data movement.
This paper under review focuses on increasing arithmetic intensity by lowering the required data movement. The authors focus on achieving this by varying the convolution algorithm and the bit width rather than model architecture, as shown in Figure 1. This formula below gives some intuition on how data movement can be improved:
$$DM = bit \; width \;x ubrace(C \; H times W)_("Model architecture") times ubrace((m + r  1)^2 / m^2)_("Convolution Algorithm") times text(caching factor)$$
$$r$$ is the kernel size and $$m$$ is the output tile size of the Winograd convolution (more on this in the next section). The caching factor can be ignored for now. So, to improve data movement, and therefore runtime, we discuss how we could either quantise our model to lower bit widths; or use a faster convolution algorithm that would allow you to perform more operations with the same memory transfer.
Using lowerprecision (quantised) networks results in smaller model sizes, faster inference, lower energy consumption, and smaller chip area, which has a direct impact on the data movement [1]. For these reasons, 8bit arithmetic encoding has been widely adopted in computeconstrained devices.
There are a few ways of performing convolutions. Winogradbased convolutions have quickly gained traction as a preferred approach to implement convolutional neural networks (ConvNet) for small kernels on various hardware platforms because it requires fewer floatingpoint operations and data movement than standard/direct convolutions (GEMM, normally using the operations im2row and im2col).
Although Winogradbased convolutions sound great, they do suffer from exponentially increasing numerical errors which increase with the tile size parameter $$m$$. The reason this happens is beyond the scope of this article but if you’re interested, see [9]. Despite the fact that neural networks are known to be to be fairly robust to weight quantisation, this is not the case for when Winograd is used as the numerical inaccuracies get exacerbated with lower bit widths (see Table 2 two sections below).
The Winogradaware Quantized Network discussed later in the blog post, poses a solution that allows the quantised network to accommodate for the numerical errors through backpropagation, allowing learnable parameters to absorb some of this inaccuracy.
Before we start presenting the paper, this section provides a quick overview in how the Winograd algorithm works. Principally, Winograd works on the basis of algorithmic strength reduction: swapping out expensive operations such as multiplications with cheaper operations such as additions.
Standard GEMM based convolutions have a lot of overlap in pixels when a filter is convolved over the input. Figure 3 below shows a standard 3x3 convolution being applied to the top left of an image (the 4x4 input). The centre four pixels are used in all four of the matrix multiplications, so this is where Winograd improves on redundant operations.
: A visualisation of the matrix multiplications required to achieve a convolution reducing a 4x4 input to a 2x2 output with a 3x3 filter, using the GEMM (im2col) algorithm.
Instead of having 36 multiplication operations (4 times 3x3) in the example above, a Winograd convolution would instead reduce this to 16 (1 times 4x4); which is a 2.25 increase of algorithmic efficiency for convolutions with filter sizes of 3x3. The Winograd algorithm achieves this by breaking up the input image into overlapping tiles of size $$m + r  1$$ and performing the convolution in the ‘Winograd domain’, where the convolution is simply a Hadamard product or elementwise multiplication. The Winograd algorithm of input tile $$X$$, weight kernel $$W$$ and output $$Q$$ can be described with the following formula:
$$Y = ubrace(bb A^T [ubrace(bb G W bb G^T)_("Kernel transform") ubrace(odot)_("Hadamard product") ubrace(bb B^T X bb B)_("Input transform")] bb A)_("Output transform")$$
Here, we can see that three transformations need to occur: (i) the input transform using the matrix $$B^T$$ which transforms the input tile $$X$$ into Winograd space; (ii) the kernel transform using the matrix $$G$$, which transforms the weight kernel `W` into Winograd space (can be precomputed so will not affect runtime); (iii) the output transform using the matrix $$A^T$$, which transforms the output back into the original space. $$B^T$$, $$G$$ and $$A^T$$ are collectively known as the Winograd transformation matrices. A visualisation of this procedure is shown in Figure 4.
Figure 4: $$F(2 \;x \;2, 3 \;x \;3)$$ or F2 Winograd pipeline where the input tile and weight kernel have the $$B^T$$ and $$G$$ matrices applied respectively, transforming them to the Winograd domain. Then, they are multiplied elementwise (Hadamard product), and the output is finally transformed back to the image domain using the $$A^T$$ transformation matrix.
The Winograd algorithm can be defined for multiple output tile sizes $$m$$ and kernel sizes $$r$$. The algorithm itself, for a 2D convolution, can be denoted as $$F(m \;x \;m, r \;x \;r)$$. In this paper, the authors focus mainly on 3x3 kernels and refer to $$F(2 \;x \;2, 3 \;x \;3)$$ as F2, $$F(4\;x\;4, 3\;x\;3)$$ as F4 etc.
Each size configuration of Winograd has differently initialised transformation matrices. Most commonly, they are initialised with the Chinese Remainder Theorem or the CookToom algorithm, which requires choosing a set of socalled polynomial points. The choice of these are not trivial, but for small Winograd kernels and output tile sizes, such as F2 and F4, there is a common consensus on the "best" points [4].
The efficiency of Winogradbased algorithms depend directly on the output tile size. Table 1 shows how many inputandkernel multiplications are required to achieve a convolution at different output tile sizes, assuming a kernel size of 3x3. It is very clear that larger output tile sizes of Winograd enjoy a greater arithmetic complexity reduction factor compared to GEMM convolutions.
Output tile size $$m$$  # GEMM (im2row) multiplications  # Winograd multiplications  Arithmetic complexity reduction factor 
2 x 2  36  16 
2.25x

4 x 4  144  36  4x 
6 x 6  324  64  5.0625x 
However, the numerical imprecision is also closely associated with the output tile size, as well as the bit width of the operation. The implications of reducing bit width and/or increasing output tile size can be seen from Table 2 below. Next section will explain further the numerical errors encountered when performing model quantisation in largertile Winograds.
As previous sections explain, the Winograd convolution inherently suffers from numerical inaccuracies induced by the complexity of the convolution method (F2 vs F4 vs F6) and the bit width of the operation (32bit vs 16bit vs 8bit). This incongruity is shown in the Table 2.
Table 2: CIFAR10 accuracies for a ResNet18 architecture pretrained in full precision (float32) with different Winograd convolution methods and bit widths switched on.
Upon model quantisation, the CIFAR10 accuracy drops by around 75% with F4 convolutions and over 80% with F6! Whilst F2 still works well in quantised networks, it's clear from this table that quantising a pretrained model renders Winograd F4 and F6 virtually unworkable. In other words, we can't reap the runtime benefits of both reduced precision arithmetic and the more efficient F4 or F6 Winograds simultaneously; in its current form, we have to choose either or.
Luckily, this paper attempts to resolve this incompatibility and offers a solution for closing this accuracy gap, whilst retaining the speedup from the Winograd algorithm. They achieve this by formulating socalled Winogradaware networks.
The core idea is extremely simple, yet effective: during training, instead of direct convolution, every convolution $$z = f(d, g)$$ (slightly different notation from above, but $$z = Y = output \;tile$$, $$d = X = input \;tile$$ and $$g = W = weight \;kernel$$) is explicitly implemented as a Winograd convolution formulated by the equation:
$$z = A^T [G g \;G^T \cdot B^T d B] A$$
This exposes the Winograd transformation matrices to the model training, since the Winograd convolution consists of only backpropable (matrixmatrix multiplication and elementwise multiplication) operations. This means that
In the following sections of the paper, the models that include just the first bullet point are termed Winogradaware, and refer to these models in the experimental results below simply as F2, F4 and so on (with static Winograd transforms). The authors use the suffix 'flex' to denote if the Winogradaware model also includes the second bullet point, i.e. enables the transformation matrices $$B^T$$, $$G$$ and $$A^T$$ to be learnable.
Figure 5: The forward pass of a Winogradaware layer, in which the transformation matrices $$B^T$$, $$G$$ and $$A^T$$ are initialised via CookToom algorithm [4]. If the transformation matrices are part of the learnable set of model parameters, the gradients carried from the loss function would update these (given by the coloured arrows). $$Qx$$ denotes the intermediate quantisation operation that takes place in each step of the algorithm to the desired bit width.
Using Winogradaware training, how much of the lost accuracy due to model quantisation can we recover when using F4 or F6? The authors studied this by training ResNet18 architectures [6] on a CIFAR10 classification task from scratch, comparing configurations for different convolution algorithms (F2, F4 and F6 plus their 'flex' counterpart) and bit widths (32bit, 16bit, 10bit and 8bit). They also varied the model size with a uniform multiplier for the channel width of the layers (with a width multiplier of 1.0 corresponding to the full ResNet18). The results can be seen in Figure 6 below.
Figure 6: Performances of Winogradaware ResNet18 architectures on CIFAR10 for various Winograd algorithms (versus direct convolution or im2row) and various bit widths. In quantised networks, the 'flex' configurations (i.e. those that learn the Winograd transforms) are essential for maintaining high accuracy of the model.
Firstly, let's compare the Winogradaware models (static transforms only for now) with the aforementioned accuracy gaps, and focus on width multiplier = 1.0 only. Using F4 convolutions, Winogradaware training practically closes the gap entirely in 16bit(!), whilst in 8bit around 65% of the accuracy is recovered. Using F6 convolutions, almost 80% and 72% of the accuracy is recovered in 16bit and 8bit, respectively. In summary, we can narrow the gap significantly by training with Winogradaware layers, but it is still evident that largertile Winograds (F6) struggle more to close the gap when compared to F2 and F4.
Secondly, it's apparent that learning the Winograd transforms is super important for retaining performance in quantised networks, even more so than just exposing the model weights to the Winograd inaccuracies. In the 8bit experiment, the 'flex' configurations result in 10% and 5% higher accuracies for F4 and F6, respectively, than their nonflex counterpart. This means for F4, the accuracy gap is pretty much closed in 8bit. Unsurprisingly, there seems to be no discernible difference between F2 and F2flex at any bit width.
Thirdly, the authors also performed a similar experiment on an 8bit version of the LeNet5 architecture [7], which uses 5x5 kernels, on classifying MNIST digits. We won't include the figure here (Figure 5 in the paper), but it basically shows that the 'flex' configuration is again vital in order to use F4 and F6 in such low bit widths.
Lastly, the authors demonstrate that it is possible to take a fully trained ResNet18 model with direct convolutions (no numerical errors) of a given bit width, and retrain it with F4 convolutions. If Winograd transformation matrices are allowed to evolve during training, it can fully recover the accuracy of the original model within 20 epochs. Figure 7 demonstrates exactly this for an 8bit model. Note that the pretraining and retraining has to occur in the same bit width; for instance, retraining a 32bit model in 8bit would not work.
Figure 7: Retraining (or adapting) a pretrained model with im2row convolution (solid black) can be successfully and quickly done for a F4flex configuration (solid blue), even more so than training the F4flex model from scratch (solid green). With static Winograd transforms, neither retraining nor training from scratch in F4 yielded good performance.
One caveat with Winogradaware training is that it requires very high memory usage, which consequently might slow down training. This is due to the exposure of the Winograd transforms, each of which has intermediate outputs due to the matrixmatrix multiplication that are required for backpropagation. The authors relied on gradient checkpointing [8] to train these models, however may be cumbersome to do for production code.
Therefore, the last mentioned observation is important, if we were to transfer this to the architectures and problem types at Deep Render. We would be able to take pretrained, quantised models (often having used direct convolution in training) and simply finetune them with a F4flex configuration, for example. This means we can do the bulk of the training efficiently with direct convolutions, and then apply a short posttraining Winogradawareness scheme before deploying them on mobile chips with Winograd algorithms, whilst expecting only marginal (if any) loss in performance.
Heuristically, larger tile sizes of the Winograd convolution result in lower latencies but lower model performance (or accuracy) due to numerical errors. As we've seen how the accuracy of quantised models can be partially or even fully recovered by employing Winogradaware layers, let's talk about how to find a good tradeoff between maximising accuracy and minimising latency.
The authors thus present a second contribution in this paper, titled wiNAS, which is a NASbased framework that "automatically transforms an existing architecture into a Winogradaware version" that jointly optimises for network accuracy and latency. The NAS operates on a microarchitecture level, with candidate operations (or the operation set) such as different convolution algorithms (im2row, F2, F4 etc.) and bit precisions (float32, float16, int8) per layer.
The authors use a variation of the ProxylessNAS [5], which samples the path stochastically during the search using the architecture parameters $$({a_0, a_1, ..., a_n})$$. These are softmaxed, yielding probabilities $${(p^0, p^1, ..., p^n)}$$, from which the path is sampled. (In actuality, two paths are being sampled in order for gradients to pass through multiple operations.) Similar to ProxylessNAS, the wiNAS formulates the search as a twostage update process:
Update of model parameters (convolutions, activations etc.) through the loss metric $$L_{model}$$.
Update of architecture parameters $${a_0, a_1, ..., a_n}$$, where the loss introduces the latency metrics
$$L_{arch} = L_{model} + lambda_1 a_2^2 + lambda_2 E[latency]$$
The $$a_2^2$$ is a weight decay term on the architecture parameters and the expected latency of a layer $$l$$ is the weighted sum of individual operation latency estimates $$F(o_j^{[l]})$$ for each operation $$o_j^{[l]}$$:
$$E[(latency)^{[l]}] = sum_j p_j F(o_j^{[l]}) = sum_j exp(a_j) / {sum_j exp(a_j)} F(o_j^{[l]})$$
Figure 8: The microarchitecture NAS problem formulated by wiNAS for each convolution layer. Here, bit widths is not shown but may also be a second axis to perform search over.
The $$F(o_j^{[l]})$$ term is precomputed, either analytically or empirically, which is a function of the feature size of $$o_j^{[l]}$$ and quantisation level. The authors chose to compute it empirically on two different ARM CortexA CPU cores: A73 and A53 (these will be expanded on in the next section). For each convolution method (im2row, F2, F4 and F6), the operation latency was measured for various feature resolutions and channel sizes, and stored to be used in the NAS training. As such, this joint optimisation framework takes into account both performance losses of the model and the latency of the operations, whose tradeoff is controlled by the $$lambda_2$$ factor in the loss function.
The authors differentiate between two variations of the wiNAS framework:
The search for the optimal configuration was performed for 100 epochs on CIFAR10 and CIFAR100 on a ResNet18 backbone. Once completed, the resulting architectures was trained endtoend as the other Winogradaware networks. The resulting architectures can be seen in Figure 9.
Left: $$wiNAS_{WA}$$ search, fixed 8bit width, optimised over CIFAR100.
Centre: $$wiNAS_{WAQ}$$ search, optimised over CIFAR10.
Right: $$wiNAS_{WAQ}$$ search, optimised over CIFAR100.
A discussion on their quantitative performance will be deferred to the next section which is, frankly, the more important takeaway from the wiNAS framework. For now, given this figure, we can do a quick qualitative interpretation:
It is evident that model accuracy or performance stays mostly consistent across various experimental frameworks and hardware types. On the contrary, runtime or latency is extremely hardwaredependent. For that, it's very important to consider the application and the computational resources available for the relevant neural network.
Table 3: Key hardware specifications for the two mobile chips tested in this paper.
In the case of this paper, the authors target the usage of neural networks in chips commonly found in today's offtheshelf mobile hardware. To this end, they opted for one ARM CortexA73 highperformance core which is relatively old (2016), and one ARM CortexA53 highefficiency core which is very old (2012). They were tested on a Huawei HiKey 960 development board with the big.LITTLE CPU architecture for 32bit and 8bit precision. As discussed in the introduction, what ultimately sets the upper limit to the achievable speedup by Winograd for memorybound applications is related to data movement, which in turn depends on the memory bandwidth and caching subsystem of the hardware. Whilst the A73 unsurprisingly outperforms the A53 as we shall soon see, it's encouraging to see that hardware is getting bettersuited to deep learning applications over time, with high promise for mobile chips available today and in the years to come.
The authors show an accuracy versus latency table (Table 4) across the various convolution algorithms and on the two bit widths for the two CPU cores, on CIFAR10 and CIFAR100 with the ResNet18 architectures.
im2row / im2col — direct convolution
$$W_{F2}$$ / $$W_{F4}$$ — Winograd convolution (not Winogradaware)
$$WA_{F2}$$* — F2 for Winogradaware with static transforms (since F2flex has no significant gain over F2)
$$WA_{F4}$$ — F4 for Winogradaware with learned transforms
$$wiNAS_{WA}$$ / $$wiNAS_{WAQ}$$ — wiNAS configurations (as described in previous section)
There's a lot to unpack here, so let's break this table down bit by bit. First, we'll focus only on the latency values of A73; ignore A53 for now. The speedups are always measured w.r.t. to im2row in 32bit. Let's start with the 32bit experiments:
Now, let's shift our attention towards the 8bit experiments:
Let's also quickly discuss the wiNAS runs (some of which have two latency values; left is for CIFAR10, right for CIFAR100):
Now, let's look at the experiments with the A53 CPU core.
However, although we've seen benefits with Winogradaware layers and wiNAS, we cannot stress enough that the biggest difference in latency/accuracy comes from hardware. If you compare the latency values across A73 and A53, we can measure speedups up to 2.5x. Therefore, we must take into account what hardware our application will be used on when implementing performance versus latency tests. This point may be trivial, but it's also extremely important.
As described in the section for wiNAS, the authors also benchmarked the two CPU cores on various input resolutions and channel sizes across the different convolution algorithms, in order to characterise the latency behaviour. The result table is not included in this blog because it's fairly cluttered (refer to Figure 8 in the paper instead), and frankly most takeaways were fairly trivial (latency increasing with larger input dimensions and channel sizes, etc.). However, there were a couple of interesting observations, both of which is evident from Figure 10:
The solid colour bar regions represent the elementwise multiplication stage of the Winograds, whereas the portion below it represents the input transform and the portion above it the output transform.
The default static Winograd transforms $$B^T$$, $$G$$ and $$A^T$$ are relatively sparse. Example for F4 can be seen above. Multiple implementations of matrixmatrix multiplications, such as Arm's Compute Library, can exploit data sparsity which often results in lower latencies.
When Winogradaware layers learn flexible transforms, it is very unlikely that the resulting matrices are sparse. However, due to the runtime being mostly determined by data movement, the incurred penalty is less severe for memorybound applications, where additional computation can be tolerated without necessarily increasing execution time. Thus, this becomes less of an issue for Deep Render's applications.
The images used in this study are relatively small; it is uncertain how much of this knowledge is transferable to our (memorybound) problem domain with highresolution images, especially for the actual hardware tests. Another interesting aspect is whether the speedups involved also apply to GPUs and NPUs; which have slightly different execution structures than CPUs.
Whilst the wiNAS framework seems exciting, we are uncertain of how successful these experiments actually were. The authors claim that the framework can be used to jointly optimise for accuracy and latency, but they show no such tradeoff curve which would have been very interesting to look at.
Lastly, there still seems to be issues associated with F6, since it is omitted from the resulting wiNAS architectures as well as the final accuracy vs runtime table. We suspect that there is still accuracy issues with F6 in lowbit models, which Winogradaware training cannot remedy well enough; we would need further innovation in this space to enable the usage of F6 Winograds.
[1] Sze, V., Chen, Y. H., Yang, T. J. & Emer, J. (2017) Efficient Processing of Deep Neural Networks: A Tutorial and Survey. https://arxiv.org/pdf/1703.09039.pdf
[2] Zlateski, A., Jia, Z., Li, K. & Durand, F. (2018) FFT Convolutions are Faster than Winograd on Modern CPUs, Here is Why. https://arxiv.org/pdf/1809.07851.pdf
[3] Maji, P., Mundy, A., Dasika, G., Beu, J., Mattina, M. & Mullins, R. (2019) Efficient Winograd or CookToom Convolution Kernel Implementation on Widely Used Mobile CPUs. https://arxiv.org/pdf/1903.01521.pdf
[4] Toom, A. L. (1963) The complexity of a scheme of functional elements realizing the multiplication of integers. http://toomandre.com/myarticles/engmat/MULTE.PDF
[5] Cai, H., Zhu, L. & Cai, S. H. (2019) ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. https://arxiv.org/pdf/1812.00332.pdf
[6] He, K., Zhang, X., Ren, S. & Sun, J. (2015) Deep Residual Learning for Image Recognition. https://arxiv.org/pdf/1512.03385.pdf
[7] LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998) GradientBased Learning Applied to Document Recognition. http://yann.lecun.com/exdb/publis/pdf/lecun98.pdf
[8] Chen, T., Xu, B., Zhang, C. & Guestrin, C. (2016) Training Deep Nets with Sublinear Memory Cost. https://arxiv.org/pdf/1604.06174v2.pdf
[9] Barabasz, B., Anderson, A., Soodhalter, K. M. & Gregg, D. (2018) Error Analysis and Improving the Accuracy of Winograd Convolution for Deep Neural Networks. https://arxiv.org/pdf/1803.10986.pdf