Paper：2020年3月30日何恺明团队最新算法RegNet—来自Facebook AI研究院《Designing Network Design Spaces》的翻译与解读

导读：
卧槽，卧槽，卧槽！还是熟悉的团队，还是熟悉的署名，
熟悉的Ross，熟悉的何恺明，熟悉的何大佬，
这次带来了RegNet，且让我研究一番。

1. Introduction

2. Related Work

3. Design Space Design

3.1. Tools for Design Space Design

3.2. The AnyNet Design Space

3.3. The RegNet Design Space

3.4. Design Space Generalization

4. Analyzing the RegNetX Design Space

5. Comparison to Existing Networks

5.1. State-of-the-Art Comparison: Mobile Regime

5.2. Standard Baselines Comparison: ResNe(X)t

5.3. State-of-the-Art Comparison: Full Regime

6. Conclusion

论文地址：https://arxiv.org/pdf/2003.13678.pdf

Designing Network Design Spaces

Abstract

In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5× faster on GPUs.

在这项工作中，我们提出了一个新的网络设计范例。我们的目标是帮助提高对网络设计的理解，并发现跨设置泛化的设计原则。我们不是专注于设计单个的网络实例，而是设计参数化网络总体的网络设计空间。整个过程类似于经典的手工网络设计，但是提升到了设计空间的层次。使用我们的方法，我们探索了网络设计的结构方面，并得出一个由简单的、规则的网络组成的低维设计空间，我们称之为RegNet。RegNet参数化的核心思想非常简单:良好网络的宽度和深度可以用量化的线性函数来解释。我们分析了RegNet设计空间，得出了与当前网络设计实践不相符的有趣发现。RegNet设计空间提供了简单而快速的网络，可以很好地跨越各种触发器。在可比较的训练设置和flops的情况下，RegNet模型在gpu上的运行速度高达5倍，表现优于当前流行的有效网模型。

1. Introduction

Deep convolutional neural networks are the engine of visual recognition. Over the past several years better architectures have resulted in considerable progress in a wide range of visual recognition tasks. Examples include LeNet [15], AlexNet [13], VGG [26], and ResNet [8]. This body of work advanced both the effectiveness of neural networks as well as our understanding of network design. In particular, the above sequence of works demonstrated the importance of convolution, network and data size, depth, and residuals, respectively. The outcome of these works is not just particular network instantiations, but also design principles that can be generalized and applied to numerous settings.

深度卷积神经网络是视觉识别的引擎。在过去的几年里，更好的体系结构在视觉识别领域取得了长足的进步。例如LeNet[15]、AlexNet[13]、VGG[26]和ResNet[8]。这一工作体系既提高了神经网络的有效性，也提高了我们对网络设计的理解。特别是上述工作序列，分别说明了卷积、网络和数据大小、深度和残差的重要性。这些工作的结果不仅是特定的网络实例化，而且是设计原则，可以推广和应用于许多设置。

Figure 1. Design space design. We propose to design network design spaces, where a design space is a parametrized set of possible model architectures. Design space design is akin to manual network design, but elevated to the population level. In each step of our process the input is an initial design space and the output is a refined design space of simpler or better models. Following [21], we characterize the quality of a design space by sampling models and inspecting their error distribution. For example, in the figure above we start with an initial design space A and apply two refinement steps to yield design spaces B then C. In this case C ⊆ B ⊆A (left), and the error distributions are strictly improving from A to B to C (right). The hope is that design principles that apply to model populations are more likely to be robust and generalize.

While manual network design has led to large advances, finding well-optimized networks manually can be challenging, especially as the number of design choices increases. A popular approach to address this limitation is neural architecture search (NAS). Given a fixed search space of possible networks, NAS automatically finds a good model within the search space. Recently, NAS has received a lot of attention and shown excellent results [34, 18, 29].

Despite the effectiveness of NAS, the paradigm has limitations. The outcome of the search is a single network instance tuned to a specific setting (e.g., hardware platform). This is sufficient in some cases; however, it does not enable discovery of network design principles that deepen our understanding and allow us to generalize to new settings. In particular, our aim is to find simple models that are easy to understand, build upon, and generalize.

图1所示。设计空间设计。我们建议设计网络设计空间，其中设计空间是可能的模型架构的参数化集合。设计空间设计类似于人工网络设计，但提升到了人口层面。在我们流程的每个步骤中，输入是初始设计空间，输出是更简单或更好模型的精细化设计空间。在[21]之后，我们通过采样模型并检查它们的误差分布来描述设计空间的质量。例如,在上图中我们从最初的设计空间和应用两个改进措施产量设计空间B C。在这种情况下C⊆B⊆(左),和误差分布严格改善从A到B C(右)。希望适用于模型总体的设计原则更有可能是健壮的和一般化的。

虽然手动网络设计已经取得了很大的进展，但是手动找到优化良好的网络可能是一项挑战，特别是在设计选择的数量增加的情况下。解决这一限制的一种流行方法是神经架构搜索(NAS)。给定一个可能的网络的固定搜索空间，NAS会自动在搜索空间中找到一个好的模型。近年来，NAS受到了广泛的关注，并取得了良好的研究成果[34,18,29]。

尽管NAS有效，但这种范式也有局限性。搜索的结果是将单个网络实例调优到特定的设置(例如，硬件平台)。在某些情况下，这就足够了;然而，它并不能帮助我们发现网络设计原则，从而加深我们的理解，并使我们能够归纳出新的设置。特别是，我们的目标是找到易于理解、构建和泛化的简单模型。

In this work, we present a new network design paradigm that combines the advantages of manual design and NAS. Instead of focusing on designing individual network instances, we design design spaces that parametrize populations of networks.1 Like in manual design, we aim for interpretability and to discover general design principles that describe networks that are simple, work well, and generalize across settings. Like in NAS, we aim to take advantage of semi-automated procedures to help achieve these goals.

The general strategy we adopt is to progressively design simplified versions of an initial, relatively unconstrained, design space while maintaining or improving its quality (Figure 1). The overall process is analogous to manual design, elevated to the population level and guided via distribution estimates of network design spaces [21].

As a testbed for this paradigm, our focus is on exploring network structure (e.g., width, depth, groups, etc.) assuming standard model families including VGG [26], ResNet [8], and ResNeXt [31]. We start with a relatively unconstrained design space we call AnyNet (e.g., widths and depths vary freely across stages) and apply our humanin-the-loop methodology to arrive at a low-dimensional design space consisting of simple “regular” networks, that we call RegNet. The core of the RegNet design space is simple: stage widths and depths are determined by a quantized linear function. Compared to AnyNet, the RegNet design space has simpler models, is easier to interpret, and has a higher concentration of good models.

在这项工作中，我们提出了一个新的网络设计范例，它结合了手工设计和NAS的优点。我们不是专注于设计单个的网络实例，而是设计参数化网络总体的设计空间。和手工设计一样，我们的目标是可解释性，并发现描述网络的一般设计原则，这些网络简单、工作良好，并且可以在各种设置中泛化。与NAS一样，我们的目标是利用半自动过程来帮助实现这些目标。

我们采用的一般策略是逐步设计一个初始的、相对不受约束的设计空间的简化版本，同时保持或提高其质量(图1)。整个过程类似于手工设计，提升到总体水平，并通过网络设计空间[21]的分布估计进行指导。

作为这个范例的一个测试平台，我们的重点是探索假定标准模型族包括VGG[26]、ResNet[8]和ResNeXt[31]的网络结构(例如，宽度、深度、组等)。我们从一个相对不受约束的设计空间开始，我们称之为AnyNet(例如，宽度和深度在不同阶段自由变化)，并应用我们的人在循环的方法来达到一个由简单的“规则”网络组成的低维设计空间，我们称之为RegNet。RegNet设计空间的核心很简单:舞台宽度和深度由量化的线性函数决定。与AnyNet相比，RegNet设计空间具有更简单的模型，更易于解释，并且具有更高的优秀模型集中度。

We design the RegNet design space in a low-compute, low-epoch regime using a single network block type on ImageNet [3]. We then show that the RegNet design space generalizes to larger compute regimes, schedule lengths, and network block types. Furthermore, an important property of the design space design is that it is more interpretable and can lead to insights that we can learn from. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. For example, we find that the depth of the best models is stable across compute regimes (∼20 blocks) and that the best models do not use either a bottleneck or inverted bottleneck.

We compare top REGNET models to existing networks in various settings. First, REGNET models are surprisingly effective in the mobile regime. We hope that these simple models can serve as strong baselines for future work. Next, REGNET models lead to considerable improvements over standard RESNE(X)T [8, 31] models in all metrics. We highlight the improvements for fixed activations, which is of high practical interest as the number of activations can strongly influence the runtime on accelerators such as GPUs. Next, we compare to the state-of-the-art EFFICIENTNET [29] models across compute regimes. Under comparable training settings and flops, REGNET models outperform EFFICIENTNET models while being up to 5× faster on GPUs. We further test generalization on ImageNetV2 [24].

We note that network structure is arguably the simplest form of a design space design one can consider. Focusing on designing richer design spaces (e.g., including operators) may lead to better networks. Nevertheless, the structure will likely remain a core component of such design spaces.

In order to facilitate future research we will release all code and pretrained models introduced in this work.2

我们使用ImageNet[3]上的单一网络块类型，在低计算、低历元的情况下设计了RegNet设计空间。然后，我们展示了RegNet设计空间可以泛化为更大的计算状态、调度长度和网络块类型。此外，设计空间设计的一个重要特性是，它具有更强的可解释性，并能带来我们可以学习的见解。我们分析了RegNet设计空间，得出了与当前网络设计实践不相符的有趣发现。例如，我们发现最佳模型的深度在不同的计算机制(∼20块)之间是稳定的，并且最佳模型既不使用瓶颈，也不使用反向瓶颈。

我们将顶级的REGNET模型与各种环境下的现有网络进行比较。首先，REGNET模型在移动环境中非常有效。我们希望这些简单的模型可以作为未来工作的强大基线。接下来，REGNET模型在所有指标上都比标准的RESNE(X)T[8,31]模型有了显著的改进。我们强调了对固定激活的改进，这是一个很有实际意义的问题，因为激活的数量会对加速程序(如gpu)的运行时间产生很大的影响。接下来，我们比较最先进的高效网络[29]模型的计算制度。在可比较的训练设置和失败的情况下，REGNET模型比有效的net模型表现得更好，同时在gpu上可以达到5倍的速度。我们进一步测试了在ImageNetV2[24]上的泛化。

我们注意到，网络结构可以说是人们可以考虑的最简单的设计空间形式。专注于设计更丰富的设计空间(例如，包括运营商)可能会导致更好的网络。然而，该结构可能仍然是此类设计空间的核心组成部分。

为了方便未来的研究，我们将发布所有的代码和在此工作中引入的预训练模型

2. Related Work

Manual network design. The introduction of AlexNet [13] catapulted network design into a thriving research area. In the following years, improved network designs were proposed; examples include VGG [26], Inception [27, 28], ResNet [8], ResNeXt [31], DenseNet [11], and MobileNet [9, 25]. The design process behind these networks was largely manual and focussed on discovering new design choices that improve accuracy e.g., the use of deeper models or residuals. We likewise share the goal of discovering new design principles. In fact, our methodology is analogous to manual design but performed at the design space level.

Automated network design. Recently, the network design process has shifted from a manual exploration to more automated network design, popularized by NAS. NAS has proven to be an effective tool for finding good models, e.g., [35, 23, 17, 20, 18, 29]. The majority of work in NAS focuses on the search algorithm, i.e., efficiently finding the best network instances within a fixed, manually designed search space (which we call a design space). Instead, our focus is on a paradigm for designing novel design spaces. The two are complementary: better design spaces can improve the efficiency of NAS search algorithms and also lead to existence of better models by enriching the design space.

Network scaling. Both manual and semi-automated network design typically focus on finding best-performing network instances for a specific regime (e.g., number of flops comparable to ResNet-50). Since the result of this procedure is a single network instance, it is not clear how to adapt the instance to a different regime (e.g., fewer flops). A common practice is to apply network scaling rules, such as varying network depth [8], width [32], resolution [9], or all three jointly [29]. Instead, our goal is to discover general design principles that hold across regimes and allow for efficient tuning for the optimal network in any target regime.

Comparing networks. Given the vast number of possible network design spaces, it is essential to use a reliable comparison metric to guide our design process. Recently, the authors of [21] proposed a methodology for comparing and analyzing populations of networks sampled from a design space. This distribution-level view is fully-aligned with our goal of finding general design principles. Thus, we adopt this methodology and demonstrate that it can serve as a useful tool for the design space design process.

Parameterization. Our final quantized linear parameterization shares similarity with previous work, e.g. how stage widths are set [26, 7, 32, 11, 9]. However, there are two key differences. First, we provide an empirical study justifying the design choices we make. Second, we give insights into structural design choices that were not previously understood (e.g., how to set the number of blocks in each stages).

3. Design Space Design

Our goal is to design better networks for visual recognition. Rather than designing or searching for a single best model under specific settings, we study the behavior of populations of models. We aim to discover general design principles that can apply to and improve an entire model population. Such design principles can provide insights into network design and are more likely to generalize to new settings (unlike a single model tuned for a specific scenario).

We rely on the concept of network design spaces introduced by Radosavovic et al. [21]. A design space is a large, possibly infinite, population of model architectures. The core insight from [21] is that we can sample models from a design space, giving rise to a model distribution, and turn to tools from classical statistics to analyze the design space. We note that this differs from architecture search, where the goal is to find the single best model from the space.

In this work, we propose to design progressively simplified versions of an initial, unconstrained design space. We refer to this process as design space design. Design space design is akin to sequential manual network design, but elevated to the population level. Specifically, in each step of our design process the input is an initial design space and the output is a refined design space, where the aim of each design step is to discover design principles that yield populations of simpler or better performing models.

We begin by describing the basic tools we use for design space design in §3.1. Next, in §3.2 we apply our methodology to a design space, called AnyNet, that allows unconstrained network structures. In §3.3, after a sequence of design steps, we obtain a simplified design space consisting of only regular network structures that we name RegNet. Finally, as our goal is not to design a design space for a single setting, but rather to discover general principles of network design that generalize to new settings, in §3.4 we test the generalization of the RegNet design space to new settings.

Relative to the AnyNet design space, the RegNet design space is: (1) simplified both in terms of its dimension and type of network configurations it permits, (2) contains a higher concentration of top-performing models, and (3) is more amenable to analysis and interpretation.

3.1. Tools for Design Space Design

We begin with an overview of tools for design space design. To evaluate and compare design spaces, we use the tools introduced by Radosavovic et al. [21], who propose to quantify the quality of a design space by sampling a set of models from that design space and characterizing the resulting model error distribution. The key intuition behind this approach is that comparing distributions is more robust and informative than using search (manual or automated) and comparing the best found models from two design spaces.
Figure 2. Statistics of the AnyNetX design space computed with n = 500 sampled models. Left: The error empirical distribution function (EDF) serves as our foundational tool for visualizing the quality of the design space. In the legend we report the min error and mean error (which corresponds to the area under the curve). Middle: Distribution of network depth d (number of blocks) versus error. Right: Distribution of block widths in the fourth stage (w4) versus error. The blue shaded regions are ranges containing the best models with 95% confidence (obtained using an empirical bootstrap), and the black vertical line the most likely best value.
To obtain a distribution of models, we sample and train n models from a design space. For efficiency, we primarily do so in a low-compute, low-epoch training regime. In particular, in this section we use the 400 million flop3 (400MF) regime and train each sampled model for 10 epochs on the ImageNet dataset [3]. We note that while we train many models, each training run is fast: training 100 models at 400MF for 10 epochs is roughly equivalent in flops to training a single ResNet-50 [8] model at 4GF for 100 epochs. As in [21], our primary tool for analyzing design space quality is the error empirical distribution function (EDF). The error EDF of n models with errors ei is given by: F(e) gives the fraction of models with error less than e. We show the error EDF for n = 500 sampled models from the AnyNetX design space (described in §3.2) in Figure 2 (left).
Given a population of trained models, we can plot and analyze various network properties versus network error, see Figure 2 (middle) and (right) for two examples taken from the AnyNetX design space. Such visualizations show 1D projections of a complex, high-dimensional space, and can help obtain insights into the design space. For these plots, we employ an empirical bootstrap4 [5] to estimate the likely range in which the best models fall. To summarize: (1) we generate distributions of models obtained by sampling and training n models from a design space, (2) we compute and plot error EDFs to summarize design space quality, (3) we visualize various properties of a design space and use an empirical bootstrap to gain insight, and (4) we use these insights to refine the design space.
Figure 3. General network structure for models in our design spaces. (a) Each network consists of a stem (stride-two 3×3 conv with w0 = 32 output channels), followed by the network body that performs the bulk of the computation, and then a head (average pooling followed by a fully connected layer) that predicts n output classes. (b) The network body is composed of a sequence of stages that operate at progressively reduced resolution ri. (c) Each stage consists of a sequence of identical blocks, except the first block which uses stride-two conv. While the general structure is simple, the total number of possible network configurations is vast.

3.2. The AnyNet Design Space

We next introduce our initial AnyNet design space. Our focus is on exploring the structure of neural networks assuming standard, fixed network blocks (e.g., residual bottleneck blocks). In our terminology the structure of the network includes elements such as the number of blocks (i.e. network depth), block widths (i.e. number of channels), and other block parameters such as bottleneck ratios or group widths. The structure of the network determines the distribution of compute, parameters, and memory throughout the computational graph of the network and is key in determining its accuracy and efficiency. The basic design of networks in our AnyNet design space is straightforward. Given an input image, a network consists of a simple stem, followed by the network body that performs the bulk of the computation, and a final network head that predicts the output classes, see Figure 3a. We keep the stem and head fixed and as simple as possible, and instead focus on the structure of the network body that is central in determining network compute and accuracy.
Figure 4. The X block is based on the standard residual bottleneck block with group convolution [31]. (a) Each X block consists of a 1×1 conv, a 3×3 group conv, and a final 1×1 conv, where the 1×1 convs alter the channel width. BatchNorm [12] and ReLU follow each conv. The block has 3 parameters: the width wi, bottleneck ratio bi, and group width gi. (b) The stride-two (s = 2) version.
The network body consists of 4 stages operating at progressively reduced resolution, see Figure 3b (we explore varying the number of stages in §3.4). Each stage consists of a sequence of identical blocks, see Figure 3c. In total, for each stage i the degrees of freedom include the number of blocks di , block width wi , and any other block parameters. While the general structure is simple, the total number of possible networks in the AnyNet design space is vast. Most of our experiments use the standard residual bottlenecks block with group convolution [31], shown in Figure 4. We refer to this as the X block, and the AnyNet design space built on it as AnyNetX (we explore other blocks in §3.4). While the X block is quite rudimentary, we show it can be surprisingly effective when network structure is optimized. The AnyNetX design space has 16 degrees of freedom as each network consists of 4 stages and each stage i has 4 parameters: the number of blocks di , block width wi , bottleneck ratio bi , and group width gi . We fix the input resolution r = 224 unless otherwise noted. To obtain valid models, we perform log-uniform sampling of di ≤ 16, wi ≤ 1024 and divisible by 8, bi ∈ {1, 2, 4}, and gi ∈ {1, 2, . . . , 32} (we test these ranges later). We repeat the sampling until we obtain n = 500 models in our target complexity regime (360MF to 400MF), and train each model for 10 epochs.5 Basic statistics for AnyNetX were shown in Figure 2.
sic statistics for AnyNetX were shown in Figure 2. There are (16·128·3·6)4 ≈ 1018 possible model configurations in the AnyNetX design space. Rather than searching for the single best model out of these ∼1018 configurations, we explore whether there are general design principles that can help us understand and refine this design space. To do so, we apply our approach of designing design spaces. In each step of this approach, our aims are: 1. to simplify the structure of the design space, 2. to improve the interpretability of the design space, 3. to improve or maintain the design space quality, 4. to maintain model diversity in the design space.
Figure 5. AnyNetXB (left) and AnyNetXC (middle) introduce a shared bottleneck ratio bi = b and shared group width gi = g, respectively. This simplifies the design spaces while resulting in virtually no change in the error EDFs. Moreover, AnyNetXB and AnyNetXC are more amendable to analysis. Applying an empirical bootstrap to b and g we see trends emerge, e.g., with 95% confidence b ≤ 2 is best in this regime (right). No such trends are evident in the individual bi and gi in AnyNetXA (not shown). Figure 6. Example good and bad AnyNetXC networks, shown in the top and bottom rows, respectively. For each network, we plot the width wj of every block j up to the network depth d. These per-block widths wj are computed from the per-stage block depths di and block widths wi (listed in the legends for reference). Figure 7. AnyNetXD (left) and AnyNetXE (right). We show various constraints on the per stage widths wi and depths di. In both cases, having increasing wi and di is beneficial, while using constant or decreasing values is much worse. Note that AnyNetXD = AnyNetXC + wi+1 ≥ wi, and AnyNetXE = AnyNetXD + di+1 ≥ di. We explore stronger constraints on wi and di shortly. Figure 8. Linear fits. Top networks from the AnyNetX design space can be well modeled by a quantized linear parameterization, and conversely, networks for which this parameterization has a higher fitting error efit tend to perform poorly. See text for details.
We now apply this approach to the AnyNetX design space. AnyNetXA. For clarity, going forward we refer to the initial, unconstrained AnyNetX design space as AnyNetXA. AnyNetXB. We first test a shared bottleneck ratio bi = b for all stages i for the AnyNetXA design space, and refer to the resulting design space as AnyNetXB. As before, we sample and train 500 models from AnyNetXB in the same settings. The EDFs of AnyNetXA and AnyNetXB, shown in Figure 5 (left), are virtually identical both in the average and best case. This indicates no loss in accuracy when coupling the bi . In addition to being simpler, the AnyNetXB is more amenable to analysis, see for example Figure 5 (right). AnyNetXC. Our second refinement step closely follows the first. Starting with AnyNetXB, we additionally use a shared group width gi = g for all stages to obtain AnyNetXC. As before, the EDFs are nearly unchanged, see Figure 5 (middle). Overall, AnyNetXC has 6 fewer degrees of freedom than AnyNetXA, and reduces the design space size nearly four orders of magnitude. Interestingly, we find g > 1 is best (not shown); we analyze this in more detail in §4. AnyNetXD. Next, we examine typical network structures of both good and bad networks from AnyNetXC in Figure 6. A pattern emerges: good network have increasing widths. We test the design principle of wi+1 ≥ wi , and refer to the design space with this constraint as AnyNetXD. In Figure 7 (left) we see this improves the EDF substantially. We return to examining other options for controlling width shortly. AnyNetXE. Upon further inspection of many models (not shown), we observed another interesting trend. In addition to stage widths wi increasing with i, the stage depths di likewise tend to increase for the best models, although not necessarily in the last stage. Nevertheless, we test a design space variant AnyNetXE with di+1 ≥ di in Figure 7 (right), and see it also improves results. Finally, we note that the constraints on wi and di each reduce the design space by 4!, with a cumulative reduction of O(107 ) from AnyNetXA.

3.3. The RegNet Design Space

To gain further insight into the model structure, we show the best 20 models from AnyNetXE in a single plot, see Figure 8 (top-left). For each model, we plot the per-block width wj of every block j up to the network depth d (we use i and j to index over stages and blocks, respectively). See Figure 6 for reference of our model visualization. While there is significant variance in the individual models (gray curves), in the aggregate a pattern emerges. In particular, in the same plot we show the line wj = 48·(j+1)for 0 ≤ j ≤ 20 (solid black curve, please note that the y-axis is logarithmic). Remarkably, this trivial linear fit seems to explain the population trend of the growth of network widths for top models. Note, however, that this linear fit assigns a different width wj to each block, whereas individual models have quantized widths (piecewise constant functions). To see if a similar pattern applies to individual models, we need a strategy to quantize a line to a piecewise constant function. Inspired by our observations from AnyNetXD and AnyNetXE, we propose the following approach. First, we introduce a linear parameterization for block widths: This parameterization has three parameters: depth d, initial width w0 > 0, and slope wa > 0, and generates a different block width uj for each block j < d. To quantize uj , We can convert the per-block wj to our per-stage format by simply counting the number of blocks with constant width, that is, each stage i has block width wi = w0·w i m and number of blocks di = P j 1[bsj e = i]. When only considering four stage networks, we ignore the parameter combinations that give rise to a different number of stages.
Figure 9. RegNetX design space. See text for details. Figure 10. RegNetX generalization. We compare RegNetX to AnyNetX at higher flops (top-left), higher epochs (top-middle), with 5-stage networks (top-right), and with various block types (bottom). In all cases the ordering of the design spaces is consistent and we see no signs of design space overfitting. Table 1. Design space summary. See text for details.
We test this parameterization by fitting to models from AnyNetX. In particular, given a model, we compute the fit by setting d to the network depth and performing a grid search over w0, wa and wm to minimize the mean log-ratio (denoted by efit) of predicted to observed per-block widths. Results for two top networks from AnyNetXE are shown in Figure 8 (top-right). The quantized linear fits (dashed curves) are good fits of these best models (solid curves). Next, we plot the fitting error efit versus network error for every network in AnyNetXC through AnyNetXE in Figure 8 (bottom). First, we note that the best models in each design space all have good linear fits. Indeed, an empirical bootstrap gives a narrow band of efit near 0 that likely contains the best models in each design space. Second, we note that on average, efit improves going from AnyNetXC to AnyNetXE, showing that the linear parametrization naturally enforces related constraints to wi and di increasing. To further test the linear parameterization, we design a design space that only contains models with such linear structure. In particular, we specify a network structure via 6 parameters: d, w0, wa, wm (and also b, g). Given these, we generate block widths and depths via Eqn. (2)-(4). We refer to the resulting design space as RegNet, as it contains only simple, regular models. We sample d < 64, w0, wa < 256, 1.5 ≤ wm ≤ 3 and b and g as before (ranges set based on efit on AnyNetXE).
The error EDF of RegNetX is shown in Figure 9 (left). Models in RegNetX have better average error than AnyNetX while maintaining the best models. In Figure 9 (middle) we test two further simplifications. First, using wm = 2 (doubling width between stages) slightly improves the EDF, but we note that using wm ≥ 2 performs better (shown later). Second, we test setting w0 = wa, further simplifying the linear parameterization to uj = wa ·(j + 1). Interestingly, this performs even better. However, to maintain the diversity of models, we do not impose either restriction. Finally, in Figure 9 (right) we show that random search efficiency is much higher for RegNetX; searching over just ∼32 random models is likely to yield good models. Table 1 shows a summary of the design space sizes (for RegNet we estimate the size by quantizing its continuous parameters). In designing RegNetX, we reduced the dimension of the original AnyNetX design space from 16 to 6 dimensions, and the size nearly 10 orders of magnitude. We note, however, that RegNet still contains a good diversity of models that can be tuned for a variety of settings.

3.4. Design Space Generalization

We designed the RegNet design space in a low-compute, low-epoch training regime with only a single block type. However, our goal is not to design a design space for a single setting, but rather to discover general principles of network design that can generalize to new settings.

In Figure 10, we compare the RegNetX design space to AnyNetXA and AnyNetXE at higher flops, higher epochs, with 5-stage networks, and with various block types (described in the appendix). In all cases the ordering of the design spaces is consistent, with RegNetX > AnyNetXE > AnyNetXA. In other words, we see no signs of overfitting. These results are promising because they show RegNet can generalize to new settings. The 5-stage results show the regular structure of RegNet can generalize to more stages, where AnyNetXA has even more degrees of freedom.

Figure 11. RegNetX parameter trends. For each parameter and each flop regime we apply an empirical bootstrap to obtain the range that contains best models with 95% confidence (shown with blue shading) and the likely best model (black line), see also Figure 2. We observe that for best models the depths d are remarkably stable across flops regimes, and b = 1 and wm ≈ 2.5 are best. Block and groups widths (wa, w0, g) tend to increase with flops.

Figure 12. Complexity metrics. Top: Activations can have a stronger correlation to runtime on hardware accelerators than flops (we measure inference time for 64 images on an NVIDIA V100 GPU). Bottom: Trend analysis of complexity vs. flops and best fit curves (shown in blue) of the trends for best models (black curves).

Figure 13. We refine RegNetX using various constraints (see text). The constrained variant (C) is best across all flop regimes while being more efficient in terms of parameters and activations.

Figure 14. We evaluate RegNetX with alternate design choices. Left: Inverted bottleneck ( 1 8 ≤ b ≤ 1) degrades results and depthwise conv (g = 1) is even worse. Middle: Varying resolution r harms results. Right: RegNetY (Y=X+SE) improves the EDF.

4. Analyzing the RegNetX Design Space

We next further analyze the RegNetX design space and revisit common deep network design choices. Our analysis yields surprising insights that don’t match popular practice, which allows us to achieve good results with simple models. As the RegNetX design space has a high concentration of good models, for the following results we switch to sampling fewer models (100) but training them for longer (25 epochs) with a learning rate of 0.1 (see appendix). We do so to observe more fine-grained trends in network behavior. RegNet trends. We show trends in the RegNetX parameters across flop regimes in Figure 11. Remarkably, the depth of best models is stable across regimes (top-left), with an optimal depth of ∼20 blocks (60 layers). This is in contrast to the common practice of using deeper models for higher flop regimes. We also observe that the best models use a bottleneck ratio b of 1.0 (top-middle), which effectively removes the bottleneck (commonly used in practice). Next, we observe that the width multiplier wm of good models is ∼2.5 (top-right), similar but not identical to the popular recipe of doubling widths across stages. The remaining parameters (g, wa, w0) increase with complexity (bottom).
Complexity analysis. In addition to flops and parameters, we analyze network activations, which we define as the size of the output tensors of all conv layers (we list complexity measures of common conv operators in Figure 12, top-left). While not a common measure of network complexity, activations can heavily affect runtime on memory-bound hardware accelerators (e.g., GPUs, TPUs), for example, see Figure 12 (top). In Figure 12 (bottom), we observe that for the best models in the population, activations increase with the square-root of flops, parameters increase linearly, and runtime is best modeled using both a linear and a square-root term due to its dependence on both flops and activations. RegNetX constrained. Using these findings, we refine the RegNetX design space. First, based on Figure 11 (top), we set b = 1, d ≤ 40, and wm ≥ 2. Second, we limit parameters and activations, following Figure 12 (bottom). This yields fast, low-parameter, low-memory models without affecting accuracy. In Figure 13, we test RegNetX with theses constraints and observe that the constrained version is superior across all flop regimes. We use this version in §5, and further limit depth to 12 ≤ d ≤ 28 (see also Appendix D).
Alternate design choices. Modern mobile networks often employ the inverted bottleneck (b < 1) proposed in [25] along with depthwise conv [1] (g = 1). In Figure 14 (left), we observe that the inverted bottleneck degrades the EDF slightly and depthwise conv performs even worse relative to b = 1 and g ≥ 1 (see appendix for further analysis). Next, motivated by [29] who found that scaling the input image resolution can be helpful, we test varying resolution in Figure 14 (middle). Contrary to [29], we find that for RegNetX a fixed resolution of 224×224 is best, even at higher flops. SE. Finally, we evaluate RegNetX with the popular Squeeze-and-Excitation (SE) op [10] (we abbreviate X+SE as Y and refer to the resulting design space as RegNetY). In Figure 14 (right), we see that RegNetY yields good gains.
Figure 15. Top REGNETX models. We measure inference time for 64 images on an NVIDIA V100 GPU; train time is for 100 epochs on 8 GPUs with the batch size listed. Network diagram legends contain all information required to implement the models. Figure 16. Top REGNETY models (Y=X+SE). The benchmarking setup and the figure format is the same as in Figure 15.

5. Comparison to Existing Networks

We now compare top models from the RegNetX and RegNetY design spaces at various complexities to the stateof-the-art on ImageNet [3]. We denote individual models using small caps, e.g. REGNETX. We also suffix the models with the flop regime, e.g. 400MF. For each flop regime, we pick the best model from 25 random settings of the RegNet parameters (d, g, wm, wa, w0), and re-train the top model 5 times at 100 epochs to obtain robust error estimates.

Resulting top REGNETX and REGNETY models for each flop regime are shown in Figures 15 and 16, respectively. In addition to the simple linear structure and the trends we analyzed in §4, we observe an interesting pattern. Namely, the higher flop models have a large number of blocks in the third stage and a small number of blocks in the last stage. This is similar to the design of standard RESNET models. Moreover, we observe that the group width g increases with complexity, but depth d saturates for large models.

Our goal is to perform fair comparisons and provide simple and easy-to-reproduce baselines. We note that along with better architectures, much of the recently reported gains in network performance are based on enhancements to the training setup and regularization scheme (see Table 7). As our focus is on evaluating network architectures, we perform carefully controlled experiments under the same training setup. In particular, to provide fair comparisons to classic work, we do not use any training-time enhancements.

5.1. State-of-the-Art Comparison: Mobile Regime

Much of the recent work on network design has focused on the mobile regime (∼600MF). In Table 2, we compare REGNET models at 600MF to existing mobile networks. We observe that REGNETS are surprisingly effective in this regime considering the substantial body of work on finding better mobile networks via both manual design [9, 25, 19] and NAS [35, 23, 17, 18].

We emphasize that REGNET models use our basic 100 epoch schedule with no regularization except weight decay, while most mobile networks use longer schedules with various enhancements, such as deep supervision [16], Cutout [4], DropPath [14], AutoAugment [2], and so on. As such, we hope our strong results obtained with a short training schedule without enhancements can serve as a simple baseline for future work.

Table 2. Mobile regime. We compare existing models using originally reported errors to RegNet models trained in a basic setup. Our simple RegNet models achieve surprisingly good results given the effort focused on this regime in the past few years.

Table 3. RESNE(X)T comparisons. (a) Grouped by activations, REGNETX show considerable gains (note that for each group GPU inference and training times are similar). (b) REGNETX models outperform RESNE(X)T models under fixed flops as well.

Table 4. EFFICIENTNET comparisons using our standard training schedule. Under comparable training settings, REGNETY outperforms EFFICIENTNET for most flop regimes. Moreover, REGNET models are considerably faster, e.g., REGNETX-F8000 is about 5× faster than EFFICIENTNET-B5. Note that originally reported errors for EFFICIENTNET (shown grayed out), are much lower but use longer and enhanced training schedules, see Table 7.

Figure 17. ResNe(X)t comparisons. REGNETX models versus RESNE(X)T-(50,101,152) under various complexity metrics. As all models use the identical components and training settings, all observed gains are from the design of the RegNetX design space

Figure 18. EFFICIENTNET comparisons. REGNETs outperform the state of the art, especially when considering activations

5.2. Standard Baselines Comparison: ResNe(X)t

Next, we compare REGNETX to standard RESNET [8] and RESNEXT [31] models. All of the models in this experiment come from the exact same design space, the former being manually designed, the latter being obtained through design space design. For fair comparisons, we compare REGNET and RESNE(X)T models under the same training setup (our standard REGNET training setup). We note that this results in improved RESNE(X)T baselines and highlights the importance of carefully controlling the training setup.

Comparisons are shown in Figure 17 and Table 3. Overall, we see that REGNETX models, by optimizing the network structure alone, provide considerable improvements under all complexity metrics. We emphasize that good REGNET models are available across a wide range of compute regimes, including in low-compute regimes where good RESNE(X)T models are not available.

Table 3a shows comparisons grouped by activations (which can strongly influence runtime on accelerators such as GPUs). This setting is of particular interest to the research community where model training time is a bottleneck and will likely have more real-world use cases in the future, especially as accelerators gain more use at inference time (e.g., in self-driving cars). REGNETX models are quite effective given a fixed inference or training time budget.

5.3. State-of-the-Art Comparison: Full Regime

We focus our comparison on EFFICIENTNET [29], which is representative of the state of the art and has reported impressive gains using a combination of NAS and an interesting model scaling rule across complexity regimes. To enable direct comparisons, and to isolate gains due to improvements solely of the network architecture, we opt to reproduce the exact EFFICIENTNET models but using our standard training setup, with a 100 epoch schedule and no regularization except weight decay (effect of longer schedule and stronger regularization are shown in Table 7). We optimize only lr and wd, see Figure 22 in appendix. This is the same setup as REGNET and enables fair comparisons. Results are shown in Figure 18 and Table 4. At low flops, EFFICIENTNET outperforms the REGNETY. At intermediate flops, REGNETY outperforms EFFICIENTNET, and at higher flops both REGNETX and REGNETY perform better. We also observe that for EFFICIENTNET, activations scale linearly with flops (due to the scaling of both resolution and depth), compared to activations scaling with the square-root of flops for REGNETs. This leads to slow GPU training and inference times for EFFICIENTNET. E.g., REGNETX-8000 is 5× faster than EFFICIENTNET-B5, while having lower error.

6. Conclusion

In this work, we present a new network design paradigm. Our results suggest that designing network design spaces is a promising avenue for future research.

Table 5. RESNE(X)T comparisons on ImageNetV2

Table 6. EFFICIENTNET comparisons on ImageNetV2.

Figure 19. Additional ablations. Left: Fixed depth networks (d = 20) are effective across flop regimes. Middle: Three stage networks perform poorly at high flops. Right: Inverted bottleneck (b < 1) is also ineffective at high flops. See text for more context.

Figure 20. Swish vs. ReLU. Left: RegNetY performs better with Swish than ReLU at 400MF but worse at 6.4GF. Middle: Results across wider flop regimes show similar trends. Right: If, however, g is restricted to be 1 (depthwise conv), Swish is much better.