Computer Science
Power-law scaling to assist with key challenges in artificial intelligence
Y. Meir, S. Sardi, et al.
The study investigates whether and how test error in supervised deep learning decreases with dataset size according to a power-law, analogous to scaling behavior in critical physical systems. Motivated by the central role of power-law scaling in phase transitions and critical phenomena, the authors explore deep neural networks trained on MNIST to determine if test error ε scales as ε ≈ c0 (dataset size/label)^{-p}. The goals are to (a) quantify the scaling exponent p and constant c0 across training regimes (one epoch vs. many epochs) and architectures (number of hidden layers, inclusion of input crosses), (b) assess whether optimized one-epoch training can approach the accuracy achieved by many epochs, enabling rapid decision-making with limited examples, and (c) establish power-law scaling as a practical tool to estimate required dataset sizes for target test accuracies and as a benchmark to compare task and algorithmic complexity.
The paper situates its work within the framework of power-law phenomena pervasive in physics and complex systems (e.g., earthquakes, network topology, turbulence, finance, linguistics, and neural signals). In machine learning, deep learning has been applied broadly in physics (astrophysics, high-energy physics, imaging, and phase transition analysis). Prior work has examined learning performance and scaling behaviors; the authors highlight the MNIST benchmark and backpropagation methods, including momentum, and reference empirical scaling in deep learning from prior studies. This context motivates probing whether deep learning error exhibits power-law convergence with dataset size and how training strategies and architectures influence the scaling parameters.
Dataset and task: MNIST handwritten digits with 60,000 training and 10,000 test images (no data augmentation). Performance is measured by test error on held-out test data. Network architectures: Primary architecture uses a feedforward network with 784 inputs (28x28 pixels), two hidden layers of 100 units each, and 10 output units. Some experiments include additional 'input crosses' (hyper-weights) whereby each hidden unit in the first layer receives the same number of crosses, each cross comprising two input pixels. Additional experiments evaluate architectures without input crosses and with 1, 2, or 3 hidden layers (each 100 units). Preprocessing and training protocol: Balanced datasets with equal numbers of random examples per label. Per-example mean subtraction and normalization of pixel standard deviation to unity. Fixed order of labels within each epoch with random ordering of examples conditioned on label order. Forward propagation uses sigmoid activations; during forward pass, the cumulative average field is dynamically subtracted from the induced field on nodes in hidden layers. Training is evaluated for one epoch (each example seen once) and many epochs (until test error saturation; ≥150 epochs for momentum; 30–50 for accelerated method). Dataset sizes are varied by examples per label (e.g., 9, 15, 30, 60, 120 for many-epoch studies; 30, 60, 120, 240 for some one-epoch layer-depth studies), and performance is extrapolated to 6,000 examples per label (the full training set). Optimization strategies:
- Momentum backpropagation: W^{t+1} = (1−α)W^t + V^{t+1}, with V^{t+1} = μV^t − η∇_W C; biases updated analogously. The cross-entropy loss includes L2 regularization on weights. Five global hyperparameters are optimized per dataset size: μ, α, η, Amp1, Amp2 (amplitudes for hidden layers in forward propagation). Test error is minimized over these parameters for each dataset size.
- Accelerated (brain-inspired) backpropagation: Time-dependent per-weight step-size η^t updated as η^{t+1} = η^t + A tanh(B ⋅ ∂C/∂W), leading to accelerated convergence when gradients are sign-coherent. Weights updated as W^{t+1} = (1−α)W^t + V^{t+1}, with V^{t+1} = μV^t − η^{t+1} ∂C/∂W. Biases receive analogous time-dependent η updates. Eleven parameters are optimized: A1, A2, A3, β1, β2, β3, μ, α, τ, Amp1, Amp2. Ensembling: Soft committee decision (ensemble of Nc replicas, typically N=50), each trained on the same data with different initializations. The predicted label is the argmax over the sum of output activations across replicas. Evaluation: For each dataset size, test error ε is averaged over many random samples (e.g., 50–200) and fitted to a power-law ε ≈ c0 (dataset size/label)^{-p}. Extrapolations to 6,000 examples per label are reported. Comparisons are made across training epochs (one vs. many), architectures (with/without input crosses; number of hidden layers), and strategies (momentum vs. accelerated).
- Power-law scaling: Across architectures and training regimes, the minimized test error ε scales with dataset size per label as ε ≈ c0 (dataset size/label)^{-p}.
- Many epochs, with input crosses, momentum strategy: c0 ≈ 0.65, p ≈ 0.50; extrapolated ε(6000) ≈ 0.008. Saturation requires ≥150 epochs.
- Many epochs, accelerated strategy: Achieves the same saturated test error as momentum but in 30–50 epochs.
- Many epochs, soft committee (N=50, momentum): ε ≈ 0.007 with p ≈ 0.52, close to state-of-the-art deep learning results on MNIST.
- One epoch, with input crosses: • Momentum: p ≈ 0.49; ε(6000) ≈ 0.0207. • Soft committee (N=50, momentum): p ≈ 0.477; ε(6000) ≈ 0.0151. • Accelerated: p ≈ 0.49; ε(6000) ≈ 0.0174. • Soft committee (N=50, accelerated): p ≈ 0.492; ε(6000) ≈ 0.0131. Results show one-epoch performance is in close proximity to many-epoch best (≈0.013 vs. ≈0.007).
- Without input crosses, one epoch, varying depth (100 units/layer): • 1 hidden layer: p ≈ 0.302; ε(6000) ≈ 0.053. • 2 hidden layers: p ≈ 0.339; ε(6000) ≈ 0.049. • 3 hidden layers: p = 0.385; ε(6000) ≈ 0.048. Increasing depth raises the exponent p, indicating faster asymptotic error decay with data size, but constants c0 favor shallower networks for small datasets. Crossing points: ε between 1 and 2 layers crosses at ~480 examples/label; between 2 and 3 layers at ~4100 examples/label.
- Soft committee, two layers without input crosses, one epoch: p ≈ 0.306 (expected to slightly increase with better statistics); ε(6000) ≈ 0.0377.
- Many epochs without input crosses (1–3 hidden layers): Saturated ε at maximal dataset size ~0.017 with p ~0.4, largely independent of depth; three-layer ε slightly higher but within error bars. With input crosses, ε ≈ 0.008 and p ~0.5, indicating substantial benefit from input crosses (~0.028 improvement compared with two-layer network without crosses).
- Practical implication: Power-law scaling enables a priori estimation of dataset size needed for target accuracy and provides a benchmark to compare learning task difficulty (via exponent p) and algorithmic efficiency (via c0 and p).
The findings demonstrate that test error in supervised deep learning on MNIST follows a robust power-law decrease with dataset size across training strategies and architectures. The exponent p serves as a quantitative indicator of task and algorithmic difficulty: smaller p implies a harder problem requiring more data to reach a given error if c0 is similar. The close proximity of one-epoch results to many-epoch performance suggests feasibility of rapid decision-making scenarios where examples cannot be revisited, relevant to real-time applications in robotics and network control. Input crosses substantially improve performance, and depth increases the power-law exponent for one-epoch training, enhancing asymptotic scaling with larger datasets. For many-epoch training without input crosses, similar saturated errors across depths question the advantage of deeper architectures in this setting, although additional epochs and parameter-space exploration can reduce gaps. Overall, the power-law framework provides a principled means to estimate required dataset sizes, compare architectures and training algorithms, and organize tasks into a quantitative hierarchy by scaling behavior.
This work establishes that deep learning test error exhibits power-law scaling with dataset size, enabling predictive, a priori estimation of required data for target accuracies and offering a benchmark for comparing tasks and learning strategies. One-epoch training, especially with soft committees and accelerated updates, approaches state-of-the-art many-epoch performance, supporting real-time applications under data and time constraints. Depth increases the scaling exponent in one-epoch regimes, while input crosses markedly improve accuracy. Future research should extend and confirm these scaling laws across diverse datasets and tasks to build a broader hierarchy of learning complexities, investigate architectural and optimization choices that modulate the scaling parameters c0 and p, and probe the generality of depth-related advantages and the role of input cross structures.
The study focuses on a single dataset (MNIST) and specific fully connected architectures, limiting generalizability. Several results rely on extrapolations to the maximal dataset size (6,000 examples per label). Optimization of the accelerated method involves an 11-parameter search that is computationally heavy. Some conclusions (e.g., depth’s limited advantage under many epochs without input crosses) may be preliminary and potentially exceptional, as noted by the authors, and could depend on increased epochs or improved parameter tuning. Reported exponents may vary slightly with better statistics. One-epoch constraints can hinder optimization in deeper networks due to larger parameter counts.
Related Publications
Explore these studies to deepen your understanding of the subject.

