Computer Science
Machine learning in spectral domain
L. Giambagli, L. Buffoni, et al.
The paper addresses how to train neural networks by operating in the spectral (reciprocal) domain rather than the conventional node/weight (direct) space. Motivated by the lack of understanding of how massive numbers of weights interact in deep networks and by the desire to reduce parameter counts while preserving performance, the authors propose training eigenvalues and eigenvectors of layer transfer operators. The central hypothesis is that learning targeted at spectral components—especially eigenvalues associated with engineered, non-orthogonal eigenvectors that propagate information forward—can provide a global, efficient, and effective alternative to local weight updates. The study demonstrates this idea using MNIST digit recognition, comparing spectral training to standard backprop-based training under matched parameter budgets and architectures, and exploring both linear and non-linear settings with “telescopic” linear layers that can be retracted after training.
The introduction situates the work within classic and modern machine learning, recalling the perceptron and multilayer networks trained by backpropagation, and the need for large labeled datasets. Prior ideas of training only a subset of parameters or using alternative compressed parameterizations are referenced (e.g., training only BatchNorm parameters; low-rank/SVD-inspired treatments of layer mappings). The authors contrast their spectral approach—framing learning in an enlarged space enabling square operators and invoking spectral properties—with works using SVD of rectangular weight matrices to analyze information-theoretic aspects. They also connect to broader themes such as pretraining, representation understanding, and computational efficiency. Specific references include foundational ML texts and works on efficient training and information measures in deep nets.
Problem and data: The method is demonstrated on MNIST (60k train, 10k test, 28×28 grayscale images). Each image is vectorized into η₁ of length N, with the first N₁=784 entries set to pixel intensities and the remaining entries initially zero. The output layer has N_e=10 nodes. Core spectral construction: Between successive layers i and i+1 with sizes N_i and N_{i+1}, define an N×N (or compact (N_i+N_{i+1})×(N_i+N_{i+1})) operator A_k via its spectral decomposition using:
- Φ_k: a basis matrix equal to identity with a sub-diagonal block of size N_{k+1}×N_k (entries initialized as uniform random numbers, e.g., in [-0.5, 0.5]). This creates a nested, non-orthogonal set of eigenvectors that “indent” forward to propagate information from layer k to k+1.
- Λ_k: a diagonal matrix of eigenvalues. Typically, the first N_k entries are 1; the subsequent N_{k+1} entries are non-trivial trainable parameters (initially uniform in a small interval, e.g., [-0.01, 0.01]).
- A_k is then assembled as A_k = Φ_k Λ_k (2I − Φ_k) (or, in the compact algorithmic variant, A_k = Φ_k Λ_k (2I + Φ_k)^{-1} Φ_k as per the implementation description), yielding the linear transformation η_{k+1} = A_k η_k. The full network output after l−1 such steps is η_l = (∏{k=1}^{l−1} Φ_k Λ_k (2I − Φ_k)) η_1. Due to the structure of Φ_k, the first sum of layer components are preserved while the next block is populated by a compressed representation determined by the indentation and eigenvalues. Compactification (linear case): For linear architectures, the cascade can be compacted into a single equivalent perceptron A_c of size (N₁+N_e)×(N₁+N_e) that maps z (pixels plus zeroed outputs) directly to outputs. Intermediate linear layers can be introduced during training to expand parameter capacity (telescopic behavior) and later retracted to obtain a compact operational model. Non-linear extension: Insert non-linearities after linear spectral transforms. The authors use ReLU f(x)=max(0,x) (alternatively tanh with tunable β_k). The pipeline is η{k+1} = f(A_k η_k). Linear layers can be interleaved between non-linear ones to increase trainable eigenvalue capacity and later be retracted. The equivalent direct-space architecture has self-loops and feedforward connections determined by the learned spectral operators. Loss and optimization: For classification, the loss is categorical cross-entropy applied to the final N_e outputs (after softmax). In derivations, exemplar forms of the loss for linear and non-linear cascades are provided. Training can target:
- Only eigenvalues (Λ blocks) in selected layers.
- Both eigenvalues and sub-diagonal entries of Φ (eigenvectors indentation), optionally non-linearity parameters. Implementation: Implemented in TensorFlow/Keras with custom spectral layers. Two fully trainable matrices per spectral layer are created and masked so that only intended Λ diagonal entries and Φ sub-diagonal blocks are trainable. Optimizer: AdaMax (Adam variant) with learning rate 0.03 (linear) and 0.01 (non-linear), trained for ~20 epochs, mini-batches of size 300–800. Hardware: Google Colab VM/GPU. Baseline conventional NNs (direct-space) are trained on the same platform with identical optimizers and hyperparameters. Hyperparameters were chosen for stability/efficiency; no exhaustive tuning was performed. Architectures evaluated:
- Linear two-layer perceptron: N₁=784 → N₂=10.
- Wide linear: insert one linear intermediate layer of size N₂ between input (784) and output (10); vary N₂.
- Deep linear: multiple successive linear intermediate layers; examples include l=3 with intermediate sizes N₂∈{20,80,100,500,800}, l=4 with N₂=800 and N₃∈{100,200,400,600}, and l=5 with N₂=800, N₃=600, N₄=500.
- Deep non-linear: four-layer network with sizes N₁=784, N₂ (linear, varied), N₃=120 (ReLU applied here), N₄=10; linear N₂ layer is retracted post-training for an effective three-layer non-linear model. Comparative training regimes:
- Spectral constrained: train only a specified set of eigenvalues with eigenvectors fixed.
- Spectral extended: train both eigenvalues and the Φ sub-diagonal (eigenvectors indentation).
- Direct-space constrained NN: train only a randomly selected subset of weights equal in number to the eigenvalues trained in the spectral setting.
- Direct-space unconstrained NN: train all weights (and biases) in the conventional architecture.
- Linear two-layer perceptron (N₁=784, N₂=10): • Spectral training of eigenvalues only (N=N₁+N₂=794 parameters) achieved accuracy 82 ± 2% (mean over 5 runs). • Direct-space constrained NN with the same number of trainable parameters achieved 79 ± 3%. • Full direct-space training (all N₁×N₂ weights plus biases) achieved 92.7 ± 0.1%. • Spectral training of both eigenvalues and eigenvectors (total N₁+N₂+N₁×N₂ parameters) achieved 92.5 ± 0.2%, matching full direct-space performance. • Weight distributions: Training only eigenvalues in spectral domain yields a weight distribution closely matching full direct-space training, despite using about 10% of the parameters. Direct-space training constrained to N parameters produces markedly different distributions.
- Wide linear configuration (one intermediate linear layer of size N₂ between input and output): • Spectral training on N₂ + N₃ eigenvalues outperforms direct-space constrained training on the same number of parameters across N₂ values. • Spectral performance comparable to full direct-space training is reached around N₂≈1000 while using roughly 13% of the parameters of a standard two-layer perceptron with N₁×N₂ parameters. In a highlighted case, spectral training used only ~0.13% of parameters compared to standard full training yet matched the weight distribution of the fully trained direct model.
- Deep linear configurations (multiple linear intermediate layers): • Across l=3 to l=5 variants (e.g., N₂ in {20,80,100,500,800}; then N₂=800 with N₃ in {100,200,400,600}; then N₂=800, N₃=600, N₄=500), spectral eigenvalue training consistently outperforms direct-space constrained training with the same parameter counts. • The equivalent compact perceptron’s weight distribution from spectral training matches that from full direct-space training that uses many more parameters.
- Deep non-linear configuration (N₁=784, N₂ variable linear, N₃=120 with ReLU, N₄=10): • Spectral eigenvalues-only training outperforms direct-space constrained training for matched parameter budgets (N₂+N₃+N₄ trained parameters). • Full direct-space training (all N₁N₂ + N₂N₃ + N₃N₄ weights) achieves accuracy comparable to spectral training when both eigenvalues and eigenvectors are trained; spectral appears more robust to overfitting as N₂ increases.
- Aggregate best mean accuracies over explored ranges (Table 1): • Constrained NN (direct-space subset): Linear wide 87.8%, Linear deep 90.8%, Non-linear 88.9%. • Train Λ (spectral eigenvalues only): Linear wide 90.9%, Linear deep 91.5%, Non-linear 93.5%. • Unconstrained NN (full direct-space): Linear wide 91.0%, Linear deep 91.2%, Non-linear 96.4%. • Train Λ, Φ (spectral eigenvalues+eigenvectors): Linear wide 92.2%, Linear deep 91.8%, Non-linear 97.3%. Overall, eigenvalues are highly informative/global training targets; spectral approaches can match or exceed constrained direct-space training with far fewer parameters and can approximate the behavior of fully trained direct models.
The findings support the hypothesis that global training in the spectral domain, focusing on eigenvalues associated with engineered non-orthogonal eigenvectors, is an efficient and effective alternative to local weight updates. Eigenvalues emerge as key levers: even when eigenvectors are fixed, tuning eigenvalues surpasses direct-space training constrained to the same parameter count, across linear and non-linear architectures. The observed alignment between the weight distributions produced by eigenvalues-only spectral training and full direct-space training suggests that spectral pretraining can guide networks toward solutions similar to fully optimized models while using orders of magnitude fewer trainable parameters. Telescopic linear layers inserted during training expand the spectral parameter space and can be retracted post-training without loss of performance, offering a practical route to compact yet well-performing networks. In non-linear settings, spectral training remains advantageous and appears less prone to overfitting as capacity grows. These results indicate that spectral perspectives can provide both practical benefits (efficiency, pretraining) and conceptual insights into collective modes of information flow in deep networks.
The paper introduces a spectral-domain training framework for neural networks that optimizes eigenvalues and, optionally, eigenvectors of layer transfer operators with nested, non-orthogonal bases engineered to enable feedforward propagation. Across MNIST experiments in linear, deep linear, and deep non-linear architectures, spectral training of eigenvalues alone consistently outperforms direct-space training constrained to the same number of parameters, and training both eigenvalues and eigenvectors yields accuracy on par with fully trained direct-space models. Spectral training produces weight distributions similar to fully trained direct models, suggesting its use as a computationally efficient pretraining strategy. Telescopic linear layers enrich the parameter space during learning and can be retracted to produce compact operational networks. Future directions include: characterizing spectral solutions and optimization landscapes; combining spectral training with other efficiency techniques (e.g., low-rank factorization, selective parameter training); exploring alternative non-orthogonal bases to realize different topologies (e.g., reservoir computing); applying to more challenging datasets and tasks; and investigating improved initialization strategies for eigenvalues/eigenvectors.
- The primary evaluation is on MNIST (and Fashion-MNIST in supplementary material), which are relatively simple benchmarks; broader validation on diverse, complex datasets is needed to assess generality.
- Hyperparameters (learning rates, epochs, batch sizes) were chosen heuristically for stability/efficiency; no systematic optimization was performed.
- Implementation uses masking of full matrices; while effective for the problem size, more efficient implementations may be needed for large-scale applications.
- The approach currently relies on a designed nested indentation of eigenvectors to recover feedforward architectures; the impact of alternative basis designs and topologies remains to be systematically explored.
- Initialization ranges for eigenvalues/eigenvectors were not optimized; sensitivity to initialization is an open question.
Related Publications
Explore these studies to deepen your understanding of the subject.

