
Computer Science
End-to-end programmable computing systems
Y. Xiao, G. Ma, et al.
Discover a groundbreaking framework, PGL, that revolutionizes the management of algorithm complexity in autonomous systems, developed by Yao Xiao and team. With remarkable speedups achieved through advanced program representation learning, this research unveils a future where code execution is optimized across diverse hardware seamlessly.
~3 min • Beginner • English
Introduction
The study addresses the challenge of efficiently mapping increasingly complex software onto heterogeneous computing systems (e.g., CPUs and GPUs) to achieve performance and energy gains. Traditional compiler heuristics and monolithic programming models struggle with modern hardware heterogeneity, dynamic workloads, and the need for load balancing and cache locality. While prior ML-based compiler approaches improve cost models, they often target single device types or static representations and require repeated searches per program. The research question is how to learn an accurate, automated mapping of software kernels to the most suitable device within a heterogeneous system for a single application, accounting for dynamic control, data, and memory dependencies. The purpose is to build an end-to-end framework that (1) represents programs as dynamic execution graphs, (2) extracts structural features capturing higher-order dependencies, and (3) learns to partition and map code segments to devices to maximize performance and minimize communication overhead. The significance lies in enabling autonomous parallelization and device mapping, reducing programmer burden, and unlocking heterogeneous system performance.
Literature Review
The paper reviews compiler optimization methods and the limitations of heuristic cost models amid growing hardware and software complexity. It summarizes ML and deep learning applications in compilation, including DRL-based graph optimizations, automatic vectorization, and tensor/DNN graph optimizations (e.g., TASO, SOAP). It highlights graph-based program representations (ASTs, CDFGs, ProGraML, XFG) that better capture program structure than token sequences (e.g., code2vec, inst2vec), and prior device mapping efforts. The authors note gaps: static representations miss memory dependencies and loop iteration counts; existing ML methods may not directly address per-application heterogeneous device mapping with dynamic behaviors. Scheduling literature on synchronous and dynamic dataflow, heterogeneous multiprocessor scheduling, and hybrid mapping for real-time systems is discussed. The need for dynamic, fine-grained graph representations and GNN-based learning tailored to compilation tasks motivates the proposed approach.
Methodology
Framework: PGL is an end-to-end programmable graph learning system with two main components: (1) Graph Autoencoder (GAE) for unsupervised partitioning of dynamic execution graphs into kernels; (2) Graph Neural Networks (GNNs) for supervised device mapping classification (CPU vs GPU) per kernel.
Dynamic execution graph: Programs are represented as weighted directed acyclic graphs G=(V,E,W) at LLVM IR granularity. Nodes are LLVM IR instructions with type attributes; edges encode control, data, and memory dependencies with attributes; edge weights represent data communication and execution time, enabling quantification of communication overhead across cache/memory hierarchies. Graphs are constructed from partially executed traces by detecting: (i) data dependencies (register defs/uses), (ii) control dependencies (branches, calls), and (iii) memory dependencies via LLVM alias analysis (MustAlias, MayAlias treated as dependencies). This finer-grained representation captures dynamic memory dependencies and loop iteration counts better than static AST/XFG/CDFG and remains coarse enough to be tractable.
Feature extraction: Node features are derived using a novel algorithm combining random walks and multifractal analysis to capture local and scale-dependent topology. Random walks explore local density; for walk endpoints, subgraphs are extracted and analyzed via box-covering to compute multifractal measures: mass exponent τ(q), generalized fractal dimensions D(q)=τ(q)/(q−1), multifractal spectrum f(α), spectrum width and height, and dominant α0. Complexity is bounded by O(EV log V) for all-pairs shortest paths. These features encode higher-order structural motifs recurrent in code (loops mesh-like, recursion tree-like, array accesses star-like) and feed both partitioning and classification models.
Partitioning: A GAE learns node embeddings preserving intrinsic structural information. Spectral clustering on the learned embeddings partitions the dynamic execution graph into kernels while an optimization-based refinement minimizes inter-cluster communication subject to acyclicity/topological execution constraints, reducing inter-kernel data movement and preventing deadlocks/livelocks. Topological sorting ensures correct execution order among clusters.
Device mapping: For each kernel, a GNN predicts CPU vs GPU. The framework supports multiple GNN variants; experiments evaluate GCN, GAT, and GGNN. Default GCN uses two hidden layers with 32 neurons each; GGNN achieves best accuracy.
Setup and datasets: Training uses 256 OpenCL kernels labeled CPU/GPU (five-fold cross-validation). Kernels are also converted to C for graph extraction in different representations. Additional standard benchmarks validate end-to-end performance: Dijkstra (100 nodes), FFT (4096-size), K-means (256 2D tuples), Mandelbrot (4092 points), Molecular dynamics (1024 particles), Neural network (5 FC layers), Neurons (1024 ReLU neurons), and a simple CNN (conv + max-pool + FC).
Hardware configuration for evaluation: Simulated heterogeneous system with 32 CPUs and 32 GPUs on a mesh NoC. Each CPU: 4-way 64KB L1 private cache, 256KB L2 shared cache, 4GB memory, 2.4 GHz. Each GPU: 768MB memory, 86.4 GB/s bandwidth, 575 MHz. Performance varies by use/configuration.
Baselines: For representation and mapping accuracy, compare to DeepTune, DeepLLVM, NCC, and ProGraML (with GGNN). For partitioning and end-to-end performance, compare KM+GCN, HDC+GCN, MOD+GCN, METIS+GCN, NN+GCN; and framework-level baselines: PAR (threads), community detection + heuristic mapping (CommDet), NN+RL, and Aladdin.
Evaluation protocol: 5-fold CV; accuracy, precision, recall, F1 reported with mean±std over 100 runs. Convergence assessed by reducing training steps. Communication/computation breakdowns and communication overhead computed in simulation. Universal relationships between multifractal properties and system-level metrics (parallelization degree, communication overhead) are quantified via power-law fits across 132 programs spanning 17 applications.
Key Findings
- Mapping accuracy: On NVIDIA kernels, PGL-GGNN achieves 91.52% ± 3.14 accuracy (precision/recall/F1=0.94), outperforming DeepLLVM (88.64% ± 4.61) and ProGraML-GGNN (80.36% ± 4.19). On AMD kernels, PGL-GGNN achieves 93.87% ± 2.27, surpassing DeepLLVM (90.9% ± 2.14) and ProGraML-GGNN (86.6% ± 3.28). GGNN outperforms GCN and GAT by up to 1.04x in accuracy. PGL-GGNN converges to near-optimal accuracy at ~60% of training steps vs ~90% for DeepLLVM.
- End-to-end speedups: PGL provides up to 6.42x speedup versus thread-based parallel programming (PAR) and 2.02x speedup over state-of-the-art frameworks; averages include 1.89x over traditional and ML partitioners (Table 2) and 4.73x over prior frameworks (Table 3) across 17 applications.
- Communication overhead reduction: Example reductions include Dijkstra from 820.52 × 10^7 cycles (PAR) to 109.38 × 10^7 (PGL); FFT from 63.18 × 10^7 to 31.10 × 10^7. PGL consistently shows the smallest communication overhead due to optimization-based partitioning minimizing inter-cluster communication.
- Partitioning effectiveness: Learning-based partitioners (GAE, NN) yield greater speedups than non-ML methods (KM, HDC, MOD, METIS), with GAE outperforming NN by up to 32%, showing benefit of structure-aware embeddings.
- Universality via multifractals: Across 132 programs in 17 applications, multifractal graph properties (e.g., generalized fractal dimension, spectrum width/height, dominant α0, and combined complexity) exhibit power-law relationships with parallelization degree and communication overhead, indicating a universal behavior. Larger spectrum width correlates with higher parallelization and greater communication overhead. These relationships can estimate system-level metrics from graph properties (e.g., spectrum width≈2 implies parallelization degree ~10–24 and communication overhead ~8–12 × 10^9 cycles).
- Broad applicability: PGL’s dynamic execution graphs reveal common motifs (mesh for iterations, tree for recursion, star for array indexing) and leverage them for effective partitioning and mapping across domains (e.g., sequence alignment, signal processing, CNNs).
Discussion
The findings demonstrate that representing program execution as dynamic LLVM IR graphs enables learning higher-order structural dependencies crucial for heterogeneous device mapping. PGL’s multifractal feature extraction captures self-similarity and topological heterogeneity, which, coupled with GAE-based partitioning and GNN-based classification, effectively reduces communication overhead while exploiting parallelism. This directly addresses the research goal of accurate per-kernel device mapping in heterogeneous systems, overcoming limitations of static representations that miss memory dependencies and workload characteristics like loop iteration counts. The observed universal power-law relations between graph multifractal properties and system-level metrics offer interpretable guidance for optimizing parallelization and anticipating communication costs, providing a principled path for automatic mapping decisions. Compared to prior token-based or static graph-based approaches, PGL’s dynamic, fine-grained graphs and learning-based partitioning yield superior accuracy, faster convergence, and significant end-to-end speedups across diverse applications, highlighting its relevance for future AI and HPC workloads.
Conclusion
The paper introduces PGL, an end-to-end framework that: (1) constructs dynamic execution graphs at LLVM IR granularity; (2) extracts multifractal, random-walk-based features capturing local and scale-dependent topology; (3) partitions graphs into kernels using GAE with optimization-based refinement; and (4) maps kernels to CPU or GPU using GNNs. Experiments show state-of-the-art mapping accuracy, reduced communication overhead, and up to 6.42x speedups over thread baselines and 2.02x over leading frameworks. The discovered universal power-law relationships between multifractal properties and system metrics can inform designers about parallelization degrees and anticipated overheads. Future work includes extending PGL to additional accelerators (e.g., FPGAs), improving memory alias analysis to reduce false dependencies, broadening language/runtime support beyond C/C++, and integrating on-the-fly profiling and graph construction for lower overhead.
Limitations
- Overhead on small kernels: For short code segments with few instructions, the profiling and partitioning overhead may outweigh benefits, making PGL less effective.
- Random memory access patterns: Programs with significant pointer manipulation can trigger many MayAlias dependencies in LLVM alias analysis, leading to false memory dependencies across clusters and increased communication overhead.
- Runtime requirements: Building dynamic execution graphs requires successful compilation and execution to collect traces, which can be time and space consuming.
- Language support: Current runtime profiling primarily supports C/C++; expanding to other languages would improve applicability.
Related Publications
Explore these studies to deepen your understanding of the subject.