Computer Science

End-to-end programmable computing systems

Y. Xiao, G. Ma, et al.

Discover a groundbreaking framework, PGL, that revolutionizes the management of algorithm complexity in autonomous systems, developed by Yao Xiao and team. With remarkable speedups achieved through advanced program representation learning, this research unveils a future where code execution is optimized across diverse hardware seamlessly.

00:00

Playback language: English

Index

Introduction

The escalating complexity of algorithms in diverse fields, including autonomous vehicles, digital signal processing, and unmanned aerial systems, necessitates substantial increases in computational performance. Heterogeneous computing systems, which integrate multiple hardware accelerators (GPUs, FPGAs, etc.), offer a promising path to achieve these performance gains. However, effectively harnessing the potential of heterogeneous systems requires sophisticated programming strategies that can intelligently map applications to the strengths of the underlying architecture. Current monolithic programming models and task mapping techniques fail to fully exploit the heterogeneity and architectural innovations of modern hardware, leading to load imbalances and communication inefficiencies. Traditional compilation techniques, relying on heuristic cost models, are insufficient for the complex hardware landscape, requiring repeated compilations and potentially suboptimal performance. While machine learning, particularly deep learning, has shown promise in compiler optimization by learning better cost models, existing approaches often lack the capability to accurately map computations onto heterogeneous hardware for a single application. This paper addresses these limitations by proposing a novel machine learning framework that autonomously optimizes code execution on heterogeneous hardware, thereby alleviating the burden on programmers and significantly improving performance.

Literature Review

Existing research explores the use of machine learning in compiler optimization. Deep learning techniques have been applied to learn improved cost models and guide optimizations such as automatic vectorization (Neuro-vectorizer) and tensor computation graph optimization. Reinforcement learning has also been employed for ML compiler graph optimizations, generalizing policies across different graphs and tasks. However, a significant gap remains in the development of compiler approaches that effectively leverage recent advances in machine learning to map computations onto heterogeneous hardware systems for single applications. Existing techniques often fall short in dynamically adapting to the complexities of heterogeneous hardware under uncertain conditions and in relieving programmers of the task of writing highly efficient code. Prior work using machine learning for device mapping often relies on simpler representations than those proposed in this work. Graph representation learning strategies have demonstrated superior learning ability for various code analysis tasks, including code similarity and program classification. However, existing graph-based representations frequently fail to capture the intricate, multi-scale patterns inherent in real-world software.

Methodology

The proposed Programmable Graph Learning (PGL) framework employs a two-component architecture: a Graph Autoencoder (GAE) and a Graph Neural Network (GNN). The GAE, an unsupervised learning model, partitions the complex program into kernels suitable for either CPU or GPU execution. The supervised learning GNN then predicts the optimal hardware for each kernel. OpenCL kernels are used as training data, with a 5-fold cross-validation strategy applied to the GNN model. The framework utilizes a novel feature extraction algorithm based on random walks and multifractal analysis. Random walks explore local topological density, and multifractal analysis quantifies structural complexity and heterogeneity of code graphs, revealing higher-order statistics. A dynamic execution graph representation, constructed from a partially executed code trace, is employed. Nodes represent low-level virtual machine (LLVM IR) instructions, and edges represent control, data, and memory dependencies. This representation allows for capturing fine-grained inter-dependencies, including memory dependencies which are crucial for effective optimization. The GAE partitions this graph, and the GNN predicts the optimal hardware mapping for each resulting kernel. The framework is evaluated against several baselines including K-means clustering, hierarchical divisive clustering, modularity-based community detection, METIS graph partitioning, and a feed-forward neural network combined with GCNs. Additionally, comparisons are made against state-of-the-art techniques such as DeepTune, DeepLLVM, and ProGraML. Different GNN architectures (GCN, GAT, GGNN) are evaluated to determine the most effective approach. The evaluation incorporates a comprehensive set of applications, including algebraic multigrid solvers, sequence alignment tools, DNA mapping algorithms, neural networks, and various high-performance computing kernels.

Key Findings

PGL demonstrates significant performance improvements. Compared to state-of-the-art techniques, PGL achieves a 1.03x improvement over DeepLLVM and a 1.14x improvement over ProGraML in terms of prediction accuracy. GGNN emerges as the most accurate GNN architecture within the PGL framework. PGL shows faster convergence in machine learning model training compared to DeepLLVM. The framework exhibits a strong power-law relationship between multifractal properties (generalized fractal dimension, spectrum width, spectrum height, dominant Lipschitz-Holder exponent) and system-level metrics (parallelization degree, communication overhead). This power-law relationship suggests a universality in the efficient mapping of software to heterogeneous hardware. Evaluations on 17 applications across diverse domains show that PGL achieves an average speedup of 1.89x compared to state-of-the-art graph partitioning algorithms and a 4.73x speedup compared to other state-of-the-art frameworks. Full-system simulation shows PGL achieves a 6.42x speedup compared to thread-based execution and a 2.02x speedup compared to state-of-the-art techniques. The analysis reveals that PGL effectively reduces communication overhead compared to other approaches, even with its fine-grained graph representation, highlighting the effectiveness of the optimization-based partitioning strategy.

Discussion

The results validate the effectiveness of the PGL framework in optimizing code execution on heterogeneous hardware. The observed power-law relationship between multifractal properties and system-level metrics suggests the existence of universal principles governing efficient software-hardware mapping. This understanding can be leveraged to guide the design of future heterogeneous computing systems and to develop more efficient code optimization strategies. The significant speedups achieved by PGL demonstrate its potential to address the computational demands of increasingly complex applications. The superior performance compared to existing techniques underscores the importance of employing dynamic execution graphs and multifractal analysis for capturing the intricate dependencies within software programs. The framework's ability to autonomously optimize code execution offers significant advantages for programmers, reducing development time and improving application performance.

Conclusion

PGL provides a significant advancement in automating code optimization for heterogeneous computing systems. The framework leverages graph representation learning, multifractal analysis, and a novel dynamic execution graph representation to achieve significant speedups compared to state-of-the-art techniques. The discovery of a universal power-law relationship between multifractal properties and system-level metrics offers valuable insights for future system design and code optimization. Future work could focus on extending the framework's support for various programming languages, improving the handling of memory dependencies, and further optimizing the graph construction process.

Limitations

PGL's runtime profiling currently only supports C and C++ code involving complex computations; it is less beneficial for simple code with few instructions due to the overhead of profiling and partitioning. The framework may not be optimally suited for programs with many random memory accesses due to difficulties in accurately identifying memory dependencies, potentially increasing communication overhead. The compilation and execution of programs to collect execution traces can be time-consuming. Future work could address these limitations by broadening language support, enhancing memory dependency analysis, and optimizing the graph construction process.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

End-to-end differentiability and tensor processing unit computing to accelerate materials’ inverse design

H. Liu, Y. Liu, et al.

Chemistry

An end-to-end deep learning framework for translating mass spectra to de-novo molecules

E. E. Litsa, V. Chenthamarakshan, et al.

Earth Sciences

The conterminous United States are projected to become more prone to flash floods in a high-end emissions scenario

Z. Li, S. Gao, et al.

Computer Science

TIAToolbox as an end-to-end library for advanced tissue image analytics

J. Pocock, S. Graham, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny