logo
ResearchBunny Logo
Enhancing Indoor Temperature Forecasting through Synthetic Data in Low-Data Environments

Engineering and Technology

Enhancing Indoor Temperature Forecasting through Synthetic Data in Low-Data Environments

Z. Thiry, M. Ruocco, et al.

In a groundbreaking study by Zachari Thiry, Massimiliano Ruocco, Alessandro Nocente, and Michail Spitieris, the potential of synthetic data generated through AI methods like GANs and VAEs to enhance indoor temperature forecasting is explored. This research demonstrates how augmenting real data can significantly boost forecasting accuracy and reduce training variance, addressing a critical challenge in HVAC system control.

00:00
00:00
Playback language: English
Introduction
Accurate indoor temperature forecasting is essential for optimizing HVAC systems and reducing energy consumption in buildings, a significant contributor to CO2 emissions (40% of energy consumption and 36% of CO2 emissions in the EU). Traditional physics-based methods are outperformed by machine learning models, particularly Recurrent Neural Networks (RNNs) like LSTMs. However, these models require substantial, diverse training data, including scenarios rarely captured in standard building operation (extreme temperature changes). Acquiring such data is expensive and energy-intensive. This research addresses this low-data challenge by exploring the use of synthetic data augmentation, a rapidly growing field using techniques like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). The goal is to improve forecasting accuracy and robustness in low-data environments by combining real and synthetic data, thereby reducing reliance on extensive and costly data acquisition.
Literature Review
Existing literature on time series data augmentation categorizes methods into traditional techniques (simple scaling, rotation), GAN-based methods, and Autoencoder-based methods. Traditional methods are simple but often disrupt temporal relationships. Generative models like GANs are preferred for preserving temporal dynamics but face convergence issues, especially with limited data. The paper reviews TimeGAN, DoppleGANger, and TimeVQVAE – three state-of-the-art models for time series generation, highlighting their strengths and weaknesses. TimeGAN uses an adversarial and supervised loss with dimension reduction. DoppleGANger addresses fidelity and mode collapse issues using an auxiliary discriminator and batched generation. TimeVQVAE uses Vector Quantization and a modified MaskGIT prior learning process for improved sample quality and diversity.
Methodology
The study uses data from a dedicated test facility (Test-cell) collected in four phases (RICO1-RICO4) over several months. The dataset is a tabular time series (N=59,040, D=81) with a 1-minute sampling rate. Data preprocessing involves standard scaling, reshaping, and manual exclusion of inconsistent or incomplete series. The data is labeled into three classes based on temperature trends (monotonic positive, monotonic negative, non-monotonic) using a moving average and derivative calculation. Three synthesizers (TimeGAN, DoppleGANger, TimeVQVAE) are trained on the preprocessed data and evaluated using t-SNE, PCA, and visual inspection. The utility of the synthesized data is assessed in two forecasting experiments using a one-layer LSTM model: 1) General data augmentation, comparing a baseline (real data only) with synthetic data only and a combination of real and synthetic data. 2) Class imbalance handling, where a class is undersampled, and the synthesizer is used to generate samples to restore balance. Forecasting performance is evaluated using MSE, MAE, MAPE, and MASE.
Key Findings
Visual inspection and PCA analysis showed that TimeGAN failed to converge, while DoppleGANger generated samples with high-frequency noise. TimeVQVAE produced the most diverse and realistic samples. In the general data augmentation experiment, using synthetic data for training (TSTR) significantly improved forecasting accuracy compared to the baseline (TRTR). Combining real and synthetic data (TRSTR) further enhanced accuracy, though training variance increased. The class imbalance experiment showed no significant performance improvement or decrease when using synthetic data to balance class distributions, but variance showed more fluctuations. These fluctuations are dependent on the chosen metric and ablation ratio, ranging from a minimal decrease to a significant increase (up to 55%).
Discussion
The findings demonstrate the potential of TimeVQVAE for generating high-quality synthetic time series data for augmenting real data in low-data environments for indoor temperature forecasting. Synthetic data augmentation significantly improves forecast accuracy and provides a more robust model. The increased variance with combined datasets suggests a trade-off between accuracy and stability; further analysis is needed to understand and mitigate this. The lack of improvement in the class imbalance experiment might stem from imbalance in the testing data, necessitating further investigation. Overall, the study's success highlights the practicality of synthetic data in tackling data scarcity issues common in building energy management.
Conclusion
This paper successfully demonstrates the superior performance of TimeVQVAE for synthesizing univariate time series data compared to TimeGAN and DoppleGANger. Synthetic data augmentation significantly improves forecasting accuracy, especially in low-data scenarios. While increased variance requires further investigation, the results strongly advocate for leveraging synthetic data in building energy applications. Future research should explore different synthesizers, more complex forecasting models, and the application of these techniques to diverse datasets for wider generalizability.
Limitations
The study focuses on a specific dataset from a controlled environment. The generalizability of the findings needs validation with datasets from real-world buildings. The simple LSTM model used might limit the overall performance. The observed increase in variance with synthetic data needs further analysis to identify the underlying causes and potential mitigation strategies. The experiment related to class imbalance has limitations due to potential imbalances within the testing sets which should be further investigated.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny