Time Series Forecasting with XGBoost and LightGBM: Predicting Energy Consumption

7 min readFeb 27, 2023

In the simplest terms, time series forecasting is the process of predicting future values based on previous historical data. One of the hottest fields where time series forecasting is utilized currently is in the cryptocurrencies market, where one wants to predict how prices in popular cryptos, like Bitcoin or Ethereum, will fluctuate in the next few days or even longer periods of time. Another real world case is the energy consumption prediction. Especially in the contemporary world where energy is one of the primary points of discussion, being capable of accurately predicting the demand of energy consumption is a crucial tool for any electric power company. In this article, we will take a quick but practical look at how this is done by incorporating Ensemble models such as extreme gradient boosting or XGBoost and light gradient boosting or LGB models.

Problem

We will focus on the energy consumption problem, where given a sufficiently large dataset of the daily energy consumption of different households in a city, we are tasked to predict as accurately as possible the future energy demands. For the purposes of this tutorial, I’ve chosen the London Energy Dataset which contains the energy consumption of 5,567 randomly selected households in the city of London, UK for the time period of November 2011 to February 2014. Later on, in an attempt to improve our predictions we combine this set with the London Weather Dataset in order to add weather related data in the process.

This is the first part of this mini series. After reading through this article, make sure to check the next part which improves the results significantly by incorporating lag features.

Preprocessing

The very first thing we have to do in every project is to get a good understanding of the data and preprocess them if needed. To view the data with pandas we can do:

The `LCLid` is a unique string that identifies each household, the `Date` is self-explanatory, the `KWH` is the total number of kilowatt-hours spent on that date and there are no missing values at all. Since we want to predict the consumption in a general fashion and not per household, we need to group the results by date and average the kilowatt-hours.

At this point, it would be great if we could have a look at the way consumption changes through the years. A line plot can expose this:

Energy Consumption Plot on the Entire Dataset

The seasonality characteristic is pretty obvious. During the winter months we observe high demands in energy, while throughout the summer the consumption is at the lowest levels. This behavior repeats itself for every year in the dataset, with different high and low values. To visualize the fluctuation in the span of a year we can do:

To train a model like XGBoost and LightGB we need to create the features ourselves. Currently, we have only one feature: the full date. We can extract different features based on the full date such as the day of the week, the day of the year, the month and others. To achieve this we can do:

So, the `date` feature is currently redundant. Before dropping it, we will use it to split our dataset into training and testing sets. Contrary to the conventional training, in time series we can’t just split the set in a random way because the order of the data is extremely important and we are only allowed to incorporate previous data. Otherwise, we might be prompted to predict a value while taking into consideration future values too! The dataset contains almost 2.5 years of data, so for the testing set we will use only the last 6 months. If the training set was bigger we would have used the entire last year as the testing set.

To visualize again the split and discriminate between the training and testing sets we can plot:

Now we can drop the `date` feature and create the training and testing sets:

Training the Models

The hyperparameter optimization will be done with grid search. The grid search takes parameters and some values as configuration and tries out every possible combination. The parameter configuration that achieves the best result, will be the one to form the best estimator. Grid search utilizes cross validation too, so it is crucial to provide an appropriate splitting mechanism. Again, due to the nature of the problem we can’t just use plain k-fold cross validation. Scikit learn provides the TimeSeriesSplit method which splits the data incrementally in a respectful manner in terms of continuity.

For the LightGB model we can do the same by providing different parameters:

Evaluation

To evaluate the best estimator on the test set we will calculate some metrics. These are the Mean Absolute Error or MAE, the Mean Squared Error or MSE and the Mean Absolute Percentage Error or MAPE. Each of these provide a different perspective on the actual performance of the trained model. Additionally, we will plot a line diagram to better visualize the performance of the model.

Lastly, to evaluate any of the aforementioned models we have to run the following:

Even though XGBoost predicts more accurately the energy consumption during the winter months, to strictly quantify and compare the performances we need to calculate the error metrics. By taking a look at the table below, it is more than obvious that XGBoost outperforms LightGBM in all cases.

Preprocessing Weather Data

The model performs relatively well, but is there a way to improve it even further? The answer is yes. There are many different tips and tricks available that can be employed in order to achieve better results. One of them is to use auxiliary features that are correlated directly or indirectly to energy consumption. For example, the weather data can play a decisive role when it comes to predicting energy demands. That’s why we choose to enhance our dataset with weather data from the London Weather Dataset.

First let’s take a look at the structure of the data:

There are various missing data that need to be filled in. Filling missing data is not trivial and depends on each case. Since we have weather data where each day depends on the previous and next days, we will fill those values by interpolating. Also, we will convert the `date` column to `datetime` and, then, merge the two dataframes in order to get one enhanced dataframe.

Keep in mind that after generating the enhanced set, we have to re-run the splitting process and get the new `training_data` and `testing_data`. Do not forget to include the new features too.

There is no need to update the training steps. After training the models on the new dataset, we get the following results:

The weather data improve the performance in both models by a significant margin. In particular, in the XGBoost scenario the MAE is reduced by almost 44%, while the MAPE moved from 19% to 16%. For LightGBM, the MAE has dropped by 42% and the MAPE declined from 19.8% to 16.7%.

Conclusion & Future Steps

Ensemble models are very powerful machine learning tools that can be utilized in the time series forecasting problem. In this article, we’ve seen how this is done in the case of energy consumption. At first, we trained our models by using solely the date factor. Later on, we took into consideration additional data in the training process that are correlated with the task at hand in order to boost the results in a notable way.

The performance can be improved even more by incorporating the so-called lag features or trying different hyperparameter optimization techniques such as randomized search or Bayesian optimization. If you want to see how, check the latest article of this series. I encourage you to try these out yourselves and share the outcomes in the comments below.

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com