Article

Introduction to Energy Modelling - A ML Approach

Author:

Jason Ramchandani
Lead Developer Advocate Lead Developer Advocate

Overview

This article will introduce a simple approach to begin to model energy markets. In our case we will be looking to see if we can ascertain any impact on German energy prices when the wind blows. Energy markets are quite complex to model nevertheless we will try to have a go at this using some of the tools available in the Python scientific computing ecosystem. I hope to demonstrate that it is possible with a little effort.

We will be using a mix of data from different sources combined with data from both Eikon and the Refinitiv Data Platform as well as our Point Connect service. We will be using some APIs and also some file-based sources. I will include any data files in the code repository so you can replicate.

Sections

Imports and Connections

API Calls

Visualization Workflow

Open Source Wind Data

Locating Weather Monitoring Stations

Calculate nearest weather station for each windfarm

Visualizations using GeoPandas, Shapefiles and Matplotlib

Getting Prices Data (our Labels)

Creating z-scores

Implement Machine Learning Model

Split our dataset into train, evaluation and test sets

Create feature and label sets

Define our model to test and hand it our train, eval and test sets

Generate Visual Outputs

Initial Observations

Summary

Further Resources

Pre-requisites:

Refinitiv Eikon / Workspace with access to Eikon Data APIs (Free Trial Available)

Python 2.x/3.x

Required Python Packages: eikonpandasnumpymatplotlibgeopandasshapely,scipyxgboostscikit-learn

 

Imports and connections

So lets start by importing some libraries and setting our credentials.

    	
            

import refinitiv.dataplatform.eikon as ek

ek.set_app_key('Your APP KEY Here')

import refinitiv.dataplatform as rdp

rdp.open_desktop_session('Your APP KEY Here')

rdp.__version__

    	
            '1.0.0a6'
        
        
    
    	
            

import numpy as np

import pandas as pd

import geopandas as gpd

from geopandas import GeoDataFrame

from shapely.geometry import Point, Polygon

import matplotlib as mp

import matplotlib.pyplot as plt

%matplotlib inline

from scipy.spatial.distance import cdist

import time

import scipy.stats as sci

import xgboost as xgb

from xgboost import plot_importance, plot_tree

from sklearn.metrics import mean_squared_error, mean_absolute_error

from collections import OrderedDict

import warnings

warnings.filterwarnings("ignore")

API Calls

The code below uses the new RDP Search service which is an excellent tool for finding content, in our case Physical Assets such as windfarms - but there are many other search views to help you out such as instruments, people, entities and many more (reference guide). In the code below we provide a filter which contains the query syntax we need to narrow down the Physical Asset universe to the ones we want - wind farms located in Germany.

    	
            

dfRic = rdp.search(

    view = rdp.SearchViews.PhysicalAssets,

    filter = "RCSAssetTypeLeaf eq 'wind farm' and RCSRegionLeaf eq 'Germany'",

    top=5000

)

display(dfRic)

         PI PermID BusinessEntity RIC DocumentTitle
0 77309744787 77309744787 PHYSICALASSETxPLANT C}GW7309744787 Sornzig-Ablass, Power; Wind Farm, Germany
1 77309744788 77309744788 PHYSICALASSETxPLANT C}GW7309744788 Esperstedt-Obhausen, Power; Wind Farm, Germany
2 77309761318 77309761318 PHYSICALASSETxPLANT C}GW7309761318 Arzberg , Power; Wind Farm, Germany
3 77309761319 77309761319 PHYSICALASSETxPLANT C}GW7309761319 Ascheberg , Power; Wind Farm, Germany
4 77309761320 77309761320 PHYSICALASSETxPLANT C}GW7309761320 Aschersleben, Power; Wind Farm, Germany
... ... ... ... ... ...
3514 77309825970 77309825970 PHYSICALASSETxPLANT C}GW7309825970 Gestorf, Power; Wind Farm, Germany
3515 77309831038 77309831038 PHYSICALASSETxPLANT C}GW7309831038 Windpark Havelland, Power; Wind Farm, Germany
3516 77309891128 77309891128 PHYSICALASSETxPLANT C}GW7309891128 Nordsee One, Power; Wind Farm, Germany
3517 77309914787 77309914787 PHYSICALASSETxPLANT C}GW7309914787 Wikinger, Power; Wind Farm, Germany
3518 77309921538 77309921538 PHYSICALASSETxPLANT C}GW7309921538 Druxberge, Power; Wind Farm, Germany

     

3519 rows × 5 columns

Next we provide the list of physical asset RICs from the previous search to our get_data API call, along with a few fields we are interested in.

    	
            

df, err = ek.get_data(instruments = dfRic['RIC'].astype(str).values.tolist(),fields = 

                      ["TR.AssetName","TR.AssetLongitude","TR.AssetLatitude","TR.AssetCapacity",

                       "TR.AssetCapacityDateFrom"])

 

df

       Instrument Name Longitude Latitude Capacity Capacity Date From
0 C}GW7309744787 Sornzig-Ablass 12.990382 51.196692 5 1999-01-01
1 C}GW7309744788 Esperstedt-Obhausen 11.705583 51.401455 13  
2 C}GW7309761318 Arzberg 12.174439 50.071657 1 1999-11-01
3 C}GW7309761319 Ascheberg 7.572423 51.784933 1 1999-11-01
4 C}GW7309761320 Aschersleben 11.480171 51.795133 23 2004-01-01
... ... ... ... ... ... ...
3514 C}GW7309825970 Gestorf 8.787088 53.082689 28  
3515 C}GW7309831038 Windpark Havelland 12.742300 52.918800 147 2009-01-01
3516 C}GW7309891128 Nordsee One 6.764129 53.985858 332 2017-01-01
3517 C}GW7309914787 Wikinger 14.069000 54.833548 350 2018-07-04
3518 C}GW7309921538 Druxberge 11.283300 52.158000 117 2018-05-14

3519 rows × 6 columns

 

Visualization Workflow

We will be creating some visualisations using geopandas and shape files - to do this we need to create a geometry point type (a coordinate of longitude and latitude) so we can work with polygonal maps. Here we zip our longitude and latitude together to form this new Point datatype. We store this in a new geometry column.

    	
            

geometry = [Point(xy) for xy in zip(df.Longitude,df.Latitude)]

df['geometry'] = geometry

df['closestWMO'] = np.nan

df

 

          Instrument Name Longitude Latitude Capacity Capacity Date From geometry closestWMO
0 C}GW7309744787 Sornzig-Ablass 12.990382 51.196692 5 1999-01-01 POINT (12.990382 51.196692) NaN
1 C}GW7309744788 Esperstedt-Obhausen 11.705583 51.401455 13   POINT (11.705583 51.401455) NaN
2 C}GW7309761318 Arzberg 12.174439 50.071657 1 1999-11-01 POINT (12.174439 50.071657) NaN
3 C}GW7309761319 Ascheberg 7.572423 51.784933 1 1999-11-01 POINT (7.572423 51.784933) NaN
4 C}GW7309761320 Aschersleben 11.480171 51.795133 23 2004-01-01 POINT (11.480171 51.795133) NaN
... ... ... ... ... ... ... ... ...
3514 C}GW7309825970 Gestorf 8.787088 53.082689 28   POINT (8.787088000000001 53.082689) NaN
3515 C}GW7309831038 Windpark Havelland 12.742300 52.918800 147 2009-01-01 POINT (12.7423 52.9188) NaN
3516 C}GW7309891128 Nordsee One 6.764129 53.985858 332 2017-01-01 POINT (6.764129 53.98585799999999) NaN
3517 C}GW7309914787 Wikinger 14.069000 54.833548 350 2018-07-04 POINT (14.069 54.833548) NaN
3518 C}GW7309921538 Druxberge 11.283300 52.158000 117 2018-05-14 POINT (11.2833 52.158) NaN

3519 rows × 8 columns

 

Open source wind data

Next we wish to download some wind data. There is plenty of open source data available - the data used here is sourced from Deutscher Wetterdienst (see further resources section for more detail) - this is from various weather monitoring stations in Germany - which record wind speed at a height of 10 meters. Maybe you have store of such or similar information in your own company. Here we are just reading it into a dataframe, wind. Data basis: Deutscher Wetterdienst, aggregated.

    	
            

wind = pd.read_csv('/Users/jasonram/Downloads/DEWindHourly.csv')

wind.head()

         wmo 10020 10046 10055 10091 10113 10147 10162 10200 10224 ... 10815 10852 10870 10895 10908 10929 10946 10948 10961 10962
0 14/01/2016 00:00 5.1 3.1 4.1 4.1 5.1 4.1 4.1 4.1 5.1 ... 2.1 4.1 5.1 6.2 NaN 3.1 2.1 2.1 13.9 7.2
1 14/01/2016 01:00 4.1 3.1 3.1 5.1 4.1 5.1 4.1 5.1 5.1 ... 2.1 4.1 3.1 6.2 NaN 2.1 3.1 2.1 12.9 8.2
2 14/01/2016 02:00 5.1 3.1 3.1 4.1 4.1 5.1 4.1 3.1 4.1 ... 3.1 4.1 4.1 7.2 NaN 2.1 3.1 3.1 12.9 8.2
3 14/01/2016 03:00 5.1 3.1 3.1 5.1 4.1 5.1 3.1 5.1 4.1 ... 3.1 4.1 4.1 5.1 NaN 1.0 3.1 2.1 14.9 6.2
4 14/01/2016 04:00 1.0 3.1 5.1 3.1 4.1 4.1 4.1 4.1 3.1 ... 3.1 2.1 4.1 4.1 NaN 1.0 3.1 1.0 10.8 4.1

5 rows × 52 columns

We also want to get total wind output data for Germany - this I got from our Point Connect data service.

    	
            

output = pd.read_csv('/Users/jasonram/Documents/DEUwindprod.csv')

output.set_index('Date', inplace=True)

output.index = pd.to_datetime(output.index)

output.head()

              Output MwH      
Date  
2016-01-01 12:00:00 1170.62800
2016-01-01 13:00:00 847.90200
2016-01-01 14:00:00 820.25950
2016-01-01 15:00:00 1093.58175
2016-01-01 16:00:00 1787.73625



Now we add two additional features Total Wind Speed (measuring across all weather stations) and Mean Wind Speed. Data basis: Deutscher Wetterdienst, average and total added.

 

    	
            

wind.set_index('wmo', inplace=True)

wind.index = pd.to_datetime(wind.index)

wind.index.names=['Date']

wind['TotWin'] = wind.iloc[:,1:].sum(axis=1, skipna=True)

wind['MeanTot'] = wind.iloc[:,1:].mean(axis=1, skipna=True)

wind.head()

          10020 10046 10055 10091 10113 10147 10162 10200 10224 10264 ... 10870 10895 10908 10929 10946 10948 10961 10962 TotWin MeanTot
Date                                          
2016-01-14 00:00:00 5.1 3.1 4.1 4.1 5.1 4.1 4.1 4.1 5.1 3.1 ... 5.1 6.2 NaN 3.1 2.1 2.1 13.9 7.2 194.9 7.955102
2016-01-14 01:00:00 4.1 3.1 3.1 5.1 4.1 5.1 4.1 5.1 5.1 3.1 ... 3.1 6.2 NaN 2.1 3.1 2.1 12.9 8.2 189.0 7.560000
2016-01-14 02:00:00 5.1 3.1 3.1 4.1 4.1 5.1 4.1 3.1 4.1 3.1 ... 4.1 7.2 NaN 2.1 3.1 3.1 12.9 8.2 184.8 7.542857
2016-01-14 03:00:00 5.1 3.1 3.1 5.1 4.1 5.1 3.1 5.1 4.1 3.1 ... 4.1 5.1 NaN 1.0 3.1 2.1 14.9 6.2 171.5 7.000000
2016-01-14 04:00:00 1.0 3.1 5.1 3.1 4.1 4.1 4.1 4.1 3.1 3.1 ... 4.1 4.1 NaN 1.0 3.1 1.0 10.8 4.1 156.0 6.500000

5 rows × 53 columns

We now join our wind output dataframe to our main wind features dataframe using an inner join on Date.

    	
            

wind = wind.join(output, how='inner', on='Date')

wind.describe()

          10020 10046 10055 10091 10113 10147 10162 10200 10224 10264 ... 10895 10908 10929 10946 10948 10961 10962 TotWin MeanTot Output MwH
count 30625.000000 30016.000000 31469.000000 30559.000000 30130.000000 30408.000000 30281.000000 31351.000000 30655.000000 30232.000000 ... 30799.000000 29963.000000 30314.000000 29201.000000 28651.000000 30534.000000 30686.000000 31912.00000 31912.000000 31091.000000
mean 7.179798 3.887883 6.443344 6.884208 5.956927 4.239549 3.903798 4.144633 4.322143 3.318173 ... 3.153528 7.712863 2.417061 2.230297 2.217553 6.837768 4.376598 181.39446 7.447921 11212.407872
std 3.140366 2.043747 3.247412 3.504434 2.828279 2.199805 1.990499 2.099516 2.307688 1.698319 ... 1.965949 4.587546 1.556825 1.146934 1.243888 3.889654 2.572592 73.87456 2.751302 8616.785901
min 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 ... 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.00000 0.000000 137.500000
25% 5.100000 2.100000 4.100000 4.100000 4.100000 2.600000 2.100000 2.100000 3.100000 2.100000 ... 2.100000 4.100000 1.000000 1.000000 1.000000 4.100000 3.100000 131.70000 5.485714 4558.045375
50% 7.200000 3.100000 6.200000 6.200000 5.100000 4.100000 3.100000 4.100000 4.100000 3.100000 ... 2.600000 7.200000 2.100000 2.100000 2.100000 6.200000 4.100000 167.30000 6.791578 8916.729000
75% 8.700000 5.100000 8.200000 8.700000 7.200000 5.100000 5.100000 5.100000 6.200000 4.100000 ... 4.100000 9.800000 3.100000 3.100000 3.100000 8.700000 5.100000 218.50000 8.796000 15448.654250
max 28.800000 15.900000 22.100000 23.100000 25.700000 19.000000 15.900000 18.000000 20.100000 12.900000 ... 18.000000 30.900000 12.900000 11.800000 11.800000 32.900000 24.200000 588.50000 23.078431 44697.997000

8 rows × 54 columns

 

Locating Weather Monitoring Stations

For more detailed modelling I have included this section of code, where we can download the geographic locations of each of our wind monitoring sites. This will allow us to calculate the closest wind monitoring station to each of the windfarms and perhaps get a more accurate indication of local weather conditions on output. First we load the locations of our wind monitoring sites and then create a new geometry type for them. Data basis: Deutscher Wetterdienst, gridded data reproduced graphically

    	
            

location = pd.read_csv('/Users/jasonram/Downloads/deWMOlocation.csv')

location.head()

        wmo lat long
0 10020 55.0111 8.4125
1 10046 54.3761 10.1433
2 10055 54.5297 11.0619
3 10091 54.6817 13.4367
4 10113 53.7139 7.1525
    	
            

geometry2 = [Point(xy) for xy in zip(location.long,location.lat)]

location['geometry2'] = geometry2

location.head()

        wmo lat long geometry2
0 10020 55.0111 8.4125 POINT (8.4125 55.0111)
1 10046 54.3761 10.1433 POINT (10.1433 54.3761)
2 10055 54.5297 11.0619 POINT (11.0619 54.5297)
3 10091 54.6817 13.4367 POINT (13.4367 54.6817)
4 10113 53.7139 7.1525 POINT (7.1525 53.7139)

Calculate nearest weather station for each windfarm

Here we will calculate the closest weather monitoring station for each of our windfarms using thee sciipy.spatial.distance cdist routine which calculates the distance from one point to a range of coordinates - then we just select the minimum (closest) one and store it to our dataframe.

    	
            

def closestWMO(point, plist):

    return plist[cdist([point], plist).argmin()]

    	
            

pointList = location[['lat','long']].values

for i, row in df.iterrows():

    pointA = (df.Latitude[i],df.Longitude[i])

    sol = closestWMO(pointA, pointList)

    df['closestWMO'].iloc[i] = location['wmo'].loc[(location['lat']==sol[0]) & (location['long']==sol[1])].values[0]

 

df.head()   

 

        Instrument Name Longitude Latitude Capacity Capacity Date From geometry closestWMO
0 C}GW7309744787 Sornzig-Ablass 12.990382 51.196692 5 1999-01-01 POINT (12.990382 51.196692) 10488.0
1 C}GW7309744788 Esperstedt-Obhausen 11.705583 51.401455 13   POINT (11.705583 51.401455) 10469.0
2 C}GW7309761318 Arzberg 12.174439 50.071657 1 1999-11-01 POINT (12.174439 50.071657) 10685.0
3 C}GW7309761319 Ascheberg 7.572423 51.784933 1 1999-11-01 POINT (7.572423 51.784933) 10315.0
4 C}GW7309761320 Aschersleben 11.480171 51.795133 23 2004-01-01 POINT (11.480171 51.795133) 10469.0

Visualizations using GeoPandas, Shapefiles and Matplotlib

We will now do some visualisations working with GeoPandas and shapefiles. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Shapefiles are geospatial vector data formats that allow for storing of geographic type features such as lines, points and polygons (areas). There are many shapefiles available on the internet. I am using one I found for Germany and have included this is the the github repo containing this notebook so you can replicate. Here we just read in the file (I have also printed it so you can take a look at it) and then plot it using matplotlib.

    	
            

gdf = gpd.read_file('/Users/jasonram/Downloads/gadm36_DEU_shp/gadm36_DEU_4.shp')

print(gdf)

GID_0 NAME_0 GID_1 NAME_1 GID_2 \
0 DEU Germany DEU.1_1 Baden-Württemberg DEU.1.1_1
1 DEU Germany DEU.1_1 Baden-Württemberg DEU.1.1_1
2 DEU Germany DEU.1_1 Baden-Württemberg DEU.1.1_1
3 DEU Germany DEU.1_1 Baden-Württemberg DEU.1.1_1
4 DEU Germany DEU.1_1 Baden-Württemberg DEU.1.1_1
... ... ... ... ... ...
11297 DEU Germany DEU.16_1 Thüringen DEU.16.22_1
11298 DEU Germany DEU.16_1 Thüringen DEU.16.22_1
11299 DEU Germany DEU.16_1 Thüringen DEU.16.22_1
11300 DEU Germany DEU.16_1 Thüringen DEU.16.22_1
11301 DEU Germany DEU.16_1 Thüringen DEU.16.22_1

NAME_2 GID_3 NAME_3 GID_4 \
0 Alb-Donau-Kreis DEU.1.1.1_1 Allmendingen DEU.1.1.1.1_1
1 Alb-Donau-Kreis DEU.1.1.1_1 Allmendingen DEU.1.1.1.2_1
2 Alb-Donau-Kreis DEU.1.1.2_1 Blaubeuren DEU.1.1.2.1_1
3 Alb-Donau-Kreis DEU.1.1.2_1 Blaubeuren DEU.1.1.2.2_1
4 Alb-Donau-Kreis DEU.1.1.3_1 Blaustein DEU.1.1.3.1_1
... ... ... ... ...
11297 Weimarer Land DEU.16.22.9_1 Nordkreis Weimar DEU.16.22.9.12_1
11298 Weimarer Land DEU.16.22.9_1 Nordkreis Weimar DEU.16.22.9.13_1
11299 Weimarer Land DEU.16.22.9_1 Nordkreis Weimar DEU.16.22.9.14_1
11300 Weimarer Land DEU.16.22.9_1 Nordkreis Weimar DEU.16.22.9.15_1
11301 Weimarer Land DEU.16.22.9_1 Nordkreis Weimar DEU.16.22.9.16_1

NAME_4 VARNAME_4 TYPE_4 ENGTYPE_4 CC_4 \
0 Allmendingen None Gemeinde Town 084255001002
1 Altheim None Gemeinde Town 084255001004
2 Berghülen None Gemeinde Town 084255002017
3 Blaubeuren None Stadt Town 084255002020
4 Blaustein None Stadt Town 084250141141
... ... ... ... ... ...
11297 Rohrbach None Gemeinde Town 160715013081
11298 Sachsenhausen None Gemeinde Town 160715013082
11299 Schwerstedt None Gemeinde Town 160715013085
11300 Vippachedelhausen None Gemeinde Town 160715013092
11301 Wohlsborn None Gemeinde Town 160715013097

geometry
0 (POLYGON ((9.747226715087947 48.32448959350592...
1 (POLYGON ((9.76099872589117 48.32106399536138,...
2 POLYGON ((9.82752799987793 48.45702743530279, ...
3 POLYGON ((9.775817871093693 48.36040115356457,...
4 POLYGON ((9.82752799987793 48.45702743530279, ...
... ...
11297 POLYGON ((11.37983036041265 51.07011032104498,...
11298 POLYGON ((11.39456176757824 51.03350448608404,...
11299 POLYGON ((11.30850887298595 51.05867004394531,...
11300 POLYGON ((11.19609355926508 51.10009002685547,...
11301 POLYGON ((11.37327384948736 51.02006912231445,...

[11302 rows x 15 columns]

    	
            

fig,ax = plt.subplots(figsize =(15,15))

gdf.plot(ax=ax)

Next we just convert our df dataframe containing all our windfarm geometry points and our location dataframe that contains the location of our weather monitoring stations to GeoDataFrames and plot both the map, and overlay our windfarms as green dots and our weather monitoring stations as blue dots. Data basis: Deutscher Wetterdienst, gridded data reproduced graphically.

    	
            

gdf2 = GeoDataFrame(df,geometry=geometry)

gdf2 = gdf2[(gdf2['Longitude'] >=5) & (gdf2['Longitude'] <20)]

gdf3 = GeoDataFrame(location,geometry=geometry2)

    	
            

fig, ax = plt.subplots(figsize= (15,15))

gdf.plot(ax=ax,alpha = 0.4,color="grey")

gdf2.plot(ax=ax, color = "green", markersize = 5)

gdf3.plot(ax=ax, color = "blue", markersize = 30)

Get temperature data

Here we are using hourly air temperature. This could be a demand side feature as colder temperatures might lead to more heating use or warmer temperatures might lead to more airconditioning etc. Note we are only taking the data from one station in middle of Germany, but for our exploratory purposes this will give us some indication of temperature conditions. Data basis: Deutscher Wetterdienst. More information is available in the Further Resources section.

    	
            

temp = pd.read_csv('/Users/jasonram/Downloads/DEairtemp.csv')

temp.set_index('Date', inplace=True)

temp.index = pd.to_datetime(temp.index)

temp.head()

             AirTemp    
Date  
2016-01-14 00:00:00 2.9
2016-01-14 01:00:00 2.8
2016-01-14 02:00:00 2.6
2016-01-14 03:00:00 2.1
2016-01-14 04:00:00 1.6
    	
            

wind = wind.join(temp, how='inner', on='Date')

wind.head()

        10020 10046 10055 10091 10113 10147 10162 10200 10224 10264 ... 10908 10929 10946 10948 10961 10962 TotWin MeanTot Output MwH AirTemp
Date                                          
2016-01-14 00:00:00 5.1 3.1 4.1 4.1 5.1 4.1 4.1 4.1 5.1 3.1 ... NaN 3.1 2.1 2.1 13.9 7.2 194.9 7.955102 9561.25 2.9
2016-01-14 01:00:00 4.1 3.1 3.1 5.1 4.1 5.1 4.1 5.1 5.1 3.1 ... NaN 2.1 3.1 2.1 12.9 8.2 189.0 7.560000 9201.25 2.8
2016-01-14 02:00:00 5.1 3.1 3.1 4.1 4.1 5.1 4.1 3.1 4.1 3.1 ... NaN 2.1 3.1 3.1 12.9 8.2 184.8 7.542857 8975.25 2.6
2016-01-14 03:00:00 5.1 3.1 3.1 5.1 4.1 5.1 3.1 5.1 4.1 3.1 ... NaN 1.0 3.1 2.1 14.9 6.2 171.5 7.000000 8285.75 2.1
2016-01-14 04:00:00 1.0 3.1 5.1 3.1 4.1 4.1 4.1 4.1 3.1 3.1 ... NaN 1.0 3.1 1.0 10.8 4.1 156.0 6.500000 8373.25 1.6

5 rows × 55 columns

Data on Solar energy

Here we are using hourly sum of solar incoming radiation data. At night the sun does not shine - so we should expect zero readings and from sunrise it should start to increase till sunset. Note, this is for only one area in the centre of Germany but for our exploratory purposes it will at least give us some indication on sunny conditions. Data basis: Deutscher Wetterdienst More information is available in the Further Resources section.

    	
            

solar = pd.read_csv('/Users/jasonram/Downloads/DEsolar.csv')

solar.set_index('Date', inplace=True)

solar.index = pd.to_datetime(solar.index)

solar

                  FG_LBERG     
Date  
2016-01-14 00:00:00 0
2016-01-14 01:00:00 0
2016-01-14 02:00:00 0
2016-01-14 03:00:00 0
2016-01-14 04:00:00 0
... ...
2020-09-30 20:00:00 0
2020-09-30 21:00:00 0
2020-09-30 22:00:00 0
2020-09-30 23:00:00 0
2020-10-01 00:00:00 0
    	
            

wind = wind.join(solar, how='inner', on='Date')

wind.head()

        10020 10046 10055 10091 10113 10147 10162 10200 10224 10264 ... 10929 10946 10948 10961 10962 TotWin MeanTot Output MwH AirTemp FG_LBERG
Date                                          
2016-01-14 00:00:00 5.1 3.1 4.1 4.1 5.1 4.1 4.1 4.1 5.1 3.1 ... 3.1 2.1 2.1 13.9 7.2 194.9 7.955102 9561.25 2.9 0
2016-01-14 01:00:00 4.1 3.1 3.1 5.1 4.1 5.1 4.1 5.1 5.1 3.1 ... 2.1 3.1 2.1 12.9 8.2 189.0 7.560000 9201.25 2.8 0
2016-01-14 02:00:00 5.1 3.1 3.1 4.1 4.1 5.1 4.1 3.1 4.1 3.1 ... 2.1 3.1 3.1 12.9 8.2 184.8 7.542857 8975.25 2.6 0
2016-01-14 03:00:00 5.1 3.1 3.1 5.1 4.1 5.1 3.1 5.1 4.1 3.1 ... 1.0 3.1 2.1 14.9 6.2 171.5 7.000000 8285.75 2.1 0
2016-01-14 04:00:00 1.0 3.1 5.1 3.1 4.1 4.1 4.1 4.1 3.1 3.1 ... 1.0 3.1 1.0 10.8 4.1 156.0 6.500000 8373.25 1.6 0

5 rows × 56 columns

 

Getting Prices Data (our Labels)

Thus far we have been building our feature matrix (our wind, temperature and solar features) and now we wish to build our label dataset - in our case we wish to look at hourly EPEX Spot energy prices. All price data on Refinitiv Platforms is in GMT - whereas all the ther data is in local (CET) - therefore we need to shift our price data forward by 1hr to align it correctly. The source is our Point Connect repository - again have included this file in the github repo so you can replicate.

    	
            

pri = pd.read_csv('/Users/jasonram/Downloads/DEUprices1.csv')

pri.set_index('Date', inplace=True)

pri.index = pd.to_datetime(pri.index)

pri = pri.shift(+1, freq='H')

pri.head()

 

 

                 Val 
Date  
2015-01-01 01:00:00 18.29
2015-01-01 02:00:00 16.04
2015-01-01 03:00:00 14.60
2015-01-01 04:00:00 14.95
2015-01-01 05:00:00 14.50

Sometimes ML routines can work better with smoothed data so I have included a moving average, I don't use it here but include it for reference.

    	
            

pri['val_mav'] = pri['Val'].rolling(window=5).mean()

pri.head()

 

         Val val_mav
Date    
2015-01-01 01:00:00 18.29 NaN
2015-01-01 02:00:00 16.04 NaN
2015-01-01 03:00:00 14.60 NaN
2015-01-01 04:00:00 14.95 NaN
2015-01-01 05:00:00 14.50 15.676

Creating z-scores

We also wish to create some z-scores to normalise variation across the dataset.

    	
            

b = list(pri.columns[0:2].values)

for col in b:

    col_zscore = col + '_zscore'

    pri[col_zscore] = (pri[col] - pri[col].mean())/pri[col].std(ddof=0)

 

pri

         Val val_mav Val_zscore val_mav_zscore
Date        
2015-01-01 01:00:00 18.29 NaN -1.038734 NaN
2015-01-01 02:00:00 16.04 NaN -1.175974 NaN
2015-01-01 03:00:00 14.60 NaN -1.263808 NaN
2015-01-01 04:00:00 14.95 NaN -1.242459 NaN
2015-01-01 05:00:00 14.50 15.676 -1.269908 -1.270077
... ... ... ... ...
2019-09-30 21:00:00 44.58 53.148 0.564845 1.152580
2019-09-30 22:00:00 34.60 50.032 -0.043892 0.951123
2019-09-30 23:00:00 33.09 43.776 -0.135996 0.546657
2019-10-01 00:00:00 31.58 38.624 -0.228100 0.213568
2019-01-10 01:00:00 30.04 34.778 -0.322033 -0.035086

41617 rows × 4 columns

We also need to normalize our feature matrix before handing it to our ML routine.

 

    	
            

wind_zs = wind

b = list(wind_zs.columns.values)

for col in b:

    col_zscore = col + '_zscore'

    wind_zs[col_zscore] = (wind_zs[col] - wind_zs[col].mean())/wind_zs[col].std(ddof=0) #this handles nan vals

    

wind_zs

           10020 10046 10055 10091 10113 10147 10162 10200 10224 10264 ... 10929_zscore 10946_zscore 10948_zscore 10961_zscore 10962_zscore TotWin_zscore MeanTot_zscore Output MwH_zscore AirTemp_zscore FG_LBERG_zscore
Date                                          
2016-02-02 00:00:00 6.99 14.496 -1.727986 -1.346367 13.9 7.2 15.9 13.9 9.8 8.2 ... 1.091376 1.648526 -0.093817 1.818084 0.718598 3.204296 3.374194 2.282638 -0.188402 -0.251216
2016-03-02 00:00:00 13.78 19.312 -1.313825 -1.035001 17.0 9.8 19.0 18.0 11.8 10.8 ... 1.091376 1.648526 0.710983 2.332151 2.516744 2.925399 3.074240 1.936871 -0.939298 -0.251216
2016-04-02 00:00:00 25.03 28.906 -0.627622 -0.414725 10.8 5.1 12.9 11.8 14.9 5.1 ... 0.445495 1.648526 NaN 0.764248 1.500401 1.385346 1.417906 0.989915 -0.530722 -0.251216
2016-05-02 00:00:00 26.62 30.968 -0.530639 -0.281412 7.2 3.1 7.2 7.2 4.1 2.1 ... 3.093606 1.648526 0.710983 1.021281 1.500401 0.348666 0.364528 -0.381246 -0.574892 -0.251216
2016-06-02 00:00:00 26.41 32.474 -0.543448 -0.184045 9.8 4.1 5.1 8.7 NaN 6.2 ... -0.200385 -0.111577 -0.093817 0.481511 -0.493195 0.178608 0.120055 0.277080 0.772304 -0.251216
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019-09-30 20:00:00 49.27 52.510 0.850916 1.111332 8.2 4.1 10.8 12.9 6.2 3.1 ... -0.200385 -1.079634 -0.979097 -0.161072 -0.102294 0.287445 0.179243 0.689498 0.032450 -0.251216
2019-09-30 21:00:00 44.58 53.148 0.564845 1.152580 7.2 4.1 10.8 12.9 3.1 5.1 ... -0.200385 -0.111577 -0.979097 0.352995 -0.884097 0.179968 0.121518 0.518119 0.087663 -0.251216
2019-09-30 22:00:00 34.60 50.032 -0.043892 0.951123 7.2 3.1 9.8 12.9 3.1 4.1 ... -0.910854 -1.079634 -0.979097 0.764248 -0.884097 0.034398 -0.087575 0.468959 0.142876 -0.251216
2019-09-30 23:00:00 33.09 43.776 -0.135996 0.546657 6.2 3.1 8.7 10.8 6.2 4.1 ... -0.200385 -1.079634 -0.979097 1.818084 -1.314088 0.141875 0.025751 0.379867 0.164961 -0.251216
2019-10-01 00:00:00 31.58 38.624 -0.228100 0.213568 7.2 4.1 6.2 7.2 8.7 3.1 ... -0.200385 -1.079634 NaN -0.700842 -0.102294 0.345945 0.361541 0.635634 0.054536 -0.251216

30977 rows × 116 columns

Temporal Considerations

We want to compare the same hourly data for each day - so we need to create a daily series for each hour in the day as it is these series we want to model. The reason is that demand/supply profiles are different each hour of the day so - its not correct to compare hours in a continuous series ie hour 01 to hour 02 etc. The correct set up should be day 1 hour 1, day 2 hour 1, day 3 hour 1 etc. In this way we build up 24 discrete hourly profiles - as it is these we wish to model. Below we just extract the hour of our timestamp.

    	
            intersected_df['hour'] = intersected_df.index.strftime('%H')
        
        
    

Implement Machine Learning Model

For our modelling we will use an xgboost model - which a gradient-boosted decision tree class of algorithm - which is quite popular in the literature. We will apply the simplest vanilla implementation here to see if we can get any usable results. Obviously the ML area is quite technical and requires some mastery but you can use this codebase to extend and experiement as appropriate.

 

Split our dataset in to train, evaluation and test sets

We need to split our dataset into a training set, an evaluation set and a test set. We will hand the the ML routine only the training and evaluation datasets to generate its models. We will then test predictions made by that model against the test dataset to see if we can find anything useful. We split the data roughly at the halfway mark using 10% of our data as evaluation set.

    	
            

hour_ls = intersected_df.hour.unique()

hour_ls

array(['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10',
'11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21',
'22', '23'], dtype=object)

    	
            

split_date = '2017-10-01'

eval_date = '2017-12-01'   

pri_train = {} 

pri_eval = {}

pri_test = {}

 

for h in hour_ls:

    pri_train[h] = pd.DataFrame((intersected_df.loc[(intersected_df.index <= split_date) 

                                                    & (intersected_df['hour']== h)].copy()),

                                                    columns = intersected_df.columns)

    pri_eval[h] = pd.DataFrame((intersected_df.loc[(intersected_df.index > split_date) 

                                                   & (intersected_df.index <= eval_date) 

                                                   & (intersected_df['hour']== h)].copy()),

                                                   columns = intersected_df.columns)

    pri_test[h] = pd.DataFrame((intersected_df.loc[(intersected_df.index > eval_date) 

                                                   & (intersected_df['hour']== h)].copy()),

                                                   columns = intersected_df.columns)

    	
            

for h in hour_ls:

    print(h,pri_train[h].shape, pri_eval[h].shape, pri_test[h].shape)

00 (606, 117) (60, 117) (609, 117)
01 (611, 117) (61, 117) (634, 117)
02 (610, 117) (61, 117) (637, 117)
03 (602, 117) (61, 117) (624, 117)
04 (611, 117) (61, 117) (626, 117)
05 (610, 117) (61, 117) (628, 117)
06 (594, 117) (58, 117) (572, 117)
07 (611, 117) (60, 117) (632, 117)
08 (610, 117) (61, 117) (633, 117)
09 (598, 117) (61, 117) (589, 117)
10 (612, 117) (61, 117) (635, 117)
11 (613, 117) (61, 117) (631, 117)
12 (591, 117) (60, 117) (612, 117)
13 (613, 117) (61, 117) (634, 117)
14 (613, 117) (61, 117) (637, 117)
15 (598, 117) (60, 117) (614, 117)
16 (609, 117) (61, 117) (629, 117)
17 (608, 117) (61, 117) (630, 117)
18 (588, 117) (59, 117) (610, 117)
19 (607, 117) (61, 117) (631, 117)
20 (611, 117) (61, 117) (637, 117)
21 (599, 117) (61, 117) (619, 117)
22 (612, 117) (61, 117) (637, 117)
23 (612, 117) (61, 117) (633, 117)

Create feature and label sets

We now need to seperate our training (pri_train), evaluation (pri_eval) and test (pri_test) dataframes into their feature and label set components. Here we create a small function to do this for us. The training, evaluation and test datasets contain both the raw and the normalised (_zscore) data. We only wish to pass the normalised feature data to the ML routine so we simply select the relevant zscore features. The function also takes in which target label parameter, which we use to select the column we want to model for.

    	
            

def create_features(df, label=None):

    collist = df.columns.tolist()

    relcols = collist[60:len(collist)-1] 

    X = df[relcols]

    if label:

        y = df[label].to_frame()

        return X, y

    return X

Define our XGBoost model - hand it our train, eval and test sets

So we are now ready to hand each of our 3 datasets for each hour from 0 to 23 to the XGBoost routine to generate a model for each hour. We define our XGB model using the Regressor class and a squared error objective function. We then pass the training feature and label sets to the fit function along with our evalution set. Once the modeel is generated, we will then store the feature importance graph and then generate a prediction for each hour and then calculate the Mean Squared Error (MSE) for each prediction versus the observed label.

    	
            

ax_ls ={}

res ={}

reg = xgb.XGBRegressor(objective ='reg:squarederror',n_estimators=1000)

 

for h in hour_ls:

    #create our features, train, eval & test

    X_train, y_train = create_features(pri_train[h], label='Val_zscore')

    X_eval, y_eval = create_features(pri_eval[h], label='Val_zscore')

    X_test, y_test = create_features(pri_test[h], label='Val_zscore')

    #define our xgboost model

    reg.fit(X_train, y_train,

        eval_set=[(X_eval, y_eval)], 

        early_stopping_rounds=50,

       verbose=False)

    #create and store feature importance plots

    ax_ls[h]= reg.get_booster().get_fscore()

    #create prediction from our xgboost model

    pri_test[h]['ML_Prediction'] = reg.predict(X_test)

    #calculate prediction accuracy using MSE

    res[h] = mean_squared_error(pri_test[h]['Val_zscore'].values, pri_test[h]['ML_Prediction'].values)

Generate Visual Output

We have just created 24 hourly xgboost models, we now want to plot our feature importance (the higher the F-score the more important the feature) for each hourly series, as well as the accuracy of our ML generated predictions - using in our case the mean squared error (MSE) calculation we created above. We can easily do this is 2 steps.

    	
            

fig = plt.figure(figsize=(20,60))

for i,col in enumerate(hour_ls):

    bx=plt.subplot(15,2,i+1)

    data1 = OrderedDict(sorted(ax_ls[col].items(), key=lambda t: t[1]))

    plt.barh(range(len(data1)),data1.values(), align='center')

    plt.yticks(range(len(data1)),list(data1.keys()))

    plt.title = ("Hour " + col)

 

plt.show()

    	
            

fig = plt.figure(figsize=(20,60))

for i,col in enumerate(hour_ls):

    ax=plt.subplot(15,2,i+1) 

    pri_test[col][['Val_zscore','ML_Prediction']].plot(ax=ax,legend=False,title = ("Hour " + col) )

    props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)

    ax.set_xlabel('xlabel', fontsize=6)

    ax.set_ylabel('ylabel', fontsize=3)

    ax.text(0.75, .15, "MSE: " + str(round(res[col],2)), transform=ax.transAxes, fontsize=13,

        verticalalignment='bottom',bbox=props)

    plt.xticks(rotation=90)

plt.show()

Initial Observations

From the line charts above, we can see that the generated models are able to predict some of the downward spikes in spot energy prices. The MSE varies from 0.73 (hour 01) to 1.56 (hour 6). There does seem to be quite some variation in accuracy even with this very simple and limited model. However, none of these are particularly accurate, MSE of 0 being perfect and MSE of 1 indicating 1 standard deviation.

We can see for example that on some days prices went much lower than our prediction - this could be due for example to a day which was both windy and sunny (remember these energy sources are not stored but dumped to the grid - leading to the negative prices - the negative prices discourage use of thermal sources). We did have one feature included for solar energy - but it was for only one location in the centre of the country, as opposed to wind which we had much greater coverage for. Conversely, there are other days where the prediction was much lower than the observed, so perhaps it wasn't that sunny at all locations or thermal capacity had been dialled back etc. So we could expand our model to include new more complete features such more solar locations, thermal features and such.

Our model does not really tell us anything about upward spikes as we have only included one demand side type of feature - air temperature - and only for one station in the centre of Germany - in our feature matrix. Again we could expand our model with features that might help us with this demand side modelling for example.

Another next step might be to use only a subset of the highest F-score features and re-run the models to see if that can improve things, or indeed look at a suite of different classes of algorithm - we have only used xgboost but there are many other that we could implement within the main iterative routine I have given above.

Further, we have been looking only at observed values. Remember, one of the features we are using is actual Wind Output - we wouldn't know this ahead of time. So clearly we need to know about forecast data. We know that energy markets are based around forecast data. So another next step would be to incorporate all of that data but that is beyond the scope of this introductory article.

Summary

A lot of ground was covered in this article so let's summarize. We retrieved a set of physical asset (windfarms in our case) RICs from our excellent new RDP Search API. We used these to request some reference-type data for these windfarms using our Eikon Data API. We then created a new geometry 'Point' datatype where we stored locational co-ordinates for our windfarms. We followed this by loading some open source wind data, wind output data from our Point Connect repository and joined these together, creating 2 additional features along the way. We also loaded some open source solar energy and air temperature data. We then retrived the locations of our weather monitoring stations and as before, created a new geometry datatype for those. We were then able to calculate the closest weather monitoring station for each of our windfarms using the scipy.spatial cdist routine - though we didn't use it in ths introductory model other than for visualisation - I included it as it might prove useful were you to develop more complex models. We then downloaded a shape file for Germany, converted our pandas dataframes to GeoPandas dataframes and ploted overlays for both windfarms and wind monitoring stations. Next we read in some price data (EPEX Spot) to use for our label data, shifted it forward one hour to align with our local time (CET) features and we then started to normalise our datasets using z-scores. After joining our feature and label dataframes, we then added a feature called hour which tells us which hour the timestamp refers to. We then split our data into training, evaluation and test sets and defined our feature and label sets for each of the 24 hours in the day. Next we defined 24 different XGBoost ML models (one for each hour of the day), handing it the appropriate training and evaluation sets and ploted our feature importance F-scores. Finally, we used our model to generate a label prediction based on our hold-out or test feature matrix and plotted that against the observed label values.

Whilst a simple and far from complete example, I hope this article can provide you with an idea of how to approach this area and a practical codebase to experiment and explore further.

Further Resources

Further Resources for Eikon Data API

For Content Navigation in Eikon - please use the Data Item Browser Application: Type 'DIB' into Eikon Search Bar.

Further Resources for Refinitv Data Plaform (RDP) APIs

For Content Navigation in RDP - please go here

For more information on the Point Connect service please go here

Further Resources for Open Source German Weather Data kindly made public by DWD


image.png

Hourly Wind Data

Hourly Air Temperature - I used Station 01266, and field TT_TU

Hourly Solar Readings - I used Station 00183, and field FG_LBERG

Source of geospatial base data:

Surveying authorities of the Länder and Federal Agency for Cartography and Geodesy (https://www.bkg.bund.de).

Source of satellite data: EUMETSAT (https://www.eumetsat.int) NOAA (https://www.noaa.gov)

Further Resources for Shapefiles

German SHP Map