Article

Using Economic Indicators with Eikon Data API - A Machine Learning Example

Jason Ramchandani
Lead Developer Advocate Lead Developer Advocate

Overview

This is a quick guide for those interested in using the large and varied economic timeseries offered within Eikon - using the Data API. In ths article I will show what economic data is available in Eikon, how to navigate it and download it and finally put it into use with a simple machine learning example using an XGBoost model.

Sections

Eikon Search Workflow

Real-time Economic Indicators

Non Real-time Economic Indicators

Putting it all together - a basic machine learning sample

Adding Target Variables or Labels

Define and fit XGBoost Model

Conclusion

Pre-requisites:

Refinitiv Eikon / Workspace with access to Eikon Data APIs (Free Trial Available)

Python 2.x/3.x

Required Python Packages: eikonpandasnumpyxgboostsci-kit Learn

First of all Eikon carries around 1800 realtime economic indicators from around the globe. The are economic indicators where we track the releases in realtime over our realtime network and provide additional value-added information such as Reuters Polls as well as other useful information such as significance of the economic indicator. Addtionally, Eikon carries over 400,000 non-realtime economic indicators globally. In the near future, the number of non-realtime indicators will more than double to 1 million timeseries. This is a great deal of economic timeseries for you to potentially work with to generate insight and models. This guide is to help developers and quant analysts to navigate this datascape.

A popular previous way to navigate economic data was to use the speed guide in Eikon. However, this has been removed in favour of an Eikon Search based workflow. Further, support for the realtime economic chain RICS such as <0#ECONALLUS>, <0#ECONALLJP> etc have been removed - so whilst these chain RICS still work - they are not being maintained.

Eikon Search Workflow

I mentioned above that the Eikon Search Workflow is now the main way to discover what economic indicator data is available. So lets see how this works. Please try typing 'UK Economic Indicators' into the Eikon search bar and hitting 'more search results' at the bottom. Please see the result below.

 

From this if we look at the Economic Indicators section we can see that over 4000 items are returned. So lets click on that to discover more.

We are now at the main filter page so we can filter from the 10 columns that are there. The most useful will probably be source, periodicity and perhaps most importantly Indicator Type. Please see the types listed below:

Multiple Economic Indicator Data Sources

Types of Economic Indicator Types of Economic Indicator Types of Economic Indicator
Business Surveys Industrial Production & Utilisation Population
Imports & Exports Energy & Environment Consumer Prices / Inflation
GDP/GVA by Industry Workforce & Unemployment Domestic Finance
Automobiles & Transport Macro Forecasts Balance of Payments
GDP by Expenditure Sales, Orders & Inventories Stocks, Bonds, Futures
Banking Agriculture Consumer Surveys
Money Supply Exchange Rates and Operations Other Commodities
Housing & Real Estate Prices Housing & Construction Consumption
International Reserves Producer Prices Employment & Hours
External Debt Investment / Capital Formation Government Accounts
Retail Sales Financial & Flow of Funds Accounts Other Industries
Interest Rates Foreign Transactions Incomes & Savings
Import & Export Price Indices Wages & Earnings Government Debt & Borrowing
GDP Deflators Other National Accounts International Investment Position
Cyclical & Activity Indices Commodity Fundamentals Productivity & Labour Costs
Consumer Finance Central Banking Corporate Accounts & Actions
Sector Accounts Other Prices Personal Expenditures
Other Surveys Commodity Prices & Indices Industry Balance Sheets
Regional Accounts Credit Risk & Ratings

Once we narrow down what we are targetting we can then go to the page view and then download a list of appropriate RICS to excel. From there we can copy and paste these as an instrument list for use in our API Call.

    	
            

import eikon as ek

import pandas as pd

import numpy as np

import xgboost as xgb

from xgboost import plot_importance, plot_tree

from sklearn.metrics import mean_squared_error, mean_absolute_error

ek.set_app_key('YOUR APP KEY HERE')

Real-time Economic Indicators

As mentioned earlier Eikon carries about 1800 real-time economic indicators. Lets see what we are taking about by looking at the economic monitor (type EM into Eikon Search) for say the US (filter on US). Notice here we can see the recent and upcoming economic indicators for the next week or so with a date as well as time, importance of release, polls to see what expectations are, as well as details about these polls such as min, max, median.

For realtime indicators lets make a simple call to the Data API get_data service. Here we are using the no longer maintained economic chain RIC '0#ECONALLUS'. These still work for backward compatibility but are no longer maintained from a content perspective. This can give us a programmatic version of the Economic Monitor App above. We are using the realtime quote model for economic indicators here and you can see this by typing a realtime RIC - say 'USADP=ECI' (US ADP payrolls) and hitting quote. If you hover the mouse over a particular number you will get the field name.

    	
            

data, err = ek.get_data(['0#ECONALLUS'],['DSPLY_NMLL','RELEVANCE','ACT_REL_ST','ECI_ACT_DT','ACT_VAL_NS','UNIT_PREFX','FCAST_PRD','RPT_UNITS','ECON_ACT','ECON_PRIOR','RTR_POLL', 'FCAST_SEST','FCAST_ACRY','NDOR_1'])

data = data[data['ECI_ACT_DT']>='2020-06-09'] #from today forward (note DateTimes are in GMT!)

data.sort_values(['ECI_ACT_DT','ACT_VAL_NS'],ascending=True,inplace=True) # Sort on release date & Time

data



Instrument DSPLY_NMLL RELEVANCE ACT_REL_ST ECI_ACT_DT ACT_VAL_NS UNIT_PREFX FCAST_PRD RPT_UNITS ECON_ACT ECON_PRIOR RTR_POLL FCAST_SEST FCAST_ACRY NDOR_1
58 USFFP1=ECI FFR Projection-1st Yr* 0.0 CONF 2020-06-10 18:00:00   Q2 2020 Percent 0.100 1.600 NaN NaN NaN  
59 USFFP2=ECI FFR Projection-2nd Yr* 0.0 CONF 2020-06-10 18:00:00   Q2 2020 Percent 0.100 1.900 NaN NaN NaN  
61 USFFPL=ECI FFR Projection-Longer* 0.0 CONF 2020-06-10 18:00:00   Q2 2020 Percent 2.500 2.500 NaN NaN NaN  
62 USFFPQ=ECI FFR Projection-Current* 0.0 CONF 2020-06-10 18:00:00   Q2 2020 Percent 0.100 1.600 NaN NaN NaN  
63 USFOMC=ECI Fed Funds Target Rate 3.0 CONF 2020-06-10 18:00:00 None 11 Jun Percent 0.125 0.125 NaN NaN NaN  
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...  
186 USUM5P=ECI U Mich 5-Yr Inf Prelim* 1.0 CONF 2020-07-17 14:00:00   Jul 2020 Percent NaN 2.500 NaN NaN NaN  
188 USUMCP=ECI U Mich Conditions Prelim 1.0 CONF 2020-07-17 14:00:00   Jul 2020 Index NaN 87.100 NaN NaN NaN  
189 USUMEP=ECI U Mich Expectations Prelim 1.0 CONF 2020-07-17 14:00:00   Jul 2020 Index NaN 72.300 NaN NaN NaN  
192 USUMSP=ECI U Mich Sentiment Prelim 3.0 CONF 2020-07-17 14:00:00   Jul 2020 Index NaN 78.100 NaN NaN NaN  
14 USCFNA=ECI National Activity Index 1.0 CONF 2020-07-21 12:30:00 None Jun 2020 Index NaN 2.610 NaN NaN NaN  
                               

Real-time Indicator retrieval from the timeseries service is different from other RIC instruments in that we DO NOT request field types (eg Open High Low Close) and we set the interval to monthly. Lets see how it works using the ADP payroll number for the last 10 years of monthly releases. Note that I have also added 3 other RICs that seem to be derived from the same underlying RIC. These are the poll values associated with the monthly release (hence the lowercase p preceding the RIC) and then we have Median, Low & High Values (the =M,=L,=H respectively succeding the RIC)

    	
            

df=ek.get_timeseries(['USADP=ECI','pUSADP=M','pUSADP=L','pUSADP=H'],start_date='2010-03-01',end_date='2020-04-26',interval='monthly',)

df

VALUE USADP=ECI pUSADP=M pUSADP=L pUSADP=H
Date        
2010-01-31 -30000.0 -30000.0 -110000.0 50000.0
2010-02-28 -38000.0 -20000.0 -100000.0 25000.0
2010-03-31 -11000.0 40000.0 -20000.0 180000.0
2010-04-30 132000.0 30000.0 0.0 80000.0
2010-05-31 211000.0 60000.0 32000.0 180000.0
... ... ... ... ...
2019-11-30 161000.0 140000.0 120000.0 188000.0
2019-12-31 167000.0 160000.0 77000.0 225000.0
2020-01-31 205000.0 156000.0 140000.0 215000.0
2020-02-29 147000.0 170000.0 140000.0 250000.0
2020-03-31 -302000.0 -150000.0 -1500000.0 19000.0

123 rows × 4 columns

Non Real-time Economic Indicators

As mentioned earlier Eikon carries just shy of 400,000 non-realtime economic indicators covering a whole host of indicator types listed above. Once you have found the list that you wish to download - just copy and paste those into a historical timeseries call this time for Monthly German New Car Production and registration:

    	
            

df=ek.get_timeseries(['aDEPROCARP','aDEACECARP','aDECOMPLIDX/A'],start_date='2010-03-01',end_date='2020-04-26',interval='monthly')

df

VALUE aDEPROCARP aDEACECARP aDECOMPLIDX/A
Date      
2010-01-31 376937.0 181189.0 100.40760
2010-02-28 450355.0 194846.0 100.70590
2010-03-31 560055.0 294374.0 100.99160
2010-04-30 464591.0 259414.0 101.25310
2010-05-31 470130.0 249708.0 101.47920
... ... ... ...
2019-11-30 423400.0 299127.0 99.12325
2019-12-31 276400.0 283380.0 99.23273
2020-01-31 341600.0 246300.0 99.34171
2020-02-29 388900.0 239943.0 99.42706
2020-03-31 287900.0 215119.0 96.90305

123 rows × 3 columns

Putting it all together - a basic machine learning sample

So given our new found economic indicator discovery workflow - lets try to put this data to work to see if we can come up with something of interest. In this case lets try to download all available auto industry indicators for all countries in the world and see if this can help explain movement in some global automakers. In order to do this we will be using an XGBoost model which a gradient boosted decision-tree class of algorithm.

From the Economic Search app I have identified a list of global economic indicators from the automobile industry and have downloaded these rics to an excel file - which I can now read into pandas and also use in my get_timeseries call.

    	
            

riclist = pd.read_excel('/Users/jasonram/Downloads/autoric1.xlsx') #or the location of your saved Excel file

allRics = riclist['RIClist'].astype(str).values.tolist()

allRics

['IDMBKY=ECI',
'IDCARY=ECI',
'USCARS=ECI',
'ESCARM=ECI',
'ESCARY=ECI',
'ITCARM=ECI',

...

'aJPPNISS',
'aCATRKSBSP',
'aUSAUIMMXP',
'aGBACECARP',
'aLUNCAR',
'aISNVEHREG/A',
'aROACECARP']

 

    	
            

start='2010-03-01'

end='2020-04-26'

ts = pd.DataFrame()

df = pd.DataFrame()

 

for r in allRics:

    try:

        ts = ek.get_timeseries(r,start_date=start,end_date=end,interval='monthly')

        ts.rename(columns = {'VALUE': r}, inplace = True)

        if len(ts):

            df = pd.concat([df, ts], axis=1)

        else:

            df = ts

    except:

        pass

    

df



IDMBKY=ECI IDCARY=ECI USCARS=ECI ESCARM=ECI ESCARY=ECI ITCARM=ECI DECARM=ECI DECARY=ECI GBCARY=ECI ZAVEHY=ECI ... aLUNHTREG aLTNHTREG aSGMANPOP aJPPNISS aCATRKSBSP aUSAUIMMXP aGBACECARP aLUNCAR aISNVEHREG/A aROACECARP
Date                                          
2010-03-31 39.6 92.1 5630000.0 36.7 63.1 28.5 51.1 -26.6 26.6 24.7 ... 77.0 43.0 933225.0 99903.0 77800.0 93816.0 397383.0 3231.0 224.010250  
2010-04-30 70.0 88.5 5480000.0 -24.9 39.3 -37.9 -11.9 -31.7 11.5 38.8 ... 54.0 86.0 935024.0 86180.0 81675.0 75715.0 148793.0 3307.0 126.812220  
2010-05-31 39.9 69.0 5720000.0 9.9 44.6 2.4 -3.7 -35.1 13.5 40.5 ... 42.0 85.0 936311.0 79430.0 82402.0 73661.0 153095.0 2947.0 238.749611  
2010-06-30 34.7 77.9 5640000.0 18.1 25.6 4.3 15.8 -32.3 10.8 19.7 ... 61.0 74.0 937860.0 100471.0 84009.0 79809.0 195226.0 2792.0 525.231951  
2010-07-31 28.4 72.0 5740000.0 -32.4 -24.1 -10.4 -17.9 -30.2 -13.2 23.4 ... 76.0 139.0 938922.0 96956.0 83218.0 76143.0 136446.0 2312.0 472.853610  
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...  
2019-11-30 -8.3 -9.4 4380000.0 -0.9 2.3 -4.0 5.1 9.7 -1.3 -5.8 ... NaN NaN 972224.0 65092.0 114988.0 75933.0 156621.0 1511.0 1604.147896  
2019-12-31 -6.8 -0.2 4490000.0 13.6 6.6 -7.0 -5.3 19.5 3.4 4.2 ... NaN NaN 973101.0 61201.0 91368.0 96974.0 148997.0 1248.0 1386.820577  
2020-01-31 -18.7 -2.1 4370000.0 -18.3 -7.6 11.0 -13.1 -7.3 -7.3 -8.1 ... NaN NaN 973534.0 65812.0 90008.0 56975.0 149279.0 1640.0 NaN  
2020-02-29 2.6 -2.6 4250000.0 9.5 -6.0 4.7 -2.6 -10.8 -2.9 -0.7 ... NaN NaN 974205.0 51673.0 100464.0 94819.0 79594.0 1766.0 NaN  
2020-03-31 -3.2 -15.0 2880000.0 -60.2 -69.3 -82.6 -10.3 -37.7 -44.4 -29.7 ... NaN NaN 973761.0 55233.0 75791.0 114056.0 254684.0 1122.0 NaN  
    	
            

df.describe()

 



IDMBKY=ECI IDCARY=ECI USCARS=ECI ESCARM=ECI ESCARY=ECI ITCARM=ECI DECARM=ECI DECARY=ECI GBCARY=ECI ZAVEHY=ECI ... aLUNHTREG aLTNHTREG aSGMANPOP aJPPNISS aCATRKSBSP aUSAUIMMXP aGBACECARP aLUNCAR aISNVEHREG/A aROACECARP
count 121.000000 121.000000 1.210000e+02 121.000000 121.000000 121.000000 121.000000 121.000000 121.000000 121.000000 ... 70.000000 70.000000 121.000000 121.000000 121.000000 121.000000 121.000000 121.000000 118.000000  
mean 1.989256 8.971901 6.429669e+06 1.719835 3.452893 4.186777 1.361157 0.227273 1.167769 2.982645 ... 77.742857 219.414286 960183.859504 81079.975207 99080.884298 100680.198347 194813.760331 1760.975207 1297.845786  
std 18.529849 23.429430 1.102691e+06 20.667017 20.016614 35.981592 16.818400 11.737554 10.016605 11.891570 ... 20.483776 92.862606 9862.880360 16003.024972 23965.353378 17986.371146 112509.688567 451.704466 723.767425  
min -36.800000 -39.100000 2.880000e+06 -60.200000 -69.300000 -82.600000 -36.700000 -37.700000 -44.400000 -29.700000 ... 36.000000 43.000000 933225.000000 44193.000000 54137.000000 56975.000000 55305.000000 933.000000 126.812220  
25% -10.300000 -5.300000 5.660000e+06 -10.600000 -8.100000 -12.700000 -9.600000 -4.700000 -4.100000 -4.500000 ... 62.500000 148.750000 955156.000000 70698.000000 81675.000000 89349.000000 137746.000000 1418.000000 734.942203  
50% 1.800000 4.700000 6.550000e+06 5.400000 5.200000 -2.300000 -2.300000 2.600000 2.400000 0.300000 ... 76.500000 215.500000 959314.000000 79267.000000 93633.000000 100294.000000 163357.000000 1756.000000 1128.116730  
75% 10.500000 17.300000 7.480000e+06 15.000000 15.200000 13.500000 7.900000 7.400000 8.000000 8.200000 ... 93.000000 265.500000 969728.000000 93607.000000 116544.000000 115554.000000 195226.000000 2038.000000 1969.269372  
max 76.400000 92.100000 8.170000e+06 36.700000 63.100000 124.100000 51.200000 24.700000 26.600000 40.500000 ... 127.000000 574.000000 974495.000000 120217.000000 151810.000000 135565.000000 562337.000000 3307.000000 2695.983608  

Importantly, we now want to create some lags for our data to see if there is any lagged effect for our data.

    	
            

dfs = df.copy()

lags = 6

for r in dfs.columns.values:

    for lag in range(1, lags + 1):

        dfs[r +'_lag_{}'.format(lag)] = dfs[r].shift(lag)

 

dfs



IDMBKY=ECI IDCARY=ECI USCARS=ECI ESCARM=ECI ESCARY=ECI ITCARM=ECI DECARM=ECI DECARY=ECI GBCARY=ECI ZAVEHY=ECI ... aISNVEHREG/A_lag_3 aISNVEHREG/A_lag_4 aISNVEHREG/A_lag_5 aISNVEHREG/A_lag_6 aROACECARP_lag_1 aROACECARP_lag_2 aROACECARP_lag_3 aROACECARP_lag_4 aROACECARP_lag_5 aROACECARP_lag_6
Date                                          
2010-03-31 39.6 92.1 5630000.0 36.7 63.1 28.5 51.1 -26.6 26.6 24.7 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN  
2010-04-30 70.0 88.5 5480000.0 -24.9 39.3 -37.9 -11.9 -31.7 11.5 38.8 ... NaN NaN NaN NaN 9923.0 NaN NaN NaN NaN  
2010-05-31 39.9 69.0 5720000.0 9.9 44.6 2.4 -3.7 -35.1 13.5 40.5 ... NaN NaN NaN NaN 10376.0 9923.0 NaN NaN NaN  
2010-06-30 34.7 77.9 5640000.0 18.1 25.6 4.3 15.8 -32.3 10.8 19.7 ... 224.010250 NaN NaN NaN 10758.0 10376.0 9923.0 NaN NaN  
2010-07-31 28.4 72.0 5740000.0 -32.4 -24.1 -10.4 -17.9 -30.2 -13.2 23.4 ... 126.812220 224.010250 NaN NaN 13892.0 10758.0 10376.0 9923.0 NaN  
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...  
2019-11-30 -8.3 -9.4 4380000.0 -0.9 2.3 -4.0 5.1 9.7 -1.3 -5.8 ... 1426.873182 1214.162249 1058.299246 1219.266184 10649.0 6046.0 23177.0 23206.0 14082.0  
2019-12-31 -6.8 -0.2 4490000.0 13.6 6.6 -7.0 -5.3 19.5 3.4 4.2 ... 1444.425137 1426.873182 1214.162249 1058.299246 13091.0 10649.0 6046.0 23177.0 23206.0  
2020-01-31 -18.7 -2.1 4370000.0 -18.3 -7.6 11.0 -13.1 -7.3 -7.3 -8.1 ... 1497.699721 1444.425137 1426.873182 1214.162249 13773.0 13091.0 10649.0 6046.0 23177.0  
2020-02-29 2.6 -2.6 4250000.0 9.5 -6.0 4.7 -2.6 -10.8 -2.9 -0.7 ... 1604.147896 1497.699721 1444.425137 1426.873182 12489.0 13773.0 13091.0 10649.0 6046.0  
2020-03-31 -3.2 -15.0 2880000.0 -60.2 -69.3 -82.6 -10.3 -37.7 -44.4 -29.7 ... 1386.820577 1604.147896 1497.699721 1444.425137 8836.0 12489.0 13773.0 13091.0 10649.0  

Adding Target Variables or Labels

So we now have our feature matrix with all our raw and lagged data. Now lets add some target variables or labels. We will be adding two types of label that I think encompass many of the use cases - firstly a timeseries of monthly closing prices as a label - and secondly a company fundamental monthly timeseries - in our case the IBES consensus revenue estimate for FY1. You will see its very straightforward to download and then concatenate these types of label data to your feature matrix. Here we use BMW as an example as it is a global automaker.

    	
            

df1 = ek.get_timeseries(['BMWG.DE'],['CLOSE'],start_date='2010-03-01',end_date='2020-04-26',interval='monthly')

df1

BMWG.DE CLOSE
Date  
2010-03-31 34.180
2010-04-30 37.130
2010-05-31 37.965
2010-06-30 40.065
2010-07-31 41.310
... ...
2019-11-30 73.300
2019-12-31 73.140
2020-01-31 64.340
2020-02-29 58.930
2020-03-31 47.115
    	
            

df2, err = ek.get_data('BMWG.DE',['TR.RevenueMean(SDate=-3,EDate=-123,Period=FY1,Frq=CM).calcdate','TR.RevenueMean(SDate=-3,EDate=-123,Period=FY1,Frq=M)'])

df2['Calc Date'] = pd.to_datetime(df2['Calc Date'].str[:10])

df2.set_index('Calc Date', inplace=True)

df2.sort_index(ascending=True, inplace=True)

df2



Instrument Revenue - Mean
Calc Date    
2010-03-31 BMWG.DE 52395861610
2010-04-30 BMWG.DE 52566096720
2010-05-31 BMWG.DE 53593869130
2010-06-30 BMWG.DE 53757373350
2010-07-31 BMWG.DE 55526963730
... ... ...
2019-11-30 BMWG.DE 100616380300
2019-12-31 BMWG.DE 100905726390
2020-01-31 BMWG.DE 101207227710
2020-02-29 BMWG.DE 101460894380
2020-03-31 BMWG.DE 92875810320

121 rows × 2 columns

    	
            

dfs = pd.concat([dfs, df1], axis=1)

dfs = pd.concat([dfs, df2['Revenue - Mean']], axis=1)

dfs

  IDMBKY=ECI IDCARY=ECI USCARS=ECI ESCARM=ECI ESCARY=ECI ITCARM=ECI DECARM=ECI DECARY=ECI GBCARY=ECI ZAVEHY=ECI ... aISNVEHREG/A_lag_5 aISNVEHREG/A_lag_6 aROACECARP_lag_1 aROACECARP_lag_2 aROACECARP_lag_3 aROACECARP_lag_4 aROACECARP_lag_5 aROACECARP_lag_6 CLOSE Revenue - Mean
2010-03-31 39.6 92.1 5630000.0 36.7 63.1 28.5 51.1 -26.6 26.6 24.7 ... NaN NaN NaN NaN NaN NaN NaN NaN 34.180  
2010-04-30 70.0 88.5 5480000.0 -24.9 39.3 -37.9 -11.9 -31.7 11.5 38.8 ... NaN NaN 9923.0 NaN NaN NaN NaN NaN 37.130  
2010-05-31 39.9 69.0 5720000.0 9.9 44.6 2.4 -3.7 -35.1 13.5 40.5 ... NaN NaN 10376.0 9923.0 NaN NaN NaN NaN 37.965  
2010-06-30 34.7 77.9 5640000.0 18.1 25.6 4.3 15.8 -32.3 10.8 19.7 ... NaN NaN 10758.0 10376.0 9923.0 NaN NaN NaN 40.065  
2010-07-31 28.4 72.0 5740000.0 -32.4 -24.1 -10.4 -17.9 -30.2 -13.2 23.4 ... NaN NaN 13892.0 10758.0 10376.0 9923.0 NaN NaN 41.310  
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...  
2019-11-30 -8.3 -9.4 4380000.0 -0.9 2.3 -4.0 5.1 9.7 -1.3 -5.8 ... 1058.299246 1219.266184 10649.0 6046.0 23177.0 23206.0 14082.0 13011.0 73.300  
2019-12-31 -6.8 -0.2 4490000.0 13.6 6.6 -7.0 -5.3 19.5 3.4 4.2 ... 1214.162249 1058.299246 13091.0 10649.0 6046.0 23177.0 23206.0 14082.0 73.140  
2020-01-31 -18.7 -2.1 4370000.0 -18.3 -7.6 11.0 -13.1 -7.3 -7.3 -8.1 ... 1426.873182 1214.162249 13773.0 13091.0 10649.0 6046.0 23177.0 23206.0 64.340  
2020-02-29 2.6 -2.6 4250000.0 9.5 -6.0 4.7 -2.6 -10.8 -2.9 -0.7 ... 1444.425137 1426.873182 12489.0 13773.0 13091.0 10649.0 6046.0 23177.0 58.930  
2020-03-31 -3.2 -15.0 2880000.0 -60.2 -69.3 -82.6 -10.3 -37.7 -44.4 -29.7 ... 1497.699721 1444.425137 8836.0 12489.0 13773.0 13091.0 10649.0 6046.0 47.115  
                                           

Next we need to standardise our data before we feed it to the XGBoost routine and that means both the factor matrix and the labels. In our case we will use a simple to calculate z-score. As you will see this is applying to all our current feature and label concatenated frame.

    	
            

b = list(dfs.columns[0:].values)

for col in b:

    col_zscore = col + '_zscore'

    dfs[col_zscore] = (dfs[col] - dfs[col].mean())/dfs[col].std(ddof=0)

dfs

  IDMBKY=ECI IDCARY=ECI USCARS=ECI ESCARM=ECI ESCARY=ECI ITCARM=ECI DECARM=ECI DECARY=ECI GBCARY=ECI ZAVEHY=ECI ... aISNVEHREG/A_lag_5_zscore aISNVEHREG/A_lag_6_zscore aROACECARP_lag_1_zscore aROACECARP_lag_2_zscore aROACECARP_lag_3_zscore aROACECARP_lag_4_zscore aROACECARP_lag_5_zscore aROACECARP_lag_6_zscore CLOSE_zscore Revenue - Mean_zscore
2010-03-31 39.6 92.1 5630000.0 36.7 63.1 28.5 51.1 -26.6 26.6 24.7 ... NaN NaN NaN NaN NaN NaN NaN NaN -2.546434  
2010-04-30 70.0 88.5 5480000.0 -24.9 39.3 -37.9 -11.9 -31.7 11.5 38.8 ... NaN NaN 0.440206 NaN NaN NaN NaN NaN -2.359657  
2010-05-31 39.9 69.0 5720000.0 9.9 44.6 2.4 -3.7 -35.1 13.5 40.5 ... NaN NaN 0.558189 0.439728 NaN NaN NaN NaN -2.306790  
2010-06-30 34.7 77.9 5640000.0 18.1 25.6 4.3 15.8 -32.3 10.8 19.7 ... NaN NaN 0.657681 0.557231 0.449537 NaN NaN NaN -2.173830  
2010-07-31 28.4 72.0 5740000.0 -32.4 -24.1 -10.4 -17.9 -30.2 -13.2 23.4 ... NaN NaN 1.473930 0.656318 0.567156 0.464146 NaN NaN -2.095004  
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...  
2019-11-30 -8.3 -9.4 4380000.0 -0.9 2.3 -4.0 5.1 9.7 -1.3 -5.8 ... -0.325140 -0.100668 0.629292 -0.565924 3.890867 3.929603 1.564926 1.287398 -0.069580  
2019-12-31 -6.8 -0.2 4490000.0 13.6 6.6 -7.0 -5.3 19.5 3.4 4.2 ... -0.110532 -0.321422 1.265310 0.628044 -0.557105 3.922037 3.952333 1.566968 -0.079711  
2020-01-31 -18.7 -2.1 4370000.0 -18.3 -7.6 11.0 -13.1 -7.3 -7.3 -8.1 ... 0.182350 -0.107668 1.442937 1.261473 0.638039 -0.547340 3.944745 3.948669 -0.636876  
2020-02-29 2.6 -2.6 4250000.0 9.5 -6.0 4.7 -2.6 -10.8 -2.9 -0.7 ... 0.206517 0.184048 1.108520 1.438376 1.272091 0.653555 -0.537793 3.941099 -0.979406  
2020-03-31 -3.2 -15.0 2880000.0 -60.2 -69.3 -82.6 -10.3 -37.7 -44.4 -29.7 ... 0.279871 0.208119 0.157097 1.105320 1.449169 1.290659 0.666639 -0.530724 -1.727464  

Next we need to make sure we identify the right features to pass to the XGBoost model - we only wish to pass the zscore features.

    	
            

mycols = dfs.columns

mycols[1815:-2]

Index(['IDMBKY=ECI_zscore', 'IDCARY=ECI_zscore', 'USCARS=ECI_zscore',
'ESCARM=ECI_zscore', 'ESCARY=ECI_zscore', 'ITCARM=ECI_zscore',
'DECARM=ECI_zscore', 'DECARY=ECI_zscore', 'GBCARY=ECI_zscore',
'ZAVEHY=ECI_zscore',
...
'aISNVEHREG/A_lag_3_zscore', 'aISNVEHREG/A_lag_4_zscore',
'aISNVEHREG/A_lag_5_zscore', 'aISNVEHREG/A_lag_6_zscore',
'aROACECARP_lag_1_zscore', 'aROACECARP_lag_2_zscore',
'aROACECARP_lag_3_zscore', 'aROACECARP_lag_4_zscore',
'aROACECARP_lag_5_zscore', 'aROACECARP_lag_6_zscore'],
dtype='object', length=1813)

 

We are now getting into the machine learning section. So we need to split our data in train and test sets.

    	
            

split_date = '01-01-2015'

pri_train = dfs.loc[dfs.index <= split_date].copy()

pri_test = dfs.loc[dfs.index > split_date].copy()

We also need to a function to create our feature and label sets and then we go ahead and create our X_train (training features set), y_train (training label set), X_test (test features set), y_test (test label set).

    	
            

def create_features(df, label=None):

    collist = df.columns.tolist()

    relcols = collist[1815:-2]      #len(collist)-1] 

    X = df[relcols]

    if label:

        y = df[label].to_frame()

        return X, y

    return X

    	
            

X_train, y_train = create_features(pri_train, label='CLOSE_zscore')

X_test, y_test = create_features(pri_test, label='CLOSE_zscore')

Finally I usually just do a date check - to see all dates are properly aligned.

    	
            

print(X_train.index.min(),y_train.index.min(),X_train.index.max(),y_train.index.max())

 

2010-03-31 00:00:00 2010-03-31 00:00:00 2014-12-31 00:00:00 2014-12-31 00:00:00

    	
            

print(X_test.index.min(),y_test.index.min(),X_test.index.max(),y_test.index.max())

 

2015-01-31 00:00:00 2015-01-31 00:00:00 2020-03-31 00:00:00 2020-03-31 00:00:00

Define and fit XGBoost model

Finally we start defining oour XGBoost model (in our case we use a simple vanilla setup) - then we pass our training set and test set to fit our model. Once we have done that we can have visualise our feature importance as a plot.

    	
            

reg = xgb.XGBRegressor(objective ='reg:squarederror',n_estimators=1000)

reg.fit(X_train, y_train,

        eval_set=[(X_train, y_train), (X_test, y_test)],

        early_stopping_rounds=50,

       verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
importance_type='gain', learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
n_jobs=1, nthread=None, objective='reg:squarederror',
random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
seed=None, silent=None, subsample=1, verbosity=1)

    	
            

_ = plot_importance(reg, height=0.9)

 

Finally - we can use our model to create an ML_Prediction for our label set given our X_test set and then compare that to actual observations of the label set.

    	
            

pri_test1 = pri_test.copy()

pri_test1['ML_Prediction'] = reg.predict(X_test)

pri_all = pd.concat([pri_test1, pri_train], sort=False)

    	
            

_ = pri_all[['CLOSE_zscore','ML_Prediction']].plot(figsize=(15, 5))

 

Conclusion

In this article we have discussed the economic content available in Eikon. We talked about real-time and non-realtime economic indicators and how you can discover and use this data in your workflows using our search discovery tools. We also showed an example of real-time indicator monitoring - similar to the Economic Monitor (EM) app in Eikon.

Then we moved on to an example which puts a lot of this into practice using a simple machine learning example. We found a group of automobile industry data using our search discovery tools and then saved those down as a spreadsheet. The we used those RICs to download the timeseries for these economic indicators. We created some new (derived) features by creating 6 months worth of lagged data. Once we had completed our raw feature matrix - we showed how to get two types of typical label data - a price series (in our case monthly BMW closing prices) and also an example of some fundamental data for BMW (in our case Monthly IBES consenus Revenue estimates). We then standardised our data using z-scores, a very common method followed by splitting our data into training and test sets and created our feature and label sets for both. We then defined a simple XGBoost model and handed our data to it to fit. Once that was complete we then created a model prediction for our label set (in our case BMW monthly closing prices) and then visualised this is a chart.

The model was able to generate some kind of prediction - which doesn't seem that compelling - but is not entirely without merit. The model is broadly predicting lower prices for the stock based on our economic indicator factors and this has broadly been evidenced in the test set.

Whilst a simple example, I hope this article has demystified this type of workflow and provided you with a practical codebase to explore further.

Further Resources for Eikon Data API

For Content Navigation in Eikon - please use the Data Item Browser Application: Type 'DIB' into Eikon Search Bar.

Download Notebook from Github