ACADEMIC ARTICLE SERIES:

Predicting M&A Targets Using Machine Learning Techniques

Author:

Dr. Haykaz Aramyan
Developer Advocate Developer Advocate

The purpose of this article is to build a predictive model for Mergers and Acquisitions (M&A) target identification and discover if that will produce an abnormal return for investors by utilizing Refinitiv Data APIs. Extensive literature review is conducted to identify main Machine Learning models as well as variables used in empirical studies which can be provided by individual request. The rest of the article is structured in the following way. Section 1 briefly provides theoretical background to M&A, discusses the main motivations and drivers, and suggests main stakeholders who can benefit from target predictive modeling. Section 2 and Section 3 present the methodology of predictive modeling and describe the data respectively. Section 4 discusses the empirical results from logistic regression models and identifies significant variables. We test out-of sample predictive power of the models in Section 5Section 6 provides portfolio return estimations based on the prediction outputs.

 

 

Below are three code cells containing packages required in this report. If these packages are not already installed on your computer, please run these cells.

    	
            !pip install eikon 
        
        
    
    	
            !pip install sklearn
        
        
    
    	
            !pip install seaborn
        
        
    
    	
            !pip install plotnine
        
        
    
    	
            !pip install openpyxl
        
        
    
    	
            

import numpy as np

import pandas as pd

import os

import seaborn as sns

import plotnine as pn

import datetime

from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import LinearRegression

from sklearn import metrics

from sklearn import preprocessing

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

from sklearn.cluster import KMeans

from sklearn.datasets import make_classification

from sklearn.metrics import roc_curve

from sklearn.metrics import roc_auc_score

from numpy import mean

from numpy import std

from sklearn.model_selection import KFold

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

import statsmodels.api as sm

from statsmodels.stats.outliers_influence import variance_inflation_factor

from scipy.stats import ttest_ind

from plotnine import *

 

# import eikon package and read app key from a text file

import eikon as ek

app_key = open("app_key.txt","r")

 

ek.set_app_key(app_key.read())

app_key.close()

Section 1: Theoretical Background

 

1.1 Definitions to M&A

M&A are corporate actions involving restructuring and change of control within companies, which play an essential role in external corporate growth. The literature uses the terms of mergers, acquisitions, and takeovers synonymously; however, there are subtle differences in their economic implications. Piesse, Lee, Lin, and Kuo (2013) interpret acquisitions and takeovers as activities when the acquirer gets control over 50% equity of the target company and mergers when two firms join to form a new entity.

Overall, according to Piesse et al. (2013), the negotiating process is often friendly in M&A, assuming synergies for both firms and hostile in case of takeovers. In this sense, terms “merger” and “acquisition” are used synonymously to refer to friendly corporate actions and “takeovers” to hostile corporate actions. The current article concentrates on friendly M&A only, which assumes a substantial premium for the target’s stock price.

 

1.2 Background to M&A

M&A activity has increased throughout recent years both in terms of the number and value of the deals. The number of M&A deals reached its peak in 2017 when 50,600 M&A deals were announced, totaling USD 3.5 trillion. The activity was more than twenty times higher than the number of deals in 1985 and around ten times higher than the deal value for the same year. M&A activity for 1985-2021 is summarized in Figure below (Source):

    	
            

def ab_return(RIC, sdate, edate, announce_date, period):

    '''

    Calculate abnormal return of a given security during an observation period based on Event study Methodology(MacKinlay,1997)

    

    Dependencies

    ------------

    Python library 'eikon' version 1.1.12

    Python library 'numpy' version 1.20.1

    Python library 'pandas' version 1.2.4

    Python library 'Sklearn' version 0.24.1

    

    Parameters

    -----------

        Input:

            RIC (str): Refinitiv Identification Number (RIC) of a stock

            sdate (str): Starting date of the estimation period - in yyyy-mm-dd

            edate (str): End date of the estimation period, which is also starting date of the observation period - in yyyy-mm-dd

            announce_date (str): End date of the observation period which is assumed to be the M&A announcment date or any other specified date

            period (int): Number of trading days in during the observation period. For each date in this period abnormal return is calculated

        Output:

            CAR (int): Cumulative Abnormal Return (CAR) for a given stock

            abnormal_returns (DataFrame): Dataframe containing abnormal returns during the observation period

    '''

    

    #create an empty dataframe to store abnormal returns

    abnormal_returns = pd.DataFrame({'#': np.arange(start = 1, stop = period)})

    

    ## estimate linear regression model parameters based on estimation period

    # get timeseries for the specified RIC and market proxy (S&P 500 in our case) for the both estimation and observation period

    df_all_per = ek.get_timeseries([RIC, '.SPX'], 

                      start_date = sdate, 

                      end_date = announce_date,

                      interval='daily',

                      fields = 'CLOSE')

    

    # slice the estimation period

    df_all_per.reset_index(inplace = True)

    df_est_per = df_all_per.loc[(df_all_per['Date'] <= edate)]

    

    # calculate means of percentage change of returns for the stock and market proxy

    df_est_per.insert(loc = len(df_est_per.columns), column = "Return_stock", value = df_est_per[RIC].pct_change()*100)

    df_est_per.insert(loc = len(df_est_per.columns), column = "Return_market", value = df_est_per[".SPX"].pct_change()*100)

    mean_stock = df_est_per["Return_stock"].mean()

    mean_index = df_est_per["Return_market"].mean()

    df_est_per.dropna(inplace = True)

    

    # reshape the dataframe and estimate parameters of linear regression

    y = df_est_per["Return_stock"].to_numpy().reshape(-1,1)

    X = df_est_per["Return_market"].to_numpy().reshape(-1,1)

    model = LinearRegression().fit(X,y)

    Beta = model.coef_[0][0]

    intercept = model.intercept_[0]

    

    # slice the estimation period

    df_obs_per = df_all_per.loc[(df_all_per['Date'] >= edate)]

    

    # calculate percentage change of returns for the stock and market proxy

    df_obs_per.insert(loc = len(df_obs_per.columns), column = "Return_stock", value = df_obs_per[RIC].pct_change()*100)

    df_obs_per.insert(loc = len(df_obs_per.columns), column = "Return_market", value = df_obs_per[".SPX"].pct_change()*100)

    

    df_obs_per.dropna(inplace=True)

    df_obs_per.reset_index(inplace=True)

    

    # calculate and return cumulative abnormal return (CAR) for the observation period

    abnormal_returns.insert(loc = len(abnormal_returns.columns), column = str(RIC)+'_Date', value = df_obs_per["Date"])

    abnormal_returns.dropna(inplace=True)

    abnormal_returns.insert(loc = len(abnormal_returns.columns), column = str(RIC)+'_return', value = df_obs_per["Return_stock"] - (intercept + Beta * df_obs_per["Return_market"]))

    CAR =  abnormal_returns.iloc[:,2].sum()

    return CAR, abnormal_returns

The following gives an example of calculaing abnormal return for Slack Technologies (WORK.N^G21) during the observation period which includes the aquisition announcement date by Salesforce.

    	
            

CAR, abnormal_returns = ab_return("WORK.N^G21", '2020-03-26', "2020-10-02", "2020-12-01", 60)

CAR

50.620286227582014

The function mentioned above has been used to calculate abnormal returns for 3 different cases:

1. Calculate Run-up return vriable for target and non-target companies - This variable is based on the findings in the previous literature (Keown & Pinkerton, 1981, Barnes, 1998), suggesting that target companies generate significant run-up returns during one to two months before the announcement of the deal. We calculate this return for both target and non-target companies.The period 250 to 60 days before the deal announcement was used as an estimation window. The observation period was two months before the announcement.

2. Calculate post announcement abnormal returns - In order to support the assumption that shareholders of target companies receive abnormal returns after company acquisition post-announcement abnormal returns for target companies are calculated.

3. Calculate portfolio abnormal returns - abnormal returns for the portfolios, constructed based on the models outputs are calculated to test whether a target prediction model can capture some of the examined announcement abnormal returns.

The estimation window for both the announcement and portfolio returns is 250 to 60 days before the deal announcement. The observation period for portfolio abnormal return calculation is 60 days before and 3 days after the announcement. As for the announcement returns, multiple observation periods, such as [-40, +40], [-20, +20], [-10, +10], [-5, +5], are considered to observe both run-up (two months preceding the announcement) and mark-up returns (two months following the deal announcement). The Figure below illustrates observation and estimation periods of both announcement and portfolio abnormal returns.

    	
            

#read target group dataset from excel file: update URL based on your file location

target_group = pd.read_excel (r'Target_Sample.xlsx')

 

#select only the columns needed for the subsequect sections 

target_group = target_group.iloc[:,3:31]

target_group.head()

  AD-30 Announcement Date Company RIC Market Cap Abnormal return 60 day ROE Profit Margin Gross Profit Margin Profit to Capital Return on Sales ... Price to Sales EV to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Net debt to Total Capital Growth-Resource Mismatch Label
0 29/05/2021 28/06/2021 QADA.O 1.28E+09 -9.77544 10.77455 4.201914 59.18308 0.077847 4.544522 ... 4.807378 4.129297 11.42875 0.097879 0.996802 1.002554 -6.27018 -0.9134 1 1
1 22/05/2021 21/06/2021 RAVN.O 1.16E+09 -7.70184 7.449319 6.531071 33.81626 0.0565 7.288653 ... 4.424246 4.337533 4.644855 0.008242 0.180739 0.098591 -0.84217 -0.09042 1 1
2 22/05/2021 21/06/2021 LDL 5.33E+08 -11.8431 -4.71313 -1.55563 18.96168 -0.1396 2.823379 ... 0.813417 1.031316 2.430913 1.049446 32.00883 0.193466 9.432784 0.3186 1 1
3 19/05/2021 18/06/2021 SYKE.OQ^H21 1.49E+09 -11.1465 7.669254 3.841362 32.09241 0.058989 7.49143 ... 0.931876 0.895016 1.780169 0.070497 3.052414 0.107747 -1.44018 -0.05943 0 1
4 09/05/2021 08/06/2021 MCF 3.98E+08 -9.64752 -251.266 -146.448 8.89214 -5.3374 -20.0788 ... 7.253047 7.377276 45.87027 0.989979 1.849967 0.044645 0.080742 0.45284 0 1
    	
            target_group.describe()
        
        
    
  Market Cap ROE Profit Margin Gross Profit Margin Profit to Capital Return on Sales ROC EV/EBIDTA Sales growth, 3y Free cash Flow/Sales ... Working Capital to Total Assests Price to Sales EV to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Net debt to Total Capital Growth-Resource Mismatch
count 6.56E+02 656 656 656 655 656 656 656 656 656 ... 655 6.56E+02 656 656 656 656 655 656 656 656
mean 4.05E+09 1.227193 0.004292 42.592205 0.011263 8.386456 7.23945 10.709083 16.277893 -0.005295 ... 0.192846 2.57E+00 3.179511 4.055075 1.278185 26.890114 0.1747 9.272178 0.160276 0.317073
std 8.61E+09 51.640629 33.771662 21.314522 0.264163 20.842951 13.043522 59.78252 55.207696 0.453567 ... 0.210179 2.74E+00 3.113478 6.783922 3.441966 22.710054 0.223336 15.630188 0.47266 0.465691
min 1.14E+07 -764.000937 -256.188448 0.56503 -5.337401 -245.91966 -72.692063 -864.301773 -25.870643 -6.586436 ... -0.67105 8.00E-09 0.095067 -1.557987 -35.141609 0 0.000072 -23.83771 -2.62278 0
25% 5.95E+08 -1.868845 -1.386343 26.123918 -0.008542 2.897525 3.257199 7.013958 0.924596 -0.010343 ... 0.02784 8.14E-01 1.163568 1.604212 0.154095 6.051583 0.029994 -0.964812 -0.122802 0
50% 1.42E+09 6.999636 3.864721 41.146985 0.039185 8.911569 7.353025 10.612786 7.095638 0.051043 ... 0.15772 1.74E+00 2.235609 2.447625 0.738462 24.968875 0.099197 5.778312 0.29549 0
75% 3.94E+09 14.453616 9.865902 57.78912 0.088947 17.107852 12.727209 16.922002 17.110155 0.115698 ... 0.30882 3.33E+00 3.874402 4.047843 1.325697 42.010874 0.232926 15.229821 0.51408 1
max 1.05E+11 246.345515 205.474778 99.23579 0.536342 59.360426 59.537746 440.269908 976.508129 0.661562 ... 0.86214 2.70E+01 19.02814 103.5282 41.333135 131.325927 2.622785 166.992527 1.17803 1
    	
            

def peers(RIC, date):

    '''

    Get peer group for an individual RIC along with required variables for the models

    Dependencies

    ------------

    Python library 'eikon' version 1.1.12

    Python library 'pandas' version 1.2.4    

    Parameters

    -----------

        Input:

            RIC (str): Refinitiv Identification Number (RIC) of a stock

            date (str): Date as of which peer group and variables are requested - in yyyy-mm-dd

        Output:

            peer_group (DataFrame): Dataframe of 50 peer companies along with requested variables

    

    '''

    # specify variables for the request

    fields=["TR.TRBCIndustry", "TR.TRBCBusinessSector", "TR.TRBCActivity", "TR.F.MktCap", "TR.F.ReturnAvgTotEqPctTTM",

           "TR.F.IncAftTaxMargPctTTM","TR.F.GrossProfMarg","TR.F.NetIncAfterMinIntr","TR.F.TotCap","TR.F.OpMargPctTTM",

           "TR.F.ReturnCapEmployedPctTTM","TR.F.NetCashFlowOp", "TR.F.LeveredFOCF", "TR.F.TotRevenue", "TR.F.RevGoodsSrvc3YrCAGR",

           "TR.F.NetPPEPctofTotAssets", "TR.F.TotAssets","TR.F.SGA","TR.F.CurrRatio","TR.F.WkgCaptoTotAssets",

           "TR.PriceToBVPerShare","TR.PriceToSalesPerShare","TR.EVToEBITDA","TR.EVToSales","TR.F.TotShHoldEq","TR.F.TotDebtPctofTotAssets",

           "TR.F.DebtTot","TR.F.NetDebtPctofNetBookValue","TR.F.NetDebttoTotCap","TR.TotalDebtToEV","TR.F.NetDebtPerShr","TR.TotalDebtToEBITDA"]

    

    #search for peers

    instruments = 'SCREEN(U(IN(Peers("{}"))))'.format(RIC)

    

    #request variable data for each peer

    peer_group, error = ek.get_data(instruments = instruments, fields = fields, parameters = {'SDate': date})

    

#     df.to_excel(str(RIC[i]) + '.xlsx') - can be enabled if required to store peer data in excel

 

    return peer_group

The following gives an example of retrieving peers and specified variables for Slack Technologies. The next cell provides the list of target company RICs along with the date for peer identification and variable retrieval.

    	
            

#request peer data for Slack and show first 5 peers 

peers('WORK.N^G21', '2020-11-01').head()

  Instrument TRBC Industry Name TRBC Business Sector Name TRBC Activity Name Market Capitalization Return on Average Total Equity - %, TTM Income after Tax Margin - %, TTM Gross Profit Margin - % Net Income after Minority Interest Total Capital ... Enterprise Value To EBITDA (Daily Time Series Ratio) Enterprise Value To Sales (Daily Time Series Ratio) Total Shareholders' Equity incl Minority Intr & Hybrid Debt Total Debt Percentage of Total Assets Debt - Total Net Debt Percentage of Net Book Value Net Debt to Total Capital Total Debt To Enterprise Value (Daily Time Series Ratio) Net Debt per Share Total Debt To EBITDA (Daily Time Series Ratio)
0 CRM.N IT Services & Consulting Software & IT Services Cloud Computing Services 1.61709E+11 8.53505 12.244582 75.23102 126000000 36947000000 ... 75.667295 10.56531 33885000000 5.55455 3062000000 -16.84483 -0.13222 1.305458 -5.470325 0.987805
1 TEAM.OQ Software Software & IT Services Software (NEC) 44700285864 -63.864982 -25.815988 83.34708 -350654000 1729057000 ... 348.861618 27.320632 575306000 29.62839 1153751000 <NA> -0.57558 2.479368 -4.021923 8.649564
2 SPLK.OQ Software Software & IT Services Software (NEC) 24215811839 -41.714537 -27.622625 81.78035 -336668000 3714059000 ... <NA> 13.637515 1999429000 31.522 1714630000 -2.06907 -0.01091 7.035879 -0.256871 <NA>
3 NOW.N Software Software & IT Services Enterprise Software 53245552000 34.148025 16.597748 76.97849 626698000 2822922000 ... 179.616716 22.703498 2127941000 11.53988 694981000 -88.00939 -0.35287 1.779752 -5.25762 3.196732
4 MSFT.OQ Software Software & IT Services Software (NEC) 1.54331E+12 41.399328 32.285167 67.781 44281000000 1.91127E+11 ... 21.371859 9.968566 1.18304E+11 24.16872 72823000000 -116.67399 -0.33331 5.026813 -8.414212 1.074323
    	
            

# retrieve RICs and dates for target group companies

RICs = target_group['Company RIC'].to_list()

date = target_group['AD-30'].dt.strftime('%Y-%m-%d').to_list()

Running "peer" function over all RICs and dates above will create dataframes (or excel files) for each target company, which will include all peers along with specified variables. After screening the peer data based on their similarity to the target group and data availability, each target was matched by year with the closest non-target company.The final dataset for non-target companies is included in the github folder of the current article.

    	
            

non_target_group = pd.read_excel (r'Non-target_sample.xlsx')

non_target_group = non_target_group.iloc[:,4:31]

 

non_target_group.head()

  Announcement Date Company RIC Market Cap Abnormal return 60 day ROE Profit Margin Gross Profit Margin Profit to Capital Return on Sales Return on Capital ... Price to Sales EV to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Net debt to Total Capital Growth-Resource Mismatch Label
0 28/06/2021 OPRA.OQ 1.04E+09 -2.831319 6.083762 65.461049 83.82711 0.169373 15.307199 1.39486 ... 7.262251 6.548395 1.296834 0.008555 0.735752 0.126564 -1.101874 -0.11889 0 0
1 21/06/2021 BDGI.TO 1.33E+09 -7.522057 1.43515 0.894579 15.33098 0.052053 1.9545 1.837 ... 2.575443 2.793027 4.468704 0.447268 8.701365 0.036376 3.719562 0.27267 0 0
2 21/06/2021 LXFR.N 4.54E+08 -4.183324 12.733008 6.93408 24.90764 0.090703 11.847015 12.007564 ... 1.998172 2.126281 3.635942 0.319569 10.675449 0.006803 1.789655 0.23537 1 0
3 18/06/2021 CSGS.OQ 1.48E+09 -1.351786 13.953294 5.693989 43.61389 0.075892 10.941099 13.94227 ... 1.472389 1.61714 3.443384 0.831489 21.661849 0.243919 3.390701 0.14338 1 0
4 08/06/2021 NOG.A 4.02E+08 22.048509 -329.504482 -388.580768 4.88476 -1.255706 15.533732 4.011939 ... 1.826714 3.535152 -3.433331 -4.231196 48.400209 0.001979 20.549773 1.3075 0 0
    	
            non_target_group.describe()
        
        
    
  Market Cap Abnormal return 60 day ROE Profit Margin Gross Profit Margin Profit to Capital Return on Sales Return on Capital EV to EBIDTA Sales growth, 3y ... Price to Sales EV to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Net debt to Total Capital Growth-Resource Mismatch Label
count 6.56E+02 656 656 656 656 656 656 656 656 656 ... 6.56E+02 656 656 656 656 656 656 656 656 656
mean 4.33E+09 -0.03835 5.461099 3.032561 42.11838 0.045924 9.690908 9.461485 13.03909 18.25903 ... 3.06E+00 3.688036 3.90387 1.040082 23.00153 0.168167 9.464655 0.11195 0.335366 0
std 9.40E+09 17.67915 37.75587 29.40039 22.05762 0.161203 22.55952 14.18395 71.4611 54.84517 ... 3.98E+00 4.877811 9.202817 2.782451 21.69162 0.19202 32.38858 0.43166 0.472478 0
min 9.15E+06 -126.719 -456.588 -388.581 -13.2379 -1.25571 -263.836 -71.0716 -801.847 -27.3662 ... 6.40E-08 0.097714 -57.4275 -4.2312 0 0 -155.537 -1.58058 0 0
25% 6.39E+08 -8.06732 1.434235 1.063125 24.7642 0.007688 4.223528 4.037114 7.515283 1.190327 ... 7.73E-01 1.062764 1.450024 0.108651 4.598596 0.031875 -1.47364 -0.18771 0 0
50% 1.59E+09 0.176899 8.971992 5.342759 40.22623 0.053812 9.181608 8.680717 11.16064 7.887975 ... 1.66E+00 2.087322 2.408813 0.538702 18.37021 0.10329 2.629143 0.188925 0 0
75% 4.09E+09 8.888801 16.61211 10.67219 57.44022 0.101982 17.32304 15.17464 18.00727 18.29859 ... 3.79E+00 4.29108 4.139631 1.054455 35.62091 0.241191 12.27956 0.438397 1 0
max 9.36E+10 135.3732 157.6795 100.8545 100 0.969145 85.14702 115.929 689.3331 879.8665 ... 4.18E+01 65.16187 183.2269 47.82671 127.4428 1.601634 435.7095 1.3075 1 0

Hereof, data from 2010-2019 are used as a training sample for the prediction model, and data from 2020-2021 as a hold-out testing sample to measure the prediction outputs. Finally, other non-target companies from peer data (all peers were included based on data availability) were added in the hold-out sample to have a similar to natural world distribution of target and non-target companies. The all target sample consisting of 1705 observations is stored in github folder of this article.

    	
            

non_target_all = pd.read_excel (r'non-target_all.xlsx')

non_target_all = non_target_all.iloc[:,4:31]

 

non_target_all.head()

  Announcement Date Company RIC Market Cap Abnormal return 60 day ROE Profit Margin Gross Profit Margin Profit to Capital Return on Sales Return on Capital ... Price to Sales EV to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Net debt to Total Capital Growth-Resource Mismatch Label
0 28/06/2021 ORCL.N 1.70E+11 -2.326304 104.657802 32.324708 77.64155 0.122175 39.064271 17.34117 ... 5.720429 6.58138 25.612618 5.628686 26.528822 0.441759 9.299967 0.33836 0 0
1 28/06/2021 APPF.OQ 6.18E+09 5.841547 76.413277 49.575963 52.97011 0.554012 0.40571 0.434931 ... 14.705474 14.237565 16.26181 0 0 0.490567 -4.900518 -0.58939 0 0
2 28/06/2021 APPS.O 3.76E+08 -3.597121 45.866903 15.068955 38.693 0.141264 17.909776 47.794993 ... 22.945682 22.850398 52.729386 0.270393 0.322305 0.218848 -0.006771 -0.00601 0 0
3 28/06/2021 SPSC.OQ 3.83E+09 1.688433 11.630102 14.084168 66.29434 0.108352 15.610047 12.147986 ... 10.245945 9.611462 7.640845 0 0 0.355799 -5.282973 -0.44561 1 0
4 28/06/2021 OPRA.OQ 1.04E+09 -1.893117 6.083762 65.461049 83.82711 0.169373 15.307199 1.39486 ... 7.262251 6.548395 1.296834 0.008555 0.735752 0.126564 -1.101874 -0.11889 0 0
    	
            

data = pd.concat([target_group, non_target_group], ignore_index=True)

 

#select only observations earlier than January 1, 2020 for the trainig dataset

data = data.loc[data["Announcement Date"] < "2020-01-01"]

data.head()

  AD-30 Announcement Date Company RIC Market Cap Abnormal return 60 day ROE Profit Margin Gross Profit Margin Profit to Capital Return on Sales ... Price to Sales EV to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Net debt to Total Capital Growth-Resource Mismatch Label
75 19/10/2019 18/11/2019 SDI^G20 2.59E+08 18.90802 13.07202 3.033754 47.29568 0.007028 17.38898 ... 0.524636 1.178682 3.958383 2.576265 48.9193 0.062581 11.274 0.5609 1 1
85 24/11/2019 24/12/2019 AXE^F20 1.80E+09 10.58003 12.19726 2.32083 19.29359 0.055382 4.238786 ... 0.335002 0.448346 1.684529 0.797122 27.73048 0.028701 34.57491 0.41485 1 1
86 23/11/2019 23/12/2019 WAAS.K^C20 5.05E+08 4.173605 -5.32525 -10.2765 53.22647 -0.03158 3.50351 ... 3.49148 4.571051 1.66549 0.949824 36.04978 0.086268 9.82416 0.40087 0 1
87 20/11/2019 20/12/2019 CRCM.K^B20 6.25E+08 22.87365 -12.0777 -9.43021 76.91928 0.245749 -3.10547 ... 1.940667 1.582142 2.401903 0 0 0.429477 -3.97826 -0.59256 1 1
88 19/11/2019 19/12/2019 TIVO.O^F20 1.17E+09 2.869971 -30.7963 -71.75 48.63371 -0.14058 3.716264 ... 1.376742 2.33317 0.704931 0.664552 58.92102 0.065171 5.414204 0.2701 0 1
    	
            

#create holdout testing sample by joining target group observations dating after January 1, 2020 and all non-target observationa

data_hold = pd.concat([target_group.loc[target_group["Announcement Date"] > "2020-01-01"], non_target_all], ignore_index=True)

data_hold.head()

  AD-30 Announcement Date Company RIC Market Cap Abnormal return 60 day ROE Profit Margin Gross Profit Margin Profit to Capital Return on Sales ... Price to Sales EV to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Net debt to Total Capital Growth-Resource Mismatch Label
0 29/05/2021 28/06/2021 QADA.O 1.28E+09 -9.775437 10.77455 4.201914 59.18308 0.077847 4.544522 ... 4.807378 4.129297 11.428748 0.097879 0.996802 1.002554 -6.270184 -0.9134 1 1
1 22/05/2021 21/06/2021 RAVN.O 1.16E+09 -7.701835 7.449319 6.531071 33.81626 0.0565 7.288653 ... 4.424246 4.337533 4.644855 0.008242 0.180739 0.098591 -0.842171 -0.09042 1 1
2 22/05/2021 21/06/2021 LDL 5.33E+08 -11.843083 -4.71313 -1.555628 18.96168 -0.139595 2.823379 ... 0.813417 1.031316 2.430913 1.049446 32.008828 0.193466 9.432784 0.3186 1 1
3 19/05/2021 18/06/2021 SYKE.OQ^H21 1.49E+09 -11.146474 7.669254 3.841362 32.09241 0.058989 7.49143 ... 0.931876 0.895016 1.780169 0.070497 3.052414 0.107747 -1.440183 -0.05943 0 1
4 09/05/2021 08/06/2021 MCF 3.98E+08 -9.647523 -251.26627 -146.447928 8.89214 -5.337401 -20.078817 ... 7.253047 7.377276 45.870267 0.989979 1.849967 0.044645 0.080742 0.45284 0 1
    	
            

# report number of observations and structure of the datasets

print(f"Number of target companies in training dataset is: {data.loc[data['Label']==1].shape[0]}")

print(f"Number of non-target companies in training dataset is: {data.loc[data['Label']==0].shape[0]}")

print(f"\nNumber of target companies in hold-out testing dataset is: {data_hold.loc[data_hold['Label']==1].shape[0]}")

print(f"Number of non-target companies in hold-out testing dataset is: {data_hold.loc[data_hold['Label']==0].shape[0]}")

Number of target companies in training dataset is: 572
Number of non-target companies in training dataset is: 572

Number of target companies in hold-out testing dataset is: 84
Number of non-target companies in hold-out testing dataset is: 1704

    	
            

#select numeric fields only

data = data.iloc[:,4:29]

 

# run t-test

t_test = ttest_ind(data.loc[data['Label']==1], data.loc[data['Label']==0])

 

#store results in a dataframe

ttest_data = pd.DataFrame()

ttest_data["feature"] = data.columns

ttest_data["t_test"] = t_test[0]

ttest_data["p-value"] = t_test[1]

 

ttest_data = ttest_data.T

ttest_data.rename(columns = ttest_data.iloc[0], inplace = True)

ttest_data = ttest_data.iloc[1:]

ttest_data

  Abnormal return 60 day ROE Profit Margin Gross Profit Margin Profit to Capital Return on Sales Return on Capital EV to EBIDTA Sales growth, 3y Free cash Flow/Sales ... Price to Sales EV to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Net debt to Total Capital Growth-Resource Mismatch Label
t_test 3.527428 -1.475911 -1.889791 0.444909 -2.945529 -0.931712 -2.840692 -1.111184 -0.657753 0.860847 ... -1.973785 -1.756332 -0.395142 1.511998 3.23238 0.074497 -0.596586 1.742697 -0.941765 inf
p-value 0.000436 0.140243 0.059039 0.656469 0.003289 0.351683 0.004582 0.266723 0.51083 0.389503 ... 0.048647 0.0793 0.692812 0.130811 0.001263 0.940628 0.550902 0.081656 0.346512 0
    	
            

# create dataframe to store VIF scores

vif_data = pd.DataFrame()

vif_data["feature"] = data.columns

vif_data

 

# calculate VIF for each feature

vif_data["VIF"] = [variance_inflation_factor(data.values, i)

                          for i in range(len(data.columns))] 

 

vif_data = vif_data.T

vif_data.rename(columns = vif_data.iloc[0], inplace = True)

vif_data = vif_data.iloc[1:]

vif_data

  Abnormal return 60 day ROE Profit Margin Gross Profit Margin Profit to Capital Return on Sales Return on Capital EV to EBIDTA Sales growth, 3y Free cash Flow/Sales ... Price to Sales EV to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Net debt to Total Capital Growth-Resource Mismatch Label
VIF 1.054777 1.881072 2.665773 6.042009 3.554314 5.010072 5.76416 1.157921 1.311202 2.419423 ... 20.179892 23.487542 2.079249 2.184374 6.102144 4.116519 1.609153 5.965589 1.933686 2.065807
    	
            

#provide correlation matrix of variables

data.corr(method ='pearson')

  Profit to Capital Return on Sales Return on Capital EV to EBIDTA Sales growth, 3y Free cash Flow/Sales Operating cash flow to Total Assets Asset Turnover Current Ratio Working Capital to Total Assests Price to Sales EV to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Net debt to Total Capital Growth-Resource Mismatch Label
Profit to Capital 1 0.469819 0.770463 0.057058 -0.099643 0.159629 0.571263 0.133556 -0.026975 -0.023155 -0.092609 -0.095129 0.000116 -0.061493 -0.194562 -0.062241 -0.058041 -0.01065 0.093123 -0.086833
Return on Sales 0.469819 1 0.571258 0.072925 -0.125538 0.304241 0.475287 -0.160287 -0.208076 -0.256022 -0.025165 0.084847 -0.02229 0.02147 0.063414 -0.236161 0.113287 0.220759 -0.040704 -0.02756
Return on Capital 0.770463 0.571258 1 0.072687 -0.103096 0.184553 0.640436 0.250245 -0.108058 -0.080304 -0.165991 -0.146214 0.02568 0.001541 -0.121035 -0.126193 -0.00914 0.103537 0.069217 -0.083765
EV to EBIDTA 0.057058 0.072925 0.072687 1 -0.031775 -0.216339 0.050431 -0.019283 -0.020741 -0.055012 0.142681 0.187297 0.021269 0.003807 0.003469 -0.075433 0.01399 0.082962 -0.001593 -0.032864
Sales growth, 3y -0.099643 -0.125538 -0.103096 -0.031775 1 -0.230891 -0.080419 -0.08697 0.135151 0.095959 0.198016 0.196634 0.04208 -0.035913 -0.04864 0.082146 -0.011732 -0.111481 -0.091704 -0.01946
Free cash Flow/Sales 0.159629 0.304241 0.184553 -0.216339 -0.230891 1 0.287205 0.130224 0.024008 0.08932 -0.393878 -0.542854 0.023437 -0.027909 -0.123641 0.084361 -0.068922 -0.105306 0.056599 0.025465
Operating cash flow to Total Assets 0.571263 0.475287 0.640436 0.050431 -0.080419 0.287205 1 0.237074 -0.072218 -0.030645 -0.129507 -0.1691 0.050883 -0.046929 -0.275411 0.047895 -0.111884 -0.121877 0.064638 -0.023641
Asset Turnover 0.133556 -0.160287 0.250245 -0.019283 -0.08697 0.130224 0.237074 1 -0.021046 0.198435 -0.367556 -0.443685 0.016387 -0.049528 -0.194819 0.084507 -0.158973 -0.118254 0.141162 -0.00063
Current Ratio -0.026975 -0.208076 -0.108058 -0.020741 0.135151 0.024008 -0.072218 -0.021046 1 0.805576 0.180575 -0.001214 0.003018 -0.139586 -0.333796 0.331163 -0.257875 -0.475517 0.320242 0.012113
Working Capital to Total Assests -0.023155 -0.256022 -0.080304 -0.055012 0.095959 0.08932 -0.030645 0.198435 0.805576 1 0.09664 -0.122366 0.040904 -0.170173 -0.460793 0.515196 -0.329673 -0.661775 0.413097 0.011728
Price to Sales -0.092609 -0.025165 -0.165991 0.142681 0.198016 -0.393878 -0.129507 -0.367556 0.180575 0.09664 1 0.905159 0.21278 -0.088396 -0.262439 0.190755 -0.080365 -0.226254 -0.096225 -0.058308
EV to Sales -0.095129 0.084847 -0.146214 0.187297 0.196634 -0.542854 -0.1691 -0.443685 -0.001214 -0.122366 0.905159 1 0.149185 -0.016209 -0.031145 -0.007317 0.072198 0.020936 -0.147149 -0.051902
Market to Book 0.000116 -0.02229 0.02568 0.021269 0.04208 0.023437 0.050883 0.016387 0.003018 0.040904 0.21278 0.149185 1 0.48869 -0.077159 0.139582 -0.079884 -0.014592 -0.027932 -0.011692
Total debt to Equity -0.061493 0.02147 0.001541 0.003807 -0.035913 -0.027909 -0.046929 -0.049528 -0.139586 -0.170173 -0.088396 -0.016209 0.48869 1 0.394115 -0.12604 0.251266 0.381659 -0.10333 0.044698
Debt to EV -0.194562 0.063414 -0.121035 0.003469 -0.04864 -0.123641 -0.275411 -0.194819 -0.333796 -0.460793 -0.262439 -0.031145 -0.077159 0.394115 1 -0.409302 0.469554 0.725694 -0.238567 0.095216
Cash to Capital -0.062241 -0.236161 -0.126193 -0.075433 0.082146 0.084361 0.047895 0.084507 0.331163 0.515196 0.190755 -0.007317 0.139582 -0.12604 -0.409302 1 -0.309905 -0.719857 0.167972 0.002204
Net debt per share -0.058041 0.113287 -0.00914 0.01399 -0.011732 -0.068922 -0.111884 -0.158973 -0.257875 -0.329673 -0.080365 0.072198 -0.079884 0.251266 0.469554 -0.309905 1 0.4629 -0.094465 -0.017651
Net debt to Total Capital -0.01065 0.220759 0.103537 0.082962 -0.111481 -0.105306 -0.121877 -0.118254 -0.475517 -0.661775 -0.226254 0.020936 -0.014592 0.381659 0.725694 -0.719857 0.4629 1 -0.264913 0.051501
Growth-Resource Mismatch 0.093123 -0.040704 0.069217 -0.001593 -0.091704 0.056599 0.064638 0.141162 0.320242 0.413097 -0.096225 -0.147149 -0.027932 -0.10333 -0.238567 0.167972 -0.094465 -0.264913 1 -0.027857
Label -0.086833 -0.02756 -0.083765 -0.032864 -0.01946 0.025465 -0.023641 -0.00063 0.012113 0.011728 -0.058308 -0.051902 -0.011692 0.044698 0.095216 0.002204 -0.017651 0.051501 -0.027857 1
    	
            

# remove specified variables and prepare final dataset for the logistic regression model

drop = ['Return on Capital',"EV to Sales","Working Capital to Total Assests",

                   'Operating cash flow to Total Assets','Asset Turnover',

                 'Net debt to Total Capital','Profit Margin', 'ROE']

 

data.drop(columns=drop, inplace=True)

data.head()

  Abnormal return 60 day Gross Profit Margin Profit to Capital Return on Sales EV to EBIDTA Sales growth, 3y Free cash Flow/Sales Current Ratio Price to Sales Market to Book Total debt to Equity Debt to EV Cash to Capital Net debt per share Growth-Resource Mismatch Label
75 18.908016 47.29568 0.007028 17.388983 6.624964 23.607172 -0.006725 1.54999 0.524636 3.958383 2.576265 48.919298 0.062581 11.274002 1 1
85 10.580031 19.29359 0.055382 4.238786 9.964724 11.032365 0.011345 1.93994 0.335002 1.684529 0.797122 27.730476 0.028701 34.574912 1 1
86 4.173605 53.22647 -0.031583 3.50351 31.02985 13.336855 0.049832 2.47767 3.49148 1.66549 0.949824 36.049779 0.086268 9.82416 0 1
87 22.873645 76.91928 0.245749 -3.105473 30.648062 11.566996 0.16575 3.1974 1.940667 2.401903 0 0 0.429477 -3.978257 1 1
88 2.869971 48.63371 -0.140578 3.716264 9.784327 11.620528 0.1911 0.97639 1.376742 0.704931 0.664552 58.921021 0.065171 5.414204 0 1
    	
            

X = data.drop(['Label','Total debt to Equity'],axis =1)

y = data['Label']

    	
            

# fit logistic regression model on the unclustered dataset

lr = LogisticRegression(solver = 'liblinear',penalty = 'l2', random_state=0)

lr.fit(X, y)

 

# fit logistic regression model by different package to show the summary output

log_reg = sm.Logit(y, X).fit()

print(log_reg.summary())

Optimization terminated successfully.            
  Current function value: 0.676776        
  Iterations 5            
  Logit Regression Results          
                 
Dep. Variable: Label No. Observations: 1144          
Model: Logit Df Residuals: 1130          
Method: MLE Df Model: 13          
Date: Tue, 14 Sep 2021 Pseudo R-squ.: 0.02362          
Time: 22:01:50 Log-Likelihood: -774.23          
converged: TRUE LL-Null: -792.96          
Covariance Type: nonrobust LLR p-value: 0.0003512          
                 
      coef std err z P>|z| [0.025 0.975]
                 
Abnormal return 60 day 0.0118 0.004 3.012 0.003 0.004 0.019
Gross Profit Margin 0.0031 0.003 1.03 0.303 -0.003 0.009
Profit to Capital -1.2249 0.526 -2.329 0.02 -2.256 -0.194
Return on Sales -0.0007 0.004 -0.164 0.87 -0.009 0.008
EV to EBIDTA -0.0009 0.001 -0.801 0.423 -0.003 0.001
Sales growth, 3y -0.0012 0.002 -0.724 0.469 -0.005 0.002
Free cash Flow/Sales 0.0524 0.189 0.278 0.781 -0.317 0.422
Current Ratio   0.0186 0.038 0.49 0.624 -0.056 0.093
Price to Sales -0.0512 0.028 -1.848 0.065 -0.106 0.003
Market to Book -0.0003 0.008 -0.04 0.968 -0.017 0.016
Debt to EV 0.007 0.003 2.402 0.016 0.001 0.013
Cash to Capital -0.0586 0.375 -0.156 0.876 -0.794 0.677
Net debt per share -0.0053 0.003 -1.707 0.088 -0.011 0.001
Growth-Resource Mismatch   -0.1516 0.139 -1.094 0.274 -0.423 0.12
    	
            

# keep only leverage and liquidity ratios

cl_drop = ['Label','Free cash Flow/Sales','Growth-Resource Mismatch','EV to EBIDTA','Abnormal return 60 day', 'Return on Sales',

               'Gross Profit Margin','Price to Sales','Market to Book','Total debt to Equity','Sales growth, 3y','Profit to Capital']

dat_cl = data.drop(cl_drop, axis = 1)

 

# run kmeans algorithm and partition data into two clusters

km = KMeans(n_clusters = 2).fit(dat_cl)

 

# store cluster values in a dataframe

cluster_map = pd.DataFrame()

cluster_map['data_index'] = dat_cl.index.values

cluster_map['cluster'] = km.labels_

 

# create list of indexes for each cluster

cl_0_idx = []

cl_1_idx = []

for i in range(len(dat_cl)):

    if cluster_map.iloc[i,1] == 0:

        cl_0_idx.append(i)

    if cluster_map.iloc[i,1] == 1:

        cl_1_idx.append(i)

        

print("Number of elements in Cluster 0 is" , len(cl_0_idx))

print("Number of elements in Cluster 1 is" , len(cl_1_idx))

Number of elements in Cluster 0 is 702
Number of elements in Cluster 1 is 442

    	
            

# plot distribution of target and non-target companies in each cluster

cluster_map["Label"] = y.values

sns.set(rc = {'figure.figsize':(10,7)})

ax = sns.countplot(x = "cluster", hue = "Label", data = cluster_map)

 

# add numeric values on the bars

for p in ax.patches:

    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x() + 0.15, p.get_height() + 5))

    	
            

# get cluster centroids

cl_centroid = km.cluster_centers_

 

# store variable centroids in a dataframe

centroids = pd.DataFrame()

centroids["Variable"] = dat_cl.columns

centroids["Mean_cluster 0"] = cl_centroid[0]

centroids["Mean_cluster 1"] = cl_centroid[1]

centroids

  Variable Mean_cluster 0 Mean_cluster 1
0 Current Ratio 2.805736 1.554494
1 Debt to EV 10.621829 47.207315
2 Cash to Capital 0.23208 0.065443
3 Net debt per share 0.236114 24.964022

Further we run logistic regression models on each cluster datapoints and discuss the results

    	
            

lr_0 = LogisticRegression(solver = 'liblinear',random_state=0, penalty = 'l2')

lr_0.fit(X.iloc[cl_0_idx], y.iloc[cl_0_idx])

 

log_reg = sm.Logit(y.iloc[cl_0_idx], X.iloc[cl_0_idx]).fit()

print(log_reg.summary())

Optimization terminated successfully.            
  Current function value: 0.670424        
  Iterations 6            
                 
  Logit Regression Results          
                 
Dep. Variable: Label No. Observations: 702          
Model: Logit Df Residuals: 688          
Method: MLE Df Model: 13          
Date: Tue, 14 Sep 2021 Pseudo R-squ.: 0.02694          
Time: 22:02:05 Log-Likelihood: -470.64          
converged: TRUE LL-Null: -483.67          
Covariance Type: nonrobust LLR p-value: 0.01668          
                 
      coef std err z P>|z| [0.025 0.975]
                 
Abnormal return 60 day 0.014 0.005 2.648 0.008 0.004 0.024
Gross Profit Margin 0.0036 0.004 0.941 0.347 -0.004 0.011
Profit to Capital -1.4259 0.663 -2.152 0.031 -2.725 -0.127
Return on Sales 0.0059 0.007 0.862 0.388 -0.008 0.019
EV to EBIDTA -0.0012 0.001 -0.975 0.33 -0.003 0.001
Sales growth, 3y -0.0051 0.003 -1.68 0.093 -0.011 0.001
Free cash Flow/Sales -0.0313 0.375 -0.083 0.934 -0.766 0.703
Current Ratio   0.0321 0.046 0.692 0.489 -0.059 0.123
Price to Sales -0.0672 0.038 -1.779 0.075 -0.141 0.007
Market to Book -0.0026 0.023 -0.112 0.911 -0.049 0.043
Debt to EV -0.0177 0.008 -2.144 0.032 -0.034 -0.002
Cash to Capital 0.6774 0.457 1.482 0.138 -0.218 1.573
Net debt per share 0.0324 0.017 1.949 0.051 0 0.065
Growth-Resource Mismatch   -0.1303 0.175 -0.744 0.457 -0.473 0.213

Logistic regression outputs from Cluster 0, which includes companies in better financial health, find similar results to Model based on the entire dataset. Particularly, Abnormal returns are significant in 1%, Profit to capital, Debt to EV ratios in 5%, and Price to Sales at a 10% significance level. In addition, Net debt per share is also significant at a 5% level. The directions of the impact of significant variables are mainly similar to the first model; however, the coefficient values are higher for the Cluster 1 model for all variables. The exception is leverage ratios, where the direction of coefficients is the opposite, suggesting that companies with lower Debt to EV and higher Net debt per share contribute to the acquisition.

    	
            

lr_1 = LogisticRegression(solver = 'liblinear',random_state=0, penalty = 'l2')

lr_1.fit(X.iloc[cl_1_idx], y.iloc[cl_1_idx])

 

log_reg = sm.Logit(y.iloc[cl_1_idx], X.iloc[cl_1_idx]).fit()

print(log_reg.summary())

Optimization
terminated successfully.              
  Current function value: 0.648109          
  Iterations 6              
                   
  Logit Regression Results            
                   
Dep. Variable: Label No. Observations: 442            
Model: Logit Df Residuals: 428            
Method: MLE Df Model: 13            
Date: Tue, 14 Sep 2021 Pseudo R-squ.: 0.05057            
Time: 22:02:10 Log-Likelihood: -286.46            
converged: TRUE LL-Null: -301.72            
Covariance Type: nonrobust LLR p-value: 0.003969            
                   
        coef std err z P>|z| [0.025 0.975]
                   
Abnormal return 60 day   0.0078 0.006 1.292 0.196 -0.004 0.02
Gross Profit Margin   0.01 0.006 1.646 0.1 -0.002 0.022
Profit to Capital   -1.3091 1.296 -1.01 0.313 -3.85 1.232
Return on Sales   -0.0185 0.009 -2.015 0.044 -0.036 -0.001
EV to EBIDTA   -0.0004 0.005 -0.075 0.94 -0.01 0.009
Sales growth, 3y   0.0055 0.005 1.17 0.242 -0.004 0.015
Free cash Flow/Sales   0.364 0.254 1.434 0.152 -0.133 0.861
Current Ratio     0.2083 0.101 2.067 0.039 0.011 0.406
Price to Sales   0.0203 0.06 0.339 0.735 -0.097 0.137
Market to Book   -0.0014 0.009 -0.152 0.879 -0.019 0.016
Debt to EV   0.0046 0.005 0.892 0.372 -0.006 0.015
Cash to Capital   -1.3762 1.377 -0.999 0.318 -4.076 1.323
Net debt per share   -0.0111 0.005 -2.401 0.016 -0.02 -0.002
Growth-Resource Mismatch     -0.3546 0.283 -1.252 0.211 -0.91 0.201
    	
            

print('\033[1m' + "Accuracy metrics for the model based on the entire dataset" + '\033[0m')

print('Classification accuracy: {:.3f}'.format(lr.score(X, y)))

print('ROC_AUC score: {:.3f}'.format(roc_auc_score(y, lr.predict_proba(X)[:, 1])))

 

print('\033[1m' + "\nAccuracy metrics for the model based on the Cluster 0 data" + '\033[0m')

print('Classification accuracy: {:.3f}'.format(lr_0.score(X.iloc[cl_0_idx], y.iloc[cl_0_idx])))

print('ROC_AUC score: {:.3f}'.format(roc_auc_score(y.iloc[cl_0_idx], lr_0.predict_proba(X.iloc[cl_0_idx])[:, 1])))

 

print('\033[1m' + "\nAccuracy metrics for the model based on the Cluster 1 data" + '\033[0m')

print('Classification accuracy on test set: {:.3f}'.format(lr_1.score(X.iloc[cl_1_idx], y.iloc[cl_1_idx])))

print('ROC_AUC score: {:.3f}'.format(roc_auc_score(y.iloc[cl_1_idx], lr_1.predict_proba(X.iloc[cl_1_idx])[:, 1])))

Accuracy metrics for the model based on the entire dataset
Classification accuracy: 0.575
ROC_AUC score: 0.602

Accuracy metrics for the model based on the Cluster 0 data
Classification accuracy: 0.595
ROC_AUC score: 0.618

Accuracy metrics for the model based on the Cluster 1 data
Classification accuracy on test set: 0.606
ROC_AUC score: 0.644

The comparison of the three models shows that Clustered models produce relatively better results according to the Accuracy, AUC measure, and Pseudo R squire. We believe that the AUC measure is a better estimate, considering that Clustered models are imbalanced, and equal to 0.6, 0.618, and 0.644 for Model 1, 2, and 3, respectively. Additionally, models on clustered data have better explanatory power as clustered models produced a more comprehensive view of significant variables. However, it is also worth noting that the difference in model accuracy is not radical and can be associated with the sample size, which is the smallest for Model 3. The difference becomes even smaller after the cross-validation. Results from stratified cross-validation with ten splits for accuracy and AUC scores are summarized in the table below.

    	
            

cv = StratifiedKFold(n_splits=10)

 

# implement 10-k cross validation for each model

scores_acc = cross_val_score(lr, X, y, scoring = 'accuracy', cv = cv, n_jobs = -1)

scores_roc = cross_val_score(lr, X, y, scoring = 'roc_auc', cv = cv, n_jobs = -1)

print('\033[1m' + "Accuracy metrics for the model based on the entire dataset after 10-fold cross-validation" + '\033[0m')

print('Accuracy: %.3f (%.3f)' % (mean(scores_acc), std(scores_acc)))

print('ROC_AUC score: %.3f (%.3f)' % (mean(scores_roc), std(scores_roc)))

 

scores_cl_0_acc = cross_val_score(lr_0,X.iloc[cl_0_idx], y.iloc[cl_0_idx], scoring = 'accuracy', cv = cv, n_jobs = -1)

scores_cl_0_roc = cross_val_score(lr_0,X.iloc[cl_0_idx], y.iloc[cl_0_idx], scoring = 'roc_auc', cv = cv, n_jobs = -1)

print('\033[1m' + "\nAccuracy metrics for the model based on the Cluster 0 data after 10-fold cross-validation" + '\033[0m')

print('Accuracy: %.3f (%.3f)' % (mean(scores_cl_0_acc), std(scores_cl_0_acc)))

print('ROC_AUC score: %.3f (%.3f)' % (mean(scores_cl_0_roc), std(scores_cl_0_roc)))

 

scores_cl_1_acc = cross_val_score(lr_1,X.iloc[cl_1_idx], y.iloc[cl_1_idx], scoring = 'accuracy', cv = cv, n_jobs = -1)

scores_cl_1_roc = cross_val_score(lr_1,X.iloc[cl_1_idx], y.iloc[cl_1_idx], scoring = 'roc_auc', cv = cv, n_jobs = -1)

print('\033[1m' + "\nAccuracy metrics for the model based on the Cluster 1 data after 10-fold cross-validation" + '\033[0m')

print('Accuracy: %.3f (%.3f)' % (mean(scores_cl_1_acc), std(scores_cl_1_acc)))

print('ROC_AUC score: %.3f (%.3f)' % (mean(scores_cl_1_roc), std(scores_cl_1_roc)))

Accuracy metrics for the model based on the entire dataset after 10-fold cross-validation
Accuracy: 0.565 (0.033)
ROC_AUC score: 0.578 (0.040)

Accuracy metrics for the model based on the Cluster 0 data after 10-fold cross-validation
Accuracy: 0.570 (0.051)
ROC_AUC score: 0.572 (0.048)

Accuracy metrics for the model based on the Cluster 1 data after 10-fold cross-validation
Accuracy: 0.577 (0.070)
ROC_AUC score: 0.593 (0.073)

    	
            

def find_optimal_cutoff(model, X, y_true):

    '''

    Dependencies

    ------------

    Python library 'Plotnine' version 0.8.0    

    

    Parameters

    -----------

    Identify optimal threshold for the model based on Youden’s J index and plot ROC curve

        Input:

            model: name of the model for which optimal threshold is being calculates

            X (DataFrame): DataFrame of independent variables

            y_true (Series): Series of True Labels

        Output:

            threshold_opt (Series): Series of True Labels

            ROC curve

    '''    

    # get target probabilities

    lr_probs = model.predict_proba(X)

    lr_probs = lr_probs[:, 1]

    

    # create dataframe of TPR and FPR per threshold

    fpr, tpr, thresholds = roc_curve(y_true, lr_probs)

    df_fpr_tpr = pd.DataFrame({'FPR':fpr, 'TPR':tpr, 'Threshold':thresholds})

    

    # calculate optimal threshold based on Gmean

    gmean = np.sqrt(tpr  * (1 - fpr))

    index = np.argmax(gmean)

    threshold_opt = round(thresholds[index], ndigits = 4)

    gmean_opt = round(gmean[index], ndigits = 4)

    fpr_opt = round(fpr[index], ndigits = 4)

    tpr_opt = round(tpr[index], ndigits = 4)

    

    print('Best Threshold: {} with G-Mean: {}'.format(threshold_opt, gmean_opt))

    print('FPR: {}, TPR: {}'.format(fpr_opt, tpr_opt))

    

    # plot the ROC curve and the optimal point:

    # source of the visualization: https://towardsdatascience.com/optimal-threshold-for-imbalanced-classification-5884e870c293

    pn.options.figure_size = (6,4)

    return threshold_opt, (

        ggplot(data = df_fpr_tpr)+

        geom_point(aes(x = 'FPR',

                       y = 'TPR'),

                   size = 0.4)+

        geom_point(aes(x = fpr_opt,

                       y = tpr_opt),

                   color = '#981220',

                   size = 4)+

        geom_line(aes(x = 'FPR',

                      y = 'TPR'))+

        geom_text(aes(x = fpr_opt,

                      y = tpr_opt),

                  label = 'Optimal threshold: {}'.format(threshold_opt),

                  nudge_x = 0.14,

                  nudge_y = -0.10,

                  size = 10,

                  fontstyle = 'italic')+

        labs(title = 'ROC Curve')+

        xlab('False Positive Rate (FPR)')+

        ylab('True Positive Rate (TPR)')+

        theme_minimal()

    )

First we identify the optimal cut-off for the model based on the entire dataset using the function above.

    	
            

# get threshold for Model 1 and plot the ROC curve

threshold_opt, plot = find_optimal_cutoff(lr, X, y)

plot

Best Threshold: 0.4961 with G-Mean: 0.5865
FPR: 0.4056, TPR: 0.5787

Then we identify the optimal cut-off for the model based on the cluster 0 data.

    	
            

threshold_opt_cl_0, plot_cl_0 = find_optimal_cutoff(lr_0, X.iloc[cl_0_idx], y.iloc[cl_0_idx])

plot_cl_0

Best Threshold: 0.4606 with G-Mean: 0.5826
FPR: 0.3916, TPR: 0.558

    	
            

threshold_opt_cl_1, plot_cl_1 = find_optimal_cutoff(lr_1, X.iloc[cl_1_idx], y.iloc[cl_1_idx])

plot_cl_1

Best Threshold: 0.5902 with G-Mean: 0.6128
FPR: 0.291, TPR: 0.5296

    	
            

# drop to be removeded variables from hold-out sample

data_hold.drop(columns = drop, inplace = True)

 

# Separate independent and dependent variables

X_test = data_hold.drop(['AD-30', 'Announcement Date', 'Label',"Company RIC",'Market Cap','Total debt to Equity'], axis = 1)

y_test =  data_hold['Label']

X_test.head()

  Abnormal return 60 day Gross Profit Margin Profit to Capital Return on Sales EV to EBIDTA Sales growth, 3y Free cash Flow/Sales Current Ratio Price to Sales Market to Book Debt to EV Cash to Capital Net debt per share Growth-Resource Mismatch
0 -9.775437 59.18308 0.077847 4.544522 65.10298 0.521386 0.096377 1.38192 4.807378 11.428748 0.996802 1.002554 -6.270184 1
1 -7.701835 33.81626 0.0565 7.288653 40.759031 -2.363193 0.112886 2.69199 4.424246 4.644855 0.180739 0.098591 -0.842171 1
2 -11.843083 18.96168 -0.139595 2.823379 11.659163 3.438428 0.053293 1.86214 0.813417 2.430913 32.008828 0.193466 9.432784 1
3 -11.146474 32.09241 0.058989 7.49143 7.752199 2.581302 0.071953 1.89468 0.931876 1.780169 3.052414 0.107747 -1.440183 0
4 -9.647523 8.89214 -5.337401 -20.078817 113.200436 14.994174 -0.007138 0.52276 7.253047 45.870267 1.849967 0.044645 0.080742 0
    	
            

# get target probabilities

prob = lr.predict_proba(X_test)

 

# create dataframe and store model results

pred_res_lr = data_hold[['Announcement Date','Company RIC','Label']]

pred_res_lr.insert(loc = len(pred_res_lr.columns), column = "Probability_target", value = prob[:,1])

pred_res_lr.insert(loc = len(pred_res_lr.columns), column = "Class", value = np.where(pred_res_lr['Probability_target'] > threshold_opt, 1, 0))

 

# assign TP/FP/TN/FN labels based on the specified cut-off probability

pred_res_lr.insert(loc = len(pred_res_lr.columns), column = "Outcome", value = 

                np.where((pred_res_lr['Class'] == 1) & (pred_res_lr['Label'] == 1), "TP", 

                np.where((pred_res_lr['Class'] == 1) & (pred_res_lr['Label'] == 0),"FP",

                np.where((pred_res_lr['Class'] == 0) & (pred_res_lr['Label'] == 1),"FN", "TN"))))

 

pred_res_lr.head()

  Announcement Date Company RIC Label Probability_target Class Outcome
0 28/06/2021 QADA.O 1 0.403778 0 FN
1 21/06/2021 RAVN.O 1 0.367537 0 FN
2 21/06/2021 LDL 1 0.487374 0 FN
3 18/06/2021 SYKE.OQ^H21 1 0.416753 0 FN
4 08/06/2021 MCF 1 0.971662 1 TP
    	
            

print('\033[1m' + "Observations" + '\033[0m')

print(f'Total Number of companies: {pred_res_lr.shape[0]}')

print('Number of target companies: ' +  str(pred_res_lr.loc[pred_res_lr['Label'] == 1].shape[0]))

print('Number of non-companies: ' +  str(pred_res_lr.loc[pred_res_lr['Label'] == 0].shape[0]))

 

# calculate confusion matrix data based on specified cut-off probability

TP_lr = pred_res_lr.loc[pred_res_lr['Outcome'] == "TP"].shape[0]

TN_lr = pred_res_lr.loc[pred_res_lr['Outcome'] == "TN"].shape[0]

FP_lr = pred_res_lr.loc[pred_res_lr['Outcome'] == "FP"].shape[0]

FN_lr = pred_res_lr.loc[pred_res_lr['Outcome'] == "FN"].shape[0]

 

print('\033[1m' + "\nAbsolute Measures" + '\033[0m')

print(f'TP: {TP_lr}')

print(f'TN: {TN_lr}')

print(f'FP(Type II error): {FP_lr}')

print(f'FN(Type I error): {FN_lr}')

 

print('\033[1m' + "\nRelative Measures" + '\033[0m')

print('ROC score:' + str (round(roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1]),2)))

print(f'Accuracy: {round((TP_lr + TN_lr) / (TP_lr + TN_lr + FP_lr + FN_lr),2)}')

print(f'Precision: {round(TP_lr / (TP_lr + FP_lr),2)}')

print(f'Recall(TPR): {round(TP_lr / (TP_lr + FN_lr),2)}')

print(f'FPR: {round(FP_lr / (FP_lr + TN_lr),2)}')

print(f'F1 score: {round(2*(TP_lr / (TP_lr + FP_lr)) * (TP_lr / (TP_lr + FN_lr) / ((TP_lr / (TP_lr + FP_lr)) + (TP_lr / (TP_lr + FN_lr)))),2)}')

Observations
Total Number of companies: 1788
Number of target companies: 84
Number of non-companies: 1704

Absolute Measures
TP: 48
TN: 982
FP(Type II error): 722
FN(Type I error): 36

Relative Measures
ROC score:0.6
Accuracy: 0.58
Precision: 0.06
Recall(TPR): 0.57
FPR: 0.42
F1 score: 0.11

    	
            

# set the hold out sample for clustering

datahold_cl =  data_hold.drop(cl_drop, axis = 1)

datahold_cl = datahold_cl.drop(['AD-30','Announcement Date','Company RIC','Market Cap'], axis=1)

 

# cluster the hold out sample

cl_preds = km.predict(datahold_cl)

 

cl_0_idx_t = []

cl_1_idx_t = []

 

for i in range(len(cl_preds)):

    if cl_preds[i] == 0:

        cl_0_idx_t.append(i)

    if cl_preds[i] == 1:

        cl_1_idx_t.append(i)

        

print("Number of elements in Cluster 0 is" , len(cl_0_idx_t))

print("Number of elements in Cluster 1 is" , len(cl_1_idx_t))

Number of elements in Cluster 0 is 1143
Number of elements in Cluster 1 is 645

We calculate and report logistic regression outputs of the models based on Cluster 0 and Cluster 1 separately. Below are the results from the Cluster 0 model.

    	
            

y_pred_lr_0 = lr_0.predict(X_test.iloc[cl_0_idx_t])

prob_0 = lr_0.predict_proba(X_test.iloc[cl_0_idx_t])

 

df_c = data_hold[['Announcement Date','Company RIC','Label']]

 

pred_res_lr_0 = df_c.iloc[cl_0_idx_t]

pred_res_lr_0.insert(loc = len(pred_res_lr_0.columns), column = 'Cluster', value = [0] * len(cl_0_idx_t))

pred_res_lr_0.insert(loc = len(pred_res_lr_0.columns), column = 'Probability_target', value = prob_0[:,1])

pred_res_lr_0.insert(loc = len(pred_res_lr_0.columns), column = 'Class', value = np.where(pred_res_lr_0['Probability_target'] > threshold_opt_cl_0, 1, 0))

 

pred_res_lr_0.insert(loc = len(pred_res_lr_0.columns), column = "Outcome", value = 

                np.where((pred_res_lr_0['Class'] == 1) & (pred_res_lr_0['Label'] == 1), "TP", 

                np.where((pred_res_lr_0['Class'] == 1) & (pred_res_lr_0['Label'] == 0),"FP",

                np.where((pred_res_lr_0['Class'] == 0) & (pred_res_lr_0['Label'] == 1),"FN", "TN"))))

 

 

pred_res_lr_0.head()

  Announcement Date Company RIC Label Cluster Probability_target Class Outcome
0 28/06/2021 QADA.O 1 0 0.489038 1 TP
1 21/06/2021 RAVN.O 1 0 0.363049 0 FN
3 18/06/2021 SYKE.OQ^H21 1 0 0.389464 0 FN
4 08/06/2021 MCF 1 0 0.975591 1 TP
7 06/06/2021 ALU.AX 1 0 0.418407 0 FN
    	
            

print('\033[1m' + "Observations" + '\033[0m')

print(f'Total Number of companies: {pred_res_lr_0.shape[0]}')

print('Total Number of companies: ' +  str(pred_res_lr_0.loc[pred_res_lr_0['Label'] == 1].shape[0]))

print('Total Number of companies: ' +  str(pred_res_lr_0.loc[pred_res_lr_0['Label'] == 0].shape[0]))

 

TP_lr_0 = pred_res_lr_0.loc[pred_res_lr_0['Outcome'] == "TP"].shape[0]

TN_lr_0 = pred_res_lr_0.loc[pred_res_lr_0['Outcome'] == "TN"].shape[0]

FP_lr_0 = pred_res_lr_0.loc[pred_res_lr_0['Outcome'] == "FP"].shape[0]

FN_lr_0 = pred_res_lr_0.loc[pred_res_lr_0['Outcome'] == "FN"].shape[0]

 

print('\033[1m' + "\nAbsolute Measures" + '\033[0m')

print(f'TP: {TP_lr_0}')

print(f'TN: {TN_lr_0}')

print(f'FP(Type II error): {FP_lr_0}')

print(f'FN(Type I error): {FN_lr_0}')

 

print('\033[1m' + "\nRelative Measures" + '\033[0m')

print('ROC score:' + str (round(roc_auc_score(y_test.iloc[cl_0_idx_t], lr_0.predict_proba(X_test.iloc[cl_0_idx_t])[:, 1]),2)))

print(f'Accuracy: {round((TP_lr_0 + TN_lr_0) / (TP_lr_0 + TN_lr_0 + FP_lr_0 + FN_lr_0),2)}')

print(f'Precision: {round(TP_lr_0 / (TP_lr_0 + FP_lr_0),2)}')

print(f'Recall(TPR): {round(TP_lr_0 / (TP_lr_0 + FN_lr_0),2)}')

print(f'FPR: {round(FP_lr_0 / (FP_lr_0 + TN_lr_0),2)}')

print(f'F1 score: {round(2*(TP_lr_0 / (TP_lr_0 + FP_lr_0)) * (TP_lr_0 / (TP_lr_0 + FN_lr_0) / ((TP_lr_0 / (TP_lr_0 + FP_lr_0)) + (TP_lr_0 / (TP_lr_0 + FN_lr_0)))),2)}')

Observations
Total Number of companies: 1143
Total Number of companies: 45
Total Number of companies: 1098

Absolute Measures
TP: 27
TN: 697
FP(Type II error): 401
FN(Type I error): 18

Relative Measures
ROC score:0.65
Accuracy: 0.63
Precision: 0.06
Recall(TPR): 0.6
FPR: 0.37
F1 score: 0.11

    	
            

y_pred_lr_1 = lr_1.predict(X_test.iloc[cl_1_idx_t])

prob_1 = lr_1.predict_proba(X_test.iloc[cl_1_idx_t])

 

pred_res_lr_1 = df_c.iloc[cl_1_idx_t]

pred_res_lr_1.insert(loc = len(pred_res_lr_1.columns), column = 'Cluster', value = [0] * len(cl_1_idx_t))

pred_res_lr_1.insert(loc = len(pred_res_lr_1.columns), column = 'Probability_target', value = prob_1[:,1])

pred_res_lr_1.insert(loc = len(pred_res_lr_1.columns), column = 'Class', value = np.where(pred_res_lr_1['Probability_target'] > threshold_opt_cl_1, 1, 0))

 

pred_res_lr_1.insert(loc = len(pred_res_lr_1.columns), column = "Outcome", value = 

                np.where((pred_res_lr_1['Class'] == 1) & (pred_res_lr_1['Label'] == 1), "TP", 

                np.where((pred_res_lr_1['Class'] == 1) & (pred_res_lr_1['Label'] == 0),"FP",

                np.where((pred_res_lr_1['Class'] == 0) & (pred_res_lr_1['Label'] == 1),"FN", "TN"))))

 

pred_res_lr_1.head()

  Announcement Date Company RIC Label Cluster Probability_target Class Outcome
2 21/06/2021 LDL 1 0 0.535761 0 FN
5 07/06/2021 QTS.N^I21 1 0 0.52956 0 FN
6 07/06/2021 USCR.OQ^H21 1 0 0.461561 0 FN
9 28/05/2021 WBT 1 0 0.612057 1 TP
12 04/05/2021 UFS 1 0 0.55886 0 FN
    	
            

print('\033[1m' + "Observations" + '\033[0m')

print(f'Total Number of companies: {pred_res_lr_1.shape[0]}')

print('Total Number of companies: ' +  str(pred_res_lr_1.loc[pred_res_lr_1['Label'] == 1].shape[0]))

print('Total Number of companies: ' +  str(pred_res_lr_1.loc[pred_res_lr_1['Label'] == 0].shape[0]))

 

TP_lr_1 = pred_res_lr_1.loc[pred_res_lr_1['Outcome'] == "TP"].shape[0]

TN_lr_1 = pred_res_lr_1.loc[pred_res_lr_1['Outcome'] == "TN"].shape[0]

FP_lr_1 = pred_res_lr_1.loc[pred_res_lr_1['Outcome'] == "FP"].shape[0]

FN_lr_1 = pred_res_lr_1.loc[pred_res_lr_1['Outcome'] == "FN"].shape[0]

 

print('\033[1m' + "\nAbsolute Measures" + '\033[0m')

print(f'TP: {TP_lr_1}')

print(f'TN: {TN_lr_1}')

print(f'FP(Type II error): {FP_lr_1}')

print(f'FN(Type I error): {FN_lr_1}')

 

print('\033[1m' + "\nRelative Measures" + '\033[0m')

print('ROC score:' + str (round(roc_auc_score(y_test.iloc[cl_1_idx_t], lr_0.predict_proba(X_test.iloc[cl_1_idx_t])[:, 1]),2)))

print(f'Accuracy: {round((TP_lr_1 + TN_lr_1) / (TP_lr_1 + TN_lr_1 + FP_lr_1 + FN_lr_1),2)}')

print(f'Precision: {round(TP_lr_1 / (TP_lr_1 + FP_lr_1),2)}')

print(f'Recall(TPR): {round(TP_lr_1 / (TP_lr_1 + FN_lr_1),2)}')

print(f'FPR: {round(FP_lr_1 / (FP_lr_1 + TN_lr_1),2)}')

print(f'F1 score: {round(2*(TP_lr_1 / (TP_lr_1 + FP_lr_1)) * (TP_lr_1 / (TP_lr_1 + FN_lr_1) / ((TP_lr_1 / (TP_lr_1 + FP_lr_1)) + (TP_lr_1 / (TP_lr_1 + FN_lr_1)))),2)}')

Observations
Total Number of companies: 645
Total Number of companies: 39
Total Number of companies: 606

Absolute Measures
TP: 28
TN: 319
FP(Type II error): 287
FN(Type I error): 11

Relative Measures
ROC score:0.45
Accuracy: 0.54
Precision: 0.09
Recall(TPR): 0.72
FPR: 0.47
F1 score: 0.16

Moving to the Cluster 1 model, which involves low-liquid and high-levered companies, has the highest accurate identification of target companies. The model correctly identifies 28 targets out of 39, 72% in relative measures. In contrast, the model has the poorest ability to predict actual non-targets, particularly 52% or 319 out of 606 non-targets. Considering the highly toward non-targets unbalanced (95:5) hold-out sample, the predictions mentioned above result in the lowest accuracy of 0.54 compared to the other models. However, due to the relatively accurate prediction of targets and smaller Type I error, the model produces the highest recall and precision scores, 0.72 and 0.09, respectively. The F1 score is also the highest and is equal to 0.16.

As can be noticed Clustered models characterize with diverged results. That can be attributed to an assumption that models overall predict well for the companies which are in bad financial health. This assumption is supported on both clustered models, as Cluster 0, which involved companies in good financial health, better identifies non-targets (potentially financially distressed), and Cluster 1, which involved companies with lower liquidity and higher leverage, better identifies the targets (again potentially financially distressed).

Finally we calculate and report also combined results of clustering to be able to better measure the impact of clustering on logistic regression prediction accuracy.

    	
            

pred_res_combined = pd.concat([pred_res_lr_0,pred_res_lr_1], ignore_index=True)

pred_res_combined

  Announcement Date Company RIC Label Cluster Probability_target Class Outcome
0 28/06/2021 QADA.O 1 0 0.489038 1 TP
1 21/06/2021 RAVN.O 1 0 0.363049 0 FN
2 18/06/2021 SYKE.OQ^H21 1 0 0.389464 0 FN
3 08/06/2021 MCF 1 0 0.975591 1 TP
4 06/06/2021 ALU.AX 1 0 0.418407 0 FN
... ... ... ... ... ... ... ...
1783 06/01/2020 AEG.TO 0 0 0.718261 1 FP
1784 06/01/2020 LUB.N 0 0 0.718315 1 FP
1785 06/01/2020 BH.N 0 0 0.463615 0 TN
1786 06/01/2020 TACO.OQ 0 0 0.696414 1 FP
1787 06/01/2020 WEN.OQ 0 0 0.636937 1 FP
    	
            

print('\033[1m' + "Observations" + '\033[0m')

print(f'Total Number of companies: {pred_res_combined.shape[0]}')

print('Total Number of companies: ' +  str(pred_res_combined.loc[pred_res_combined['Label'] == 1].shape[0]))

print('Total Number of companies: ' +  str(pred_res_combined.loc[pred_res_combined['Label'] == 0].shape[0]))

 

TP_lr_comb = pred_res_combined.loc[pred_res_combined['Outcome'] == "TP"].shape[0]

TN_lr_comb = pred_res_combined.loc[pred_res_combined['Outcome'] == "TN"].shape[0]

FP_lr_comb = pred_res_combined.loc[pred_res_combined['Outcome'] == "FP"].shape[0]

FN_lr_comb = pred_res_combined.loc[pred_res_combined['Outcome'] == "FN"].shape[0]

 

print('\033[1m' + "\nAbsolute Measures" + '\033[0m')

print(f'TP: {TP_lr_comb}')

print(f'TN: {TN_lr_comb}')

print(f'FP(Type II error): {FP_lr_comb}')

print(f'FN(Type I error): {FN_lr_comb}')

 

print('\033[1m' + "\nRelative Measures" + '\033[0m')

print(f'Accuracy: {round((TP_lr_comb + TN_lr_comb) / (TP_lr_comb + TN_lr_comb + FP_lr_comb + FN_lr_comb),2)}')

print(f'Precision: {round(TP_lr_comb / (TP_lr_comb + FP_lr_comb),2)}')

print(f'Recall(TPR): {round(TP_lr_comb / (TP_lr_comb + FN_lr_comb),2)}')

print(f'FPR: {round(FP_lr_comb / (FP_lr_comb + TN_lr_comb),2)}')

print(f'F1 score: {round(2*(TP_lr_comb / (TP_lr_comb + FP_lr_comb)) * (TP_lr_comb / (TP_lr_comb + FN_lr_comb) / ((TP_lr_comb / (TP_lr_comb + FP_lr_comb)) + (TP_lr_comb / (TP_lr_comb + FN_lr_comb)))),2)}')

Observations
Total Number of companies: 1788
Total Number of companies: 84
Total Number of companies: 1704

Absolute Measures
TP: 55
TN: 1016
FP(Type II error): 688
FN(Type I error): 29

Relative Measures
Accuracy: 0.6
Precision: 0.07
Recall(TPR): 0.65
FPR: 0.4
F1 score: 0.13

    	
            

# create a dataframe for abnormal returns calculation with necessary dates

companies = data_hold[['Company RIC', 'Announcement Date', 'Label']]

companies.insert(loc = len(companies.columns), column = 'ad-250', value = companies['Announcement Date'] - datetime.timedelta(250))

companies.insert(loc = len(companies.columns), column = 'ad-60', value = companies['Announcement Date'] - datetime.timedelta(60))

companies.insert(loc = len(companies.columns), column = 'ad+3', value = companies['Announcement Date'] + datetime.timedelta(3))

companies.insert(loc = len(companies.columns), column = 'ad+20', value = companies['Announcement Date'] + datetime.timedelta(20))

companies.head()

  Company RIC Announcement Date Label ad-250 ad-60 ad+3 ad+20
0 QADA.O 28/06/2021 1 21/10/2020 29/04/2021 01/07/2021 18/07/2021
1 RAVN.O 21/06/2021 1 14/10/2020 22/04/2021 24/06/2021 11/07/2021
2 LDL 21/06/2021 1 14/10/2020 22/04/2021 24/06/2021 11/07/2021
3 SYKE.OQ^H21 18/06/2021 1 11/10/2020 19/04/2021 21/06/2021 08/07/2021
4 MCF 08/06/2021 1 01/10/2020 09/04/2021 11/06/2021 28/06/2021

After the dataframe is created, abnormal returns for each target company in the holdout sample is estimated using the function described in the begining of this article.

    	
            

# store target companies in a separate dataframe

targets = companies.loc[companies['Label'] == 1]

 

# create lists for ab_return function parameters

RIC_t = targets['Company RIC']

sdate_t = targets['ad-250'].dt.strftime('%Y-%m-%d').to_list()

edate_t = targets['ad-60'].dt.strftime('%Y-%m-%d').to_list()

ann_date_t = targets['ad+20'].dt.strftime('%Y-%m-%d').to_list()

 

# create an empty dataframe to store retrieved abnormal returns

abreturn_targets = pd.DataFrame({'#': np.arange(start = 1, stop = 80)})

 

# get abnormal returns for specified companies during the specified period using the function

for i in range(len(RIC_t)):

    CAR, abnormal_returns = ab_return(RIC_t[i], sdate_t[i], edate_t[i], ann_date_t[i], 80)

    abreturn_targets.insert(loc = len(abreturn_targets.columns), column = RIC_t[i] + "/" + sdate_t[i], value = abnormal_returns.iloc[:,2])

abreturn_targets.head()

  # QADA.O/2020-10-21 RAVN.O/2020-10-14 LDL/2020-10-14 SYKE.OQ^H21/2020-10-11 MCF/2020-10-01 QTS.N^I21/2020-09-30 USCR.OQ^H21/2020-09-30 ALU.AX/2020-09-29 CLDR.K/2020-09-24 ... GLIBA.O^L20/2019-10-24 MINI.O^G20/2019-06-26 XPER.O/2019-06-18 IOTS.O^F20/2019-06-15 FSCT.O^H20/2019-06-01 POPE.O^E20/2019-05-10 PRMW.O^C20/2019-05-08 HXL/2019-05-07 DERM.O^B20/2019-05-05 HABT.O^C20/2019-05-01
0 1 -1.951212 1.789674 0.647811 -1.340482 -3.464857 -0.938509 -0.390116 0.233426 0.987546 ... 0.730205 0.404776 -1.69255 4.279967 1.407284 -0.500029 0.667141 1.481534 0.494524 0.51578
1 2 0.169606 -1.020053 -1.074035 0.038024 0.63128 0.575662 -0.215277 0.411751 -2.603475 ... -0.759163 -0.486181 1.112555 3.068245 -0.639972 -0.269045 -3.550817 -0.16982 0.063426 1.777091
2 3 -0.922917 0.445024 2.533315 1.74993 1.591846 1.21047 -6.187087 0.531499 1.902075 ... -1.080253 0.301547 0.255182 3.554056 0.769056 -0.22176 -0.847796 -0.655486 -6.877575 0.135701
3 4 -0.321588 0.618946 5.371113 -0.73537 -5.662016 -0.567464 3.221014 2.527593 -0.665821 ... 1.020085 -0.149288 -1.004653 5.529008 1.005436 1.596167 2.552246 -0.340613 0.20865 -0.946638
4 5 -3.229711 -1.168822 1.329981 -0.410937 -0.830568 2.084242 -2.789562 3.016393 -0.52657 ... -0.688504 -0.848708 -1.03037 0.261553 0.267546 1.372257 1.975911 0.715192 1.144629 -0.696808

After retrieving abnormal returns for all of the target companies, we calculate cumulative sum of the returns and plot through a lineplot.

    	
            

# remove the first row and calculate total return from all of the stocks per observation date

abreturn_targets.drop(columns = ['#'], inplace = True)

abreturn_targets.insert(loc = len(abreturn_targets.columns), column = 'Total return', value = abreturn_targets.sum(axis =1)/84)

 

# calculate cumulative sum of total returns and plot in a graph

abreturn_targets = abreturn_targets.cumsum(axis = 0)

sns.lineplot(data = abreturn_targets, y = "Total return", x = np.arange(-39, 40))

    	
            

RIC = companies['Company RIC']

sdate = companies['ad-250'].dt.strftime('%Y-%m-%d').to_list()

edate = companies['ad-60'].dt.strftime('%Y-%m-%d').to_list()

ann_date = companies['ad+3'].dt.strftime('%Y-%m-%d').to_list()

 

 

# it takes relatively long to execute this code. The values can be read from excel for this particular example

return_list = []

for i in range(len(RIC)):

    CAR, abnormal_returns = ab_return(RIC[i], sdate[i], edate[i], ann_date[i], 60)

    return_list.append(CAR)

 

 

# return_list =  pd.read_excel (r'ab_ret.xlsx')

# return_list = return_list.iloc[:,1]

First we calculate and report portfolio abnormal returns from the portfolio created by the model based on the entire dataset.

    	
            

# add returns to the prediction result dataframe

pred_res_lr.insert(loc = len(pred_res_lr.columns), column = "Abnormal Return", value = return_list)

pred_res_lr.head()

  Announcement Date Company RIC Label Probability_target Class Outcome Abnormal Return
0 28/06/2021 QADA.O 1 0.403778 0 FN 7.043903
1 21/06/2021 RAVN.O 1 0.367537 0 FN 37.23871
2 21/06/2021 LDL 1 0.487374 0 FN 67.65457
3 18/06/2021 SYKE.OQ^H21 1 0.416753 0 FN 19.19718
4 08/06/2021 MCF 1 0.971662 1 TP -12.0689
    	
            

print('\033[1m' + "Portfolio composition" + '\033[0m')

print('Predicted targtes: ' +  str(pred_res_lr.loc[pred_res_lr['Class'] == 1].shape[0]))

print('Among which Actual Targets: ' +  str(pred_res_lr.loc[(pred_res_lr['Label'] == 1) & (pred_res_lr['Class'] == 1)].shape[0]))

print('Among which Actual Non-Targets: ' +  str(pred_res_lr.loc[(pred_res_lr['Label'] == 0) & (pred_res_lr['Class'] == 1)].shape[0]))

 

print('\033[1m' + "\nAccuracy Metrics" + '\033[0m')

print(f'Recall(TPR): {round(TP_lr / (TP_lr + FN_lr),2)}')

print(f'FPR: {round(FP_lr / (FP_lr + TN_lr),2)}')

print(f'F1 score: {round(2*(TP_lr / (TP_lr + FP_lr)) * (TP_lr / (TP_lr + FN_lr) / ((TP_lr / (TP_lr + FP_lr)) + (TP_lr / (TP_lr + FN_lr)))),2)}')

 

print('\033[1m' + "\nPortfolio Abnormal Return" + '\033[0m')

print("From actual targets: " + str(round(pred_res_lr['Abnormal Return'].loc[(pred_res_lr['Class'] == 1) & (pred_res_lr['Label'] == 1)].mean(),2)))

print("From actual non-targets: " + str(round(pred_res_lr['Abnormal Return'].loc[(pred_res_lr['Class'] == 1) & (pred_res_lr['Label'] == 0)].mean(),2)))

print("Weighted Total: " + str(round(pred_res_lr['Abnormal Return'].loc[pred_res_lr['Class'] == 1].mean(),2)))

Portfolio composition
Predicted targtes: 770
Among which Actual Targets: 48
Among which Actual Non-Targets: 722

Accuracy Metrics
Recall(TPR): 0.57
FPR: 0.42
F1 score: 0.11

Portfolio Abnormal Return
From actual targets: 27.11
From actual non-targets: 6.3
Weighted Total: 7.6

The portfolio constructed based on the entire dataset comprises of 770 companies. Among them, only 48 are actual targets, which converts into 0.58 and 0.42 TPR and FPR, respectively. Despite large Type II errors, the model is able to generate 7.6 % of weighted total abnormal return. Additionally, we observe that actual targets generate considerably more returns, 27.11 %, which is diluted by the returns from non-target companies. Nevertheless, predicted as target non-targets still generate a positive 7.6 % abnormal return.

Then we calculate and report portfolio abnormal returns from the portfolio created by the model trained on Cluster 0 companies.

    	
            

idx_lr_0 = pred_res_lr_0.index.values.tolist()

return_list_lr_0 = [return_list[index] for index in idx_lr_0]

 

pred_res_lr_0.insert(loc = len(pred_res_lr_0.columns), column = "Abnormal Return", value = return_list_lr_0)

pred_res_lr_0.head()

  Announcement Date Company RIC Label Cluster Probability_target Class Outcome Abnormal Return
0 28/06/2021 QADA.O 1 0 0.489038 1 TP 7.043903
1 21/06/2021 RAVN.O 1 0 0.363049 0 FN 37.23871
3 18/06/2021 SYKE.OQ^H21 1 0 0.389464 0 FN 19.19718
4 08/06/2021 MCF 1 0 0.975591 1 TP -12.0689
7 06/06/2021 ALU.AX 1 0 0.418407 0 FN 37.30235
    	
            

print('\033[1m' + "Portfolio composition" + '\033[0m')

print('Predicted targtes: ' +  str(pred_res_lr_0.loc[pred_res_lr_0['Class'] == 1].shape[0]))

print('Among which Actual Targets: ' +  str(pred_res_lr_0.loc[(pred_res_lr_0['Label'] == 1) & (pred_res_lr_0['Class'] == 1)].shape[0]))

print('Among which Actual Non-Targets: ' +  str(pred_res_lr_0.loc[(pred_res_lr_0['Label'] == 0) & (pred_res_lr_0['Class'] == 1)].shape[0]))

 

print('\033[1m' + "\nAccuracy Metrics" + '\033[0m')

print(f'Recall(TPR): {round(TP_lr_0 / (TP_lr_0 + FN_lr_0),2)}')

print(f'FPR: {round(FP_lr_0 / (FP_lr_0 + TN_lr_0),2)}')

print(f'F1 score: {round(2*(TP_lr_0 / (TP_lr_0 + FP_lr_0)) * (TP_lr_0 / (TP_lr_0 + FN_lr_0) / ((TP_lr_0 / (TP_lr_0 + FP_lr_0)) + (TP_lr_0 / (TP_lr_0 + FN_lr_0)))),2)}')

 

print('\033[1m' + "\nPortfolio Abnormal Return" + '\033[0m')

print("From actual targets: " + str(round(pred_res_lr_0['Abnormal Return'].loc[(pred_res_lr_0['Class'] == 1) & (pred_res_lr_0['Label'] == 1)].mean(),2)))

print("From actual non-targets: " + str(round(pred_res_lr_0['Abnormal Return'].loc[(pred_res_lr_0['Class'] == 1) & (pred_res_lr_0['Label'] == 0)].mean(),2)))

print("Weighted Total: " + str(round(pred_res_lr_0['Abnormal Return'].loc[pred_res_lr_0['Class'] == 1].mean(),2)))

Portfolio composition
Predicted targtes: 428
Among which Actual Targets: 27
Among which Actual Non-Targets: 401

Accuracy Metrics
Recall(TPR): 0.6
FPR: 0.37
F1 score: 0.11

Portfolio Abnormal Return
From actual targets: 29.91
From actual non-targets: 6.17
Weighted Total: 7.67

Finally, we calculate and report portfolio abnormal returns from the portfolio created by the model trained on the Cluster 1 companies.

    	
            

idx_lr_1 = pred_res_lr_1.index.values.tolist()

return_list_lr_1 = [return_list[index] for index in idx_lr_1]

 

pred_res_lr_1.insert(loc = len(pred_res_lr_1.columns), column = "Abnormal Return", value = return_list_lr_1)

pred_res_lr_1.head()

  Announcement Date Company RIC Label Cluster Probability_target Class Outcome Abnormal Return
2 21/06/2021 LDL 1 0 0.535761 0 FN 67.65457
5 07/06/2021 QTS.N^I21 1 0 0.52956 0 FN 23.58232
6 07/06/2021 USCR.OQ^H21 1 0 0.461561 0 FN -6.41504
9 28/05/2021 WBT 1 0 0.612057 1 TP 21.79022
12 04/05/2021 UFS 1 0 0.55886 0 FN 10.25247
    	
            

print('\033[1m' + "Portfolio composition" + '\033[0m')

print('Predicted targtes: ' +  str(pred_res_lr_1.loc[pred_res_lr_1['Class'] == 1].shape[0]))

print('Among which Actual Targets: ' +  str(pred_res_lr_1.loc[(pred_res_lr_1['Label'] == 1) & (pred_res_lr_1['Class'] == 1)].shape[0]))

print('Among which Actual Non-Targets: ' +  str(pred_res_lr_1.loc[(pred_res_lr_1['Label'] == 0) & (pred_res_lr_1['Class'] == 1)].shape[0]))

 

print('\033[1m' + "\nAccuracy Metrics" + '\033[0m')

print(f'Recall(TPR): {round(TP_lr_1 / (TP_lr_1 + FN_lr_1),2)}')

print(f'FPR: {round(FP_lr_1 / (FP_lr_1 + TN_lr_1),2)}')

print(f'F1 score: {round(2*(TP_lr_1 / (TP_lr_1 + FP_lr_1)) * (TP_lr_1 / (TP_lr_1 + FN_lr_1) / ((TP_lr_1 / (TP_lr_1 + FP_lr_1)) + (TP_lr_1 / (TP_lr_1 + FN_lr_1)))),2)}')

 

print('\033[1m' + "\nPortfolio Abnormal Return" + '\033[0m')

print("From actual targets: " + str(round(pred_res_lr_1['Abnormal Return'].loc[(pred_res_lr_1['Class'] == 1) & (pred_res_lr_1['Label'] == 1)].mean(),2)))

print("From actual non-targets: " + str(round(pred_res_lr_1['Abnormal Return'].loc[(pred_res_lr_1['Class'] == 1) & (pred_res_lr_1['Label'] == 0)].mean(),2)))

print("Weighted Total: " + str(round(pred_res_lr_1['Abnormal Return'].loc[pred_res_lr_1['Class'] == 1].mean(),2)))

Portfolio composition
Predicted targtes: 315
Among which Actual Targets: 28
Among which Actual Non-Targets: 287

Accuracy Metrics
Recall(TPR): 0.72
FPR: 0.47
F1 score: 0.16

Portfolio Abnormal Return
From actual targets: 26.16
From actual non-targets: 4.6
Weighted Total: 6.52

Moving to returns generated by the portfolios constructed based on clustered models, we observe that the Cluster 0 portfolio generated the highest weighted total abnormal return of 7.66 % despite having the smallest precision and recall scores. This is merely due to the lowest number of inclusions of non-target companies (FPR is 0.36). Additionally, despite low TPR compared to the Cluster 0 portfolio, the 27 targets included in the Cluster 1 portfolio could generate around 4 % more abnormal return than those from Cluster 1. This can be because of the difference in characteristics of Cluster 0 and Cluster 1 companies. Companies in Cluster 0 are comparably in better financial health; hence possible synergetic targets for acquires, usually involving higher price premiums.

    	
            pred_res_combined = pd.concat([pred_res_lr_0,pred_res_lr_1], ignore_index=True)
        
        
    
    	
            

print('\033[1m' + "Portfolio composition" + '\033[0m')

print('Predicted targtes: ' +  str(pred_res_combined.loc[pred_res_combined['Class'] == 1].shape[0]))

print('Among which Actual Targets: ' +  str(pred_res_combined.loc[(pred_res_combined['Label'] == 1) & (pred_res_combined['Class'] == 1)].shape[0]))

print('Among which Actual Non-Targets: ' +  str(pred_res_lr_1.loc[(pred_res_combined['Label'] == 0) & (pred_res_combined['Class'] == 1)].shape[0]))

 

print('\033[1m' + "\nAccuracy Metrics" + '\033[0m')

print(f'Recall(TPR): {round(TP_lr_comb / (TP_lr_comb + FN_lr_comb),2)}')

print(f'FPR: {round(FP_lr_comb / (FP_lr_comb + TN_lr_comb),2)}')

print(f'F1 score: {round(2*(TP_lr_comb / (TP_lr_comb + FP_lr_comb)) * (TP_lr_comb / (TP_lr_comb + FN_lr_comb) / ((TP_lr_comb / (TP_lr_comb + FP_lr_comb)) + (TP_lr_comb / (TP_lr_comb + FN_lr_comb)))),2)}')

 

print('\033[1m' + "\nPortfolio Abnormal Return" + '\033[0m')

print("From actual targets: " + str(round(pred_res_combined['Abnormal Return'].loc[(pred_res_combined['Class'] == 1) & (pred_res_combined['Label'] == 1)].mean(),2)))

print("From actual non-targets: " + str(round(pred_res_combined['Abnormal Return'].loc[(pred_res_combined['Class'] == 1) & (pred_res_combined['Label'] == 0)].mean(),2)))

print("Weighted Total: " + str(round(pred_res_combined['Abnormal Return'].loc[pred_res_combined['Class'] == 1].mean(),2)))

Portfolio composition
Predicted targtes: 743
Among which Actual Targets: 55
Among which Actual Non-Targets: 251

Accuracy Metrics
Recall(TPR): 0.65
FPR: 0.4
F1 score: 0.13

Portfolio Abnormal Return
From actual targets: 28.0
From actual non-targets: 5.52
Weighted Total: 7.18

    	
            

# divide the dataset into 2020 and 2021 groups and create list of RICs

companies_2020 = companies.loc[companies['Announcement Date'] < '2021-01-01']

companies_2021 = companies.loc[companies['Announcement Date'] >= '2021-01-01']

 

companies_2020_RIC = companies_2020['Company RIC'].tolist()

companies_2021_RIC = companies_2021['Company RIC'].tolist()

 

 

# get price data for the start and end date of the analyis

price_sdate_2020, error = ek.get_data(instruments = companies_2020_RIC, fields = 'TR.PriceClose', parameters = {'SDate': '2020-01-01'})

price_edate_2020, error = ek.get_data(instruments = companies_2020_RIC, fields = 'TR.PriceClose', parameters = {'SDate': '2020-12-31'})

price_sdate_2021, error = ek.get_data(instruments = companies_2021_RIC, fields = 'TR.PriceClose', parameters = {'SDate': '2021-01-01'})

price_edate_2021, error = ek.get_data(instruments = companies_2021_RIC, fields = 'TR.PriceClose', parameters = {'SDate': '2021-07-01'})

 

 

# insert the retrieved date into the respective dataframes 

companies_2020.insert(loc = len(companies_2020.columns), column = 'price_sdate', value = price_sdate_2020.iloc[:, 1].tolist())

companies_2020.insert(loc = len(companies_2020.columns), column = 'price_edate', value = price_edate_2020.iloc[:, 1].tolist())

companies_2021.insert(loc = len(companies_2021.columns), column = 'price_sdate', value = price_sdate_2021.iloc[:, 1].tolist())

companies_2021.insert(loc = len(companies_2021.columns), column = 'price_edate', value = price_edate_2021.iloc[:, 1].tolist())

 

# rejoin the datasets

companies = pd.concat([companies_2020,companies_2021], sort=False).sort_index()

companies.head()

  Company RIC Announcement Date Label ad-250 ad-60 ad+3 ad+20 price_sdate price_edate
0 QADA.O 28/06/2021 1 21/10/2020 29/04/2021 01/07/2021 18/07/2021 63.18 86.6
1 RAVN.O 21/06/2021 1 14/10/2020 22/04/2021 24/06/2021 11/07/2021 33.09 57.66
2 LDL 21/06/2021 1 14/10/2020 22/04/2021 24/06/2021 11/07/2021 30.03 60.48
3 SYKE.OQ^H21 18/06/2021 1 11/10/2020 19/04/2021 21/06/2021 08/07/2021 37.67 53.45
4 MCF 08/06/2021 1 01/10/2020 09/04/2021 11/06/2021 28/06/2021 2.29 4.46
    	
            

price_adate = []

 

# get stock price data for target companies by the 3 day after the announcement

for i in range(len(RIC)):

    # request the data for actual targets

    if companies.iloc[i, 2] == 1:

        df, err =ek.get_data(RIC[i], fields = ['TR.PriceClose'] , parameters={'SDate': ann_date[i]})

        price_adate.append(df.iloc[0,1])

    # return NaN if the company is not a target    

    else:

        price_adate.append(np.nan)

 

# insert the retrieved data into the original dataframe

companies.insert(loc = len(companies.columns), column = 'price_adate', value = price_adate)

After retrieving the necessary data we calculate and report profit per stock as described in the beginning of current section.

    	
            

# calculate profit from the investment strategy and insert into the dataframe

companies.insert(loc = len(companies.columns), column = "Profit", value = 

                np.where(companies['price_adate'].isnull, (companies['price_edate'] - companies['price_sdate'])/companies['price_sdate']*100, 

                        (companies['price_adate'] - companies['price_sdate'])/companies['price_sdate']*100))

companies.head()

  Company RIC Announcement Date Label ad-250 ad-60 ad+3 ad+20 price_sdate price_edate price_adate Profit
0 QADA.O 28/06/2021 1 21/10/2020 29/04/2021 01/07/2021 18/07/2021 63.18 86.6 86.6 37.06869
1 RAVN.O 21/06/2021 1 14/10/2020 22/04/2021 24/06/2021 11/07/2021 33.09 57.66 57.77 74.25204
2 LDL 21/06/2021 1 14/10/2020 22/04/2021 24/06/2021 11/07/2021 30.03 60.48 60.67 101.3986
3 SYKE.OQ^H21 18/06/2021 1 11/10/2020 19/04/2021 21/06/2021 08/07/2021 37.67 53.45 53.82 41.8901
4 MCF 08/06/2021 1 01/10/2020 09/04/2021 11/06/2021 28/06/2021 2.29 4.46 4.53 94.75983

Next we retrieve market performance data, which is in our case S&P 500 index, and calculate and report market return for 2020 and 2021.

    	
            

# retrieve data for S&P 500 for the specified timestamp

market_data_2020, err =ek.get_data('.SPX', fields = ['TR.PriceClose.Date','TR.PriceClose'] , parameters={'SDate': '2020-01-01', 'Edate': '2020-12-31'})

market_data_2021, err =ek.get_data('.SPX', fields = ['TR.PriceClose.Date','TR.PriceClose'] , parameters={'SDate': '2021-01-01', 'Edate': '2021-07-01'})

 

 

# calculate and report market return

market_ret_2020 = round((market_data_2020.iloc[-1,2] - market_data_2020.iloc[0,2])/market_data_2020.iloc[-1,2]*100,2)

market_ret_2021 = round((market_data_2021.iloc[-1,2] - market_data_2021.iloc[0,2])/market_data_2021.iloc[-1,2]*100,2)

 

print(f'Market Return for 2020 is: {market_ret_2020}%')

print(f'Market Return for 2020 is: {market_ret_2021}%')

Market Return for 2020 is: 13.99%
Market Return for 2020 is: 13.05%

As in previous sections, we first calculate and report portfolio investment returns from the portfolio created by the model based on the entire dataset

    	
            

pred_res_lr.insert(loc = len(pred_res_lr.columns), column = "Profit", value = companies['Profit'])

pred_res_lr.head()

  Announcement Date Company RIC Label Probability_target Class Outcome Abnormal Return Profit
0 28/06/2021 QADA.O 1 0.403778 0 FN 7.043903 37.06869
1 21/06/2021 RAVN.O 1 0.367537 0 FN 37.238713 74.25204
2 21/06/2021 LDL 1 0.487374 0 FN 67.654569 101.3986
3 18/06/2021 SYKE.OQ^H21 1 0.416753 0 FN 19.197178 41.8901
4 08/06/2021 MCF 1 0.971662 1 TP -12.068867 94.75983
    	
            

port_return_2020 = round(pred_res_lr['Profit'].loc[(pred_res_lr['Class'] == 1) & (pred_res_lr['Announcement Date'] < '2021-01-01')].mean(),2)

port_return_2021 = round(pred_res_lr['Profit'].loc[(pred_res_lr['Class'] == 1) & (pred_res_lr['Announcement Date'] >= '2021-01-01')].mean(),2)

 

print(f'Portfolio Return for 2020 is: {port_return_2020}%')

print(f'Portfolio Return for 2020 is: {port_return_2021}%')

 

print(f'\nMarket adjusted Portfolio return for 2020 is: {port_return_2020 - market_ret_2020}%')

print(f'Market adjusted Portfolio return for 2021 is: {port_return_2021 - market_ret_2021}%')

print(f'Average Market adjusted Portfolio return is: {((port_return_2020 - market_ret_2020) + (port_return_2021 - market_ret_2021))/2}%')

Portfolio Return for 2020 is: 23.88%
Portfolio Return for 2020 is: 34.81%

Market adjusted Portfolio return for 2020 is: 9.889999999999999%
Market adjusted Portfolio return for 2021 is: 21.76%
Average Market adjusted Portfolio return is: 15.825%

Then we calculate and report portfolio investment returns from the portfolio created by the model trained on Cluster 0 companies.

    	
            

profit_list = companies['Profit'].tolist()

 

profit_list_lr_0 = [profit_list[index] for index in idx_lr_0]

pred_res_lr_0.insert(loc = len(pred_res_lr_0.columns), column = "Profit", value = profit_list_lr_0)

pred_res_lr_0.head()

  Announcement Date Company RIC Label Cluster Probability_target Class Outcome Abnormal Return Profit
0 28/06/2021 QADA.O 1 0 0.489038 1 TP 7.043903 37.06869
1 21/06/2021 RAVN.O 1 0 0.363049 0 FN 37.23871 74.25204
3 18/06/2021 SYKE.OQ^H21 1 0 0.389464 0 FN 19.19718 41.8901
4 08/06/2021 MCF 1 0 0.975591 1 TP -12.0689 94.75983
7 06/06/2021 ALU.AX 1 0 0.418407 0 FN 37.30235 6.649014
    	
            

port_return_2020_lr_0 = round(pred_res_lr_0['Profit'].loc[(pred_res_lr_0['Class'] == 1) & (pred_res_lr_0['Announcement Date'] < '2021-01-01')].mean(),2)

port_return_2021_lr_0 = round(pred_res_lr_0['Profit'].loc[(pred_res_lr_0['Class'] == 1) & (pred_res_lr_0['Announcement Date'] >= '2021-01-01')].mean(),2)

 

print(f'Portfolio Return for 2020 is: {port_return_2020_lr_0}%')

print(f'Portfolio Return for 2020 is: {port_return_2021_lr_0}%')

 

print(f'\nMarket adjusted Portfolio return for 2020 is: {port_return_2020_lr_0 - market_ret_2020}%')

print(f'Market adjusted Portfolio return for 2021 is: {port_return_2021_lr_0 - market_ret_2021}%')

print(f'Average Market adjusted Portfolio return is: {((port_return_2020_lr_0 - market_ret_2020) + (port_return_2021_lr_0 - market_ret_2021))/2}%')

Portfolio Return for 2020 is: 39.78%
Portfolio Return for 2020 is: 25.62%

Market adjusted Portfolio return for 2020 is: 25.79%
Market adjusted Portfolio return for 2021 is: 12.57%
Average Market adjusted Portfolio return is: 19.18%

Finally, we calculate and report portfolio investment returns from the portfolio created by the model trained on the Cluster 1 companies.

    	
            

profit_list_lr_1 = [profit_list[index] for index in idx_lr_1]

pred_res_lr_1.insert(loc = len(pred_res_lr_1.columns), column = "Profit", value = profit_list_lr_1)

pred_res_lr_1.head()

  Announcement Date Company RIC Label Cluster Probability_target Class Outcome Abnormal Return Profit
2 21/06/2021 LDL 1 0 0.535761 0 FN 67.65457 101.3986
5 07/06/2021 QTS.N^I21 1 0 0.52956 0 FN 23.58232 24.80608
6 07/06/2021 USCR.OQ^H21 1 0 0.461561 0 FN -6.41504 84.63848
9 28/05/2021 WBT 1 0 0.612057 1 TP 21.79022 77.27273
12 04/05/2021 UFS 1 0 0.55886 0 FN 10.25247 73.30174
    	
            

port_return_2020_lr_1 = round(pred_res_lr_1['Profit'].loc[(pred_res_lr_1['Class'] == 1) & (pred_res_lr_1['Announcement Date'] < '2021-01-01')].mean(),2)

port_return_2021_lr_1 = round(pred_res_lr_1['Profit'].loc[(pred_res_lr_1['Class'] == 1) & (pred_res_lr_1['Announcement Date'] >= '2021-01-01')].mean(),2)

 

print(f'Portfolio Return for 2020 is: {port_return_2020_lr_1}%')

print(f'Portfolio Return for 2020 is: {port_return_2021_lr_1}%')

 

print(f'\nMarket adjusted Portfolio return for 2020 is: {port_return_2020_lr_1 - market_ret_2020}%')

print(f'Market adjusted Portfolio return for 2021 is: {port_return_2021_lr_1 - market_ret_2021}%')

print(f'Average Market adjusted Portfolio return is: {((port_return_2020_lr_1 - market_ret_2020) + (port_return_2021_lr_1 - market_ret_2021))/2}%')

Portfolio Return for 2020 is: 10.71%
Portfolio Return for 2020 is: 40.78%

Market adjusted Portfolio return for 2020 is: -3.2799999999999994%
Market adjusted Portfolio return for 2021 is: 27.73%
Average Market adjusted Portfolio return is: 12.225000000000001%

    	
            

pred_res_combined = pd.concat([pred_res_lr_0,pred_res_lr_1], ignore_index=True)

 

 

port_return_2020_combined = round(pred_res_combined['Profit'].loc[(pred_res_combined['Class'] == 1) & (pred_res_combined['Announcement Date'] < '2021-01-01')].mean(),2)

port_return_2021_combined = round(pred_res_combined['Profit'].loc[(pred_res_combined['Class'] == 1) & (pred_res_combined['Announcement Date'] >= '2021-01-01')].mean(),2)

print(f'Portfolio Return for 2020 is: {port_return_2020_combined}%')

print(f'Portfolio Return for 2020 is: {port_return_2021_combined}%')

 

print(f'\nMarket adjusted Portfolio return for 2020 is: {port_return_2020_combined - market_ret_2020}%')

print(f'Market adjusted Portfolio return for 2021 is: {port_return_2021_combined - market_ret_2021}%')

print(f'Average Market adjusted Portfolio return is: {((port_return_2020_combined - market_ret_2020) + (port_return_2021_combined - market_ret_2021))/2}%')

Portfolio Return for 2020 is: 27.04%
Portfolio Return for 2020 is: 31.82%

Market adjusted Portfolio return for 2020 is: 13.049999999999999%
Market adjusted Portfolio return for 2021 is: 18.77%
Average Market adjusted Portfolio return is: 15.91%