1. Home
  2. Article Catalog
  3. Predicting M&A Targets Using ML: Unlocking the potential of NLP based variables

Predicting M&A Targets Using ML: Unlocking the potential of NLP based variables

Authors:
Haykaz Aramyan
Developer Advocate Intern Developer Advocate Intern

Overview

In this article, I extend on the Mergers and Acquisitions (M&A) predictive modeling that has been published earlier, by incorporating Natural Language Processing (NLP) based news sentiment variable into it. The original model used only financial variables and utilized Logistic regression for M&A target identification and showcased if that produces an abnormal return for investors. Instead, the purpose of this article is to test if news sentiment derived by NLP has any significant contribution to M&A predictive modeling. To do that, the significance (by comparing evaluation metrics) of news sentiment is tested on different Machine Learning (ML) models, including logistic regression, random forest, and XGBoost models. As it comes to the NLP model, Finbert and BERT-RNA models are tested to calculate sentiment on the news preceding M&A announcement. 

The motivation behind using the news sentiment variable comes from a literature finding suggesting that target companies generate significant run-up returns during one month before the announcement of the deal. The problem here is that abnormal returns may happen not only because of the potential future merger announcements but also because of other positive news impacting the share prices. Thus, overall news sentiment should be evaluated and discussed in relation to the abnormal return. Our hypothesis here is that only abnormal return amid no or low positive news sentiment environment is an indication of M&A announcement.

The article has the following structure. In Section 1, datasets for target and non-target companies are constructed. For the target dataset, the RDP search function is used to get the list of target companies for the specified period. For the non-targets, PEER screen function is used to request peer companies of the target list. Financial variables for both target and non-target companies are requested via the get_data function. The article utilizes Refinitiv Data Platform (RDP) API to access the required data. In section 2, news sentiment prior to M&A is calculated via NLP techniques. Finally, in section 3, the performance of different ML models with and without news sentiment variable, calculated both by FinBert and BERT-RNA models, is evaluated.

Install and import packages

To start, the necessary packages are imported and installed. I use the Refinitiv Data platform API to get the data. To do that, an authorization with an app key is required. The code is built using Python 3.9. Other prerequisite packages are installed below.

    	
            !pip install refinitiv.dataplatform
        
        
    
    	
            !pip install sklearn
        
        
    
    	
            !pip install xgboost
        
        
    
    	
            !pip install transformers
        
        
    
    	
            !pip install openpyxl
        
        
    
    	
            !pip install plotly
        
        
    
    	
            

import configparser

import datetime

import pandas as pd

import numpy as np

from numpy import mean

from numpy import std

import os

import plotly.express as px

import plotly.graph_objects as go

from plotly.subplots import make_subplots

 

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn import tree

from sklearn.ensemble import RandomForestClassifier

import xgboost as xgb

from xgboost import XGBClassifier

from sklearn.decomposition import PCA

from transformers import BertTokenizer, BertForSequenceClassification

import torch

 

from sklearn import metrics

from sklearn.metrics import accuracy_score

from sklearn.metrics import classification_report

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import cross_validate

from sklearn.model_selection import RepeatedStratifiedKFold

from scipy.stats import norm

 

import warnings

warnings.filterwarnings("ignore")

 

# import eikon package and read app key from a text file

import refinitiv.dataplatform as rdp

app_key = open("app_key.txt","r").read()

 

rdp.open_desktop_session(app_key)

In addition to the desktop session, I open the RDP platform session to be able to access news data on RDP.

    	
            

cfg = configparser.ConfigParser()

cfg.read('rdp.cfg',encoding='utf-8')

 

APP_KEY = cfg['RDP']['app_key']

RDP_LOGIN = cfg['RDP']['rdp_login']

RDP_PASSWORD = cfg['RDP']['rdp_password']

 

session = rdp.open_platform_session(

    APP_KEY,

    rdp.GrantPassword(

        username = RDP_LOGIN,

        password = RDP_PASSWORD

    )

)

session.open()

I am not opening only platform session because eikon legacy functions (such as rdp.legacy.get_data) raise the attribute error below on the platform session. This is going to be solved in upcoming API versions.

AttributeError: 'PlatformSession' object has no attribute '_get_udf_url'

Section 1: Construct dataset for predictive modeling

In order to train and evaluate any classification model, dataset of at least two classes is required. Thus, two separate datasets for target and non-target companies are constructed. First, I access M&A data using the Search function of RDP API. Then, I get the list of target companies and request financial variables using the get_data function. Next, I use the PEER screen function to get peer companies and construct the non-target dataset for the models. Finally, these datasets are merged into a single one with appropriate labels in order to estimate the models' outputs.

1.1 Construct dataset for the target group of companies

The list of target companies is requested using DealsMergersAndAcquisitions Search view. The following criteria are used to filter the data and access the ones needed for the current model:

Form of the transaction is equal to Merger or Acquisition - acquisition of majority or partial interest is not included in the model.

Form of Transaction is equal to Completed, Pending or Withdrawn - I have included pending and withdrawn deals as well, as those are claimed to provide abnormal returns for the investors.

Transaction Value is greater than USD 100 mln - this is set to exclude very small deals.

Target company is equal to public - we are interested in only public companies as we want to buy the stock of those companies classified as a target by the model.

Acquirer Company Name is not equal to Creditors or Shareholders - this filter is used to include only acquisitions by an actual company.

Transaction Announcement Date is less than 2021-11-15 and greater than 2020-09-15 - the upper limit is set to fix the number of companies; otherwise, every day running the code would add new entries and affect the reproductivity of the model. As it comes to the lower limit, it is set to meet the current restriction of RDP API, which is the ability to get news data for the last 15 months only.

Target Country is equal to US or UK - In the initial model I included only US companies; however, here, to increase the sample size, deals from the UK are also included. UK is the closest to the US in terms of M&A activity and market reactions to the deal announcements.

The code below requests M&A data using the filters specified above and orders the data by the announcement date in descending order. More on how you can use search, including guidance, examples, and tips to determine the possible approaches, from simple discovery through experimentation to more advanced techniques, are presented in this article.

    	
            

#build search query with the specified filters

MnA = rdp.search(

    view = rdp.SearchViews.DealsMergersAndAcquisitions,

    

    #specify filtering properties

    filter = "((AcquirerCompanyName ne 'Creditors' and AcquirerCompanyName ne 'Shareholder') and (TargetCountry eq 'US' or TargetCountry eq 'UK')"

    + "and TransactionValueIncludingNetDebtOfTarget ge 100 and TargetPublicStatus eq 'Public')"

    + "and (TransactionStatus eq 'Completed' or TransactionStatus eq 'Pending' or TransactionStatus eq 'Withdrawn')"

    + "and (FormOfTransactionName xeq 'Merger' or FormOfTransactionName xeq 'Acquisition') and (TransactionAnnouncementDate le 2021-11-15 and TransactionAnnouncementDate ge 2020-09-15)",

    

    #select only the required fields and order them based on announcement date

    #then specify number of items to be 10000, default value is 100

    select = 'TransactionAnnouncementDate, TargetCompanyName, TargetRIC',

    order_by = 'TransactionAnnouncementDate desc',

    top = 10000)

 

#remove companies which doesn't have RIC

MnA = MnA.dropna(subset = ['TargetRIC']).reset_index(drop = True)

 

print(f'Number of M&A deals for the specified period is {len(MnA)}')

MnA.head()

Number of M&A deals for the specified period is 324

  TargetRIC TransactionAnnouncementDate TargetCompanyName
0 [CONE.O] 2021-11-15T00:00:00.000Z CyrusOne Inc
1 [COR^L21] 2021-11-15T00:00:00.000Z CoreSite Realty Corp
2 [LAACZ.PK^L21] 2021-11-15T00:00:00.000Z LAACO Ltd
3 [CSPR.K] 2021-11-15T00:00:00.000Z Casper Sleep Inc
4 [MCFE.O] 2021-11-08T00:00:00.000Z McAfee Corp
... ... ... ...
324 [LKSDQ.PK^C21] 2020-09-15T00:00:00.000Z LSC Communications Inc

325 rows × 3 columns

One very valid question that may pop up from the code above, especially regarding the filter properties and values, is identifying the exact names and possible values of filter properties. For example, how to know that property name for the country where the target company is based is "TargetCountry", and the possible value for the United Kingdom is "UK" but not simply "United Kingdom"? The problem is, while Search provides a significant amount of content, power, and flexibility, there are challenges when attempting to navigate through the hundreds of available properties when deciding how to extract data. In this article, Nick Zincone outlines a convenient tool that significantly simplifies the challenges of discovering financial properties when programmatically building Search.

I built the search query following the referred article, and the resulting output from the code above is 324 M&A deals from the US and UK from September 15, 2020, to November 15, 2021. Further, I create a list of RICs and announcement dates, including one for 30 days prior to the announcement. These lists are further used to get financial data for target companies.

    	
            

#create list of RICs

rics = MnA['TargetRIC'].to_list()

rics = [rics[i][0] for i in range(len(rics))]

 

#create list of announcement dates including one for 30 days prior to the announcement

dates = pd.DataFrame(MnA['TransactionAnnouncementDate'])

dates.insert(loc = len(dates.columns), column = 'AD-30', value = pd.to_datetime(dates['TransactionAnnouncementDate']) - datetime.timedelta(30))

dates.insert(loc = len(dates.columns), column = 'rics', value = rics)

 

dates_30 = dates['AD-30'].dt.strftime('%Y-%m-%d').to_list()

Below I use the get_data function to request the specified financial variables for the 324 target companies. Here should be noted that the initial fields are coming from my first article where I outline the motivation behind choosing them. Further, I also run correlation analysis to remove variables that may carry multicollinearity. I use also try/except statements to handle possible request errors (such as runtime, connection, bad request,  etc.) and run through them again.

    	
            

#specify variables

fields = ["TR.TRBCIndustry", "TR.F.MktCap", "TR.F.ReturnAvgTotEqPctTTM",

           "TR.F.IncAftTaxMargPctTTM", "TR.F.GrossProfMarg","TR.F.NetIncAfterMinIntr","TR.F.TotCap","TR.F.OpMargPctTTM",

           "TR.F.ReturnCapEmployedPctTTM","TR.F.NetCashFlowOp", "TR.F.LeveredFOCF", "TR.F.TotRevenue",

          "TR.F.TotRevenue(SDate = -1Y)","TR.F.TotRevenue(SDate = -2Y)","TR.F.TotRevenue(SDate = -3Y)", "TR.F.TotAssets","TR.F.CurrRatio","TR.F.WkgCaptoTotAssets",

           "TR.PriceToBVPerShare","TR.PriceToSalesPerShare",'TR.F.EBITDA',"TR.EV","TR.EVToSales","TR.F.TotShHoldEq",

           "TR.F.DebtTot","TR.F.NetDebttoTotCap","TR.TotalDebtToEV","TR.F.NetDebtPerShr", "TR.F.CashCashEquivTot"]

 

#create empty lists and dataframe to store requested values

target_data = pd.DataFrame()

error_target = []

error_target_dates = []

 

for i in range(len(rics)):

    try:

        #get data for fields as of the specified date 

        df, err = rdp.legacy.get_data(rics[i], fields = fields , parameters = {'SDate': dates_30[i]})

        #add anoouncement date to the resulting dataframe

        df.insert(loc = 1, column = 'AD', value = pd.to_datetime(dates['TransactionAnnouncementDate'][i]))

        #append company data to the main dataframe

        target_data = pd.concat([target_data, df], ignore_index = True, axis = 0)

    #if error is returned, store ric and request date into a separate list

    except:

        error_target.append(rics[i])

        error_target_dates.append(dates_30[i])

        continue

    	
            

#run the data request code above for the companies in the error list

for i in range(len(error_target)):

    df, err =rdp.legacy.get_data(error_target[i], fields = fields , parameters = {'SDate': error_target_dates[i]})

    target_data = pd.concat([target_data, df], ignore_index = True, axis = 0)

Further, I drop some of the variables which are eliminated from the model. Again please see the previous article for more details.

    	
            

#convert announcement date into date format  

target_data['AD'] = target_data['AD'].apply(lambda a: pd.to_datetime(a).date()) 

 

#drop some of the field as specified in my previous article

target_data = target_data.drop(columns = ['TRBC Industry Name', 'Market Capitalization', 'Income after Tax Margin - %, TTM', 'Return on Capital Employed - %, TTM', 

                                    'Net Cash Flow from Operating Activities', 'Working Capital to Total Assets', 'Enterprise Value To Sales (Daily Time Series Ratio)',

                                    'Net Debt to Total Capital', 'Total Debt To Enterprise Value (Daily Time Series Ratio)','Return on Average Total Equity - %, TTM','Total Assets'])

 

#remove NAs, insert a column for the date on 30 days prior to the M&A announcement 

target_data.dropna(inplace = True)

target_data.insert(loc = 1, column = 'AD-30', value = target_data['AD'] - datetime.timedelta(30))

target_data.reset_index(drop = True, inplace = True)

target_data.head()

  Instrument AD-30 AD Gross Profit Margin - % Net Income after Minority Interest Total Capital Operating Margin - %, TTM Free Cash Flow Revenue from Business Activities - Total Revenue from Business Activities - Total.1 ... Revenue from Business Activities - Total.3 Current Ratio Price To Book Value Per Share (Daily Time Series Ratio) Price To Sales Per Share (Daily Time Series Ratio) Earnings before Interest Taxes Depreciation & Amortization Enterprise Value (Daily Time Series) Total Shareholders' Equity incl Minority Intr & Hybrid Debt Debt - Total Net Debt per Share Cash & Cash Equivalents - Total
0 COR 16/10/2021 15/11/2021 40.21891 79309000 1788389000 22.855325 3427000 606824000 572727000 ... 4.82E+08 0.39338 320.578875 9.957096 155874000 8.01E+09 72478000 1715911000 39.991626 5543000
1 CSPR.K 16/10/2021 15/11/2021 51.0835 -89555000 91013000 -10.707998 -63503000 497000000 439258000 ... 2.51E+08 1.39422 -7.884344 0.31469 -60493000 1.91E+08 25467000 65546000 -0.57663 88922000
2 DVD 10/10/2021 09/11/2021 26.66113 7482000 69027000 25.492173 1195000 38543000 45963000 ... 4.67E+07 3.47706 1.068976 1.019694 5646000 7.14E+07 69027000 0 -0.345537 12568000
3 MCFE.O 09/10/2021 08/11/2021 62.31934 -118000000 2187000000 10.883691 718000000 2906000000 2635000000 ... 2.08E+09 0.38323 -0.979807 4.451983 710000000 2.05E+10 -1800000000 3987000000 8.768888 231000000
4 CPLG.K 09/10/2021 08/11/2021 -10.7056 -178000000 1682000000 -30.140187 -62000000 411000000 812000000 ... 8.36E+08 4.58824 1.060099 2.12139 -12000000 1.40E+09 857000000 825000000 11.758621 143000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
208 LKSDQ.PK^C21 16/08/2020 15/09/2020 8.77931 -295000000 838000000 -2.776801 -75000000 3326000000 3326000000 ... 3.60E+09 0.88242 -0.008859 0.000588 85000000 -9.43E+07 -72000000 910000000 24.047799 105000000

209 rows × 21 columns

After removing target companies with missing values we end up having 209 companies in our target dataset.

Since data retrieval takes a relatively long time, I store the data in an excel file once the code is fully executed and further read from there. The dataset is available in the GitHub folder.

    	
            

target_data.to_excel("mergerdata/target_data.xlsx")

#read target data from the stored excel file

target_data = pd.read_excel("mergerdata/target_data.xlsx").drop(columns = ['Unnamed: 0'])

1.2 Construct dataset for the non-target group of companies

The non-target sample is constructed from companies similar to the target ones. The best way to identify similar companies is to look at the peers, for which I use the Peer screen function. The peer group for each company, with the variables to be used in the prediction model, is requested using the function below. The function takes a RIC and date as an input and returns a dataframe containing peer companies along with the specified financial variables.

    	
            

def peers(RIC, date):

    '''

    Get peer group for an individual RIC along with required variables for the models

    

    Dependencies

    ------------

    Python library 'refinitiv.dataplatform' version 1.0.0a8.post1

    Python library 'pandas' version 1.3.3

    

    Parameters

    -----------

        Input:

            RIC (str): Refinitiv Identification Number (RIC) of a stock

            date (str): Date as of which peer group and variables are requested - in yyyy-mm-dd

        Output:

            peer_group (DataFrame): Dataframe of 50 peer companies along with requested variables

    

    '''

    # specify variables for the request

    fields = ["TR.F.GrossProfMarg","TR.F.NetIncAfterMinIntr","TR.F.TotCap","TR.F.OpMargPctTTM","TR.F.LeveredFOCF", "TR.F.TotRevenue",

          "TR.F.TotRevenue(SDate = -1Y)","TR.F.TotRevenue(SDate = -2Y)","TR.F.TotRevenue(SDate = -3Y)","TR.F.CurrRatio",

           "TR.PriceToBVPerShare","TR.PriceToSalesPerShare",'TR.F.EBITDA',"TR.EV","TR.F.TotShHoldEq",

              "TR.F.DebtTot","TR.F.NetDebtPerShr", "TR.F.CashCashEquivTot"]

 

    #search for peers

    instruments = 'SCREEN(U(IN(Peers("{}"))))'.format(RIC)

    

    #request variable data for each peer

    peer_group, error = rdp.legacy.get_data(instruments = instruments, fields = fields, parameters = {'SDate': date})

    

    return peer_group

Below I store the rics and dates into separate lists and call the function above for each RIC in the rics list. Then, I drop peers with missing values and merge the resulting dataframe with the main dataframe of peer companies. The code involves the try/except statement to catch API request errors and run the code on them again.

    	
            

#store rics and dates into separate lists

target_rics = target_data['Instrument'].to_list()

target_dates = target_data['AD-30'].dt.strftime('%Y-%m-%d').to_list()

 

#create empty lists for error data and a dataframe to store selected peers

no_peers = []

no_dates = []

peer_data = pd.DataFrame()

 

for i in range(len(target_rics)):

    try:

        #request Peer function for each target company in the lits

        vals = peers(target_rics[i], target_dates[i])

        #drop peers with missing values

        vals.dropna(inplace = True)

        #add a column for 30 days prior to the M&A announcement

        vals.insert(loc = 1, column = 'AD-30', value = target_dates[i])

        #append target company's peer data to the main dataframe of all peers

        peer_data = pd.concat([peer_data, vals], ignore_index = True, axis = 0)

        

    #if error is returned, store ric and request date in a separate list    

    except:

        no_peers.append(target_rics[i])

        no_dates.append(target_dates[i])

        continue

    	
            

#run the data request code above for the companies in the error list

for i in range(len(no_peers)):

    try:

        vals = peers(no_peers[i], no_dates[i])

        vals.dropna(inplace = True)

        peer_data = pd.concat([peer_data, vals], ignore_index = True, axis = 0)

    except:

        continue

Since data retrieval takes a relatively long time, I store the data in an excel file once the code is fully executed and further read from there. The dataset is available in the GitHub folder

    	
            

peer_data.to_excel('mergerdata/peer_data.xlsx')

 

#read peer data from the excel file

peer_data = pd.read_excel('mergerdata/peer_data.xlsx').drop(columns = ['Unnamed: 0'])

peer_data.insert(loc = 2, column = 'AD', value =  pd.to_datetime(peer_data['AD-30']) + datetime.timedelta(30))

peer_data.head()

  Instrument AD-30 AD Gross Profit Margin - % Net Income after Minority Interest Total Capital Operating Margin - %, TTM Free Cash Flow Revenue from Business Activities - Total Revenue from Business Activities - Total.1 ... Revenue from Business Activities - Total.3 Current Ratio Price To Book Value Per Share (Daily Time Series Ratio) Price To Sales Per Share (Daily Time Series Ratio) Earnings before Interest Taxes Depreciation & Amortization Enterprise Value (Daily Time Series) Total Shareholders' Equity incl Minority Intr & Hybrid Debt Debt - Total Net Debt per Share Cash & Cash Equivalents - Total
0 CONE.OQ 16/10/2021 15/11/2021 16.69086 4.14E+07 6.00E+09 4.585427 -4.54E+08 1.03E+09 9.81E+08 ... 6.72E+08 1.74655 3.626125 8.846605 1.10E+08 1.32E+10 2.56E+09 3.44E+09 26.29221 2.71E+08
1 SWCH.N 16/10/2021 15/11/2021 45.3667 1.55E+07 1.66E+09 19.471734 -1.13E+08 5.12E+08 4.62E+08 ... 3.78E+08 1.18579 20.444834 11.715779 2.40E+08 7.98E+09 6.11E+08 1.05E+09 3.980943 9.07E+07
2 EQIX.OQ 16/10/2021 15/11/2021 48.74857 3.70E+08 2.31E+10 18.495061 2.73E+07 6.00E+09 5.56E+09 ... 4.37E+09 1.28863 6.587149 11.065227 2.54E+09 8.22E+10 1.06E+10 1.25E+10 121.638963 1.60E+09
3 T.N 16/10/2021 15/11/2021 36.86772 -5.18E+09 3.36E+11 15.741272 2.75E+10 1.72E+11 1.81E+11 ... 1.61E+11 0.81982 1.130615 1.042513 5.45E+10 3.69E+11 1.79E+11 1.57E+11 20.699777 9.74E+09
4 PEB.N 16/10/2021 15/11/2021 -36.37443 -3.92E+08 5.59E+09 -72.648225 -3.27E+08 4.43E+08 1.61E+09 ... 7.69E+08 0.71356 0.917671 7.480777 -9.60E+07 4.99E+09 3.26E+09 2.33E+09 16.855754 1.24E+08
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7393 SP.OQ 16/08/2020 15/09/2020 21.25361 4.88E+07 7.43E+08 4.999359 6.32E+07 9.35E+08 9.35E+08 ... 9.11E+08 0.57085 1.674494 0.308523 1.20E+08 7.95E+08 3.74E+08 3.69E+08 15.028087 2.41E+07

7394 rows × 21 columns

The resulting dataset consists of 7394 peer companies of 209 targets. However, there are many duplicates in this list, as many target companies have the same peers, and some of the peers were targets themselves. The data is further filtered and processed to eliminate the duplicates, which is presented next in this article.

1.3 Merge the two datasets, add remaining variables and labels

Here, I merge the target and non-target datasets, remove duplicates and calculate then add the rest of the variables which are not directly accessible through the API calls. First, the labels are added to the datasets and two datasets are merged into one.

    	
            

#add labels

target_data['Label'] = 1

peer_data['Label'] = 0

 

#merge the target and non-target datasets

all_data = pd.concat([target_data, peer_data], ignore_index = True,  axis = 0).reset_index(drop = True)

Further, I remove the duplicate peers and those in the target list. Here, I extract the first part (before ".") of a RIC into a separate column and run remove duplicate function on that column. This would allow considering the companies which RIC is changed due to a corporate event.

    	
            

#add a column to store the first part of the RIC 

all_data.insert(loc = 1, column = 'RIC_beg', value = [all_data['Instrument'][i].split(".")[0] for i in range(len(all_data))])

 

#run remove duplicate function on newly added column and remove that after the function is executed

all_data = all_data.drop_duplicates(subset = ['RIC_beg'], keep = 'first')

all_data.drop(columns = ['RIC_beg'], inplace = True)

After the dataset with no duplicates is ready, I add the rest of the variables to be used in the ML models that are not directly accessible via the API.

    	
            

#add two dates for 60 and 250 days before M&A announcement which are used to specify

#estimation and observation periods for abnormal return calculation

all_data.insert(loc = 1, column = 'AD-60', value =  all_data['AD'] - datetime.timedelta(60))

all_data.insert(loc = 1, column = 'AD-250', value =  all_data['AD'] - datetime.timedelta(250))

 

#calculate/add several variables that are used in ML models

all_data.insert(loc = len(all_data.columns), column = 'Profit to Capital', value = all_data['Net Income after Minority Interest']/all_data['Total Capital'])

all_data.insert(loc = len(all_data.columns), column = 'Free Cash Flow to Sales', value = all_data['Free Cash Flow']/all_data['Revenue from Business Activities - Total'])

all_data.insert(loc = len(all_data.columns), column = 'Cash to Capital', value = all_data['Cash & Cash Equivalents - Total']/all_data['Total Capital'])

all_data.insert(loc = len(all_data.columns), column = 'EV to EBITDA', value = all_data['Enterprise Value (Daily Time Series)']/all_data['Earnings before Interest Taxes Depreciation & Amortization'])

all_data.insert(loc = len(all_data.columns), column = 'Debt to EV', value = all_data['Debt - Total']/all_data['Enterprise Value (Daily Time Series)'])

all_data.insert(loc = len(all_data.columns), column = 'Sales_growth', value = 

                ((all_data['Revenue from Business Activities - Total']-all_data['Revenue from Business Activities - Total.1'])/all_data['Revenue from Business Activities - Total.1']*100 +

                (all_data['Revenue from Business Activities - Total.1']-all_data['Revenue from Business Activities - Total.2'])/all_data['Revenue from Business Activities - Total.2']*100+

                (all_data['Revenue from Business Activities - Total.2']-all_data['Revenue from Business Activities - Total.3'])/all_data['Revenue from Business Activities - Total.3']*100)/3)

 

#if there is no sales value for some year, sales growth variable is returned as "inf". So we need remove those instances

all_data.drop(all_data.loc[all_data['Sales_growth']==np.inf].index, inplace=True)

#drop NA values and reset index on the final dataset

all_data.dropna(inplace=True)

all_data = all_data.reset_index(drop=True)

all_data.head()

  Instrument AD-250 AD-60 AD-30 AD Gross Profit Margin - % Net Income after Minority Interest Total Capital Operating Margin - %, TTM Free Cash Flow ... Debt - Total Net Debt per Share Cash & Cash Equivalents - Total Label Profit to Capital Free Cash Flow to Sales Cash to Capital EV to EBITDA Debt to EV Sales_growth
0 COR 10/03/2021 16/09/2021 16/10/2021 15/11/2021 40.21891 79309000 1.79E+09 22.855325 3427000 ... 1.72E+09 39.991626 5543000 1 0.044347 0.005647 0.003099 51.388454 0.214218 8.048231
1 CSPR.K 10/03/2021 16/09/2021 16/10/2021 15/11/2021 51.0835 -89555000 9.10E+07 -10.707998 -63503000 ... 6.55E+07 -0.57663 88922000 1 -0.98398 -0.127773 0.977025 -3.154113 0.343529 29.404002
2 DVD 04/03/2021 10/09/2021 10/10/2021 09/11/2021 26.66113 7482000 6.90E+07 25.492173 1195000 ... 0.00E+00 -0.345537 12568000 1 0.108392 0.031004 0.182074 12.643395 0 -5.932295
3 MCFE.O 03/03/2021 09/09/2021 09/10/2021 08/11/2021 62.31934 -118000000 2.19E+09 10.883691 718000000 ... 3.99E+09 8.768888 231000000 1 -0.053955 0.247075 0.105624 28.844474 0.194682 11.902193
4 CPLG.K 03/03/2021 09/09/2021 09/10/2021 08/11/2021 -10.7056 -178000000 1.68E+09 -30.140187 -62000000 ... 8.25E+08 11.758621 143000000 1 -0.105826 -0.150852 0.085018 -116.99623 0.587626 -17.358218
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3848 KNL.N^G21 09/01/2020 17/07/2020 16/08/2020 15/09/2020 38.44269 67500000 8.74E+08 7.734148 88300000 ... 4.46E+08 8.789553 8500000 0 0.077266 0.06183 0.00973 6.547313 0.385728 8.204436

3849 rows × 30 columns

The last variable which needs to be calculated separately is the abnormal return, for which I use the function from the previous article.

    	
            

def ab_return(RIC, sdate, edate, announce_date, period):

    '''

    Calculate abnormal return of a given security during an observation period based on Event study Methodology(MacKinlay,1997)

    

    Dependencies

    ------------

    Python library 'refinitiv.dataplatform' version 1.0.0a8.post1

    Python library 'numpy' version 1.20.1

    Python library 'pandas' version 1.3.3

    Python library 'Sklearn' version 0.24.1

    

    Parameters

    -----------

        Input:

            RIC (str): Refinitiv Identification Number (RIC) of a stock

            sdate (str): Starting date of the estimation period - in yyyy-mm-dd

            edate (str): End date of the estimation period, which is also starting date of the observation period - in yyyy-mm-dd

            announce_date (str): End date of the observation period which is assumed to be the M&A announcement date or any other specified date

            period (int): Number of trading days in during the observation period. For each date in this period abnormal return is calculated

        Output:

            CAR (int): Cumulative Abnormal Return (CAR) for a given stock

            abnormal_returns (DataFrame): Dataframe containing abnormal returns during the observation period

    '''

    

    #create an empty dataframe to store abnormal returns

    abnormal_returns = pd.DataFrame({'#': np.arange(start = 1, stop = period)})

    

    ## estimate linear regression model parameters based on estimation period

    # get timeseries for the specified RIC and market proxy (S&P 500 in our case) for the both estimation and observation period

    df_all_per = rdp.legacy.get_timeseries([RIC, '.SPX'], 

                      start_date = sdate, 

                      end_date = announce_date,

                      interval='daily',

                      fields = 'CLOSE')

    

    # slice the estimation period

    df_all_per.reset_index(inplace = True)

    df_est_per = df_all_per.loc[(df_all_per['Date'] <= edate)]

    

    # calculate means of percentage change of returns for the stock and market proxy

    df_est_per.insert(loc = len(df_est_per.columns), column = "Return_stock", value = df_est_per[RIC].pct_change()*100)

    df_est_per.insert(loc = len(df_est_per.columns), column = "Return_market", value = df_est_per[".SPX"].pct_change()*100)

    mean_stock = df_est_per["Return_stock"].mean()

    mean_index = df_est_per["Return_market"].mean()

    df_est_per.dropna(inplace = True)

    

    # reshape the dataframe and estimate parameters of linear regression

    y = df_est_per["Return_stock"].to_numpy().reshape(-1,1)

    X = df_est_per["Return_market"].to_numpy().reshape(-1,1)

    model = LinearRegression().fit(X,y)

    Beta = model.coef_[0][0]

    intercept = model.intercept_[0]

    

    # slice the estimation period

    df_obs_per = df_all_per.loc[(df_all_per['Date'] >= edate)]

    

    # calculate percentage change of returns for the stock and market proxy

    df_obs_per.insert(loc = len(df_obs_per.columns), column = "Return_stock", value = df_obs_per[RIC].pct_change()*100)

    df_obs_per.insert(loc = len(df_obs_per.columns), column = "Return_market", value = df_obs_per[".SPX"].pct_change()*100)

    

    df_obs_per.dropna(inplace=True)

    df_obs_per.reset_index(inplace=True)

    

    # calculate and return cumulative abnormal return (CAR) for the observation period

    abnormal_returns.insert(loc = len(abnormal_returns.columns), column = str(RIC)+'_Date', value = df_obs_per["Date"])

    abnormal_returns.dropna(inplace=True)

    abnormal_returns.insert(loc = len(abnormal_returns.columns), column = str(RIC)+'_return', value = df_obs_per["Return_stock"] - (intercept + Beta * df_obs_per["Return_market"]))

    CAR =  abnormal_returns.iloc[:,2].sum()

    return CAR, abnormal_returns

After the function is defined, I store the RICs and dates in separate lists to cal them from a loop.

    	
            

RIC = all_data['Instrument'].to_list()

sdate = all_data['AD-250'].dt.strftime('%Y-%m-%d').to_list()

edate = all_data['AD-60'].dt.strftime('%Y-%m-%d').to_list()

announce_date = all_data['AD'].dt.strftime('%Y-%m-%d').to_list()

Then I run the function above for each company and append the cumulative abnormal return to a list, which is further inserted into the main dataframe.

    	
            

return_list = []

for i in range(len(RIC)):

    try:

        #run abnormal return function for all of the companies

        CAR, abnormal_returns = ab_return(RIC[i], sdate[i], edate[i], announce_date[i], 60)

        #calculate the cumulative abnormal return for the observation period and append the value to the return list

        return_list.append(abnormal_returns[str(RIC[i]) + '_return'].iloc[:len(abnormal_returns)-4].sum())

    except:

        #in case of error append the return list with NAN value

        return_list.append(np.nan)

        continue

 

#insert the return list to our original dataset and remove NA values

all_data.insert(loc = len(all_data.columns), column = 'AR', value = return_list)

all_data.dropna(inplace = True)

As before, since data retrieval takes a relatively long time, I store the data in an excel file once the code is fully executed and further read from there. The dataset is available in the GitHub folder.

    	
            

all_data.to_excel('mergerdata/all_data.xlsx')

 

#read data from a stored excel file

all_data = pd.read_excel('mergerdata/all_data.xlsx').drop(columns = ['Unnamed: 0'])

targets = len(all_data.loc[all_data['Label'] == 1])

non_targets = len(all_data.loc[all_data['Label'] == 0])

print(f'Number of target companies in the dataset: {targets}')

print(f'Number of non-target companies in the dataset: {non_targets}')

Number of target companies in the dataset: 182
Number of non-target companies in the dataset: 3648

The dataset consists of 182 target and 3648 non-target companies totaling 3849 companies. Although it is important to test the model on an imbalanced dataset considering that non-target companies are more common in the real world than the target ones, it is not necessary to have this big ratio of 20:1. The challenge with many companies is that we will end up with 100,000s of textual instances to run NLP on. This will obviously require a lot of time and computational power, which I believe is not necessary for the purpose of this article. Thus, I take first up to the seven closest peers per target company which still ensures similar to real-world imbalanced distribution. One can easily select more peers and even all of them and still run all of the processes coming next in this article.

    	
            

#extract announcement dates of target companies 

dates = all_data['AD'].loc[all_data['Label'] == 1]

peer_idx = []

 

for date in dates:

    #extract peers as of the announcement day

    peers = all_data.loc[(all_data['Label'] == 0) & (all_data['AD'] == date)]

    #check if target company has at least 7 peers

    if len(peers) > 7:

        #if yes, select the first 7

        peers = peers.iloc[:7]

    #get indexes of selected peers 

    peer_idx.append(peers.index)

#flatten the list of lists into a single list

peers = [item for sublist in peer_idx for item in sublist]

 

#update dataset to make sure only the selected peers are included

all_data = pd.concat([all_data.loc[all_data['Label'] == 1], all_data.loc[peers]], ignore_index = True, axis = 0).reset_index(drop = True)

Finally, I remove the rest of the variables which were used to calculate the variables to be included in the ML models. The final dataset structure is reported next.

    	
            

all_data = all_data.drop(columns = ['AD-250', 'AD-60', 'AD-30', 'Net Income after Minority Interest', 'Total Capital','Free Cash Flow', 'Cash & Cash Equivalents - Total',

                                    'Revenue from Business Activities - Total', 'Revenue from Business Activities - Total.1',

                                       'Revenue from Business Activities - Total.1', 'Revenue from Business Activities - Total.2', 'Revenue from Business Activities - Total.3',

                                        "Total Shareholders'" + ' Equity incl Minority Intr & Hybrid Debt', 'Debt - Total',

                                   'Earnings before Interest Taxes Depreciation & Amortization', 'Enterprise Value (Daily Time Series)'])

all_data.head()

  Instrument AD Gross Profit Margin - % Operating Margin - %, TTM Current Ratio Price To Book Value Per Share (Daily Time Series Ratio) Price To Sales Per Share (Daily Time Series Ratio) Net Debt per Share Label Profit to Capital Free Cash Flow to Sales Cash to Capital EV to EBITDA Debt to EV AR Sales_growth
0 COR 15/11/2021 40.21891 22.855325 0.39338 320.578875 9.957096 39.991626 1 0.044347 0.005647 0.003099 51.388454 0.214218 -0.858267 8.048231
1 DVD 09/11/2021 26.66113 25.492173 3.47706 1.068976 1.019694 -0.345537 1 0.108392 0.031004 0.182074 12.643395 0 -11.662717 -5.932295
2 MCFE.O 08/11/2021 62.31934 10.883691 0.38323 -0.979807 4.451983 8.768888 1 -0.053955 0.247075 0.105624 28.844474 0.194682 -12.960495 11.902193
3 CPLG.K 08/11/2021 -10.7056 -30.140187 4.58824 1.060099 2.12139 11.758621 1 -0.105826 -0.150852 0.085018 -116.99623 0.587626 12.219687 -17.358218
4 MNR 05/11/2021 81.96786 49.471649 11.5161 1.641761 10.420573 8.678789 1 -0.01158 -0.491178 0.012299 39.203266 0.262022 -0.604319 13.026745
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1338 KNL.N^G21 15/09/2020 38.44269 7.734148 1.17448 1.691853 0.510063 8.789553 0 0.077266 0.06183 0.00973 6.547313 0.385728 0.338345 8.204436

1339 rows × 16 columns

    	
            

targets = len(all_data.loc[all_data['Label'] == 1])

non_targets = len(all_data.loc[all_data['Label'] == 0])

print(f'Number of target companies in the dataset: {targets}')

print(f'Number of non-target companies in the dataset: {non_targets}')

Number of target companies in the dataset: 182
Number of non-target companies in the dataset: 1157

Section 2: Evaluate news sentiment prior to the M&A

This section walks through the retrieval of news headlines during the previous 30 days of the M&A announcement and utilizes FinBert and BERT-RNA models to evaluate the sentiment of each headline. If FinBert has a pre-trained classification engine and the headlines can be directly passed to it, BERT-RNA returns embeddings only. To get sentiment classifications for the latter, a classification model on the embeddings needs to be trained first, and only then passed the headlines to the classification engine.

2.1 Get news headlines

Before running sentiment analysis, news headlines for both target and non-target companies are requested using the get_news_headlines function of RDP API. Here, instead of the actual news stories, I use only news headlines considering the high number of those, which, as we will see further in this article, is around 122,000 instances. The challenge is that to get individual storylines, we would have to make 122,000 API calls. Just this will take more than 12 days considering the daily API call limit of 10,000 requests. Thus, I have decided to lose the API call overhead and focus on the headline text. Here, I just iterate over the news headlines and then pass the headline text to the sentiment engines, which will return the sentiment score.

Since we don't want the news related to the actual M&A announcement to appear in our sentiment classification dataset, I specify the news request period from 30 days before to 5 days before the announcement.

    	
            

#create column dates for 30 day and 5 day before the M&A announcement

all_data.insert(loc = 1, column = 'AD-30', value =  all_data['AD'] - datetime.timedelta(30))

all_data.insert(loc = 1, column = 'AD-5', value =  all_data['AD'] - datetime.timedelta(5))

Then, the code below loops over all instruments and requests news headlines for the specified period by storing the data into a separate dataframe.

    	
            

#create empty dataframe to store headlines and separate lists to store error values

all_headlines = pd.DataFrame()

err_ric = []

err_sdate = []

err_edate = []

 

for i in range(len(all_data)):

    try:

        ric_headlines = pd.DataFrame()

        #request news headlines for each instrument for the specified period

        df = rdp.get_news_headlines(query = all_data['Instrument'][i] + ' and Language:LEN', date_from = all_data['AD-30'][i].strftime('%Y-%m-%d'), 

                                           date_to = all_data['AD-5'][i].strftime('%Y-%m-%d'), count = 5000)

        #add the headlines along with the instrument and request dates to the ric_headlines dataframe

        if len(df) > 0:

            ric_headlines.insert(loc = 0, column = 'RIC', value = [all_data['Instrument'][i]]*len(df))

            ric_headlines.insert(loc = len(ric_headlines.columns), column = 'sdate', value = [all_data['AD-30'][i]]*len(df))

            ric_headlines.insert(loc = len(ric_headlines.columns), column = 'edate', value = [all_data['AD-5'][i]]*len(df))

            ric_headlines.insert(loc = len(ric_headlines.columns), column = 'Headlines', value = df['text'].values)

        #in case there is no headline, add the instrument and dates by indicating no news for the specified period

        else:

            ric_headlines.insert(loc = 0, column = 'RIC', value = [all_data['Instrument'][i]]*1)

            ric_headlines.insert(loc = len(ric_headlines.columns), column = 'sdate', value = all_data['AD-30'][i])

            ric_headlines.insert(loc = len(ric_headlines.columns), column = 'edate', value = all_data['AD-5'][i])

            ric_headlines.insert(loc = len(ric_headlines.columns), column = 'Headlines', value = 'no news')       

        all_headlines = pd.concat([all_headlines, ric_headlines], ignore_index = True, axis = 0)

    #store rics and dates resulting error in separate lists

    except:

        err_ric.append(all_data['Instrument'][i])

        err_sdate.append(all_data['AD-30'][i])

        err_edate.append(all_data['AD-5'][i])

As before, since data retrieval takes a relatively long time, I store the data in an excel file once the code is fully executed and further read from there. The dataset is available in the GitHub folder.

    	
            

all_headlines.to_excel('mergerdata/headlinesAll.xlsx')

 

#read headlines dataset from excel and remove duplicated headlines.

headlines = pd.read_excel('mergerdata/headlines.xlsx').drop(columns = ['Unnamed: 0'])

headlines = headlines.drop_duplicates(subset = ['Headlines'], keep = 'first').reset_index(drop = True)

headlines

  RIC sdate edate Headlines
0 COR 16/10/2021 10/11/2021 CIF/FOB Gulf Grain-Corn barge basis steady to ...
1 COR 16/10/2021 10/11/2021 Export Summary-Philippine importer buys feed w...
2 COR 16/10/2021 10/11/2021 PLATTS: 698--Platts Latin America Corn Daily C...
3 COR 16/10/2021 10/11/2021 DJ Thailand Corn Weather - Nov 9
4 COR 16/10/2021 10/11/2021 DJ Northeast China Corn Weather - Nov 9
... ... ... ... ...
121966 KNL.N^G21 16/08/2020 10/09/2020 COVID-19 Impact & Recovery Analysis | Office F...

121967 rows × 4 columns

After getting news headlines and removing duplicated values, we end up having 1041 instances in our dataset.

2.2 Load pretrained FinBert sentiment classification model

About the key terminology and processes behind the NLP

Before loading the FinBert model, it is worth giving a basic understanding of the key terminology and processes behind the NLP:

Tokenization - Tokenization is the first process of NLP when a text is split into words or subwords, which then is converted to ids through a look-up table. Although this seems pretty straightforward, there are multiple ways of splitting sentences into words or subwords, and each way has its own advantages and disadvantages. Hugging faces provide a great introductory guideline on tokenization, which can be found here.

Word Embedding - Word embeddings are the vector representation of words where words or phrases from the vocabulary are mapped to vectors of real numbers. The vector encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. As it comes to the technical creation of the embeddings, these are created using a neural network with an input layer, hidden layer, and an output layer. An illustrative and explanatory example is provided in this blog post.

Transformers - The Hugging Face transformers package is a Python library that provides numerous pre-trained models that are used for a variety of NLP tasks. One of such pre-trained models in FinBert, which is introduced in greater detail in this section.

This article, which walks through the NLP text sentiment classification processes with illustrative examples, is a great source to learn more and have a hands-on experience with tokenization, word embeddings, and transformers.

About FinBert model

As it comes to FinBert model itself, it is a pre-trained NLP model to analyze the sentiment of the financial text. It is built by further training the BERT language model in the finance domain, using Reuters TRC2 financial corpus and thereby fine-tuning it for financial sentiment classification. After the model is adapted to the domain-specific language, it is trained with labeled data for the sentiment classification task.

Financial PhraseBook dataset by Malo et al. (2014) has been used to train the classification task. The dataset consisting of 4845 instances is carefully labeled by 16 experts and master students with finance backgrounds who, along with labels, reported inter-annotator agreement levels for each sentence.

According to the FinBert GitHub account, in order to use the pre-trained FinBert model, one should:

  • Create a directory for the model.
  • Download the model (pytorch_model.bin) and put it into the created directory.
  • Put a copy of config.json in that same directory.
  • Call the model with .from_pretrained(model directory name)

I have already created a folder and stored the required files in a directory called finbert. To load the model, we just need to run the code below. Additionally, I load the BERT tokenizer after loading the model.

    	
            

#load FinBert model

model = BertForSequenceClassification.from_pretrained('finbert/pytorch_model.bin',config='finbert/config.json', num_labels=3)

 

#load the tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

2.3 Train sentiment classification model on Labs BERT-RNA

BERT-RNA is a financial language model created by LSEG Labs. BERT-RNA extends BERT-BASE and creates a finance-domain-specific model leveraging LSEG’s depth and breadth of unstructured financial data. The model is pre-trained using Reuters News Archive, which consists of all Reuters articles published between 1996 and 2019.

LSEG labs BERT-RNA model returns a vector of word embeddings, the process of which is illustrated in the image below:

Unlike FinBert, BERT-RNA doesn't have a sentiment classification engine and returns a vector of word embeddings that needs to be further trained on a labeled dataset to perform a classification task. In order to train the classification engine, we need to have a labeled dataset with sentiment scores. For that purpose, I use the same Financial Phrasebook dataset which was used to train the FinBert model.

To do that, I downloaded the dataset from here and stored it in the local directory. Below I read the dataset by replacing sentiment words with labels from 0 to 2.

    	
            

# read financial phrasebook dataset into a csv

input_df = pd.read_csv('data/sentiment_data/Sentences_50Agree.txt', sep=".@", header=None, encoding="ISO-8859-1", names=['text','label'])

 

#update the labels

input_df['label'].replace('positive', 0, inplace =True)

input_df['label'].replace('negative', 1, inplace =True)

input_df['label'].replace('neutral', 2, inplace =True)

input_df

  text label
0 According to Gran , the company has no plans t... 2
1 Technopolis plans to develop in stages an area... 2
2 The international electronic industry company ... 1
3 With the new production plant the company woul... 0
4 According to the company 's updated strategy f... 0
... ... ...
4845 Sales in Finland decreased by 10.5 % in Januar... 1

4846 rows × 2 columns

The next step is to format our training data into the CSV structure that BERT RNA accepts. BERT RNA expects a CSV structure with a single column for the text. The header for this column can be named anything.

    	
            input_df[['text']].to_csv('BERT_RNA_Input.csv', index=False)
        
        
    

Then I upload the CSV file into the Labs environment by creating a new Job.

After the job is fully executed, an OUT file is created, which can be downloaded from the Labs environment.

Finally, we can read the out file back to our environment by pandas read_json function.

    	
            

# read embeddings from .out file exported from the Labs UI 

X = pd.read_json('BERT-RNA/data_phrasebook.out')

y = input_df['label']

X

  0 1 2 3 4 5 6 7 8 9 ... 758 759 760 761 762 763 764 765 766 767
0 0.205987 -0.972051 -0.381347 0.501484 0.447768 -0.110406 -0.234994 0.848129 0.360031 -0.292546 ... 0.913734 -1.026575 -0.905293 -0.347511 0.355931 -0.3144 -0.214688 1.27441 0.453175 -0.341905
1 -0.113231 -0.3017 0.002256 0.194796 -0.718617 -0.341368 -1.007442 0.307653 0.209102 0.985369 ... 0.483089 -0.847685 -1.057852 -1.417513 0.23843 0.241687 0.104607 1.550795 1.130612 -0.348374
2 -0.684828 -0.622877 -0.286472 0.148611 0.217178 -0.284424 -0.645603 0.768584 0.321413 0.908583 ... 0.541181 -0.685226 -1.008933 -0.682811 -0.057627 0.381497 0.292033 1.189711 1.192648 -0.445462
3 -0.904132 0.030845 0.622767 0.262248 0.077298 -0.140442 -0.579286 0.101153 0.377318 0.261287 ... 0.56816 -0.969471 -0.71147 -0.391669 0.507979 -0.210143 0.211704 1.644572 0.787941 -0.044877
4 -0.670881 -0.279356 0.123999 0.624517 -0.294081 0.547578 -0.694566 -0.116304 0.309266 0.085417 ... -0.285442 -2.091228 -1.278206 -0.919836 0.67997 -0.128099 -0.331344 2.467715 1.42978 -0.408402
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4845 -0.612367 0.508037 -0.481639 -0.288383 0.829181 1.217645 -1.065149 0.49603 -0.330197 -0.392747 ... 0.755171 -1.537933 -0.021399 -0.989588 0.174585 0.040041 0.143034 2.354939 1.13924 -0.199524

4846 rows × 768 columns

The output is a vector for each input token (sentence), and each vector is made up of 768 numbers which correspond to the number of hidden units in the NLP model.

BERT RNA is trained with max sentence length set as 512 tokens. It is advised in BERT-RNA documentation that if the average length of text input is much shorter than 512, the feature space can be very sparse. Thus, applying the dimension reduction technique on the embedding output is recommended.

I will use the Principal component analysis (PCA) technique for dimensionality reduction, but before that, I check the average size of the text inputs.

    	
            

# get average length of sentences for finding the optimal number of PCA components

length = []

for i in range(len(input_df)):

    length.append(len(input_df['text'][i]))

print(f'Average length of our text input is {round(np.mean(length),0)}')

Average length of our text input is 127.0

Now let's apply the PCA dimensionality reduction with principal components of 127.

    	
            

# run pca

pca = PCA(n_components = 127)

output_pca = pca.fit_transform(X.to_numpy())

Finally, I split PCA output into train and test datasets and run logistic regression to train the sentiment classification model.

    	
            

#split data into training and testing set

X_train_pca,X_test_pca,y_train_pca,y_test_pca = train_test_split(pd.DataFrame(output_pca), y, test_size = 0.20,random_state=0)

 

#fit logistic regression model on the training set

logistic_regression_pca = LogisticRegression().fit(X_train_pca,y_train_pca)

 

#predict labels on testing set and report classification outputs

y_pred_pca = logistic_regression_pca.predict(X_test_pca)

print(classification_report(y_test_pca, y_pred_pca))

The classification results suggest overall accuracy of 0.79; moreover model produced the highest F1 score on the Neutral class (0.86) and the lowest on the negative class (0.67). The varying accuracy measures can also be caused by the number of instances in each label which is the highest for the neutral (575 instances) and the lowest for the negative class (128 instances).

Overall, the results are satisfactory to proceed with classifying the news sentiment of the target and non-target companies.

2.4 Evaluate news sentiments based on FinBert and BERT-RNA models

Now, as we have both of our NLP models ready for the news sentiment classification task, let's proceed with it.

Get and Label news headline embeddings from BERT-RNA

First, let's label the news using the BERT-RNA model. For that, I split the news headlines file into 4 different CSVs to make sure the embeddings job is completed properly and download/store the OUT files in a local directory. The code below reads and merges the news headlines into a single dataframe.

    	
            

# read embeddings from Labs UI exported .out file

sentiment1 = pd.read_json('BERT-RNA/sentiment1.out')

sentiment2 = pd.read_json('BERT-RNA/sentiment2.out')

sentiment3 = pd.read_json('BERT-RNA/sentiment3.out')

sentiment4 = pd.read_json('BERT-RNA/sentiment4.out')

 

#merge sentiment headlines into one dataframe

sentiment = pd.concat([sentiment1, sentiment2, sentiment3, sentiment4], ignore_index = True, axis = 0)

sentiment

  0 1 2 3 4 5 6 7 8 9 ... 758 759 760 761 762 763 764 765 766 767
0 0.302118 -0.395428 -0.499062 -0.756795 0.039865 -0.187485 -0.690534 -0.056547 -0.551031 -1.315916 ... -1.288661 -1.796014 -0.872499 -0.292467 0.528821 0.036952 -0.718162 1.347818 0.085477 0.36723
1 0.52297 0.320371 0.283631 -0.227879 -0.791052 -1.029441 -0.312268 0.141227 -0.785787 -0.181109 ... -0.16645 -0.972932 0.544504 -0.546426 0.284581 0.714084 -0.798771 1.70327 0.501686 0.033989
2 0.5063 -0.34506 -0.06983 0.214841 0.390197 -0.425927 0.368861 0.048239 0.280302 -0.359503 ... 0.866907 -1.398539 -1.72619 -0.046738 1.193358 0.838718 -0.339353 0.984092 0.768999 -0.236333
3 0.029535 0.541696 0.354668 -0.242215 -0.471775 -0.437619 0.180205 0.6168 0.558821 -0.245121 ... 1.109218 -1.398365 -0.619977 -0.273647 0.905635 0.935657 -1.365727 0.304455 0.349801 -0.232744
4 -0.218143 0.552087 0.342258 -0.521361 -0.300859 -0.641693 -0.078891 0.898205 0.515348 0.15542 ... 0.670583 -1.259355 -0.590247 0.000672 0.545476 1.139542 -1.199727 0.049802 0.59293 -0.545367
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
121966 -0.06228 -0.536419 0.247584 -0.161385 -0.395961 0.055131 -0.544398 0.073212 0.061124 0.042159 ... 0.382183 -1.415708 -1.149672 -0.275521 0.105553 0.545435 0.136251 0.67413 0.689401 -0.430182

121967 rows × 768 columns

Next, I apply PCA dimensionality reduction on our headlines and label them using pre-trained logistic regression model.

    	
            

#apply dimensionality reduction

sentiment_pca = pca.fit_transform(sentiment.to_numpy())

 

#predict labels for news headlines

sentL = logistic_regression_pca.predict(sentiment_pca)

 

#add label list to the headlines dataframe

headlines['sentimentLabs'] = sentL

headlines

  RIC sdate edate Headlines sentimentLabs
0 COR 16/10/2021 10/11/2021 CIF/FOB Gulf Grain-Corn barge basis steady to ... 2
1 COR 16/10/2021 10/11/2021 Export Summary-Philippine importer buys feed w... 2
2 COR 16/10/2021 10/11/2021 PLATTS: 698--Platts Latin America Corn Daily C... 2
3 COR 16/10/2021 10/11/2021 DJ Thailand Corn Weather - Nov 9 2
4 COR 16/10/2021 10/11/2021 DJ Northeast China Corn Weather - Nov 9 2
... ... ... ... ... ...
121966 KNL.N^G21 16/08/2020 10/09/2020 COVID-19 Impact & Recovery Analysis | Office F... 1

121967 rows × 5 columns

Label headlines via FinBert model and evaluate pre M&A overall sentiment

Now let's label news headlines via the FinBert model and evaluate pre-M&A overall sentiment by both models. To calculate overall sentiment, I first initiate SentOverallFBert and SentOverallLabs variables starting from 0, and each time a news headline is labeled positive for a particular company, I increase the corresponding variable value by 1 and decrease by 1 if the headline is labeled as negative. Neutral labels don't impact overall sentiment.

Before starting the actual labeling and overall sentiment calculation processes, I first group the news headlines by RICs and then loop over the headlines belonging to the RIC. The code below does the grouping.

    	
            dfs = dict(tuple(headlines.groupby('RIC', sort = False)))
        
        
    

Now I loop over each headline of the RIC, label it and calculate overall sentiment for each company based on the two NLP models.

    	
            

#create empty dictionary to store the sentiment values

sentiments = {'RIC':[],'SentOverallFBert':[],'SentOverallLabs':[]}

 

#loop over each RIC

for ric in dfs:

    #append the RIC to the dictionary

    sentiments['RIC'].append(ric)

    

    #initiate overall Sentiment for FinBert

    SentOverallFBert = 0

    #loop over each news headline belonging to the RIC

    for text in dfs[ric]['Headlines']:

        #tokenize the headlines

        inputs = tokenizer(text, return_tensors="pt")

        #get prediction outputs

        outputs = model(**inputs)

        #get the maximum probability class

        sentF = torch.argmax(outputs[0])

        #update SentOverallFBert based on the classification output

        if sentF == 0:

            SentOverallFBert += 1

        elif sentF == 1:

            SentOverallFBert -= 1

    #append FinBert calculated overall sentiment of a company to the dictionary            

    sentiments['SentOverallFBert'].append(SentOverallFBert)

    

    #initiate overall Sentiment for BERT-RNA

    SentOverallLabs = 0

    #update SentOverallLabs based on the classification output

    for sentL in dfs[ric]['sentimentLabs']:

        if sentL == 0:

            SentOverallLabs += 1

        elif sentL == 1:

            SentOverallLabs -= 1 

    #append BERT-RNA calculated overall sentiment of a company to the dictionary            

    sentiments['SentOverallLabs'].append(SentOverallLabs)

    

#convert dictionary to a dataframe

sentiments = pd.DataFrame(sentiments)

Here again, I store the data in an excel file once the code is fully executed and further read from there. The dataset is available in the GitHub folder.

    	
            

sentiments.to_excel('mergerdata/sentiment_finbert_Labs.xlsx')

#read sentiment dataset from excel

sentiments = pd.read_excel('mergerdata/sentiment_finbert_Labs.xlsx')

Below we plot the overall sentiments calculated by both models to explore the outputs visually and compare the sentiments between the two models.

    	
            

#plot values using Plotly scatter plot

fig = px.scatter(sentiments, y = ["SentOverallFBert", "SentOverallLabs"])

#update plot layout

fig.update_layout(height=700, width=1100)

fig.update_yaxes(range=[-150, 550], tick0=0)

 

#move legend to the top left

fig.update_layout(legend=dict(

    yanchor="top",

    y=0.99,

    xanchor="left",

    x=0.01

))

fig.show()

According to the graph, most of the headlines have close to neutral sentiment; moreover, most of the lower negative outputs are classified by BERT-RNA while FinBert has higher positive classifications.

Section 3: Evaluation of M&A predictive modeling

This section evaluates M&A predictive modeling. Before evaluating the ML models, It should be noted that the sample size, especially for the target companies, is very small for robust predictive results. Thus, the primary purpose of this article is to showcase a workflow of M&A predictive modeling using Refinitiv data/APIs and discover if news sentiment has any significant explanatory impact on the predictive power of the model. And, if one wants to build a robust M&A predictive model can use this workflow and variables to train the models on much larger datasets.

Logistic regression, random forests, and XGBoost ML techniques are used for the predictive modeling. The reason for using multiple ML techniques is to evaluate the explanatory power of news sentiment variables from multiple perspectives and make a robust conclusion regarding the importance of that variable on the M&A predictive modeling. Particularly, logistic regression allows looking at the p-values and the coefficients of the variables and random forest and XGBoost evaluate the importance of the features.

As these models are not used for actual prediction but rather for showing the impact of the sentiment variables on evaluation metrics, instead of train/test split, I employ Repeated Stratified Cross-Validation with 10 splits and 5 repeats. Considering the imbalanced nature of the dataset, ROC_AUC score is used as the main accuracy metric.

3.1 Preparing dataset for predictive modeling

Before training and evaluating the models mentioned above, let's first construct our final dataset of independent and dependent variables. For that, I join NLP-based sentiment variables to our initial dataset consisting of the financial variables.

    	
            

#set index columns to implement dataframe joining

all_data.set_index('Instrument', inplace = True)

sentiments.rename(columns = {"RIC": "Instrument"}, inplace=True)

sentiments.set_index('Instrument', inplace = True)

 

#join NLP based and financial variable datasets

all_data = all_data.join(sentiments, on = 'Instrument', how = 'inner').reset_index().drop_duplicates(subset = ['Instrument'], keep = 'first')

all_data.head()

  Instrument AD Gross Profit Margin - % Operating Margin - %, TTM Current Ratio Price To Book Value Per Share (Daily Time Series Ratio) Price To Sales Per Share (Daily Time Series Ratio) Net Debt per Share Label Profit to Capital Free Cash Flow to Sales Cash to Capital EV to EBITDA Debt to EV AR Sales_growth SentOverallFBert SentOverallLabs
0 COR 15/11/2021 40.21891 22.855325 0.39338 320.578875 9.957096 39.991626 1 0.044347 0.005647 0.003099 51.388454 0.214218 -0.858267 8.048231 -61 27
1 DVD 09/11/2021 26.66113 25.492173 3.47706 1.068976 1.019694 -0.345537 1 0.108392 0.031004 0.182074 12.643395 0 -11.662717 -5.932295 -3 11
2 MCFE.O 08/11/2021 62.31934 10.883691 0.38323 -0.979807 4.451983 8.768888 1 -0.053955 0.247075 0.105624 28.844474 0.194682 -12.960495 11.902193 1 15
3 CPLG.K 08/11/2021 -10.7056 -30.140187 4.58824 1.060099 2.12139 11.758621 1 -0.105826 -0.150852 0.085018 -116.99623 0.587626 12.219687 -17.358218 -2 3
4 MNR 05/11/2021 81.96786 49.471649 11.5161 1.641761 10.420573 8.678789 1 -0.01158 -0.491178 0.012299 39.203266 0.262022 -0.604319 13.026745 -2 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1309 KNL.N^G21 15/09/2020 38.44269 7.734148 1.17448 1.691853 0.510063 8.789553 0 0.077266 0.06183 0.00973 6.547313 0.385728 0.338345 8.204436 3 -2

1031 rows × 18 columns

Then I run correlation analysis to eliminate the highly correlated variables and avoid multicollinearity. The code below unstacks and sorts the outputs showing the highly correlated variables.

    	
            all_data.corr().unstack().sort_values().drop_duplicates()
        
        
    

It can be observed from the correlation output that Abnormal return(AR) is correlated with Price To Sales Per Share and EV to EBITDA. At the same time, the latter two variables are highly correlated with each other as well. Thus, I removed both of them and kept AR for the final model. Another correlated pair is Free Cash Flow to Sales and Operating Margin. Among those, I eliminated operating margin since there is already another variable, Gross Profit Margin, describing the management efficiency component of the company.

I drop the variables Operating Margin, EV to EBITDA, and Price to Sales per Share in the cell below. Additionally, I created three different sets of independent variables. The first dataset (X_NoSent) consists of only financial variables, the second dataset (X_BertRna) includes the overall sentiment variable derived by the BERT-RNA model, and the third one (X_Finbert) includes overall sentiment based on the FinBert model. Throughout the analysis, evaluation metrics based on all three datasets are reported to showcase the effect of news sentiment variable, including in relation to AR.

    	
            

#drop correlated variables

X = all_data.drop(columns = ['Instrument', 'Label', 'AD', 'Operating Margin - %, TTM','EV to EBITDA', 

                             'Price To Sales Per Share (Daily Time Series Ratio)'])

y = all_data['Label']

 

#create separate datasets for no sentiment and sentiment based data

X_NoSent = X.drop(columns = ['SentOverallLabs', 'SentOverallFBert'])

X_BertRna = X.drop(columns = ['SentOverallFBert'])

X_Finbert = X.drop(columns = ['SentOverallLabs'])

3.2 Evaluation of Logistic regression model outputs

First, I train and evaluate the outputs from the logistic regression with a 'liblinear solver' and the penalty of 'l2'. This will allow us to look at p-values and coefficients of the independent variables helping to make certain conclusions regarding the impact of the sentiment variables. As the main accuracy metric the ROC_AUC score is used.

    	
            

#initiate logistic regression model

lr = LogisticRegression(solver = 'liblinear', penalty = 'l2', random_state = 0)

 

#define cross validation parameters

cv = RepeatedStratifiedKFold(n_splits = 20, n_repeats = 5, random_state = 0)

 

#run cross validation on logistic regression model with defined CV parameters

scoresNoSent = cross_validate(lr, X_NoSent, y, scoring=['roc_auc'], cv = cv, n_jobs = -1)

scoresBertRna = cross_validate(lr, X_BertRna, y, scoring=['roc_auc'], cv = cv, n_jobs = -1)

scoresFinbert = cross_validate(lr, X_Finbert, y, scoring=['roc_auc'], cv = cv, n_jobs = -1)

 

#report cross validation outputs on 3 datasets

print('\033[1m' + "Model with no sentiment variable" + '\033[0m')

print("AUC:" + str(round(scoresNoSent['test_roc_auc'].mean(),2)))

 

print('\033[1m' + "With Sentiment from BERT-RNA" + '\033[0m')

print("AUC:" + str(round(scoresBertRna['test_roc_auc'].mean(),2)))

 

print('\033[1m' + "With Sentiment from FinBert" + '\033[0m')

print("AUC:" + str(round(scoresFinbert['test_roc_auc'].mean(),2)))

Model with no sentiment variable
AUC:0.5
With Sentiment from BERT-RNA
AUC:0.54
With Sentiment from FinBert
AUC:0.57

From the reported ROC_AUC scores, we can clearly see that logistic regression models with NLP-based sentiment variable significantly outperform the no sentiment model. Moreover, the model with sentiment derived by the FinBert model achieves the highest accuracy of 0.57. Further, I calculate and report coefficients and p-values of variables from the FinBert sentiment-based model. This will show the significance and the direction of the impact of the variables. For that, I first normalize the data by subtracting the mean from the actual values and dividing the outcome by the standard deviation. This will normalize the data and make sure the explainability of the coefficients.

    	
            

#normalize the data

X_norm = (X_Finbert - np.mean(X_Finbert)) / np.std(X_Finbert, 0)

 

#fit logistic regression model with normalized data

model = lr.fit(X_norm, y)

 

#create a dataframe consisting of the list of coefficients

coefs = pd.DataFrame(index = X_Finbert.columns, data = model.coef_[0], columns = ['Coefficients'])

Before showing the resulting dataframe of coefficients, I add to that the p-values for a more comprehensive outlook. As known, sklearn doesn't have a built-in package for p-value calculation; thus, I calculate it myself by adapting an example from a thread in Stackoverflow.

    	
            

def logit_pvalue(model, x):

    '''

    Calculate z-scores for scikit-learn LogisticRegression.This function uses asymtptics for maximum likelihood estimates.

    

    Parameters:

    ------------

        Input

            model: fitted sklearn.linear_model.LogisticRegression with intercept and large C

            x:     matrix on which the model was fit

        Output:  

            p:     array of p-values

    '''

    p = model.predict_proba(x)

    n = len(p)

    m = len(model.coef_[0]) + 1

    coefs = np.concatenate([model.intercept_, model.coef_[0]])

    x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))

    ans = np.zeros((m, m))

    

    for i in range(n):

        ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]

    vcov = np.linalg.inv(np.matrix(ans))

    se = np.sqrt(np.diag(vcov))

    t =  coefs/se  

    p = (1 - norm.cdf(abs(t))) * 2

    

    return p

 

#get p-values

zScores = logit_pvalue(model, X_Finbert)

#convert from scientific numpers

zScores = ['{0:f}'.format(num) for num in zScores][1:]

#append p-values to coefficient's dataframe

coefs.insert(loc = len(coefs.columns), column = 'P-Value', value = zScores)

coefs.T

  Gross Profit Margin - % Current Ratio Price To Book Value Per Share (Daily Time Series Ratio) Net Debt per Share Profit to Capital Free Cash Flow to Sales Cash to Capital Debt to EV AR Sales_growth SentOverallFBert
Coefficients -0.038679 -0.001491 0.096708 -0.44698 0.019854 0.075284 0.013539 -0.046934 0.531696 0.093291 -0.45572
P-Value 0.000125 0.991442 0.000069 0 0.966635 0.899132 0.987717 0.961843 0 0 0

From the results above, we can observe a statistically significant impact of Gross Profit Margin, Price To Book Value Per Share, Net Debt per Share, AR, Sales Growth, and SentOverallFBert variables. Moreover, the coefficient of SentOverallFBert is one of the biggest and equals to -0.46. At the same time, the coefficient of AR is the biggest equalling to 0.53. The negative coefficient of SentOverallFBert and the positive coefficient of AR (both statistically significant) suggest that higher abnormal return and lower positive sentiment indicate a higher possibility of M&A. This is in line with my initial assumption and supports the hypothesis that abnormal returns amid no or lower positive news is an indication of M&A announcement.

Further, I train and evaluate random forest and XGBoost models on the same datasets and look at the differences of the evaluation metrics across the models with different datasets. Most importantly, I look at the feature importance to confirm the significance of the Sentiment variable for the M&A predictive modeling.

3.3 Evaluation of Random forest model outputs

Next, I train and evaluate the outputs from the random forest model with 500 estimators and the entropy criterion. I used balanced subsample class weight considering the imbalanced nature of the dataset. Also, I set the value of the parameter max_depth to 3 to make sure I don't overfit the model. ROC_AUC score is used as the main accuracy metric. Additionally, I report values for precision, recall, and F1. Finally, I look at the feature importance of independent variables to evaluate the significance of sentiment variables according to the random forest model.

    	
            

#initiate random forest model with specified parameters

rf = RandomForestClassifier(n_estimators=500, class_weight='balanced_subsample', criterion='entropy', max_depth = 3, n_jobs=-1, random_state = 5)

 

#run cross validation on random forest model with defined CV parameters

scoresNoSentRf = cross_validate(rf, X_NoSent, y, scoring = ['roc_auc', 'accuracy', 'f1', 'precision', 'recall'], cv = cv, n_jobs = -1)

scoresBertRnaRf = cross_validate(rf, X_BertRna, y, scoring = ['roc_auc', 'accuracy', 'f1', 'precision', 'recall'], cv = cv, n_jobs = -1)

scoresFinbertRf = cross_validate(rf, X_Finbert, y, scoring = ['roc_auc', 'accuracy', 'f1', 'precision', 'recall'], cv = cv, n_jobs = -1)

 

#report cross validation outputs on 3 datasets

print('\033[1m' + "Model with no sentiment variable" + '\033[0m')

print("AUC:" + str(round(scoresNoSentRf['test_roc_auc'].mean(),2)))

print("Precision:" + str(round(scoresNoSentRf['test_precision'].mean(),2)))

print("Recall:" + str(round(scoresNoSentRf['test_recall'].mean(),2)))

print("F1:" + str(round(scoresNoSentRf['test_f1'].mean(),2)))

 

print('\033[1m' + "With Sentiment from BERT-RNA" + '\033[0m')

print("AUC:" + str(round(scoresBertRnaRf['test_roc_auc'].mean(),2)))

print("Precision:" + str(round(scoresBertRnaRf['test_precision'].mean(),2)))

print("Recall:" + str(round(scoresBertRnaRf['test_recall'].mean(),2)))

print("F1:" + str(round(scoresBertRnaRf['test_f1'].mean(),2)))

 

print('\033[1m' + "With Sentiment from FinBert" + '\033[0m')

print("AUC:" + str(round(scoresFinbertRf['test_roc_auc'].mean(),2)))

print("Precision:" + str(round(scoresFinbertRf['test_precision'].mean(),2)))

print("Recall:" + str(round(scoresFinbertRf['test_recall'].mean(),2)))

print("F1:" + str(round(scoresFinbertRf['test_f1'].mean(),2)))

Model with no sentiment variable
AUC: 0.62
Precision: 0.24
Recall: 0.49
F1: 0.32

With Sentiment from BERT-RNA
AUC: 0.64
Precision: 0.25
Recall: 0.49
F1: 0.33

With Sentiment from FinBert
AUC: 0.64
Precision: 0.26
Recall: 0.51
F1: 0.34

Reported evaluation metrics are in line with the one from the logistic regression model in terms of models with NLP-based sentiment variable outperforming the no sentiment model. Moreover, we observe a much higher ROC_AUC score for random forest models. Particularly, it is 0.12 higher for the no sentiment variable model and 0.11, 0.7 higher for models with sentiments from BERT_RNA and FinBert sentiment, respectively. It is worth also highlighting that FinBert and BERT_RNA based models have the same ROC_AUC score of 0.64; however, FinBert still slightly outperforms BERT_RNA by precision, recall, and F1 measures.

Next, I calculate and plot the feature importance for both models with sentiment variables.

    	
            

#calculate feature importance

importanceFinbertRf = [round(imp,2) for imp in rf.fit(X_Finbert, y).feature_importances_]

importanceBertRnaRf = [round(imp,2) for imp in rf.fit(X_BertRna, y).feature_importances_]

 

#create subplots for two models

fig = make_subplots(

    rows=2, cols=1, subplot_titles=[

        'Feature importance for model with FinBert Sentiment',

        'Feature importance for model with BERT-RNA Sentiment'])

 

#add barplot for feature importance from FinBert model

fig.add_trace(go.Bar(y = X_Finbert.columns, x = importanceFinbertRf,

                     orientation='h', text = importanceFinbertRf, showlegend = False), row=1, col=1)

 

#add barplot for feature importance from BERT_RNA model

fig.add_trace(go.Bar(y = X_BertRna.columns, x = importanceBertRnaRf,

                       orientation='h', text = importanceBertRnaRf, showlegend = False), row=2, col=1)

 

#update layout and sort values inside the plot

fig.update_layout(height = 800, width = 1100)

fig.update_yaxes(categoryorder='total ascending')

Both of the graphs above suggest the high importance of the news sentiment variable. Particularly, SentOverallFBert has the second-highest feature importance after Profit to Capital. The latter has the highest importance in the model based on BERT_RNA as well. In that model, Net Dept per Share has slightly higher importance than SentOverallLabs. Nevertheless, the results from the feature importance values are in line with the results from the logistic regression model. They show the high importance of NLP-based news sentiment variables along with AR variable.

3.4 Evaluation of XGBoost model outputs

Finally, I train and evaluate the outputs from the XGBoost model with 500 estimators, a learning rate of 0.1, and an alpha of 30. I used the balanced scale_pos_weihght parameter considering the imbalanced nature of the dataset. Also, I set the parameter max_depth to 3 and subsample to 0.5 to make sure I don't overfit the model. Here again, the ROC_AUC score is used as the main accuracy metric. Additionally, I report values for precision, recall, and F1. Finally, I look at the feature importance of independent variables to evaluate the significance of sentiment variables according to the XGBoost model.

    	
            

#calculate number of target and non-target companies for assigning weights

target = all_data[all_data['Label'] == 1].shape[0]

nontarget = all_data[all_data['Label'] == 0].shape[0]

 

#initiate XGBoost model with specified parameters

xgBoost = XGBClassifier(scale_pos_weight = nontarget/target, max_depth=3, subsample = 0.5, n_estimators = 500, learning_rate = 0.1, alpha = 30)

 

#run cross validation on XGBoost model with defined CV parameters

scoresNoSentXgb = cross_validate(xgBoost, X_NoSent, y, scoring = ['roc_auc', 'accuracy', 'f1', 'precision', 'recall'], cv = cv, n_jobs = -1)

scoresBertRnaXgb = cross_validate(xgBoost, X_BertRna, y, scoring = ['roc_auc', 'accuracy', 'f1', 'precision', 'recall'], cv = cv, n_jobs = -1)

scoresFinbertXgb = cross_validate(xgBoost, X_Finbert, y, scoring = ['roc_auc', 'accuracy', 'f1', 'precision', 'recall'], cv = cv, n_jobs = -1)

 

#report cross validation outputs on 3 datasets

print('\033[1m' + "Model with no sentiment variable" + '\033[0m')

print("AUC:" + str(round(scoresNoSentXgb['test_roc_auc'].mean(),2)))

print("Precision:" + str(round(scoresNoSentXgb['test_precision'].mean(),2)))

print("Recall:" + str(round(scoresNoSentXgb['test_recall'].mean(),2)))

print("F1:" + str(round(scoresNoSentXgb['test_f1'].mean(),2)))

 

print('\033[1m' + "With Sentiment from BERT-RNA" + '\033[0m')

print("AUC:" + str(round(scoresBertRnaXgb['test_roc_auc'].mean(),2)))

print("Precision:" + str(round(scoresBertRnaXgb['test_precision'].mean(),2)))

print("Recall:" + str(round(scoresBertRnaXgb['test_recall'].mean(),2)))

print("F1:" + str(round(scoresBertRnaXgb['test_f1'].mean(),2)))

 

print('\033[1m' + "With Sentiment from FinBert" + '\033[0m')

print("AUC:" + str(round(scoresFinbertXgb['test_roc_auc'].mean(),2)))

print("Precision:" + str(round(scoresFinbertXgb['test_precision'].mean(),2)))

print("Recall:" + str(round(scoresFinbertXgb['test_recall'].mean(),2)))

print("F1:" + str(round(scoresFinbertXgb['test_f1'].mean(),2)))

Model with no sentiment variable
AUC: 0.62
Precision: 0.24
Recall: 0.48
F1: 0.32

With Sentiment from BERT-RNA
AUC: 0.63
Precision: 0.25
Recall: 0.49
F1: 0.33

With Sentiment from FinBert
AUC: 0.64
Precision: 0.26
Recall: 0.53
F1: 0.35

Reported evaluation metrics are in line with the one from the random forest model. Here again, we observe a much higher ROC_AUC score compared to the logistic regression outputs. It is worth also highlighting that here FinBert based model still slightly outperforms the BERT_RNA-based one not only by precision, recall, and F1 measures but also on the ROC_AUC score.

Here again, I calculate and plot the feature importance for both models with sentiment variables.

    	
            

importanceFinbertXgb = [round(imp,2) for imp in xgBoost.fit(X_Finbert, y).feature_importances_]

importanceBertRnaXgb = [round(imp,2) for imp in xgBoost.fit(X_BertRna, y).feature_importances_]

 

fig = make_subplots(

    rows=2, cols=1, subplot_titles=[

        'Feature importance for model with FinBert Sentiment',

        'Feature importance for model with BERT-RNA Sentiment'])

 

fig.add_trace(go.Bar(y = X_Finbert.columns, x = importanceFinbertXgb,

                     orientation='h', text = importanceFinbertXgb, showlegend = False), row=1, col=1)

 

fig.add_trace(go.Bar(y = X_BertRna.columns, x = importanceBertRnaXgb,

                       orientation='h', text = importanceBertRnaXgb, showlegend = False), row=2, col=1)

 

fig.update_layout(height=800, width=1100)

fig.update_yaxes(categoryorder='total ascending')

As in the case of the random forest model, both of the graphs suggest that the news sentiment variable is among the highest important features. The main distinctive part of XGBoost feature importance is that it revealed Debt to EV as another important feature which value exceeds both SentOverallFBert and SentOverallLabs. Nevertheless, the results from feature importance values are in line with the results from the logistic regression model and the random forest models suggesting the high importance of NLP-based news sentiment variables.

Summary

This article was an extension to my first article on Predicting M&A targets using Machine Learning techniques. If the first article used only financial variables to predict M&A, here the NLP-based news sentiment variable was used to increase the predictive power of the model. The main hypothesis behind the news sentiment variable was the intuition that abnormal returns amid no or lower positive news sentiment environment could indicate an M&A announcement. For the news sentiment analysis, two BERT-based models, including FinBert and BERT-RNA, are used and the significance of variables derived by both of the models through logistic regression, random forest, and XGBoost ML techniques, are compared.

Although the dataset wasn't large enough to claim the robustness of the model's predictive power, the evaluation results on different ML techniques allow to claim the importance of the NLP-based news sentiment variable along with Abnormal return for the M&A predictive analysis. In fact, the latter was the main purpose of this article. Although the predictive power of this model isn't high enough to use this model for an actual prediction and trading, I hope that this prediction workflow along with the selected variables, can be useful for training more robust models on much larger datasets and achieve much higher accuracies.

Apart from this, another important aspect of this article was the usage of the Search function for M&A data retrieval and the usage of BERT-RNA embeddings for sentiment classification, which I believe can be useful for the developer community who use Refinitv products.