FinTech & Covid 19
FinTech & Covid 19 : Did Covid have a serious affect on prices for companies, especially FinTech firms? How did it occur?
Does Stock Return Reflect the Impact of Covid-19 Pandemic? A Text-based Analysis on Firms’ 10-K Filings
Individual Assignment Report for the Financial Data Analytics Course April 2022
Abstract
I use textual analysis techniques to extract Covid coverage information and sentiment score from firms’ 10-K filings. This novel information is incorporated with conventional financial data to predict stock performance in machine learning models. The empirical results show no significant relationship between Covid coverage in 10-K filing and stock performance, while sentiment can contribute to the stock performance classification.
Acknowledgments: I would like to thank Professor Mancy Luo for her excellent guidance throughout this course.
Keywords: 10-K, Textual Analysis, Sentiment Analysis, Logistic Regression, Random Forest 1
I. Introduction
The Covid-19 Pandemic has triggered a major crisis in the whole world and its impact on the stock market is also dramatic. With an increasing number of countries gradually removing Covid-related restrictions recently, it is meaningful to take a retrospective view and examine the impact of Covid shock on a firm’s stock performance since its outbreak.
This report aims to use basic textual analysis techniques and machine learning models to explore one research question as follows:
Is stock performance associated with the firm’s Covid coverage in 10-K filing?
New technologies have enabled researchers to extract valuable information from the vast quantities of digital text and incorporate it in their studies (Gentzkow et al., 2019). In the field of finance, the periodic reports (e.g., 10-K) disclosed by firms have been widely analyzed as these filings provide ample business and financial information about the firms. Conventional research focuses more on the accounting data as it is readily available and easy to analyze. With the help of novel text-mining techniques, researchers now can use text as an input to their research. In this report, I use textual analysis techniques to build up a dummy variable to proxy for the Covid coverage in the firm’s 10-K filing.
In addition, the sentiment in the text has also been widely analyzed in finance (Loughran and McDonald, 2016; Azimi and Agrawal, 2021) and previous studies have shown that positive and negative sentiment can predict abnormal stock returns (Jegadeesh and Wu, 2013). Hence, I also construct a sentiment score and add it into the analysis.
The remainder of this report is organized as follows: Section II provides the data source and cleaning procedures used in this report and Section III presents the models and algorithms used in this thesis. The empirical findings and discussion are presented in Section IV. In the end, Section V concludes the report. Appendix A shows the tailored stop word list and Covid-related word list, Appendix B gives a detailed description of the variables used in the empirical study, and Appendix C presents the R code used in this report.
1
II. Data
A. Data Source
The data used in this analysis is gathered from texts of 10-K filings disclosed by the firms, and conventional financial databases like COMPUSTAT and CRSP.
According to the Securities Exchange Act of 1934 and its following amendment, any issuers with securities registered under or subject to certain sections are required to report their periodic filings (e.g., 10-K and 10-Q). These filings are stored on the Security and Exchange Commission’s (SEC) EDGAR website and can be accessed by the general public. However, the raw filings are encoded into browser-friendly files which contain a large amount of irrelevant codes for this analysis, and it is beyond the scope of this report to tackle the redundant information. Therefore, I choose to use a preprocessed 10-K filings data labeled as “Stage One Parse” published by the Software Repository for Accounting and Finance in University of Notre Dame In this analysis.
Apart from the novel text data, I also build up features using COMPUSTAT and CRSP databases, both of which are easily accessed through Wharton Research Data Services2. The sample contains all the firms which disclose their 10-K filings in the fourth quarter of 2021 and are covered by COMPUSTAT and CRSP from March 10, 2020 to September 30, 20213. A detailed description of the usage of databases can be found in Appendix B.
B. Text Preprocessing
In general, there are five main steps in the text preprocessing.
First, I create a corpus using the sample texts and narrow the scope of my textual analysis to Item 7 and Item 7A sections in the 10-K filing. Item 7 “Management’s Discussion and Analysis of Financial Condition and Results of Operations”, known as the MD&A for short, allows firm management give the firm’s perspective on the business results of the past financial year, and Item 7A “Quantitative and Qualitative Disclosures about Market Risk” discusses how the firm manages its market risk exposures. Both of these two sections provide the most relevant information on
1A detailed documentation for this data can be found in this link: https://sraf.nd.edu/sec-edgar-data/cleaned-10x files/10x-stage-one-parsing-documentation/
2See: https://wrds-www.wharton.upenn.edu/
3On March 11, 2020, the World Health Organisation characterized Covid-19 as a pandemic. I select the day before this date as the beginning of the sample period. Since the purpose of this analysis is to evaluate the stock performance between the start of pandemic and the release of 10-K, I choose the last day of Q3 in 2021 to be the end point as all the 10-K filings are disclosed in Q4 of 2021 in the sample.
2
firm’s operational performance and are of main interest to researchers. Therefore, I extract these two sections from the original 10-K filings to depict the impact of Covid-19 on firms. Second, I create a stopword list and a Covid-related word list to clean the text. In addition to the default stopword list provided in the R package tm, I add a few more words (e.g., company, result) into that list in order to fit into the context of this analysis. All these words do not provide useful information in this analysis and are removed from the text. Besides, as there exist various descriptions of the Covid-19 pandemic (coronavirus, pandemic etc.). To simplify the evaluation of Covid-19 impact in the text, I explicitly define a Covid-related word list and substitute all the words in this list into a single word “covid”. The stopword list and Covid-related word list can be found in Appendix A.
Third, I implement some conventional text cleaning procedures which include changing the whole text into lowercase, removing numbers, punctuation signs, single-letter words and newline signs, and striping white spaces.
Fourth, in order to exhibit the most frequent words in the texts, I use the cleaned corpus to create two Document Term Matrices based on absolute word frequency and TF-IDF respectively. Then I use the R package wordcloud and generate word cloud graphs to demonstrate the most frequent words in the sample.
Lastly, I tokenize the text into words and generate a Covid-related dummy feature and a sentiment score for further analysis. Specifically, for each firm’s tokenized text, if there exists the word “covid”, I will assign this Covid dummy value to 1, otherwise it is set to 0. To create the sentiment score for each firm, I use the sentiment word list loughran (Loughran and McDonald, 2011) provided in the R package tidytext to merge it with the tokenized text and generate the sentiment score using the number of positive words and negative words.
C. Financial Data Preprocessing
The financial data preprocessing includes three parts.
First, I create a linking table to match different firm identifiers. Since I use different data sources to create firm features, it is essential to merge them together into a single dataset for further analysis. However, the code to identify firms vary among these data sources. Specifically, the 10-K text data uses CIK code4to identify firms, COMPUSTAT uses GVKEY code5, and CRSP
4The Central Index Key (CIK) is used on the SEC’s computer systems to identify corporations and individual people who have filed disclosure with the SEC.
5The Global Company Key or GVKEY is a unique six-digit number key assigned to each company (issue, currency, 3
uses PERMNO code6. I extract CIK code from 10-K filings and use “CRSP/Compustat Merged Database – Linking Table” service to find the matching PERMNO and GVKEY codes. Second, I construct features based on the COMPUSTAT and CRSP dataset. The target variable D Return is created in two steps. I first calculate the accumulated return between March 10, 2020 and September 30, 2021, then assign the D Return to be 1 if the accumulated return exceeds the median value, otherwise it is set to 0. The control variables are generated according to Azimi and Agrawal (2021)’s definition.
Lastly, I winsorize all non-dummy variables at the 5% level in both tails of the distribution and fill missing values with cross-sectional median of in each variable.
D. Sample Description
Words with Term Frequency over 5,000 Top 30 Words in TF-IDF Statistic
Figure 1. Word Frequency in Sample Texts
Note: These two graphs are generated using the R package wordcloud. The Covid-related words are among the most frequent words in both standards. The words listed in these graphs are selected from the preprocessed sample texts, which include “Item 7 Management’s Discussion and Analysis of Financial Condition and Results of Operations” and “Item 7A Quantitative and Qualitative Disclosures about Market Risk” in the 10-K filing. The sample firms are chosen from those which disclose their 10-K filings in the fourth quarter of 2021 and are covered in COMPUSTAT and CRSP databases from March 10, 2020 to September 30, 2021.
The textual analysis result has clearly shown that Covid-19 pandemic is among the mostly discussed words in sample firms’ 10-K filing, and its importance is supported by the tf–idf statistic
index) in the Capital IQ COMPUSTAT database.
6 PERMNO is a unique permanent security identification number assigned by CRSP to each security. Strictly speaking, PERMNO is not a firm identifier as one firm may issue several securities. For simplicity, I use it to distinguish firms in this analysis.
Fintech : at the same time.
As illustrated in Figure 1, on one hand, the absolute term frequency of Covid-related words exceeds 5,000, which is expected as the Covid-19 pandemic has caused a great shock to the world economy. On the other hand, it is ranked in the top 30 word list in terms of tf-idf metric, which shows that not all firms mention Covid-related words in their 10-K filings. Therefore, it is meaningful to explore if mentioning Covid-related words in the 10-K filing would make a difference on the stock performance.
Table I
Descriptive statistics
Variable No. Mean SD Median Min Max
D Return 248 0.49 0.50 0.00 0.00 1.00
D Covid 248 0.90 0.31 1.00 0.00 1.00
Sentiment 248 -0.27 0.20 -0.31 -0.42 1.00
Cash 248 0.19 0.20 0.13 0.00 1.00
Leverage 248 0.27 0.22 0.25 0.00 1.28
Ln(Sale) 248 6.30 2.57 6.96 -0.42 9.57
Ln(Market Cap) 248 7.10 1.84 7.21 1.59 9.57
Tangibility 248 0.20 0.21 0.14 0.00 0.90
T obin′s q 248 2.14 1.87 1.37 0.10 9.57
Ln(Total Assets) 248 7.01 1.99 7.24 0.34 9.57
Note: This table presents the descriptive statistics for the variables
used in this analysis. The sample includes 248 firms which disclose
their 10-K in the fourth quarter of 2021 and are covered by the COM
PUSTAT and CRSP databases from March 10, 2020 to September
30, 2021. All variables other than the two dummies (D Return and
D Covid) are winsorized at the 5% level in both tails of the distri
bution. The definition of these variables can be found in Appendix
B.
Table I presents the summary statistics of the variables used in this analysis. The target variable D Return is a dummy variable which depicts the stock performance of a firm relative to the whole sample. D Covid and Sentiment are constructed from the 10-K filing and they offer quantitative measures of firms’ reactions towards Covid-19 and their sentiment during the past financial year. The rest variables are control variables aiming to capture firms’ characteristics included in the 10-K filing.
FinTech
III. Models and Algorithms
Since the target variable in this analysis is a dummy variable, I choose three binary classification models to explore the interactions between stock performance and Covid-19 coverage in 10-K filing.
A. Models
The first model is the logit model, which assumes a linear combination of one or more independent variables. The model formula is shown as follows:
D Return = α + β1 · D Covid + β2 · Sentiment + γ · Ω ,
where D Return is the dummy variable which is assigned to 1 if the accumulated return is above median and 0 otherwise, D Covid indicates if a firm has mentioned Covid-related words in its 10-K filing, Sentiment is a measure sentiment score of in firms’ 10-K filing, and Ω represents the control variables, namely Cash, Leverage Ratio, Sales, Market Cap, Tangibility, Tobin’s q, and Total Assets. A detailed description of variable construction can be found in Appendix B.
Apart from the conventional linear model, I also select two machine learning models in this analysis, that is, Classification and Regression Trees (CART) and Random Forest. Neither of these two models pre-specify a model structure and are suitable for exploring the non-linear relationship between target and predictors.
B. Model Implementation in R
The implementation of these models is straightforward in R. Throughout this analysis, I use R package caret 7 to run the models and the codes can be found in Appendix C. In particular, I utilize the (K-fold N-time) cross-validation method to assess models’ performance on the validation dataset. This method is achieved using numbers and repeated attributes in the trainControl function.
In addition, to evaluate models’ prediction performance on a test set, I compare the receiver operating characteristic (ROC) curves for these models.
7See: https://topepo.github.io/caret/
6
IV. Empirical Results
This section presents two perspectives with regards to the model performance.
A. Model Evaluation
The evaluation of models is based on the ROC curve which plots the true positive rate (TPR) against the true negative rate (TNR) at various threshold settings. I select the area under the ROC curve (AUC) as the metric to compare the model’s ability to distinguish between stock performance.
Figure 2. Receiver Operating Characteristic
Figure 2 illustrates the performance metrics for the three models. Logit model (red line) outperforms the other models in terms of AUC metric as it successfully predicts 60% of firms’ stock performance in the test set. The machine learning models’ performance are similar, both of which slightly outperforms the random guess result (50%).
7
B. Does Covid Coverage Matter?
To examine the relationship between stock performance and Covid coverage, I calculate the importance score for all the predictors using a built-in function in R package caret. As shown in Figure 3, in all three models, Covid coverage predictor D Covid remains in the bottom of all predictors, which indicates its inability in stock performance classification. Sentiment has been proved in previous studies that it can predict abnormal returns (Jegadeesh and Wu, 2013). In this analysis, Sentiment demonstrates certain importance in distinguishing sample firms’ stock performance, though its performance varies among different models.
Figure 3. Variable Importance
V. Conclusion
In this project, I use textual analysis techniques to extract firms’ Covid coverage information in their 10-K filings and construct their sentiment scores based on texts at the same time. Then I incorporate this novel information with conventional financial information and implement several machine learning models to study firms’ stock performance.
8The calculation of variable importance score is beyond the scope of this project.
8
The empirical results in this analysis show that there exist no significant relationship between Covid coverage in 10-K filing and stock performance, while sentiment score can contribute to the stock performance classification.
Due to time constraint, this analysis is limited to a small sample set, which greatly impacts models’ performance. With a longer time frame, this research can be expanded to a much larger sample set and implemented with more advanced textual analysis.
VI. References
Azimi, M. and Agrawal, A. (2021). Is positive sentiment in corporate annual reports informative? evidence from deep learning. Review of Asset Pricing Studies, 11(4):762–805.
Gentzkow, M., Kelly, B., and Taddy, M. (2019). Text as data. Journal of Economic Literature, 57(3):535–74.
Jegadeesh, N. and Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3):712–729.
Loughran, T. and McDonald, B. (2011). When is a liability not a liability? textual analysis, dictionaries, and 10-ks. Journal of Finance, 66(1):35–65.
Loughran, T. and McDonald, B. (2016). Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54(4):1187–1230.
9
Appendix A. Stop Words and Covid-related Words
Table II
List of stop words and Covid-related words
Category Content
Stop Word i, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, them selves, what, which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, would, should, could, ought, i’m, you’re, he’s, she’s, it’s, we’re, they’re, i’ve, you’ve, we’ve, they’ve, i’d, you’d, he’d, she’d, we’d, they’d, i’ll, you’ll, he’ll, she’ll, we’ll, they’ll, isn’t, aren’t, wasn’t, weren’t, hasn’t, haven’t, hadn’t, doesn’t, don’t, didn’t, won’t, wouldn’t, shan’t, shouldn’t, can’t, cannot, couldn’t, mustn’t, let’s, that’s, who’s, what’s, here’s, there’s, when’s, where’s, why’s, how’s, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, company, companies, year, years, million, millions, billion, billions, trillion, trillions, statement, statements, report, reports, reporting, note, notes, will, shall, include, including, includes, included, date, dates, result, results, table, ta bles, period, periods, amount, amounts, management, also, following, quarter, quarters, annual, annually, quarterly, financial, finance, asset, assets, cash, cashes, income, sales, revenue, tax, taxes, share, shares, stock, stocks, cogs, rate, rates, expenses, accounting, price, per, equity, equities, item, items, total, net, gross, consolidated, time, securities, upon, relate, related, relates, due, ii, iii, iv, v, vi, vii, viii, ix, x
Covid-related Word
covid, corona, coronavirus, pandemic, virus, vaccine, lockdown, quarantine, disease, infection, cdc, crisis, crises
Note: Before conducting text cleaning procedures, I transform all the letters into lowercase, hence all the words in this table are case insensitive.
10
Appendix B. Variable Definition
Table III
Variable definitions
Variable Definition Source
Cash Cash and cash equivalent divided by total as sets, che/at
D Covid A dummy variable which is assigned to 1 if the file contains at least one of the pre-specified
Covid-related words, otherwise it is set to 0
D Return A dummy variable which is assigned to 1 if the accumulated return of a firm is above the cross
sectional median value, otherwise it is set to 0
Leverage Leverage ratio, measured as interest bearing liabilities divided by total assets, (dltt + dlc)/at
Ln(Market Cap) Natural log of market value of common shares, log(prcc f ∗ csho)
COMPUSTAT
EDGAR 10-K filings CRSP
COMPUSTAT
COMPUSTAT
Ln(Sale) Natural log of total sales, log(sale) COMPUSTAT Ln(Total Assets) Natural log of total assets, log(at) COMPUSTAT
Sentiment A sentiment score which is calculated as the difference between the number of positive words
and negative words divided by the sum of these
two types of words in one file, (positive −
negative)/(positive + negative)
Tangibility Property, plant, and equipment divided by total assets, ppent/at
T obin′s q the market value of a firm divided by the replacement value of the firm’s assets, (prcc f ∗
csho + pstk + dltt + dlc)/at
EDGAR 10-K filings
COMPUSTAT
COMPUSTAT
Note: I mainly follow Azimi and Agrawal (2021)’s definitions to build up the COMPUSTAT variables used in the analyses. The rest of the variables are calculated based on my own knowledge.
FinTech
Appendix C. R Code
A. Initialization
rm( list = ls () )
graphics .off ()
# set working directory
setwd ( dirname ( rstudioapi :: getActiveDocumentContext ()$ path ) )
# load packages
library ( tidyverse )
library ( ggplot2 )
library ( tm )
library ( wordcloud )
library ( tidytext )
library ( readr )
library ( caret )
library ( stringr )
library ( edgar )
library ( haven )
library ( lubridate )
library ( DescTools )
library ( xtable )
library ( psych )
library ( reshape )
library ( pROC )
B. Text Preprocessing
### load 10 -K files published in quarter 4 of 2021 temp <- list . files ( path = “10 -X/ QTR4 “, pattern = “*10 -K_ edgar *”, full . names = TRUE )
# extract cik code from file name
p1 <- “^10 -X/ QTR4 /[0 -9]+ _10 -K_[a-z]+_[a-z]+_*”
p2 <- “*_ [0 -9]+ -[0 -9]+ -[0 -9]+. txt $”
12
temp1 <- gsub ( p1 , “”, temp )
temp2 <- gsub ( p2 , “”, temp1 )
# rename file name with cik code and store them into the designated data folder
file . rename ( temp , paste0 (” data /”, temp2 , ” . txt “) )
rm( list = ls( pattern = ” ^[ temp *| p*] “) )
# generate corpus
mycorpus <- Corpus ( DirSource (” data “) )
# ## Text Cleaning
# define a new stopword list
stopword _ list <- c( stopwords (” en “) , ” company “,” companies “, ” year “, ” years “, ” million “, ” millions “, ” billion “, ” billions “, ” trillion “, ” trillions “, ” statement “, ” statements “, ” report “, ” reports “, ” reporting “, ” note “, ” notes “, ” will “, ” shall “, ” include “, ” including “, ” includes “, ” included “, ” date “, ” dates “, ” result “, ” results “, ” table “, ” tables “, ” period “, ” periods “, ” amount “, ” amounts “, ” management “, ” also “, ” following “, ” quarter “, ” quarters “, ” annual “, ” annually “, ” quarterly “, ” financial “, ” finance “, ” asset “, ” assets “, ” cash “, ” cashes “, ” income “, ” sales “, ” revenue “, ” tax “, ” taxes “, ” share “, ” shares “, ” stock “, ” stocks “, ” cogs “, ” rate “, ” rates “, ” expenses “, ” accounting “, ” price “, ” per “, ” equity “, ” equities “, ” item “, ” items “, ” total “, ” net “, ” gross “, ” consolidated “, ” time “, ” securities “, ” upon “, ” relate “, ” related “, ” relates “, ” due “, ” ii “, ” iii “, ” iv “, ” v “, ” vi “, ” vii “, ” viii “, ” ix “, “x”)
# define a covid – related word list
covid _ word <- c(” covid “, ” corona “, ” coronavirus “, ” pandemic “, ” virus “, ” vaccine “, ” lockdown “, ” quarantine “, ” disease “, ” infection “, ” cdc “, ” crisis “, ” crises “)
# define a function to remove a certain pattern
add space <- content _ transformer ( function (x , pattern ) gsub ( pattern , ” “, x ) )
# unify the covid – related words
13
covidtransformer <- content _ transformer ( function (x , pattern ) gsub ( pattern , ” covid “, x ) )
# define the text _ cleaning function
text _ cleaning <- function ( corpus ) {
library ( tm )
# change the whole texts to lowercase
temp _ corpus <- tm_map ( corpus , tolower )
# extract Item 7 MD&A and Item 7A Quantitative and Qualitative Disclosures About Market Risk
temp _ corpus <- tm_map ( temp _corpus , add space , “^.* item \\ s7 \\. “) temp _ corpus <- tm_map ( temp _corpus , add space , ” item \\ s8 \\.. *$”) # combine Covid words together
temp _ corpus <- tm_map ( temp _corpus , covid transformer , paste0 ( covid _word , collapse =”|”) )
# remove numbers
temp _ corpus <- tm_map ( temp _corpus , removeNumbers )
# remove stopwords
temp _ corpus <- tm_map ( temp _corpus , removeWords , stopword _ list ) # remove months
temp _ corpus <- tm_map ( temp _corpus , removeWords , tolower (c( month . name , month . abb ) ) )
# remove punctuations , keeping intra _ word _ contractions and intra _ word _ dashes
temp _ corpus <- tm_map ( temp _corpus , removePunctuation ,
preserve _ intra _ word _ contractions = TRUE ,
preserve _ intra _ word _ dashes = TRUE )
# remove single letter word
temp _ corpus <- tm_map ( temp _corpus , add space , “*\\b[[: alpha :]]{1}\\ b*”) # remove newline sign
temp _ corpus <- tm_map ( temp _corpus , add space , “\n”)
# strip white space
temp _ corpus <- tm_map ( temp _corpus , stripWhitespace )
return ( temp _ corpus )
}
FinTech
# clean the corpus
mycorpus1 <- text _ cleaning ( mycorpus )
### Wordcloud
# generate word cloud based on term frequency
dtm <- DocumentTermMatrix ( mycorpus1 )
freq _ words <- colSums (as. matrix ( dtm ) )
freq _ words <- sort ( freq _words , decreasing = TRUE )
wf <- data . frame ( word = names ( freq _ words ) , frequency = freq _words , row . names = NULL )
wordcloud ( wf$word , wf$ frequency , min . freq = 5000 , # specify frequency higher than 5000
scale = c(4 , 1) , rot . per = 0 , # not rotate words
random . order = FALSE , # plotting words in decreasing frequency colors = brewer . pal (1 ,” Paired “) )
# generate word cloud based on TF -IDF
dtm _ tfidf <- DocumentTermMatrix ( mycorpus1 , control = list ( weighting = weightTfIdf ) )
data <- as. matrix ( dtm _ tfidf )
freq _ tfidf <- colMeans ( data )
freq _ tfidf <- sort ( freq _tfidf , decreasing = TRUE )
wf_ tf idf <- data . frame ( word = names ( freq _ tfidf ) , frequency = freq _tfidf , row . names = NULL )
wordcloud ( wf_ tfidf $word , wf_ tf idf $ frequency , max . words = 30 , # specify max words as 30
scale = c(4 , 1) , rot . per = 0 , # not rotate words
random . order = FALSE , # plotting words in decreasing frequency colors = brewer . pal (1 ,” Paired “) )
### Tokenization
# convert a corpus to a tidytext format
mytext <- data . frame ( cik = gsub (“.txt “, “”, names ( mycorpus1 ) ) , text = get (” content “, mycorpus1 ) ,
15
stringsAsFactors = FALSE ,
row. names = NULL )
# tokenize into words
tokentext <- mytext % >% unnest _ tokens ( word , text )
# generate covid dummy
# if covid exists in a firm ’s filing , assign 1 , otherwise assign 0 d_ covid <- tokentext % >% group _by( cik ) % >%
mutate ( covid = ifelse (” covid ” % in % word , 1 , 0) ) % >% select ( cik , covid ) % >%
unique ()
FinTech
### Sentimental Analysis
sent <- tokentext % >% inner _ join (get _ sentiments (” loughran “) , by = ” word “) % >%
filter ( sentiment % in % c(” positive “, ” negative “) ) % >% group _by( cik ) % >% count ( word , sentiment , sort = TRUE ) % >%
spread ( sentiment , n , fill = 0) % >%
summarize ( positive = sum( positive ) , negative = sum( negative ) ) % >% mutate ( sentiment = ( positive – negative ) / ( positive + negative ) )
# extract cik code list as firm identifier
cik _ list <- unique (as. numeric ( sent $ cik ) )
# write . table (cik_list , “cik_ list .txt” , sep = “\n” , quote = FALSE , row . names = FALSE , col. names = FALSE )
# merge sentiment data and covid data
sent _ covid <- merge ( sent , d_covid , by = “cik “)
C. Stock Features Data Preprocessing
### Create firm identifier (permno -gvkey -cik) link table # Load link table downloaded from WRDS
link <- read .csv (” link .csv “, header = TRUE , stringsAsFactors = FALSE ) link <- link % >% select (c( gvkey , LPERMNO , cik ) ) % >% filter (!is.na( cik ) )
16
% >% unique ()
colnames ( link ) [2] <- ” permno ”
# extract permno code to be used in CRSP
permno _ list <- unique ( link $ permno )
# write . table ( permno _list , file = ” permno _ list .txt” , sep = “\n” , quote = FALSE , row . names = FALSE , col . names = FALSE )
# extract gvkey code to be used in CRSP
gvkey _ list <- unique ( link $ gvkey )
# write . table ( gvkey _list , file = ” gvkey _ list .txt” , sep = “\n” , quote = FALSE , row . names = FALSE , col . names = FALSE )
### Create Return Variable
# Source : CRSP
crsp <- read .csv (” crsp .csv “, header = TRUE , stringsAsFactors = FALSE ) crsp $ date <- ymd ( crsp $ date )
# generate return dummy
# if the accumulated return of a firm >= median , assign 1 , otherwise 0 crsp <- crsp % >% group _by( PERMNO ) % >%
filter ( date >= ymd (” 20200310 “) ) % >% # On Mar 11 , 2020 , WHO characterized COVID -19 as a pandemic
slice (c(1 , n () ) ) % >% mutate ( return = PRC / lag ( PRC ) -1 ) % >% ungroup () % >%
filter (!is.na( return ) ) % >% mutate ( d_ return = ifelse ( return >= median ( return ) , 1 , 0) ) % >%
select ( PERMNO , d_ return )
colnames ( crsp ) [1] <- ” permno ”
### Create Control Variables
# Source : COMPUSTAT
comp <- read .csv (” compustat .csv “, header = TRUE , stringsAsFactors = FALSE ) comp $ date <- ymd ( comp $ datadate )
# generate control variables
comp <- comp % >% filter ( year == 2021) % >%
mutate ( cash = cheat , # cash and cash equivalent / total assets 17
lev = ( dltt + dlc ) / at , # leverage ratio
logsale = log ( sale ) , # ln( sale )
mkt _cap = log ( prcc _f * csho ) , # ln( market cap)
tang = ppent / at , # tangibility
tobin _q = ( prcc _f * csho + pstk + dltt + dlc ) / at , # Tobin ’s q logat = log( at ) ) % >% # ln( total asset )
select (c( gvkey , date , cash , lev , logsale , mkt _cap , tang , tobin _q, logat ) )
D. Sample Dataset Creation
# merge text data and stock features
sampleset <- merge (link , sent _covid , by = “cik”)
sampleset <- merge ( sampleset , crsp , by = ” permno “)
sampleset <- merge ( sampleset , comp , by = ” gvkey “)
samples <- sample set % >% select ( -c( cik , permno , gvkey , date , positive , negative ) )
# winsorize all non – dummy variables
sampleset _win <- sampleset % >% select ( -c( d_return , covid ) ) % >% Winsorize ( probs = c(0.05 , 0.95) , na.rm = TRUE )
sampleset <- cbind ( sampleset $d_return , sampleset $covid , sampleset _win ) rm( sampleset _win )
# replace missing values with median for each variable
sampleset <- sampleset % >%
mutate _if(is. numeric , function ( x ) ifelse (is.na( x ) , median (x , na.rm = T ) , x ) ) % >%
dplyr :: rename ( return = ‘ sampleset $d_return ‘ , covid = ‘ sampleset $covid ‘)
# Summary Statistics of all variables
sample _ sum <- sampleset % >% describe () % >%
select (n , mean , sd , median , min , max )
E. Models
18
sampleset $ return <- as. factor ( sampleset $ return )
sampleset $ covid <- as. factor ( sampleset $ covid )
### split the data for cross – validation
set . seed (123)
index <- createDataPartition ( sampleset $ return , # the target p = 0.7 , # percentage of data into training
set
list = FALSE ) # data format
# generate train and test datasets
train _ sample <- sampleset [index , ]
test _ sample <- sampleset [ -index , ]
### model setup
control <- trainControl ( method =” repeatedcv “, # cross validation number = 4 , # 4 – folder cross – validation
repeats = 5 , # repeat 5 times
classProbs = TRUE , # return class probability
summaryFunction = twoClassSummary ,
savePredictions = TRUE ) # for plotting ROC
levels ( train _ sample $ return ) <-
make . names ( levels ( factor ( train _ sample $ return ) ) )
### Logistic regression
logit <- train ( return ~ . , data = train _sample ,
method = “glm “, family = binomial ( link = ” logit “) , metric = “ROC”, trControl = control )
### CART
cart <- train ( return ~ . , data = train _sample , method = ” rpart “, metric = “ROC”, trControl = control )
### random forecast
rf <- train ( return ~ . , data = train _sample , method =”rf”, metric = “ROC “, 19
trControl = control )
### collect and summarize the resampling results
results <- resamples ( list ( logit = logit , cart = cart , random _ forest = rf) ) bwplot ( results )
### variable importance
varimp _ list <- list ( varImp ( logit )$ importance , varImp ( cart )$ importance , varImp (rf)$ importance )
varimp _ list <- lapply ( varimp _list , function ( x ) data . frame (x , rn = row . names ( x ) ) )
varimp <- data . frame ( rn = row . names ( varImp ( logit )$ importance ) ) for ( i in varimp _ list ) {
varimp <- varimp % >% merge (i , by = “rn”, all = TRUE )
}
rm( i )
names ( varimp ) <- make . names ( names ( varimp ) , unique = TRUE )
varimp <- varimp % >% dplyr :: rename ( Variable = rn , logit = Overall .x , cart = Overall .y , rf = Overall ) % >%
arrange ( – logit , – cart , -rf)
# change data frame from wide to long format
varimp <- melt ( varimp )
colnames ( varimp ) <- c(” Feature “, ” Model “, ” Value “)
# draw heatmap
varimp % >% ggplot ( aes ( x = Model , y = Feature , fill = Value ) ) + geom _ tile ()
### ROC
pred _ logit <- predict ( logit , test _sample , type = ” prob “) pred _ cart <- predict ( cart , test _sample , type = ” prob “)
pred _rf <- predict (rf , test _sample , type = ” prob “)
20
roc ( test _ sample $ return , pred _ logit $X0 , plot = TRUE , percent = TRUE , xlab = ” True ␣ Negative ␣ Percentage “, ylab = ” True ␣ Positive ␣ Percentage “, col = “red”, print . auc = TRUE )
plot . roc ( test _ sample $ return , pred _ cart $X0 , percent = TRUE , print . auc = TRUE , col = ” blue “, add = TRUE , print . auc . y = 45)
plot . roc ( test _ sample $ return , pred _rf$X0 , percent = TRUE , print . auc = TRUE , col = ” green “, add = TRUE , print . auc . y = 40) legend (” bottomright “, legend = c(” Logit ␣ Model “, ” CART “, ” Random ␣ Forest “) , col =c(“red”, ” blue “, ” green “) , lw = 2)