FinTech & Covid 19

FinTech & Covid 19 : Did Covid have a serious affect on prices for companies, especially FinTech firms? How did it occur?

Does Stock Return Reflect the Impact of Covid-19 Pandemic? A Text-based Analysis on Firms’ 10-K Filings 

Individual Assignment Report for the Financial Data Analytics Course April 2022 

Abstract 

I use textual analysis techniques to extract Covid coverage information and sentiment score from firms’ 10-K filings. This novel information is incorporated with conventional financial data to predict stock performance in machine learning models. The empirical results show no significant relationship between Covid coverage in 10-K filing and stock performance, while sentiment can contribute to the stock performance classification. 

Acknowledgments: I would like to thank Professor Mancy Luo for her excellent guidance throughout this course. 

Keywords: 10-K, Textual Analysis, Sentiment Analysis, Logistic Regression, Random Forest 1

I. Introduction 

The Covid-19 Pandemic has triggered a major crisis in the whole world and its impact on the stock market is also dramatic. With an increasing number of countries gradually removing Covid-related restrictions recently, it is meaningful to take a retrospective view and examine the impact of Covid shock on a firm’s stock performance since its outbreak. 

This report aims to use basic textual analysis techniques and machine learning models to explore one research question as follows: 

Is stock performance associated with the firm’s Covid coverage in 10-K filing? 

New technologies have enabled researchers to extract valuable information from the vast quantities of digital text and incorporate it in their studies (Gentzkow et al., 2019). In the field of finance, the periodic reports (e.g., 10-K) disclosed by firms have been widely analyzed as these filings provide ample business and financial information about the firms. Conventional research focuses more on the accounting data as it is readily available and easy to analyze. With the help of novel text-mining techniques, researchers now can use text as an input to their research. In this report, I use textual analysis techniques to build up a dummy variable to proxy for the Covid coverage in the firm’s 10-K filing. 

In addition, the sentiment in the text has also been widely analyzed in finance (Loughran and McDonald, 2016; Azimi and Agrawal, 2021) and previous studies have shown that positive and negative sentiment can predict abnormal stock returns (Jegadeesh and Wu, 2013). Hence, I also construct a sentiment score and add it into the analysis. 

The remainder of this report is organized as follows: Section II provides the data source and cleaning procedures used in this report and Section III presents the models and algorithms used in this thesis. The empirical findings and discussion are presented in Section IV. In the end, Section V concludes the report. Appendix A shows the tailored stop word list and Covid-related word list, Appendix B gives a detailed description of the variables used in the empirical study, and Appendix C presents the R code used in this report. 

1

II. Data 

A. Data Source 

The data used in this analysis is gathered from texts of 10-K filings disclosed by the firms, and conventional financial databases like COMPUSTAT and CRSP. 

According to the Securities Exchange Act of 1934 and its following amendment, any issuers with securities registered under or subject to certain sections are required to report their periodic filings (e.g., 10-K and 10-Q). These filings are stored on the Security and Exchange Commission’s (SEC) EDGAR website and can be accessed by the general public. However, the raw filings are encoded into browser-friendly files which contain a large amount of irrelevant codes for this analysis, and it is beyond the scope of this report to tackle the redundant information. Therefore, I choose to use a preprocessed 10-K filings data labeled as “Stage One Parse” published by the Software Repository for Accounting and Finance in University of Notre Dame In this analysis. 

Apart from the novel text data, I also build up features using COMPUSTAT and CRSP databases, both of which are easily accessed through Wharton Research Data Services2. The sample contains all the firms which disclose their 10-K filings in the fourth quarter of 2021 and are covered by COMPUSTAT and CRSP from March 10, 2020 to September 30, 20213. A detailed description of the usage of databases can be found in Appendix B. 

B. Text Preprocessing 

In general, there are five main steps in the text preprocessing. 

First, I create a corpus using the sample texts and narrow the scope of my textual analysis to Item 7 and Item 7A sections in the 10-K filing. Item 7 “Management’s Discussion and Analysis of Financial Condition and Results of Operations”, known as the MD&A for short, allows firm management give the firm’s perspective on the business results of the past financial year, and Item 7A “Quantitative and Qualitative Disclosures about Market Risk” discusses how the firm manages its market risk exposures. Both of these two sections provide the most relevant information on 

1A detailed documentation for this data can be found in this link: https://sraf.nd.edu/sec-edgar-data/cleaned-10x files/10x-stage-one-parsing-documentation/ 

2See: https://wrds-www.wharton.upenn.edu/ 

3On March 11, 2020, the World Health Organisation characterized Covid-19 as a pandemic. I select the day before this date as the beginning of the sample period. Since the purpose of this analysis is to evaluate the stock performance between the start of pandemic and the release of 10-K, I choose the last day of Q3 in 2021 to be the end point as all the 10-K filings are disclosed in Q4 of 2021 in the sample. 

2

firm’s operational performance and are of main interest to researchers. Therefore, I extract these two sections from the original 10-K filings to depict the impact of Covid-19 on firms. Second, I create a stopword list and a Covid-related word list to clean the text. In addition to the default stopword list provided in the R package tm, I add a few more words (e.g., company, result) into that list in order to fit into the context of this analysis. All these words do not provide useful information in this analysis and are removed from the text. Besides, as there exist various descriptions of the Covid-19 pandemic (coronavirus, pandemic etc.). To simplify the evaluation of Covid-19 impact in the text, I explicitly define a Covid-related word list and substitute all the words in this list into a single word “covid”. The stopword list and Covid-related word list can be found in Appendix A. 

Third, I implement some conventional text cleaning procedures which include changing the whole text into lowercase, removing numbers, punctuation signs, single-letter words and newline signs, and striping white spaces. 

Fourth, in order to exhibit the most frequent words in the texts, I use the cleaned corpus to create two Document Term Matrices based on absolute word frequency and TF-IDF respectively. Then I use the R package wordcloud and generate word cloud graphs to demonstrate the most frequent words in the sample. 

Lastly, I tokenize the text into words and generate a Covid-related dummy feature and a sentiment score for further analysis. Specifically, for each firm’s tokenized text, if there exists the word “covid”, I will assign this Covid dummy value to 1, otherwise it is set to 0. To create the sentiment score for each firm, I use the sentiment word list loughran (Loughran and McDonald, 2011) provided in the R package tidytext to merge it with the tokenized text and generate the sentiment score using the number of positive words and negative words. 

C. Financial Data Preprocessing 

The financial data preprocessing includes three parts. 

First, I create a linking table to match different firm identifiers. Since I use different data sources to create firm features, it is essential to merge them together into a single dataset for further analysis. However, the code to identify firms vary among these data sources. Specifically, the 10-K text data uses CIK code4to identify firms, COMPUSTAT uses GVKEY code5, and CRSP 

4The Central Index Key (CIK) is used on the SEC’s computer systems to identify corporations and individual people who have filed disclosure with the SEC. 

5The Global Company Key or GVKEY is a unique six-digit number key assigned to each company (issue, currency, 3

uses PERMNO code6. I extract CIK code from 10-K filings and use “CRSP/Compustat Merged Database – Linking Table” service to find the matching PERMNO and GVKEY codes. Second, I construct features based on the COMPUSTAT and CRSP dataset. The target variable D Return is created in two steps. I first calculate the accumulated return between March 10, 2020 and September 30, 2021, then assign the D Return to be 1 if the accumulated return exceeds the median value, otherwise it is set to 0. The control variables are generated according to Azimi and Agrawal (2021)’s definition. 

Lastly, I winsorize all non-dummy variables at the 5% level in both tails of the distribution and fill missing values with cross-sectional median of in each variable. 

D. Sample Description 

Words with Term Frequency over 5,000 Top 30 Words in TF-IDF Statistic 

Figure 1. Word Frequency in Sample Texts 

Note: These two graphs are generated using the R package wordcloud. The Covid-related words are among the most frequent words in both standards. The words listed in these graphs are selected from the preprocessed sample texts, which include “Item 7 Management’s Discussion and Analysis of Financial Condition and Results of Operations” and “Item 7A Quantitative and Qualitative Disclosures about Market Risk” in the 10-K filing. The sample firms are chosen from those which disclose their 10-K filings in the fourth quarter of 2021 and are covered in COMPUSTAT and CRSP databases from March 10, 2020 to September 30, 2021. 

The textual analysis result has clearly shown that Covid-19 pandemic is among the mostly discussed words in sample firms’ 10-K filing, and its importance is supported by the tf–idf statistic 

index) in the Capital IQ COMPUSTAT database. 

6 PERMNO is a unique permanent security identification number assigned by CRSP to each security. Strictly speaking, PERMNO is not a firm identifier as one firm may issue several securities. For simplicity, I use it to distinguish firms in this analysis. 

Fintech : at the same time. 

As illustrated in Figure 1, on one hand, the absolute term frequency of Covid-related words exceeds 5,000, which is expected as the Covid-19 pandemic has caused a great shock to the world economy. On the other hand, it is ranked in the top 30 word list in terms of tf-idf metric, which shows that not all firms mention Covid-related words in their 10-K filings. Therefore, it is meaningful to explore if mentioning Covid-related words in the 10-K filing would make a difference on the stock performance. 

Table I 

Descriptive statistics 

Variable No. Mean SD Median Min Max 

D Return 248 0.49 0.50 0.00 0.00 1.00 

D Covid 248 0.90 0.31 1.00 0.00 1.00 

Sentiment 248 -0.27 0.20 -0.31 -0.42 1.00 

Cash 248 0.19 0.20 0.13 0.00 1.00 

Leverage 248 0.27 0.22 0.25 0.00 1.28 

Ln(Sale) 248 6.30 2.57 6.96 -0.42 9.57 

Ln(Market Cap) 248 7.10 1.84 7.21 1.59 9.57 

Tangibility 248 0.20 0.21 0.14 0.00 0.90 

T obins q 248 2.14 1.87 1.37 0.10 9.57 

Ln(Total Assets) 248 7.01 1.99 7.24 0.34 9.57 

Note: This table presents the descriptive statistics for the variables 

used in this analysis. The sample includes 248 firms which disclose 

their 10-K in the fourth quarter of 2021 and are covered by the COM 

PUSTAT and CRSP databases from March 10, 2020 to September 

30, 2021. All variables other than the two dummies (D Return and 

D Covid) are winsorized at the 5% level in both tails of the distri 

bution. The definition of these variables can be found in Appendix 

B. 

Table I presents the summary statistics of the variables used in this analysis. The target variable D Return is a dummy variable which depicts the stock performance of a firm relative to the whole sample. D Covid and Sentiment are constructed from the 10-K filing and they offer quantitative measures of firms’ reactions towards Covid-19 and their sentiment during the past financial year. The rest variables are control variables aiming to capture firms’ characteristics included in the 10-K filing. 

FinTech

III. Models and Algorithms 

Since the target variable in this analysis is a dummy variable, I choose three binary classification models to explore the interactions between stock performance and Covid-19 coverage in 10-K filing. 

A. Models 

The first model is the logit model, which assumes a linear combination of one or more independent variables. The model formula is shown as follows: 

D Return = α + β1 · D Covid + β2 · Sentiment + γ · Ω

where D Return is the dummy variable which is assigned to 1 if the accumulated return is above median and 0 otherwise, D Covid indicates if a firm has mentioned Covid-related words in its 10-K filing, Sentiment is a measure sentiment score of in firms’ 10-K filing, and Ω represents the control variables, namely Cash, Leverage Ratio, Sales, Market Cap, Tangibility, Tobin’s q, and Total Assets. A detailed description of variable construction can be found in Appendix B. 

Apart from the conventional linear model, I also select two machine learning models in this analysis, that is, Classification and Regression Trees (CART) and Random Forest. Neither of these two models pre-specify a model structure and are suitable for exploring the non-linear relationship between target and predictors. 

B. Model Implementation in R 

The implementation of these models is straightforward in R. Throughout this analysis, I use R package caret 7 to run the models and the codes can be found in Appendix C. In particular, I utilize the (K-fold N-time) cross-validation method to assess models’ performance on the validation dataset. This method is achieved using numbers and repeated attributes in the trainControl function. 

In addition, to evaluate models’ prediction performance on a test set, I compare the receiver operating characteristic (ROC) curves for these models. 

7See: https://topepo.github.io/caret/ 

6

IV. Empirical Results 

This section presents two perspectives with regards to the model performance. 

A. Model Evaluation 

The evaluation of models is based on the ROC curve which plots the true positive rate (TPR) against the true negative rate (TNR) at various threshold settings. I select the area under the ROC curve (AUC) as the metric to compare the model’s ability to distinguish between stock performance. 

Figure 2. Receiver Operating Characteristic 

Figure 2 illustrates the performance metrics for the three models. Logit model (red line) outperforms the other models in terms of AUC metric as it successfully predicts 60% of firms’ stock performance in the test set. The machine learning models’ performance are similar, both of which slightly outperforms the random guess result (50%). 

7

B. Does Covid Coverage Matter? 

To examine the relationship between stock performance and Covid coverage, I calculate the importance score for all the predictors using a built-in function in R package caret. As shown in Figure 3, in all three models, Covid coverage predictor D Covid remains in the bottom of all predictors, which indicates its inability in stock performance classification. Sentiment has been proved in previous studies that it can predict abnormal returns (Jegadeesh and Wu, 2013). In this analysis, Sentiment demonstrates certain importance in distinguishing sample firms’ stock performance, though its performance varies among different models. 

Figure 3. Variable Importance 

V. Conclusion 

In this project, I use textual analysis techniques to extract firms’ Covid coverage information in their 10-K filings and construct their sentiment scores based on texts at the same time. Then I incorporate this novel information with conventional financial information and implement several machine learning models to study firms’ stock performance. 

8The calculation of variable importance score is beyond the scope of this project. 

8

The empirical results in this analysis show that there exist no significant relationship between Covid coverage in 10-K filing and stock performance, while sentiment score can contribute to the stock performance classification. 

Due to time constraint, this analysis is limited to a small sample set, which greatly impacts models’ performance. With a longer time frame, this research can be expanded to a much larger sample set and implemented with more advanced textual analysis. 

VI. References 

Azimi, M. and Agrawal, A. (2021). Is positive sentiment in corporate annual reports informative? evidence from deep learning. Review of Asset Pricing Studies, 11(4):762–805. 

Gentzkow, M., Kelly, B., and Taddy, M. (2019). Text as data. Journal of Economic Literature, 57(3):535–74. 

Jegadeesh, N. and Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3):712–729. 

Loughran, T. and McDonald, B. (2011). When is a liability not a liability? textual analysis, dictionaries, and 10-ks. Journal of Finance, 66(1):35–65. 

Loughran, T. and McDonald, B. (2016). Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54(4):1187–1230. 

9

Appendix A. Stop Words and Covid-related Words 

Table II 

List of stop words and Covid-related words 

Category Content 

Stop Word i, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, them selves, what, which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, would, should, could, ought, i’m, you’re, he’s, she’s, it’s, we’re, they’re, i’ve, you’ve, we’ve, they’ve, i’d, you’d, he’d, she’d, we’d, they’d, i’ll, you’ll, he’ll, she’ll, we’ll, they’ll, isn’t, aren’t, wasn’t, weren’t, hasn’t, haven’t, hadn’t, doesn’t, don’t, didn’t, won’t, wouldn’t, shan’t, shouldn’t, can’t, cannot, couldn’t, mustn’t, let’s, that’s, who’s, what’s, here’s, there’s, when’s, where’s, why’s, how’s, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, company, companies, year, years, million, millions, billion, billions, trillion, trillions, statement, statements, report, reports, reporting, note, notes, will, shall, include, including, includes, included, date, dates, result, results, table, ta bles, period, periods, amount, amounts, management, also, following, quarter, quarters, annual, annually, quarterly, financial, finance, asset, assets, cash, cashes, income, sales, revenue, tax, taxes, share, shares, stock, stocks, cogs, rate, rates, expenses, accounting, price, per, equity, equities, item, items, total, net, gross, consolidated, time, securities, upon, relate, related, relates, due, ii, iii, iv, v, vi, vii, viii, ix, x 

Covid-related Word 

covid, corona, coronavirus, pandemic, virus, vaccine, lockdown, quarantine, disease, infection, cdc, crisis, crises 

Note: Before conducting text cleaning procedures, I transform all the letters into lowercase, hence all the words in this table are case insensitive. 

10

Appendix B. Variable Definition 

Table III 

Variable definitions 

Variable Definition Source 

Cash Cash and cash equivalent divided by total as sets, che/at 

D Covid A dummy variable which is assigned to 1 if the file contains at least one of the pre-specified 

Covid-related words, otherwise it is set to 0 

D Return A dummy variable which is assigned to 1 if the accumulated return of a firm is above the cross 

sectional median value, otherwise it is set to 0 

Leverage Leverage ratio, measured as interest bearing liabilities divided by total assets, (dltt + dlc)/at 

Ln(Market Cap) Natural log of market value of common shares, log(prcc f ∗ csho

COMPUSTAT 

EDGAR 10-K filings CRSP 

COMPUSTAT 

COMPUSTAT 

Ln(Sale) Natural log of total sales, log(sale) COMPUSTAT Ln(Total Assets) Natural log of total assets, log(at) COMPUSTAT 

Sentiment A sentiment score which is calculated as the difference between the number of positive words 

and negative words divided by the sum of these 

two types of words in one file, (positive − 

negative)/(positive + negative

Tangibility Property, plant, and equipment divided by total assets, ppent/at 

T obins q the market value of a firm divided by the replacement value of the firm’s assets, (prcc f ∗ 

csho + pstk + dltt + dlc)/at 

EDGAR 10-K filings 

COMPUSTAT 

COMPUSTAT 

Note: I mainly follow Azimi and Agrawal (2021)’s definitions to build up the COMPUSTAT variables used in the analyses. The rest of the variables are calculated based on my own knowledge. 

FinTech

Appendix C. R Code 

A. Initialization 

rm( list = ls () ) 

graphics .off () 

# set working directory 

setwd ( dirname ( rstudioapi :: getActiveDocumentContext ()$ path ) ) 

# load packages 

library ( tidyverse ) 

library ( ggplot2 ) 

library ( tm ) 

library ( wordcloud ) 

library ( tidytext ) 

library ( readr ) 

library ( caret ) 

library ( stringr ) 

library ( edgar ) 

library ( haven ) 

library ( lubridate ) 

library ( DescTools ) 

library ( xtable ) 

library ( psych ) 

library ( reshape ) 

library ( pROC ) 

B. Text Preprocessing 

### load 10 -K files published in quarter 4 of 2021 temp <- list . files ( path = “10 -X/ QTR4 “, pattern = “*10 -K_ edgar *”, full . names = TRUE ) 

# extract cik code from file name 

p1 <- “^10 -X/ QTR4 /[0 -9]+ _10 -K_[a-z]+_[a-z]+_*” 

p2 <- “*_ [0 -9]+ -[0 -9]+ -[0 -9]+. txt $” 

12

temp1 <- gsub ( p1 , “”, temp ) 

temp2 <- gsub ( p2 , “”, temp1 ) 

# rename file name with cik code and store them into the designated data folder 

file . rename ( temp , paste0 (” data /”, temp2 , ” . txt “) ) 

rm( list = ls( pattern = ” ^[ temp *| p*] “) ) 

# generate corpus 

mycorpus <- Corpus ( DirSource (” data “) ) 

# ## Text Cleaning 

# define a new stopword list 

stopword _ list <- c( stopwords (” en “) , ” company “,” companies “, ” year “, ” years “, ” million “, ” millions “, ” billion “, ” billions “, ” trillion “, ” trillions “, ” statement “, ” statements “, ” report “, ” reports “, ” reporting “, ” note “, ” notes “, ” will “, ” shall “, ” include “, ” including “, ” includes “, ” included “, ” date “, ” dates “, ” result “, ” results “, ” table “, ” tables “, ” period “, ” periods “, ” amount “, ” amounts “, ” management “, ” also “, ” following “, ” quarter “, ” quarters “, ” annual “, ” annually “, ” quarterly “, ” financial “, ” finance “, ” asset “, ” assets “, ” cash “, ” cashes “, ” income “, ” sales “, ” revenue “, ” tax “, ” taxes “, ” share “, ” shares “, ” stock “, ” stocks “, ” cogs “, ” rate “, ” rates “, ” expenses “, ” accounting “, ” price “, ” per “, ” equity “, ” equities “, ” item “, ” items “, ” total “, ” net “, ” gross “, ” consolidated “, ” time “, ” securities “, ” upon “, ” relate “, ” related “, ” relates “, ” due “, ” ii “, ” iii “, ” iv “, ” v “, ” vi “, ” vii “, ” viii “, ” ix “, “x”) 

# define a covid – related word list 

covid _ word <- c(” covid “, ” corona “, ” coronavirus “, ” pandemic “, ” virus “, ” vaccine “, ” lockdown “, ” quarantine “, ” disease “, ” infection “, ” cdc “, ” crisis “, ” crises “) 

# define a function to remove a certain pattern 

add space <- content _ transformer ( function (x , pattern ) gsub ( pattern , ” “, x ) ) 

# unify the covid – related words 

13

covidtransformer <- content _ transformer ( function (x , pattern ) gsub ( pattern , ” covid “, x ) ) 

# define the text _ cleaning function 

text _ cleaning <- function ( corpus ) { 

library ( tm ) 

# change the whole texts to lowercase 

temp _ corpus <- tm_map ( corpus , tolower ) 

# extract Item 7 MD&A and Item 7A Quantitative and Qualitative Disclosures About Market Risk 

temp _ corpus <- tm_map ( temp _corpus , add space , “^.* item \\ s7 \\. “) temp _ corpus <- tm_map ( temp _corpus , add space , ” item \\ s8 \\.. *$”) # combine Covid words together 

temp _ corpus <- tm_map ( temp _corpus , covid transformer , paste0 ( covid _word , collapse =”|”) ) 

# remove numbers 

temp _ corpus <- tm_map ( temp _corpus , removeNumbers ) 

# remove stopwords 

temp _ corpus <- tm_map ( temp _corpus , removeWords , stopword _ list ) # remove months 

temp _ corpus <- tm_map ( temp _corpus , removeWords , tolower (c( month . name , month . abb ) ) ) 

# remove punctuations , keeping intra _ word _ contractions and intra _ word _ dashes 

temp _ corpus <- tm_map ( temp _corpus , removePunctuation , 

preserve _ intra _ word _ contractions = TRUE , 

preserve _ intra _ word _ dashes = TRUE ) 

# remove single letter word 

temp _ corpus <- tm_map ( temp _corpus , add space , “*\\b[[: alpha :]]{1}\\ b*”) # remove newline sign 

temp _ corpus <- tm_map ( temp _corpus , add space , “\n”) 

# strip white space 

temp _ corpus <- tm_map ( temp _corpus , stripWhitespace ) 

return ( temp _ corpus ) 

FinTech

# clean the corpus 

mycorpus1 <- text _ cleaning ( mycorpus ) 

### Wordcloud 

# generate word cloud based on term frequency 

dtm <- DocumentTermMatrix ( mycorpus1 ) 

freq _ words <- colSums (as. matrix ( dtm ) ) 

freq _ words <- sort ( freq _words , decreasing = TRUE ) 

wf <- data . frame ( word = names ( freq _ words ) , frequency = freq _words , row . names = NULL ) 

wordcloud ( wf$word , wf$ frequency , min . freq = 5000 , # specify frequency higher than 5000 

scale = c(4 , 1) , rot . per = 0 , # not rotate words 

random . order = FALSE , # plotting words in decreasing frequency colors = brewer . pal (1 ,” Paired “) ) 

# generate word cloud based on TF -IDF 

dtm _ tfidf <- DocumentTermMatrix ( mycorpus1 , control = list ( weighting = weightTfIdf ) ) 

data <- as. matrix ( dtm _ tfidf ) 

freq _ tfidf <- colMeans ( data ) 

freq _ tfidf <- sort ( freq _tfidf , decreasing = TRUE ) 

wf_ tf idf <- data . frame ( word = names ( freq _ tfidf ) , frequency = freq _tfidf , row . names = NULL ) 

wordcloud ( wf_ tfidf $word , wf_ tf idf $ frequency , max . words = 30 , # specify max words as 30 

scale = c(4 , 1) , rot . per = 0 , # not rotate words 

random . order = FALSE , # plotting words in decreasing frequency colors = brewer . pal (1 ,” Paired “) ) 

### Tokenization 

# convert a corpus to a tidytext format 

mytext <- data . frame ( cik = gsub (“.txt “, “”, names ( mycorpus1 ) ) , text = get (” content “, mycorpus1 ) , 

15

stringsAsFactors = FALSE , 

row. names = NULL ) 

# tokenize into words 

tokentext <- mytext % >% unnest _ tokens ( word , text ) 

# generate covid dummy 

# if covid exists in a firm ’s filing , assign 1 , otherwise assign 0 d_ covid <- tokentext % >% group _by( cik ) % >% 

mutate ( covid = ifelse (” covid ” % in % word , 1 , 0) ) % >% select ( cik , covid ) % >% 

unique () 

FinTech

### Sentimental Analysis 

sent <- tokentext % >% inner _ join (get _ sentiments (” loughran “) , by = ” word “) % >% 

filter ( sentiment % in % c(” positive “, ” negative “) ) % >% group _by( cik ) % >% count ( word , sentiment , sort = TRUE ) % >% 

spread ( sentiment , n , fill = 0) % >% 

summarize ( positive = sum( positive ) , negative = sum( negative ) ) % >% mutate ( sentiment = ( positive – negative ) / ( positive + negative ) ) 

# extract cik code list as firm identifier 

cik _ list <- unique (as. numeric ( sent $ cik ) ) 

# write . table (cik_list , “cik_ list .txt” , sep = “\n” , quote = FALSE , row . names = FALSE , col. names = FALSE ) 

# merge sentiment data and covid data 

sent _ covid <- merge ( sent , d_covid , by = “cik “) 

C. Stock Features Data Preprocessing 

### Create firm identifier (permno -gvkey -cik) link table # Load link table downloaded from WRDS 

link <- read .csv (” link .csv “, header = TRUE , stringsAsFactors = FALSE ) link <- link % >% select (c( gvkey , LPERMNO , cik ) ) % >% filter (!is.na( cik ) ) 

16

% >% unique () 

colnames ( link ) [2] <- ” permno ” 

# extract permno code to be used in CRSP 

permno _ list <- unique ( link $ permno ) 

# write . table ( permno _list , file = ” permno _ list .txt” , sep = “\n” , quote = FALSE , row . names = FALSE , col . names = FALSE ) 

# extract gvkey code to be used in CRSP 

gvkey _ list <- unique ( link $ gvkey ) 

# write . table ( gvkey _list , file = ” gvkey _ list .txt” , sep = “\n” , quote = FALSE , row . names = FALSE , col . names = FALSE ) 

### Create Return Variable 

# Source : CRSP 

crsp <- read .csv (” crsp .csv “, header = TRUE , stringsAsFactors = FALSE ) crsp $ date <- ymd ( crsp $ date ) 

# generate return dummy 

# if the accumulated return of a firm >= median , assign 1 , otherwise 0 crsp <- crsp % >% group _by( PERMNO ) % >% 

filter ( date >= ymd (” 20200310 “) ) % >% # On Mar 11 , 2020 , WHO characterized COVID -19 as a pandemic 

slice (c(1 , n () ) ) % >% mutate ( return = PRC / lag ( PRC ) -1 ) % >% ungroup () % >% 

filter (!is.na( return ) ) % >% mutate ( d_ return = ifelse ( return >= median ( return ) , 1 , 0) ) % >% 

select ( PERMNO , d_ return ) 

colnames ( crsp ) [1] <- ” permno ” 

### Create Control Variables 

# Source : COMPUSTAT 

comp <- read .csv (” compustat .csv “, header = TRUE , stringsAsFactors = FALSE ) comp $ date <- ymd ( comp $ datadate ) 

# generate control variables 

comp <- comp % >% filter ( year == 2021) % >% 

mutate ( cash = cheat , # cash and cash equivalent / total assets 17

lev = ( dltt + dlc ) / at , # leverage ratio 

logsale = log ( sale ) , # ln( sale ) 

mkt _cap = log ( prcc _f * csho ) , # ln( market cap) 

tang = ppent / at , # tangibility 

tobin _q = ( prcc _f * csho + pstk + dltt + dlc ) / at , # Tobin ’s q logat = log( at ) ) % >% # ln( total asset ) 

select (c( gvkey , date , cash , lev , logsale , mkt _cap , tang , tobin _q, logat ) ) 

D. Sample Dataset Creation 

# merge text data and stock features 

sampleset <- merge (link , sent _covid , by = “cik”) 

sampleset <- merge ( sampleset , crsp , by = ” permno “) 

sampleset <- merge ( sampleset , comp , by = ” gvkey “) 

samples <- sample set % >% select ( -c( cik , permno , gvkey , date , positive , negative ) ) 

# winsorize all non – dummy variables 

sampleset _win <- sampleset % >% select ( -c( d_return , covid ) ) % >% Winsorize ( probs = c(0.05 , 0.95) , na.rm = TRUE ) 

sampleset <- cbind ( sampleset $d_return , sampleset $covid , sampleset _win ) rm( sampleset _win ) 

# replace missing values with median for each variable 

sampleset <- sampleset % >% 

mutate _if(is. numeric , function ( x ) ifelse (is.na( x ) , median (x , na.rm = T ) , x ) ) % >% 

dplyr :: rename ( return = ‘ sampleset $d_return ‘ , covid = ‘ sampleset $covid ‘) 

# Summary Statistics of all variables 

sample _ sum <- sampleset % >% describe () % >% 

select (n , mean , sd , median , min , max ) 

E. Models 

18

sampleset $ return <- as. factor ( sampleset $ return ) 

sampleset $ covid <- as. factor ( sampleset $ covid ) 

### split the data for cross – validation 

set . seed (123) 

index <- createDataPartition ( sampleset $ return , # the target p = 0.7 , # percentage of data into training 

set 

list = FALSE ) # data format 

# generate train and test datasets 

train _ sample <- sampleset [index , ] 

test _ sample <- sampleset [ -index , ] 

### model setup 

control <- trainControl ( method =” repeatedcv “, # cross validation number = 4 , # 4 – folder cross – validation 

repeats = 5 , # repeat 5 times 

classProbs = TRUE , # return class probability 

summaryFunction = twoClassSummary , 

savePredictions = TRUE ) # for plotting ROC 

levels ( train _ sample $ return ) <- 

make . names ( levels ( factor ( train _ sample $ return ) ) ) 

### Logistic regression 

logit <- train ( return ~ . , data = train _sample , 

method = “glm “, family = binomial ( link = ” logit “) , metric = “ROC”, trControl = control ) 

### CART 

cart <- train ( return ~ . , data = train _sample , method = ” rpart “, metric = “ROC”, trControl = control ) 

### random forecast 

rf <- train ( return ~ . , data = train _sample , method =”rf”, metric = “ROC “, 19

trControl = control ) 

### collect and summarize the resampling results 

results <- resamples ( list ( logit = logit , cart = cart , random _ forest = rf) ) bwplot ( results ) 

### variable importance 

varimp _ list <- list ( varImp ( logit )$ importance , varImp ( cart )$ importance , varImp (rf)$ importance ) 

varimp _ list <- lapply ( varimp _list , function ( x ) data . frame (x , rn = row . names ( x ) ) ) 

varimp <- data . frame ( rn = row . names ( varImp ( logit )$ importance ) ) for ( i in varimp _ list ) { 

varimp <- varimp % >% merge (i , by = “rn”, all = TRUE ) 

rm( i ) 

names ( varimp ) <- make . names ( names ( varimp ) , unique = TRUE ) 

varimp <- varimp % >% dplyr :: rename ( Variable = rn , logit = Overall .x , cart = Overall .y , rf = Overall ) % >% 

arrange ( – logit , – cart , -rf) 

# change data frame from wide to long format 

varimp <- melt ( varimp ) 

colnames ( varimp ) <- c(” Feature “, ” Model “, ” Value “) 

# draw heatmap 

varimp % >% ggplot ( aes ( x = Model , y = Feature , fill = Value ) ) + geom _ tile () 

### ROC 

pred _ logit <- predict ( logit , test _sample , type = ” prob “) pred _ cart <- predict ( cart , test _sample , type = ” prob “) 

pred _rf <- predict (rf , test _sample , type = ” prob “) 

20

roc ( test _ sample $ return , pred _ logit $X0 , plot = TRUE , percent = TRUE , xlab = ” True ␣ Negative ␣ Percentage “, ylab = ” True ␣ Positive ␣ Percentage “, col = “red”, print . auc = TRUE ) 

plot . roc ( test _ sample $ return , pred _ cart $X0 , percent = TRUE , print . auc = TRUE , col = ” blue “, add = TRUE , print . auc . y = 45) 

plot . roc ( test _ sample $ return , pred _rf$X0 , percent = TRUE , print . auc = TRUE , col = ” green “, add = TRUE , print . auc . y = 40) legend (” bottomright “, legend = c(” Logit ␣ Model “, ” CART “, ” Random ␣ Forest “) , col =c(“red”, ” blue “, ” green “) , lw = 2)

YouTube

FinTech & Covid 19