Whats the next big thing in fintech?

Whats the next big thing in fintech?

Moreover, can we use Machine Learning to determine the answer!

Let’s dive in!

An empirical analysis on the determinants of firms’ EDGAR search volume. Understanding where the investment crowd moves capital.

Group Assignment Report for the Financial Data Analytics Course 

Abstract 

I use linear regression and machine learning models to explore the determinants of firms’ EDGAR search volume (ESV). My empirical results have shown that machine learning models can capture the nonlinear interactions between ESV and the predictors, thus generating better model performance. In addition, the firm’s observable characteristics seem to be more important than the new information released in determining ESV.

Acknowledgments: I would like to thank Professor Mancy Luo for her excellent guidance throughout this course. 

Keywords: EDGAR, Linear Regression, KNN, CART, Random Forest 

I. Introduction 

It has been widely assumed in the field of financial economics that asset prices reflect information to some extent (Fama, 1970) and plenty of empirical research has focused on examining how efficient the market can be. However, it is of equal importance to understand how well the market participants receive the information released by the firms and the “actual use of reported data by investors” (Lev, 1989). 

Inspired by the work of Drake et al. (2015) in this topic, this report tries to use the latest EDGAR data available together with machine learning models to explore one research question as follows: 

Which features determine the EDGAR search volume (ESV) at the firm level ? 

The process to acquire and understand information can be costly, hence investors will not engage themselves in this process unless the expected return outweighs this cost (Grossman and Stiglitz, 1980). Based on this theory, I develop two hypotheses to tackle my research question. 

First, the release of new information which has an impact on asset prices will offer investors incentives to acquire the information as they can use it to adjust their investment accordingly (Kim and Verrecchia, 1997). The types of new information released include but are not limited to earnings announcement, analyst coverage, etc. Therefore, my first hypothesis is stated as follows: 

Hypothesis 1: The ESV is positively correlated with the release of new information by the firm. 

Second, firms have often been attached to different labels according to their observable characteristics (e.g., value/growth, small-cap/large-cap), and plenty of trading strategies in the market are formulated based on these categories. For investors, they might make their investment decision based on these classifications. Under such circumstances, they have the incentive to gather relevant information with respect to particular firm categories which fit their investment mandate. Among the numerous observable characteristics, there are some well-studied candidates such as book-to market ratio, market capitalization (Fama and French, 1993), etc. my second hypothesis is stated as follow: 

Hypothesis 2: The ESV is correlated with a firm’s observable characteristics. 1

My empirical analysis is developed based on these two hypotheses. The remainder of this report is organized as follows: Section II provides the data smyce and cleaning procedures used in this report and Section III presents the models and algorithms used in this thesis. The empirical findings and discussion are presented in Section IV. In the end, Section V concludes the report. Appendix A gives a detailed description of the variables used in the empirical study, Appendix B demonstrates the additional empirical results of the models, and Appendix C presents the R code used in this report. 

II. Data 

A. Data Smyce 

I use data from fmy different databases to construct my sample dataset, namely the EDGAR Log File Data (EDGAR hereinafter), CRSP, COMPUSTAT, and I/B/E/S. The original EDGAR data is published on SEC website1, which “assembles information on internet search traffic for EDGAR filings through SEC.gov generally covering the period February 14, 2003 through June 30, 2017”. In my case, I use another version of EDGAR data (Loughran and McDonald, 2017) from Software Repository for Accounting and Finance2. Apart from the EDGAR data, I also use CRSP, COMPUSTAT, and I/B/E/S database s3 to extract data so as to build up firms’ information release and characteristics features. The detailed description of data sources can be found in Appendix A. 

B. Data Preprocessing 

In general, I have implemented sample selection, datasets combination, missing values imputation, and variable winsorization and standardization procedures in my data preprocessing part. First, due to the computational constraint, I only select the EDGAR data for the S&P 500 firms in the year of 2015, which accounts for almost 35% of all the search volume in that year. The number of firms (identified by the CIK code4) and the search volume statistics in my sample can be found in Table I. 

1See: https://www.sec.gov/dera/data/edgar-log-file-data-set.html 

2See: https://sraf.nd.edu/data/edgar-server-log/ 

3All these databases are accessed through the Wharton Research Data Services (WRDS). 4The Central Index Key (CIK) is used on the SEC’s computer systems to identify corporations and individual people who have filed disclosure with the SEC. 

2

Table I 

Sample Coverage 

No. of CIKs Total user requests Mean requests per CIK 

Total EDGAR CIKs 264,942 32,371,327 122 EDGAR CIKs with coverage in S&P 500 465 11,229,423 24,149 

Note: This table presents the sample coverage used in my analyses. The CIKs used in my sample are selected based on two criteria: first, it is included in the S&P 500 Index before Jan 1, 2015, and second, it is not removed from the S&P 500 Index before Jan 1, 2016. The sample period ranges from January 1, 2015 to December 31, 2015, excluding weekends and holidays. 

Second, since the variables used in my analysis involve fmy different databases, it is essential to merge them into a single dataset for further analysis. However, the codes to identify firms vary among these datasets. Specifically, EDGAR data uses CIK code to identify firms, CRSP uses PERMNO code5, COMPUSTAT uses GVKEY code6, and I/B/E/S uses firm tickets. I create a linking table containing all these different types of identifiers to match them together and use it to create my full sample dataset and construct all the variables used in the empirical analysis. 

Third, I check the missing values in each variable. There are two types of missing values in my sample data, with variables created from CRSP (Abnormal Return, Log(MV E), and Turnover) being one type and those from COMPUSTAT (BTM and Leverage) the other type. The missing values in type one variables are generated because I implement lag computation to obtain these variables, so the first observations for each firm will be missing. I fill these missing values with the firm-wise median plus a random noise7. The missing values in the second type variables are due to lack of information in the COMPUSTAT database, and I replace these missing values with the time-wise median plus a random noise. 

Lastly, I winterize all the variables other than dummies (D Analyst and D Earning) at the 5% level at both tails of the distributions to curb the impact of the outliers. Then I implement standardization to the non-dummy variables to cancel the varying magnitudes of these variables. 

5 PERMNO is a unique permanent security identification number assigned by CRSP to each security. Strictly speaking, PERMNO is not a firm identifier as one firm may issue several securities. For simplicity, I use it to distinguish firms in my analysis. 

6The Global Company Key or GVKEY is a unique six-digit number key assigned to each company (issue, currency, index) in the Capital IQ COMPUSTAT database. 

7The random noise is picked from the N (0, 0.1) distribution. 

3

C. Sample Description 

I follow Drake et al. (2015) to assign all the disclosure types into nine groups and calculate the summary statistics of them. As shown in Table II, disclosures concerning firms’ accounting information (10-K/Q) are among the most searched types, which account for more than half of all the requests made by users. This finding is in line with Drake et al. (2015) and highlights the importance of periodical reports to the investors. 

Table II 

Daily EDGAR requests summary 

Form Mean Median SD Min Max Total % of total 

10-K 18,472 18,244 4,554 7,210 29,200 4,655,060 41 

10-Q 6,652 6,586 1,779 2,709 12,613 1,676,265 15 

8-K 7,137 6,756 1,631 3,170 16,522 1,798,421 16 

DEF 3,099 2,906 751 1,201 5,336 780,833 7 

424 2,064 1,855 2,785 966 45,738 520,125 5 

4 1,613 1,512 640 776 8,010 406,581 4 

S 1,109 1,102 147 525 1,493 279,554 2 

SC 841 722 384 312 2,839 212,035 2 

Other 3,574 3,468 705 1,722 9,470 900,549 8 

Total 11,229,423 100 

Note: This table presents the summary statistics on daily EDGAR requests of 

different filing form groups for S&P 500 firms. The filing forms are aggregated 

into 9 groups: Form 10-K, Form 10-Q, Form 8-K, Form DEF, Form 425, Form 4, 

Form S, Form SC, and Other. The sample period ranges from January 1, 2015 to 

December 31, 2015, excluding weekends and holidays. 

Table III presents the summary statistics of the variables used in my analyses before standardization. I have constructed three target variables to depict the EDGAR search volume. In specific, variable ESV stands for the total search volume per firm and day. I treat 10-K and 10-Q as accounting information smyce and build up the ESV Acct variable to denote the search volume for accounting information. The rest types of disclosure files are classified as “other” and their volume is shown in variable ESV Other. After this further regrouping process, the total number of observations in my sample dataset shrinks to 117,141. The detailed description of variable definition can be found in Appendix A. 

4

Table III 

Descriptive statistics 

Variable No. Mean SD Median Min Max 

ESV 117,141 60.99 33.46 55.00 1.00 109.00 

ESV Acct 117,141 40.74 32.67 30.00 0.00 109.00 

ESV Other 117,141 31.54 28.33 22.00 0.00 109.00 

D Analyst 117,141 0.11 0.31 0.00 0.00 1.00 

D Earning 117,141 0.05 0.22 0.00 0.00 1.00 

Abnormal Return 117,141 0.00 0.01 -0.00 -0.00 0.73 

Turnover 117,141 0.01 0.01 0.01 0.00 0.46 

Log(MV E) 117,141 16.88 0.98 16.72 14.04 20.47 

BTM 117,141 0.40 0.31 0.32 -0.00 2.92 

Leverage 117,141 0.37 0.14 0.36 0.01 1.06 

Note: This table presents the descriptive statistics for the variables used in 

my analyses. The data is composed of all EDGAR Search Volume (ESV

made on the SEC’s EDGAR servers by all individual users and for all SEC 

filings of S&P 500 firms during the sample period from January 1, 2015 to 

December 31, 2015, excluding weekends and holidays. All variables other 

than the two indicator features (D Analyst and D Earning) are winsorized 

at the 5% level at both tails of the distribution. 

III. Empirical Models and Algorithms 

In my empirical research, I choose the linear regression model as my benchmark as it can provide us with a straightforward and easy way to explore the interactions between the target and predictors. 

However, linear regression has been heavily criticized due to its oversimplified assumptions. In particular, its linear relationship assumption is often not realistic and its failure to capture the more complex nonlinear interactions among variables encourages us to seek more advanced machine learning models. Therefore, I selected some machine learning models taught in the class to overcome the aforementioned shortcomings of the linear regression model. 

A. Benchmark Model 

My benchmark model follows Drake et al. (2015), which explores the linear relationship between firms’ EDGAR search volume and the new information release and the characteristics of firms. The 

5

full model is specified as follows: 

ESV =α + β1D Analyst + β2D Earning + β3Abnormal Return + β4T turnover + β5Log(MV E) + β6BTM + β7Leverage , 

where the target variable ESV in the above equation includes three measures, i.e., the total EDGAR search volume (ESV ), the EDGAR search volume of accounting information (ESV Acct), and the EDGAR search volume of non-accounting information (ESV Other). The definition of the predictors can be found in Appendix A. 

B. Machine Learning Models 

I select three machine learning models in my analyses, including K-Nearest Neighbors (KNN), classification and regression tree (CART), and random forest. None of these models pre-specify the underlying data distribution and the model structure, which enables the algorithm to detect the more complicated interactions in the sample data. 

C. Model Implementation in R 

The implementation of these models is straightforward in R. Throughout my analysis, I use the R package caret to implement my models. 

In particular, I utilize the (K-fold N-time) cross-validation method to assess my models’ performance on the validation dataset. This method is achieved using numbers and repeated attributes in the trainControl function. 

The primary function to implement models is the train function in caret, and a key attribute of this function is method, which defines the algorithm used in the training process. For linear regression, KNN, CART, and Random Forest, I assign lm, kknn, rpart, and parRF9to the method attribute respectively. 

IV. Model Comparison and Evaluation : Whats the next big thing in fintech?

The empirical models used in my analyses have shown consistent results in terms of performance metrics and variable importance ranking. In this section, I demonstrate the empirical results for 

8See: https://topepo.github.io/caret/ 

9I choose parRF instead of the commonly used rf method to implement the parallel computation of random forest algorithm so as to speed up my training process. 

6

models using ESV Acct as target variable, and the results for the other two target variables are presented in Appendix B. 

A. Model Performance 

My evaluation of model performance is based on three metrics, the R2, the mean absolute error (MAE), and the root mean square error (RMSE). R2 Represents the proportion of the variance for a target variable that is explained by predictors in a model, while MAE and RMSE measure the differences between values predicted by a model and the values observed. In general, a higher R2 value, together with a lower MAE or RMSE, is a positive indicator for a model’s performance. 

Figure 1. The Model Performance Comparison for Target ESV Acct 

Figure 1 illustrates the performance metrics for models with ESV Acct as target. In terms of R2, simple models like linear regression model and CART generate less impressive values, neither of which exceeds 40%. KNN model performs better with a R2 around 50%, while random forest evidently outperforms its competitors with a value of more than 60%. The performance comparison of MAE and RMSE metrics show a similar pattern as linear regression models perform worse and random forest ranks in the first place. 

There are two possible explanations with regard to the performance comparison results. First, the underperformance of linear regression model indicates that the linear assumption of the model structure is too simple to capture the true interactions among variables, hence it cannot generate a 

7

satisfied output. Second, the dominance of random forest over other models results from its ability to exploit more complex model structures using machine learning techniques such as bagging and decision trees. 

B. Which Features matter? 

To further assess the validity of the two hypotheses, I calculate the importance ranking for all the predictors based on the built-in function10 in R package caret

Figure 2. The Variable Importance Ranking 

As shown in Figure 2, among all the models, the natural logarithm of market capitalization (log mve) is the most important feature in predicting EDGAR search volume. This finding is in line with Drake et al. (2015), in which they also find a positive association between ESV and firm size. This result is expected as investors tend to focus more on the large-cap firms in the market, hence the demands for disclosures from these firms are high. 

Apart from the firm size, other variables’ importance scores vary among different models. For example, firm’s turnover (turnover), leverage ratio (lev) and boot-to-market ratio (btm) have shown 10The calculation of variable importance score is beyond the scope of this project. 

8

some importance in the random forest, but these variables do not perform equally well in other models. 

However, there exists one common pattern in all the models, that is, the firm’s observable characteristics clearly outperform the variables proxy for new information release in terms of importance score, which goes against my first hypothesis and the findings in Drake et al. (2015). One possible explanation in my case is that my sample only contains S&P 500 firms which are arguably the most chased firms in the market. For these firms, investors will keep a close eye on their disclosures regardless of the type. 

V. Conclusion 

In this project, I use a sample from S&P 500 firms in the year of 2015 and implement linear regression and multiple machine learning models to explore the determinants of EDGAR search volume (ESV) at the firm level. 

My contributions are twofold. First, the empirical results have shown that machine learning models clearly outperform the linear regression model, which results from the fact that machine learning models can depict the non-linear structures within variables and generate a better prediction. Second, the firm’s observable characteristics achieve higher importance scores than the variable of new information release in all the models used in this empirical analysis. This might relate to the fact that investors always maintain a high interest in the S&P 500 firms. 

Due to the computational constraint, my empirical analysis is limited to a small sample size and shallow model structures. With a higher computation power in the future, my research can be expanded to truly big data and more complex model structures with more variables and state-of-the-art models. 

VI. References 

Drake, M. S., Roulstone, D. T., and Thornock, J. R. (2015). The determinants and consequences of information acquisition via edgar. Contemporary Accounting Research, 32(3):1128–1161. 

Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. Journal of Finance, 25(2):383–417. 

9

Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33(1):3–56. 

Grossman, S. J. and Stiglitz, J. E. (1980). On the impossibility of informationally efficient markets. American Economic Review, 70(3):393–408. 

Kim, O. and Verrecchia, R. E. (1997). Pre-announcement and event-period private information. Journal of Accounting and Economics, 24(3):395–419. 

Lev, B. (1989). On the usefulness of earnings and earnings research: Lessons and directions from two decades of empirical research. Journal of Accounting Research, 27:153–192. 

Loughran, T. and McDonald, B. (2017). The use of Edgar filings by investors. Journal of Behavioral Finance, 18(2):231–248. 

10

Appendix A. Variable Definition 

Table IV 

Variable definitions 

Variable Definition Frequency Smyce 

Abnormal Return,t−1 The buy-and-hold return for firm i on day t−1 less the buy-and-hold S&P 500 portfolio return 

on day t −

BTMi,q−1 The ratio of book value of common equity to market capitalization (CEQQ/[P RCCQ × 

CSHOQ]) for firm i, measured as of the fiscal 

quarter end q −

D Analyst At An indicator feature set equal to 1 from date t to t + 5 with analyst coverage on firm i and to 

0 otherwise 

D Earningi,t An indicator feature set equal to 1 from firm i’s quarterly earnings announcement date t to 

t + 5 and to 0 otherwise 

ESVi,t The number of EDGAR requests for all filing types for firm i on day

ESV Accti,t The number of EDGAR requests for periodic accounting reports (10-K and 10-Q) for firm

on day

ESV Other At The number of EDGAR requests for all filings other than periodic accounting reports (10-K 

and 10-Q) for firm i on day

Leverage,q−1 The ratio of long-term debt to total assets (LLT Q/AT Q) measured for firm i as of the 

fiscal quarter end q −

Log(MV E)i,t−1 The natural log of firm i’s market value of equity, measured as the share price times shares 

outstanding (Log(P RC × SHROUT)) of the 

firm measured on day t −

Turnover at1 The ratio of firm i’s trading volume on day t − 1 to its total shares outstanding 

(V OL/SHROUT

Daily CRSP 

Quarterly COMPUSTAT 

Daily I/B/E/S Daily I/B/E/S 

Daily SEC 

Daily SEC 

Daily SEC 

Quarterly COMPUSTAT Daily CRSP 

Daily CRSP 

Note: I mainly follow Drake et al. (2015)’s definitions to build up all the variables used in the analyses. 11

Appendix B. Additional Empirical Results 

Figure 3. The Model Performance Comparison for Target ESV 

Figure 4. The Model Performance Comparison for Target ESV Other 

12

Figure 5. The Variable Importance Ranking for Target ESV 

Figure 6. The Variable Importance Ranking for Target ESV Other 

13

Appendix C. R Code 

A. Initialization 

rm( list = ls () ) 

graphics .off () 

# Set working directory 

setwd ( dirname ( rstudioapi :: getActiveDocumentContext ()$ path ) ) 

# Load libraries 

library ( tidyverse ) 

library ( ggplot2 ) 

library ( lubridate ) 

library ( stringr ) 

library ( timeDate ) 

library ( haven ) 

library ( xtable ) 

library ( formattable ) 

library ( caret ) 

library ( psych ) 

library ( DescTools ) 

library ( reshape ) 

B. Load Raw Data 

# Load and append EDGAR log dataset in 2015 

files <- list . files ( path = ” EDGAR / 2015 “, pattern = “*.csv”, full . names = TRUE ) 

edgar _ full <- read .csv( files [1] , header = TRUE , stringsAsFactors = FALSE , colClasses = c(rep(” integer “, 2) , ” NULL “, 

” integer “, rep(” NULL “, 4) , ” character “, 

” integer “) ) 

for ( f in files [ -1]) { 

df <- read .csv (f , header = TRUE , stringsAsFactors = FALSE , 14

colClasses = c(rep(” integer “, 2) , ” NULL “, ” integer “, 

rep(” NULL “, 4) , ” character “, ” integer “) ) 

edgar _ full <- rbind ( edgar _full , df) 

rm(df , files , f ) 

# Transform date and filing _ date variables to ymd format edgar _ full $ date <- ymd ( edgar _ full $ date ) 

edgar _ full $ filing _ date <- ymd ( edgar _ full $ filing _ date ) 

# Change cik code to 10 – digit standard format 

edgar _ full $ cik <- str _pad ( edgar _ full $cik , 10 , pad = “0”) 

# Save cik codes in order to match with CRSP / Compustat datasets cik _ list <- unique ( edgar _ full $ cik ) 

# Load linking table retrieved from WRDS 

# Smyce : WRDS Compustat CRSP Link Table 

link <- read .csv (” link .csv “, header = TRUE , stringsAsFactors = FALSE ) link <- subset (link , select = c(“cik “, “tic “, ” gvkey “, ” LPERMNO “) ) link $ cik <- str _pad ( link $cik , 10 , pad = “0”) 

link $ gvkey <- str_pad ( link $gvkey , 6 , pad = “0”) 

link <- unique ( link ) 

C. Sample Selection 

# Smyce : WRDS SAS Platform 

sp500 <- read _sas (” dsp500list . sas7bdat “) 

# select firms which are S&P 500 constituents throughout 2015 sp500 <- subset ( sp500 , ( ending >= base :: as. Date (” 2015 -12 -31 “) ) & start <= base :: as. Date (“2015 -1 -1”) ) 

sp500 _ list <- unique ( sp500 $ PERMNO ) 

link <- subset (link , LPERMNO % in % sp500 _ list ) 

# save permno list to be used in CRSP database 

permno _ list <- unique ( link $ LPERMNO ) 

15

# save tic list to be used in I/B/E/S database 

tic _ list <- unique ( link $ tic ) 

# save gvkey list to be used in COMPUSTAT database 

gvkey _ list <- unique ( link $ gvkey ) 

rm( sp500 , sp500 _ list ) 

# Filter for firms with coverage in CRSP and COMPUSTAT edgar _ sample <- edgar _ full % >% filter ( cik % in % link $ cik ) rm( edgar _ full ) 

# Remove non – trading days from sample dataset 

edgar _ sample <- edgar _ sample [! isWeekend ( edgar _ sample $ date ) & ! edgar _ sample $ date % in % 

base :: as. Date ( holidayNYSE (2015) ) ,] 

D. Regroup Information Type 

### Combine form types into 9 groups 

# Form 10 -K 

Whats the next big thing in fintech?

temp _10 k <- c(“10 -K”, “10 -K/A”, “10 – K405 “, “10 – K405 /A”, “10 – KT”, “10 – KT/A”, “10 KSB”, “10 KSB/A”, “10 KSB40 “, “10 KSB40 /A”, “10 KT405 “, “10 KT405 /A”) 

edgar _ sample $ form [ edgar _ sample $ form % in % temp _10 k ] <- ” Form ␣10 -K” 

# Form 10 -Q 

temp _10q <- c(“10 -Q”, “10 -Q/A”, “10 – QT”, “10 – QT/A”, “10 QSB “, “10 QSB /A”) edgar _ sample $ form [ edgar _ sample $ form % in % temp _10q] <- ” Form ␣10 -Q” 

# Form 8 -K 

temp _8 k <- c(“8 -K”, “8 -K/A”, “8 – K12B “, “8 – K12B /A”, “8 – K12G3 “, “8 – K12G3 /A”, “8 – K15D5 “) 

edgar _ sample $ form [ edgar _ sample $ form % in % temp _8 k ] <- ” Form ␣8 -K” 16

# Form 424 

temp _ 424 <- c(“424 A”, “424 B1”, “424 B2”, “424 B3”, “424 B4”, “424 B5”, “424 B7”, “424 B8”) 

edgar _ sample $ form [ edgar _ sample $ form % in % temp _ 424] <- ” Form ␣424 ” 

# Form S 

temp _s <- c(“S -1”, “S -1/A”, “S -11 “, “S -11 /A”, “S -11 MEF “, “S -1 MEF “, “S -2”, “S -2/A”, “S -2 MEF”, “S -3”, “S -3/A”, “S -3 ASR”, “S -3D”, “S -3D/A”, “S -3 DPOS “, “S -3 MEF”, “S -4”, “S -4␣POS”, “S -4/A”, “S -4 EF”, “S -4 EF/A”, “S -4 MEF”, “S -8”, “S -8␣POS”, “S -8/A”) 

edgar _ sample $ form [ edgar _ sample $ form % in % temp _s ] <- ” Form ␣S” 

# Form SC 

temp _sc <- c(“SC␣13D”, “SC␣13D/A”, “SC␣13 E1”, “SC␣13 E1/A”, “SC␣13 E3”, “SC␣ 13 E3/A”, “SC␣13 E4”, “SC␣13 E4/A”, “SC␣13G”, “SC␣13G/A”, “SC␣14 D1”, “SC␣ 14 D1/A”, “SC␣14 D9”, “SC␣14 D9/A”, “SC␣14 F1”, “SC␣14 F1/A”, “SC␣TO -C”, “SC␣TO -I”, “SC␣TO -I/A”, “SC␣TO -T”, “SC␣TO -T/A”, ” SC13E4F “, ” SC13E4F /A”, ” SC14D1F “, ” SC14D1F /A”, ” SC14D9C “, ” SC14D9F “, ” SC14D9F /A”) edgar _ sample $ form [ edgar _ sample $ form % in % temp _sc ] <- ” Form ␣SC” 

# Form 4 

temp _4 <- c(“4”, “4/A”) 

edgar _ sample $ form [ edgar _ sample $ form % in % temp _4] <- ” Form ␣4″ 

# Form DEF 

temp _ def <- c(“DEF ␣14A”, “DEF ␣14C”, ” DEF13E3 “, ” DEF13E3 /A”, ” DEFA14A “, ” DEFA14C “,” DEFC14A “, ” DEFC14C “, ” DEFM14A “, ” DEFM14C “, ” DEFN14A “, ” DEFR14A “,” DEFR14C “, ” DEFS14A “, ” DEFS14C “) 

edgar _ sample $ form [ edgar _ sample $ form % in % temp _def ] <- ” Form ␣DEF ” 

# Other 

temp _ other <- c(” Form ␣10 -K”, ” Form ␣10 -Q”, ” Form ␣8 -K”, ” Form ␣424 “, ” Form ␣ S”, ” Form ␣SC”, ” Form ␣4″, ” Form ␣DEF”) 

edgar _ sample $ form [! edgar _ sample $ form % in % temp _ other ] <- ” Other ” rm( list = ls( pattern = “^ temp _”) ) 

17

# Daily EDGAR requests by form group 

edgar _ form _sum <- edgar _ sample % >% select (date , form , nr_ total ) % >% group _by(date , form ) % >% 

summarise ( form _sum = sum( nr_ total ) ) 

form _ summary <- edgar _ form _sum % >% group _by( form ) % >% 

summarise ( mean = mean ( form _sum) , 

median = median ( form _sum ) , 

sd = sd( form _sum ) , 

min = min( form _sum) , 

max = max( form _sum) , 

total = sum( form _sum) ) % >% 

mutate _if(is. numeric , round , 0) % >% 

arrange ( – total ) 

form _ summary <- form _ summary % >% mutate ( pct = total /sum ( total )* 100) 

# create latex code for requests summary table 

<< results = tex > > 

xtable ( form _ summary , digits = 0) 

rm( form _ summary , edgar _ form _sum ) 

E. Feature Construction 

### Create EDGAR search volume (ESV) variables 

edgar _ sample $ acct <- ifelse ( edgar _ sample $ form % in % c(” Form ␣10 -K”, ” Form ␣ 10 -Q”) , 1 , 0) 

# ESV 

edgar _ sample <- edgar _ sample % >% group _by(date , cik ) % >% mutate ( ESV = sum ( nr_ total ) ) 

# ESV_ acct 

esv _ acct <- edgar _ sample % >% filter ( acct == 1) % >% 

18

group _by(date , cik ) % >% 

summarise ( ESV_ acct = sum( nr_ total ) ) 

edgar _ sample <- edgar _ sample % >% 

select ( – one _of (“nr_ total “, ” form “, ” filing _ date “, 

” acct “) ) % >% 

unique () 

edgar _ sample <- merge ( edgar _sample , esv_acct , by = c(” date “, “cik”) , all . x = TRUE ) 

rm( esv _ acct ) 

edgar _ sample $ ESV_ acct <- ifelse (is.na( edgar _ sample $ ESV _ acct ) , 0 , edgar _ sample $ ESV_ acct ) 

# ESV_ other 

edgar _ sample $ ESV_ other <- edgar _ sample $ ESV – edgar _ sample $ ESV _ acct 

# Merge with link table 

edgar _ sample <- merge ( edgar _sample , link , by = “cik”, all. x = TRUE ) edgar _ sample <- edgar _ sample [ , c(2 , 6 , 1 , 7 , 8 , 3 , 4 , 5) ] % >% arrange ( date ) 

edgar _ sample <- edgar _ sample % >% rename ( permno = LPERMNO ) 

### CRSP features : lagged abnormal return , lagged log( market cap), and lagged turnover ( Smyce : WRDS ) 

crsp <- read .csv (” crsp .csv “, stringsAsFactors = FALSE , header = TRUE ) crsp <- crsp % >% rename ( permno = PERMNO , sic = SICCD , prc = PRC , vol = VOL , shrout = SHROUT ) % >% 

mutate (log _mve _l1 = log ( lag ( prc ) * lag ( shrout ) ) , # log market cap vol = vol / 1000) # in thousands 

crsp $ date <- ymd ( crsp $ date ) 

# stock daily return 

crsp <- crsp % >% group _by( permno ) % >% mutate ( ret = ( prc – lag ( prc ) ) / lag ( prc ) ) 

# Daily abnormal return & turnover 

19

crsp <- crsp % >% mutate ( abret _l1 = ret – sprtrn , 

turnover _l1 = lag ( vol ) / lag ( shrout ) ) % >% 

select ( permno , date , abret _l1 , turnover _l1 , log _mve _l1 ) % >% 

filter ( date <= base :: as. Date (” 2015 -12 -31 “) ) 

# merge with main sample dataset 

edgar _ sample <- merge ( edgar _sample , crsp , by = c(” permno “, ” date “) , all. x = TRUE ) 

rm( crsp ) 

### Analyst coverage dummy ( Smyce : I/B/E/S Recommendations ) ibes <- read .csv (” analyst .csv “, stringsAsFactors = FALSE , header = TRUE ) ibes $ ANNDATS <- ymd ( ibes $ ANNDATS ) 

# generate analyst coverage dummy variable 

ibes <- ibes % >% rename ( tic = TICKER , 

date = ANNDATS ) % >% 

mutate (D_ analyst = 1) % >% 

select ( tic , date , D_ analyst ) % >% 

unique () 

# merge with main sample dataset 

edgar _ sample <- merge ( edgar _sample , ibes , by = c(“tic”, ” date “) , all. x = TRUE ) 

idx <- which ( edgar _ sample $D_ analyst == 1) 

for ( i in idx ) { 

edgar _ sample [ i +1 , “D_ analyst “] <- 1 

edgar _ sample [ i +2 , “D_ analyst “] <- 1 

edgar _ sample [ i +3 , “D_ analyst “] <- 1 

edgar _ sample [ i +4 , “D_ analyst “] <- 1 

edgar _ sample [ i +5 , “D_ analyst “] <- 1 

rm( ibes , idx , i ) 

# fill missing values 

# if D_ analyst == NA , then no analyst covers this firm 

edgar _ sample $D_ analyst <- ifelse (is.na( edgar _ sample $D_ analyst ) , 0 , 20

edgar _ sample $D_ analyst ) 

### Compustat features : lagged quarterly book -to – market ratio and lagged quarterly leverage ( Smyce : WRDS ) 

compustat <- read .csv(” compustat .csv”, header = TRUE , stringsAsFactors = FALSE ) 

compustat $ datadate <- ymd ( compustat $ datadate ) 

compustat <- compustat % >% select ( gvkey , datadate , atq , ceqq , cshoq , lltq , prccq ) % >% 

mutate ( btm _l1 = lag ( ceqq ) / ( lag ( prccq ) * lag ( cshoq ) ) , 

lev_l1 = lag ( lltq ) / lag ( atq ) ) % >% 

select ( gvkey , datadate , btm _l1 , lev _l1 ) % >% 

filter ( datadate > ” 2014 -12 -31 “) % >% 

rename ( date = datadate ) 

compustat $ gvkey <- str_pad ( compustat $gvkey , 6 , pad = “0”) 

# merge with main sample dataset 

edgar _ sample <- merge ( edgar _sample , compustat , by = c(” gvkey “, ” date “) , all = TRUE ) 

# backward fill missing values 

edgar _ sample <- edgar _ sample % >% group _by( gvkey ) % >% 

fill (c( btm_l1 , lev _l1 ) , . direction = 

“up”) 

rm( compustat ) 

### Earning announcement date dummy variable ( Smyce : I/B/E/S Summary statistics ) 

earning <- read .csv (” earning _ announcement .csv”, header = TRUE , stringsAsFactors = FALSE ) 

earning $ ANNDATS _ACT <- ymd ( earning $ ANNDATS _ACT ) 

earning <- earning % >% filter ( ANNDATS _ACT > ” 2014 -12 -31 “) % >% select ( TICKER , ANNDATS _ACT ) % >% 

rename ( tic = TICKER , 

date = ANNDATS _ACT ) % >% 

unique () % >% arrange ( date ) % >% 

21

mutate (D_ earning = 1) 

# merge with main sample dataset 

edgar _ sample <- merge ( edgar _sample , earning , by = c(“tic”, ” date “) , all. x = TRUE ) 

# if the date is within 5 -day frame after the announcement date , fill missing values with 1 , 

# otherwise fill missing values with 0 

idx <- which ( edgar _ sample $D_ earning == 1) 

for ( i in idx ) { 

edgar _ sample [ i +1 , “D_ earning “] <- 1 

edgar _ sample [ i +2 , “D_ earning “] <- 1 

edgar _ sample [ i +3 , “D_ earning “] <- 1 

edgar _ sample [ i +4 , “D_ earning “] <- 1 

edgar _ sample [ i +5 , “D_ earning “] <- 1 

edgar _ sample $D_ earning <- ifelse (is.na( edgar _ sample $D_ earning ) , 0 , edgar _ sample $D_ earning ) 

rm( earning , idx , i ) 

F. Feature Cleaning 

### Note : Financial reporting period end dates in Compustat dataset may not be trading days , 

# It is necessary to remove these non – trading days from the main dataset edgar _ sample <- edgar _ sample [! isWeekend ( edgar _ sample $ date ) & ! edgar _ sample $ date % in % 

base :: as. Date ( holidayNYSE (2015) ) ,] 

### Check observations with missing cik code 

edgar _ sample [ which (is.na( edgar _ sample $ cik ) ) ,] 

# There are only 7 missing cik observations , which also miss multiple other feature values . 

# Remove these observations 

edgar _ sample <- edgar _ sample [!is.na( edgar _ sample $ cik ) ,] 22

### Winsorization of numeric features at the 5% level at both tails of the distribution 

edgar _ sample _fea <- edgar _ sample % >% select ( ESV :log_mve_l1 , btm _l1 : lev_l1 ) % >% 

Winsorize ( probs = c(0.05 , 0.95) , na.rm = TRUE ) 

edgar _ sample <- cbind ( edgar _ sample $cik , edgar _ sample $date , edgar _ sample $D_ analyst , edgar _ sample $D_ earning , edgar _ sample _fea ) 

edgar _ sample <- edgar _ sample % >% rename ( cik = ‘ edgar _ sample $cik ‘ , date = ‘ edgar _ sample $date ‘ , D_ analyst = ‘ edgar _ sample $D_ analyst ‘ , D_ earning = ‘ edgar _ sample $D_ earning ‘) 

rm( edgar _ sample _fea ) 

### Missing values 

# abret _l1 , log_mve_l1 and turnover _l1 , the values on the first trading day are missing due to lag () computation , fill them with median values per firm plus a random noise 

set . seed (123) 

impute _ median <- function ( x ) { 

ind_na <- is.na( x ) 

x [ ind_na] <- median ( x [!ind_na ]) + rnorm (sum( ind_na) , mean = 0 , sd = 0.1) as. numeric ( x ) 

edgar _ sample <- edgar _ sample % >% group _by( cik ) % >% 

mutate _at ( vars ( abret _l1 , log _mve _l1 , turnover _l1 ) , impute _ median ) % >% ungroup () 

# btm_l1 and lev_l1 , the values are missing due to missing data in Compustat database 

# fill them with median values per day plus a random noise edgar _ sample <- edgar _ sample % >% group _by( date ) % >% 

mutate _at ( vars ( btm_l1 , lev _l1 ) , impute _ median ) % >% 

ungroup () 

# Summary statistics of features 

23

fea _sum <- edgar _ sample % >% select ( -c( cik , date ) ) % >% describe () % >% select (n , mean , sd , median , min , max ) 

fea _sum 

# create latex code for feature description table 

<< results = tex > > 

xtable ( fea_sum , digits = 2) 

rm( fea _sum , link , cik _list , gvkey _list , permno _list , tic _ list ) 

### Data split 

# remove cik and date variables 

edgar _ sample <- edgar _ sample % >% select ( -c( cik , date ) ) 

# convert dummies to factor variables 

edgar _ sample $D_ analyst <- as. factor ( edgar _ sample $D_ analyst ) edgar _ sample $D_ earning <- as. factor ( edgar _ sample $D_ earning ) 

# generate sample datasets of different target variables (ESV/ESV_ acct /ESV_ other ) 

edgar _esv <- edgar _ sample % >% select ( -c( ESV_acct , ESV_ other ) ) edgar _esv <- edgar _esv [ , c(3 , 1 , 2 , 4:8) ] 

edgar _esv_ acct <- edgar _ sample % >% select ( -c( ESV , ESV_ other ) ) edgar _esv_ acct <- edgar _esv_ acct [ , c(3 , 1 , 2 , 4:8) ] 

edgar _esv_ other <- edgar _ sample % >% select ( -c( ESV_acct , ESV ) ) edgar _esv_ other <- edgar _esv_ other [ , c(3 , 1 , 2 , 4:8) ] rm( edgar _ sample ) 

# center and scale the datasets 

standardize <- function ( data ) { 

library ( caret ) 

preprocessvalue <- preProcess (data , method = c(” center “, ” scale “) ) data _ imp <- predict ( preprocessvalue , data ) 

return ( data _ imp ) 

24

edgar _esv_imp <- standardize ( edgar _esv ) 

edgar _esv_ acct _imp <- standardize ( edgar _esv_ acct ) 

edgar _esv_ other _imp <- standardize ( edgar _esv_ other ) 

rm( edgar _esv , edgar _esv _acct , edgar _esv _ other ) 

G. Models 

# split data 

set . seed (123) 

index <- createDataPartition ( edgar _esv_imp$ESV , # the target p = 0.7 , # percentage of data into training 

set 

list = FALSE ) # data format 

train _esv_imp <- edgar _esv_imp [index , ] 

test _ esv _imp <- edgar _esv _imp [ -index , ] 

train _esv_ acct _imp <- edgar _esv_ acct _imp [index , ] 

test _ esv _ acct _imp <- edgar _esv _ acct _imp [ -index , ] 

train _esv_ other _imp <- edgar _esv_ other _imp [index , ] 

test _ esv _ other _imp <- edgar _esv _ other _imp [ -index , ] 

rm( edgar _esv _imp , edgar _esv _ acct _imp , edgar _esv _ other _imp ) 

# set parameters 

control <- trainControl ( method =” repeatedcv “, # cross validation number = 3 , # 3 – folder cross – validation 

repeats = 2) # repeat 2 times 

### Benchmark model : Linear Regression 

lm_esv <- train ( ESV ~ . , data = train _esv _imp , 

method = “lm”, trControl = control ) 

lm_esv _ acct <- train ( ESV _ acct ~ . , data = train _esv _ acct _imp , method = “lm”, trControl = control ) 

lm_esv _ other <- train ( ESV _ other ~ . , data = train _esv _ other _imp , method = “lm”, trControl = control ) 

25

### KNN model 

knn _esv <- train ( ESV ~ . , data = train _esv_imp , 

method = ” kknn “, k = 3 , trControl = control ) 

knn _esv _ acct <- train ( ESV_ acct ~ . , data = train _esv_ acct _imp , method = ” kknn “, k = 3 , trControl = control ) 

knn _esv _ other <- train ( ESV_ other ~ . , data = train _esv_ other _imp , method = ” kknn “, k = 3 , trControl = control ) 

### CART 

cart _ esv <- train ( ESV ~ . , data = train _esv _imp , 

method = ” rpart “, trControl = control ) 

cart _ esv _ acct <- train ( ESV _ acct ~ . , data = train _esv _ acct _imp , method = ” rpart “, trControl = control ) 

cart _ esv _ other <- train ( ESV _ other ~ . , data = train _esv _ other _imp , method = ” rpart “, trControl = control ) 

### Random Forest 

rf_esv <- train ( ESV ~ . , data = train _esv _imp , 

method = ” parRF “, # using parallel random forest method to speed up training 

importance = TRUE , 

trControl = control , ntree = 500) 

rf_esv _ acct <- train ( ESV _ acct ~ . , data = train _esv _ acct _imp , method = ” parRF “, # using parallel random forest method to speed up training 

importance = TRUE , 

trControl = control , ntree = 500) 

rf_esv _ other <- train ( ESV _ other ~ . , data = train _esv _ other _imp , method = ” parRF “, # using parallel random forest method 

26

to speed up training 

importance = TRUE , 

trControl = control , ntree = 500) 

H. Model Evaluation 

# ESV 

model _esv_ list <- list (lm = lm_esv , knn = knn_esv , cart = cart _esv , rf = rf_ esv ) 

res _esv <- resamples ( model _esv_ list ) 

bwplot ( res_esv , layout = c(3 , 1) ) 

# ESV_ acct 

model _esv_ acct _ list <- list (lm = lm_esv_acct , knn = knn_esv_acct , cart = cart _ esv_acct , rf = rf_esv _ acct ) 

res _esv _ acct <- resamples ( model _esv_ acct _ list ) 

bwplot ( res_ esv_acct , layout = c(3 , 1) ) 

# ESV_ other 

model _esv_ other _ list <- list (lm = lm_esv_other , knn = knn_esv_other , cart = cart _esv _other , rf = rf_esv _ other ) 

res _esv _ other <- resamples ( model _esv_ other _ list ) 

bwplot ( res_ esv_other , layout = c(3 , 1) ) 

I. Variable Importance 

## ESV 

# extract variable importance points from models 

esv _ varimp _ list <- list ( varImp (lm_esv )$ importance , 

varImp ( knn _esv )$ importance , 

varImp ( cart _ esv )$ importance , 

varImp (rf_esv )$ importance ) 

esv _ varimp _ list <- lapply ( esv _ varimp _list , 

function ( x ) data . frame (x , rn = 

row . names ( x ) ) ) 

27

esv _ varimp <- data . frame ( rn = row. names ( varImp (lm_esv )$ importance ) ) for ( i in esv _ varimp _ list ) { 

esv_ varimp <- esv_ varimp % >% merge (i , by = “rn”, all = TRUE ) } 

rm( i ) 

names ( esv_ varimp ) <- make . names ( names ( esv_ varimp ) , unique = TRUE ) 

esv _ varimp <- esv _ varimp % >% dplyr :: rename ( Variable = rn , lm = Overall .x , 

knn = Overall .y , 

cart = Overall . x .1 , 

rf = Overall . y .1) % >% 

arrange ( -lm , -knn , – cart , -rf) 

esv _ varimp [is.na( esv _ varimp ) ] <- 0 

esv _ varimp [ esv _ varimp $ Variable == “D_ analyst1 “, “knn”] <- esv _ varimp [ esv_ varimp $ Variable == “D_ analyst “, “knn”] + esv_ varimp [ esv_ varimp $ Variable == “D_ analyst1 “,”knn”] 

esv _ varimp [ esv _ varimp $ Variable == “D_ earning1 “, “knn”] <- esv _ varimp [ esv_ varimp $ Variable == “D_ earning1 “,”knn”] + esv_ varimp [ esv_ varimp $ Variable == “D_ earning “,”knn”] 

esv _ varimp <- esv _ varimp [!esv_ varimp $ Variable % in % c(“D_ analyst “, “D_ earning “) ,] 

# change data frame from wide to long format 

esv _ varimp <- melt ( esv _ varimp ) 

colnames ( esv_ varimp ) <- c(” Feature “, ” Model “, ” Value “) 

# draw heatmap 

esv _ varimp % >% ggplot ( aes ( x = Model , y = Feature , 

fill = Value ) ) + 

geom _ tile () + 

scale _ fill _ gradient2 ( low =” green “, high =” darkgreen “, guide =” colorbar “) ## ESV_ acct 

28

# extract variable importance points from models 

esv _ acct _ varimp _ list <- list ( varImp (lm_esv_ acct )$ importance , varImp ( knn _esv _ acct )$ importance , 

varImp ( cart _ esv _ acct )$ importance , 

varImp (rf_esv _ acct )$ importance ) 

esv _ acct _ varimp _ list <- lapply ( esv_ acct _ varimp _list , 

function ( x ) data . frame (x , rn = 

row . names ( x ) ) ) 

esv _ acct _ varimp <- data . frame ( rn = 

row . names ( varImp (lm_esv_ acct )$ importance ) ) 

for ( i in esv _ acct _ varimp _ list ) { 

esv_ acct _ varimp <- esv_ acct _ varimp % >% merge (i , by = “rn”, all = TRUE ) } 

rm( i ) 

names ( esv_ acct _ varimp ) <- make . names ( names ( esv_ acct _ varimp ) , unique = TRUE ) 

esv _ acct _ varimp <- esv _ acct _ varimp % >% dplyr :: rename ( Variable = rn , lm = Overall .x , 

knn = Overall .y , 

cart = Overall . x .1 , 

rf = Overall . y .1) % >% 

arrange ( -lm , -knn , – cart , -rf) 

esv _ acct _ varimp [is.na( esv _ acct _ varimp ) ] <- 0 

esv _ acct _ varimp [ esv _ acct _ varimp $ Variable == “D_ analyst1 “, “knn”] <- esv _ acct _ varimp [ esv_ acct _ varimp $ Variable == “D_ analyst “, “knn”] + esv_ acct _ varimp [ esv_ acct _ varimp $ Variable == “D_ analyst1 “,”knn”] esv _ acct _ varimp [ esv _ acct _ varimp $ Variable == “D_ earning1 “, “knn”] <- esv _ acct _ varimp [ esv_ acct _ varimp $ Variable == “D_ earning1 “,”knn”] + esv_ acct _ varimp [ esv_ acct _ varimp $ Variable == “D_ earning “,”knn”] esv _ acct _ varimp <- esv _ acct _ varimp [!esv_ acct _ varimp $ Variable % in % c(“D_ analyst “, “D_ earning “) ,] 

# change data frame from wide to long format 

29

esv _ acct _ varimp <- melt ( esv _ acct _ varimp ) 

colnames ( esv_ acct _ varimp ) <- c(” Feature “, ” Model “, ” Value “) 

# draw heatmap 

esv _ acct _ varimp % >% ggplot ( aes ( x = Model , y = Feature , 

fill = Value ) ) + 

geom _ tile () + 

scale _ fill _ gradient2 ( low =” green “, high =” darkgreen “, guide =” colorbar “) 

## ESV_ other 

# extract variable importance points from models 

esv _ other _ varimp _ list <- list ( varImp (lm_esv_ other )$ importance , varImp ( knn _esv _ other )$ importance , 

varImp ( cart _ esv _ other )$ importance , 

varImp (rf_esv _ other )$ importance ) 

esv _ other _ varimp _ list <- lapply ( esv_ other _ varimp _list , 

function ( x ) data . frame (x , rn = 

row . names ( x ) ) ) 

esv _ other _ varimp <- data . frame ( rn = 

row . names ( varImp (lm_esv_ other )$ importance ) ) 

for ( i in esv _ other _ varimp _ list ) { 

esv_ other _ varimp <- esv_ other _ varimp % >% merge (i , by = “rn”, all = TRUE ) } 

rm( i ) 

names ( esv_ other _ varimp ) <- make . names ( names ( esv_ other _ varimp ) , unique = TRUE ) 

esv _ other _ varimp <- esv _ other _ varimp % >% dplyr :: rename ( Variable = rn , lm = Overall .x , 

knn = Overall .y , 

cart = Overall . x .1 , 

rf = Overall . y .1) % >% 

arrange ( -lm , -knn , – cart , -rf) 

30

esv _ other _ varimp [is.na( esv_ other _ varimp ) ] <- 0 

esv _ other _ varimp [ esv _ other _ varimp $ Variable == ” D_ analyst1 “, ” knn “] <- esv _ other _ varimp [ esv_ other _ varimp $ Variable == ” D_ analyst “, ” knn “] + esv_ other _ varimp [ esv_ other _ varimp $ Variable == ” D_ analyst1 “,” knn “] esv _ other _ varimp [ esv _ other _ varimp $ Variable == ” D_ earning1 “, ” knn “] <- esv _ other _ varimp [ esv_ other _ varimp $ Variable == ” D_ earning1 “,” knn “] + esv_ other _ varimp [ esv_ other _ varimp $ Variable == ” D_ earning “,” knn “] esv _ other _ varimp <- esv _ other _ varimp [!esv_ other _ varimp $ Variable % in % c(” D_ analyst “, “D_ earning “) ,] 

# change data frame from wide to long format 

esv _ other _ varimp <- melt ( esv_ other _ varimp ) 

colnames ( esv_ other _ varimp ) <- c(” Feature “, ” Model “, ” Value “) 

# draw heatmap 

Whats the next big thing in fintech?

esv _ other _ varimp % >% ggplot ( aes ( x = Model , y = Feature , fill = Value ) ) + 

geom _ tile () + scale _ fill _ gradient2 ( low =” green “, high =” darkgreen “, guide =” colorbar “) 31

Whats the next big thing in fintech?

NYSE.com

Whats the next big thing in fintech?