Accessing Financial Reports and Corporate Events Data using GetDFPData

Marcelo S. Perlin

IAG/PUC-RJ, Rio de Janeiro, May 25, 2018

Who am I?

Teacher and researcher (PPGA/UFRGS, Porto Alegre-RS)

Loves data analysis, programming, and tennis

Homepage and blog: https://msperlin.github.io/

Author of Processing and Analyzing Financial Data with R - Online (free) version here

Introduction

GetDFPData provides an open R interface to all financial statements and corporate events distributed by B3 and CVM

Motivation

Papers by Journal
Journal Surveyed Total Percent
Contabilidade Vista & Revista 19 80 0.24
Revista Brasileira de Finanças 9 85 0.11
Revista Contabilidade & Finanças 34 118 0.29
Revista Contabilidade e Organizações 22 82 0.27
Revista Contemporânea de Contabilidade 21 112 0.19

Using GetDFPData

The Web Interface

A web version was developed and published in the internet as a shiny application.

http://www.msperlin.com/shiny/GetDFPData/

The CL (command line) R interface

Installation of GetDFPData

  1. Install R
  2. Install RStudio
  3. Install package:
# Release version in CRAN
install.packages('GetDFPData')

Using GetDFPData

The starting point of GetDFPData is looking for information about current companies.

library(GetDFPData)
library(dplyr)

df.info <- gdfpd.get.info.companies(type.data = 'companies')
## Found cache file. Loading data..
glimpse(df.info)
## Observations: 694
## Variables: 16
## $ name.company         <chr> "521 PARTICIPAÇOES S.A. - EM LIQUIDAÇÃO E...
## $ id.company           <int> 16330, 16284, 108, 20940, 21725, 19313, 1...
## $ cnpj                 <dbl> 1.547749e+12, 1.851771e+12, 6.066481e+13,...
## $ date.registration    <date> 1997-07-11, 1997-05-30, 1969-05-16, 2007...
## $ date.constitution    <date> 1996-07-30, 1997-04-02, 1944-08-18, 1976...
## $ city                 <chr> "RIO DE JANEIRO", "RIO DE JANEIRO", "PORT...
## $ estate               <chr> "RJ", "RJ", "RS", "SP", "SP", "SP", "SP",...
## $ situation            <chr> "ATIVO", "ATIVO", "CANCELADA", "CANCELADA...
## $ situation.operations <chr> "LIQUIDAÇÃO EXTRAJUDICIAL", "FASE OPERACI...
## $ listing.segment      <chr> NA, "Tradicional", "Tradicional", "Tradic...
## $ main.sector          <chr> NA, "Financeiro e Outros", "Materiais Bás...
## $ sub.sector           <chr> NA, "Outros", "Siderurgia e Metalurgia", ...
## $ segment              <chr> NA, "Outros", "Siderurgia", "Açucar e Alc...
## $ tickers              <chr> NA, "QVQP3B", NA, NA, "ADHM3", "AELP3", "...
## $ first.date           <date> 1998-12-31, 2001-12-31, 2009-12-31, 2009...
## $ last.date            <date> 2017-12-31, 2017-12-31, 2009-12-31, 2009...

Finding names of companies

Every company in the database can be identified by its official name. Function gdfpd.search.company allow the user to search for the official name of a company.

Let’s find the official name of Ambev, one of the largest companies in Brazil:

my.names <- gdfpd.search.company('ambev')
## Found cache file. Loading data..
## 
## Found 2 companies:
## COMPANHIA DE BEBIDAS DAS AMÉRICAS-AMBEV | situation = CANCELADA | first date = 2010-12-31 | last date - 2012-12-31
## AMBEV S.A.                              | situation = ATIVO | first date = 2012-12-31 | last date - 2017-12-31

Its official name in Bovespa records is AMBEV S.A..

Downloading financial information for one company

name.companies <- 'AMBEV S.A.'
first.date <- '2012-12-31'
last.date  <- '2016-12-31'

df.reports <- gdfpd.GetDFPData(name.companies = name.companies, 
                               first.date = first.date,
                               last.date = last.date)
## Found cache file. Loading data..
## Fetching new files from Bovespa.
## 
## Downloading data for 1 companies
## First Date: 2012-12-31
## Laste Date: 2016-12-31
## Inflation index: dollar
## 
## Downloading inflation data
##  Caching inflation RDATA into tempdir()  Done
## 
## Inputs looking good! Starting download of files:
## 
## AMBEV S.A.
##  Available periods: 2016-12-31   2015-12-31  2014-12-31  2013-12-31  2012-12-31
## 
## 
## Processing 23264 - AMBEV S.A.
##  Finding info from Bovespa
##      Found BOV cache file
##  Processing 23264 - AMBEV S.A. | date 2016-12-31
##      Acessing DFP data | Found DFP cache file
##      Acessing FRE data | Found FRE cache file
##      Acessing FCA data | Found FCA cache file
##  Processing 23264 - AMBEV S.A. | date 2015-12-31
##      Acessing DFP data | Found DFP cache file
##      Acessing FRE data | Found FRE cache file
##      Acessing FCA data | Found FCA cache file
##  Processing 23264 - AMBEV S.A. | date 2014-12-31
##      Acessing DFP data | Found DFP cache file
##      Acessing FRE data | Found FRE cache file
##      Acessing FCA data | Found FCA cache file
##  Processing 23264 - AMBEV S.A. | date 2013-12-31
##      Acessing DFP data | Found DFP cache file
##      Acessing FRE data | Found FRE cache file
##      Acessing FCA data | Found FCA cache file
##  Processing 23264 - AMBEV S.A. | date 2012-12-31
##      Acessing DFP data | Found DFP cache file
##      Acessing FRE data | Found FRE cache file
##      Acessing FCA data | Found FCA cache file

The Result

The resulting object from gdfpd.GetDFPData is a tibble, a data.frame type of object that allows for list columns (Müller and Wickham 2017):

glimpse(df.reports)
## Observations: 1
## Variables: 40
## $ company.name                     <chr> "AMBEV S.A."
## $ company.code                     <int> 23264
## $ cnpj                             <chr> "07526557000100"
## $ date.company.constitution        <date> 2005-07-08
## $ date.cvm.registration            <date> 2013-10-30
## $ company.tickers                  <chr> "ABEV3"
## $ min.date                         <date> 2012-12-31
## $ max.date                         <date> 2016-12-31
## $ n.periods                        <int> 5
## $ company.segment                  <chr> "Tradicional"
## $ current.stockholders             <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ current.stock.composition        <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ fr.assets                        <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ fr.liabilities                   <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ fr.income                        <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ fr.cashflow                      <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ fr.assets.consolidated           <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ fr.liabilities.consolidated      <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ fr.income.consolidated           <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ fr.cashflow.consolidated         <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ fr.auditing.report               <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.dividends                <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.stockholders             <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.capital.issues           <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.mkt.value                <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.capital.increases        <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.capital.reductions       <list> [<>]
## $ history.stock.repurchases        <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.other.stock.events       <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.compensation             <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.compensation.summary     <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.transactions.related     <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.debt.composition         <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.governance.listings      <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.board.composition        <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.committee.composition    <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.family.relations         <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.family.related.companies <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.auditing                 <list> [<c("AMBEV S.A.", "AMBEV S.A...
## $ history.responsible.docs         <list> [<c("AMBEV S.A.", "AMBEV S.A...

All financial statements for the different quarters are available within df.reports. For example, the income statements for all desired quarters of AMBEV are:

df.income <- df.reports$fr.income[[1]]

glimpse(df.income)
## Observations: 130
## Variables: 6
## $ name.company       <chr> "AMBEV S.A.", "AMBEV S.A.", "AMBEV S.A.", "...
## $ ref.date           <date> 2016-12-31, 2016-12-31, 2016-12-31, 2016-1...
## $ acc.number         <chr> "3.01", "3.02", "3.03", "3.04", "3.04.01", ...
## $ acc.desc           <chr> "Receita de Venda de Bens e/ou Serviços", "...
## $ acc.value          <dbl> 20634193, -11263620, 9370573, 4006806, -399...
## $ acc.value.infl.adj <dbl> 6332420.75, -3456688.66, 2875732.09, 122964...

Downloading financial information for several companies

GetDFPData is specially designed for handling large scale download of data. Let’s build a case with three randomly selected active companies:

my.companies <- c("COMPANHIA CELG DE PARTICIPAÇÕES", "DOHLER S.A.", "KLABIN S.A.")

first.date <- '2012-01-31'
last.date  <- '2017-01-31'

df.reports <- gdfpd.GetDFPData(name.companies = my.companies, 
                               first.date = first.date,
                               last.date = last.date)

And now we can check the resulting tibble:

glimpse(df.reports )
## Observations: 3
## Variables: 40
## $ company.name                     <chr> "COMPANHIA CELG DE PARTICIPAÇ...
## $ company.code                     <int> 21393, 5207, 12653
## $ cnpj                             <chr> "08560444000193", "8468340800...
## $ date.company.constitution        <date> 2006-12-04, 1755-12-09, 1978...
## $ date.cvm.registration            <date> 2008-03-14, 1973-12-26, 1997...
## $ company.tickers                  <chr> "GPAR3", "DOHL3;DOHL4", "KLBN...
## $ min.date                         <date> 2012-12-31, 2012-12-31, 2012...
## $ max.date                         <date> 2016-12-31, 2016-12-31, 2016...
## $ n.periods                        <int> 5, 5, 5
## $ company.segment                  <chr> "Tradicional", "Tradicional",...
## $ current.stockholders             <list> [<c("COMPANHIA CELG DE PARTI...
## $ current.stock.composition        <list> [<c("COMPANHIA CELG DE PARTI...
## $ fr.assets                        <list> [<c("COMPANHIA CELG DE PARTI...
## $ fr.liabilities                   <list> [<c("COMPANHIA CELG DE PARTI...
## $ fr.income                        <list> [<c("COMPANHIA CELG DE PARTI...
## $ fr.cashflow                      <list> [<c("COMPANHIA CELG DE PARTI...
## $ fr.assets.consolidated           <list> [<c("COMPANHIA CELG DE PARTI...
## $ fr.liabilities.consolidated      <list> [<c("COMPANHIA CELG DE PARTI...
## $ fr.income.consolidated           <list> [<c("COMPANHIA CELG DE PARTI...
## $ fr.cashflow.consolidated         <list> [<c("COMPANHIA CELG DE PARTI...
## $ fr.auditing.report               <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.dividends                <list> [<>, <c("DOHLER S.A.", "DOHL...
## $ history.stockholders             <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.capital.issues           <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.mkt.value                <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.capital.increases        <list> [<>, <c("DOHLER S.A.", "DOHL...
## $ history.capital.reductions       <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.stock.repurchases        <list> [<>, <>, <c("KLABIN S.A.", "...
## $ history.other.stock.events       <list> [<>, <c("DOHLER S.A.", "DOHL...
## $ history.compensation             <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.compensation.summary     <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.transactions.related     <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.debt.composition         <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.governance.listings      <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.board.composition        <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.committee.composition    <list> [<COMPANHIA CELG DE PARTICIP...
## $ history.family.relations         <list> [<>, <c("DOHLER S.A.", "DOHL...
## $ history.family.related.companies <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.auditing                 <list> [<c("COMPANHIA CELG DE PARTI...
## $ history.responsible.docs         <list> [<c("COMPANHIA CELG DE PARTI...

Integration with tidyverse

library(tidyverse)

# set assets tbl
tab.assets <- bind_rows(df.reports$fr.assets) %>%
  filter(acc.desc == 'Ativo Total') %>%
  group_by(name.company, ref.date) %>%
  summarise(Total.Assets = acc.value)

# plot it
p <- ggplot(tab.assets, aes(x = ref.date, y=Total.Assets)) +
  geom_col() + facet_wrap(~name.company, scales = 'free')
print(p)

Exporting financial data

The package also includes function gdfpd.export.DFP.data for exporting the financial data in the Excel or csv format.

Next we present an example of saving the output as an Excel file.

my.basename <- 'MyExcelData'
my.type.export <- 'xlsx' # only supported so far

gdfpd.export.DFP.data(df.reports = df.reports, 
                      base.file.name = my.basename,
                      type.export = my.type.export)
## File  MyExcelData.xlsx  already exists. Deleting it..
## Start Copying Dataframes to MyExcelData.xlsx
##  Copying table metadata - 3 rows, 5 columns
##  Copying table metadata - 3 rows, 5 columns
##  Copying table current.stockholders - 28 rows, 6 columns
##  Copying table current.stock.composition - 9 rows, 4 columns
##  Copying table fr.assets - 240 rows, 6 columns
##  Copying table fr.liabilities - 375 rows, 6 columns
##  Copying table fr.income - 400 rows, 6 columns
##  Copying table fr.cashflow - 316 rows, 6 columns
##  Copying table fr.assets.consolidated - 225 rows, 6 columns
##  Copying table fr.liabilities.consolidated - 390 rows, 6 columns
##  Copying table fr.income.consolidated - 504 rows, 6 columns
##  Copying table fr.cashflow.consolidated - 350 rows, 6 columns
##  Copying table fr.auditing.report - 15 rows, 3 columns
##  Copying table history.dividends - 149 rows, 12 columns
##  Copying table history.stockholders - 131 rows, 14 columns
##  Copying table history.capital.issues - 30 rows, 5 columns
##  Copying table history.mkt.value - 15 rows, 5 columns
##  Copying table history.capital.increases - 72 rows, 8 columns
##  Copying table history.capital.reductions - 6 rows, 12 columns
##  Copying table history.stock.repurchases - 16 rows, 12 columns
##  Copying table history.other.stock.events - 17 rows, 8 columns
##  Copying table history.compensation - 45 rows, 18 columns
##  Copying table history.compensation.summary - 45 rows, 9 columns
##  Copying table history.transactions.related - 48 rows, 13 columns
##  Copying table history.debt.composition - 27 rows, 9 columns
##  Copying table history.governance.listings - 15 rows, 5 columns
##  Copying table history.board.composition - 365 rows, 18 columns
##  Copying table history.committee.composition - 10 rows, 18 columns
##  Copying table history.family.relations - 83 rows, 10 columns
##  Copying table history.family.related.companies - 419 rows, 11 columns
##  Copying table history.auditing - 21 rows, 10 columns
##  Copying table history.responsible.docs - 25 rows, 5 columns
## 
## Export sucessful. Data available at file MyExcelData.xlsx

Example of usage: Family companies and financial performance

To illustrate the use of GetDFPData we select a topic that is relatively simple and current: the relationship between family companies and financial performance.

Methodology

In order to test if a founding-family company has a better/worst performance, we analyze a large sample of Boards of Directors in Brazil. To do this, we estimate the following (fixed effects panel data) model:

\[ Performance_{i,t} = \phi_1 \times Family_{i,t} + \phi_2 \times Family_{i,t} ^2 + \beta \times Controls + \alpha_i + \mu_t + \epsilon_{i,t} \]

Where the main variables of interest are:

\(Performance_{i,t}\) - Return on assets (ROA) and return on equity (ROE) for company i, year t.

\(Family _{i,t}\) - percentage of participation of family members in boards for company i, year t.

And the controls variables are:

\(size _{i,t}\) - log of total assets, company i, year t.

\(age _{i,t}\) - log of companies age, company i, year t.

\(risk _{i,t}\) - standard deviation of daily returns of most liquid stock, calculated by year and by company.

\(leverage _{i,t}\) - ratio of total debt to total assets.

Example of usage: The Code

Our sample selection procedure follows the aforementioned studies. First, we’ll filter the data for only companies that have tickers, that is, are tradeable in the equity market, and those not belonging in the financial and utility sectors. This is a common filter in corporate finance studies as these sectors have specific dynamics related to their performance measured by ROE and ROA.

library(GetDFPData)
library(dplyr)
library(stringr)

# get info table
df.info <- gdfpd.get.info.companies(type.data = 'companies')
## Found cache file. Loading data..
# filter companies
df.info <- df.info %>%
  filter(!is.na(tickers),
         !(main.sector %in% c('Financeiro e Outros', 'Utilidade Pública')))

We find 242 companies with this simple filter. Now, let’s download data for the selected companies. Be aware that this is a lengthy download and may take a significant amount of time.

# set options
first.date <- '2012-01-31'
last.date  <- '2018-01-01'
my.companies <- df.info$name.company

# get data
df.reports <- gdfpd.GetDFPData(name.companies = my.companies,
                               first.date = first.date,
                               last.date = last.date)

The next step is the download of stock data with package BatchGetSymbols and the calculation of our volatility measure, \(risk _{i,t}\). Before downloading the data, we need to find all tickers of all companies. In the case of more than one stock for a company, we keep the asset with most financial volume.

library(BatchGetSymbols)
library(lubridate)

get.ticker.df <- function(df.in) {
  # Gets ticker string and organizes it in another data_frame

  temp.split <- str_split(df.in$tickers, ';')[[1]]

  temp.df <- data_frame(name.company = df.in$name.company,
                        ticker = temp.split)
}

# get dataframe with tickers
my.l <- by(data = df.info,
           INDICES = df.info$name.company,
           FUN = get.ticker.df)

# bind into one dataframe
df.tickers <- do.call(bind_rows, my.l)

# add .SA for yahoo finance
my.tickers <- paste0(df.tickers$ticker, '.SA')

# get data
df.stocks <- BatchGetSymbols(tickers = my.tickers,
                             first.date = '2010-01-01',
                             last.date = '2017-01-01',
                             bench.ticker = '^BVSP',
                             thresh.bad.data = 0.5)[[2]]

# replace .SA (keep same notation)
df.stocks$ticker <- str_replace_all(df.stocks$ticker,
                                    fixed('.SA'), '')

# join with ticker df
df.stocks <- inner_join(df.tickers, df.stocks)

# find selected tickers by volume
tab.volume <- df.stocks %>%
  group_by(name.company, ticker) %>%
  summarise(mean.volume = mean(volume, na.rm = TRUE)) %>%
  group_by(name.company) %>%
  summarise(selected.ticker = ticker[which.max(mean.volume)])

df.stocks <- df.stocks[df.stocks$ticker %in% tab.volume$selected.ticker, ]

# calculate volatility measure
df.volat <- df.stocks %>%
  mutate(ref.date = as.Date(paste0(year(ref.date),'-12-31'))) %>%
  group_by(name.company, ref.date) %>%
  summarise(risk = sd(ret.adjusted.prices, na.rm = TRUE))

The last step in the data acquisition stage is to bind all financial reports in a single dataframe.

get.col <- function(df.in, my.col) {
  # function to bind all rows (dataframes) of a tibble
  return(do.call(rbind, df.in[[my.col]]))
}

# financial reports
df.fr <- bind_rows(get.col(df.reports, 'fr.income'),
                   get.col(df.reports, 'fr.assets'),
                   get.col(df.reports, 'fr.cashflow'),
                   get.col(df.reports, 'fr.liabilities'))

# board composition and family participation
df.board <- get.col(df.reports, 'history.board.composition')
df.family <- get.col(df.reports, 'history.family.relations')

Now that all raw data is available, we can proceed to calculate our variables of interest. The next code will process df.family and df.board, which contains information about board composition and family members of companies, and output a reference table that shows, for each company and year, the percentage of board seats held by family members.

ref.tab <- df.family %>%
  group_by(name.company, ref.date) %>%
  summarise(persons = str_c(unique(person.name),
                            collapse = ' ; ')) %>%
  merge(df.board) %>%
  mutate(dummy = person.name %in% str_split(persons,
                                            fixed(' ; '),
                                            simplify = TRUE)) %>%
  group_by(name.company, ref.date) %>%
  summarise(family.idx = mean(dummy)) %>%
  glimpse()
## Observations: 943
## Variables: 3
## $ name.company <chr> "ADVANCED DIGITAL HEALTH MEDICINA PREVENTIVA S.A....
## $ ref.date     <date> 2014-12-31, 2009-12-31, 2010-12-31, 2011-12-31, ...
## $ family.idx   <dbl> 0.20000000, 0.16666667, 0.20000000, 0.19230769, 0...

Finally, we build our final dataframe containing all variables necessary for our panel and OLS model:

get.acc <- function(acc.value, search.vec, my.str) {
  # get account value from fr.dataframe

  out.value <- acc.value[search.vec == my.str]

  # test if it exists
  if (length(out.value) == 0) out.value <- NA

  return(out.value)

}

# build final table
model.table <- df.fr %>%
  group_by(name.company, ref.date) %>%
  summarise(profit.loss = get.acc(acc.value, acc.number, '3.09'),
            total.assets = get.acc(acc.value, acc.number, '1'),
            capital = get.acc(acc.value, acc.number,  '2.03'),
            total.sales = get.acc(acc.value, acc.number, '3.01'),
            EBIT = get.acc(acc.value, acc.number, '3.05'),
            deprec.amort = get.acc(acc.value, acc.number, '6.01.01.02'),
            EBITDA = EBIT + deprec.amort,
            short.term.debt = get.acc(acc.value, acc.number, '2.01.04'),
            long.term.debt = get.acc(acc.value, acc.number, '2.02.01'),
            total.debt = short.term.debt + long.term.debt,
            size = log(total.assets),
            leverage = total.debt/total.assets,
            ROA = EBITDA/total.assets,
            ROE = profit.loss/capital) %>%
  inner_join(df.volat) %>%
  inner_join(ref.tab) %>%
  inner_join(df.info) %>%
  filter(total.assets > 0,
         total.sales > 0,
         capital > 0)  %>%
  inner_join(select(df.reports, company.name,
                    date.company.constitution,
                    date.cvm.registration),
             by = c('name.company' = 'company.name' )) %>%
  mutate(age = log(as.numeric(ref.date - date.company.constitution)/365)) %>%
  select(name.company, ref.date, ROA,
         ROE, size, age, risk, leverage, family.idx, main.sector) %>%
  glimpse()
## Observations: 506
## Variables: 10
## $ name.company <chr> "ALPARGATAS SA", "ALPARGATAS SA", "ALPARGATAS SA"...
## $ ref.date     <date> 2010-12-31, 2011-12-31, 2012-12-31, 2013-12-31, ...
## $ ROA          <dbl> 0.17146869, 0.15552614, 0.13210314, 0.11059302, 0...
## $ ROE          <dbl> 0.23129018, 0.20804121, 0.16911026, 0.16899388, 0...
## $ size         <dbl> 14.47189, 14.52877, 14.66921, 14.89599, 14.92580,...
## $ age          <dbl> 4.642624, 4.652211, 4.661732, 4.671137, 4.680456,...
## $ risk         <dbl> 0.01515056, 0.01973376, 0.01816129, 0.01482123, 0...
## $ leverage     <dbl> 0.114016614, 0.059653695, 0.057139840, 0.13141140...
## $ family.idx   <dbl> 0.20000000, 0.19230769, 0.19230769, 0.19230769, 0...
## $ main.sector  <chr> "Consumo Cíclico", "Consumo Cíclico", "Consumo Cí...

As a final step, we remove outliers from the data using a simple winsorization procedure with a 2.5% cut-off in both sides of the distribution:

winsorize <- function(x) {
  lim <- quantile(x, probs = c(0.025,0.975), na.rm = T)
  x[x<lim[1]] <- lim[1]
  x[x>lim[2]] <- lim[2]
  return(x)
}

model.table[,c('ROA','ROE','risk')] <- sapply(model.table[,c('ROA','ROE','risk')],
                                              winsorize)

A summary of the resulting dataset is given next:

summary(model.table)
##  name.company          ref.date               ROA          
##  Length:506         Min.   :2010-12-31   Min.   :-0.05480  
##  Class :character   1st Qu.:2011-12-31   1st Qu.: 0.04494  
##  Mode  :character   Median :2013-12-31   Median : 0.08376  
##                     Mean   :2013-11-23   Mean   : 0.08911  
##                     3rd Qu.:2015-12-31   3rd Qu.: 0.13820  
##                     Max.   :2016-12-31   Max.   : 0.23153  
##                                          NA's   :23        
##       ROE                size             age              risk        
##  Min.   :-0.74808   Min.   : 9.639   Min.   :0.3384   Min.   :0.01086  
##  1st Qu.: 0.01808   1st Qu.:13.491   1st Qu.:3.4971   1st Qu.:0.01985  
##  Median : 0.08990   Median :14.472   Median :3.9772   Median :0.02442  
##  Mean   : 0.06248   Mean   :14.535   Mean   :3.9126   Mean   :0.03129  
##  3rd Qu.: 0.16937   3rd Qu.:15.359   3rd Qu.:4.2611   3rd Qu.:0.03234  
##  Max.   : 0.43936   Max.   :20.276   Max.   :7.6095   Max.   :0.12451  
##                                                       NA's   :2        
##     leverage        family.idx      main.sector       
##  Min.   :0.0000   Min.   :0.00000   Length:506        
##  1st Qu.:0.1140   1st Qu.:0.07743   Class :character  
##  Median :0.2615   Median :0.12500   Mode  :character  
##  Mean   :0.2566   Mean   :0.15648                     
##  3rd Qu.:0.3736   3rd Qu.:0.18182                     
##  Max.   :0.7893   Max.   :0.71429                     
## 

Model estimation and results

With the required dataset cleaned and structured, we proceed with the estimation of the models. First we set the model specification as a formula object:

control.vars <- '+ size + age + risk + leverage'
my.index <- c('name.company', 'ref.date')

my.formula.ROA.plm <- formula(paste0('ROA ~ family.idx + I(family.idx^2) ',
                                 control.vars,
                                 ' + factor(year(ref.date))'))

my.formula.ROE.plm <- formula(paste0('ROE ~ family.idx + I(family.idx^2) ',
                                 control.vars,
                                 ' + factor(year(ref.date))'))

my.formula.ROE.lm <- formula(paste0('ROE ~ family.idx + I(family.idx^2) ',
                                 control.vars,
                                 ' + factor(year(ref.date))',
                                 ' + factor(main.sector)') )

my.formula.ROA.lm <- formula(paste0('ROA ~ family.idx + I(family.idx^2) ',
                                 control.vars,
                                 ' + factor(year(ref.date))',
                                 ' + factor(main.sector)') )

We have four models, two panel data and two ordinary least squares. We can now estimate them using plm and lm:

library(plm)

my.plm.ROA <- plm(data = model.table,
                  formula = my.formula.ROA.plm,
                  index = my.index,
                  model = 'within')

my.lm.ROA <- lm(data = model.table,
                formula = my.formula.ROA.lm)

my.plm.ROE <- plm(data = model.table,
                  formula = my.formula.ROE.plm,
                  index = my.index,
                  model = 'within')

my.lm.ROE <- lm(data = model.table,
                formula = my.formula.ROE.lm)

The Results

library(stargazer)

stargazer(my.plm.ROA, my.lm.ROA, my.plm.ROE, my.lm.ROE,
          type = 'html',
          title = 'Estimation Results',
          model.names = F,
          column.labels = c('FE', 'OLS', 'FE', 'OLS'),
          style = 'aer',
          omit = 'factor*',
          omit.stat = c('f', 'ser'),
          add.lines = list(c('Year Dummies', 'Yes', 'Yes',
                             'Yes', 'Yes'),
                           c('Sector Dummies', 'No', 'Yes',
                             'No', 'Yes')))
Estimation Results
ROA ROE
FE OLS FE OLS
(1) (2) (3) (4)
family.idx -0.010 -0.027 -0.356 0.374*
(0.106) (0.079) (0.346) (0.195)
I(family.idx2) 0.151 0.063 0.116 -0.867***
(0.219) (0.145) (0.650) (0.327)
size 0.062*** -0.002 0.003 0.006
(0.012) (0.002) (0.017) (0.005)
age 0.040 0.000 0.162* -0.003
(0.026) (0.003) (0.090) (0.008)
risk -0.242** -0.907*** -0.957** -3.121***
(0.122) (0.146) (0.437) (0.399)
leverage -0.079** -0.004 -0.089 -0.187***
(0.031) (0.018) (0.101) (0.049)
Constant 0.192*** 0.239***
(0.032) (0.088)
Year Dummies Yes Yes Yes Yes
Sector Dummies No Yes No Yes
Observations 481 481 504 504
R2 0.290 0.183 0.178 0.266
Adjusted R2 0.108 0.151 -0.029 0.238
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.

Conclusions

Thank you!! Questions?

References

Müller, Kirill, and Hadley Wickham. 2017. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.