Solutions to book exercises

The book Analyzing Financial and Economic Data with R is available at Amazon , as an ebook and print. A restricted online version is available at https://www.msperlin.com/afedr

Questions

Question

The R language was developed based on what other programming language?

Julia
Python
Javascript
S
C++

Solution

Straight from the book, section What is R: “R is a modern version of S, a programming language originally created in Bell Laboratories (formerly AT&T, now Lucent Technologies).”

Question

What are the names of the two authors of R?

Guido van Rossum and Bjarne Stroustrup
Ross Ihaka and Robert Gentleman
Roger Federer and Rafael Nadal
Linus Torvalds and Richard Stallman
John Chambers and Robert Engle

Solution

Straight from the book: “… The base code of R was developed by two academics, Ross Ihaka and Robert Gentleman, resulting in the programming platform we have today.”.

Question

Why is R special when comparing to other programming languages, such as Python, C++, javascript and others?

Quick code execution
Easy to use
Makes it easy to write mobile apps
It was designed for analyzing data and producing statistical output
Works on any plataform such as Windows, Unix, MacOS

Solution

Undoubtedly, the main differential of the R language is the ease with which data can be analyzed on the platform. Although other languages also allow data analysis, it is in R where this process is supported by a wide range of specialized and efficient packages.

Question

What was the reason the programming language was named R?

R = Reausable code.
It was the only available letter, not yet used as a programming language.
Letter R is shared in the first names of its authors.
Letter R is cool!
The mother of one of the authors is called Renata.

Solution

The letter R was chosen due to its use in the first letter of the two authors of the platform.

Question

Consider the following alternatives about R and RStudio:

I - R was developed in 2018 and is an innovative and unstable project;

II - RStudio is an alternative programming language to R;

III - R is not compatible with different programming languages;

Which alternatives are correct?

FALSE, FALSE, FALSE
TRUE, TRUE, TRUE
TRUE, FALSE, TRUE
FALSE, TRUE, TRUE
FALSE, FALSE, TRUE

Solution

See section “Why Choose R” in the “Introduction” chapter.

Question

Once you have R and RStudio installed, head over to the CRAN package website ¹ and look for technologies you use in your work. For example, if you use Google Sheets ² ostensibly in your work, you will soon discover that there is a package in CRAN called googlesheets4 that interacts with spreadsheets in the cloud.

Solution
1. Browse CRAN package website
2. Search for technologies you use in your work (Excel, Word, Google Docs, …)
Question

On the CRAN site you can also install the Rtools application. What is it for?
1. Compile graphics.
2. Compile R packages locally
3. Build web pages.
4. Compile technical reports.
5. Make coffee (?).
Solution

Rtools is an extension particular to R on Windows. It is used to compile packages from source code and is a requirement for those who develop packages. For the average user, however, it is also recommended to install Rtools as some packages requires compilation.

For Linux/Unix or MacOS users, Rtools is not necessary as, generally, compilers are already available within the operating system itself.
Question

Use Google to search for R groups in your region. Check if the meetings are frequent and, if you don’t have a major impediment, go to one of these meetings and make new friends.

Solution

It is not uncommon for programmers to have a tendency for introversion. This was certainly my case at the beginning of my career. But, know that shyness is a transitory state. In the same way that you will improve in any sport at the rate of how often you practice it, the more comunicative you are, less shy and more assertive you tend to be.

The sad (or not) reality for the timid is that communication is a fundamental part of the adult life and is a way to maintain your professional network. The more people who know your work and your personality, the better. Perhaps a person you met in one of these groups can refer you to a job vacancy or future project. Summing up, what do you really have to lose by going to one of these meetings?
Question

Go to the RBloggers website ¹ and look for a topic of interest to you, such as football (soccer) or investments (investments). Read at least three of the found blog posts.

Solution

I am particularly passionate about the sport of tennis. On the RBloggers website I’ve found the following articles mixing R and tennis:

Using R to study the evolution of Tennis

Visualizing Tennis Grand Slam Winners Performances

Tennis Grand Slam Tournaments Champions Basic Analysis
Question

If you work in an institution with data infrastructure, talk to the person in charge of the IT department and verify what technologies are used. Check if, through R, it is possible to access all tables in the databases. For now there is no need to write code, yet. Just check if this possibility exists.

Solution

At the university we have access to different paid repositories for financial data. Unfortunately, none of them offers any type of API for communicating with R. In fact, this was one of the motivators for writing R packages for free access to financial data.
Question

In RStudio, create a new script and save it in a personal folder. Now, write R commands in the script that define two objects: one holding a sequence between 1 and 100 and the other with the text of your name (ex. ‘Richard’). Execute the whole script with the keyboard shortcuts.

Solution
```
x <- 1:100
y <- 'Richard'

# press control+shift+enter to run this chunk of code in RStudio
```
Question

In the previously created script, use function message to display the following phrase in R’s prompt:"My name is ....".

Solution
```
x <- 36
y <- 'Richard'

message(paste0('My name is ', y))

# press control+shift+enter to run this chunk of code in RStudio
```
Question

Within the same script, show the current working directory (see function getwd, as in print(getwd())). Now, change your working directory to Desktop (Desktop) and show the following message on the prompt screen: 'My desktop address is ....'. Tip: use and abuse of RStudio’s autocomplete tool to quickly find the desktop folder.

Solution
```
current_dir <- getwd()
print(current_dir)

new_dir <- '~/Desktop/' # this is probably C:/Users/USERNAME/Desktop for Windows
setwd(new_dir)

cat(paste0('My desktop address is ', getwd()))
```

Question

Use R to download the compressed zip file with the book material, available at this link ¹. Save it as a file in the temporary session folder (see function fs::file_temp()).

Solution

local_file <- fs::file_temp(ext = '.zip')
my_url <- afedR3::links_get()$book_blog_zip

download.file(url = my_url, 
              destfile = local_file)

# check if exists
fs::file_exists(local_file)

Question

Use the unzip function to unzip the downloaded file from previous question to a directory called 'afedR-files' inside the “Desktop” folder. How many files are available in the resulting folder? Tip: use the recursive = TRUE argument with fs::dir_ls to also search for all available subdirectories.

Solution

my_folder <- '~/Desktop/adfeR-Files' # this is probably C:/Users/USERNAME/Desktop for Windows
unzip(local_file, exdir = my_folder) # local_file comes from previous exercise

files <- fs::dir_ls(my_folder, 
                    recurse = TRUE)

n_files <- length(files)

message(paste0('There are ', n_files, ' files available at folder "', my_folder, '".'))

Question

Every time the user installs an R package, all package files are stored locally in a specific directory of the hard disk. Using command Sys.getenv('R_LIBS_USER') and fs::dir_ls, list all the directories in this folder. How many packages are available in this folder on your computer?

Solution
```
r_pkg_folder <- Sys.getenv('R_LIBS_USER')

available_dirs <- fs::dir_ls(
    r_pkg_folder, 
    recurse = FALSE,
    type = 'directory')
    
n_dirs <- length(available_dirs)

cat(paste0('There are ', n_dirs, ' folders available  at "', r_pkg_folder, '".'))
```

Question

In the same topic as previous exercise, list all files in all subfolders in the directory containing the files for the different packages (see command Sys.getenv('R_LIBS_USER')). On average, how many files are needed for each package?

Solution

r_pkg_folder <- Sys.getenv ('R_LIBS_USER')

pkg_files <- fs::dir_ls(r_pkg_folder, recurse = TRUE)
my_dirs <- list.dirs(
    r_pkg_folder, 
    recurse = FALSE,
    type = 'directory'
    )

n_files <- length(pkg_files)
n_dirs <- length(my_dirs)

my_msg <- paste0('We have ', length(pkg_files), ' ',
                 'files for ', length(my_dirs), ' packages. \n',
                 'On average, there are ', n_files/n_dirs, ' files per directory.')

message(my_msg)

Question

Use the install.packages function to install the yfR package on your computer. After installation, use function yf_get() to download price data for the IBM stock in the last 15 days. Tip: use function Sys.Date() to find out the current date and Sys.Date()- 15 to calculate the date located 15 days in the past.

Solution
```
if (!require(yfR)) install.packages('yfR')

df_prices <- yfR::yf_get(tickers = 'IBM',
                     first_date = Sys.Date() - 15,
                     last_date = Sys.Date())

 str(df_prices)
```
Question

The cranlogs package allows access to downloads statistics of CRAN packages. After installing cranlogs on your computer, use the cranlogs::cran_top_downloads function to check which are the 10 most installed packages by the global community in the last month. Which package comes first? Tip: Set the cran_top_downloads function input to when = 'last-month'. Also, be aware that the answer here may not be the same as you got because it depends on the day the R code was executed.

Solution
```
#if (!require(cranlogs)) install.packages('cranlogs')
pkgs <- cranlogs::cran_top_downloads(when = 'last-month')

my_sol <- pkgs$package[1]
my_sol
```
Question

Using the devtools package, install the development version of the ggplot2 package, available in the Hadley Hickman repository. Load the package using library and create a simple figure with the code qplot(y = rnorm(10), x = 1:10).

Solution
```
if (!require(devtools)) install.packages("devtools")

devtools::install_github('hadley/ggplot2')

library(ggplot2)
qplot(y = rnorm (10), x = 1:10)
```

Question

Using your programming ability check on your computer which folder, from the “Documents” directory (shortcut = ~), has the largest number of files. Display the five folders with the largest number of files on R’s prompt.

Solution

doc_folder <- '~' # 'C:/Users/USERNAME/Documents' in Windows
                  # '/home/USERNAME/  in Linux

fct_count_files <- function(dir_in) {
  n_files <- fs::dir_ls(dir_in, recurse = FALSE)
  return(length(n_files))
}

# be aware this might take lots of time...
all_folders <- fs::dir_ls(path = doc_folder, 
                          type = 'directory', 
                          recurse = TRUE)

counter_files <- sapply(all_folders, fct_count_files)
sorted <- sort(counter_files, decreasing = TRUE)

message('\nThe five folders with highest number of files are:\n\n')
message(paste0(names(sorted[1:5]), collapse = '\n'))

Question

Imagine a survey regarding your household budget over time. Financial data is available in electronic spreadsheets separated by month, for 10 years. The objective of the research is to understand if it is possible to purchase a real state property in the next five years. Within this setup, detail in text the elements in each stage of the study, from importing the data to the construction of the report.

Solution

The possible stages of the study are:
1. Importing 01: Import all data related to income and family budget, rate of return on investments - fixed income or stocks - and historical property prices in the desired locations;
2. Cleaning 01: Clean the data for outliers and missing data (NA);
3. Manipulation 01: Use personal income data to find the need for monthly savings for each year.
4. Manipulation 02: Based on the historical returns of investments and the inflation of the property, check how many years it takes to save the amount of money needed to buy the property.

Question

Based on the study proposed earlier, create a directory structure on your computer to accommodate the study. Create mock files for each subdirectory (see directory structure at section @ref(directories)). Be aware you can create mock files and direction all in R (see functions cat and fs::dir_create).

S+
C++
Python
Matlab
Javascript

Solution

library(fs)

# set temp dir for solving exercise
temp_dir <- path(tempdir(), 'Chapter 3 exercise')

# create folder
dir_create(temp_dir)

# create files
file_create(path(temp_dir, '01-import_data.R'))
file_create(path(temp_dir, '02-clean_data.R'))
file_create(path(temp_dir, '03-build_budget_table.R'))
file_create(path(temp_dir, '04-build_investment_realstate_table.R'))
file_create(path(temp_dir, '05-build_report.R'))

# create dirs
dir_create(path(temp_dir, 'data'))
dir_create(path(temp_dir, 'tabs'))
dir_create(path(temp_dir, 'figs'))

# fill with files
for (i_year in 2009:2019) {
  file_create(path(temp_dir, 'data', 
                   paste0('annual_budget_', i_year, '.xlsx')) )
}

file_create(path(temp_dir, 'data', 'fixed_income_index.csv'))
file_create(path(temp_dir, 'data', 'real_state_data.rds'))
file_create(path(temp_dir, 'figs', 'Income_and_HousePrices.png'))
file_create(path(temp_dir, 'tabs',  'Table_with_Results.xlsx'))

dir_tree(temp_dir)

Question

Create a dataframe with the following code:

library(dplyr)

my_N <- 10000
my_df <- tibble(x = 1:my_N,
                y = runif(my_N))

Export the resulting dataframe to each of the five formats: csv, rds, xlsx, fst. Which of the formats took up the most space in the computer’s memory? Tip: file.size calculates the size of files within R.

Solution

library(dplyr)
library(readr)

do_tests <- function(my_N) {
  
  my_df <- tibble(x = 1:my_N,
                  y = runif(my_N))
  
  # csv
  my_f <- tempfile(pattern = 'temp', fileext = '.csv')
  time.csv <- system.time({
    write_csv(my_df, my_f)
  })['elapsed']
  size.csv <- file.size(my_f)/1000000
  
  # rds
  my_f <- tempfile(pattern = 'temp', fileext = '.rds')
  time.rds <- system.time({
    write_rds(my_df, my_f)
  })['elapsed']
  
  size.rds <- file.size(my_f)/1000000
  
  # xlsx
  my_f <- tempfile(pattern = 'temp', fileext = '.xlsx')
  library(writexl) 
  time.xlsx <- system.time({
    write_xlsx(my_df, my_f)
  })['elapsed']
  
  size.xlsx <- file.size(my_f)/1000000
  
  # fst
  library(fst)
  my_f <- tempfile(pattern = 'temp', fileext = '.fst')
  time.fst <- system.time({
    write_fst(my_df, my_f)
  })['elapsed']
  
  size.fst <- file.size(my_f)/1000000
  
  print(c(size.csv, size.rds, size.xlsx, size.fst))
  
  print(c(time.csv, time.rds, time.xlsx, time.fst))
  
  tab <- tibble(Result = c('csv', 'rds', 'xlsx', 'fst'), 
                Size =  c(size.csv, size.rds, size.xlsx, size.fst),
                Time = c(time.csv, time.rds, time.xlsx, time.fst))
  
  return(tab)
}

my_N <- 10000

tab <- do_tests(my_N)
print(tab)

my_msg <- paste0('The format with largest disk space for N = ', my_N, ' is ', 
                 tab$Result[which.max(tab$Size)], '.')
message(my_msg)

Question

Improve the previous code by measuring the execution time for saving the data in different formats. Which file format resulted in the fastest execution for exporting data? Tip: use the system.time function or thetictoc package to calculate the execution times.

Solution
```
# do notice that this chunk requires the execution of previous solution
my_msg <- paste0('The format with least execution time for N = ', my_N, ' is ', 
                 tab$Result[which.min(tab$Time)], '.')
message(my_msg)
```

Question

For the previous code, reset the value of my_N to 1000000. Does it change the answers to the last two questions?

Solution

# do notice that this chunk requires the execution of previous solution
my_N <- 1000000

tab <- do_tests(my_N)
print(tab)

my_msg <- paste0('The format with largest disk space for N = ', my_N, ' is ', 
                 tab$Result[which.max(tab$Size)], '.')

message(my_msg)

my_msg <- paste0('The format with least execution time for N = ', my_N, ' is ', 
                 tab$Result[which.min(tab$Time)], '.')
message(my_msg)

Question

Use afedR3::data_path function to access the CH04_SP500.csv file in the book’s data repository. Import the contents of the file into R with the function readr::read_csv. How many lines are there in the resulting dataframe?

Solution

my_f <- afedR3::data_path('CH04_SP500.csv')

df_SP500 <- readr::read_csv(my_f, 
                     col_types = readr::cols())

my_msg <- paste0('There are ', nrow(df_SP500), ' rows and ', 
                 ncol(df_SP500), ' columns in file ', basename(my_f))
message(my_msg)

my_sol <- nrow(df_SP500)

Question

At link https://eeecon.uibk.ac.at/~zeileis/grunfeld/Grunfeld.csv/ you’ll find a .csv file for the Grunfeld data. This is a particularly famous table due to its use as reference data in econometric models. Using readr::read_csv function, read this file using the direct link as input read_csv. How many columns do you find in the resulting dataframe?

Solution
```
my_url <- 'https://eeecon.uibk.ac.at/~zeileis/grunfeld/Grunfeld.csv'

df_grunfeld <- readr::read_csv(my_url, col_types = readr::cols())

my_sol <- ncol(df_grunfeld)
```
Question

Use function afedR3::data_path function to access the CH04_example-tsv.tsv file in the book’s data repository. Note that the columns of the data are separated by the tab symbol ('\t'), and not the usual comma. After reading the readr::read_delim manual, import the information from this file to your R session. How many rows does the resulting dataframe contain?

Solution
```
my_f <- afedR3::data_path('CH04_example-tsv.tsv')

df_tsv <- readr::read_delim(my_f, delim = '\t', col_types = readr::cols())

my_sol <- nrow(df_tsv)

#check_answers(my_answers)
```
Question

In the book package you’ll find data file called CH04_another-funky-csv-file.csv, with a particularly bizarre format. Open it in a text editor and try to understand how the columns are separated and what is symbol for decimals. After that, study the inputs of function utils::read.table and import the table into your R session. If we add the number of rows to the number of columns in the imported table, what is the result?

Solution
```
my_f <- afedR3::data_path("CH04_another-funky-csv-file.csv")

df_funky <- read.table(file = my_f, 
           dec = '?', 
           skip = 7, 
           sep = '|', 
           header = TRUE)

my_sol <- nrow(df_funky) + ncol(df_funky)
```

Question

Using the yfR package, download daily data of the Facebook stock (META) from Yahoo Finance for the period between 2019 and 2023. What is the lowest unadjusted closing price (column price.close) in the analyzed period?

Solution

ticker <- 'META'
first_date <- '2019-01-01'
last_date  <- '2023-01-01'

df_prices <- yfR::yf_get(
  tickers = ticker, 
  first_date = first_date, 
  last_date = last_date)

my_sol <- min(df_prices$price_close, 
              na.rm = TRUE)

Question

If you have not already done so, create a profile on the Quandl website ¹ and download the arabica coffee price data in the CEPEA database (Center for Advanced Studies in Applied Economics) ) between 2010-01-01 and 2020-12-31. What is the value of the most recent price?

Solution

library(GetQuandlData)

id_series <- c(COFFEE = 'CEPEA/COFFEE_A')

# set api key (make sure you got your own!)
#my_api_key <- readLines('~/Dropbox/98-pass_and_bash/.quandl_api.txt')

first_date <- '2010-01-01'
last_date <- '2020-12-31'

#df_coffee <- get_Quandl_series(id_in = id_series, 
 #                              first_date = first_date,
  #                             last_date = last_date,
   #                            api_key = my_api_key,
    #                           do_cache = TRUE)

# find most recent
#idx <- which.max(df_coffee$ref_date)
#my_sol <- as.numeric(df_coffee$`Cash Price US$`[idx])

Question

Use function simfinapi::sfa_get_entities() to import data about all available companies in Simfin. How many companies do you find? (see function dplyr::n_distinct()).

Solution

library(simfinapi)
library(dplyr)
library(readr)

# set api key - make sure you got your own at https://simfin.com/data/access/api
#my_api_key <- read_lines('~/GDrive/98-pass-and-bash/.api_key_simfin.txt')
my_api_key <- ''

cache_dir <- fs::path_temp("cache-simfin")
fs::dir_create(cache_dir)

# get info
simfinapi::sfa_set_api_key(my_api_key)
simfinapi::sfa_set_cache_dir(cache_dir)

# get info
df_info_companies <- simfinapi::sfa_get_entities()

# check it
glimpse(df_info_companies)

n_companies <- n_distinct(df_info_companies$ticker)

my_sol <- n_companies

Question

With package simfinapi, download the PL (profit/loss) statement for FY (final year) data for TESLA (ticker = “TSLA”) for year 2022. What is the latest Profit/Loss of the company for that particular year?

Solution

library(simfinapi)
library(dplyr)
library(readr)

# set api key - make sure you got your own at https://simfin.com/data/access/api
#my_api_key <- read_lines('~/GDrive/98-pass-and-bash/.api_key_simfin.txt')

cache_dir <- fs::path_temp("cache-simfin")
fs::dir_create(cache_dir)

# get info
simfinapi::sfa_set_api_key(my_api_key)
simfinapi::sfa_set_cache_dir(cache_dir)

ticker <- 'TSLA'
type_statement <- 'pl' # profit/loss
period <- 'fy' # final year
year <- 2022

PL <- sfa_get_statement(
  ticker = ticker, 
  statement = type_statement,
  period = period,
  fyear = year)

glimpse(PL)

desired_acc <- 'net_income'

latest_acc_value <- PL$net_income[1]

my_sol <- latest_acc_value

Question

Using function tidyquant::tq_index, download the current composition of index DOW. What is the company with the highest percentage in the composition of the index?

Be aware that the answer is time-dependent and the reported result might be different from what you actually got in your R session.
1. NIKE INC CL B
2. INTEL CORP
3. COCA COLA CO/THE
4. UNITEDHEALTH GROUP INC
5. SALESFORCE INC
Solution
```
library(tidyquant)

# print available indices
available_index <- tq_index_options()

my_index <- sample(available_index, 1)

# get components of "DOWJONES"
comp_idx <- tq_index(my_index)

highest_weight <- comp_idx$company[which.max(comp_idx$weight)]

my_sol <- highest_weight 
```

Question

Using again the yfR package, download financial data between 2019-01-01 and 2020-01-01 for the following tickers:

AAPL: Apple Inc
BAC: Bank of America Corporation
GE: General Electric Company
TSLA: Tesla, Inc.
SNAP: Snap Inc.

Using the adjusted closing price column, what company provided higher return to the stock holder during the analyzed period?

Tip: this is an advanced exercise that will require some coding. To solve it, check out function split to split the dataframe of price data and lapply to map a function to each dataframe.

BAC
GE
AAPL
TSLA
SNAP

Solution

library(dplyr)

first_date <- '2019-01-01'
last_date  <- '2020-01-01'

my_tickers <- c('AAPL', 'BAC', 
                'GE', 'TSLA',
                'SNAP')
    
df_prices <- yfR::yf_get(
  tickers = my_tickers, 
  first_date = first_date, 
  last_date = last_date)

split_l <- split(df_prices, df_prices$ticker)

my_fct <- function(df_in) {
  price_vec <- df_in$price_adjusted
  ticker_in <- df_in$ticker[1]
  
  total_ret <- last(price_vec)/first(price_vec) - 1
  
  return(tibble(ticker = ticker_in,
                total_ret = total_ret))
}

df_results <- bind_rows(
  lapply(split_l, my_fct)
)

winner <- df_results$ticker[which.max(df_results$total_ret)]

my_sol <- winner

Question

Using function dplyr::tibble, create a dataframe called my_df with a column called x containing a sequence from -100 to 100 and another column called y with the value of column x added by 5. How many values in column x are greater than 10 and lower than 25?
1. 14
2. 2
3. 5
4. 7
5. 9
Solution
```
my_df <- dplyr::tibble(x = -100:100, 
                        y = x + 5)
# solution
my_sol <- sum((my_df$x > 10)&(my_df$x < 25))
```
Question

Create a new column in object my_df called cumsum_x, containing the cumulative sum of x (cumsum function). In this new column, how many values are greater than -3500?

Solution
```
my_df <- dplyr::tibble(x = -100:100, 
                       y = x + 5)

# solution
my_df$cumsum_x <- cumsum(my_df$x)

# solution
my_sol <- sum(my_df$cumsum_x > -3500)
```
Question

Use function dplyr::filter function and the pipeline operator to filter my_df, keeping only the rows where the value of the y column is greater than 0. What is the number of rows in the resulting table?

Solution
```
my_df <- dplyr::tibble(x = -100:100, 
                       y = x + 5)

# solution
my_df2 <- my_df |>
  dplyr::filter(y > 0)

# solution
my_sol <- nrow(my_df2)
```

Question

If you have not already done so, repeat exercises 1, 2 and 3 using the functions of the tidyverse universe and the pipeline operator.

Solution

library(dplyr)
library(readr)
library(stringr)

my_df <- tibble(x = -100:100,
                y = x + 5) |>
  mutate(cumsum_x = cumsum(x))

# solution 01
sol_01 <- my_df |>
  filter(x > 10, 
         x < 25) |>
  nrow()

# solution 02
sol_02 <- my_df |>
  mutate(cumsum_x = cumsum(x)) |>
  filter(cumsum_x > -3500) |>
  nrow()

# solution 03
sol_03 <- my_df |>
  filter(y > 0) |>
  nrow()

message(str_glue(
  'Solution 01 = {sol_01} \n',
  'Solution 02 = {sol_02} \n',
  'Solution 03 = {sol_03}'
  )
)

Question

Use the yfR package to download Google (GOOG) stock data, from 2015-01-01 to 2023-01-01. If the investor had bought 1000 USD in Google shares on the first day of the data and kept the investment until today, what would be the value of his portfolio?

$30,685.82
$11,997.49
$5,976.27
$24,039.93
$18,018.71

Solution

library(dplyr)

first_date <- '2015-01-01'
last_date <- '2023-01-01'

df_prices <- yfR::yf_get('GOOG', 
                         first_date,
                         last_date)

value_purchase <- sample(seq(1000, 10000, by = 50), 1)

# solution
my_sol <- last(df_prices$price_adjusted)/first(df_prices$price_adjusted)*value_purchase

Question

Use functions afedR3::data_path and readr::read_csv to import data from the CH11_grunfeld.csv file. Now, use function dplyr::glimpse to find out the number of lines in the imported data. What is the number of rows in the imported table?
1. 200
2. 234
3. 271
4. 308
5. 345
Solution
```
library(dplyr)
library(readr)

my_file <- afedR3::data_path('CH11_grunfeld.csv')

df_grunfeld <- read_csv(my_file, 
                        col_types = cols())

# solution
glimpse(df_grunfeld)

my_sol <- nrow(df_grunfeld)
```

Question

Create a list type object with three dataframes in its contents,df1, df2 and df3. The content and size of the dataframes is your personal choice. Now, use the sapply function and a custom function to find out the number of rows and columns in each dataframe.

Solution

df1 <- dplyr::tibble(x = 1:10)
df2 <- dplyr::tibble(y = runif(100))
df3 <- dplyr::tibble(z = rnorm(150),
                     m = rnorm(150))

my_l <- list(df1, df2, df3)

my_fct <- function(df_in) {
  out <- c('nrows' = nrow(df_in), 
           'ncols' = ncol(df_in))
  return(out)
}

tab <- sapply(my_l, my_fct)
tab

Question

Within an R session, create an identity matrix (value 1 in the diagonal, zero in any other) of size 5X5. Tip: use the diag function to define the diagonal of the matrix.

Solution
```
my_size <- 5
M_identity <- matrix(0, 
                     nrow = my_size, 
                     ncol = my_size)

# solution
diag(M_identity) <- 1

print(M_identity)
```

Question

Consider the following vectors x and y:

set.seed(7)
x <- sample (1:3, size = 5, replace = T)
y <- sample (1:3, size = 5, replace = T)

What is the sum of the elements of a new vector resulting from the multiplication between the elements of x andy?

Solution

set.seed(7)

x <- sample (1:3, size = 5, replace = T)
y <- sample (1:3, size = 5, replace = T)

# solution
my_sol <- sum(x*y)

Question

If we performed a cumulative sum of a sequence between 1 and 100, in what element would this sum go beyond the value of 50?
1. 10
2. 11
3. 13
4. 15
5. 17
Solution
```
my_sum <- cumsum(1:100)

# solution
my_sol <- (which(my_sum > 50)[1])
```
Question

Using R, create an sequence called seq_1 between -15 and 10, where the interval between values is always equal to 2. What is the sum of the elements of seq_1?
1. -39
2. -20
3. -27
4. -69
5. -108
Solution
```
# solution
seq_1 <- seq(from = -15, to = 10, by = 2)

# solution
my_sol <- sum(seq_1)
```
Question

Define another object called seq_2 containing a sequence of size 1000, with values between 0 and 100. What is the standard deviation (function sd()) of this sequence?
1. 45.26061
2. 28.91085
3. 22.19465
4. 12.94184
5. 74.17146
Solution
```
seq_2 <- seq(from = 0, 
             to = 100, 
             length.out = 1000)

# solution
my_sol <- sd(seq_2)
```
Question

Calculate the sum between seq_1 and seq_2 vectors (see previous exercises). Did this operation work despite the different size of the vectors? Explain your answer. If it works, what is the highest value of the resulting vector?
1. 191.5447
2. 150.804
3. 108.999
4. 171.1743
5. 130.4336
Solution

Yes, it worked, but with a warning: _ “the length of the longest object is not a multiple of the length of the shortest object”_. The explanation is that whenever R encounters operations with different vector sizes, it uses the recycling rule, where the shortest vector is repeated as many times as necessary to match the size of the longest vector. In the book, see section on numerical vectors for more details.
```
seq_1 <- seq(from = -10, to = 10, by = 2)
seq_2 <- seq(from = 0, 
             to = 100, 
             length.out = 1000)

# solution
my_sol <- max(seq_1+seq_2)
```

Question

Let’s assume that, on a certain date, you bought 100 shares in a company, paying $15 per share. After some time, you sold 41 shares for a $14 each and the remaining 59 shares were sold for $17 on a later day. Using a script in R, structure this financial problem by creating numeric objects. What is the total gross profit/loss from this sequence of transactions on the stock market?

$27
$50
$77
$206
$129

Solution

total_shares <- 100
price_purchase <- 15

total_purchase_value <- total_shares*price_purchase

qtd_sell_1 <- sample(10:50, 1)
price_sell_1 <- sample(10:18, 1)
total_sell_1 <- qtd_sell_1*price_sell_1

qtd_sell_2 <- total_shares-qtd_sell_1
price_sell_2 <- sample(10:18, 1)
total_sell_2 <- qtd_sell_2*price_sell_2

total_sell_value <- total_sell_1 + total_sell_2

# solution
my_sol <- total_sell_value - total_purchase_value

Question

Create a vector x according to the following formula, where $i = 1 ... 100$ . What is the sum of the elements of x?

$x_i=\frac{-1^{i+1}}{2i-1}$
1. 0.7828982
2. 0.9299688
3. 1.069942
4. 1.209914
5. 1.349887
Solution
```
i <- 1:100

x <- ( (-1)^(i+1) )/(2*i - 1)

# solution
my_sol <- sum(x)
```
Question

Create a $z_i$ vector according to the following formula where $x_i = 1 ... 50$ and $y_i = 50 ... 1$ . What is the sum of the elements of $z_i$ ? Tip: check out how the dplyr::lag() function works.

$z_i=\frac{y_i - x_{i-1}}{y_{i-2}}$
1. -65.95709
2. -113.3288
3. -78.56528
4. -90.15313
5. -101.741
Solution
```
x <- 1:50
y <- 50:1

# solution using `base`
z <- (y - c(NA, x[1:(length(x)-1)]))/c(NA, NA, y[1:(length(y)-2)])

# solution with tidyverse (much prettier huh!)
z <- (y - lag(x, n = 1))/lag(y, n = 2)

# solution (be aware of the NA values)
my_sol <- sum(z, na.rm = TRUE)
```
Question

Using set.seed (10), create an object called x with random values from the Normal distribution with a mean of 10 and standard deviation of 10. Using the cut function, create another object that defines two groups based on values of x greater than 15 and lower than 15. How many observations you find in the first group?
1. 668
2. 775
3. 915
4. 1055
5. 1195
Solution
```
set.seed(10)
x <- rnorm(n = 1000, mean = 10, sd = 10)

my_group <- cut(x, 
                breaks = c(-Inf, 15, Inf))

# solution
my_sol <- table(my_group)[1]
```

Question

Create the following object with the following code:

set.seed(15)
my_char <- paste(sample(letters, 5000, replace = T), 
                 collapse = '')

How many times is the letter 'x' found in the resulting text object?

Solution

set.seed(15)
my_char <- paste(sample(letters, 5000, replace = T), 
                 collapse = '')

# solution
my_sol <- stringr::str_count(my_char, 'x')

Question

Based on the my_char object created earlier, if we divided it into several smaller pieces using the letter "b", what is the number of characters in the largest piece found?

Solution

set.seed(15)
my_char <- paste(sample(letters, 5000, replace = T), 
                 collapse = '')

my_split <- stringr::str_split(my_char, pattern = 'b')[[1]]

# find number of characters in each
n_chars <- sapply(my_split, nchar)

# solution
my_sol <- n_chars[which.max(n_chars)]

Question

At web address https://www.gutenberg.org/cache/epub/69694/pg69694.txt you’ll find the full text of the book The Devil of East Lupton, Vermont, by William Fitzgerald. Use functions download.file and readr::read_lines to import the entire book as a vector of characters called my_book in R. How many lines does the resulting object have?
1. 1047
2. 127
3. 350
4. 573
5. 796
Solution
```
my_link <- 'https://www.gutenberg.org/cache/epub/69694/pg69694.txt'

my_book <- readr::read_lines(my_link)

# solution
my_sol <- length(my_book)
```
Question

Bind the vector of characters in my_book into a single object called full_text using command paste0(my_book, collapse = '\n'). Using the stringr package, how many times is the word 'Vermont' repeated throughout the text?
1. 10
2. 4
3. 6
4. 13
5. 23
Solution
```
my_link <- 'https://www.gutenberg.org/cache/epub/69694/pg69694.txt'

my_book <- readr::read_lines(my_link)

# solution
full_text <- paste0(my_book, collapse = '\n')
my_sol <- stringr::str_count(full_text, stringr::fixed('Vermont'))
```

Question

For the full_text object created earlier, use the stringr::str_split() function to split the entire text using blanks. Based on the resulting vector, create a frequency table. What is the most used word in the text? Tip: Remove all cases of empty characters.

Solution

my_link <- 'https://www.gutenberg.org/cache/epub/69694/pg69694.txt'

my_book <- readr::read_lines(my_link)

# solution
full_text <- paste0(my_book, collapse = '\n')
my_split <- stringr::str_split(full_text, 
                               pattern = stringr::fixed(' '))[[1]]

# remove empty
my_split <- my_split[my_split != '']

my_tab <- sort(table(my_split), decreasing = TRUE)

# solution
my_sol <- names(my_tab[1])

Question

Assuming that a person born on 2000-05-12 will live for 100 years, what is the number of birthday days that will fall on a weekend (Saturday or Sunday)? Tip: use operator %in% to check for a multiple condition in a vector.
1. 29
2. 8
3. 22
4. 51
5. 80
Solution
```
b_day <- as.Date('2000-05-12')
n_years <- 100

b_day_vec <- seq(b_day, b_day + n_years*365, by = '1 year')

w_days <- weekdays(b_day_vec)

n_weekend <- sum(w_days %in% c('Saturday', 'Sunday'))

# solution
my_sol <- n_weekend
```
Question

What date and time is found 10⁴ seconds after 2021-02-02 11:50:02?
1. 2021-02-02 09:39:55
2. 2021-02-02 14:36:42
3. 2021-02-02 12:39:23
4. 2021-02-02 14:22:58
5. 2021-02-02 13:23:34
Solution
```
time_1 <- as.POSIXct('2021-02-02 11:50:02')
my_sec <- 10000

my_sol <- time_1 + my_sec
```

Question

Create a function called say_my_name that takes a person’s name as input and shows the text Your name is … on the screen. Within the scope of the function, use comments to describe the purpose of the function, its inputs and outputs.

Solution

say_my_name <- function(name_in) {
  # Print a single name in the prompt
  #
  # ARGS: name_in - Name to be printed
  #
  # RETURNS: TRUE, if sucessfull
  
  my_msg <- paste0('Your name is ', name_in)
  
  message(my_msg)
  
  # invisible will omit output if function is called without definition of output
  return(invisible(TRUE))
}

# testing
say_my_name('Marcelo')

Question

Implement a test for the input objects, so that when the input name is not of the character class, an error is returned to the user. Test your new function and verify if it is working as expected.

Solution

say_my_name <- function(name_in) {
  # Prints a single name in the prompt
  #
  # ARGS: name_in - Name to be printed
  # RETURNS: TRUE, if sucessfull
  
  # check inputs
  if (class(name_in) != 'character') {
    stop('Class of input name_in is ', class(name_in), 
         ' and not character!')
  }
  
  if (length(name_in) > 1) {
    stop('Input name_in has length ', length(name_in), 
         ' and not 1 (this function only works for one name)!')
  }
  
  
  
  my_msg <- paste0('Your name is ', name_in, '.')
  
  message(my_msg)
  
  # invisible makes sure the fct doesnt return anything if not output is set
  return(invisible(TRUE))
}

# testing Ok
say_my_name('Marcelo')

# testing vector
say_my_name(c('Richard', 'Michael'))

# testing class
say_my_name(1)

Question

Create a vector with any five names, called my_names. Using a loop, apply function say_my_name to each element of my_names.

Solution

say_my_name <- function(name_in) {
  # Prints a single name in the prompt
  #
  # ARGS: name_in - Name to be printed
  # RETURNS: TRUE, if sucessfull
  
  # check inputs
  if (class(name_in) != 'character') {
    stop('Class of input name_in is ', class(name_in), 
         ' and not character!')
  }
  
  if (length(name_in) > 1) {
    stop('Input name_in has length ', length(name_in), 
         ' and not 1 (this function only works for one name)!')
  }
  
  
  
  my_msg <- paste0('Your name is ', name_in, '\n')
  
  message(my_msg)
  
  # invisible makes sure the fct doesnt return anything if not output is set
  return(invisible(TRUE))
}

my_names <- c('Marcelo', 'Ricardo', 'Tarcizio')

for (i_names in my_names) {
  say_my_name(i_names)
}

Question

In the database of Brasil.IO ¹ you will find a table with names and genres. Import the data from the file and, using a loop, apply function say_my_name to 15 random names in the database. Tip: you can read the data directly from the link using readr::read_csv(LINK).

Solution

say_my_name <- function(name_in) {
  # Prints a single name in the prompt
  #
  # ARGS: name_in - Name to be printed
  # RETURNS: TRUE, if sucessfull
  
  # check inputs
  if (class(name_in) != 'character') {
    stop('Class of input name_in is ', class(name_in), 
         ' and not character!')
  }
  
  if (length(name_in) > 1) {
    stop('Input name_in has length ', length(name_in), 
         ' and not 1 (this function only works for one name)!')
  }
  
  
  
  my_msg <- paste0('Your name is ', name_in, '.')
  
  message(my_msg)
  
  # invisible makes sure the fct doesnt return anything if not output is set
  return(invisible(TRUE))
}

library(dplyr)
library(readr)
library(stringr)

# get CURRENT url from https://data.chhs.ca.gov/dataset/most-popular-baby-names-2005-current
my_url <- 'https://data.brasil.io/dataset/genero-nomes/grupos.csv.gz'

df_names <- read_csv(my_url, col_types = cols())

my_names <- sample(df_names$name, 15)

for (i_name in my_names) {
  
  say_my_name(i_name)
}

Question

Redo previous exercises using function sapply or purrr::walk.

Solution

say_my_name <- function(name_in) {
  # Prints a single name in the prompt
  #
  # ARGS: name_in - Name to be printed
  # RETURNS: TRUE, if sucessfull
  
  # check inputs
  if (class(name_in) != 'character') {
    stop('Class of input name_in is ', class(name_in), 
         ' and not character!')
  }
  
  if (length(name_in) > 1) {
    stop('Input name_in has length ', length(name_in), 
         ' and not 1 (this function only works for one name)!')
  }
    
  my_msg <- paste0('Your name is ', name_in, '\n')
  
  message(my_msg)
  
  # invisible makes sure the fct doesnt return anything if not output is set
  return(invisible(TRUE))
}

library(dplyr)
library(readr)
library(purrr)

# get CURRENT url from https://data.chhs.ca.gov/dataset/most-popular-baby-names-2005-current
my_url <- 'https://data.brasil.io/dataset/genero-nomes/grupos.csv.gz'

df_names <- read_csv(my_url, col_types = cols())

# using sapply
vec_out <- sapply(sample(df_names$name, 15),
                  say_my_name)
glimpse(vec_out)

# using purrr
walk(sample(df_names$name, 15),
    say_my_name)

Question

Use the {yfR} package to download data from the SP500 ('^GSPC'), Ibovespa ('^BVSP'), FTSE ('^ FSTE') and Nikkei 225 ('^N225' index)) from ‘2010-01-01’` to the current date. With the imported data, use a loop to calculate the average, maximum and minimum return for each index during the analyzed period. Save all results in a single table and show it in the prompt of R.

Solution

library(dplyr)

indexes <- c('^BVSP', '^GSPC', '^FTSE', '^N225')

df_indices <- yfR::yf_get(indexes, 
                          '2010-01-01',
                          Sys.Date())

tab <- tibble()
for (index in indexes) {
  
  temp_df <- df_indices |>
    filter(ticker == index)
  
  avg_ret <- mean(temp_df$ret_adjusted_prices, 
                  na.rm = TRUE)
  max_ret <- max(temp_df$ret_adjusted_prices, 
                 na.rm = TRUE)
  min_ret <- min(temp_df$ret_adjusted_prices, 
                 na.rm = TRUE)
  
  # save result
  tab <- bind_rows(tab, tibble(index = index,
                               mean_ret = avg_ret, 
                               max_ret = max_ret, 
                               min_ret = min_ret))
  
}

print(tab)

Question

Redo previous exercise using functions group_by and summarise, both from package dplyr.

Solution

library(dplyr)

indexes <- c('^BVSP', '^GSPC', '^FTSE', '^N225')

df_indices <- yfR::yf_get(indexes, 
                          '2010-01-01',
                          Sys.Date())

tab_tidy <- df_indices |>
  group_by(ticker) |>
  summarise(mean_ret = mean(ret_adjusted_prices, na.rm = TRUE),
            max_ret = max(ret_adjusted_prices, na.rm = TRUE),
            min_ret = min(ret_adjusted_prices, na.rm = TRUE))

print(tab_tidy)

Question

On the Rstudio CRAN logs website ¹, you’ll find data on download statistics for the base distribution of R in the Daily R downloads section. Using your programming skills, import all available data between 2022-01-01 and 2022-01-05 and aggregate them into a single table. Which country has the highest download count for R?

Solution

# set function that will download the files
read_cranlogs_files <- function(date_in) {
  # Reads log files from http://cran-logs.rstudio.com/
  #
  # ARGS: date_in - date of log data
  require(lubridate)
  
  url_dl <- paste0('http://cran-logs.rstudio.com/', year(date_in), '/',
                   date_in, '-r.csv.gz')
  
  cat('\nReading ', url_dl)
  
  df <- readr::read_csv(url_dl, col_types = readr::cols())
  
  return(df)
}

# find out the availabe dates in url
library(rvest)
library(lubridate)
library(stringr)
library(dplyr)

available_links <- read_html('http://cran-logs.rstudio.com/') |>
  html_nodes(css = 'a') |>
  html_attr('href')

# only keep links for R download (those with -r.csv.gz pattern)
idx <- str_detect(available_links, '-r.csv.gz')
r_links <- available_links[idx]

# find out dates 
dates_dls <- ymd(basename(r_links))
max_date <- max(dates_dls)

first_date <- as.Date('2022-01-01')
last_date <- as.Date('2022-01-05')

my_dates <- seq(first_date, 
                last_date, 
                by = '1 day')

library(purrr) 
library(dplyr)

l_out <- map(my_dates, 
             safely(read_cranlogs_files, 
                    otherwise = tibble())) # return empty tibble in case of error

df_cranlogs <- bind_rows(map(l_out, 'result'))

# solution 
my_sol <- names(sort(table(df_cranlogs$country), 
               decreasing = TRUE)[1])

Question

For the same daily FTSE data, check the dates and prices of the 20 biggest price drops. If, for each of these cases, an investor bought the index at the price of the biggest drops and kept it for 30 days, what would be his average nominal return per transaction?

2.95%
4.51%
5.07%
3.38%
3.94%

Solution

library(dplyr)
library(readr)

ticker <- '^FTSE'
first_date <- '2010-01-01'
last_date <- '2023-01-01'

df_FTSE_daily <- yfR::yf_get(ticker, 
                             first_date,
                             last_date)

# buy at t, sell at t+30
trade_window <- 30 

# find largest drops
largest_drops <- df_FTSE_daily |>
  arrange(ret_adjusted_prices) |>
  slice(1:20)

# There are many ways to solve the exercise. 
# Here we will use a loop which is the simplest way to looking at the problem.
# You could also solve it with the functional approach of package purrrr,
# that is, writing a function.

tab <- tibble()
for (i_date in seq_along(largest_drops$ref_date)) {
  
  my_date <- largest_drops$ref_date[i_date]
  # filter data to keep only datapoints in each horizon
  temp_df <- df_FTSE_daily |>
    filter(ref_date >= my_date,
           ref_date <= my_date + trade_window)
  
  
  buy_price <- first(temp_df$price_adjusted)
  sell_price <- last(temp_df$price_adjusted)
  return <- sell_price/buy_price - 1
  
  tab <- bind_rows(tab, 
                   tibble(date = my_date, 
                          buy_price = buy_price, 
                          sell_price = sell_price, 
                          return = return))
}

print(tab)

# solution
my_sol <- mean(tab$return)

Question

Consider a dataframe created with the following R code:

library(dplyr)

my_N <- 100

df <- bind_rows(tibble(ticker = rep('STOCK 1', my_N),
                       ref_date = Sys.Date() + 1:my_N,
                       price = 100 + cumsum(rnorm(my_N))),
                tibble(ticker = rep('STOCK 2', my_N),
                       ref_date = Sys.Date() + 1:my_N,
                       price = 100 + cumsum(rnorm(my_N))) )

print(df)

Is the dataframe in the long or wide format? Explain your answer.

Solution

The format is long: we have data stacked for two different stocks. Note that, with the addition of new tickers, the table grows with new lines. New variables can be easily added with new columns.

Question

Change the format of the previous dataframe, from long to wide or vice versa.

library(dplyr)

my_N <- 100

df <- bind_rows(tibble(ticker = rep('STOCK 1', my_N),
                       ref_date = Sys.Date() + 1:my_N,
                       price = 100 + cumsum(rnorm(my_N))),
                tibble(ticker = rep('STOCK 2', my_N),
                       ref_date = Sys.Date() + 1:my_N,
                       price = 100 + cumsum(rnorm(my_N))) )

print(df)

Solution

library(dplyr)
library(readr)

my_N <- 100

df <- bind_rows(tibble(ticker = rep('STOCK 1', my_N),
                       ref_date = Sys.Date() + 1:my_N,
                       price = 100 + cumsum(rnorm(my_N))),
                tibble(ticker = rep('STOCK 2', my_N),
                       ref_date = Sys.Date() + 1:my_N,
                       price = 100 + cumsum(rnorm(my_N))) )

print(df)

# convert from long to wide
df_wide <- tidyr::spread(data = df, 
                  key = 'ticker',
                  value = 'price')

# print result
print(df_wide)

Question

Consider the following list:

library(dplyr)

my_l <- list(df1 = tibble(x = 1:100, 
                          y = runif(100)),
             df2 = tibble(x = 1:100, 
                          y = runif(100), 
                          v = runif(100)),
             df3 = tibble(x = 1:100, 
                          y = runif(100), 
                          z = runif(100)) 
            )

Add all dataframes in my_l to a single object using do.call or dplyr::bind_rows functions. What happened to the df1 data where v and z columns do not exist?

Solution

When bind_rows does not find the same column at the junction of different tables, the missing data is defined as NAs. See below:

library(dplyr)
library(readr)

my_l <- list(df1 = tibble(x = 1:100, 
                          y = runif(100)),
             df2 = tibble(x = 1:100, 
                          y = runif(100), 
                          v = runif(100)),
             df3 = tibble(x = 1:100, 
                          y = runif(100), 
                          z = runif(100)) )

# solution with bind_rows
bind_df1 <- bind_rows(my_l)

# solution with do.cal
bind_df2 <- do.call(bind_rows, my_l)

# check solutions match
identical(bind_df1, bind_df2)

print(bind_df1)
# the missing data points were set as NA values

Question

Use the yfR package to download the SP500 index data ('^GSPC') from 1950-01-01 to 2021-01-01. What is the sum of the 5 highest positive returns on the index?

0.0909854
0.3952476
0.501383
0.1924061
0.2938269

Solution

ticker <- '^GSPC'
first_date <- '1950-01-01'
last_date <- '2021-01-01'

df_sp500 <- yfR::yf_get('^GSPC', 
                            first_date,
                            last_date)

select_n <- 5
tab <- dplyr::tibble(position = 1:select_n,
              top5_positive = sort(df_sp500$ret_adjusted_prices, 
                                   decreasing = TRUE)[1:select_n],
              top5_negative = sort(df_sp500$ret_adjusted_prices, 
                                   decreasing = FALSE)[1:select_n])

print(tab)

my_sol <- sum(tab$top5_positive)

Question

Use the replace_outliers function from section @ref(outliers) to remove outliers from all numeric columns of the SP500 data previously imported with my_prob = 0.025. How many lines were lost in this cleaning process?

12734
11291
7281
9847
8403

Solution

library(purrr)

ticker <- '^GSPC'
first_date <- '1950-01-01'
last_date <- '2021-01-01'

df_sp500 <- yfR::yf_get('^GSPC', 
                            first_date,
                            last_date)

replace_outliers <- function(col_in, my_prob = 0.05) {
  # Replaces outliers from a vector
  #
  # INPUTS: col_in The vector
  #         my_prob Probability of quantiles (p and 1-p)
  #
  # OUTPUT: A vector
  
  # return if class is other than numeric
  if (!(class(col_in) %in% c('numeric', 'integer'))) return(col_in)
  
  my_outliers <- stats::quantile(x = col_in,
                                 probs = c(my_prob, 1-my_prob),
                                 na.rm = TRUE)
  
  idx <- (col_in <= my_outliers[1])|(col_in >= my_outliers[2])
  col_in[idx] <- NA
  
  return(col_in)
  
}

# remove outlivers from vectors
l_out <- map(df_sp500, replace_outliers, my_prob = 0.025)

df_sp500_nooutlier <- na.omit(as_tibble(l_out))

nrow_1 <- nrow(df_sp500)
nrow_2 <- nrow(df_sp500_nooutlier)

my_sol <- nrow_1 - nrow_2

Question

Use the yfR::yf_get function to import prices of the FTSE index ('^ FTSE') from 2010-01-01 to 2021-01-01. Then, reconstruct the data at the annual frequency, defining each year’s value as the last observation of the period. Tip: see the dplyr::summary_all function for a functional way to aggregate all the columns of a dataframe.

Solution

ticker <- '^FTSE'

first_date <- '2010-01-01'
last_date <- '2021-01-01'

df_FTSE_daily <- yfR::yf_get(ticker,
                            first_date,
                            last_date)

# change from daily to annual
df_FTSE_yearly <- df_FTSE_daily |>
  mutate(ref_year = lubridate::year(ref_date)) |>
  group_by(ref_year) |>
  summarise_all(.funs = last)

print(df_FTSE_yearly)

Question

Use the same daily data as the FTSE and reconstruct the data at the monthly frequency, again using the first observation of the period.

Solution

ticker <- '^FTSE'
first_date <- '2010-01-01'
last_date <- '2023-01-01'

df_FTSE_daily <- yfR::yf_get(ticker,
                            first_date,
                            last_date)

# change from daily to monthly
df_FTSE_monthly <- df_FTSE_daily |>
  mutate(ref_month = format(ref_date, '%Y-%m-01')) |>
  group_by(ref_month) |>
  summarise_all(first)

print(df_FTSE_monthly)

Question

Download TESLA (TSLA) stock data with the yfR package for the last 500 days. Display the price line chart adjusted over time using the ggplot() function. Make sure that:

x and y axes are correctly named;
The chart has a title (“Prices for TESLA”), subtitle (“Data from YYYY-MM-DD to YYYY-MM-DD”) and a caption (“Solution for exercise 01, chapter 10 - afedR”).

Solution

library(yfR)
library(ggplot2)

tickers <- 'TLSA'

df_tlsa<- yf_get(tickers = tickers, 
                 first_date = Sys.Date() - 500,
                 last_date = Sys.Date())

p <- ggplot(df_tlsa, aes(x = ref_date, y = price_adjusted)) + 
  geom_line() + 
  labs(title = paste0('Prices for ', tickers) ,
       subtitle = paste0('Data from ', min(df_tlsa$ref_date), ' to ', 
                         max(df_tlsa$ref_date)),
       caption = "Solution for exercise 01, chapter 10 - afedR")

print(p)

Question

Using the yfR package, download stock data for:

APPLE INC. (AAPL)
MICROSOFT CORPORATION (MSFT)
AMAZON.COM, INC. (AMZN)
ALPHABET INC. (GOOG)

in the last 1500 days. Display stock prices with different line colors on the same graph. Adapt all other aspects of the graph from previous exercises such as title and axis labels.

Solution

library(yfR)
library(ggplot2)

tickers <- c('AAPL', 
             'MSFT',
             'AMZN',
             'GOOG') 

df_stocks <- yf_get(
     tickers = tickers, 
     first_date = Sys.Date() - 500,
     last_date = Sys.Date()
     )

p <- ggplot(df_stocks, aes(x = ref_date, 
                           y = price_adjusted,
                           color = ticker)) + 
  geom_line() + 
  labs(title = paste0('Prices for ', tickers) ,
       subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ', 
                         max(df_stocks$ref_date)),
       caption = "Solution for exercise 02, chapter 10 - afedR")

print(p)

Question

For the previous plot, add dots in the lines.

Solution

library(yfR)
library(ggplot2)

tickers <- c('AAPL', 
             'MSFT',
             'AMZN',
             'GOOG') 

df_stocks <- yfR(tickers = tickers, 
                             first_date = Sys.Date() - 500,
                             last_date = Sys.Date())

p <- ggplot(df_stocks, aes(x = ref_date, 
                           y = price_adjusted,
                           color = ticker)) + 
  geom_line() + 
  labs(title = paste0('Prices for ', tickers) ,
       subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ', 
                         max(df_stocks$ref_date)),
       caption = "Solution for exercise 03, chapter 10 - afedR")


# add points
p <- p + geom_point()

print(p)

Question

For the same plot, separate the stock prices on different panels with the ggplot::facet_wrap function. Use the scales = 'free' argument to release the x and y axis of each panel.

Solution

library(yfR)

tickers <- c('AAPL', 
             'MSFT',
             'AMZN',
             'GOOG') 

df_stocks <- yfR(
  tickers = tickers, 
  first_date = Sys.Date() - 500,
  last_date = Sys.Date()) 

p <- ggplot(df_stocks, aes(x = ref_date, 
                           y = price_adjusted,
                           color = ticker)) + 
  geom_line() + 
  labs(title = paste0('Prices for ', tickers) ,
       subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ', 
                         max(df_stocks$ref_date)),
       caption = "Solution for exercise 04, chapter 10 - afedR")

# add points and facet wrap
p <- p + 
  geom_point() + 
  facet_wrap(~ticker, scales = 'free')

print(p)

Question

Change the theme of the previous graph to a black and white scale, both for the graph area and for the colors of the lines.

Solution

library(yfR)

tickers <- c('AAPL', 
             'MSFT',
             'AMZN',
             'GOOG') 

df_stocks <- yfR(tickers = tickers, 
                             first_date = Sys.Date() - 500,
                             last_date = Sys.Date())[[2]]

p <- ggplot(df_stocks, aes(x = ref_date, 
                           y = price_adjusted,
                           color = ticker)) + 
  geom_line() + 
  labs(title = paste0('Prices for ', tickers) ,
       subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ', 
                         max(df_stocks$ref_date)),
       caption = "Solution for exercise 05, chapter 10 - afedR")

# add points and facet_wrap
p <- p + geom_point() + 
  facet_wrap(~ticker, scales = 'free') + 
  theme_bw() + 
  scale_color_grey()

# 
print(p)

Question

For the previous data, present the histogram of the returns of the different stocks in different panels and save the result in a file called 'histograms.png'.

Solution

library(ggplot2)

tickers <- c('AAPL', 
             'MSFT',
             'AMZN',
             'GOOG') 

df_stocks <- yfR::yf_get(tickers, 
                         Sys.Date() - 500,
                         Sys.Date())

p <- ggplot(df_stocks, aes(x = ret_adjusted_prices)) + 
  geom_histogram() + 
  facet_wrap(~ticker)

print(p)

# save in temp folder
my_file <- file.path(tempdir(), 'histograms.png')
ggsave(filename = my_file, plot = p)

Question

Use the yfR::yf_collection_get() function to download the annual return data for all stocks in the DOW index, from 2015-01-01 to the current day. After that, create the average/variance map by plotting the average **annual* return as the y axis and the standard deviation as the x axis. Tip: You’ll find many outliers in the raw data. Make sure that the graph is visible limiting the x and y axes (see the ggplot2::xlim and ggplot2::ylim functions).

Solution
```
library(yfR)
library(dplyr)
library(readr)
library(stringr)
library(ggplot2)

df_prices <- yfR::yf_collection_get(
  'DOW',
  first_date = '2015-01-01',
  freq_data = 'yearly'
  )

tab <- df_prices |>
  na.omit() |>
  group_by(ticker) |>
  summarise(mean = mean(ret_adjusted_prices),
            sd = sd(ret_adjusted_prices))

p <- ggplot(tab, aes(x = sd, y = mean)) + 
  geom_point() + 
  xlim(0, 0.5) + ylim(-0.5, 0.5)

print(p)
```
Question

Head over to the Kaggle data website ¹ and choose a particular dataset for your analysis. It need not be related to economics or finance. Feel free to make a decision based on your own interests.

After downloading the data, create a couple of hypothesis that you might have about the data. Create a visual analysis of the data that helps answering those hypothesis.

Solution

Head over to https://www.kaggle.com/datasets:
1. Choose a dataset
2. Build a visualization plot of the chosen data

Question

Simulate the following linear process in R:

set.seed(5)

# number of obs
n_row <- 100

# set x as Normal (0, 1)
x <- rnorm(n_row)

# set coefficients
my_alpha <- 1.5
my_beta <- 0.5

# build y
y <- my_alpha + my_beta*x + rnorm(n_row)

Now, estimate a linear model where x is the explanatory variable and y is the explained variable. Use the summary function on the estimation return object to get more details about the model. What is the estimated beta value of the simulated data?

0.4002792
1.503832
0.8707205
0.2910268
0.633111

Solution

set.seed(5)

# number of obs
n_row <- 100

# set x as Normal (0, 1)
x <- rnorm(n_row)

# set coefficients
my_alpha <- 1.5
my_beta <- 0.5

# build y
y <- my_alpha + my_beta*x + rnorm(n_row)

my_lm <- lm(formula = y ~ x, data = tibble::tibble(x, y))

summary(my_lm)

my_sol <- coef(my_lm)[2]

Question

Using the {car} package, test the joint hypothesis that the value of alpha is equal to 1.5 and the value of beta is equal to 0.5. What is the value of the resulting F test?

10.72713
16.96681
7.799261
40.30135
23.33454

Solution

set.seed(5)

# number of obs
n_row <- 100

# set x as Normal (0, 1)
x <- rnorm(n_row)

# set coefficients
my_alpha <- 1.5
my_beta <- 0.5

# build y
y <- my_alpha + my_beta*x + rnorm(n_row)

my_lm <- lm(formula = y ~ x, data = tibble::tibble(x, y))

summary(my_lm)

library(car)

# set test matrix
test_matrix <- matrix(c(my_alpha,  # alpha test value
                        my_beta))  # beta test value

# hypothesis matrix 
hyp_mat <- matrix(c(1.5, 0,
                    0  , 0.5),
                  nrow = 2)

# do test
my_waldtest <- linearHypothesis(my_lm, 
                                hypothesis.matrix = hyp_mat, 
                                rhs = test_matrix)

# print result
my_sol <- my_waldtest$F[2]

Question

Use the gvlma package to test the OLS assumptions for the model previously estimated. Does the model pass all tests? If not, increase the value of n_row to 1000 and try again. Did the increase in the number of observations of the model impact the assumptions test? In what way?

Solution

The estimated model has not passed all the tests. In fact, not even the increase in the number of observations in the simulation resulted in the approval of the model in all aspects.
```
set.seed(5)

# number of obs
n_row <- 1000

# set x as Normal (0, 1)
x <- rnorm(n_row)

# set coefficients
my_alpha <- 1.5
my_beta <- 0.5

# build y
y <- my_alpha + my_beta*x + rnorm(n_row)

my_lm <- lm(formula = y ~ x, data = tibble::tibble(x, y))

summary(my_lm)

library(gvlma)

# global validation of model
gvmodel <- gvlma(my_lm) 

# print result
summary(gvmodel)
```

Question

Use command yfR::yf_index_composition("DOW") to find the ticker of the components of the index and yfR::yf_get to download all available price data. Using the SP500 itself – ticker '^GSPC' – as the market index, calculate the beta for all stocks. Display the histogram of the estimated betas. Note that the SP500 returns are not available in the original database and must be downloaded and added to the original database.

Solution

library(ggplot2)

tickers <- yfR::yf_index_composition("DOW")$ticker
first_date <- Sys.Date() - 3*365
last_date  <- Sys.Date()

df_stocks <- yfR::yf_get(tickers, 
                         first_date, 
                         last_date)

df_sp500 <- yfR::yf_get('^GSPC', 
                        first_date, 
                        last_date)

idx <- match(df_stocks$ref_date, df_sp500$ref_date)
df_stocks$ret_mkt <- df_sp500$ret_closing_prices[idx]

# calculate beta for each stock
estimate_beta <- function(df) {
  # Function to estimate beta from dataframe of stocks returns
  #
  # Args:
  #   df - Dataframe with columns ret and ret.sp500
  #
  # Returns:
  #   The value of beta
  
  my_model <- lm(data = df, 
                 formula = ret_adjusted_prices ~ ret_mkt)
  
  return(coef(my_model)[2])
}

my_betas <- by(data = df_stocks, 
               INDICES = df_stocks$ticker, 
               FUN = estimate_beta)

glimpse(my_betas)

# solution
p <- ggplot(tibble::tibble(betas = my_betas), aes(x = betas)) + 
  geom_histogram()

print(p)

Question

For previously imported data, estimate a panel data version for the market model (beta). In this version, each stock has a different intercept, but they share the same beta. Is the estimated beta significant at 5%?

Solution

library(ggplot2)

tickers <- yfR::yf_index_composition("DOW")$ticker
first_date <- Sys.Date() - 3*365
last_date  <- Sys.Date()

df_stocks <- yfR::yf_get(tickers, 
                         first_date, 
                         last_date)

df_sp500 <- yfR::yf_get('^GSPC', 
                        first_date, 
                        last_date)

idx <- match(df_stocks$ref_date, df_sp500$ref_date)
df_stocks$ret_mkt <- df_sp500$ret_closing_prices[idx]

# calculate PLM beta
library(plm)

# estimate panel data model with fixed effects
my_pdm <- plm(data = df_stocks, 
              formula = ret_adjusted_prices ~ ret_mkt, 
              model = 'within',
              index = c('ticker'))

# print result
print(summary(my_pdm))

Question

Using the tidyverse functions, dplyr::group_by and dplyr::do, estimate an ARIMA model for the returns of each stock, available from the import process of previous exercise. In the same output dataframe, create a new column with the forecast in t + 1 for each model. Which stock has the highest expected return for t + 1?

Solution

library(dplyr)

tickers <- yfR::yf_index_composition("DOW")$ticker
first_date <- Sys.Date() - 3*365
last_date  <- Sys.Date()

df_stocks <- yfR::yf_get(tickers, 
                         first_date, 
                         last_date)

df_sp500 <- yfR::yf_get('^GSPC', 
                        first_date, 
                        last_date)

my_tab <- df_stocks |>
  group_by(ticker) |>
  do(my_arima = arima(x = .$ret_adjusted_prices, 
                      order = c(1,0,0))) |>
  mutate(arima_forecast = predict(my_arima, n.ahead = 1 )$pred[1])

glimpse(my_tab)

# solution
idx <- which.max(my_tab$arima_forecast )
print(my_tab$ticker[idx])

Question

In the same code used for the previous question, add a new column-list with the estimation of an ARMA (1, 0)-GARCH(1, 1) model for the returns of each stock. Add another column with the volatility forecast (standard deviation) at t + 1.

By dividing the expected return calculated in the previous item by the expected risk, we have a market direction index, where those stocks with the highest index value have the highest ratio of expected return to risk. Which stock is more attractive and has the highest value of this index?

Solution

tickers <- yfR::yf_index_composition("DOW")$ticker
first_date <- Sys.Date() - 3*365
last_date  <- Sys.Date()

df_stocks <- yfR::yf_get(tickers, 
                         first_date, 
                         last_date)

df_sp500 <- yfR::yf_get('^GSPC', 
                        first_date, 
                        last_date)

library(dplyr)
library(fGarch)

tab_models <- df_stocks |>
  na.omit() |>
  group_by(ticker) |>
  do(my_garch = garchFit(formula = ~ arma(1,0) + garch(1,1), 
                         data = .$ret_adjusted_prices, 
                         trace = FALSE) ) 

tab_models <- tab_models |>
  mutate(forecast_mean = predict(my_garch, 
                                 n.ahead = 1)$meanForecast[1],
         forecast_sd = predict(my_garch, 
                               n.ahead = 1)$standardDeviation[1],
         sharpe_index = forecast_mean/forecast_sd)

glimpse(tab_models)

# solution
idx <- which.max(tab_models$sharpe_index)
print(tab_models$ticker[idx])

Question

Using package {afedR3}, import the data from file CH11_grunfeld.csv and create a descriptive table of the variables. This table should provide enough information for the reader to understand the data. Use {xtable} package to report it in LaTeX or Word/Writer format.

Solution

library(dplyr)

my_f <-  afedR3::data_path('CH11_grunfeld.csv')

df_grunfeld <- readr::read_csv(my_f, col_types = readr::cols())

my_tab <- tibble(nrows = nrow(df_grunfeld),
                 ncol = ncol(df_grunfeld),
                 n_companies = length(unique(df_grunfeld$firm)),
                 n_years = length(unique(df_grunfeld$year)))

my_tab

library(xtable)

# Save to lates
# save to temp file
my_tex_file <- tempfile(pattern = 'table', fileext = '.tex')
print(xtable(my_tab), file = my_tex_file)

readr::read_lines(my_tex_file)

Question

Using the yfR::yf_index_composition function, select random 4 stocks with set.seed(5) from the SP500 index and download the adjusted price data for the last three years. Estimate an ARIMA(1, 0, 1) model for each stock and report the result on the R screen with the texreg::screenreg function.

Solution

set.seed(5)

library(dplyr)

tickers <- sample(yfR::yf_index_composition("SP500")$ticker, 4)
first_date <- Sys.Date() - 3*365
last_date  <- Sys.Date()

df_stocks <- yfR::yf_get(tickers, 
                         first_date, 
                         last_date)

my_tab <- df_stocks |>
  na.omit() |>
  group_by(ticker) |>
  do(my_arima = arima(x = .$ret_adjusted_prices, 
                      order = c(1, 0, 1))) 

glimpse(my_tab)

# solution
library(texreg)

screenreg(my_tab$my_arima)

Question

Create a new Rmarkdown report from previous two exercises. Compile the report in html and open it in your browser.

Solution

See section @ref(creating-rmarkdown) – Creating Reports with RMarkdown – from the same chapter.

Question

Consider the following code:

library(forecast)

ticker <- '^GSPC'
df_prices <- yfR::yf_get(ticker, 
                         '2010-01-01')

my_arima <- forecast::auto.arima(df_prices$ret_adjusted_prices)
summary(my_arima)

Use functions Rprof and profvis to identify the bottleneck of the code. Which line number is taking the longest execution time?

Solution

library(fGarch)
library(profvis)

# set temporary file for results
profiling_file <-  tempfile(pattern = 'profiling_exercise', 
                            fileext = '.out')

# initialize profiling
Rprof(filename = profiling_file)

# run code
profiling <- profvis(expr = {
  
  ticker <- '^GSPC'
  df_prices <- yfR::yf_get(ticker, 
                         '2010-01-01')

  my_arima <- forecast::auto.arima(df_prices$ret_adjusted_prices)
  summary(my_arima)
})

# create visualization
temp_html <- tempfile(pattern = 'profile',
                      fileext = '.html')

htmlwidgets::saveWidget(profiling, temp_html)

# open in browser from R
browseURL(temp_html)

Question

Use the Rcpp package to write a C++ language function that will add elements of two numerical vectors. The output must be another vector of the same size and with elements equivalent to the x + y operation. Use the identical function to test that all elements of both vectors are equal.

Solution

library(Rcpp)

cppFunction('Rcpp::NumericVector sum_vectors_C(NumericVector x, NumericVector y) {
  int n = x.size();
  
  Rcpp::NumericVector total(x.size());
  
  for(int i = 0; i < n; ++i) {
    total[i] = x[i] + y[i];
  }
  return total;
}')

x <- runif(100)
y <- runif(100)

sol_C <- sum_vectors_C(x, y)
sol_R <- x + y

identical(sol_C, sol_R)

Question

Use the {tictoc} package to compare the performance of the previous function with the native operator +, and a loop-based version with the pre-allocation of the result vector. Which alternative has the shortest execution time and why? Does the Rcpp version beat the loop version?

Solution

The best way to add vectors is with the native operator +, which is already optimized for fast executions.

library(Rcpp)

cppFunction('Rcpp::NumericVector sum_vectors_C(NumericVector x, NumericVector y) {
  int n = x.size();
  
  Rcpp::NumericVector total(x.size());
  
  for(int i = 0; i < n; ++i) {
    total[i] = x[i] + y[i];
  }
  return total;
}')

x <- runif(100)
y <- runif(100)

sol_C <- sum_vectors_C(x, y)
sol_R <- x + y

identical(sol_C, sol_R)

library(tictoc)

tic('Using Rcpp')
sol_C <- sum_vectors_C(x, y)
toc()

tic('Using base R')
sol_R <- x + y
toc()

tic('Using a loop and prealocation')
sol_loop <- numeric(length = length(x))
for (i in 1:length(x)) {
  sol_loop[i] <- x[i] + y[i]
}
toc()

Question

Use the {memoise} package to create a memorized version of function Quandl::Quandl. Use the new function to import data about the United States Consumer Price Index (quandl code 'FRED/DDOE01USA086NWDB'). How much percentage speed gain do you get from the second call to the memorized version?

Solution

library(Quandl)
library(memoise)
library(tictoc)

mem_quandl <- memoise(f = Quandl, cache = cache_memory())

id <- 'FRED/DDOE01USA086NWDB'

tic('Using original Quandl')
df <- Quandl(code = id)
toc()

tic('Using memoise version (first call)')
df <- mem_quandl(code = id)
toc()

tic('Using memoise version (second call)')
df <- mem_quandl(code = id)
toc()