The book Analyzing Financial and Economic Data with R is available at Amazon, as an ebook and print. A restricted online version is available at https://www.msperlin.com/afedr
The R language was developed based on what other programming language?
Straight from the book, section What is R: “R is a modern version of S, a programming language originally created in Bell Laboratories (formerly AT&T, now Lucent Technologies).”
What are the names of the two authors of R?
Straight from the book: “… The base code of R was developed by two academics, Ross Ihaka and Robert Gentleman, resulting in the programming platform we have today.”.
Why is R special when comparing to other programming languages, such as Python, C++, javascript and others?
Undoubtedly, the main differential of the R language is the ease with which data can be analyzed on the platform. Although other languages also allow data analysis, it is in R where this process is supported by a wide range of specialized and efficient packages.
What was the reason the programming language was named R?
The letter R was chosen due to its use in the first letter of the two authors of the platform.
Consider the following alternatives about R and RStudio:
I - R was developed in 2018 and is an innovative and unstable project;
II - RStudio is an alternative programming language to R;
III - R is not compatible with different programming languages;
Which alternatives are correct?
See section “Why Choose R” in the “Introduction” chapter.
Once you have R and RStudio installed, head over to the CRAN package website1 and look for technologies you use in your work. For example, if you use Google Sheets2 ostensibly in your work, you will soon discover that there is a package in CRAN called googlesheets4
that interacts with spreadsheets in the cloud.
On the CRAN site you can also install the Rtools application. What is it for?
Rtools is an extension particular to R on Windows. It is used to compile packages from source code and is a requirement for those who develop packages. For the average user, however, it is also recommended to install Rtools as some packages requires compilation.
For Linux/Unix or MacOS users, Rtools is not necessary as, generally, compilers are already available within the operating system itself.
Use Google to search for R groups in your region. Check if the meetings are frequent and, if you don’t have a major impediment, go to one of these meetings and make new friends.
It is not uncommon for programmers to have a tendency for introversion. This was certainly my case at the beginning of my career. But, know that shyness is a transitory state. In the same way that you will improve in any sport at the rate of how often you practice it, the more comunicative you are, less shy and more assertive you tend to be.
The sad (or not) reality for the timid is that communication is a fundamental part of the adult life and is a way to maintain your professional network. The more people who know your work and your personality, the better. Perhaps a person you met in one of these groups can refer you to a job vacancy or future project. Summing up, what do you really have to lose by going to one of these meetings?
Go to the RBloggers website1 and look for a topic of interest to you, such as football (soccer) or investments (investments). Read at least three of the found blog posts.
I am particularly passionate about the sport of tennis. On the RBloggers website I’ve found the following articles mixing R and tennis:
Using R to study the evolution of Tennis
Visualizing Tennis Grand Slam Winners Performances
Tennis Grand Slam Tournaments Champions Basic Analysis
If you work in an institution with data infrastructure, talk to the person in charge of the IT department and verify what technologies are used. Check if, through R, it is possible to access all tables in the databases. For now there is no need to write code, yet. Just check if this possibility exists.
At the university we have access to different paid repositories for financial data. Unfortunately, none of them offers any type of API for communicating with R. In fact, this was one of the motivators for writing R packages for free access to financial data.
In RStudio, create a new script and save it in a personal folder. Now, write R commands in the script that define two objects: one holding a sequence between 1 and 100 and the other with the text of your name (ex. ‘Richard’). Execute the whole script with the keyboard shortcuts.
x <- 1:100
y <- 'Richard'
# press control+shift+enter to run this chunk of code in RStudio
In the previously created script, use function message
to display the following phrase in R’s prompt:"My name is ...."
.
x <- 36
y <- 'Richard'
message(paste0('My name is ', y))
# press control+shift+enter to run this chunk of code in RStudio
Within the same script, show the current working directory (see function getwd
, as in print(getwd())
). Now, change your working directory to Desktop (Desktop) and show the following message on the prompt screen: 'My desktop address is ....'
. Tip: use and abuse of RStudio’s autocomplete tool to quickly find the desktop folder.
current_dir <- getwd()
print(current_dir)
new_dir <- '~/Desktop/' # this is probably C:/Users/USERNAME/Desktop for Windows
setwd(new_dir)
cat(paste0('My desktop address is ', getwd()))
Use R to download the compressed zip file with the book material, available at this link1. Save it as a file in the temporary session folder (see function fs::file_temp()
).
local_file <- fs::file_temp(ext = '.zip')
my_url <- afedR3::links_get()$book_blog_zip
download.file(url = my_url,
destfile = local_file)
# check if exists
fs::file_exists(local_file)
Use the unzip
function to unzip the downloaded file from previous question to a directory called 'afedR-files'
inside the “Desktop” folder. How many files are available in the resulting folder? Tip: use the recursive = TRUE
argument with fs::dir_ls
to also search for all available subdirectories.
my_folder <- '~/Desktop/adfeR-Files' # this is probably C:/Users/USERNAME/Desktop for Windows
unzip(local_file, exdir = my_folder) # local_file comes from previous exercise
files <- fs::dir_ls(my_folder,
recurse = TRUE)
n_files <- length(files)
message(paste0('There are ', n_files, ' files available at folder "', my_folder, '".'))
Every time the user installs an R package, all package files are stored locally in a specific directory of the hard disk. Using command Sys.getenv('R_LIBS_USER')
and fs::dir_ls
, list all the directories in this folder. How many packages are available in this folder on your computer?
r_pkg_folder <- Sys.getenv('R_LIBS_USER')
available_dirs <- fs::dir_ls(
r_pkg_folder,
recurse = FALSE,
type = 'directory')
n_dirs <- length(available_dirs)
cat(paste0('There are ', n_dirs, ' folders available at "', r_pkg_folder, '".'))
In the same topic as previous exercise, list all files in all subfolders in the directory containing the files for the different packages (see command Sys.getenv('R_LIBS_USER')
). On average, how many files are needed for each package?
r_pkg_folder <- Sys.getenv ('R_LIBS_USER')
pkg_files <- fs::dir_ls(r_pkg_folder, recurse = TRUE)
my_dirs <- list.dirs(
r_pkg_folder,
recurse = FALSE,
type = 'directory'
)
n_files <- length(pkg_files)
n_dirs <- length(my_dirs)
my_msg <- paste0('We have ', length(pkg_files), ' ',
'files for ', length(my_dirs), ' packages. \n',
'On average, there are ', n_files/n_dirs, ' files per directory.')
message(my_msg)
Use the install.packages
function to install the yfR
package on your computer. After installation, use function yf_get()
to download price data for the IBM stock in the last 15 days. Tip: use function Sys.Date()
to find out the current date and Sys.Date()- 15
to calculate the date located 15 days in the past.
if (!require(yfR)) install.packages('yfR')
df_prices <- yfR::yf_get(tickers = 'IBM',
first_date = Sys.Date() - 15,
last_date = Sys.Date())
str(df_prices)
The cranlogs
package allows access to downloads statistics of CRAN packages. After installing cranlogs
on your computer, use the cranlogs::cran_top_downloads
function to check which are the 10 most installed packages by the global community in the last month. Which package comes first? Tip: Set the cran_top_downloads
function input to when = 'last-month'
. Also, be aware that the answer here may not be the same as you got because it depends on the day the R code was executed.
#if (!require(cranlogs)) install.packages('cranlogs')
pkgs <- cranlogs::cran_top_downloads(when = 'last-month')
my_sol <- pkgs$package[1]
my_sol
Using the devtools
package, install the development version of the ggplot2
package, available in the Hadley Hickman repository. Load the package using library
and create a simple figure with the code qplot(y = rnorm(10), x = 1:10)
.
if (!require(devtools)) install.packages("devtools")
devtools::install_github('hadley/ggplot2')
library(ggplot2)
qplot(y = rnorm (10), x = 1:10)
Using your programming ability check on your computer which folder, from the “Documents” directory (shortcut = ~
), has the largest number of files. Display the five folders with the largest number of files on R’s prompt.
doc_folder <- '~' # 'C:/Users/USERNAME/Documents' in Windows
# '/home/USERNAME/ in Linux
fct_count_files <- function(dir_in) {
n_files <- fs::dir_ls(dir_in, recurse = FALSE)
return(length(n_files))
}
# be aware this might take lots of time...
all_folders <- fs::dir_ls(path = doc_folder,
type = 'directory',
recurse = TRUE)
counter_files <- sapply(all_folders, fct_count_files)
sorted <- sort(counter_files, decreasing = TRUE)
message('\nThe five folders with highest number of files are:\n\n')
message(paste0(names(sorted[1:5]), collapse = '\n'))
Imagine a survey regarding your household budget over time. Financial data is available in electronic spreadsheets separated by month, for 10 years. The objective of the research is to understand if it is possible to purchase a real state property in the next five years. Within this setup, detail in text the elements in each stage of the study, from importing the data to the construction of the report.
The possible stages of the study are:
Importing 01: Import all data related to income and family budget, rate of return on investments - fixed income or stocks - and historical property prices in the desired locations;
Cleaning 01: Clean the data for outliers and missing data (NA);
Manipulation 01: Use personal income data to find the need for monthly savings for each year.
Manipulation 02: Based on the historical returns of investments and the inflation of the property, check how many years it takes to save the amount of money needed to buy the property.
Based on the study proposed earlier, create a directory structure on your computer to accommodate the study. Create mock files for each subdirectory (see directory structure at section @ref(directories)). Be aware you can create mock files and direction all in R (see functions cat
and fs::dir_create
).
library(fs)
# set temp dir for solving exercise
temp_dir <- path(tempdir(), 'Chapter 3 exercise')
# create folder
dir_create(temp_dir)
# create files
file_create(path(temp_dir, '01-import_data.R'))
file_create(path(temp_dir, '02-clean_data.R'))
file_create(path(temp_dir, '03-build_budget_table.R'))
file_create(path(temp_dir, '04-build_investment_realstate_table.R'))
file_create(path(temp_dir, '05-build_report.R'))
# create dirs
dir_create(path(temp_dir, 'data'))
dir_create(path(temp_dir, 'tabs'))
dir_create(path(temp_dir, 'figs'))
# fill with files
for (i_year in 2009:2019) {
file_create(path(temp_dir, 'data',
paste0('annual_budget_', i_year, '.xlsx')) )
}
file_create(path(temp_dir, 'data', 'fixed_income_index.csv'))
file_create(path(temp_dir, 'data', 'real_state_data.rds'))
file_create(path(temp_dir, 'figs', 'Income_and_HousePrices.png'))
file_create(path(temp_dir, 'tabs', 'Table_with_Results.xlsx'))
dir_tree(temp_dir)
Create a dataframe
with the following code:
library(dplyr)
my_N <- 10000
my_df <- tibble(x = 1:my_N,
y = runif(my_N))
Export the resulting dataframe
to each of the five formats: csv, rds, xlsx, fst. Which of the formats took up the most space in the computer’s memory? Tip: file.size
calculates the size of files within R.
library(dplyr)
library(readr)
do_tests <- function(my_N) {
my_df <- tibble(x = 1:my_N,
y = runif(my_N))
# csv
my_f <- tempfile(pattern = 'temp', fileext = '.csv')
time.csv <- system.time({
write_csv(my_df, my_f)
})['elapsed']
size.csv <- file.size(my_f)/1000000
# rds
my_f <- tempfile(pattern = 'temp', fileext = '.rds')
time.rds <- system.time({
write_rds(my_df, my_f)
})['elapsed']
size.rds <- file.size(my_f)/1000000
# xlsx
my_f <- tempfile(pattern = 'temp', fileext = '.xlsx')
library(writexl)
time.xlsx <- system.time({
write_xlsx(my_df, my_f)
})['elapsed']
size.xlsx <- file.size(my_f)/1000000
# fst
library(fst)
my_f <- tempfile(pattern = 'temp', fileext = '.fst')
time.fst <- system.time({
write_fst(my_df, my_f)
})['elapsed']
size.fst <- file.size(my_f)/1000000
print(c(size.csv, size.rds, size.xlsx, size.fst))
print(c(time.csv, time.rds, time.xlsx, time.fst))
tab <- tibble(Result = c('csv', 'rds', 'xlsx', 'fst'),
Size = c(size.csv, size.rds, size.xlsx, size.fst),
Time = c(time.csv, time.rds, time.xlsx, time.fst))
return(tab)
}
my_N <- 10000
tab <- do_tests(my_N)
print(tab)
my_msg <- paste0('The format with largest disk space for N = ', my_N, ' is ',
tab$Result[which.max(tab$Size)], '.')
message(my_msg)
Improve the previous code by measuring the execution time for saving the data in different formats. Which file format resulted in the fastest execution for exporting data? Tip: use the system.time
function or thetictoc
package to calculate the execution times.
# do notice that this chunk requires the execution of previous solution
my_msg <- paste0('The format with least execution time for N = ', my_N, ' is ',
tab$Result[which.min(tab$Time)], '.')
message(my_msg)
For the previous code, reset the value of my_N
to 1000000
. Does it change the answers to the last two questions?
# do notice that this chunk requires the execution of previous solution
my_N <- 1000000
tab <- do_tests(my_N)
print(tab)
my_msg <- paste0('The format with largest disk space for N = ', my_N, ' is ',
tab$Result[which.max(tab$Size)], '.')
message(my_msg)
my_msg <- paste0('The format with least execution time for N = ', my_N, ' is ',
tab$Result[which.min(tab$Time)], '.')
message(my_msg)
Use afedR3::data_path
function to access the CH04_SP500.csv file in the book’s data repository. Import the contents of the file into R with the function readr::read_csv
. How many lines are there in the resulting dataframe
?
my_f <- afedR3::data_path('CH04_SP500.csv')
df_SP500 <- readr::read_csv(my_f,
col_types = readr::cols())
my_msg <- paste0('There are ', nrow(df_SP500), ' rows and ',
ncol(df_SP500), ' columns in file ', basename(my_f))
message(my_msg)
my_sol <- nrow(df_SP500)
At link https://eeecon.uibk.ac.at/~zeileis/grunfeld/Grunfeld.csv/ you’ll find a .csv file for the Grunfeld data. This is a particularly famous table due to its use as reference data in econometric models. Using readr::read_csv
function, read this file using the direct link as input read_csv
. How many columns do you find in the resulting dataframe
?
my_url <- 'https://eeecon.uibk.ac.at/~zeileis/grunfeld/Grunfeld.csv'
df_grunfeld <- readr::read_csv(my_url, col_types = readr::cols())
my_sol <- ncol(df_grunfeld)
Use function afedR3::data_path
function to access the CH04_example-tsv.tsv file in the book’s data repository. Note that the columns of the data are separated by the tab symbol ('\t'
), and not the usual comma. After reading the readr::read_delim
manual, import the information from this file to your R session. How many rows does the resulting dataframe
contain?
my_f <- afedR3::data_path('CH04_example-tsv.tsv')
df_tsv <- readr::read_delim(my_f, delim = '\t', col_types = readr::cols())
my_sol <- nrow(df_tsv)
#check_answers(my_answers)
In the book package you’ll find data file called CH04_another-funky-csv-file.csv, with a particularly bizarre format. Open it in a text editor and try to understand how the columns are separated and what is symbol for decimals. After that, study the inputs of function utils::read.table
and import the table into your R session. If we add the number of rows to the number of columns in the imported table, what is the result?
my_f <- afedR3::data_path("CH04_another-funky-csv-file.csv")
df_funky <- read.table(file = my_f,
dec = '?',
skip = 7,
sep = '|',
header = TRUE)
my_sol <- nrow(df_funky) + ncol(df_funky)
Using the yfR
package, download daily data of the Facebook stock (META) from Yahoo Finance for the period between 2019 and 2023. What is the lowest unadjusted closing price (column price.close
) in the analyzed period?
ticker <- 'META'
first_date <- '2019-01-01'
last_date <- '2023-01-01'
df_prices <- yfR::yf_get(
tickers = ticker,
first_date = first_date,
last_date = last_date)
my_sol <- min(df_prices$price_close,
na.rm = TRUE)
If you have not already done so, create a profile on the Quandl website1 and download the arabica coffee price data in the CEPEA database (Center for Advanced Studies in Applied Economics) ) between 2010-01-01 and 2020-12-31. What is the value of the most recent price?
library(GetQuandlData)
id_series <- c(COFFEE = 'CEPEA/COFFEE_A')
# set api key (make sure you got your own!)
#my_api_key <- readLines('~/Dropbox/98-pass_and_bash/.quandl_api.txt')
first_date <- '2010-01-01'
last_date <- '2020-12-31'
#df_coffee <- get_Quandl_series(id_in = id_series,
# first_date = first_date,
# last_date = last_date,
# api_key = my_api_key,
# do_cache = TRUE)
# find most recent
#idx <- which.max(df_coffee$ref_date)
#my_sol <- as.numeric(df_coffee$`Cash Price US$`[idx])
Use function simfinapi::sfa_get_entities()
to import data about all available companies in Simfin. How many companies do you find? (see function dplyr::n_distinct()
).
library(simfinapi)
library(dplyr)
library(readr)
# set api key - make sure you got your own at https://simfin.com/data/access/api
#my_api_key <- read_lines('~/GDrive/98-pass-and-bash/.api_key_simfin.txt')
my_api_key <- ''
cache_dir <- fs::path_temp("cache-simfin")
fs::dir_create(cache_dir)
# get info
simfinapi::sfa_set_api_key(my_api_key)
simfinapi::sfa_set_cache_dir(cache_dir)
# get info
df_info_companies <- simfinapi::sfa_get_entities()
# check it
glimpse(df_info_companies)
n_companies <- n_distinct(df_info_companies$ticker)
my_sol <- n_companies
With package simfinapi
, download the PL (profit/loss) statement for FY (final year) data for TESLA (ticker = “TSLA”) for year 2022. What is the latest Profit/Loss of the company for that particular year?
library(simfinapi)
library(dplyr)
library(readr)
# set api key - make sure you got your own at https://simfin.com/data/access/api
#my_api_key <- read_lines('~/GDrive/98-pass-and-bash/.api_key_simfin.txt')
cache_dir <- fs::path_temp("cache-simfin")
fs::dir_create(cache_dir)
# get info
simfinapi::sfa_set_api_key(my_api_key)
simfinapi::sfa_set_cache_dir(cache_dir)
ticker <- 'TSLA'
type_statement <- 'pl' # profit/loss
period <- 'fy' # final year
year <- 2022
PL <- sfa_get_statement(
ticker = ticker,
statement = type_statement,
period = period,
fyear = year)
glimpse(PL)
desired_acc <- 'net_income'
latest_acc_value <- PL$net_income[1]
my_sol <- latest_acc_value
Using function tidyquant::tq_index
, download the current composition of index DOW. What is the company with the highest percentage in the composition of the index?
Be aware that the answer is time-dependent and the reported result might be different from what you actually got in your R session.
library(tidyquant)
# print available indices
available_index <- tq_index_options()
my_index <- sample(available_index, 1)
# get components of "DOWJONES"
comp_idx <- tq_index(my_index)
highest_weight <- comp_idx$company[which.max(comp_idx$weight)]
my_sol <- highest_weight
Using again the yfR
package, download financial data between 2019-01-01 and 2020-01-01 for the following tickers:
Using the adjusted closing price column, what company provided higher return to the stock holder during the analyzed period?
Tip: this is an advanced exercise that will require some coding. To solve it, check out function split
to split the dataframe of price data and lapply
to map a function to each dataframe.
library(dplyr)
first_date <- '2019-01-01'
last_date <- '2020-01-01'
my_tickers <- c('AAPL', 'BAC',
'GE', 'TSLA',
'SNAP')
df_prices <- yfR::yf_get(
tickers = my_tickers,
first_date = first_date,
last_date = last_date)
split_l <- split(df_prices, df_prices$ticker)
my_fct <- function(df_in) {
price_vec <- df_in$price_adjusted
ticker_in <- df_in$ticker[1]
total_ret <- last(price_vec)/first(price_vec) - 1
return(tibble(ticker = ticker_in,
total_ret = total_ret))
}
df_results <- bind_rows(
lapply(split_l, my_fct)
)
winner <- df_results$ticker[which.max(df_results$total_ret)]
my_sol <- winner
Using function dplyr::tibble
, create a dataframe
called my_df
with a column called x
containing a sequence from -100 to 100 and another column called y
with the value of column x added by 5. How many values in column x
are greater than 10 and lower than 25?
my_df <- dplyr::tibble(x = -100:100,
y = x + 5)
# solution
my_sol <- sum((my_df$x > 10)&(my_df$x < 25))
Create a new column in object my_df
called cumsum_x
, containing the cumulative sum of x
(cumsum
function). In this new column, how many values are greater than -3500?
my_df <- dplyr::tibble(x = -100:100,
y = x + 5)
# solution
my_df$cumsum_x <- cumsum(my_df$x)
# solution
my_sol <- sum(my_df$cumsum_x > -3500)
Use function dplyr::filter
function and the pipeline operator to filter my_df
, keeping only the rows where the value of the y
column is greater than 0. What is the number of rows in the resulting table?
my_df <- dplyr::tibble(x = -100:100,
y = x + 5)
# solution
my_df2 <- my_df |>
dplyr::filter(y > 0)
# solution
my_sol <- nrow(my_df2)
If you have not already done so, repeat exercises 1, 2 and 3 using the functions of the tidyverse
universe and the pipeline operator.
library(dplyr)
library(readr)
library(stringr)
my_df <- tibble(x = -100:100,
y = x + 5) |>
mutate(cumsum_x = cumsum(x))
# solution 01
sol_01 <- my_df |>
filter(x > 10,
x < 25) |>
nrow()
# solution 02
sol_02 <- my_df |>
mutate(cumsum_x = cumsum(x)) |>
filter(cumsum_x > -3500) |>
nrow()
# solution 03
sol_03 <- my_df |>
filter(y > 0) |>
nrow()
message(str_glue(
'Solution 01 = {sol_01} \n',
'Solution 02 = {sol_02} \n',
'Solution 03 = {sol_03}'
)
)
Use the yfR
package to download Google (GOOG) stock data, from 2015-01-01 to 2023-01-01. If the investor had bought 1000 USD in Google shares on the first day of the data and kept the investment until today, what would be the value of his portfolio?
library(dplyr)
first_date <- '2015-01-01'
last_date <- '2023-01-01'
df_prices <- yfR::yf_get('GOOG',
first_date,
last_date)
value_purchase <- sample(seq(1000, 10000, by = 50), 1)
# solution
my_sol <- last(df_prices$price_adjusted)/first(df_prices$price_adjusted)*value_purchase
Use functions afedR3::data_path
and readr::read_csv
to import data from the CH11_grunfeld.csv file. Now, use function dplyr::glimpse
to find out the number of lines in the imported data. What is the number of rows in the imported table?
library(dplyr)
library(readr)
my_file <- afedR3::data_path('CH11_grunfeld.csv')
df_grunfeld <- read_csv(my_file,
col_types = cols())
# solution
glimpse(df_grunfeld)
my_sol <- nrow(df_grunfeld)
Create a list type object with three dataframes
in its contents,df1
, df2
and df3
. The content and size of the dataframes
is your personal choice. Now, use the sapply
function and a custom function to find out the number of rows and columns in each dataframe
.
df1 <- dplyr::tibble(x = 1:10)
df2 <- dplyr::tibble(y = runif(100))
df3 <- dplyr::tibble(z = rnorm(150),
m = rnorm(150))
my_l <- list(df1, df2, df3)
my_fct <- function(df_in) {
out <- c('nrows' = nrow(df_in),
'ncols' = ncol(df_in))
return(out)
}
tab <- sapply(my_l, my_fct)
tab
Within an R session, create an identity matrix (value 1 in the diagonal, zero in any other) of size 5X5. Tip: use the diag
function to define the diagonal of the matrix.
my_size <- 5
M_identity <- matrix(0,
nrow = my_size,
ncol = my_size)
# solution
diag(M_identity) <- 1
print(M_identity)
Consider the following vectors x and y:
set.seed(7)
x <- sample (1:3, size = 5, replace = T)
y <- sample (1:3, size = 5, replace = T)
What is the sum of the elements of a new vector resulting from the multiplication between the elements of x
andy
?
set.seed(7)
x <- sample (1:3, size = 5, replace = T)
y <- sample (1:3, size = 5, replace = T)
# solution
my_sol <- sum(x*y)
If we performed a cumulative sum of a sequence between 1 and 100, in what element would this sum go beyond the value of 50?
my_sum <- cumsum(1:100)
# solution
my_sol <- (which(my_sum > 50)[1])
Using R, create an sequence called seq_1
between -15 and 10, where the interval between values is always equal to 2. What is the sum of the elements of seq_1
?
# solution
seq_1 <- seq(from = -15, to = 10, by = 2)
# solution
my_sol <- sum(seq_1)
Define another object called seq_2
containing a sequence of size 1000, with values between 0 and 100. What is the standard deviation (function sd()
) of this sequence?
seq_2 <- seq(from = 0,
to = 100,
length.out = 1000)
# solution
my_sol <- sd(seq_2)
Calculate the sum between seq_1
and seq_2
vectors (see previous exercises). Did this operation work despite the different size of the vectors? Explain your answer. If it works, what is the highest value of the resulting vector?
Yes, it worked, but with a warning: _ “the length of the longest object is not a multiple of the length of the shortest object”_. The explanation is that whenever R encounters operations with different vector sizes, it uses the recycling rule, where the shortest vector is repeated as many times as necessary to match the size of the longest vector. In the book, see section on numerical vectors for more details.
seq_1 <- seq(from = -10, to = 10, by = 2)
seq_2 <- seq(from = 0,
to = 100,
length.out = 1000)
# solution
my_sol <- max(seq_1+seq_2)
Let’s assume that, on a certain date, you bought 100 shares in a company, paying $15 per share. After some time, you sold 41 shares for a $14 each and the remaining 59 shares were sold for $17 on a later day. Using a script in R, structure this financial problem by creating numeric objects. What is the total gross profit/loss from this sequence of transactions on the stock market?
total_shares <- 100
price_purchase <- 15
total_purchase_value <- total_shares*price_purchase
qtd_sell_1 <- sample(10:50, 1)
price_sell_1 <- sample(10:18, 1)
total_sell_1 <- qtd_sell_1*price_sell_1
qtd_sell_2 <- total_shares-qtd_sell_1
price_sell_2 <- sample(10:18, 1)
total_sell_2 <- qtd_sell_2*price_sell_2
total_sell_value <- total_sell_1 + total_sell_2
# solution
my_sol <- total_sell_value - total_purchase_value
Create a vector x according to the following formula, where . What is the sum of the elements of x?
i <- 1:100
x <- ( (-1)^(i+1) )/(2*i - 1)
# solution
my_sol <- sum(x)
Create a vector according to the following formula where and . What is the sum of the elements of ? Tip: check out how the dplyr::lag()
function works.
x <- 1:50
y <- 50:1
# solution using `base`
z <- (y - c(NA, x[1:(length(x)-1)]))/c(NA, NA, y[1:(length(y)-2)])
# solution with tidyverse (much prettier huh!)
z <- (y - lag(x, n = 1))/lag(y, n = 2)
# solution (be aware of the NA values)
my_sol <- sum(z, na.rm = TRUE)
Using set.seed (10)
, create an object called x
with random values from the Normal distribution with a mean of 10 and standard deviation of 10. Using the cut
function, create another object that defines two groups based on values of x
greater than 15 and lower than 15. How many observations you find in the first group?
set.seed(10)
x <- rnorm(n = 1000, mean = 10, sd = 10)
my_group <- cut(x,
breaks = c(-Inf, 15, Inf))
# solution
my_sol <- table(my_group)[1]
Create the following object with the following code:
set.seed(15)
my_char <- paste(sample(letters, 5000, replace = T),
collapse = '')
How many times is the letter 'x'
found in the resulting text object?
set.seed(15)
my_char <- paste(sample(letters, 5000, replace = T),
collapse = '')
# solution
my_sol <- stringr::str_count(my_char, 'x')
Based on the my_char
object created earlier, if we divided it into several smaller pieces using the letter "b"
, what is the number of characters in the largest piece found?
set.seed(15)
my_char <- paste(sample(letters, 5000, replace = T),
collapse = '')
my_split <- stringr::str_split(my_char, pattern = 'b')[[1]]
# find number of characters in each
n_chars <- sapply(my_split, nchar)
# solution
my_sol <- n_chars[which.max(n_chars)]
At web address https://www.gutenberg.org/cache/epub/69694/pg69694.txt you’ll find the full text of the book The Devil of East Lupton, Vermont, by
William Fitzgerald. Use functions download.file
and readr::read_lines
to import the entire book as a vector of characters called my_book
in R. How many lines does the resulting object have?
my_link <- 'https://www.gutenberg.org/cache/epub/69694/pg69694.txt'
my_book <- readr::read_lines(my_link)
# solution
my_sol <- length(my_book)
Bind the vector of characters in my_book
into a single object called full_text
using command paste0(my_book, collapse = '\n')
. Using the stringr
package, how many times is the word 'Vermont'
repeated throughout the text?
my_link <- 'https://www.gutenberg.org/cache/epub/69694/pg69694.txt'
my_book <- readr::read_lines(my_link)
# solution
full_text <- paste0(my_book, collapse = '\n')
my_sol <- stringr::str_count(full_text, stringr::fixed('Vermont'))
For the full_text
object created earlier, use the stringr::str_split()
function to split the entire text using blanks. Based on the resulting vector, create a frequency table. What is the most used word in the text? Tip: Remove all cases of empty characters.
my_link <- 'https://www.gutenberg.org/cache/epub/69694/pg69694.txt'
my_book <- readr::read_lines(my_link)
# solution
full_text <- paste0(my_book, collapse = '\n')
my_split <- stringr::str_split(full_text,
pattern = stringr::fixed(' '))[[1]]
# remove empty
my_split <- my_split[my_split != '']
my_tab <- sort(table(my_split), decreasing = TRUE)
# solution
my_sol <- names(my_tab[1])
Assuming that a person born on 2000-05-12 will live for 100 years, what is the number of birthday days that will fall on a weekend (Saturday or Sunday)? Tip: use operator %in%
to check for a multiple condition in a vector.
b_day <- as.Date('2000-05-12')
n_years <- 100
b_day_vec <- seq(b_day, b_day + n_years*365, by = '1 year')
w_days <- weekdays(b_day_vec)
n_weekend <- sum(w_days %in% c('Saturday', 'Sunday'))
# solution
my_sol <- n_weekend
What date and time is found 104 seconds after 2021-02-02 11:50:02?
time_1 <- as.POSIXct('2021-02-02 11:50:02')
my_sec <- 10000
my_sol <- time_1 + my_sec
Create a function called say_my_name
that takes a person’s name as input and shows the text Your name is … on the screen. Within the scope of the function, use comments to describe the purpose of the function, its inputs and outputs.
say_my_name <- function(name_in) {
# Print a single name in the prompt
#
# ARGS: name_in - Name to be printed
#
# RETURNS: TRUE, if sucessfull
my_msg <- paste0('Your name is ', name_in)
message(my_msg)
# invisible will omit output if function is called without definition of output
return(invisible(TRUE))
}
# testing
say_my_name('Marcelo')
Implement a test for the input objects, so that when the input name is not of the character
class, an error is returned to the user. Test your new function and verify if it is working as expected.
say_my_name <- function(name_in) {
# Prints a single name in the prompt
#
# ARGS: name_in - Name to be printed
# RETURNS: TRUE, if sucessfull
# check inputs
if (class(name_in) != 'character') {
stop('Class of input name_in is ', class(name_in),
' and not character!')
}
if (length(name_in) > 1) {
stop('Input name_in has length ', length(name_in),
' and not 1 (this function only works for one name)!')
}
my_msg <- paste0('Your name is ', name_in, '.')
message(my_msg)
# invisible makes sure the fct doesnt return anything if not output is set
return(invisible(TRUE))
}
# testing Ok
say_my_name('Marcelo')
# testing vector
say_my_name(c('Richard', 'Michael'))
# testing class
say_my_name(1)
Create a vector with any five names, called my_names
. Using a loop, apply function say_my_name
to each element of my_names
.
say_my_name <- function(name_in) {
# Prints a single name in the prompt
#
# ARGS: name_in - Name to be printed
# RETURNS: TRUE, if sucessfull
# check inputs
if (class(name_in) != 'character') {
stop('Class of input name_in is ', class(name_in),
' and not character!')
}
if (length(name_in) > 1) {
stop('Input name_in has length ', length(name_in),
' and not 1 (this function only works for one name)!')
}
my_msg <- paste0('Your name is ', name_in, '\n')
message(my_msg)
# invisible makes sure the fct doesnt return anything if not output is set
return(invisible(TRUE))
}
my_names <- c('Marcelo', 'Ricardo', 'Tarcizio')
for (i_names in my_names) {
say_my_name(i_names)
}
In the database of Brasil.IO1 you will find a table with names and genres. Import the data from the file and, using a loop, apply function say_my_name
to 15 random names in the database. Tip: you can read the data directly from the link using readr::read_csv(LINK)
.
say_my_name <- function(name_in) {
# Prints a single name in the prompt
#
# ARGS: name_in - Name to be printed
# RETURNS: TRUE, if sucessfull
# check inputs
if (class(name_in) != 'character') {
stop('Class of input name_in is ', class(name_in),
' and not character!')
}
if (length(name_in) > 1) {
stop('Input name_in has length ', length(name_in),
' and not 1 (this function only works for one name)!')
}
my_msg <- paste0('Your name is ', name_in, '.')
message(my_msg)
# invisible makes sure the fct doesnt return anything if not output is set
return(invisible(TRUE))
}
library(dplyr)
library(readr)
library(stringr)
# get CURRENT url from https://data.chhs.ca.gov/dataset/most-popular-baby-names-2005-current
my_url <- 'https://data.brasil.io/dataset/genero-nomes/grupos.csv.gz'
df_names <- read_csv(my_url, col_types = cols())
my_names <- sample(df_names$name, 15)
for (i_name in my_names) {
say_my_name(i_name)
}
Redo previous exercises using function sapply
or purrr::walk
.
say_my_name <- function(name_in) {
# Prints a single name in the prompt
#
# ARGS: name_in - Name to be printed
# RETURNS: TRUE, if sucessfull
# check inputs
if (class(name_in) != 'character') {
stop('Class of input name_in is ', class(name_in),
' and not character!')
}
if (length(name_in) > 1) {
stop('Input name_in has length ', length(name_in),
' and not 1 (this function only works for one name)!')
}
my_msg <- paste0('Your name is ', name_in, '\n')
message(my_msg)
# invisible makes sure the fct doesnt return anything if not output is set
return(invisible(TRUE))
}
library(dplyr)
library(readr)
library(purrr)
# get CURRENT url from https://data.chhs.ca.gov/dataset/most-popular-baby-names-2005-current
my_url <- 'https://data.brasil.io/dataset/genero-nomes/grupos.csv.gz'
df_names <- read_csv(my_url, col_types = cols())
# using sapply
vec_out <- sapply(sample(df_names$name, 15),
say_my_name)
glimpse(vec_out)
# using purrr
walk(sample(df_names$name, 15),
say_my_name)
Use the {yfR} package to download data from the SP500 ('^GSPC'
), Ibovespa ('^BVSP'
), FTSE ('^ FSTE'
) and Nikkei 225 ('^N225'
index)) from ‘2010-01-01’` to the current date. With the imported data, use a loop to calculate the average, maximum and minimum return for each index during the analyzed period. Save all results in a single table and show it in the prompt of R.
library(dplyr)
indexes <- c('^BVSP', '^GSPC', '^FTSE', '^N225')
df_indices <- yfR::yf_get(indexes,
'2010-01-01',
Sys.Date())
tab <- tibble()
for (index in indexes) {
temp_df <- df_indices |>
filter(ticker == index)
avg_ret <- mean(temp_df$ret_adjusted_prices,
na.rm = TRUE)
max_ret <- max(temp_df$ret_adjusted_prices,
na.rm = TRUE)
min_ret <- min(temp_df$ret_adjusted_prices,
na.rm = TRUE)
# save result
tab <- bind_rows(tab, tibble(index = index,
mean_ret = avg_ret,
max_ret = max_ret,
min_ret = min_ret))
}
print(tab)
Redo previous exercise using functions group_by
and summarise
, both from package dplyr
.
library(dplyr)
indexes <- c('^BVSP', '^GSPC', '^FTSE', '^N225')
df_indices <- yfR::yf_get(indexes,
'2010-01-01',
Sys.Date())
tab_tidy <- df_indices |>
group_by(ticker) |>
summarise(mean_ret = mean(ret_adjusted_prices, na.rm = TRUE),
max_ret = max(ret_adjusted_prices, na.rm = TRUE),
min_ret = min(ret_adjusted_prices, na.rm = TRUE))
print(tab_tidy)
On the Rstudio CRAN logs website1, you’ll find data on download statistics for the base distribution of R in the Daily R downloads section. Using your programming skills, import all available data between 2022-01-01 and 2022-01-05 and aggregate them into a single table. Which country has the highest download count for R?
# set function that will download the files
read_cranlogs_files <- function(date_in) {
# Reads log files from http://cran-logs.rstudio.com/
#
# ARGS: date_in - date of log data
require(lubridate)
url_dl <- paste0('http://cran-logs.rstudio.com/', year(date_in), '/',
date_in, '-r.csv.gz')
cat('\nReading ', url_dl)
df <- readr::read_csv(url_dl, col_types = readr::cols())
return(df)
}
# find out the availabe dates in url
library(rvest)
library(lubridate)
library(stringr)
library(dplyr)
available_links <- read_html('http://cran-logs.rstudio.com/') |>
html_nodes(css = 'a') |>
html_attr('href')
# only keep links for R download (those with -r.csv.gz pattern)
idx <- str_detect(available_links, '-r.csv.gz')
r_links <- available_links[idx]
# find out dates
dates_dls <- ymd(basename(r_links))
max_date <- max(dates_dls)
first_date <- as.Date('2022-01-01')
last_date <- as.Date('2022-01-05')
my_dates <- seq(first_date,
last_date,
by = '1 day')
library(purrr)
library(dplyr)
l_out <- map(my_dates,
safely(read_cranlogs_files,
otherwise = tibble())) # return empty tibble in case of error
df_cranlogs <- bind_rows(map(l_out, 'result'))
# solution
my_sol <- names(sort(table(df_cranlogs$country),
decreasing = TRUE)[1])
For the same daily FTSE data, check the dates and prices of the 20 biggest price drops. If, for each of these cases, an investor bought the index at the price of the biggest drops and kept it for 30 days, what would be his average nominal return per transaction?
library(dplyr)
library(readr)
ticker <- '^FTSE'
first_date <- '2010-01-01'
last_date <- '2023-01-01'
df_FTSE_daily <- yfR::yf_get(ticker,
first_date,
last_date)
# buy at t, sell at t+30
trade_window <- 30
# find largest drops
largest_drops <- df_FTSE_daily |>
arrange(ret_adjusted_prices) |>
slice(1:20)
# There are many ways to solve the exercise.
# Here we will use a loop which is the simplest way to looking at the problem.
# You could also solve it with the functional approach of package purrrr,
# that is, writing a function.
tab <- tibble()
for (i_date in seq_along(largest_drops$ref_date)) {
my_date <- largest_drops$ref_date[i_date]
# filter data to keep only datapoints in each horizon
temp_df <- df_FTSE_daily |>
filter(ref_date >= my_date,
ref_date <= my_date + trade_window)
buy_price <- first(temp_df$price_adjusted)
sell_price <- last(temp_df$price_adjusted)
return <- sell_price/buy_price - 1
tab <- bind_rows(tab,
tibble(date = my_date,
buy_price = buy_price,
sell_price = sell_price,
return = return))
}
print(tab)
# solution
my_sol <- mean(tab$return)
Consider a dataframe
created with the following R code:
library(dplyr)
my_N <- 100
df <- bind_rows(tibble(ticker = rep('STOCK 1', my_N),
ref_date = Sys.Date() + 1:my_N,
price = 100 + cumsum(rnorm(my_N))),
tibble(ticker = rep('STOCK 2', my_N),
ref_date = Sys.Date() + 1:my_N,
price = 100 + cumsum(rnorm(my_N))) )
print(df)
Is the dataframe
in the long or wide format? Explain your answer.
The format is long: we have data stacked for two different stocks. Note that, with the addition of new tickers, the table grows with new lines. New variables can be easily added with new columns.
Change the format of the previous dataframe
, from long to wide or vice versa.
library(dplyr)
my_N <- 100
df <- bind_rows(tibble(ticker = rep('STOCK 1', my_N),
ref_date = Sys.Date() + 1:my_N,
price = 100 + cumsum(rnorm(my_N))),
tibble(ticker = rep('STOCK 2', my_N),
ref_date = Sys.Date() + 1:my_N,
price = 100 + cumsum(rnorm(my_N))) )
print(df)
library(dplyr)
library(readr)
my_N <- 100
df <- bind_rows(tibble(ticker = rep('STOCK 1', my_N),
ref_date = Sys.Date() + 1:my_N,
price = 100 + cumsum(rnorm(my_N))),
tibble(ticker = rep('STOCK 2', my_N),
ref_date = Sys.Date() + 1:my_N,
price = 100 + cumsum(rnorm(my_N))) )
print(df)
# convert from long to wide
df_wide <- tidyr::spread(data = df,
key = 'ticker',
value = 'price')
# print result
print(df_wide)
Consider the following list:
library(dplyr)
my_l <- list(df1 = tibble(x = 1:100,
y = runif(100)),
df2 = tibble(x = 1:100,
y = runif(100),
v = runif(100)),
df3 = tibble(x = 1:100,
y = runif(100),
z = runif(100))
)
Add all dataframes
in my_l
to a single object using do.call
or dplyr::bind_rows
functions. What happened to the df1
data where v
and z
columns do not exist?
When bind_rows
does not find the same column at the junction of different tables, the missing data is defined as NAs
. See below:
library(dplyr)
library(readr)
my_l <- list(df1 = tibble(x = 1:100,
y = runif(100)),
df2 = tibble(x = 1:100,
y = runif(100),
v = runif(100)),
df3 = tibble(x = 1:100,
y = runif(100),
z = runif(100)) )
# solution with bind_rows
bind_df1 <- bind_rows(my_l)
# solution with do.cal
bind_df2 <- do.call(bind_rows, my_l)
# check solutions match
identical(bind_df1, bind_df2)
print(bind_df1)
# the missing data points were set as NA values
Use the yfR
package to download the SP500 index data ('^GSPC'
) from 1950-01-01 to 2021-01-01. What is the sum of the 5 highest positive returns on the index?
ticker <- '^GSPC'
first_date <- '1950-01-01'
last_date <- '2021-01-01'
df_sp500 <- yfR::yf_get('^GSPC',
first_date,
last_date)
select_n <- 5
tab <- dplyr::tibble(position = 1:select_n,
top5_positive = sort(df_sp500$ret_adjusted_prices,
decreasing = TRUE)[1:select_n],
top5_negative = sort(df_sp500$ret_adjusted_prices,
decreasing = FALSE)[1:select_n])
print(tab)
my_sol <- sum(tab$top5_positive)
Use the replace_outliers
function from section @ref(outliers) to remove outliers from all numeric columns of the SP500 data previously imported with my_prob = 0.025
. How many lines were lost in this cleaning process?
library(purrr)
ticker <- '^GSPC'
first_date <- '1950-01-01'
last_date <- '2021-01-01'
df_sp500 <- yfR::yf_get('^GSPC',
first_date,
last_date)
replace_outliers <- function(col_in, my_prob = 0.05) {
# Replaces outliers from a vector
#
# INPUTS: col_in The vector
# my_prob Probability of quantiles (p and 1-p)
#
# OUTPUT: A vector
# return if class is other than numeric
if (!(class(col_in) %in% c('numeric', 'integer'))) return(col_in)
my_outliers <- stats::quantile(x = col_in,
probs = c(my_prob, 1-my_prob),
na.rm = TRUE)
idx <- (col_in <= my_outliers[1])|(col_in >= my_outliers[2])
col_in[idx] <- NA
return(col_in)
}
# remove outlivers from vectors
l_out <- map(df_sp500, replace_outliers, my_prob = 0.025)
df_sp500_nooutlier <- na.omit(as_tibble(l_out))
nrow_1 <- nrow(df_sp500)
nrow_2 <- nrow(df_sp500_nooutlier)
my_sol <- nrow_1 - nrow_2
Use the yfR::yf_get
function to import prices of the FTSE index ('^ FTSE'
) from 2010-01-01 to 2021-01-01. Then, reconstruct the data at the annual frequency, defining each year’s value as the last observation of the period. Tip: see the dplyr::summary_all
function for a functional way to aggregate all the columns of a dataframe
.
ticker <- '^FTSE'
first_date <- '2010-01-01'
last_date <- '2021-01-01'
df_FTSE_daily <- yfR::yf_get(ticker,
first_date,
last_date)
# change from daily to annual
df_FTSE_yearly <- df_FTSE_daily |>
mutate(ref_year = lubridate::year(ref_date)) |>
group_by(ref_year) |>
summarise_all(.funs = last)
print(df_FTSE_yearly)
Use the same daily data as the FTSE and reconstruct the data at the monthly frequency, again using the first observation of the period.
ticker <- '^FTSE'
first_date <- '2010-01-01'
last_date <- '2023-01-01'
df_FTSE_daily <- yfR::yf_get(ticker,
first_date,
last_date)
# change from daily to monthly
df_FTSE_monthly <- df_FTSE_daily |>
mutate(ref_month = format(ref_date, '%Y-%m-01')) |>
group_by(ref_month) |>
summarise_all(first)
print(df_FTSE_monthly)
Download TESLA (TSLA) stock data with the yfR
package for the last 500 days. Display the price line chart adjusted over time using the ggplot() function. Make sure that:
library(yfR)
library(ggplot2)
tickers <- 'TLSA'
df_tlsa<- yf_get(tickers = tickers,
first_date = Sys.Date() - 500,
last_date = Sys.Date())
p <- ggplot(df_tlsa, aes(x = ref_date, y = price_adjusted)) +
geom_line() +
labs(title = paste0('Prices for ', tickers) ,
subtitle = paste0('Data from ', min(df_tlsa$ref_date), ' to ',
max(df_tlsa$ref_date)),
caption = "Solution for exercise 01, chapter 10 - afedR")
print(p)
Using the yfR
package, download stock data for:
in the last 1500 days. Display stock prices with different line colors on the same graph. Adapt all other aspects of the graph from previous exercises such as title and axis labels.
library(yfR)
library(ggplot2)
tickers <- c('AAPL',
'MSFT',
'AMZN',
'GOOG')
df_stocks <- yf_get(
tickers = tickers,
first_date = Sys.Date() - 500,
last_date = Sys.Date()
)
p <- ggplot(df_stocks, aes(x = ref_date,
y = price_adjusted,
color = ticker)) +
geom_line() +
labs(title = paste0('Prices for ', tickers) ,
subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ',
max(df_stocks$ref_date)),
caption = "Solution for exercise 02, chapter 10 - afedR")
print(p)
For the previous plot, add dots in the lines.
library(yfR)
library(ggplot2)
tickers <- c('AAPL',
'MSFT',
'AMZN',
'GOOG')
df_stocks <- yfR(tickers = tickers,
first_date = Sys.Date() - 500,
last_date = Sys.Date())
p <- ggplot(df_stocks, aes(x = ref_date,
y = price_adjusted,
color = ticker)) +
geom_line() +
labs(title = paste0('Prices for ', tickers) ,
subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ',
max(df_stocks$ref_date)),
caption = "Solution for exercise 03, chapter 10 - afedR")
# add points
p <- p + geom_point()
print(p)
For the same plot, separate the stock prices on different panels with the ggplot::facet_wrap
function. Use the scales = 'free'
argument to release the x and y axis of each panel.
library(yfR)
tickers <- c('AAPL',
'MSFT',
'AMZN',
'GOOG')
df_stocks <- yfR(
tickers = tickers,
first_date = Sys.Date() - 500,
last_date = Sys.Date())
p <- ggplot(df_stocks, aes(x = ref_date,
y = price_adjusted,
color = ticker)) +
geom_line() +
labs(title = paste0('Prices for ', tickers) ,
subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ',
max(df_stocks$ref_date)),
caption = "Solution for exercise 04, chapter 10 - afedR")
# add points and facet wrap
p <- p +
geom_point() +
facet_wrap(~ticker, scales = 'free')
print(p)
Change the theme of the previous graph to a black and white scale, both for the graph area and for the colors of the lines.
library(yfR)
tickers <- c('AAPL',
'MSFT',
'AMZN',
'GOOG')
df_stocks <- yfR(tickers = tickers,
first_date = Sys.Date() - 500,
last_date = Sys.Date())[[2]]
p <- ggplot(df_stocks, aes(x = ref_date,
y = price_adjusted,
color = ticker)) +
geom_line() +
labs(title = paste0('Prices for ', tickers) ,
subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ',
max(df_stocks$ref_date)),
caption = "Solution for exercise 05, chapter 10 - afedR")
# add points and facet_wrap
p <- p + geom_point() +
facet_wrap(~ticker, scales = 'free') +
theme_bw() +
scale_color_grey()
#
print(p)
For the previous data, present the histogram of the returns of the different stocks in different panels and save the result in a file called 'histograms.png'
.
library(ggplot2)
tickers <- c('AAPL',
'MSFT',
'AMZN',
'GOOG')
df_stocks <- yfR::yf_get(tickers,
Sys.Date() - 500,
Sys.Date())
p <- ggplot(df_stocks, aes(x = ret_adjusted_prices)) +
geom_histogram() +
facet_wrap(~ticker)
print(p)
# save in temp folder
my_file <- file.path(tempdir(), 'histograms.png')
ggsave(filename = my_file, plot = p)
Use the yfR::yf_collection_get()
function to download the annual return data for all stocks in the DOW index, from 2015-01-01 to the current day. After that, create the average/variance map by plotting the average **annual* return as the y axis and the standard deviation as the x axis. Tip: You’ll find many outliers in the raw data. Make sure that the graph is visible limiting the x and y axes (see the ggplot2::xlim
and ggplot2::ylim
functions).
library(yfR)
library(dplyr)
library(readr)
library(stringr)
library(ggplot2)
df_prices <- yfR::yf_collection_get(
'DOW',
first_date = '2015-01-01',
freq_data = 'yearly'
)
tab <- df_prices |>
na.omit() |>
group_by(ticker) |>
summarise(mean = mean(ret_adjusted_prices),
sd = sd(ret_adjusted_prices))
p <- ggplot(tab, aes(x = sd, y = mean)) +
geom_point() +
xlim(0, 0.5) + ylim(-0.5, 0.5)
print(p)
Head over to the Kaggle data website1 and choose a particular dataset for your analysis. It need not be related to economics or finance. Feel free to make a decision based on your own interests.
After downloading the data, create a couple of hypothesis that you might have about the data. Create a visual analysis of the data that helps answering those hypothesis.
Head over to https://www.kaggle.com/datasets:
Simulate the following linear process in R:
set.seed(5)
# number of obs
n_row <- 100
# set x as Normal (0, 1)
x <- rnorm(n_row)
# set coefficients
my_alpha <- 1.5
my_beta <- 0.5
# build y
y <- my_alpha + my_beta*x + rnorm(n_row)
Now, estimate a linear model where x
is the explanatory variable and y
is the explained variable. Use the summary
function on the estimation return object to get more details about the model. What is the estimated beta value of the simulated data?
set.seed(5)
# number of obs
n_row <- 100
# set x as Normal (0, 1)
x <- rnorm(n_row)
# set coefficients
my_alpha <- 1.5
my_beta <- 0.5
# build y
y <- my_alpha + my_beta*x + rnorm(n_row)
my_lm <- lm(formula = y ~ x, data = tibble::tibble(x, y))
summary(my_lm)
my_sol <- coef(my_lm)[2]
Using the {car} package, test the joint hypothesis that the value of alpha is equal to 1.5 and the value of beta is equal to 0.5. What is the value of the resulting F test?
set.seed(5)
# number of obs
n_row <- 100
# set x as Normal (0, 1)
x <- rnorm(n_row)
# set coefficients
my_alpha <- 1.5
my_beta <- 0.5
# build y
y <- my_alpha + my_beta*x + rnorm(n_row)
my_lm <- lm(formula = y ~ x, data = tibble::tibble(x, y))
summary(my_lm)
library(car)
# set test matrix
test_matrix <- matrix(c(my_alpha, # alpha test value
my_beta)) # beta test value
# hypothesis matrix
hyp_mat <- matrix(c(1.5, 0,
0 , 0.5),
nrow = 2)
# do test
my_waldtest <- linearHypothesis(my_lm,
hypothesis.matrix = hyp_mat,
rhs = test_matrix)
# print result
my_sol <- my_waldtest$F[2]
Use the gvlma
package to test the OLS assumptions for the model previously estimated. Does the model pass all tests? If not, increase the value of n_row
to 1000 and try again. Did the increase in the number of observations of the model impact the assumptions test? In what way?
The estimated model has not passed all the tests. In fact, not even the increase in the number of observations in the simulation resulted in the approval of the model in all aspects.
set.seed(5)
# number of obs
n_row <- 1000
# set x as Normal (0, 1)
x <- rnorm(n_row)
# set coefficients
my_alpha <- 1.5
my_beta <- 0.5
# build y
y <- my_alpha + my_beta*x + rnorm(n_row)
my_lm <- lm(formula = y ~ x, data = tibble::tibble(x, y))
summary(my_lm)
library(gvlma)
# global validation of model
gvmodel <- gvlma(my_lm)
# print result
summary(gvmodel)
Use command yfR::yf_index_composition("DOW")
to find the ticker of the components of the index and yfR::yf_get
to download all available price data. Using the SP500 itself – ticker '^GSPC'
– as the market index, calculate the beta for all stocks. Display the histogram of the estimated betas. Note that the SP500 returns are not available in the original database and must be downloaded and added to the original database.
library(ggplot2)
tickers <- yfR::yf_index_composition("DOW")$ticker
first_date <- Sys.Date() - 3*365
last_date <- Sys.Date()
df_stocks <- yfR::yf_get(tickers,
first_date,
last_date)
df_sp500 <- yfR::yf_get('^GSPC',
first_date,
last_date)
idx <- match(df_stocks$ref_date, df_sp500$ref_date)
df_stocks$ret_mkt <- df_sp500$ret_closing_prices[idx]
# calculate beta for each stock
estimate_beta <- function(df) {
# Function to estimate beta from dataframe of stocks returns
#
# Args:
# df - Dataframe with columns ret and ret.sp500
#
# Returns:
# The value of beta
my_model <- lm(data = df,
formula = ret_adjusted_prices ~ ret_mkt)
return(coef(my_model)[2])
}
my_betas <- by(data = df_stocks,
INDICES = df_stocks$ticker,
FUN = estimate_beta)
glimpse(my_betas)
# solution
p <- ggplot(tibble::tibble(betas = my_betas), aes(x = betas)) +
geom_histogram()
print(p)
For previously imported data, estimate a panel data version for the market model (beta). In this version, each stock has a different intercept, but they share the same beta. Is the estimated beta significant at 5%?
library(ggplot2)
tickers <- yfR::yf_index_composition("DOW")$ticker
first_date <- Sys.Date() - 3*365
last_date <- Sys.Date()
df_stocks <- yfR::yf_get(tickers,
first_date,
last_date)
df_sp500 <- yfR::yf_get('^GSPC',
first_date,
last_date)
idx <- match(df_stocks$ref_date, df_sp500$ref_date)
df_stocks$ret_mkt <- df_sp500$ret_closing_prices[idx]
# calculate PLM beta
library(plm)
# estimate panel data model with fixed effects
my_pdm <- plm(data = df_stocks,
formula = ret_adjusted_prices ~ ret_mkt,
model = 'within',
index = c('ticker'))
# print result
print(summary(my_pdm))
Using the tidyverse functions, dplyr::group_by
and dplyr::do
, estimate an ARIMA model for the returns of each stock, available from the import process of previous exercise. In the same output dataframe
, create a new column with the forecast in t + 1 for each model. Which stock has the highest expected return for t + 1?
library(dplyr)
tickers <- yfR::yf_index_composition("DOW")$ticker
first_date <- Sys.Date() - 3*365
last_date <- Sys.Date()
df_stocks <- yfR::yf_get(tickers,
first_date,
last_date)
df_sp500 <- yfR::yf_get('^GSPC',
first_date,
last_date)
my_tab <- df_stocks |>
group_by(ticker) |>
do(my_arima = arima(x = .$ret_adjusted_prices,
order = c(1,0,0))) |>
mutate(arima_forecast = predict(my_arima, n.ahead = 1 )$pred[1])
glimpse(my_tab)
# solution
idx <- which.max(my_tab$arima_forecast )
print(my_tab$ticker[idx])
In the same code used for the previous question, add a new column-list with the estimation of an ARMA (1, 0)-GARCH(1, 1) model for the returns of each stock. Add another column with the volatility forecast (standard deviation) at t + 1.
By dividing the expected return calculated in the previous item by the expected risk, we have a market direction index, where those stocks with the highest index value have the highest ratio of expected return to risk. Which stock is more attractive and has the highest value of this index?
tickers <- yfR::yf_index_composition("DOW")$ticker
first_date <- Sys.Date() - 3*365
last_date <- Sys.Date()
df_stocks <- yfR::yf_get(tickers,
first_date,
last_date)
df_sp500 <- yfR::yf_get('^GSPC',
first_date,
last_date)
library(dplyr)
library(fGarch)
tab_models <- df_stocks |>
na.omit() |>
group_by(ticker) |>
do(my_garch = garchFit(formula = ~ arma(1,0) + garch(1,1),
data = .$ret_adjusted_prices,
trace = FALSE) )
tab_models <- tab_models |>
mutate(forecast_mean = predict(my_garch,
n.ahead = 1)$meanForecast[1],
forecast_sd = predict(my_garch,
n.ahead = 1)$standardDeviation[1],
sharpe_index = forecast_mean/forecast_sd)
glimpse(tab_models)
# solution
idx <- which.max(tab_models$sharpe_index)
print(tab_models$ticker[idx])
Using package {afedR3}, import the data from file CH11_grunfeld.csv and create a descriptive table of the variables. This table should provide enough information for the reader to understand the data. Use {xtable} package to report it in LaTeX or Word/Writer format.
library(dplyr)
my_f <- afedR3::data_path('CH11_grunfeld.csv')
df_grunfeld <- readr::read_csv(my_f, col_types = readr::cols())
my_tab <- tibble(nrows = nrow(df_grunfeld),
ncol = ncol(df_grunfeld),
n_companies = length(unique(df_grunfeld$firm)),
n_years = length(unique(df_grunfeld$year)))
my_tab
library(xtable)
# Save to lates
# save to temp file
my_tex_file <- tempfile(pattern = 'table', fileext = '.tex')
print(xtable(my_tab), file = my_tex_file)
readr::read_lines(my_tex_file)
Using the yfR::yf_index_composition
function, select random 4 stocks with set.seed(5)
from the SP500 index and download the adjusted price data for the last three years. Estimate an ARIMA(1, 0, 1) model for each stock and report the result on the R screen with the texreg::screenreg
function.
set.seed(5)
library(dplyr)
tickers <- sample(yfR::yf_index_composition("SP500")$ticker, 4)
first_date <- Sys.Date() - 3*365
last_date <- Sys.Date()
df_stocks <- yfR::yf_get(tickers,
first_date,
last_date)
my_tab <- df_stocks |>
na.omit() |>
group_by(ticker) |>
do(my_arima = arima(x = .$ret_adjusted_prices,
order = c(1, 0, 1)))
glimpse(my_tab)
# solution
library(texreg)
screenreg(my_tab$my_arima)
Create a new Rmarkdown report from previous two exercises. Compile the report in html and open it in your browser.
See section @ref(creating-rmarkdown) – Creating Reports with RMarkdown – from the same chapter.
Consider the following code:
library(forecast)
ticker <- '^GSPC'
df_prices <- yfR::yf_get(ticker,
'2010-01-01')
my_arima <- forecast::auto.arima(df_prices$ret_adjusted_prices)
summary(my_arima)
Use functions Rprof
and profvis
to identify the bottleneck of the code. Which line number is taking the longest execution time?
library(fGarch)
library(profvis)
# set temporary file for results
profiling_file <- tempfile(pattern = 'profiling_exercise',
fileext = '.out')
# initialize profiling
Rprof(filename = profiling_file)
# run code
profiling <- profvis(expr = {
ticker <- '^GSPC'
df_prices <- yfR::yf_get(ticker,
'2010-01-01')
my_arima <- forecast::auto.arima(df_prices$ret_adjusted_prices)
summary(my_arima)
})
# create visualization
temp_html <- tempfile(pattern = 'profile',
fileext = '.html')
htmlwidgets::saveWidget(profiling, temp_html)
# open in browser from R
browseURL(temp_html)
Use the Rcpp
package to write a C++ language function that will add elements of two numerical vectors. The output must be another vector of the same size and with elements equivalent to the x + y
operation. Use the identical
function to test that all elements of both vectors are equal.
library(Rcpp)
cppFunction('Rcpp::NumericVector sum_vectors_C(NumericVector x, NumericVector y) {
int n = x.size();
Rcpp::NumericVector total(x.size());
for(int i = 0; i < n; ++i) {
total[i] = x[i] + y[i];
}
return total;
}')
x <- runif(100)
y <- runif(100)
sol_C <- sum_vectors_C(x, y)
sol_R <- x + y
identical(sol_C, sol_R)
Use the {tictoc} package to compare the performance of the previous function with the native operator +
, and a loop-based version with the pre-allocation of the result vector. Which alternative has the shortest execution time and why? Does the Rcpp
version beat the loop version?
The best way to add vectors is with the native operator +
, which is already optimized for fast executions.
library(Rcpp)
cppFunction('Rcpp::NumericVector sum_vectors_C(NumericVector x, NumericVector y) {
int n = x.size();
Rcpp::NumericVector total(x.size());
for(int i = 0; i < n; ++i) {
total[i] = x[i] + y[i];
}
return total;
}')
x <- runif(100)
y <- runif(100)
sol_C <- sum_vectors_C(x, y)
sol_R <- x + y
identical(sol_C, sol_R)
library(tictoc)
tic('Using Rcpp')
sol_C <- sum_vectors_C(x, y)
toc()
tic('Using base R')
sol_R <- x + y
toc()
tic('Using a loop and prealocation')
sol_loop <- numeric(length = length(x))
for (i in 1:length(x)) {
sol_loop[i] <- x[i] + y[i]
}
toc()
Use the {memoise} package to create a memorized version of function Quandl::Quandl
. Use the new function to import data about the United States Consumer Price Index (quandl code 'FRED/DDOE01USA086NWDB'
). How much percentage speed gain do you get from the second call to the memorized version?
library(Quandl)
library(memoise)
library(tictoc)
mem_quandl <- memoise(f = Quandl, cache = cache_memory())
id <- 'FRED/DDOE01USA086NWDB'
tic('Using original Quandl')
df <- Quandl(code = id)
toc()
tic('Using memoise version (first call)')
df <- mem_quandl(code = id)
toc()
tic('Using memoise version (second call)')
df <- mem_quandl(code = id)
toc()