Solutions to book exercises

The book Analyzing Financial and Economic Data with R is available at Amazon, as an ebook and print. A restricted online version is available at https://www.msperlin.com/afedr

Questions

  1. Question

    The R language was developed based on what other programming language?


    1. Julia
    2. Python
    3. Javascript
    4. S
    5. C++

    Solution

    Straight from the book, section What is R: “R is a modern version of S, a programming language originally created in Bell Laboratories (formerly AT&T, now Lucent Technologies).”


  2. Question

    What are the names of the two authors of R?


    1. Guido van Rossum and Bjarne Stroustrup
    2. Ross Ihaka and Robert Gentleman
    3. Roger Federer and Rafael Nadal
    4. Linus Torvalds and Richard Stallman
    5. John Chambers and Robert Engle

    Solution

    Straight from the book: “… The base code of R was developed by two academics, Ross Ihaka and Robert Gentleman, resulting in the programming platform we have today.”.


  3. Question

    Why is R special when comparing to other programming languages, such as Python, C++, javascript and others?


    1. Quick code execution
    2. Easy to use
    3. Makes it easy to write mobile apps
    4. It was designed for analyzing data and producing statistical output
    5. Works on any plataform such as Windows, Unix, MacOS

    Solution

    Undoubtedly, the main differential of the R language is the ease with which data can be analyzed on the platform. Although other languages also allow data analysis, it is in R where this process is supported by a wide range of specialized and efficient packages.


  4. Question

    What was the reason the programming language was named R?


    1. R = Reausable code.
    2. It was the only available letter, not yet used as a programming language.
    3. Letter R is shared in the first names of its authors.
    4. Letter R is cool!
    5. The mother of one of the authors is called Renata.

    Solution

    The letter R was chosen due to its use in the first letter of the two authors of the platform.


  5. Question

    Consider the following alternatives about R and RStudio:

    I - R was developed in 2018 and is an innovative and unstable project;

    II - RStudio is an alternative programming language to R;

    III - R is not compatible with different programming languages;

    Which alternatives are correct?


    1. FALSE, FALSE, FALSE
    2. TRUE, TRUE, TRUE
    3. TRUE, FALSE, TRUE
    4. FALSE, TRUE, TRUE
    5. FALSE, FALSE, TRUE

    Solution

    See section “Why Choose R” in the “Introduction” chapter.


  6. Question

    Once you have R and RStudio installed, head over to the CRAN package website1 and look for technologies you use in your work. For example, if you use Google Sheets2 ostensibly in your work, you will soon discover that there is a package in CRAN called googlesheets4 that interacts with spreadsheets in the cloud.


    Solution

    1. Browse CRAN package website
    2. Search for technologies you use in your work (Excel, Word, Google Docs, …)

  7. Question

    On the CRAN site you can also install the Rtools application. What is it for?


    1. Compile graphics.
    2. Compile R packages locally
    3. Build web pages.
    4. Compile technical reports.
    5. Make coffee (?).

    Solution

    Rtools is an extension particular to R on Windows. It is used to compile packages from source code and is a requirement for those who develop packages. For the average user, however, it is also recommended to install Rtools as some packages requires compilation.

    For Linux/Unix or MacOS users, Rtools is not necessary as, generally, compilers are already available within the operating system itself.


  8. Question

    Use Google to search for R groups in your region. Check if the meetings are frequent and, if you don’t have a major impediment, go to one of these meetings and make new friends.


    Solution

    It is not uncommon for programmers to have a tendency for introversion. This was certainly my case at the beginning of my career. But, know that shyness is a transitory state. In the same way that you will improve in any sport at the rate of how often you practice it, the more comunicative you are, less shy and more assertive you tend to be.

    The sad (or not) reality for the timid is that communication is a fundamental part of the adult life and is a way to maintain your professional network. The more people who know your work and your personality, the better. Perhaps a person you met in one of these groups can refer you to a job vacancy or future project. Summing up, what do you really have to lose by going to one of these meetings?


  9. Question

    Go to the RBloggers website1 and look for a topic of interest to you, such as football (soccer) or investments (investments). Read at least three of the found blog posts.


    Solution

    I am particularly passionate about the sport of tennis. On the RBloggers website I’ve found the following articles mixing R and tennis:

    Using R to study the evolution of Tennis

    Visualizing Tennis Grand Slam Winners Performances

    Tennis Grand Slam Tournaments Champions Basic Analysis


  10. Question

    If you work in an institution with data infrastructure, talk to the person in charge of the IT department and verify what technologies are used. Check if, through R, it is possible to access all tables in the databases. For now there is no need to write code, yet. Just check if this possibility exists.


    Solution

    At the university we have access to different paid repositories for financial data. Unfortunately, none of them offers any type of API for communicating with R. In fact, this was one of the motivators for writing R packages for free access to financial data.


  11. Question

    In RStudio, create a new script and save it in a personal folder. Now, write R commands in the script that define two objects: one holding a sequence between 1 and 100 and the other with the text of your name (ex. ‘Richard’). Execute the whole script with the keyboard shortcuts.


    Solution

    x <- 1:100
    y <- 'Richard'
    
    # press control+shift+enter to run this chunk of code in RStudio

  12. Question

    In the previously created script, use function message to display the following phrase in R’s prompt:"My name is ....".


    Solution

    x <- 36
    y <- 'Richard'
    
    message(paste0('My name is ', y))
    
    # press control+shift+enter to run this chunk of code in RStudio

  13. Question

    Within the same script, show the current working directory (see function getwd, as in print(getwd())). Now, change your working directory to Desktop (Desktop) and show the following message on the prompt screen: 'My desktop address is ....'. Tip: use and abuse of RStudio’s autocomplete tool to quickly find the desktop folder.


    Solution

    current_dir <- getwd()
    print(current_dir)
    
    new_dir <- '~/Desktop/' # this is probably C:/Users/USERNAME/Desktop for Windows
    setwd(new_dir)
    
    cat(paste0('My desktop address is ', getwd()))

  14. Question

    Use R to download the compressed zip file with the book material, available at this link1. Save it as a file in the temporary session folder (see function fs::file_temp()).


    Solution

    local_file <- fs::file_temp(ext = '.zip')
    my_url <- afedR3::links_get()$book_blog_zip
    
    download.file(url = my_url, 
                  destfile = local_file)
    
    # check if exists
    fs::file_exists(local_file)

  15. Question

    Use the unzip function to unzip the downloaded file from previous question to a directory called 'afedR-files' inside the “Desktop” folder. How many files are available in the resulting folder? Tip: use the recursive = TRUE argument with fs::dir_ls to also search for all available subdirectories.


    Solution

    my_folder <- '~/Desktop/adfeR-Files' # this is probably C:/Users/USERNAME/Desktop for Windows
    unzip(local_file, exdir = my_folder) # local_file comes from previous exercise
    
    files <- fs::dir_ls(my_folder, 
                        recurse = TRUE)
    
    n_files <- length(files)
    
    message(paste0('There are ', n_files, ' files available at folder "', my_folder, '".'))

  16. Question

    Every time the user installs an R package, all package files are stored locally in a specific directory of the hard disk. Using command Sys.getenv('R_LIBS_USER') and fs::dir_ls, list all the directories in this folder. How many packages are available in this folder on your computer?


    Solution

    r_pkg_folder <- Sys.getenv('R_LIBS_USER')
    
    available_dirs <- fs::dir_ls(
        r_pkg_folder, 
        recurse = FALSE,
        type = 'directory')
        
    n_dirs <- length(available_dirs)
    
    cat(paste0('There are ', n_dirs, ' folders available  at "', r_pkg_folder, '".'))

  17. Question

    In the same topic as previous exercise, list all files in all subfolders in the directory containing the files for the different packages (see command Sys.getenv('R_LIBS_USER')). On average, how many files are needed for each package?


    Solution

    r_pkg_folder <- Sys.getenv ('R_LIBS_USER')
    
    pkg_files <- fs::dir_ls(r_pkg_folder, recurse = TRUE)
    my_dirs <- list.dirs(
        r_pkg_folder, 
        recurse = FALSE,
        type = 'directory'
        )
    
    n_files <- length(pkg_files)
    n_dirs <- length(my_dirs)
    
    my_msg <- paste0('We have ', length(pkg_files), ' ',
                     'files for ', length(my_dirs), ' packages. \n',
                     'On average, there are ', n_files/n_dirs, ' files per directory.')
    
    message(my_msg)

  18. Question

    Use the install.packages function to install the yfR package on your computer. After installation, use function yf_get() to download price data for the IBM stock in the last 15 days. Tip: use function Sys.Date() to find out the current date and Sys.Date()- 15 to calculate the date located 15 days in the past.


    Solution

    if (!require(yfR)) install.packages('yfR')
    
    df_prices <- yfR::yf_get(tickers = 'IBM',
                         first_date = Sys.Date() - 15,
                         last_date = Sys.Date())
    
     str(df_prices)

  19. Question

    The cranlogs package allows access to downloads statistics of CRAN packages. After installing cranlogs on your computer, use the cranlogs::cran_top_downloads function to check which are the 10 most installed packages by the global community in the last month. Which package comes first? Tip: Set the cran_top_downloads function input to when = 'last-month'. Also, be aware that the answer here may not be the same as you got because it depends on the day the R code was executed.


    Solution

    #if (!require(cranlogs)) install.packages('cranlogs')
    pkgs <- cranlogs::cran_top_downloads(when = 'last-month')
    
    my_sol <- pkgs$package[1]
    my_sol

  20. Question

    Using the devtools package, install the development version of the ggplot2 package, available in the Hadley Hickman repository. Load the package using library and create a simple figure with the code qplot(y = rnorm(10), x = 1:10).


    Solution

    if (!require(devtools)) install.packages("devtools")
    
    devtools::install_github('hadley/ggplot2')
    
    library(ggplot2)
    qplot(y = rnorm (10), x = 1:10)

  21. Question

    Using your programming ability check on your computer which folder, from the “Documents” directory (shortcut = ~), has the largest number of files. Display the five folders with the largest number of files on R’s prompt.


    Solution

    doc_folder <- '~' # 'C:/Users/USERNAME/Documents' in Windows
                      # '/home/USERNAME/  in Linux
    
    fct_count_files <- function(dir_in) {
      n_files <- fs::dir_ls(dir_in, recurse = FALSE)
      return(length(n_files))
    }
    
    # be aware this might take lots of time...
    all_folders <- fs::dir_ls(path = doc_folder, 
                              type = 'directory', 
                              recurse = TRUE)
    
    counter_files <- sapply(all_folders, fct_count_files)
    sorted <- sort(counter_files, decreasing = TRUE)
    
    message('\nThe five folders with highest number of files are:\n\n')
    message(paste0(names(sorted[1:5]), collapse = '\n'))

  22. Question

    Imagine a survey regarding your household budget over time. Financial data is available in electronic spreadsheets separated by month, for 10 years. The objective of the research is to understand if it is possible to purchase a real state property in the next five years. Within this setup, detail in text the elements in each stage of the study, from importing the data to the construction of the report.


    Solution

    The possible stages of the study are:

    1. Importing 01: Import all data related to income and family budget, rate of return on investments - fixed income or stocks - and historical property prices in the desired locations;

    2. Cleaning 01: Clean the data for outliers and missing data (NA);

    3. Manipulation 01: Use personal income data to find the need for monthly savings for each year.

    4. Manipulation 02: Based on the historical returns of investments and the inflation of the property, check how many years it takes to save the amount of money needed to buy the property.


  23. Question

    Based on the study proposed earlier, create a directory structure on your computer to accommodate the study. Create mock files for each subdirectory (see directory structure at section @ref(directories)). Be aware you can create mock files and direction all in R (see functions cat and fs::dir_create).


    1. S+
    2. C++
    3. Python
    4. Matlab
    5. Javascript

    Solution

    library(fs)
    
    # set temp dir for solving exercise
    temp_dir <- path(tempdir(), 'Chapter 3 exercise')
    
    # create folder
    dir_create(temp_dir)
    
    # create files
    file_create(path(temp_dir, '01-import_data.R'))
    file_create(path(temp_dir, '02-clean_data.R'))
    file_create(path(temp_dir, '03-build_budget_table.R'))
    file_create(path(temp_dir, '04-build_investment_realstate_table.R'))
    file_create(path(temp_dir, '05-build_report.R'))
    
    # create dirs
    dir_create(path(temp_dir, 'data'))
    dir_create(path(temp_dir, 'tabs'))
    dir_create(path(temp_dir, 'figs'))
    
    # fill with files
    for (i_year in 2009:2019) {
      file_create(path(temp_dir, 'data', 
                       paste0('annual_budget_', i_year, '.xlsx')) )
    }
    
    file_create(path(temp_dir, 'data', 'fixed_income_index.csv'))
    file_create(path(temp_dir, 'data', 'real_state_data.rds'))
    file_create(path(temp_dir, 'figs', 'Income_and_HousePrices.png'))
    file_create(path(temp_dir, 'tabs',  'Table_with_Results.xlsx'))
    
    dir_tree(temp_dir)

  24. Question

    Create a dataframe with the following code:

    library(dplyr)
    
    my_N <- 10000
    my_df <- tibble(x = 1:my_N,
                    y = runif(my_N))

    Export the resulting dataframe to each of the five formats: csv, rds, xlsx, fst. Which of the formats took up the most space in the computer’s memory? Tip: file.size calculates the size of files within R.


    Solution

    library(dplyr)
    library(readr)
    
    do_tests <- function(my_N) {
      
      my_df <- tibble(x = 1:my_N,
                      y = runif(my_N))
      
      # csv
      my_f <- tempfile(pattern = 'temp', fileext = '.csv')
      time.csv <- system.time({
        write_csv(my_df, my_f)
      })['elapsed']
      size.csv <- file.size(my_f)/1000000
      
      # rds
      my_f <- tempfile(pattern = 'temp', fileext = '.rds')
      time.rds <- system.time({
        write_rds(my_df, my_f)
      })['elapsed']
      
      size.rds <- file.size(my_f)/1000000
      
      # xlsx
      my_f <- tempfile(pattern = 'temp', fileext = '.xlsx')
      library(writexl) 
      time.xlsx <- system.time({
        write_xlsx(my_df, my_f)
      })['elapsed']
      
      size.xlsx <- file.size(my_f)/1000000
      
      # fst
      library(fst)
      my_f <- tempfile(pattern = 'temp', fileext = '.fst')
      time.fst <- system.time({
        write_fst(my_df, my_f)
      })['elapsed']
      
      size.fst <- file.size(my_f)/1000000
      
      print(c(size.csv, size.rds, size.xlsx, size.fst))
      
      print(c(time.csv, time.rds, time.xlsx, time.fst))
      
      tab <- tibble(Result = c('csv', 'rds', 'xlsx', 'fst'), 
                    Size =  c(size.csv, size.rds, size.xlsx, size.fst),
                    Time = c(time.csv, time.rds, time.xlsx, time.fst))
      
      return(tab)
    }
    
    my_N <- 10000
    
    tab <- do_tests(my_N)
    print(tab)
    
    my_msg <- paste0('The format with largest disk space for N = ', my_N, ' is ', 
                     tab$Result[which.max(tab$Size)], '.')
    message(my_msg)

  25. Question

    Improve the previous code by measuring the execution time for saving the data in different formats. Which file format resulted in the fastest execution for exporting data? Tip: use the system.time function or thetictoc package to calculate the execution times.


    Solution

    # do notice that this chunk requires the execution of previous solution
    my_msg <- paste0('The format with least execution time for N = ', my_N, ' is ', 
                     tab$Result[which.min(tab$Time)], '.')
    message(my_msg)

  26. Question

    For the previous code, reset the value of my_N to 1000000. Does it change the answers to the last two questions?


    Solution

    # do notice that this chunk requires the execution of previous solution
    my_N <- 1000000
    
    tab <- do_tests(my_N)
    print(tab)
    
    my_msg <- paste0('The format with largest disk space for N = ', my_N, ' is ', 
                     tab$Result[which.max(tab$Size)], '.')
    
    message(my_msg)
    
    my_msg <- paste0('The format with least execution time for N = ', my_N, ' is ', 
                     tab$Result[which.min(tab$Time)], '.')
    message(my_msg)

  27. Question

    Use afedR3::data_path function to access the CH04_SP500.csv file in the book’s data repository. Import the contents of the file into R with the function readr::read_csv. How many lines are there in the resulting dataframe?


    Solution

    my_f <- afedR3::data_path('CH04_SP500.csv')
    
    df_SP500 <- readr::read_csv(my_f, 
                         col_types = readr::cols())
    
    my_msg <- paste0('There are ', nrow(df_SP500), ' rows and ', 
                     ncol(df_SP500), ' columns in file ', basename(my_f))
    message(my_msg)
    
    my_sol <- nrow(df_SP500)

  28. Question

    At link https://eeecon.uibk.ac.at/~zeileis/grunfeld/Grunfeld.csv/ you’ll find a .csv file for the Grunfeld data. This is a particularly famous table due to its use as reference data in econometric models. Using readr::read_csv function, read this file using the direct link as input read_csv. How many columns do you find in the resulting dataframe?


    Solution

    my_url <- 'https://eeecon.uibk.ac.at/~zeileis/grunfeld/Grunfeld.csv'
    
    df_grunfeld <- readr::read_csv(my_url, col_types = readr::cols())
    
    my_sol <- ncol(df_grunfeld)

  29. Question

    Use function afedR3::data_path function to access the CH04_example-tsv.tsv file in the book’s data repository. Note that the columns of the data are separated by the tab symbol ('\t'), and not the usual comma. After reading the readr::read_delim manual, import the information from this file to your R session. How many rows does the resulting dataframe contain?


    Solution

    my_f <- afedR3::data_path('CH04_example-tsv.tsv')
    
    df_tsv <- readr::read_delim(my_f, delim = '\t', col_types = readr::cols())
    
    my_sol <- nrow(df_tsv)
    
    #check_answers(my_answers)

  30. Question

    In the book package you’ll find data file called CH04_another-funky-csv-file.csv, with a particularly bizarre format. Open it in a text editor and try to understand how the columns are separated and what is symbol for decimals. After that, study the inputs of function utils::read.table and import the table into your R session. If we add the number of rows to the number of columns in the imported table, what is the result?


    Solution

    my_f <- afedR3::data_path("CH04_another-funky-csv-file.csv")
    
    df_funky <- read.table(file = my_f, 
               dec = '?', 
               skip = 7, 
               sep = '|', 
               header = TRUE)
    
    my_sol <- nrow(df_funky) + ncol(df_funky)

  31. Question

    Using the yfR package, download daily data of the Facebook stock (META) from Yahoo Finance for the period between 2019 and 2023. What is the lowest unadjusted closing price (column price.close) in the analyzed period?


    Solution

    ticker <- 'META'
    first_date <- '2019-01-01'
    last_date  <- '2023-01-01'
    
    df_prices <- yfR::yf_get(
      tickers = ticker, 
      first_date = first_date, 
      last_date = last_date)
    
    my_sol <- min(df_prices$price_close, 
                  na.rm = TRUE)

  32. Question

    If you have not already done so, create a profile on the Quandl website1 and download the arabica coffee price data in the CEPEA database (Center for Advanced Studies in Applied Economics) ) between 2010-01-01 and 2020-12-31. What is the value of the most recent price?


    Solution

    library(GetQuandlData)
    
    id_series <- c(COFFEE = 'CEPEA/COFFEE_A')
    
    # set api key (make sure you got your own!)
    #my_api_key <- readLines('~/Dropbox/98-pass_and_bash/.quandl_api.txt')
    
    first_date <- '2010-01-01'
    last_date <- '2020-12-31'
    
    #df_coffee <- get_Quandl_series(id_in = id_series, 
     #                              first_date = first_date,
      #                             last_date = last_date,
       #                            api_key = my_api_key,
        #                           do_cache = TRUE)
    
    # find most recent
    #idx <- which.max(df_coffee$ref_date)
    #my_sol <- as.numeric(df_coffee$`Cash Price US$`[idx])

  33. Question

    Use function simfinapi::sfa_get_entities() to import data about all available companies in Simfin. How many companies do you find? (see function dplyr::n_distinct()).


    Solution

    library(simfinapi)
    library(dplyr)
    library(readr)
    
    # set api key - make sure you got your own at https://simfin.com/data/access/api
    #my_api_key <- read_lines('~/GDrive/98-pass-and-bash/.api_key_simfin.txt')
    my_api_key <- ''
    
    cache_dir <- fs::path_temp("cache-simfin")
    fs::dir_create(cache_dir)
    
    # get info
    simfinapi::sfa_set_api_key(my_api_key)
    simfinapi::sfa_set_cache_dir(cache_dir)
    
    # get info
    df_info_companies <- simfinapi::sfa_get_entities()
    
    # check it
    glimpse(df_info_companies)
    
    n_companies <- n_distinct(df_info_companies$ticker)
    
    my_sol <- n_companies

  34. Question

    With package simfinapi, download the PL (profit/loss) statement for FY (final year) data for TESLA (ticker = “TSLA”) for year 2022. What is the latest Profit/Loss of the company for that particular year?


    Solution

    library(simfinapi)
    library(dplyr)
    library(readr)
    
    # set api key - make sure you got your own at https://simfin.com/data/access/api
    #my_api_key <- read_lines('~/GDrive/98-pass-and-bash/.api_key_simfin.txt')
    
    cache_dir <- fs::path_temp("cache-simfin")
    fs::dir_create(cache_dir)
    
    # get info
    simfinapi::sfa_set_api_key(my_api_key)
    simfinapi::sfa_set_cache_dir(cache_dir)
    
    ticker <- 'TSLA'
    type_statement <- 'pl' # profit/loss
    period <- 'fy' # final year
    year <- 2022
    
    PL <- sfa_get_statement(
      ticker = ticker, 
      statement = type_statement,
      period = period,
      fyear = year)
    
    glimpse(PL)
    
    desired_acc <- 'net_income'
    
    latest_acc_value <- PL$net_income[1]
    
    my_sol <- latest_acc_value 

  35. Question

    Using function tidyquant::tq_index, download the current composition of index DOW. What is the company with the highest percentage in the composition of the index?

    Be aware that the answer is time-dependent and the reported result might be different from what you actually got in your R session.


    1. NIKE INC CL B
    2. INTEL CORP
    3. COCA COLA CO/THE
    4. UNITEDHEALTH GROUP INC
    5. SALESFORCE INC

    Solution

    library(tidyquant)
    
    # print available indices
    available_index <- tq_index_options()
    
    my_index <- sample(available_index, 1)
    
    # get components of "DOWJONES"
    comp_idx <- tq_index(my_index)
    
    highest_weight <- comp_idx$company[which.max(comp_idx$weight)]
    
    my_sol <- highest_weight 

  36. Question

    Using again the yfR package, download financial data between 2019-01-01 and 2020-01-01 for the following tickers:

    Using the adjusted closing price column, what company provided higher return to the stock holder during the analyzed period?

    Tip: this is an advanced exercise that will require some coding. To solve it, check out function split to split the dataframe of price data and lapply to map a function to each dataframe.


    1. BAC
    2. GE
    3. AAPL
    4. TSLA
    5. SNAP

    Solution

    library(dplyr)
    
    first_date <- '2019-01-01'
    last_date  <- '2020-01-01'
    
    my_tickers <- c('AAPL', 'BAC', 
                    'GE', 'TSLA',
                    'SNAP')
        
    df_prices <- yfR::yf_get(
      tickers = my_tickers, 
      first_date = first_date, 
      last_date = last_date)
    
    split_l <- split(df_prices, df_prices$ticker)
    
    my_fct <- function(df_in) {
      price_vec <- df_in$price_adjusted
      ticker_in <- df_in$ticker[1]
      
      total_ret <- last(price_vec)/first(price_vec) - 1
      
      return(tibble(ticker = ticker_in,
                    total_ret = total_ret))
    }
    
    df_results <- bind_rows(
      lapply(split_l, my_fct)
    )
    
    winner <- df_results$ticker[which.max(df_results$total_ret)]
    
    my_sol <- winner

  37. Question

    Using function dplyr::tibble, create a dataframe called my_df with a column called x containing a sequence from -100 to 100 and another column called y with the value of column x added by 5. How many values in column x are greater than 10 and lower than 25?


    1. 14
    2. 2
    3. 5
    4. 7
    5. 9

    Solution

    my_df <- dplyr::tibble(x = -100:100, 
                            y = x + 5)
    # solution
    my_sol <- sum((my_df$x > 10)&(my_df$x < 25))

  38. Question

    Create a new column in object my_df called cumsum_x, containing the cumulative sum of x (cumsum function). In this new column, how many values are greater than -3500?


    Solution

    my_df <- dplyr::tibble(x = -100:100, 
                           y = x + 5)
    
    # solution
    my_df$cumsum_x <- cumsum(my_df$x)
    
    # solution
    my_sol <- sum(my_df$cumsum_x > -3500)

  39. Question

    Use function dplyr::filter function and the pipeline operator to filter my_df, keeping only the rows where the value of the y column is greater than 0. What is the number of rows in the resulting table?


    Solution

    my_df <- dplyr::tibble(x = -100:100, 
                           y = x + 5)
    
    # solution
    my_df2 <- my_df |>
      dplyr::filter(y > 0)
    
    # solution
    my_sol <- nrow(my_df2)

  40. Question

    If you have not already done so, repeat exercises 1, 2 and 3 using the functions of the tidyverse universe and the pipeline operator.


    Solution

    library(dplyr)
    library(readr)
    library(stringr)
    
    my_df <- tibble(x = -100:100,
                    y = x + 5) |>
      mutate(cumsum_x = cumsum(x))
    
    # solution 01
    sol_01 <- my_df |>
      filter(x > 10, 
             x < 25) |>
      nrow()
    
    # solution 02
    sol_02 <- my_df |>
      mutate(cumsum_x = cumsum(x)) |>
      filter(cumsum_x > -3500) |>
      nrow()
    
    # solution 03
    sol_03 <- my_df |>
      filter(y > 0) |>
      nrow()
    
    message(str_glue(
      'Solution 01 = {sol_01} \n',
      'Solution 02 = {sol_02} \n',
      'Solution 03 = {sol_03}'
      )
    )

  41. Question

    Use the yfR package to download Google (GOOG) stock data, from 2015-01-01 to 2023-01-01. If the investor had bought 1000 USD in Google shares on the first day of the data and kept the investment until today, what would be the value of his portfolio?


    1. $30,685.82
    2. $11,997.49
    3. $5,976.27
    4. $24,039.93
    5. $18,018.71

    Solution

    library(dplyr)
    
    first_date <- '2015-01-01'
    last_date <- '2023-01-01'
    
    df_prices <- yfR::yf_get('GOOG', 
                             first_date,
                             last_date)
    
    value_purchase <- sample(seq(1000, 10000, by = 50), 1)
    
    # solution
    my_sol <- last(df_prices$price_adjusted)/first(df_prices$price_adjusted)*value_purchase

  42. Question

    Use functions afedR3::data_path and readr::read_csv to import data from the CH11_grunfeld.csv file. Now, use function dplyr::glimpse to find out the number of lines in the imported data. What is the number of rows in the imported table?


    1. 200
    2. 234
    3. 271
    4. 308
    5. 345

    Solution

    library(dplyr)
    library(readr)
    
    my_file <- afedR3::data_path('CH11_grunfeld.csv')
    
    df_grunfeld <- read_csv(my_file, 
                            col_types = cols())
    
    # solution
    glimpse(df_grunfeld)
    
    my_sol <- nrow(df_grunfeld)

  43. Question

    Create a list type object with three dataframes in its contents,df1, df2 and df3. The content and size of the dataframes is your personal choice. Now, use the sapply function and a custom function to find out the number of rows and columns in each dataframe.


    Solution

    df1 <- dplyr::tibble(x = 1:10)
    df2 <- dplyr::tibble(y = runif(100))
    df3 <- dplyr::tibble(z = rnorm(150),
                         m = rnorm(150))
    
    my_l <- list(df1, df2, df3)
    
    my_fct <- function(df_in) {
      out <- c('nrows' = nrow(df_in), 
               'ncols' = ncol(df_in))
      return(out)
    }
    
    tab <- sapply(my_l, my_fct)
    tab

  44. Question

    Within an R session, create an identity matrix (value 1 in the diagonal, zero in any other) of size 5X5. Tip: use the diag function to define the diagonal of the matrix.


    Solution

    my_size <- 5
    M_identity <- matrix(0, 
                         nrow = my_size, 
                         ncol = my_size)
    
    # solution
    diag(M_identity) <- 1
    
    print(M_identity)

  45. Question

    Consider the following vectors x and y:

    set.seed(7)
    x <- sample (1:3, size = 5, replace = T)
    y <- sample (1:3, size = 5, replace = T)

    What is the sum of the elements of a new vector resulting from the multiplication between the elements of x andy?


    1. 31
    2. 34
    3. 41
    4. 48
    5. 55

    Solution

    set.seed(7)
    
    x <- sample (1:3, size = 5, replace = T)
    y <- sample (1:3, size = 5, replace = T)
    
    # solution
    my_sol <- sum(x*y)

  46. Question

    If we performed a cumulative sum of a sequence between 1 and 100, in what element would this sum go beyond the value of 50?


    1. 10
    2. 11
    3. 13
    4. 15
    5. 17

    Solution

    my_sum <- cumsum(1:100)
    
    # solution
    my_sol <- (which(my_sum > 50)[1])

  47. Question

    Using R, create an sequence called seq_1 between -15 and 10, where the interval between values is always equal to 2. What is the sum of the elements of seq_1?


    1. -39
    2. -20
    3. -27
    4. -69
    5. -108

    Solution

    # solution
    seq_1 <- seq(from = -15, to = 10, by = 2)
    
    # solution
    my_sol <- sum(seq_1)

  48. Question

    Define another object called seq_2 containing a sequence of size 1000, with values between 0 and 100. What is the standard deviation (function sd()) of this sequence?


    1. 45.26061
    2. 28.91085
    3. 22.19465
    4. 12.94184
    5. 74.17146

    Solution

    seq_2 <- seq(from = 0, 
                 to = 100, 
                 length.out = 1000)
    
    # solution
    my_sol <- sd(seq_2)

  49. Question

    Calculate the sum between seq_1 and seq_2 vectors (see previous exercises). Did this operation work despite the different size of the vectors? Explain your answer. If it works, what is the highest value of the resulting vector?


    1. 191.5447
    2. 150.804
    3. 108.999
    4. 171.1743
    5. 130.4336

    Solution

    Yes, it worked, but with a warning: _ “the length of the longest object is not a multiple of the length of the shortest object”_. The explanation is that whenever R encounters operations with different vector sizes, it uses the recycling rule, where the shortest vector is repeated as many times as necessary to match the size of the longest vector. In the book, see section on numerical vectors for more details.

    seq_1 <- seq(from = -10, to = 10, by = 2)
    seq_2 <- seq(from = 0, 
                 to = 100, 
                 length.out = 1000)
    
    # solution
    my_sol <- max(seq_1+seq_2)

  50. Question

    Let’s assume that, on a certain date, you bought 100 shares in a company, paying $15 per share. After some time, you sold 41 shares for a $14 each and the remaining 59 shares were sold for $17 on a later day. Using a script in R, structure this financial problem by creating numeric objects. What is the total gross profit/loss from this sequence of transactions on the stock market?


    1. $27
    2. $50
    3. $77
    4. $206
    5. $129

    Solution

    total_shares <- 100
    price_purchase <- 15
    
    total_purchase_value <- total_shares*price_purchase
    
    qtd_sell_1 <- sample(10:50, 1)
    price_sell_1 <- sample(10:18, 1)
    total_sell_1 <- qtd_sell_1*price_sell_1
    
    qtd_sell_2 <- total_shares-qtd_sell_1
    price_sell_2 <- sample(10:18, 1)
    total_sell_2 <- qtd_sell_2*price_sell_2
    
    total_sell_value <- total_sell_1 + total_sell_2
    
    # solution
    my_sol <- total_sell_value - total_purchase_value

  51. Question

    Create a vector x according to the following formula, where i=1...100i = 1 ... 100. What is the sum of the elements of x?

    xi=1i+12i1 x_i=\frac{-1^{i+1}}{2i-1}


    1. 0.7828982
    2. 0.9299688
    3. 1.069942
    4. 1.209914
    5. 1.349887

    Solution

    i <- 1:100
    
    x <- ( (-1)^(i+1) )/(2*i - 1)
    
    # solution
    my_sol <- sum(x)

  52. Question

    Create a ziz_i vector according to the following formula where xi=1...50x_i = 1 ... 50 and yi=50...1y_i = 50 ... 1. What is the sum of the elements of ziz_i? Tip: check out how the dplyr::lag() function works.

    zi=yixi1yi2 z_i=\frac{y_i - x_{i-1}}{y_{i-2}}


    1. -65.95709
    2. -113.3288
    3. -78.56528
    4. -90.15313
    5. -101.741

    Solution

    x <- 1:50
    y <- 50:1
    
    # solution using `base`
    z <- (y - c(NA, x[1:(length(x)-1)]))/c(NA, NA, y[1:(length(y)-2)])
    
    # solution with tidyverse (much prettier huh!)
    z <- (y - lag(x, n = 1))/lag(y, n = 2)
    
    # solution (be aware of the NA values)
    my_sol <- sum(z, na.rm = TRUE)

  53. Question

    Using set.seed (10), create an object called x with random values from the Normal distribution with a mean of 10 and standard deviation of 10. Using the cut function, create another object that defines two groups based on values of x greater than 15 and lower than 15. How many observations you find in the first group?


    1. 668
    2. 775
    3. 915
    4. 1055
    5. 1195

    Solution

    set.seed(10)
    x <- rnorm(n = 1000, mean = 10, sd = 10)
    
    my_group <- cut(x, 
                    breaks = c(-Inf, 15, Inf))
    
    # solution
    my_sol <- table(my_group)[1]

  54. Question

    Create the following object with the following code:

    set.seed(15)
    my_char <- paste(sample(letters, 5000, replace = T), 
                     collapse = '')

    How many times is the letter 'x' found in the resulting text object?


    1. 174
    2. 198
    3. 231
    4. 264
    5. 297

    Solution

    set.seed(15)
    my_char <- paste(sample(letters, 5000, replace = T), 
                     collapse = '')
    
    # solution
    my_sol <- stringr::str_count(my_char, 'x')

  55. Question

    Based on the my_char object created earlier, if we divided it into several smaller pieces using the letter "b", what is the number of characters in the largest piece found?


    1. 110
    2. 125
    3. 146
    4. 167
    5. 188

    Solution

    set.seed(15)
    my_char <- paste(sample(letters, 5000, replace = T), 
                     collapse = '')
    
    my_split <- stringr::str_split(my_char, pattern = 'b')[[1]]
    
    # find number of characters in each
    n_chars <- sapply(my_split, nchar)
    
    # solution
    my_sol <- n_chars[which.max(n_chars)]

  56. Question

    At web address https://www.gutenberg.org/cache/epub/69694/pg69694.txt you’ll find the full text of the book The Devil of East Lupton, Vermont, by William Fitzgerald. Use functions download.file and readr::read_lines to import the entire book as a vector of characters called my_book in R. How many lines does the resulting object have?


    1. 1047
    2. 127
    3. 350
    4. 573
    5. 796

    Solution

    my_link <- 'https://www.gutenberg.org/cache/epub/69694/pg69694.txt'
    
    my_book <- readr::read_lines(my_link)
    
    # solution
    my_sol <- length(my_book)

  57. Question

    Bind the vector of characters in my_book into a single object called full_text using command paste0(my_book, collapse = '\n'). Using the stringr package, how many times is the word 'Vermont' repeated throughout the text?


    1. 10
    2. 4
    3. 6
    4. 13
    5. 23

    Solution

    my_link <- 'https://www.gutenberg.org/cache/epub/69694/pg69694.txt'
    
    my_book <- readr::read_lines(my_link)
    
    # solution
    full_text <- paste0(my_book, collapse = '\n')
    my_sol <- stringr::str_count(full_text, stringr::fixed('Vermont'))

  58. Question

    For the full_text object created earlier, use the stringr::str_split() function to split the entire text using blanks. Based on the resulting vector, create a frequency table. What is the most used word in the text? Tip: Remove all cases of empty characters.


    1. the
    2. a
    3. He
    4. to
    5. in

    Solution

    my_link <- 'https://www.gutenberg.org/cache/epub/69694/pg69694.txt'
    
    my_book <- readr::read_lines(my_link)
    
    # solution
    full_text <- paste0(my_book, collapse = '\n')
    my_split <- stringr::str_split(full_text, 
                                   pattern = stringr::fixed(' '))[[1]]
    
    # remove empty
    my_split <- my_split[my_split != '']
    
    my_tab <- sort(table(my_split), decreasing = TRUE)
    
    # solution
    my_sol <- names(my_tab[1])

  59. Question

    Assuming that a person born on 2000-05-12 will live for 100 years, what is the number of birthday days that will fall on a weekend (Saturday or Sunday)? Tip: use operator %in% to check for a multiple condition in a vector.


    1. 29
    2. 8
    3. 22
    4. 51
    5. 80

    Solution

    b_day <- as.Date('2000-05-12')
    n_years <- 100
    
    b_day_vec <- seq(b_day, b_day + n_years*365, by = '1 year')
    
    w_days <- weekdays(b_day_vec)
    
    n_weekend <- sum(w_days %in% c('Saturday', 'Sunday'))
    
    # solution
    my_sol <- n_weekend

  60. Question

    What date and time is found 104 seconds after 2021-02-02 11:50:02?


    1. 2021-02-02 09:39:55
    2. 2021-02-02 14:36:42
    3. 2021-02-02 12:39:23
    4. 2021-02-02 14:22:58
    5. 2021-02-02 13:23:34

    Solution

    time_1 <- as.POSIXct('2021-02-02 11:50:02')
    my_sec <- 10000
    
    my_sol <- time_1 + my_sec

  61. Question

    Create a function called say_my_name that takes a person’s name as input and shows the text Your name is … on the screen. Within the scope of the function, use comments to describe the purpose of the function, its inputs and outputs.


    Solution

    say_my_name <- function(name_in) {
      # Print a single name in the prompt
      #
      # ARGS: name_in - Name to be printed
      #
      # RETURNS: TRUE, if sucessfull
      
      my_msg <- paste0('Your name is ', name_in)
      
      message(my_msg)
      
      # invisible will omit output if function is called without definition of output
      return(invisible(TRUE))
    }
    
    # testing
    say_my_name('Marcelo')

  62. Question

    Implement a test for the input objects, so that when the input name is not of the character class, an error is returned to the user. Test your new function and verify if it is working as expected.


    Solution

    say_my_name <- function(name_in) {
      # Prints a single name in the prompt
      #
      # ARGS: name_in - Name to be printed
      # RETURNS: TRUE, if sucessfull
      
      # check inputs
      if (class(name_in) != 'character') {
        stop('Class of input name_in is ', class(name_in), 
             ' and not character!')
      }
      
      if (length(name_in) > 1) {
        stop('Input name_in has length ', length(name_in), 
             ' and not 1 (this function only works for one name)!')
      }
      
      
      
      my_msg <- paste0('Your name is ', name_in, '.')
      
      message(my_msg)
      
      # invisible makes sure the fct doesnt return anything if not output is set
      return(invisible(TRUE))
    }
    
    # testing Ok
    say_my_name('Marcelo')
    
    # testing vector
    say_my_name(c('Richard', 'Michael'))
    
    # testing class
    say_my_name(1)

  63. Question

    Create a vector with any five names, called my_names. Using a loop, apply function say_my_name to each element of my_names.


    Solution

    say_my_name <- function(name_in) {
      # Prints a single name in the prompt
      #
      # ARGS: name_in - Name to be printed
      # RETURNS: TRUE, if sucessfull
      
      # check inputs
      if (class(name_in) != 'character') {
        stop('Class of input name_in is ', class(name_in), 
             ' and not character!')
      }
      
      if (length(name_in) > 1) {
        stop('Input name_in has length ', length(name_in), 
             ' and not 1 (this function only works for one name)!')
      }
      
      
      
      my_msg <- paste0('Your name is ', name_in, '\n')
      
      message(my_msg)
      
      # invisible makes sure the fct doesnt return anything if not output is set
      return(invisible(TRUE))
    }
    
    my_names <- c('Marcelo', 'Ricardo', 'Tarcizio')
    
    for (i_names in my_names) {
      say_my_name(i_names)
    }

  64. Question

    In the database of Brasil.IO1 you will find a table with names and genres. Import the data from the file and, using a loop, apply function say_my_name to 15 random names in the database. Tip: you can read the data directly from the link using readr::read_csv(LINK).


    Solution

    say_my_name <- function(name_in) {
      # Prints a single name in the prompt
      #
      # ARGS: name_in - Name to be printed
      # RETURNS: TRUE, if sucessfull
      
      # check inputs
      if (class(name_in) != 'character') {
        stop('Class of input name_in is ', class(name_in), 
             ' and not character!')
      }
      
      if (length(name_in) > 1) {
        stop('Input name_in has length ', length(name_in), 
             ' and not 1 (this function only works for one name)!')
      }
      
      
      
      my_msg <- paste0('Your name is ', name_in, '.')
      
      message(my_msg)
      
      # invisible makes sure the fct doesnt return anything if not output is set
      return(invisible(TRUE))
    }
    
    library(dplyr)
    library(readr)
    library(stringr)
    
    # get CURRENT url from https://data.chhs.ca.gov/dataset/most-popular-baby-names-2005-current
    my_url <- 'https://data.brasil.io/dataset/genero-nomes/grupos.csv.gz'
    
    df_names <- read_csv(my_url, col_types = cols())
    
    my_names <- sample(df_names$name, 15)
    
    for (i_name in my_names) {
      
      say_my_name(i_name)
    }

  65. Question

    Redo previous exercises using function sapply or purrr::walk.


    Solution

    say_my_name <- function(name_in) {
      # Prints a single name in the prompt
      #
      # ARGS: name_in - Name to be printed
      # RETURNS: TRUE, if sucessfull
      
      # check inputs
      if (class(name_in) != 'character') {
        stop('Class of input name_in is ', class(name_in), 
             ' and not character!')
      }
      
      if (length(name_in) > 1) {
        stop('Input name_in has length ', length(name_in), 
             ' and not 1 (this function only works for one name)!')
      }
        
      my_msg <- paste0('Your name is ', name_in, '\n')
      
      message(my_msg)
      
      # invisible makes sure the fct doesnt return anything if not output is set
      return(invisible(TRUE))
    }
    
    library(dplyr)
    library(readr)
    library(purrr)
    
    # get CURRENT url from https://data.chhs.ca.gov/dataset/most-popular-baby-names-2005-current
    my_url <- 'https://data.brasil.io/dataset/genero-nomes/grupos.csv.gz'
    
    df_names <- read_csv(my_url, col_types = cols())
    
    # using sapply
    vec_out <- sapply(sample(df_names$name, 15),
                      say_my_name)
    glimpse(vec_out)
    
    # using purrr
    walk(sample(df_names$name, 15),
        say_my_name)

  66. Question

    Use the {yfR} package to download data from the SP500 ('^GSPC'), Ibovespa ('^BVSP'), FTSE ('^ FSTE') and Nikkei 225 ('^N225' index)) from ‘2010-01-01’` to the current date. With the imported data, use a loop to calculate the average, maximum and minimum return for each index during the analyzed period. Save all results in a single table and show it in the prompt of R.


    Solution

    library(dplyr)
    
    indexes <- c('^BVSP', '^GSPC', '^FTSE', '^N225')
    
    df_indices <- yfR::yf_get(indexes, 
                              '2010-01-01',
                              Sys.Date())
    
    tab <- tibble()
    for (index in indexes) {
      
      temp_df <- df_indices |>
        filter(ticker == index)
      
      avg_ret <- mean(temp_df$ret_adjusted_prices, 
                      na.rm = TRUE)
      max_ret <- max(temp_df$ret_adjusted_prices, 
                     na.rm = TRUE)
      min_ret <- min(temp_df$ret_adjusted_prices, 
                     na.rm = TRUE)
      
      # save result
      tab <- bind_rows(tab, tibble(index = index,
                                   mean_ret = avg_ret, 
                                   max_ret = max_ret, 
                                   min_ret = min_ret))
      
    }
    
    print(tab)

  67. Question

    Redo previous exercise using functions group_by and summarise, both from package dplyr.


    Solution

    library(dplyr)
    
    indexes <- c('^BVSP', '^GSPC', '^FTSE', '^N225')
    
    df_indices <- yfR::yf_get(indexes, 
                              '2010-01-01',
                              Sys.Date())
    
    tab_tidy <- df_indices |>
      group_by(ticker) |>
      summarise(mean_ret = mean(ret_adjusted_prices, na.rm = TRUE),
                max_ret = max(ret_adjusted_prices, na.rm = TRUE),
                min_ret = min(ret_adjusted_prices, na.rm = TRUE))
    
    print(tab_tidy)

  68. Question

    On the Rstudio CRAN logs website1, you’ll find data on download statistics for the base distribution of R in the Daily R downloads section. Using your programming skills, import all available data between 2022-01-01 and 2022-01-05 and aggregate them into a single table. Which country has the highest download count for R?


    1. IE
    2. AL
    3. MG
    4. US
    5. IN

    Solution

    # set function that will download the files
    read_cranlogs_files <- function(date_in) {
      # Reads log files from http://cran-logs.rstudio.com/
      #
      # ARGS: date_in - date of log data
      require(lubridate)
      
      url_dl <- paste0('http://cran-logs.rstudio.com/', year(date_in), '/',
                       date_in, '-r.csv.gz')
      
      cat('\nReading ', url_dl)
      
      df <- readr::read_csv(url_dl, col_types = readr::cols())
      
      return(df)
    }
    
    # find out the availabe dates in url
    library(rvest)
    library(lubridate)
    library(stringr)
    library(dplyr)
    
    available_links <- read_html('http://cran-logs.rstudio.com/') |>
      html_nodes(css = 'a') |>
      html_attr('href')
    
    # only keep links for R download (those with -r.csv.gz pattern)
    idx <- str_detect(available_links, '-r.csv.gz')
    r_links <- available_links[idx]
    
    # find out dates 
    dates_dls <- ymd(basename(r_links))
    max_date <- max(dates_dls)
    
    first_date <- as.Date('2022-01-01')
    last_date <- as.Date('2022-01-05')
    
    my_dates <- seq(first_date, 
                    last_date, 
                    by = '1 day')
    
    library(purrr) 
    library(dplyr)
    
    l_out <- map(my_dates, 
                 safely(read_cranlogs_files, 
                        otherwise = tibble())) # return empty tibble in case of error
    
    df_cranlogs <- bind_rows(map(l_out, 'result'))
    
    # solution 
    my_sol <- names(sort(table(df_cranlogs$country), 
                   decreasing = TRUE)[1])

  69. Question

    For the same daily FTSE data, check the dates and prices of the 20 biggest price drops. If, for each of these cases, an investor bought the index at the price of the biggest drops and kept it for 30 days, what would be his average nominal return per transaction?


    1. 2.95%
    2. 4.51%
    3. 5.07%
    4. 3.38%
    5. 3.94%

    Solution

    library(dplyr)
    library(readr)
    
    ticker <- '^FTSE'
    first_date <- '2010-01-01'
    last_date <- '2023-01-01'
    
    df_FTSE_daily <- yfR::yf_get(ticker, 
                                 first_date,
                                 last_date)
    
    # buy at t, sell at t+30
    trade_window <- 30 
    
    # find largest drops
    largest_drops <- df_FTSE_daily |>
      arrange(ret_adjusted_prices) |>
      slice(1:20)
    
    # There are many ways to solve the exercise. 
    # Here we will use a loop which is the simplest way to looking at the problem.
    # You could also solve it with the functional approach of package purrrr,
    # that is, writing a function.
    
    tab <- tibble()
    for (i_date in seq_along(largest_drops$ref_date)) {
      
      my_date <- largest_drops$ref_date[i_date]
      # filter data to keep only datapoints in each horizon
      temp_df <- df_FTSE_daily |>
        filter(ref_date >= my_date,
               ref_date <= my_date + trade_window)
      
      
      buy_price <- first(temp_df$price_adjusted)
      sell_price <- last(temp_df$price_adjusted)
      return <- sell_price/buy_price - 1
      
      tab <- bind_rows(tab, 
                       tibble(date = my_date, 
                              buy_price = buy_price, 
                              sell_price = sell_price, 
                              return = return))
    }
    
    print(tab)
    
    # solution
    my_sol <- mean(tab$return)

  70. Question

    Consider a dataframe created with the following R code:

    library(dplyr)
    
    my_N <- 100
    
    df <- bind_rows(tibble(ticker = rep('STOCK 1', my_N),
                           ref_date = Sys.Date() + 1:my_N,
                           price = 100 + cumsum(rnorm(my_N))),
                    tibble(ticker = rep('STOCK 2', my_N),
                           ref_date = Sys.Date() + 1:my_N,
                           price = 100 + cumsum(rnorm(my_N))) )
    
    print(df)

    Is the dataframe in the long or wide format? Explain your answer.


    Solution

    The format is long: we have data stacked for two different stocks. Note that, with the addition of new tickers, the table grows with new lines. New variables can be easily added with new columns.


  71. Question

    Change the format of the previous dataframe, from long to wide or vice versa.

    library(dplyr)
    
    my_N <- 100
    
    df <- bind_rows(tibble(ticker = rep('STOCK 1', my_N),
                           ref_date = Sys.Date() + 1:my_N,
                           price = 100 + cumsum(rnorm(my_N))),
                    tibble(ticker = rep('STOCK 2', my_N),
                           ref_date = Sys.Date() + 1:my_N,
                           price = 100 + cumsum(rnorm(my_N))) )
    
    print(df)

    Solution

    library(dplyr)
    library(readr)
    
    my_N <- 100
    
    df <- bind_rows(tibble(ticker = rep('STOCK 1', my_N),
                           ref_date = Sys.Date() + 1:my_N,
                           price = 100 + cumsum(rnorm(my_N))),
                    tibble(ticker = rep('STOCK 2', my_N),
                           ref_date = Sys.Date() + 1:my_N,
                           price = 100 + cumsum(rnorm(my_N))) )
    
    print(df)
    
    # convert from long to wide
    df_wide <- tidyr::spread(data = df, 
                      key = 'ticker',
                      value = 'price')
    
    # print result
    print(df_wide)

  72. Question

    Consider the following list:

    library(dplyr)
    
    my_l <- list(df1 = tibble(x = 1:100, 
                              y = runif(100)),
                 df2 = tibble(x = 1:100, 
                              y = runif(100), 
                              v = runif(100)),
                 df3 = tibble(x = 1:100, 
                              y = runif(100), 
                              z = runif(100)) 
                )

    Add all dataframes in my_l to a single object using do.call or dplyr::bind_rows functions. What happened to the df1 data where v and z columns do not exist?


    Solution

    When bind_rows does not find the same column at the junction of different tables, the missing data is defined as NAs. See below:

    library(dplyr)
    library(readr)
    
    my_l <- list(df1 = tibble(x = 1:100, 
                              y = runif(100)),
                 df2 = tibble(x = 1:100, 
                              y = runif(100), 
                              v = runif(100)),
                 df3 = tibble(x = 1:100, 
                              y = runif(100), 
                              z = runif(100)) )
    
    # solution with bind_rows
    bind_df1 <- bind_rows(my_l)
    
    # solution with do.cal
    bind_df2 <- do.call(bind_rows, my_l)
    
    # check solutions match
    identical(bind_df1, bind_df2)
    
    print(bind_df1)
    # the missing data points were set as NA values

  73. Question

    Use the yfR package to download the SP500 index data ('^GSPC') from 1950-01-01 to 2021-01-01. What is the sum of the 5 highest positive returns on the index?


    1. 0.0909854
    2. 0.3952476
    3. 0.501383
    4. 0.1924061
    5. 0.2938269

    Solution

    ticker <- '^GSPC'
    first_date <- '1950-01-01'
    last_date <- '2021-01-01'
    
    df_sp500 <- yfR::yf_get('^GSPC', 
                                first_date,
                                last_date)
    
    select_n <- 5
    tab <- dplyr::tibble(position = 1:select_n,
                  top5_positive = sort(df_sp500$ret_adjusted_prices, 
                                       decreasing = TRUE)[1:select_n],
                  top5_negative = sort(df_sp500$ret_adjusted_prices, 
                                       decreasing = FALSE)[1:select_n])
    
    print(tab)
    
    my_sol <- sum(tab$top5_positive)

  74. Question

    Use the replace_outliers function from section @ref(outliers) to remove outliers from all numeric columns of the SP500 data previously imported with my_prob = 0.025. How many lines were lost in this cleaning process?


    1. 12734
    2. 11291
    3. 7281
    4. 9847
    5. 8403

    Solution

    library(purrr)
    
    ticker <- '^GSPC'
    first_date <- '1950-01-01'
    last_date <- '2021-01-01'
    
    df_sp500 <- yfR::yf_get('^GSPC', 
                                first_date,
                                last_date)
    
    replace_outliers <- function(col_in, my_prob = 0.05) {
      # Replaces outliers from a vector
      #
      # INPUTS: col_in The vector
      #         my_prob Probability of quantiles (p and 1-p)
      #
      # OUTPUT: A vector
      
      # return if class is other than numeric
      if (!(class(col_in) %in% c('numeric', 'integer'))) return(col_in)
      
      my_outliers <- stats::quantile(x = col_in,
                                     probs = c(my_prob, 1-my_prob),
                                     na.rm = TRUE)
      
      idx <- (col_in <= my_outliers[1])|(col_in >= my_outliers[2])
      col_in[idx] <- NA
      
      return(col_in)
      
    }
    
    # remove outlivers from vectors
    l_out <- map(df_sp500, replace_outliers, my_prob = 0.025)
    
    df_sp500_nooutlier <- na.omit(as_tibble(l_out))
    
    nrow_1 <- nrow(df_sp500)
    nrow_2 <- nrow(df_sp500_nooutlier)
    
    my_sol <- nrow_1 - nrow_2

  75. Question

    Use the yfR::yf_get function to import prices of the FTSE index ('^ FTSE') from 2010-01-01 to 2021-01-01. Then, reconstruct the data at the annual frequency, defining each year’s value as the last observation of the period. Tip: see the dplyr::summary_all function for a functional way to aggregate all the columns of a dataframe.


    Solution

    ticker <- '^FTSE'
    
    first_date <- '2010-01-01'
    last_date <- '2021-01-01'
    
    df_FTSE_daily <- yfR::yf_get(ticker,
                                first_date,
                                last_date)
    
    # change from daily to annual
    df_FTSE_yearly <- df_FTSE_daily |>
      mutate(ref_year = lubridate::year(ref_date)) |>
      group_by(ref_year) |>
      summarise_all(.funs = last)
    
    print(df_FTSE_yearly)

  76. Question

    Use the same daily data as the FTSE and reconstruct the data at the monthly frequency, again using the first observation of the period.


    Solution

    ticker <- '^FTSE'
    first_date <- '2010-01-01'
    last_date <- '2023-01-01'
    
    df_FTSE_daily <- yfR::yf_get(ticker,
                                first_date,
                                last_date)
    
    # change from daily to monthly
    df_FTSE_monthly <- df_FTSE_daily |>
      mutate(ref_month = format(ref_date, '%Y-%m-01')) |>
      group_by(ref_month) |>
      summarise_all(first)
    
    print(df_FTSE_monthly)

  77. Question

    Download TESLA (TSLA) stock data with the yfR package for the last 500 days. Display the price line chart adjusted over time using the ggplot() function. Make sure that:


    Solution

    library(yfR)
    library(ggplot2)
    
    tickers <- 'TLSA'
    
    df_tlsa<- yf_get(tickers = tickers, 
                     first_date = Sys.Date() - 500,
                     last_date = Sys.Date())
    
    p <- ggplot(df_tlsa, aes(x = ref_date, y = price_adjusted)) + 
      geom_line() + 
      labs(title = paste0('Prices for ', tickers) ,
           subtitle = paste0('Data from ', min(df_tlsa$ref_date), ' to ', 
                             max(df_tlsa$ref_date)),
           caption = "Solution for exercise 01, chapter 10 - afedR")
    
    print(p)

  78. Question

    Using the yfR package, download stock data for:

    in the last 1500 days. Display stock prices with different line colors on the same graph. Adapt all other aspects of the graph from previous exercises such as title and axis labels.


    Solution

    library(yfR)
    library(ggplot2)
    
    tickers <- c('AAPL', 
                 'MSFT',
                 'AMZN',
                 'GOOG') 
    
    df_stocks <- yf_get(
         tickers = tickers, 
         first_date = Sys.Date() - 500,
         last_date = Sys.Date()
         )
    
    p <- ggplot(df_stocks, aes(x = ref_date, 
                               y = price_adjusted,
                               color = ticker)) + 
      geom_line() + 
      labs(title = paste0('Prices for ', tickers) ,
           subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ', 
                             max(df_stocks$ref_date)),
           caption = "Solution for exercise 02, chapter 10 - afedR")
    
    print(p)

  79. Question

    For the previous plot, add dots in the lines.


    Solution

    library(yfR)
    library(ggplot2)
    
    tickers <- c('AAPL', 
                 'MSFT',
                 'AMZN',
                 'GOOG') 
    
    df_stocks <- yfR(tickers = tickers, 
                                 first_date = Sys.Date() - 500,
                                 last_date = Sys.Date())
    
    p <- ggplot(df_stocks, aes(x = ref_date, 
                               y = price_adjusted,
                               color = ticker)) + 
      geom_line() + 
      labs(title = paste0('Prices for ', tickers) ,
           subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ', 
                             max(df_stocks$ref_date)),
           caption = "Solution for exercise 03, chapter 10 - afedR")
    
    
    # add points
    p <- p + geom_point()
    
    print(p)

  80. Question

    For the same plot, separate the stock prices on different panels with the ggplot::facet_wrap function. Use the scales = 'free' argument to release the x and y axis of each panel.


    Solution

    library(yfR)
    
    tickers <- c('AAPL', 
                 'MSFT',
                 'AMZN',
                 'GOOG') 
    
    df_stocks <- yfR(
      tickers = tickers, 
      first_date = Sys.Date() - 500,
      last_date = Sys.Date()) 
    
    p <- ggplot(df_stocks, aes(x = ref_date, 
                               y = price_adjusted,
                               color = ticker)) + 
      geom_line() + 
      labs(title = paste0('Prices for ', tickers) ,
           subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ', 
                             max(df_stocks$ref_date)),
           caption = "Solution for exercise 04, chapter 10 - afedR")
    
    # add points and facet wrap
    p <- p + 
      geom_point() + 
      facet_wrap(~ticker, scales = 'free')
    
    print(p)

  81. Question

    Change the theme of the previous graph to a black and white scale, both for the graph area and for the colors of the lines.


    Solution

    library(yfR)
    
    tickers <- c('AAPL', 
                 'MSFT',
                 'AMZN',
                 'GOOG') 
    
    df_stocks <- yfR(tickers = tickers, 
                                 first_date = Sys.Date() - 500,
                                 last_date = Sys.Date())[[2]]
    
    p <- ggplot(df_stocks, aes(x = ref_date, 
                               y = price_adjusted,
                               color = ticker)) + 
      geom_line() + 
      labs(title = paste0('Prices for ', tickers) ,
           subtitle = paste0('Data from ', min(df_stocks$ref_date), ' to ', 
                             max(df_stocks$ref_date)),
           caption = "Solution for exercise 05, chapter 10 - afedR")
    
    # add points and facet_wrap
    p <- p + geom_point() + 
      facet_wrap(~ticker, scales = 'free') + 
      theme_bw() + 
      scale_color_grey()
    
    # 
    print(p)

  82. Question

    For the previous data, present the histogram of the returns of the different stocks in different panels and save the result in a file called 'histograms.png'.


    Solution

    library(ggplot2)
    
    tickers <- c('AAPL', 
                 'MSFT',
                 'AMZN',
                 'GOOG') 
    
    df_stocks <- yfR::yf_get(tickers, 
                             Sys.Date() - 500,
                             Sys.Date())
    
    p <- ggplot(df_stocks, aes(x = ret_adjusted_prices)) + 
      geom_histogram() + 
      facet_wrap(~ticker)
    
    print(p)
    
    # save in temp folder
    my_file <- file.path(tempdir(), 'histograms.png')
    ggsave(filename = my_file, plot = p)

  83. Question

    Use the yfR::yf_collection_get() function to download the annual return data for all stocks in the DOW index, from 2015-01-01 to the current day. After that, create the average/variance map by plotting the average **annual* return as the y axis and the standard deviation as the x axis. Tip: You’ll find many outliers in the raw data. Make sure that the graph is visible limiting the x and y axes (see the ggplot2::xlim and ggplot2::ylim functions).


    Solution

    library(yfR)
    library(dplyr)
    library(readr)
    library(stringr)
    library(ggplot2)
    
    df_prices <- yfR::yf_collection_get(
      'DOW',
      first_date = '2015-01-01',
      freq_data = 'yearly'
      )
    
    tab <- df_prices |>
      na.omit() |>
      group_by(ticker) |>
      summarise(mean = mean(ret_adjusted_prices),
                sd = sd(ret_adjusted_prices))
    
    p <- ggplot(tab, aes(x = sd, y = mean)) + 
      geom_point() + 
      xlim(0, 0.5) + ylim(-0.5, 0.5)
    
    print(p)

  84. Question

    Head over to the Kaggle data website1 and choose a particular dataset for your analysis. It need not be related to economics or finance. Feel free to make a decision based on your own interests.

    After downloading the data, create a couple of hypothesis that you might have about the data. Create a visual analysis of the data that helps answering those hypothesis.


    Solution

    Head over to https://www.kaggle.com/datasets:

    1. Choose a dataset
    2. Build a visualization plot of the chosen data

  85. Question

    Simulate the following linear process in R:

    set.seed(5)
    
    # number of obs
    n_row <- 100
    
    # set x as Normal (0, 1)
    x <- rnorm(n_row)
    
    # set coefficients
    my_alpha <- 1.5
    my_beta <- 0.5
    
    # build y
    y <- my_alpha + my_beta*x + rnorm(n_row)

    Now, estimate a linear model where x is the explanatory variable and y is the explained variable. Use the summary function on the estimation return object to get more details about the model. What is the estimated beta value of the simulated data?


    1. 0.4002792
    2. 1.503832
    3. 0.8707205
    4. 0.2910268
    5. 0.633111

    Solution

    set.seed(5)
    
    # number of obs
    n_row <- 100
    
    # set x as Normal (0, 1)
    x <- rnorm(n_row)
    
    # set coefficients
    my_alpha <- 1.5
    my_beta <- 0.5
    
    # build y
    y <- my_alpha + my_beta*x + rnorm(n_row)
    
    my_lm <- lm(formula = y ~ x, data = tibble::tibble(x, y))
    
    summary(my_lm)
    
    my_sol <- coef(my_lm)[2]

  86. Question

    Using the {car} package, test the joint hypothesis that the value of alpha is equal to 1.5 and the value of beta is equal to 0.5. What is the value of the resulting F test?


    1. 10.72713
    2. 16.96681
    3. 7.799261
    4. 40.30135
    5. 23.33454

    Solution

    set.seed(5)
    
    # number of obs
    n_row <- 100
    
    # set x as Normal (0, 1)
    x <- rnorm(n_row)
    
    # set coefficients
    my_alpha <- 1.5
    my_beta <- 0.5
    
    # build y
    y <- my_alpha + my_beta*x + rnorm(n_row)
    
    my_lm <- lm(formula = y ~ x, data = tibble::tibble(x, y))
    
    summary(my_lm)
    
    library(car)
    
    # set test matrix
    test_matrix <- matrix(c(my_alpha,  # alpha test value
                            my_beta))  # beta test value
    
    # hypothesis matrix 
    hyp_mat <- matrix(c(1.5, 0,
                        0  , 0.5),
                      nrow = 2)
    
    # do test
    my_waldtest <- linearHypothesis(my_lm, 
                                    hypothesis.matrix = hyp_mat, 
                                    rhs = test_matrix)
    
    # print result
    my_sol <- my_waldtest$F[2]

  87. Question

    Use the gvlma package to test the OLS assumptions for the model previously estimated. Does the model pass all tests? If not, increase the value of n_row to 1000 and try again. Did the increase in the number of observations of the model impact the assumptions test? In what way?


    Solution

    The estimated model has not passed all the tests. In fact, not even the increase in the number of observations in the simulation resulted in the approval of the model in all aspects.

    set.seed(5)
    
    # number of obs
    n_row <- 1000
    
    # set x as Normal (0, 1)
    x <- rnorm(n_row)
    
    # set coefficients
    my_alpha <- 1.5
    my_beta <- 0.5
    
    # build y
    y <- my_alpha + my_beta*x + rnorm(n_row)
    
    my_lm <- lm(formula = y ~ x, data = tibble::tibble(x, y))
    
    summary(my_lm)
    
    library(gvlma)
    
    # global validation of model
    gvmodel <- gvlma(my_lm) 
    
    # print result
    summary(gvmodel)

  88. Question

    Use command yfR::yf_index_composition("DOW") to find the ticker of the components of the index and yfR::yf_get to download all available price data. Using the SP500 itself – ticker '^GSPC' – as the market index, calculate the beta for all stocks. Display the histogram of the estimated betas. Note that the SP500 returns are not available in the original database and must be downloaded and added to the original database.


    Solution

    library(ggplot2)
    
    tickers <- yfR::yf_index_composition("DOW")$ticker
    first_date <- Sys.Date() - 3*365
    last_date  <- Sys.Date()
    
    df_stocks <- yfR::yf_get(tickers, 
                             first_date, 
                             last_date)
    
    df_sp500 <- yfR::yf_get('^GSPC', 
                            first_date, 
                            last_date)
    
    idx <- match(df_stocks$ref_date, df_sp500$ref_date)
    df_stocks$ret_mkt <- df_sp500$ret_closing_prices[idx]
    
    # calculate beta for each stock
    estimate_beta <- function(df) {
      # Function to estimate beta from dataframe of stocks returns
      #
      # Args:
      #   df - Dataframe with columns ret and ret.sp500
      #
      # Returns:
      #   The value of beta
      
      my_model <- lm(data = df, 
                     formula = ret_adjusted_prices ~ ret_mkt)
      
      return(coef(my_model)[2])
    }
    
    my_betas <- by(data = df_stocks, 
                   INDICES = df_stocks$ticker, 
                   FUN = estimate_beta)
    
    glimpse(my_betas)
    
    # solution
    p <- ggplot(tibble::tibble(betas = my_betas), aes(x = betas)) + 
      geom_histogram()
    
    print(p)

  89. Question

    For previously imported data, estimate a panel data version for the market model (beta). In this version, each stock has a different intercept, but they share the same beta. Is the estimated beta significant at 5%?


    Solution

    library(ggplot2)
    
    tickers <- yfR::yf_index_composition("DOW")$ticker
    first_date <- Sys.Date() - 3*365
    last_date  <- Sys.Date()
    
    df_stocks <- yfR::yf_get(tickers, 
                             first_date, 
                             last_date)
    
    df_sp500 <- yfR::yf_get('^GSPC', 
                            first_date, 
                            last_date)
    
    idx <- match(df_stocks$ref_date, df_sp500$ref_date)
    df_stocks$ret_mkt <- df_sp500$ret_closing_prices[idx]
    
    # calculate PLM beta
    library(plm)
    
    # estimate panel data model with fixed effects
    my_pdm <- plm(data = df_stocks, 
                  formula = ret_adjusted_prices ~ ret_mkt, 
                  model = 'within',
                  index = c('ticker'))
    
    # print result
    print(summary(my_pdm))

  90. Question

    Using the tidyverse functions, dplyr::group_by and dplyr::do, estimate an ARIMA model for the returns of each stock, available from the import process of previous exercise. In the same output dataframe, create a new column with the forecast in t + 1 for each model. Which stock has the highest expected return for t + 1?


    Solution

    library(dplyr)
    
    tickers <- yfR::yf_index_composition("DOW")$ticker
    first_date <- Sys.Date() - 3*365
    last_date  <- Sys.Date()
    
    df_stocks <- yfR::yf_get(tickers, 
                             first_date, 
                             last_date)
    
    df_sp500 <- yfR::yf_get('^GSPC', 
                            first_date, 
                            last_date)
    
    my_tab <- df_stocks |>
      group_by(ticker) |>
      do(my_arima = arima(x = .$ret_adjusted_prices, 
                          order = c(1,0,0))) |>
      mutate(arima_forecast = predict(my_arima, n.ahead = 1 )$pred[1])
    
    glimpse(my_tab)
    
    # solution
    idx <- which.max(my_tab$arima_forecast )
    print(my_tab$ticker[idx])

  91. Question

    In the same code used for the previous question, add a new column-list with the estimation of an ARMA (1, 0)-GARCH(1, 1) model for the returns of each stock. Add another column with the volatility forecast (standard deviation) at t + 1.

    By dividing the expected return calculated in the previous item by the expected risk, we have a market direction index, where those stocks with the highest index value have the highest ratio of expected return to risk. Which stock is more attractive and has the highest value of this index?


    Solution

    tickers <- yfR::yf_index_composition("DOW")$ticker
    first_date <- Sys.Date() - 3*365
    last_date  <- Sys.Date()
    
    df_stocks <- yfR::yf_get(tickers, 
                             first_date, 
                             last_date)
    
    df_sp500 <- yfR::yf_get('^GSPC', 
                            first_date, 
                            last_date)
    
    library(dplyr)
    library(fGarch)
    
    tab_models <- df_stocks |>
      na.omit() |>
      group_by(ticker) |>
      do(my_garch = garchFit(formula = ~ arma(1,0) + garch(1,1), 
                             data = .$ret_adjusted_prices, 
                             trace = FALSE) ) 
    
    tab_models <- tab_models |>
      mutate(forecast_mean = predict(my_garch, 
                                     n.ahead = 1)$meanForecast[1],
             forecast_sd = predict(my_garch, 
                                   n.ahead = 1)$standardDeviation[1],
             sharpe_index = forecast_mean/forecast_sd)
    
    glimpse(tab_models)
    
    # solution
    idx <- which.max(tab_models$sharpe_index)
    print(tab_models$ticker[idx])

  92. Question

    Using package {afedR3}, import the data from file CH11_grunfeld.csv and create a descriptive table of the variables. This table should provide enough information for the reader to understand the data. Use {xtable} package to report it in LaTeX or Word/Writer format.


    Solution

    library(dplyr)
    
    my_f <-  afedR3::data_path('CH11_grunfeld.csv')
    
    df_grunfeld <- readr::read_csv(my_f, col_types = readr::cols())
    
    my_tab <- tibble(nrows = nrow(df_grunfeld),
                     ncol = ncol(df_grunfeld),
                     n_companies = length(unique(df_grunfeld$firm)),
                     n_years = length(unique(df_grunfeld$year)))
    
    my_tab
    
    library(xtable)
    
    # Save to lates
    # save to temp file
    my_tex_file <- tempfile(pattern = 'table', fileext = '.tex')
    print(xtable(my_tab), file = my_tex_file)
    
    readr::read_lines(my_tex_file)

  93. Question

    Using the yfR::yf_index_composition function, select random 4 stocks with set.seed(5) from the SP500 index and download the adjusted price data for the last three years. Estimate an ARIMA(1, 0, 1) model for each stock and report the result on the R screen with the texreg::screenreg function.


    Solution

    set.seed(5)
    
    library(dplyr)
    
    tickers <- sample(yfR::yf_index_composition("SP500")$ticker, 4)
    first_date <- Sys.Date() - 3*365
    last_date  <- Sys.Date()
    
    df_stocks <- yfR::yf_get(tickers, 
                             first_date, 
                             last_date)
    
    my_tab <- df_stocks |>
      na.omit() |>
      group_by(ticker) |>
      do(my_arima = arima(x = .$ret_adjusted_prices, 
                          order = c(1, 0, 1))) 
    
    glimpse(my_tab)
    
    # solution
    library(texreg)
    
    screenreg(my_tab$my_arima)

  94. Question

    Create a new Rmarkdown report from previous two exercises. Compile the report in html and open it in your browser.


    Solution

    See section @ref(creating-rmarkdown) – Creating Reports with RMarkdown – from the same chapter.


  95. Question

    Consider the following code:

    library(forecast)
    
    ticker <- '^GSPC'
    df_prices <- yfR::yf_get(ticker, 
                             '2010-01-01')
    
    my_arima <- forecast::auto.arima(df_prices$ret_adjusted_prices)
    summary(my_arima)

    Use functions Rprof and profvis to identify the bottleneck of the code. Which line number is taking the longest execution time?


    Solution

    library(fGarch)
    library(profvis)
    
    # set temporary file for results
    profiling_file <-  tempfile(pattern = 'profiling_exercise', 
                                fileext = '.out')
    
    # initialize profiling
    Rprof(filename = profiling_file)
    
    # run code
    profiling <- profvis(expr = {
      
      ticker <- '^GSPC'
      df_prices <- yfR::yf_get(ticker, 
                             '2010-01-01')
    
      my_arima <- forecast::auto.arima(df_prices$ret_adjusted_prices)
      summary(my_arima)
    })
    
    # create visualization
    temp_html <- tempfile(pattern = 'profile',
                          fileext = '.html')
    
    htmlwidgets::saveWidget(profiling, temp_html)
    
    # open in browser from R
    browseURL(temp_html)

  96. Question

    Use the Rcpp package to write a C++ language function that will add elements of two numerical vectors. The output must be another vector of the same size and with elements equivalent to the x + y operation. Use the identical function to test that all elements of both vectors are equal.


    Solution

    library(Rcpp)
    
    cppFunction('Rcpp::NumericVector sum_vectors_C(NumericVector x, NumericVector y) {
      int n = x.size();
      
      Rcpp::NumericVector total(x.size());
      
      for(int i = 0; i < n; ++i) {
        total[i] = x[i] + y[i];
      }
      return total;
    }')
    
    x <- runif(100)
    y <- runif(100)
    
    sol_C <- sum_vectors_C(x, y)
    sol_R <- x + y
    
    identical(sol_C, sol_R)

  97. Question

    Use the {tictoc} package to compare the performance of the previous function with the native operator +, and a loop-based version with the pre-allocation of the result vector. Which alternative has the shortest execution time and why? Does the Rcpp version beat the loop version?


    Solution

    The best way to add vectors is with the native operator +, which is already optimized for fast executions.

    library(Rcpp)
    
    cppFunction('Rcpp::NumericVector sum_vectors_C(NumericVector x, NumericVector y) {
      int n = x.size();
      
      Rcpp::NumericVector total(x.size());
      
      for(int i = 0; i < n; ++i) {
        total[i] = x[i] + y[i];
      }
      return total;
    }')
    
    x <- runif(100)
    y <- runif(100)
    
    sol_C <- sum_vectors_C(x, y)
    sol_R <- x + y
    
    identical(sol_C, sol_R)
    
    library(tictoc)
    
    tic('Using Rcpp')
    sol_C <- sum_vectors_C(x, y)
    toc()
    
    tic('Using base R')
    sol_R <- x + y
    toc()
    
    tic('Using a loop and prealocation')
    sol_loop <- numeric(length = length(x))
    for (i in 1:length(x)) {
      sol_loop[i] <- x[i] + y[i]
    }
    toc()

  98. Question

    Use the {memoise} package to create a memorized version of function Quandl::Quandl. Use the new function to import data about the United States Consumer Price Index (quandl code 'FRED/DDOE01USA086NWDB'). How much percentage speed gain do you get from the second call to the memorized version?


    Solution

    library(Quandl)
    library(memoise)
    library(tictoc)
    
    mem_quandl <- memoise(f = Quandl, cache = cache_memory())
    
    id <- 'FRED/DDOE01USA086NWDB'
    
    tic('Using original Quandl')
    df <- Quandl(code = id)
    toc()
    
    tic('Using memoise version (first call)')
    df <- mem_quandl(code = id)
    toc()
    
    tic('Using memoise version (second call)')
    df <- mem_quandl(code = id)
    toc()