4 Importing Data from Local Files
In this chapter we’ll learn to import and export data available as local files in the computer. Although the task is not particularly difficult, a data analyst should understand the different characteristics of file formats and how to take advantage of it. While some data formats are best suited for sharing and collaboration, others can offer a significant boost in reading and writing speed.
Here we will draw a comprehensive list of data file formats in R, including:
- text data with comma-separated values (csv files);
- spreadsheet data from Excel (xlsx files);
- R native data files (RData and rds files);
- Lightning Fast Serialization of data frames (FST) format (fst files);
- SQLite;
- unstructured text data.
The previous packages and functions are sufficient for getting most of the work done. Nevertheless, it is worth mentioning that R can also import data from other softwares such as SPSS, Stata, Matlab, among many others. If that is your case, I suggest a thorough study of package {foreign} (R Core Team 2023a).
4.1 The path of local files
The first lesson in importing data from local files is that the location of the file must be explicitly stated in the code. The path of the file is then passed to a function that will read the file. The best way to work with paths is to use the autocomplete feature of RStudio (see section 2.19). An example of full path is:
my_file <- 'C:/My Research/data/price-data.csv'
Note the use of forwarding slashes (/
) to designate the file directory. Relative references also work, as in:
my_file <- 'data/price-data.csv'
Here, it is assumed that, in the current working folder, there is a directory called data
and, inside of it, exists a file called price-data.csv
. If the file path is simply its name, such as in my_file <- 'price-data.csv'
, it is implicitly assumed that the file is located in the root of the working directory. From the previous chapter, recall that you can use setwd() to change the working folder to where the work is being done and simply create and use the “Projects” feature of RStudio, which automatically sets the working directory in the same location as the .Rproj file.
I again reinforce the use of tab and the autocomplete tool of RStudio. It is much easier and practical to find files on your computer’s hard disk using tab navigation than to copy and paste the address from your file explorer. To use it, open double or quotes in RStudio, place the mouse cursor in between the quotes and press tab.
Going further, another very important point here is that the data from the file will be imported and exported as an object of type dataframe
. That is, a table contained in an Excel or .csv file will become a dataframe
object in R. When we export data, the most common format is this same type of object. Conveniently, dataframes
are nothing more than tables, with rows and columns.
Each column in the dataframe
will have its own class, the most common being numeric (numeric), text (character), factor (factor) and date (Date). When importing the data, it is imperative that each column is represented in the correct class. A vast amount of errors can be avoided by simply checking the column classes in the dataframe
resulting from the import process. For now, we only need to understand this basic property of dataframes
. We will study the details of this object in chapter 6.
4.2 csv files
Consider a data file called CH04_SP500.csv, available from the book package. It contains daily closing prices of the SP500 index from 2010-01-04 until 2022-12-27. We will now use package {afedR3} (M. S. Perlin 2023a) for finding the file and copying it to your local folder. If you followed the instructions in the book preface chapter, you should have package {afedR3} (M. S. Perlin 2023a) already installed.
Once you install package {afedR3} (M. S. Perlin 2023a), file CH04_SP500.csv and all other data files used in the book are downloaded from Github. The package also includes functions for facilitating the reproduction of code examples. Command afedR3::data_path() will return the local path of a data file by looking up its name.
Let’s copy CH04_SP500.csv to your “My Documents” folder with the following code using the tilde (~
) shortcut:
Now, if it is your first time working with .csv files, use a file browser (“File Explorer” in Windows) and open CH04_SP500.csv in the “My Documents” folder with any text editor software such as Notepad. The first lines of CH04_SP500.csv, also called header lines, show the column names. Following international standards, rows are set using line breaks, and all columns are separated by commas (,
). Next we show the textual content of the first ten lines of CH04_SP500.csv:
R> ref_date,price_close
R> 2010-01-04,1132.98999
R> 2010-01-05,1136.52002
R> 2010-01-06,1137.140015
R> 2010-01-07,1141.689941
R> 2010-01-08,1144.97998
R> 2010-01-11,1146.97998
R> 2010-01-12,1136.219971
R> 2010-01-13,1145.680054
R> 2010-01-14,1148.459961
The data in CH04_SP500.csv is organized in a standard way, and we should have no problem importing it in R. However, you should be aware this is not always the case. So, if you want to avoid the most common issues when importing data from csv files, I suggest you follow these steps:
Check the existence of text before the actual data. A standard .csv file will only have the contents of a table but, sometimes, you will find a header text with metadata (extra information about the dataset). If necessary, you can control how many lines you skip in the csv reading function;
Verify the existence of names for all columns and if those names are readable;
Check the symbol for column separation. Normally it is a comma, but you never know for sure;
For the numerical data, verify the decimal symbol. R will expect it to be a dot. If necessary, you can adjust this information in the code, making sure R knows how to correctly parse numerical data.
Check the encoding of the text file. Normally it is one of UTF-8, Latin1 (ISO-8859) or Windows 1252. These are broad encoding formats and should suffice for most cases. Whenever you find strange symbols in the text columns of the resulting
dataframe
, the problem is probably due to a mismatch between the encoding of the file and R. Windows users can check the encoding of a text file by opening it in Notepad++36. The encoding format is available in the bottom right corner of the Notepad++ editor. Linux and Mac users can find the same information in any advanced text editor software such as Kate37.
Whenever you find an unexpected text structure in a .csv file, use the arguments of the csv reading function to import the information correctly. As a rule of thumb, never modify raw data manually. Its far more efficient to use R code to deal with different structures of .csv files. It takes a bit of work, but such a policy will save you a lot of time in the future as, in a couple of months, you are unlikely to remember how you manually cleaned that csv file in order to import it more easily in your R session.
4.2.1 Importing Data
The {base} (R Core Team 2023b) package includes a native function called read.csv() for importing data from .csv files. However, we will prefer the {tidyverse} (Wickham 2023) alternative, readr::read_csv() , as it is more efficient and easier to work with. In short, the benefit of using readr::read_csv() is that it reads the data very quickly, with clever internal rules for defining the classes of imported columns.
This is the first package from the {tidyverse} (Wickham 2023) that we will use. Before doing so, it is necessary to install it in your R session. A simple way of installing all {tidyverse} (Wickham 2023) packages as a bundle is as follows:
install.packages('tidyverse')
After running the previous code, all {tidyverse} (Wickham 2023) packages will be installed on your computer. Once it finishes, let’s load it.
Back to importing data from .csv files, to load the contents of file CH04_SP500.csv in R, use the readr::read_csv() function.
# set file to read
my_f <- afedR3::data_path('CH04_SP500.csv')
# read file
my_df_sp500 <- readr::read_csv(my_f)
# print it
print(head(my_df_sp500))
R> # A tibble: 6 × 2
R> ref_date price_close
R> <date> <dbl>
R> 1 2010-01-04 1133.
R> 2 2010-01-05 1137.
R> 3 2010-01-06 1137.
R> 4 2010-01-07 1142.
R> 5 2010-01-08 1145.
R> 6 2010-01-11 1147.
As previously mentioned, the contents of the imported file becomes a dataframe
object in R, and each column of a dataframe
has a class. We can check the classes of object my_df_sp500
using function dplyr::glimpse() :
# Check the content of dataframe
dplyr::glimpse(my_df_sp500)
R> Rows: 3,269
R> Columns: 2
R> $ ref_date <date> 2010-01-04, 2010-01-05, 2010-01-06, 2…
R> $ price_close <dbl> 1132.99, 1136.52, 1137.14, 1141.69, 11…
Note that the column of dates – ref_date
– was imported as a Date
vector and the closing prices – price_close
– as numeric (dbl
, double accuracy). This is exactly what we expected. Internally, function readr::read_csv() identifies columns classes according to their content.
When calling readr::read_csv() without extra arguments, the function will present in the prompt the classes of each column. Internally, the function sets the attributes of the columns by reading the first 1000 lines of the file. Column ref_date
was imported with the Date class and column price_close
was imported as double (dbl). We can use this information in our own code by copying the text and assigning it to a variable. Have a look:
# set cols from import message
my_cols <- readr::cols(
ref_date = readr::col_date(),
price_close = readr::col_double()
)
# read file with readr::read_csv
my_df_sp500 <- readr::read_csv(my_f, col_types = my_cols)
This time, not message was set to the prompt. As an exercise, Let’s import the same data, but use a character
class for both columns:
# set cols from import message
# set cols from import message
my_cols <- readr::cols(
ref_date = readr::col_character(),
price_close = readr::col_character()
)
# read file with readr::read_csv
my_df_sp500 <- readr::read_csv(my_f, col_types = my_cols)
# glimpse the dataframe
dplyr::glimpse(my_df_sp500)
R> Rows: 3,269
R> Columns: 2
R> $ ref_date <chr> "2010-01-04", "2010-01-05", "2010-01-0…
R> $ price_close <chr> "1132.98999", "1136.52002", "1137.1400…
As expected, both columns are now of class character
.
Going further, readr::read_csv() has several other input options such as:
- change the format of the import data, including symbols for decimal places and encoding (
locale
option); - change column names (argument
col_names
); - skip n lines before importation (
skip
option); - custom definition for NA values (
na
option)
As an example, let’s study the case of a messy .csv file. In the book package we have a file called funky_csv_file.csv
where:
- the header has textual information;
- the file will use the comma as a decimal;
- the file text will contain Latin characters.
The first 10 lines of the files contain the following content:
R> Example of funky file:
R> - columns separated by ";"
R> - decimal points as ","
R>
R> Data build in 2022-12-28
R> Origin: www.funkysite.com.br
R>
R> ID;Race;Age;Sex;Hour;IQ;Height;Died
R> 001;White;80;Male;00:00:00;92;68;FALSE
R> 002;Hispanic;25;Female;00:00:00;99;68;TRUE
We have a header text up to line number 7 and the columns being separated by a semicolon (“;”). When importing the data with standard (and wrong) options, we will have the following output:
The output shows that only one column of class character was read from the file. This happens because function readr::read_csv() expects actual data (and not text) in the first line of the file. To solve it, we need to use several input arguments to handle the particularities of the file:
df_not_funky <- readr::read_delim(
file = my_f, # path of file
skip = 7, # how many lines do skip
delim = ';', # symbol for column separator
col_types = readr::cols(), # column types
locale = readr::locale(decimal_mark = ',') # locale
)
dplyr::glimpse(df_not_funky)
R> Rows: 100
R> Columns: 8
R> $ ID <chr> "001", "002", "003", "004", "005", "006", "…
R> $ Race <chr> "White", "Hispanic", "Asian", "White", "Whi…
R> $ Age <dbl> 80, 25, 25, 64, 76, 89, 33, 61, 23, 59, 80,…
R> $ Sex <chr> "Male", "Female", "Male", "Male", "Female",…
R> $ Hour <time> 00:00:00, 00:00:00, 00:00:00, 00:00:00, 00…
R> $ IQ <dbl> 92, 99, 98, 105, 109, 84, 109, 109, 99, 126…
R> $ Height <dbl> 68, 68, 69, 69, 67, 73, 65, 72, 70, 66, 63,…
R> $ Died <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE…
Note that the data has now been correctly imported, with the correct column classes. For that, we use the alternative function readr::read_delim() with custom inputs. Package {readr} (Wickham, Hester, and Bryan 2023) also provides several other functions for specific import situations.
4.2.2 Exporting Data
To write a .csv file, use the readr::write_csv() function. First, we create a new dataframe with some random data:
# set the number of rows
N <- 100
# set dataframe
my_df <- data.frame(y = runif(N),
z = rep('a', N))
# print it
print(head(my_df))
R> y z
R> 1 0.2633830 a
R> 2 0.3216680 a
R> 3 0.0993957 a
R> 4 0.4368764 a
R> 5 0.8653924 a
R> 6 0.8474320 a
And now we use readr::write_csv() to save it in a new (and temporary) .csv file:
# set file out
f_out <- fs::file_temp(ext = 'csv')
# write to files
readr::write_csv(x = my_df,
file = f_out)
In the previous example, we save the object my_df
into a temporary file with path file4f562c7c56ad.csv. We can read it back and check its contents using readr::read_csv() once again:
# read it
my_df_imported <- readr::read_csv(f_out)
# print first five rows
print(head(my_df_imported))
R> # A tibble: 6 × 2
R> y z
R> <dbl> <chr>
R> 1 0.263 a
R> 2 0.322 a
R> 3 0.0994 a
R> 4 0.437 a
R> 5 0.865 a
R> 6 0.847 a
As we can see, the data imported from the file is identical to the one created in the other code chunk.
4.3 Excel Files (xlsx)
Although it is not an efficient or portable data storage format, Microsoft Excel is a popular software in Finance and Economics due to its spreadsheet-like capacities. It is not uncommon for data to be stored and distributed in this format.
The downside of using Excel files for storing data is its low portability and the longer reading and writing times. This may not be a problem for small tables, but when handling a large volume of data, using Excel files can be very frustrating. If you can, my advice is to avoid the use of Excel files in your work cycle.
4.3.1 Importing Data
R does not have a native function for importing Excel files. Therefore, we must install and use packages to perform this operation. There are several good options including {XLConnect} (Mirai Solutions GmbH 2023), {xlsx} (Dragulescu and Arendt 2020), {readxl} (Wickham and Bryan 2023) and {tidyxl} (Garmonsway 2023).
Despite their similar goals, each package has its peculiarities. If reading Excel files is important to your work, I strongly advise the study of each package. For example, package {tidyxl} (Garmonsway 2023) was specially designed to read unstructured Excel files, where the desired information is not contained in a tabular format. Alternatively, package {XLConnect} (Mirai Solutions GmbH 2023) allows the user to open a live connection and, from R, control an Excel file, making it possible to export and send data, format cells, and so on.
In this section, we will give priority to package {readxl} (Wickham and Bryan 2023), one of the most straightforward packages to interact with Excel files. Conveniently, it does not require the installation of external software.
Let’s start with an example. Consider a file called CH04_SP500-Excel.xlsx
that contains the same SP500 data from previous section. We can import the information from the file using function readxl::read_excel() :
# set excel file
my_f <- afedR3::data_path('CH04_SP500-Excel.xlsx')
# read excel file
my_df <- readxl::read_excel(my_f, sheet = 'Sheet1')
# print with head (first five rows)
dplyr::glimpse(my_df)
R> Rows: 3,269
R> Columns: 2
R> $ ref_date <dttm> 2010-01-04, 2010-01-05, 2010-01-06, 2…
R> $ price_close <dbl> 1132.99, 1136.52, 1137.14, 1141.69, 11…
One of the benefits of using Excel files is that the column’s classes are directly inherited from the file. If the column classes are correct in the Excel file, then they will automatically be correct in R. In our case, the date column of file CH04_SP500-Excel.xlsx was correctly set as a dttm
object, a special type of DateTime
class. Likewise, even if the Excel file used commas for decimals, the import process would still succeed as the conversion is handled internally.
4.3.2 Exporting Data
Exporting a dataframe
to an Excel file is also easy. Again, no native function in R performs this procedure. We can, however, use package {writexl} (Ooms 2023):
# set number of rows
N <- 25
# create random dfs
my_df_A <- data.frame(y = seq(1, N),
z = rep('a', N))
writexl::write_xlsx(
x = my_df_A,
path = f_out
)
4.4 RData and rds Files
R offers two native formats to write objects to a local file, RData, and rds. The benefit of using both is that the saved file is compact and its access is very fast. The downside is the low portability, i.e., it’s difficult to use the files in other software.
The difference between RData and rds is that the first can save many R objects in a single file, while the latter only one. This, however, is not a hard restriction for the rds format as we can incorporate several objects into a single one using lists
. In practice, a rds file can store as many objects as needed. We will learn more about lists
in chapter 6.
4.4.1 Importing Data
To create a new .RData file, use the save() function. See the following example, where we create a .RData file with some content, clear R’s memory, and then load the previously created file:
# set a object
my_x <- 1:100
# set name of RData file
my_file_rdata <- fs::file_temp(ext = 'RData')
my_file_rds <- fs::file_temp(ext = 'rds')
# save it in RData
save(list = c('my_x'), file = my_file)
# save it in rds
readr::write_rds(my_x, my_file_rds)
We can verify the existence of the file with the fs::file_exists() function:
# check if file exists
fs::file_exists(my_file_rdata)
R> /tmp/RtmpCpmYn0/file4f566f904d6a.RData
R> FALSE
As expected, file price-data.csv is available.
Now, to import data from .rds
files, we use function readr::read_rds() :
# load content into workspace
my_x_2 <- readr::read_rds(file = my_file_rds)
Comparing the code between using .RData
and .rds
files, note that the .rds
format allows the explicit definition of the output object. The contents of my_file_rds
is saved in my_x_2
. When we use the load() function for RData files, we cannot name the output directly. This is particularly inconvenient when you need to modify the name of the imported object.
As a suggestion, use the .rds format to write and read R
data. Its use is more practical, resulting in cleaner code. The
difference in speed between one and the other is minimal. The benefit of
importing multiple objects into the same RData file becomes
irrelevant when using list
objects, which can incorporate
other objects into its content.
4.4.2 Exporting Data
We can create a new RData file with command save() :
# set vars
my_x <- 1:100
my_y <- 1:100
# write to RData
my_file <- fs::file_temp(ext = 'RData')
save(list = c('my_x', 'my_y'),
file = my_file)
We can again check if the file exists:
fs::file_exists(my_file)
R> /tmp/RtmpCpmYn0/file4f563e8814aa.RData
R> TRUE
The result is TRUE
as expected.
As for .rds files, we save it with function readr::write_rds() :
# set data and file
my_x <- 1:100
my_file <- fs::file_temp(ext = 'rds')
# save as .rds
readr::write_rds(x = my_x,
file = my_file)
# read it
my_x2 <- readr::read_rds(file = my_file)
# test equality
print(identical(my_x, my_x2))
R> [1] TRUE
Command identical() tests if both objects are equal. Again, as expected, we find the result to be TRUE
.
4.5 fst files
The fst format38, R package {fst} (Klik 2022), is specially designed to enable quick writing and reading time from tabular data, with minimal disk space. Using this format is particularly beneficial when working with large databases in powerful computers. The trick here is the use of all computer cores to import and export data, while all other formats only use one. If you have a computer with several cores, the gain in speed is impressive, as we will soon learn.
4.5.1 Importing Data
Using fst file format is similar to the previous cases. We use function fst::read_fst() to read files:
# set file location
my_file <- afedR3::data_path('CH04_example-fst.fst')
# read fst file
my_df <- fst::read_fst(my_file)
R> fstcore package v0.9.18
R> (OpenMP detected, using 16 threads)
# check contents
dplyr::glimpse(my_df)
R> Rows: 100
R> Columns: 8
R> $ ID <chr> "001", "002", "003", "004", "005", "006", "…
R> $ Race <fct> Black, White, Hispanic, Black, White, White…
R> $ Age <int> 33, 35, 23, 87, 65, 51, 58, 67, 22, 52, 52,…
R> $ Sex <fct> Male, Female, Male, Female, Male, Male, Fem…
R> $ Hour <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000…
R> $ IQ <dbl> 108, 108, 85, 106, 92, 92, 88, 100, 86, 80,…
R> $ Height <dbl> 72, 63, 77, 72, 71, 74, 64, 69, 63, 72, 70,…
R> $ Died <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE…
As with the other cases, the data from file CH04_example-fst.fst is available in the workspace.
4.5.2 Exporting Data
We use function fst::write_fst() to save dataframes
in the fst format:
# create dataframe
N <- 1000
my_file <- fs::file_temp(ext = 'fst')
my_df <- data.frame(x = runif(N))
# write to fst
fst::write_fst(x = my_df, path = my_file)
4.5.3 Timing the fst format
As a test of the potential of the fst
format, we will now time the read and write time between fst
and rds
for a large table: 5.000.000 rows and 2 columns. We will also report the size of the resulting file.
# set number of rows
N <- 5000000
# create random dfs
my_df <- data.frame(y = seq(1,N),
z = rep('a',N))
# set files
my_file_1 <- fs::file_temp(ext = 'rds')
my_file_2 <- fs::file_temp(ext = 'fst')
# test write
time_write_rds <- system.time(readr::write_rds(my_df, my_file_1 ))
time_write_fst <- system.time(fst::write_fst(my_df, my_file_2 ))
# test read
time_read_rds <- system.time(readr::read_rds(my_file_1))
time_read_fst <- system.time(fst::read_fst(my_file_2))
# test file size (MB)
file_size_rds <- file.size(my_file_1)/1000000
file_size_fst <- file.size(my_file_2)/1000000
And now we check the results:
# results
my_formats <- c('.rds', '.fst')
results_read <- c(time_read_rds[3], time_read_fst[3])
results_write<- c(time_write_rds[3], time_write_fst[3])
results_file_size <- c(file_size_rds , file_size_fst)
# print text
my_text <- paste0('\nTime to WRITE dataframe with ',
my_formats, ': ',
results_write, ' seconds', collapse = '')
message(my_text)
R>
R> Time to WRITE dataframe with .rds: 0.559999999999999 seconds
R> Time to WRITE dataframe with .fst: 0.0910000000000011 seconds
my_text <- paste0('\nTime to READ dataframe with ',
my_formats, ': ',
results_read, ' seconds', collapse = '')
message(my_text)
R>
R> Time to READ dataframe with .rds: 0.632000000000001 seconds
R> Time to READ dataframe with .fst: 0.118000000000002 seconds
my_text <- paste0('\nResulting FILE SIZE for ',
my_formats, ': ',
results_file_size, ' MBs', collapse = '')
message(my_text)
R>
R> Resulting FILE SIZE for .rds: 65.000177 MBs
R> Resulting FILE SIZE for .fst: 14.791938 MBs
The difference in performance is impressive! The fst
not only reads and writes faster but also results in smaller file sizes. Be aware, however, this result is found in a sixteen core computer in which the book was compiled. You may not be able to replicate the same result in a more modest machine.
Due to the use of all the computer’s cores, the fst format is highly recommended when working with large data on a powerful computer. Not only will the resulting files be smaller, but the writing and reading process will be considerably faster.
4.6 SQLite Files
The use of .csv or .rds files for storing objects has its limits as the volume of data increases. If you are waiting a long time to read a dataframe
from a file or if you are only interested in a small portion of a large table, you should look for alternatives. Likewise, if you have access to the computer network in your institution, including databases, it is important to learn how to directly import the corporate data into your R session.
This brings us to the topic of database software. These specific programs usually work with a query language, called SQL (Structured Query Language). It allows the user to read portions of the data and even manipulate it efficiently. Many options of database software integrate nicely with R. The list includes MySQL, SQLite and MariaDB. Here, we will provide a quick tutorial on this topic using SQLite, which is the easiest one to work with. Unlike other database software, SQLite stores data and configurations from a single local file, without the need of a formal server.
4.6.1 Importing Data
Before moving to the examples, we need to understand how to use database software. First, R will connect to the database and return a connection object. Based on this connection, we will send queries for manipulating data using the SQL language. The main advantage is we can have a large database of, let’s say, 10 GB and only load a small portion of it in R. This operation is also very quick, allowing efficient access to the available tables.
Assuming the existence of an SQLite file in the computer, we can import its tables with package {RSQLite} (Müller et al. 2023) :
# set name of SQLITE file
f_sqlite <- afedR3::data_path('CH04_example-sqlite.SQLite')
# open connection
my_con <- RSQLite::dbConnect(
drv = RSQLite::SQLite(),
f_sqlite)
# read table
my_df <- RSQLite::dbReadTable(conn = my_con,
name = 'MyTable1') # name of table in sqlite
# print with str
dplyr::glimpse(my_df)
R> Rows: 1,000
R> Columns: 2
R> $ x <dbl> 0.24645265, 0.47972926, 0.36782192, 0.09451063, …
R> $ G <chr> "B", "B", "A", "A", "A", "A", "A", "B", "B", "A"…
It worked. The dataframe
from table MyTable1
is exactly as expected.
Another example of using SQLite is with the actual SQL statements. Notice, in the previous code, we used function RSQLite::dbReadTable() to get the contents of all rows in table MyTable1
. Now, let’s use an SQL command to get from MyTable2
only the rows where the G
column is equal to A
.
# set sql statement
my_SQL_statement <- "select * from myTable2 where G='A'"
# get query
my_df_A <- RSQLite::dbGetQuery(conn = my_con,
statement = my_SQL_statement)
# disconnect from db
RSQLite::dbDisconnect(my_con)
# print with glimpse
dplyr::glimpse(my_df_A)
R> Rows: 522
R> Columns: 2
R> $ x <dbl> 0.9263521, 0.1304614, 0.1758632, 0.2323157, 0.84…
R> $ G <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"…
It also worked, as expected.
In this simple example, we can see how easy it is to create a connection to a database, retrieve tables, and disconnect. If you have to work with large tables, which, in my opinion, is any database that occupies more than 4 GB of your computer memory, it is worth moving it to proper database software. You’ll be able to retrieve data faster, without the need of loading the whole database in the computer’s memory. If you have a server available in your workplace, I strongly advise learning how to connect to it and use the SQL language to your advantage.
4.6.2 Exporting Data
As an example of exporting data to an SQLite file, let’s first create an SQLite database. For that, we will set two large dataframes
with random data and save both in an SQLite file.
# set number of rows in df
N = 10^6
# create simulated dataframe
my_large_df_1 <- data.frame(x=runif(N),
G= sample(c('A','B'),
size = N,
replace = TRUE))
my_large_df_2 <- data.frame(x=runif(N),
G = sample(c('A','B'),
size = N,
replace = TRUE))
# set path of SQLITE file
f_sqlite <- fs::file_temp(ext = 'SQLITE')
# open connection
my_con <- RSQLite::dbConnect(drv = RSQLite::SQLite(), f_sqlite)
# write df to sqlite
RSQLite::dbWriteTable(conn = my_con, name = 'MyTable1',
value = my_large_df_1)
RSQLite::dbWriteTable(conn = my_con, name = 'MyTable2',
value = my_large_df_2)
# disconnect
RSQLite::dbDisconnect(my_con)
The TRUE
output of RSQLite::dbWriteTable() indicates everything went well. A connection was opened using function RSQLite::dbConnect() , and both dataframes
were written to an SQLite file, called file4f561fa64fbb.SQLITE.
4.7 Unstructured Data and Other Formats
Another example of data importation is the case of reading and processing unstructured text files. If none of the previous packages can read the data, then it must be parsed line by line. Let’s explore this problem.
4.7.1 Importing Data
You can read the contents of a text file with function readr::read_lines() :
# set file to read
my_f <- afedR3::data_path('CH04_price-and-prejudice.txt')
# read file line by line
my_txt <- readr::read_lines(my_f)
# print 50 characters of first fifteen lines
print(stringr::str_sub(string = my_txt[1:15],
start = 1,
end = 50))
R> [1] " [Illustration:"
R> [2] ""
R> [3] " GEORGE ALLEN"
R> [4] " PUBLISHER"
R> [5] ""
R> [6] " 156 CHARING CROSS ROAD"
R> [7] " LONDON"
R> [8] ""
R> [9] " RUSKIN HOUSE"
R> [10] " ]"
R> [11] ""
R> [12] " [Illustration:"
R> [13] ""
R> [14] " _Reading Jane’s Letters._ _Cha"
R> [15] " ]"
In this example, file CH04_price-and-prejudice.txt contains the whole content of the book Pride and Prejudice by Jane Austen, freely available in the Gutenberg39 project, and downloaded with package {gutenbergr} (Johnston and Robinson 2023). We imported the entire content of the file as a character
vector named my_txt
. Each element of my_txt
is a line from the raw text file. Based on it, we can check the number of lines in the book, and also the number of times that the name 'Bennet'
, one of the protagonists, appears in the text:
# count number of lines
n_lines <- length(my_txt)
# set target text
name_to_search <- 'Bennet'
# set function for counting words
fct_count_bennet <- function(str_in, target_text) {
n_words <- length(
stringr::str_locate_all(string = str_in,
pattern = target_text)[[1]])
return(n_words)
}
# use fct for all lines of Pride and Prejudice
n_times <- sum(sapply(X = my_txt,
FUN = fct_count_bennet,
target_text = name_to_search))
# print results
my_msg <- paste0('The number of lines found in the file is ',
n_lines, '.\n',
'The word "', name_to_search, '" appears ',
n_times, ' in the book.')
message(my_msg)
R> The number of lines found in the file is 14529.
R> The word "Bennet" appears 696 in the book.
In the example, we used function sapply() . In this case, it allowed us to use a function for each element of my_txt
. We searched and counted the number of times the word “Bennet” was found.
4.7.2 Exporting Data
A typical case of exporting unstructured text is saving the log record of a procedure. This is quite simple. Using function readr::write_lines() , use the input file
to set the name of the local file and x
for the actual textual content.
# set file
my_f <- 'data/temp.txt'
# set some string
my_text <- paste0('Today is ', Sys.Date(), '\n',
'Tomorrow is ', Sys.Date()+1)
# save string to file
readr::write_lines(x = my_text, file = my_f, append = FALSE)
In the previous example, we created a simple text object and saved it in data/temp.txt. We can check the result with the readr::read_lines() function:
print(readr::read_lines(my_f))
R> [1] "Today is 2023-12-13" "Tomorrow is 2023-12-14"
As we can see, it worked as expected.
4.8 How to Select a Data File Format
The choice of file format is an important topic and might actually be a time-saver at your work. In that decision, my first and most important suggestion is to avoid the Excel format at all costs. As for alternatives, we must consider three points in the decision:
- speed of reading and writing operations;
- size of the resulting file;
- compatibility with other software and operating systems.
Usually, the use of csv files easily satisfies these requirements. A csv file is nothing more than a text file that can be opened, viewed, and imported into any other statistical software. This makes it easy to share it with other people. Also, the size of csv files is usually not restrictive and, if needed, the file can be compressed using the zip() function. For these reasons, the use of csv files for importing and exporting data is preferable in the vast majority of situations.
However, there are cases where the speed of import and export operations matter. If you don’t mind giving up portability, the rds format is a great choice for most projects. If you have good hardware and execution speed with rds is still not great, then the best alternative is the fst format, which uses all cores to import and export data.
4.9 Exercises
Q.1
Create a dataframe
with the following code:
library(dplyr)
my_N <- 10000
my_df <- tibble(x = 1:my_N,
y = runif(my_N))
Export the resulting dataframe
to each of the five formats: csv, rds, xlsx, fst. Which of the formats took up the most space in the computer’s memory? Tip: file.size
calculates the size of files within R.
Q.2
Improve the previous code by measuring the execution time for saving the data in different formats. Which file format resulted in the fastest execution for exporting data? Tip: use the system.time
function or thetictoc
package to calculate the execution times.
Q.3
For the previous code, reset the value of my_N
to 1000000
. Does it change the answers to the last two questions?
Q.4
Use afedR3::data_path
function to access the CH04_SP500.csv file in the book’s data repository. Import the contents of the file into R with the function readr::read_csv
. How many lines are there in the resulting dataframe
?
Q.5
At link https://eeecon.uibk.ac.at/~zeileis/grunfeld/Grunfeld.csv/ you’ll find a .csv file for the Grunfeld data. This is a particularly famous table due to its use as reference data in econometric models. Using readr::read_csv
function, read this file using the direct link as input read_csv
. How many columns do you find in the resulting dataframe
?
Q.6
Use function afedR3::data_path
function to access the CH04_example-tsv.tsv file in the book’s data repository. Note that the columns of the data are separated by the tab symbol ('\t'
), and not the usual comma. After reading the readr::read_delim
manual, import the information from this file to your R session. How many rows does the resulting dataframe
contain?
Q.7
In the book package you’ll find data file called CH04_another-funky-csv-file.csv, with a particularly bizarre format. Open it in a text editor and try to understand how the columns are separated and what is symbol for decimals. After that, study the inputs of function utils::read.table
and import the table into your R session. If we add the number of rows to the number of columns in the imported table, what is the result?