So far we learned how to use R for basic tasks such as interacting with the computer, creating simple vectors and downloading files from the internet. At this point, it is important to discuss the structure of a research script and, more specifically, how to organize our work in a efficient manner. As the R code base becomes larger and more complex, organization is a necessity. In this chapter, I will suggest a way to organize files and folders. So, I recommend that you follow these guidelines – or at least your own version of them – in every project you work on.
Unlike other software designs, every script in data analysis follows through clear and consecutive steps to achieve its goal.
Importation of data: Raw (original) data is imported from a local file or the internet. At this stage, no manual data manipulation should happen. The raw data must be imported “as it is”.
Cleaning and structuring the data: The raw data imported in the previous step is cleaned and structured within the needs of the analysis. Abnormal records and errors in observations can be removed or treated. The structure of the data can also be manipulated, binding (merging) different datasets and calculating variables of interest. Preferably, at the end of this stage, there should be a final collection of clean data.
Visual analysis and hypothesis testing: After cleansing and structuring the data, the work continues with the visual analysis of the data and hypothesis testing. Here, you can create graphical representations of the data for your audience and use statistical tools, such as econometric models, to test a particular hypothesis. This is the heart of the research and the stage most likely to take more development time.
Reporting the results: The final stage of a research script is reporting the results, that is, exporting selected tables and figures from R to a text processing software such as Latex, Writer (LibreOffice) or Word (Microsoft).
Each of the previous steps can be structured in a single .R script or in several separate files. Using multiple files is preferable when the first steps of the research demand significant processing time. For example, when importing and organizing a large database, it is worth the trouble to separate the code in different files. It will be easier to find bugs and maintain the code. Each script will do one job, and do it well.
A practical example would be the analysis of a large dataset of financial transactions. Importing and cleaning the data takes plenty of computer time. A smart organization is to insert these primary data procedures in a .R file and save the final objects of this stage in an external file. This local archive serves as a bridge to the next step, hypothesis testing, where the previously created file with clean data is imported. Every time a change is made to the hypothesis testing script, it is unnecessary to rebuild the whole dataset. This simple organization of files saves a lot of time. The underlying logic is simple, isolate the parts of the script that demand more computational time – and less development –, and connect them to the rest of the code using external data files.
If you are working with multiple files, one suggestion is to create a naming structure that informs the steps of the research in an intuitive way. An example would be to name the data importing code as
01-import-and-clean-data.R, the modeling code as
02-estimate-and-report-models.R and so on. The practical effect is that using a number in the first letter of the filenames clarifies the order of execution. We can also create a master script called
0-main.R that runs (
source) all other scripts. So, every time we make an update to the original data, we can simply run
0-run-it-all.R and will have the new results, without the need to execute each script individually.
A proper folder structure also benefits the reproducibility and organization of research. In simple scripts, with a small database and a low number of procedures, it is unnecessary to spend much time thinking about the organization of files. This is certainly the case for most of the code in this book. More complex programs, with several stages of data cleaning, hypothesis testing, and several sources of data, organizing the file structure is essential.
A suggestion for an effective folder structure is to create a single directory and, within it, create subdirectories for each input and output element. For example, you can create a subdirectory called
data, where all the original data will be stored, a directory
tables, where figures and tables with final results will be exported. If you are using many custom-written functions in the scripts, you can also create a directory called
r-fcts and save all files with function definitions at this location. As for the root of the directory, you should only find the main research scripts there. An example of a file structure that summarizes this idea is:
/Capital Markets and Inflation/
The research code should also be self-contained, with all files available within a sub-folder of the root directory. If you are using many different R packages, it is advisable to add a comment in the first lines of
0-run-it-all.R that indicates which packages are necessary to run the code. The most friendly way to inform it is by adding a commented line that installs all required packages, as in
#install.packages('pkg1', 'pkg2', ...). So, when someone receives the code for the first time, all he (or she) needs to do is uncomment the line and execute it. External dependencies and steps for their installation should also be informed.
The organization of the code directory facilitates collaboration and reproducibility. If you need to share the code with other researchers, simply compress the directory to a single file and send it to the recipient. After decompressing the file, the structure of the folder immediately informs the user were to change the original data, the order of execution of the scripts in the root folder, and where the outputs are saved. The same benefit goes when you reuse your code in the future, say three years from now. By working smarter, you will be more productive, spending less time with repetitive and unnecessary steps for “figuring out” how the code works.
In this section I’ll be making some suggestions for how you can conduct data analysis with R. Making it clear, these are personal positions from my experience as a researcher and teacher. Many points raised here are specific to the academic environment but can be easily extended to the practice of data research in the industry. In short, these are suggestions I wish I knew when I first started my career.
Firstly, know your data!. I can’t stress enough how this is important! The first instinct of every passionate data analyst when encountering a new set of rich information is to immediately import it into R and perform an analysis. However, a certain level of caution is needed. Every time you get your hands on a new set of data, ask yourself how much you really know:
- How was the data collected? To what purpose?
- What information does each column of the table represents? What are the details often missed?
- How do the available data compare with data used in other studies?
- Is there any possibility of bias within the data collection?
Furthermore, you need to remember that the ultimate goal of any research is communication. Thus, it is very likely that you will report your results to people who will have an informed opinion about the subject, including the sources and individualities of different datasets. The worst case scenario is when a research effort of three to six months in between coding and writing is nullified by a simple lapse in data checking. Unfortunately, this is not uncommon.
As an example, consider the case of analyzing the long term performance of companies in the retail business. For that, you gather a recent list of available companies and download financial records about their revenue, profit and adjusted stock price for the past twenty years. Well, the problem here is in the selection of the companies. By selecting those that are available today, you missed all companies that went bankrupt during the 20 year period. That is, by looking only at companies that stayed active during the whole period, you indirectly selected those that are profitable and presented good performance. This is a well-known effect called survival bias. The right way of doing this research is gathering a list of companies in the retail business twenty years ago and keep track of those that went bankrupt and those that stayed alive.
The message is clear. Be very cautious about the data you are using. Your raw tables stand at the base of the research. A small detail that goes unnoticed can invalidate your whole work. If you are lucky and the database is accompanied by a written manual, break it down to the last detail. If the information is not clear, do not be shy about sending questions to the responsible party. Likewise, if there is an inevitable operational bias in your dataset, be open and transparent about it.
The second point here is the code. After you finish reading this book, you will have the knowledge to conduct research with R. The computer will be a powerful ally in making your research ideas come true, no matter how gigantic they may be. However, a great power comes with great responsibility. Said that, you need to be aware that a single misplace line in a code can easily bias and invalidate your analysis.
Remember that analyzing data is your profession and your reputation is your most valuable asset. If you have low confidence in the produced code, do not publish or communicate your results. The code and its results is entirely your responsibility. Check it as many times as necessary. Always be skeptical about your own work:
- Do the descriptive statistics of the variables faithfully report the database?
- Is there any relationship between the variables that can be verified in the descriptive table?
- Do the main findings of the research make sense to the current literature of the subject? If not, how to explain it?
- Is it possible that a bug in the code has produced the results?
I’m constantly surprised by how many studies submitted to respected journals can be denied publication based on a simple analysis of the descriptive table. Basic errors in variable calculations can be easily spotted by a trained eye. The process of continuous evaluation of your research will not only make you stronger as a researcher but will also serve as practice for peer evaluation, much used in academic research. If you do not have enough confidence to report results, test your code extensively. If you have already done so and are still not confident, identify the lines of code you have doubts and seek help with a colleague or your advisor, if there is one. The latter is a strong ally who can help you in dealing with problems he/she already had.
All of the research work is, to some extent, based on existing work. Today it is extremely difficult to carry out ground-breaking research. Knowledge is built in the form of blocks, one over the other. There is always a collection of literature that needs to be consulted. Therefore, you should always compare your results with the results already available in the subject. If the main results are not similar to those found in the literature, one should ask himself: could a code error have created this result?
I clarify that it is possible that the results of research differ from those of the literature, but the opposite is more likely. Knowledge of this demands care with your code. Bugs and code problems are quite common and can go unnoticed, especially in early versions of scripts. As a data analyst, it is important to recognize this risk and learn to manage it.
Imagine a survey regarding your household budget over time. Financial data is available in electronic spreadsheets separated by month, for 10 years. The objective of the research is to understand if it is possible to purchase a real state property in the next five years. Within this setup, detail in text the elements in each stage of the study, from importing the data to the construction of the report.