Tao | 44-49




Positive Approaches Journal - Volume 2 Title

Volume 9 ► Issue 3 ► 2020


Getting Started With R

Sha Tao

Introduction

Data analysis is the process of evaluating data using analytical or statistical tools to discover useful information. Literally all business and all fields today need data analysis, and data-driven choices are one of the best ways to be truly confident in business decisions. In addition to traditional statistical knowledge, data analysis today largely depends on the mastery of tools for computation, exploratory analysis, visualization, dissemination, and reproducibility. There are tons of statistical software, SAS, SPSS, STATA, R, Python and MATLAB, just to name a few. They all have pros and cons and can be useful in different settings.

We will briefly introduce R here, which is a particular good programming language and free software that is widely used among statisticians.

Why people love R?

1. Free and supported on Windows, Mac, and Linux

It is hard for a beginner user to devote hundreds, if not thousands, of dollars on a software per year. R is free and is available on all main platforms.

2. Clean interface

After installing R, you usually need RStudio1, which is a set of integrated tools designed to help you be more productive with R. It includes a console, syntax-highlighting editor that supports direct code execution, and a variety of robust tools for plotting, viewing history, debugging and

managing your workspace.

Moreover, you can save your results and code in a report-style form with notes using R Markdown2 or even create websites to share your results using R Markdown.

3. Open source

Users can expand the functionality of R through add-ons called packages, thus, all the common (and a lot of times, uncommon) statistical functions and models can be found for free either in the official packages3 or communities like GitHub4.

4. Group work

In both the academic and working environment, you might need to work with your classmates or colleagues together on a project. You can create a GitHub repository5, link all of your RStudio work to the project, and then all of you are able to edit and contribute to the project.

5. Good community

A good software always needs a good community to support it. R community creates tons of innovative discussions and packages every day, solves questions, and troubleshoots codes. Two good places for statisticians to study, share, and contribute are stack overflow 6 and GitHub4.

6. Online study

Due to the wide acceptance of R, many online classes and tutorials are there for you to explore, often for free. They can be as broad as “how to do survival analysis in R,” and can also be as detailed as “how to change font of a plot title”.

7. Extensive data visualization

There are great packages and flexible tools for creating custom graphs and tables. Also, through R Shiny (an R package), you’re able to make interactive plots and dashboard for others to explore.

There are few tips for beginners to make the most R:

1. R is evolving, and we should keep up.

Because of the advantage of an open-source software, new packages are coming out every day, and we should always embrace new possibilities. When you first learn R through an online source, don’t forget to check the time that resource was published. Many R self-learners may end up using the base() package to do data wrangling and plot() to do visualization, which is not wrong, but loses efficiency and customization possibility. I suggest using tidyverse() as a start, which contains both dplyr() and ggplot2().

2. Coding style is important.

Good coding styles7 help reduce the errors and make it easier for others to understand. In addition, always make comments of the code blocks to remind yourself the function of the codes. People understand what they write at the moment, but not two months later.

3. Understand your code before using it.

A common mistake for a beginner user is to use the function they found without understanding what it is. It probably will not hurt if you are just doing the data wrangling part; however, statistical analysis will be doomed. Have you read the package documents? Have you checked 

the model assumptions? Have you addressed the flaws of the model? Have you tested the model fit? These are some of the questions you always need to ask yourself before using a function.

4. A great function, piping (%>%).

You have many commands to use regularly for data manipulation and cleaning. You may define intermediate datasets or nest function calls. However, both of them are not optimal: the first gets confusing and clutters your workspace, and the second has to be read inside out.

Piping solves this problem. It allows you to turn the nested approach into a sequential chain by passing the result of one function call as an argument to the next function call.

5. Here is a great resource that covers most of the basic use of R by Kevin Donovan, Data Analysis and Processing with R Based on IBIS Data 8.



 References

1.  Rstudio | Open source & professional software for data science teams. Rstudio website. https://rstudio.com/. Accessed November 3, 2020.

 2. R markdown. Rmarkdown.rstudio website. https://rmarkdown.rstudio.com/. Accessed November 3, 2020.

3. The Comprehensive R Archive Network. Cran.r-project. Website. https://cran.r-project.org/. Accessed November 3, 2020.

 4. Github: Where the world builds software. GitHub. https://github.com. Accessed November 3, 2020.

 5. Github And Rstudio. GitHub Resources website. https://resources.github.com/whitepapers/github-and-rstudio/. Accessed November 3, 2020.

6. Stack Overflow - where developers learn, share, & build careers. Stack Overflow website. https://stackoverflow.com/. Accessed November 3, 2020.

 7. Google’S R style guide. Github website. https://google.github.io/styleguide/Rguide.html. Accessed November 3, 2020.

 8. Donovan, K. Data analysis and processing with R based on IBIS data. Bookdown website. https://bookdown.org/kdonovan125/ibis_data_analysis_r4/#preface. Accessed November 3, 2020.

Biography

Sha Tao is a Data Analyst working at A.J. Drexel Autism Institute Policy and Analytics Center (PAC). Mr. Tao mainly recommends, develops, and tests statistical approaches, data visualization, and efficient codes for use in both smaller-scale survey data and large-scale national databases. Mr. Tao had his undergraduate study in Biochemistry and Molecular Biology at Michigan State University. He holds a master’s degree of public health in biostatistics and a certificate in advanced epidemiology from Columbia University.

Contact Information

Sha Tao

Data Analyst

A.J. Drexel Autism Institute Policy and Analytics Center

st3237@drexel.edu