Skip to content

Data Analysis With R

xuyeluo edited this page Oct 10, 2024 · 22 revisions

Data Analysis With R

This document is designed to provide essential resources and tutorials to help you become proficient in using R for data analysis. Whether you're just starting your journey or looking to enhance your skills, this guide offers a curated list of resources that are both practical and insightful, tailored to the needs of data scientists working with Hack For LA.

Introduction to R

Overview of R

R is a powerful, open-source programming language and software environment specifically designed for statistical computing and graphics. It is widely used among statisticians, data analysts, and data scientists for developing statistical software and performing data analysis. One of the key strengths of R is its extensive library of packages, which provide a wide range of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and more. R's syntax is user-friendly and highly expressive, making it an excellent tool for both beginners and experienced users. Additionally, R's active community continually contributes to its development, ensuring that it remains at the cutting edge of data science and statistical analysis.

Benefits of using R

R offers significant coding convenience, including vectorized operations and the ability to read and write data in many file formats, as well as call other command line programs. It is efficient, with parallel support for multicore processors, GPUs, and MPI. The flexibility of R is evident in its customizable software and object-oriented design.

Installing R and RStudio

Installing R and RStudio is a crucial first step for any new data scientist looking to leverage the power of R for data analysis. R, an open-source statistical programming language, provides a robust environment for data manipulation, statistical computing, and graphical representation. To maximize its potential, RStudio is recommended as an integrated development environment (IDE) that enhances the user experience with features like syntax highlighting, direct code execution, and a comprehensive workspace management system. By installing RStudio alongside R, users benefit from an organized and efficient setup that streamlines coding, debugging, and visualization tasks, making it easier to focus on data-driven insights and project outcomes.

Getting Started with R

Installing R and RStudio is a crucial first step for any new data scientist looking to leverage the power of R for data analysis. R, an open-source statistical programming language, provides a robust environment for data manipulation, statistical computing, and graphical representation. To maximize its potential, RStudio is recommended as an integrated development environment (IDE) that enhances the user experience with features like syntax highlighting, direct code execution, and a comprehensive workspace management system. By installing RStudio alongside R, users benefit from an organized and efficient setup that streamlines coding, debugging, and visualization tasks, making it easier to focus on data-driven insights and project outcomes.

Installing R

1. Download R:

Go to The Comprehensive R Archive Network website(https://cran.r-project.org).

Download the R version that suits your operating system version (Windows, macOS, or Linux)

2. Install R on Windows:

Click on the "base" link and download the installer. Run the downloaded installer and follow the on-screen instructions to complete the installation.

3. Install R on macOS:

Download the .pkg file for the latest R version.

Open the downloaded file and follow the installation instructions.

Installing RStudio

1. Download RStudio:

Go to the RStudio Download website. (https://posit.co/download/rstudio-desktop)

Click on the "Download RStudio Desktop" button under Install RStudio or download RStudio file for your Operating System:

2. Install RStudio on Windows:

Download the installer and run it.

Follow the on-screen instructions to complete the installation.

3. Install RStudio on macOS:

Download the .dmg file and open it.

Drag the RStudio icon to the Applications folder.

Here is a simple video about installing R and RStudio on Windows:

https://youtu.be/YrEe2TLr3MI?si=LRXDA0G6FquejNdC

Running your first R script

Here's a step-by-step guide to help you get started:

Writing the Script:

1. Open RStudio:

Launch RStudio from your applications or start menu.

2. Create a New Script:

Go to File New File R Script or use the shortcut Ctrl+Shift+N (Windows/Linux) or Cmd+Shift+N (macOS).

3. Write Your Script:

Enter the following basic R code into the script editor:

# My First R Script

# Print a message to the console

print("Hello, world!")

# Create a numeric variable

x = 10

# Perform a simple arithmetic operation

y = x * 2

# Print the result

print(y)

# Create a vector

numbers <- c(1, 2, 3, 4, 5)

# Calculate the mean of the vector

mean_value <- mean(numbers)

# Print the mean value

print(mean_value)

4. Save Your Script:

Save your script by going to File Save or using the shortcut Ctrl+S (Windows/Linux) or Cmd+S (macOS).

Choose a location on your computer and name your script (e.g., first_script.R).

Running the Script

1. Run the Entire Script:

To run the entire script, you can either click the Source button in the top-right corner of the script editor or use the shortcut Ctrl+Shift+Enter (Windows/Linux) or Cmd+Shift+Enter (macOS).

The script will execute, and you will see the output in the console at the bottom of RStudio.

2. Run Selected Lines:

To run specific lines of code, highlight the lines you want to execute and press Ctrl+Enter (Windows/Linux) or Cmd+Enter (macOS).

The selected lines will execute, and the output will appear in the console.

3. Viewing the Output

The output of your script will be displayed in the console. You should see the printed messages and results from your script, such as:

[1] "Hello, world!"

[1] 20

[1] 3

By following these steps, you can write, save, and run your first R script in RStudio. This process allows you to automate repetitive tasks, analyze data, and generate reports efficiently. As you become more familiar with R, you'll be able to write more complex scripts to tackle various data analysis challenges.

Here is a simple video showing how to use RStudio:

https://youtu.be/FIrsOBy5k58?si=R7O3i1gI07X-0zWx

Tidyverse:

Tidyverse is a collection of R packages designed for data science, sharing an underlying design philosophy, grammar, and data structures that make working with data easier. Volunteers can analyze Hack for LA data using Tidyverse, which offers a powerful and cohesive set of tools for efficient data cleaning, transformation, and visualization. Tidyverse's consistent syntax and integrated workflows streamline the entire data analysis process, from importing data with readr to creating insightful visualizations with ggplot2. Productivity is further enhanced by the functional programming capabilities of purrr and the string and factor management provided by stringr and forcats. These features enable volunteers to effectively analyze and present data on critical issues such as homelessness, expungement, and food insecurity, ultimately supporting informed decision-making and impactful community interventions.

Here's an overview of the core packages in the Tidyverse and their primary functions:

  1. ggplot2: Used for data visualization, it implements the grammar of graphics, providing a powerful and flexible system for creating a wide range of visualizations.

  2. dplyr: Provides a set of functions for data manipulation, including filtering rows, selecting columns, rearranging rows, and summarizing data.

  3. tidyr: Helps tidy data, ensuring that data sets are consistent and easy to work with by transforming them into a tidy format where each variable is a column, each observation is a row, and each type of observational unit is a table.

  4. readr: Facilitates the reading of rectangular data, such as CSV files, into R. It is designed to be fast and to handle a wide range of data formats.

  5. purrr: Enhances R's functional programming tools, making it easier to apply functions to data and work with lists.

  6. tibble: Provides a modern take on data frames, offering a data structure that is simpler and more user-friendly than base R data frames.

  7. stringr: Simplifies string manipulation by providing a consistent set of functions designed to make working with strings easier and more intuitive.

  8. forcats: Aims to make working with categorical data (factors) easier, providing a suite of tools for creating, modifying, and analyzing factors.

To get started with Tidyverse, you can install it in R using the following command:

install.packages("tidyverse") 

Once installed, you can load the Tidyverse packages with:

library(tidyverse)

This command will load ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, along with any other packages they depend on.

CRAN

CRAN, which stands for the Comprehensive R Archive Network, is a repository for R, a programming language and environment for statistical computing and graphics. It is one of the main resources for R users to find and download R packages, which are collections of R functions, data, and compiled code that extend the functionality of the base R environment.

Key Features of CRAN:

  1. Package Repository: CRAN hosts thousands of R packages covering a wide range of topics, from data manipulation and visualization to machine learning and bioinformatics. These packages are contributed by the R community and are regularly updated.

  2. Mirrors: CRAN is mirrored across the globe, meaning there are multiple servers around the world that host copies of CRAN to ensure fast access and reliability for users worldwide.

  3. Documentation: Each package on CRAN comes with extensive documentation, including manuals, vignettes, and examples that help users understand how to use the package effectively.

  4. CRAN Task Views: These are curated lists of packages grouped by topic, providing an easy way to find relevant packages for specific tasks like Bayesian analysis, econometrics, or machine learning.

  5. Version Control: CRAN maintains different versions of R packages, allowing users to install a specific version if needed.

  6. Quality Control: Packages on CRAN are subjected to rigorous checks and must pass several automated tests before they are accepted. This ensures that packages are reliable and compatible with the R environment.

How to Use CRAN:

  1. Installing Packages:

R users can install packages from CRAN using the install.packages() function in R. For example, install.packages("ggplot2") will install the ggplot2 package from CRAN.

  1. Browsing Packages:

Users can browse available packages on the CRAN website, where they can search for packages by name or topic.

  1. Updating Packages:

Installed packages can be updated to their latest versions using the update.packages() function.

CRAN plays a central role in the R ecosystem, providing a robust and reliable platform for the distribution of R packages, which is crucial for the development of statistical methods and data analysis workflows.

Guide to R packages installation: https://www.datacamp.com/tutorial/r-packages-guide

Read/Write files in R

  • One of the most common way to store data is saving it as files.
  • Files can be of many formats like plain text, csv, excel spreadsheet, RData etc.
  • It helps in transferring the data from one computer system to other.
  • Useful in loading data in R environment and performing necessary analysis on it.

Reading data from files

Reading data from files

setwd("your/path/to/Rtutorial")

  • In Console area, type the code as shown in the image below:
cars = read.table(file = "cars.txt", sep = "\t", header = TRUE,stringsAsFactors = FALSE)

file: filename to read, e.g. cars.txt

header: logical (TRUE/ FALSE). TRUE means the first line contains the header or name of the variables, e.g. ‘mgp’,‘cyl’, ‘disp’

sep: data separator, "\t" means data is separated by tab. It can be whitespace, comma, newline or carriage returns.

To see the other parameters, type ?read.table() for further details.

alt text

Data frames

Cars is a data frame that we can apply plenty of functions upon.

cars <- read.table("cars.txt", header = T, sep = "\t")
nrow(cars) # get the number of rows
ncol(cars) # get the number of columns
head(cars) # preview the data

alt text

read.table() documentation: https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table

Extracting the rows or columns.

cars[1, ] # extract the first row
cars[1:3, ] # extract the first 3 rows
cars[,"Transmission"] # extract the column named Cylinders
cars[ ,1] # extract the first column
cars[2,3] # extract the element at row 2 and column 3
cars[2:6, 2:4] # extract the elements at that range

alt text

Writing data to a file

Setup steps

write.table(x, file, row.names, col.names, quote, sep )

x: data you want to write on a file

file: name of the file on which you want to write

row.names / col.names: can be logical (TRUE/FALSE) or you can give values

quote: logical (TRUE/FALSE) whether you want to add quotes on your data frame

sep: how do you want to separate the data. Can be space or "\t" (tab)

In Console area, type the code as shown in the image below.

write.table(x = cars,file = "modified_cars.txt",row.names = FALSE,sep= "\t")

alt text

Reading a csv file

A csv file can be read in the following two ways:

  1. data <- read.table("filename.csv", header = T, sep = ",")

Note: we only changed name of the file from .txt to .csv and sep is changed to ",".

  1. data <- read.csv("filename.csv", header = T)

All the read.table() parameters are also applicable for read.csv().

Writing a csv file

Data can be saved as a csv file in the following two ways:

  1. write.table(x = data, file = "filename.csv", sep = ",")
  2. write.csv(x = data, file = "filename.csv")

All the write.table() parameters are also applicable for write.csv().

R provides its own format for saving or loading the data or R objects.

The two formats are RData or RDS.

These formats are useful in compressing the data and uses less space on computer disk.

Save data as RData:

You can save one or more r objects or variables as RData.

# save(data you want to save, file = "name of the file.RData")
head(cars)
x = 1:10
x
save(list = c("cars","x"),file = "cars.RDdat")
# or
save(cars,x,file = "cars.RDdat")

alt text

Save data as RDS:

You can save only one r object or a variable in RDS format.

# saveRDS(data you want to save, file = "name of the file.RDS")
saveRDS(cars, file = "cars.RDS")

Note: You can use compress = FALSE as a parameter in save function to not compress the file.

alt text

Load RData in R environment

# load(file = "filename to load.RData")
load(file = "cars.RDdata")

alt text

Read RDS file in R environment

# readRDS( file = "filename toread.RDS")

You can assign a new name to the RDS data.

alt text

To be continued:

Other Chapters:

Control flow

Implicit loop

Plotting in R

String manipulation

Issues used in the creation of this page

(Some of these issues may be closed or open/in progress.)

Contributors

Xuye Luo

References

https://cran.r-project.org/

https://posit.co/download/rstudio-desktop

https://www.w3schools.com/R/

https://github.com/imtiaz-emu/Data-Science-with-R

Updated on 10.10.2024

Clone this wiki locally