Introduction

R is a powerful language used widely for data analysis and statistical computing. It was developed in early 90s. Since then, endless efforts have been made to improve R’s user interface. The journey of R language from a rudimentary text editor to interactive R Studio and more recently Jupyter Notebooks has engaged many data science communities across the world.

This was possible only because of generous contributions by R users globally. Inclusion of powerful packages in R has made it more and more powerful with time. Packages such as dplyr, tidyr, readr, data.table, SparkR, ggplot2 have made data manipulation, visualization and computation much faster.

Install

R

  • Go to https://cran.r-project.org/
  • In ‘Installers for Supported Platforms’ section, choose and click the R installer based on your operating system.
  • Once download is complete, open the downloaded file.
  • Click Next..Next..Finish.

R Studio

  • Go to https://www.rstudio.com/products/rstudio/download/
  • In ‘Installers for Supported Platforms’ section, choose and click the R Studio installer based on your operating system.
  • Once download is complete, open the downloaded file.
  • Click Next..Next..Finish.

R Studion Interface

Once you have installed R and R Studio; - To Start R Studio, click on its desktop icon or use ‘search windows’ to access the program. It looks like this: R Studio Interface.

  • R Console: This area shows the output of code you run. Also, you can directly write codes in console. Code entered directly in R console cannot be traced later. This is where R script comes to use.
  • R Script: As the name suggest, here you get space to write codes. To run those codes, simply select the line(s) of code and press Ctrl + Enter. Alternatively, you can click on little ‘Run’ button location at top right corner of R Script.
  • R environment: This space displays the set of external elements added. This includes data set, variables, vectors, functions etc. To check if data has been loaded properly in R, always look at this area.
  • Graphical Output: This space display the graphs created during exploratory data analysis. Not just graphs, you could select packages, seek help with embedded R’s official documentation.

R Packages

The sheer power of R lies in its incredible packages. In R, most data handling tasks can be performed in 2 ways:

  • Using R packages
  • R base functions

To install a package, simply type: install.packages("package name") As a first time user, a pop might appear to select your CRAN mirror (country server), choose accordingly and press OK. I’d advice you use the default mirror if you are not sure on which one to choose. You can type this either in console directly and press ‘Enter’ or in R script and click “Run”.

R Basics

Let’s begin with basics. To get familiar with R coding environment, start with some basic calculations. R console can be used as an interactive calculator too. Type the following in your console:

> 2 + 3
> 5 

> 6 / 3
>  2

> (3*8)/(2*3)
> 4 

> log(12)
> 1.07

> sqrt (121)
> 11

You can experiment various combinations of calculations and see the results. In case, you want to obtain the previous calculation, this can be done by pressing ‘Up / Down Arrow’ key on your keyboard.

If you have done too many calculations, it would be too painful to scroll through every command and find it out. In such situations, creating variable is a helpful way.

In R, you can create a variable using <- or = sign. Let’s say I want to create a variable x to compute the sum of 7 and 8. I’ll write it as:

> x <- 8 + 7
> x
> 15

Once we create a variable, you no longer get the output directly (like a calculator), unless you call the variable in the next line. Remember, variables can be alphabets, alphanumeric but not numeric. You can’t create numeric variables.

R Programming

Understand and practice this section thoroughly. Everything you see or create in R is an object. A vector, matrix, data frame, even a variable is an object. R treats it that way. So, R has 5 basic classes of objects. This includes:

  1. Character
  2. Numeric (Real Numbers)
  3. Integer (Whole Numbers)
  4. Complex
  5. Logical (True / False)

The most basic object in R is known as vector. You can create an empty vector using vector(). Remember, a vector contains object of same class. For example: Let’s create vectors of different classes. We can create vector using c() or concatenate command also.

> a <- c(4.8, 9.5)   #numeric
> b <- c(3 + 2j, 6 - 4j) #complex
> d <- c(29, 34)   #integer
> e <- vector("logical", length = 9)

Data Types in R

R has various type of “data types” which includes vector (numeric, integer etc), matrices, data frames and list. Let’s understand them one by one.

Vector

As I mentioned above, a vector contains object of same class. But, you can mix objects of different classes too. When objects of different classes are mixed in a list, coercion occurs. This effect causes the objects of different types to “convert”" into one class. For example:

> jt <- c("Time", 24, "October", TRUE, 3.33)  #character
> nb <- c(TRUE, 24) #numeric
> qd <- c(2.5, "May") #character

To check the class of any object, use class(“vector name”) function.

> class(jt) "character"

To convert the class of a vector, you can use as. command.

> bar <- 0:5
> class(bar)
> "integer"
> as.numeric(bar)
> class(bar)
> "numeric"
> as.character(bar)
> class(bar)
> "character"

Similarly, you can change the class of any vector. But, you should pay attention here. If you try to convert a “character” vector to “numeric” , NAs will be introduced. Hence, you should be careful to use this command.

List

A list is a special type of vector which contain elements of different data types. For example:

>  my_list <- list(12, "cb", TRUE, 1 + 2i)
> my_list
[[1]]
[1] 12

[[2]]
[1] "cb"

[[3]]
[1] TRUE

[[4]]
[1] 1+2i

As you can see, the output of a list is different from a vector. This is because, all the objects are of different types. The double bracket [[1]] shows the index of first element and so on. Hence, you can easily extract the element of lists depending on their index. Like this:

> my_list[[3]]
> [1] TRUE

You can use [] single bracket too. But, that would return the list element with its index number, instead of the result above. Like this:

> my_list[3]
> [[1]]
  [1] TRUE

Matrices

When a vector is introduced with row and column i.e. a dimension attribute, it becomes a matrix. A matrix is represented by set of rows and columns. It is a 2 dimensional data structure. It consist of elements of same class. Let’s create a matrix of 3 rows and 2 columns:

> my_matrix <- matrix(1:6, nrow=3, ncol=2)
> my_matrix
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

> dim(my_matrix)
[1] 3 2

> attributes(my_matrix)
$dim
[1] 3 2

As you can see, the dimensions of a matrix can be obtained using either dim() or attributes() command.
To extract a particular element from a matrix, simply use the index shown above. For example(try this at your end):

> my_matrix[,2]   #extracts second column
> my_matrix[,1]   #extracts first column

> my_matrix[2,]   #extracts second row
> my_matrix[1,]   #extracts first row

As an interesting fact, you can also create a matrix from a vector. All you need to do is, assign dimension dim() later. Like this:

> age <- c(23, 44, 15, 12, 31, 16) # This is a vector
> age
[1] 23 44 15 12 31 16

> dim(age) <- c(2,3) # Converts the vector into a 2 rows by 3 columns Matrix
> age
[,1] [,2] [,3]
[1,] 23 15 31
[2,] 44 12 16

> class(age)
[1] "matrix"

You can also join two vectors using cbind() and rbind() functions. But, make sure that both vectors have same number of elements. If not, it will return NA values.

> x <- c(1, 2, 3, 4, 5, 6) # Vector 1
> y <- c(20, 30, 40, 50, 60) # Vector 2
> cbind(x, y) # Join vector 1 as first column and vector 2 as second column
x    y
[1,] 1 20
[2,] 2 30
[3,] 3 40
[4,] 4 50
[5,] 5 60
[6,] 6 70

> class(cbind(x, y))
[1] "matrix"

Data Frame

This is the most commonly used member of data types family. It is used to store tabular data. It is different from matrix. In a matrix, every element must have the same class. But, in a data frame, you can put list of vectors containing different classes. This means, every column of a data frame acts like a list. Every time you will read data in R, it will be stored in the form of a data frame. Hence, it is important to understand the majorly used commands on data frame:

> df <- data.frame(name = c("ash","jane","paul","mark"), score = c(67,56,87,91))
> df
name score
1 ash 67
2 jane 56
3 paul 87
4 mark 91

> dim(df)
[1] 4 2

> str(df)
'data.frame': 4 obs. of 2 variables:
$ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3
$ score: num 67 56 87 91

> nrow(df)
[1] 4

> ncol(df)
[1] 2

Let’s understand the code above.

  • df is the name of data frame.

  • dim() returns the dimension of data frame as 4 rows and 2 columns.

  • str() returns the structure of a data frame i.e. the list of variables stored in the data frame.

  • nrow() and ncol() return the number of rows and number of columns in a data set respectively.

Here you see “name” is a factor variable and “score”" is numeric. In data science, a variable can be categorized into two types: Continuous and Categorical.

  • Continuous variables are those which can take any form such as 1, 2, 3.5, 4.66 etc.
  • Categorical variables are those which takes only discrete values such as 2, 5, 11, 15 etc.

In R, categorical values are represented by factors. In df, name is a factor variable having 4 unique levels. Factor or categorical variable are specially treated in a data set

Missing values - Missing values in R are represented by NA and NaN. Now we’ll check if a data set has missing values.

#Inject NA at 1st, 2nd row and 2nd column of df 
> df[1:2,2] <- NA 
> df
name score
1 ash NA
2 jane NA
3 paul 87
4 mark 91
#Check the entire data set for NAs and return logical output
> is.na(df) 
name score
[1,] FALSE TRUE
[2,] FALSE TRUE
[3,] FALSE FALSE
[4,] FALSE FALSE
> table(is.na(df)) #returns a table of logical output
FALSE TRUE
6      2
#Return the list of rows having missing values
> df[!complete.cases(df),] 
name  score
1 ash  NA
2 jane NA

Missing values hinder normal calculations in a data set. For example, let’s say, we want to compute the mean of score. Since there are two missing values, it can’t be done directly. Let’s see:

mean(df$score)
[1] NA
> mean(df$score, na.rm = TRUE)
[1] 89

The use of na.rm = TRUE parameter tells R to ignore the NAs and compute the mean of remaining values in the selected column (score). To remove rows with NA values in a data frame, you can use na.omit:

> new_df <- na.omit(df)
> new_df
name score
3 paul 87
4 mark 91

Useful R Packages

There are ~7800 packages on CRAN Since, I’ve already explained the method of installing packages, you can go ahead and install them now. Sooner or later you’ll need them.

  • Importing Data: R offers wide range of packages for importing data available in any format such as .txt, .csv, .json, .sql etc. To import large files of data quickly, it is advisable to install and use data.table, readr, RMySQL, sqldf, jsonlite.

  • Data Visualization: R has in built plotting commands as well. They are good to create simple graphs. But, becomes complex when it comes to creating advanced graphics. Hence, you should install ggplot2.

  • Data Manipulation: R has a fantastic collection of packages for data manipulation. These packages allows you to do basic & advanced computations quickly. These packages are dplyr, plyr, tidyr, lubridate, stringr.

  • Modeling / Machine Learning: For modeling, caret package in R is powerful enough to cater to every need for creating machine learning model. However, you can install packages algorithms wise such as randomForest, rpart, gbm etc

Till here, you became familiar with the basic work style in R and its associated components.

I want you to practice, what you’ve learnt till here.

  • Practice: As a part of this section, install “swirl” package. Then type, library(swirl) to initiate the package. And, complete this interactive R tutorial. If you have followed this article thoroughly, this session should be an easy task for you!

Watch out for Part II.