4 R basics

In this section we will be dealing with basic operations using R. One should be aware of these basics before proceeding to any kind of data analysis using R.

The RStudio interface is simple. You type R code into the bottom line of the RStudio console pane (see figure 3.3) and then click Enter to run it. The code you type is called a command, because it will command your computer to do something for you. The line you type it into is called the command line.

‘#’ hashtag in R
R treats the hashtag character, ’#’in a special way; R will not run anything that follows a hashtag on a line. This makes hashtags very useful for adding comments and annotations to your code. We will be able to read the comments, but your computer will not process it.

4.1 Basic arithmetic operations

# Try these codes in R, Type these codes in console and hit enter
# result of R code is also shown here  

# + (Addition)   
7 + 4  
## [1] 11
# - (Subtraction)  
7 - 4  
## [1] 3
# * (Multiplication)
7 * 2  
## [1] 14
# / (division) 
7 / 2  
## [1] 3.5
# ^ (exponentiation) 
7 ^ 2
## [1] 49

The colon operator (:) returns every integer between two integers. It is an easy way to create a sequence of numbers.

100:130
##  [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
## [20] 119 120 121 122 123 124 125 126 127 128 129 130

4.2 Basic arithmetic functions:

# Logarithm to the base e  
log(4)  
## [1] 1.386294
# Logarithm to the base 10  
log10(4)  
## [1] 0.60206
# Logarithm to the base 2  
log2(4)  
## [1] 2
# absolute value
abs(-4)  
## [1] 4
# square root
sqrt(4) 
## [1] 2

4.3 Assigning values to variables

x <- 2
# it is that value 2 is assigned to variable x  

y <- 5  

# it is that value 5 is assigned to variable y

# Now if you give x+5 or using any operator between x and y, you can see as below  

x + y  
## [1] 7
x * y  
## [1] 10
x / y  
## [1] 0.4
x + 2*y  
## [1] 12

Note that R is case sensitive, i.e. x and X are not equal.

4.4 Basic data types

  • numeric
  • character
  • logical
my_age <- 32   
# Numeric variable 32 is assigned to my_age

# while writing your code it is preffered to use '_' instead of space  

my_name <- "Dr Pratheesh" # Character variable  

#  Are you a data scientist?: (yes/no) <=> (TRUE/FALSE)  

is_datascientist <- TRUE # logical variable

4.4.1 Vectors

A combination of multiple values (numeric, character or logical)

How to create a vector

Create a vector: c() for concatenate

Case of missing values: NA (not available) and NaN (not a number)

Get a subset of a vector: my_vector[i] to get the ith element

# Create a numeric vector
student_ages <- c(27, 25, 29, 26, 20, 21, 23, 25)  

# Create a character vector
student_name <- c("asha", "adhi", "aravind", 
                  "mary", "peter", "daisy", 
                  "papu", "ramu")  

# subset of a vector  

# obtain 3 rd element from student_name  

student_name[3]
## [1] "aravind"

Case of missing values: NA (not available) and NaN (not a number) Get a subset of a vector: my_vector[i] to get the ith element

Calculations with vectors

max(x), min(x), range(x), length(x), sum(x), mean(x), prod(x): product of the elements in x, sd(x): standard deviation, var(x): variance, sort(x)

# Create a numeric vector
student_ages <- c(27, 25, 29, 26, 20, 21, 23, 25)  

# Maximum value of the vector
max(student_ages)
## [1] 29
# Minimum value of the vector
min(student_ages)
## [1] 20
# Range of the vector
range(student_ages)  
## [1] 20 29
# Length of the vector
length(student_ages)  
## [1] 8
# Total of the values in the vector
sum(student_ages)
## [1] 196
# Mean of the vector
mean(student_ages)  
## [1] 24.5
# Product of the elements in the vector
prod(student_ages)
## [1] 122911425000
# Standard deviation  
sd(student_ages)
## [1] 3.023716
# Variance of the vector
var(student_ages)
## [1] 9.142857
# Sort the values of the vector

# Ascending order
sort(student_ages,decreasing = FALSE)
## [1] 20 21 23 25 25 26 27 29
# Descending order
sort(student_ages,decreasing = TRUE)
## [1] 29 27 26 25 25 23 21 20

4.4.2 Matrices

It’s a homogeneous collection of data sets which is arranged in a two dimensional organisation. It’s a m*n array with similar data type. It is created using a vector input. It has a fixed number of rows and columns. You can perform many arithmetic operations on R matrix like – addition, subtraction, multiplication, and divisions.

Create and naming matrix: matrix(), cbind(), rbind(), rownames(), colnames() Transpose a matrix: t() Dimensions of a matrix: ncol(), nrow(), dim() Get a subset of a matrix: my_data[row, col] Calculations with numeric matrices: rowSums(), colSums(), rowMeans(), colMeans()

# make three vectors c1, c2 and c3
c1<-c(3,4,5)
c2<-c(7,8,9)
c3<-c(11,12,13)

# Creating matrix by binding column wise  
A <- cbind(c1,c2,c3)
print(A)
##      c1 c2 c3
## [1,]  3  7 11
## [2,]  4  8 12
## [3,]  5  9 13
# Creating matrix by binding row wise  
B <- rbind(c1,c2,c3)
print(B)  
##    [,1] [,2] [,3]
## c1    3    4    5
## c2    7    8    9
## c3   11   12   13
# Adding two matrices
A+B
##      c1 c2 c3
## [1,]  6 11 16
## [2,] 11 16 21
## [3,] 16 21 26
# Simple element by element multiplication up to matrices.

A*B
##      c1  c2  c3
## [1,]  9  28  55
## [2,] 28  64 108
## [3,] 55 108 169
# Matrix multiplication  

A%*%B
##      [,1] [,2] [,3]
## [1,]  179  200  221
## [2,]  200  224  248
## [3,]  221  248  275
# Using matrix() function
# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)
##      [,1] [,2] [,3]
## [1,]    3    4    5
## [2,]    6    7    8
## [3,]    9   10   11
## [4,]   12   13   14
# Elements are arranged sequentially by column.
N <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(N)
##      [,1] [,2] [,3]
## [1,]    3    7   11
## [2,]    4    8   12
## [3,]    5    9   13
## [4,]    6   10   14
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")

P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
print(P)  
##      col1 col2 col3
## row1    3    4    5
## row2    6    7    8
## row3    9   10   11
## row4   12   13   14
# Access the element at 3rd column and 1st row.
P[1,3]
## [1] 5
# Access the element at 2nd column and 4th row.
P[4,2]
## [1] 13
# Access only the  2nd row.
P[2,]
## col1 col2 col3 
##    6    7    8
# Access only the 3rd column.  
P[,3]  
## row1 row2 row3 row4 
##    5    8   11   14
# Assign element at 1st row and 3rd column of matrix P to the variable x  

x<-P[1,3]  
print(x)
## [1] 5
# create a vector from the second column of matrix P  

y<-P[,2]
y<-as.vector(y)

#Transpose a matrix
Q<-t(P)
print(Q)
##      row1 row2 row3 row4
## col1    3    6    9   12
## col2    4    7   10   13
## col3    5    8   11   14
# Knowing dimensions of a matrix

#Number of columns in A
ncol(A)  
## [1] 3
#Number of rows in A
nrow(A)
## [1] 3
# Number of rows and columns
dim(A)  
## [1] 3 3
# Calculations with numeric matrices  

# Compute row sums of matrix A
rowSums(A)  
## [1] 21 24 27
# Compute column sums of matrix A
colSums(A)  
## c1 c2 c3 
## 12 24 36
# Compute row means of matrix A
rowMeans(A)  
## [1] 7 8 9
# Compute column means of matrix A
colMeans(A)
## c1 c2 c3 
##  4  8 12

4.4.3 Data frames

A data frame is the data type, we will be going to use frequently. A data frame is like a matrix but can have columns with different types (numeric, character, logical). Rows are observations (individuals) and columns are variables. Function data.frame() is used to create data frame

friends_data <- data.frame(
  name = c("asha", "adhi", "aravind", 
                  "mary"), 
  age = c(20,23,22,21),
  height = c(180, 170, 185, 169),
  married = c(TRUE, FALSE, TRUE, TRUE)  
)  

# Print
friends_data
##      name age height married
## 1    asha  20    180    TRUE
## 2    adhi  23    170   FALSE
## 3 aravind  22    185    TRUE
## 4    mary  21    169    TRUE

To check whether a data is a data frame, use the is.data.frame() function. Returns TRUE if the data is a data frame:

is.data.frame(friends_data)  
## [1] TRUE

To check whether a data is a matrix, data frame or any other class use the function class()

# What is the class of A? --> matrix
class(A)  
## [1] "matrix" "array"
# What is the class of friends_data? --> matrix
class(friends_data)
## [1] "data.frame"

Convert a data class to data frame use the function as.data.frame()

# Convert it as a data frame
A2 <- as.data.frame(A)
# Now, the class is data.frame
class(A2)
## [1] "data.frame"

As described in matrix section, you can use the function t() to transpose a data frame:

t(friends_data)
##         [,1]   [,2]    [,3]      [,4]  
## name    "asha" "adhi"  "aravind" "mary"
## age     "20"   "23"    "22"      "21"  
## height  "180"  "170"   "185"     "169" 
## married "TRUE" "FALSE" "TRUE"    "TRUE"

4.4.3.1 Operations on data frame

Positive indexing by name and by location

# Access the data in 'name' column in friends_data
# dollar sign is used
friends_data$name  
## [1] "asha"    "adhi"    "aravind" "mary"
# or use this
friends_data[, 'name']  
## [1] "asha"    "adhi"    "aravind" "mary"
# Subset columns 1 and 3
friends_data[ , c(1, 3)]  
##      name height
## 1    asha    180
## 2    adhi    170
## 3 aravind    185
## 4    mary    169
# Subset columns 1 to 3  
friends_data[ , c(1:3)]    
##      name age height
## 1    asha  20    180
## 2    adhi  23    170
## 3 aravind  22    185
## 4    mary  21    169

Negative Indexing

# Exclude column 1  

friends_data[, -1]  
##   age height married
## 1  20    180    TRUE
## 2  23    170   FALSE
## 3  22    185    TRUE
## 4  21    169    TRUE

Index by characteristics

We want to select all friends with age greater than or equal to 22.

# Identify rows that meet the condition
friends_data$age >= 22   
## [1] FALSE  TRUE  TRUE FALSE
# TRUE specifies that the row contains a value of age >= 22.
# Select the rows that meet the condition
friends_data[friends_data$age >= 22, ]  
##      name age height married
## 2    adhi  23    170   FALSE
## 3 aravind  22    185    TRUE
# The R code above, tells R to get all rows from friends_data where age >= 22

If you don’t want to see all the column data for the selected rows but are just interested in displaying, for example, friend names and age for friends with age >= 22, you could use the following R code:

# Use column locations
friends_data[friends_data$age >= 22,  c(1, 2)]  
##      name age
## 2    adhi  23
## 3 aravind  22
# Or use column names
friends_data[friends_data$age >= 22, c("name", "age")]
##      name age
## 2    adhi  23
## 3 aravind  22

If you’re finding that your selection statement is starting to be inconvenient, you can put your row and column selections into variables first, such as:

age22 <- friends_data$age >= 22
cols <- c("name", "age")
# Then you can select the rows and columns with those variables:

friends_data[age22, cols]
##      name age
## 2    adhi  23
## 3 aravind  22

It’s also possible to use the function subset() as follow.

# Select friends data with age >= 22
subset(friends_data, age >= 22)  
##      name age height married
## 2    adhi  23    170   FALSE
## 3 aravind  22    185    TRUE

Another option is to use the functions attach() and detach(). The function attach() takes a data frame and makes its columns accessible by simply giving their names.

# Attach a data frame
attach(friends_data)
# === Data manipulation ====
friends_data[age>=22, ]
##      name age height married
## 2    adhi  23    170   FALSE
## 3 aravind  22    185    TRUE
# === End of data manipulation ====
# Detach the data frame
detach(friends_data)

Extend a data frame
Add new column in a data frame

# Add group column to friends_data
data<-c("BSc", "MSc", "Phd", "Btech")
friends_data$degree <- data
friends_data
##      name age height married degree
## 1    asha  20    180    TRUE    BSc
## 2    adhi  23    170   FALSE    MSc
## 3 aravind  22    185    TRUE    Phd
## 4    mary  21    169    TRUE  Btech

It’s also possible to use the functions cbind() and rbind() to extend a data frame.

roll_no<-c(234,235,236,238)
cbind(friends_data, roll_no)
##      name age height married degree roll_no
## 1    asha  20    180    TRUE    BSc     234
## 2    adhi  23    170   FALSE    MSc     235
## 3 aravind  22    185    TRUE    Phd     236
## 4    mary  21    169    TRUE  Btech     238

Calculations with data frame

With numeric data frame, you can use the function rowSums(), colSums(), colMeans(), rowMeans() and apply().

# Following can be used when it is a numeric data frame

# creating a numeric data frame
c1<-c(3,4,5)
c2<-c(7,8,9)
c3<-c(11,12,13)

# Creating matrix by binding column wise  
ex <- cbind(c1,c2,c3)

# converting matrix to data frame
ex<-as.data.frame(ex)
ex
##   c1 c2 c3
## 1  3  7 11
## 2  4  8 12
## 3  5  9 13
## [1] 21 24 27
## c1 c2 c3 
## 12 24 36
# calculation in non numeric data frame on selected numeric columns
attach(friends_data)
sum(age)
## [1] 86
mean(age)
## [1] 21.5

apply() function

apply(X, MARGIN, FUN)
Here:
  -x: an array or matrix
-MARGIN:  take a value or range between 1 and 2 to define where to apply the function:
  -MARGIN=1`: the manipulation is performed on rows
-MARGIN=2`: the manipulation is performed on columns
-MARGIN=c(1,2)` the manipulation is performed on rows and columns
-FUN: tells which function to apply. Built functions like mean, median, sum, min, max and even user-defined functions can be applied>
apply(ex,2, mean)
## c1 c2 c3 
##  4  8 12
apply(ex,1, mean)
## [1] 7 8 9

Practical 1: Matrices and vectors

  1. Create a vector of numbers form 23 to 33
  2. Assign the vector with a name “numbers”
  3. Create a Matrix named “mat_num” with elements from 1 to 20 in 4 rows and 5 columns
  4. Name the columns and rows of matrix mat_num with any names you like
  5. Compute column means and row means of matrix mat_num
  6. Access the element in second row and third column of matrix mat_num

Practical 2: Data frames

  1. Create a data frame of 10 students with name, age (between 20 and 25), height and degree.
  2. Select all the rows with age above 23
  3. Find the mean and total of age