3 R basics
In this section we will be dealing with basic operations using R. One should be aware of these basics before proceeding to any kind of data analysis using R.
The RStudio interface is simple. You type R code into the bottom line of the RStudio console pane (see figure 2.3) and then click Enter to run it. The code you type is called a command, because it will command your computer to do something for you. The line you type it into is called the command line.
‘#’ hashtag in R
R treats the hashtag character, ’#’in a special way; R will not run anything that follows a hashtag on a line. This makes hashtags very useful for adding comments and annotations to your code. We will be able to read the comments, but your computer will not process it.
3.1 Basic arithmetic operations
# Try these codes in R, Type these codes in console and hit enter
# result of R code is also shown here
# + (Addition)
7 + 4
## [1] 11
# - (Subtraction)
7 - 4
## [1] 3
# * (Multiplication)
7 * 2
## [1] 14
# / (division)
7 / 2
## [1] 3.5
# ^ (exponentiation)
7 ^ 2
## [1] 49
The colon operator (:
) returns every integer between two integers. It is an easy way to create a sequence of numbers.
100:130
## [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
## [20] 119 120 121 122 123 124 125 126 127 128 129 130
3.2 Basic arithmetic functions:
# Logarithm to the base e
log(4)
## [1] 1.386294
# Logarithm to the base 10
log10(4)
## [1] 0.60206
# Logarithm to the base 2
log2(4)
## [1] 2
# absolute value
abs(-4)
## [1] 4
# square root
sqrt(4)
## [1] 2
3.3 Assigning values to variables
x <- 2
# it is that value 2 is assigned to variable x
y <- 5
# it is that value 5 is assigned to variable y
# Now if you give x+5 or using any operator between x and y, you can see as below
x + y
## [1] 7
x * y
## [1] 10
x / y
## [1] 0.4
x + 2*y
## [1] 12
Note that R is case sensitive, i.e. x and X are not equal.
3.4 Basic data types
- numeric
- character
- logical
my_age <- 32
# Numeric variable 32 is assigned to my_age
# while writing your code it is preffered to use '_' instead of space
my_name <- "Dr Pratheesh" # Character variable
# Are you a data scientist?: (yes/no) <=> (TRUE/FALSE)
is_datascientist <- TRUE # logical variable
3.4.1 Vectors
A combination of multiple values (numeric, character or logical)
How to create a vector
Create a vector: c() for concatenate
Case of missing values: NA (not available) and NaN (not a number)
Get a subset of a vector: my_vector[i] to get the ith element
# Create a numeric vector
student_ages <- c(27, 25, 29, 26, 20, 21, 23, 25)
# Create a character vector
student_name <- c("asha", "adhi", "aravind",
"mary", "peter", "daisy",
"papu", "ramu")
# subset of a vector
# obtain 3 rd element from student_name
student_name[3]
## [1] "aravind"
Case of missing values: NA (not available) and NaN (not a number) Get a subset of a vector: my_vector[i] to get the ith element
Calculations with vectors
max(x), min(x), range(x), length(x), sum(x), mean(x), prod(x): product of the elements in x, sd(x): standard deviation, var(x): variance, sort(x)
# Create a numeric vector
student_ages <- c(27, 25, 29, 26, 20, 21, 23, 25)
# Maximum value of the vector
max(student_ages)
## [1] 29
# Minimum value of the vector
min(student_ages)
## [1] 20
# Range of the vector
range(student_ages)
## [1] 20 29
# Length of the vector
length(student_ages)
## [1] 8
# Total of the values in the vector
sum(student_ages)
## [1] 196
# Mean of the vector
mean(student_ages)
## [1] 24.5
# Product of the elements in the vector
prod(student_ages)
## [1] 122911425000
# Standard deviation
sd(student_ages)
## [1] 3.023716
# Variance of the vector
var(student_ages)
## [1] 9.142857
# Sort the values of the vector
# Ascending order
sort(student_ages,decreasing = FALSE)
## [1] 20 21 23 25 25 26 27 29
# Descending order
sort(student_ages,decreasing = TRUE)
## [1] 29 27 26 25 25 23 21 20
3.4.2 Matrices
It’s a homogeneous collection of data sets which is arranged in a two dimensional organisation. It’s a m*n array with similar data type. It is created using a vector input. It has a fixed number of rows and columns. You can perform many arithmetic operations on R matrix like – addition, subtraction, multiplication, and divisions.
Create and naming matrix: matrix(), cbind(), rbind(), rownames(), colnames() Transpose a matrix: t() Dimensions of a matrix: ncol(), nrow(), dim() Get a subset of a matrix: my_data[row, col] Calculations with numeric matrices: rowSums(), colSums(), rowMeans(), colMeans()
# make three vectors c1, c2 and c3
c1<-c(3,4,5)
c2<-c(7,8,9)
c3<-c(11,12,13)
# Creating matrix by binding column wise
A <- cbind(c1,c2,c3)
print(A)
## c1 c2 c3
## [1,] 3 7 11
## [2,] 4 8 12
## [3,] 5 9 13
## [,1] [,2] [,3]
## c1 3 4 5
## c2 7 8 9
## c3 11 12 13
# Adding two matrices
A+B
## c1 c2 c3
## [1,] 6 11 16
## [2,] 11 16 21
## [3,] 16 21 26
# Simple element by element multiplication up to matrices.
A*B
## c1 c2 c3
## [1,] 9 28 55
## [2,] 28 64 108
## [3,] 55 108 169
# Matrix multiplication
A%*%B
## [,1] [,2] [,3]
## [1,] 179 200 221
## [2,] 200 224 248
## [3,] 221 248 275
# Using matrix() function
# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)
## [,1] [,2] [,3]
## [1,] 3 4 5
## [2,] 6 7 8
## [3,] 9 10 11
## [4,] 12 13 14
# Elements are arranged sequentially by column.
N <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(N)
## [,1] [,2] [,3]
## [1,] 3 7 11
## [2,] 4 8 12
## [3,] 5 9 13
## [4,] 6 10 14
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")
P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
print(P)
## col1 col2 col3
## row1 3 4 5
## row2 6 7 8
## row3 9 10 11
## row4 12 13 14
# Access the element at 3rd column and 1st row.
P[1,3]
## [1] 5
# Access the element at 2nd column and 4th row.
P[4,2]
## [1] 13
# Access only the 2nd row.
P[2,]
## col1 col2 col3
## 6 7 8
# Access only the 3rd column.
P[,3]
## row1 row2 row3 row4
## 5 8 11 14
# Assign element at 1st row and 3rd column of matrix P to the variable x
x<-P[1,3]
print(x)
## [1] 5
# create a vector from the second column of matrix P
y<-P[,2]
y<-as.vector(y)
#Transpose a matrix
Q<-t(P)
print(Q)
## row1 row2 row3 row4
## col1 3 6 9 12
## col2 4 7 10 13
## col3 5 8 11 14
# Knowing dimensions of a matrix
#Number of columns in A
ncol(A)
## [1] 3
#Number of rows in A
nrow(A)
## [1] 3
# Number of rows and columns
dim(A)
## [1] 3 3
# Calculations with numeric matrices
# Compute row sums of matrix A
rowSums(A)
## [1] 21 24 27
# Compute column sums of matrix A
colSums(A)
## c1 c2 c3
## 12 24 36
# Compute row means of matrix A
rowMeans(A)
## [1] 7 8 9
# Compute column means of matrix A
colMeans(A)
## c1 c2 c3
## 4 8 12
3.4.3 Data frames
A data frame is the data type, we will be going to use frequently. A data frame is like a matrix but can have columns with different types (numeric, character, logical). Rows are observations (individuals) and columns are variables. Function data.frame()
is used to create data frame
friends_data <- data.frame(
name = c("asha", "adhi", "aravind",
"mary"),
age = c(20,23,22,21),
height = c(180, 170, 185, 169),
married = c(TRUE, FALSE, TRUE, TRUE)
)
# Print
friends_data
## name age height married
## 1 asha 20 180 TRUE
## 2 adhi 23 170 FALSE
## 3 aravind 22 185 TRUE
## 4 mary 21 169 TRUE
To check whether a data is a data frame, use the is.data.frame()
function. Returns TRUE if the data is a data frame:
is.data.frame(friends_data)
## [1] TRUE
To check whether a data is a matrix, data frame or any other class use the function class()
# What is the class of A? --> matrix
class(A)
## [1] "matrix" "array"
# What is the class of friends_data? --> matrix
class(friends_data)
## [1] "data.frame"
Convert a data class to data frame use the function as.data.frame()
# Convert it as a data frame
A2 <- as.data.frame(A)
# Now, the class is data.frame
class(A2)
## [1] "data.frame"
As described in matrix section, you can use the function t()
to transpose a data frame:
t(friends_data)
## [,1] [,2] [,3] [,4]
## name "asha" "adhi" "aravind" "mary"
## age "20" "23" "22" "21"
## height "180" "170" "185" "169"
## married "TRUE" "FALSE" "TRUE" "TRUE"
3.4.3.1 Operations on data frame
Positive indexing by name and by location
# Access the data in 'name' column in friends_data
# dollar sign is used
friends_data$name
## [1] "asha" "adhi" "aravind" "mary"
# or use this
friends_data[, 'name']
## [1] "asha" "adhi" "aravind" "mary"
# Subset columns 1 and 3
friends_data[ , c(1, 3)]
## name height
## 1 asha 180
## 2 adhi 170
## 3 aravind 185
## 4 mary 169
# Subset columns 1 to 3
friends_data[ , c(1:3)]
## name age height
## 1 asha 20 180
## 2 adhi 23 170
## 3 aravind 22 185
## 4 mary 21 169
Negative Indexing
# Exclude column 1
friends_data[, -1]
## age height married
## 1 20 180 TRUE
## 2 23 170 FALSE
## 3 22 185 TRUE
## 4 21 169 TRUE
Index by characteristics
We want to select all friends with age greater than or equal to 22.
# Identify rows that meet the condition
friends_data$age >= 22
## [1] FALSE TRUE TRUE FALSE
# TRUE specifies that the row contains a value of age >= 22.
# Select the rows that meet the condition
friends_data[friends_data$age >= 22, ]
## name age height married
## 2 adhi 23 170 FALSE
## 3 aravind 22 185 TRUE
# The R code above, tells R to get all rows from friends_data where age >= 22
If you don’t want to see all the column data for the selected rows but are just interested in displaying, for example, friend names and age for friends with age >= 22, you could use the following R code:
# Use column locations
friends_data[friends_data$age >= 22, c(1, 2)]
## name age
## 2 adhi 23
## 3 aravind 22
# Or use column names
friends_data[friends_data$age >= 22, c("name", "age")]
## name age
## 2 adhi 23
## 3 aravind 22
If you’re finding that your selection statement is starting to be inconvenient, you can put your row and column selections into variables first, such as:
age22 <- friends_data$age >= 22
cols <- c("name", "age")
# Then you can select the rows and columns with those variables:
friends_data[age22, cols]
## name age
## 2 adhi 23
## 3 aravind 22
It’s also possible to use the function subset()
as follow.
# Select friends data with age >= 22
subset(friends_data, age >= 22)
## name age height married
## 2 adhi 23 170 FALSE
## 3 aravind 22 185 TRUE
Another option is to use the functions attach()
and detach()
. The function attach()
takes a data frame and makes its columns accessible by simply giving their names.
# Attach a data frame
attach(friends_data)
# === Data manipulation ====
friends_data[age>=22, ]
## name age height married
## 2 adhi 23 170 FALSE
## 3 aravind 22 185 TRUE
# === End of data manipulation ====
# Detach the data frame
detach(friends_data)
Extend a data frame
Add new column in a data frame
# Add group column to friends_data
data<-c("BSc", "MSc", "Phd", "Btech")
friends_data$degree <- data
friends_data
## name age height married degree
## 1 asha 20 180 TRUE BSc
## 2 adhi 23 170 FALSE MSc
## 3 aravind 22 185 TRUE Phd
## 4 mary 21 169 TRUE Btech
It’s also possible to use the functions cbind()
and rbind()
to extend a data frame.
## name age height married degree roll_no
## 1 asha 20 180 TRUE BSc 234
## 2 adhi 23 170 FALSE MSc 235
## 3 aravind 22 185 TRUE Phd 236
## 4 mary 21 169 TRUE Btech 238
Calculations with data frame
With numeric data frame, you can use the function rowSums()
, colSums()
, colMeans()
, rowMeans()
and apply()
.
# Following can be used when it is a numeric data frame
# creating a numeric data frame
c1<-c(3,4,5)
c2<-c(7,8,9)
c3<-c(11,12,13)
# Creating matrix by binding column wise
ex <- cbind(c1,c2,c3)
# converting matrix to data frame
ex<-as.data.frame(ex)
ex
## c1 c2 c3
## 1 3 7 11
## 2 4 8 12
## 3 5 9 13
rowSums(ex)
## [1] 21 24 27
colSums(ex)
## c1 c2 c3
## 12 24 36
## [1] 86
mean(age)
## [1] 21.5
apply()
function
apply(X, MARGIN, FUN)
Here:
-x: an array or matrix
-MARGIN: take a value or range between 1 and 2 to define where to apply the function:
-MARGIN=1`: the manipulation is performed on rows
-MARGIN=2`: the manipulation is performed on columns
-MARGIN=c(1,2)` the manipulation is performed on rows and columns
-FUN: tells which function to apply. Built functions like mean, median, sum, min, max and even user-defined functions can be applied>
apply(ex,2, mean)
## c1 c2 c3
## 4 8 12
apply(ex,1, mean)
## [1] 7 8 9