7 Descriptive Statistics
Before proceeding further please go through the first 6 chapters from my book Textbook of Agricultural Statistics.
You can download iris data from here or use the inbuilt data set as discussed in chapter 6.
To do any analysis we first import data, see Chapter 5. You can inspect your data using the functions head()
and tail()
, which will display the first and the last part of the data, respectively.
# import data set in to R
# here we are using iris dataset
my_data<-iris # iris dataset is called in to variable my_data
# display first 6 rows
head(my_data, 6)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# display last 6 rows
tail(my_data,6)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
7.1 List of R functions
Descriptive Statistics | R Function |
---|---|
Mean | mean() |
Standard deviation | sd() |
Variance | var() |
Minimum | min() |
Maximum | maximum() |
Median | median() |
Range of values (minimum and maximum) | range() |
Sample quantiles | quantile() |
Generic function | summary() |
Interquartile range | IQR() |
The function mfv()
, for most frequent value, [in modeest
package] can be used to find the statistical mode of a numeric vector.
7.2 Measures of central tendancy
Know more about Measures of central tendency
# Compute the mean value
mean(iris$Sepal.Length)
## [1] 5.843333
# Compute the median value
median(iris$Sepal.Length)
## [1] 5.8
## Loading required package: modeest
mfv(iris$Sepal.Length)
## [1] 5
# Compute the minimum value
min(iris$Sepal.Length)
## [1] 4.3
# Compute the maximum value
max(iris$Sepal.Length)
## [1] 7.9
# Quartiles
quantile(iris$Sepal.Length)
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
# Compute the median
median(iris$Sepal.Length)
## [1] 5.8
7.3 Measures of dispersion
Know more about Measures of dispersion
# Range
# minimum value and maximum value is displayed
range(iris$Sepal.Length)
## [1] 4.3 7.9
# Inter quartile range
IQR(iris$Sepal.Length)
## [1] 1.3
# Compute the variance
var(iris$Sepal.Length)
## [1] 0.6856935
# Compute the standard deviation
sd(iris$Sepal.Length)
## [1] 0.8280661
# Compute the median absolute deviation
mad(iris$Sepal.Length)
## [1] 1.03782
7.4 Overall summary
The overall summary statistics of One variable or an entire data frame can be displayed using the function summary()
.
7.4.1 Summary of single variable
The mean, median, 25th and 75th quartiles, minimum and maximum values are all returned in a single line call.
summary(iris$Sepal.Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
7.4.2 Summary of a Data frame
In this instance, each column receives an automated use of the summary() method. The kind of data in the column determines how the output is formatted. For instance:
The mean, median, min, max, and quartiles are returned if the column is a numeric variable.
The total number of observations in each group is returned if the column is a factor variable.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
7.5 The sapply function
The sapply()
method may also be used to apply a specific function on a list or vector. For example, we may use it to compute the mean, sd, var, min, quantile,… for each column in a data frame.
sapply(iris[, -5], mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
# The command direct to find the mean of all columns in the dataset iris except 5th column, which is species and is non numeric
# Compute quartiles
sapply(iris[, -5], quantile)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0% 4.3 2.0 1.00 0.1
## 25% 5.1 2.8 1.60 0.3
## 50% 5.8 3.0 4.35 1.3
## 75% 6.4 3.3 5.10 1.8
## 100% 7.9 4.4 6.90 2.5
7.6 Some other useful functions
7.6.1 stat.desc()
The function stat.desc()
[in pastecs package], provides lot of useful statistics in a single call
#Install 'pastecs' package
#load the library
library(pastecs)
res <- stat.desc(my_data[, -5]) # results were stored to the variable 'res'
# decimal digits were rounded to 2 using round()function
round(res, 2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## nbr.val 150.00 150.00 150.00 150.00
## nbr.null 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 4.30 2.00 1.00 0.10
## max 7.90 4.40 6.90 2.50
## range 3.60 2.40 5.90 2.40
## sum 876.50 458.60 563.70 179.90
## median 5.80 3.00 4.35 1.30
## mean 5.84 3.06 3.76 1.20
## SE.mean 0.07 0.04 0.14 0.06
## CI.mean.0.95 0.13 0.07 0.28 0.12
## var 0.69 0.19 3.12 0.58
## std.dev 0.83 0.44 1.77 0.76
## coef.var 0.14 0.14 0.47 0.64
7.7 Missing values
It should be noted that some R functions will return errors or NA even if just one item is missing or when the data contains missing values. For instance, if a vector has even a single missing value, the mean()
method will return NA. The option na.rm = TRUE
, which instructs the function to delete any NAs before computations, can be used to prevent this. Here’s an example of how to use the mean function:
mean(iris$Sepal.Length, na.rm = TRUE)
7.8 Frequency table
For more theoretical details see here Dataset below shows the number of children in 50 household collected as a part of survey. I will demonstrate how frequency table can be constructed.
## [1] 5 1 6 9 5 9 6 2 3 1 5 5 1 6 6 6 3 1 0 1 2 0 8 5 7 3 0 8 0 2 5 3 3 5 0 4 6 8
## [39] 6 9 8 4 7 9 4 9 3 4 5 1
The frequency counts of the variables that appear in the chosen column of the dataframe are calculated using the table()
function in R. The outcome is presented as a two-row tabular structure, with the first row displaying the value of each column and the second row displaying the matching frequencies.
# Frequency Table
freq_table<- table(survey_data)
freq_table
## survey_data
## 0 1 2 3 4 5 6 7 8 9
## 5 6 3 6 4 8 7 2 4 5
7.8.1 Cumulative frequency
The total of all classes in a frequency distribution table, including this class below it, makes up the cumulative frequency distribution of a particular data set. The value at every cell location is calculated by adding the current value and all previous values encountered up to that point. This may be determined by using the cumsum()
method.
# cumulative frequency
cumsum(freq_table)
## 0 1 2 3 4 5 6 7 8 9
## 5 11 14 20 24 32 39 41 45 50
From the above results now you can easily answer the question: number of families having 8 or less than 8 children?. It is 45
7.8.2 Grouped Frequency distribution
What is a grouped frequency table?
Before going in to the details of constructing frequency table read thoroughly The construction of grouped frequency table.
# Constucting frequency table
data<-iris$Sepal.Length
# Decide how many classes you wish to use using struges formula
N<- length(data)
k<-1+3.322*log(N,base=10)
n_class<-ceiling(k) #Rounds value of k to nearest highest integer.
#Determine the class width
c<-(max(data)-min(data))/k
# Rounded value of c to one digit after decimal
class_width<-round(c,1)
# Find lower limit
L<-min(data)-((c*k-(max(data)-min(data)))/2)
# Upper limit
U<-L+(class_width*n_class)
#Break the range [L,U] into non-overlapping sub-intervals by defining a sequence of equal distance break points (class_width).
breaks = seq(L, U, by=class_width)
#Classify the data according to the sub-intervals (each of class_width length) with cut. As the intervals are to be closed on the left, and open on the right, we set the right argument as FALSE.
data.cut = cut(data, breaks, right=FALSE)
# Generate frequency table of classes
data.freq = table(data.cut)
data.freq
## data.cut
## [4.3,4.7) [4.7,5.1) [5.1,5.5) [5.5,5.9) [5.9,6.3) [6.3,6.7) [6.7,7.1) [7.1,7.5)
## 9 23 20 28 19 23 16 6
## [7.5,7.9)
## 5
7.9 R function
This section is just for your information. I don’t want to go deep in to programming part, as we are concentrating only on data analysis. If you are not interested you can skip this section.
A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.
7.9.1 Creating a Function
To create a function, use the function()
keyword:
#General Format
function_name<- function(){
#function_body
}
For example, here I’m creating a function named print_me()
, on calling print_me()
, it automatically prints “Hello world!”
# create a function with the name print_me
print_me <- function() {
print("Hello World!")
}
# calling function print_me()
print_me()
## [1] "Hello World!"
In the above section we have written at least 12 lines of code to construct a grouped frequency table from a data. Each time we need to run all these code individually, which is a difficult task. You can create a Function to perform this task in a single call. I’m creating a function called construct()
to create a grouped frequency table.
construct <- function(x)
{
#This function will take in a numeric vector x and gives its frequency table
N<- length(x)
k<-1+3.322*log(N,base=10)
n_class<-ceiling(k)
c<-(max(x)-min(x))/k
class_width<-round(c,1)
L<-min(x)-((c*k-(max(x)-min(x)))/2)
U<-L+(class_width*n_class)
breaks = seq(L, U, by=class_width)
data.cut = cut(x, breaks, right=FALSE)
data.freq = table(data.cut)
data.freq
}
Constructing frequency table of Petal.Length in iris dataset using the construct
function we just created.
construct(iris$Petal.Length)
## data.cut
## [1,1.7) [1.7,2.4) [2.4,3.1) [3.1,3.8) [3.8,4.5) [4.5,5.2) [5.2,5.9) [5.9,6.6)
## 44 6 1 6 22 37 21 9
## [6.6,7.3)
## 4
Self assesment
- Download the dataset by clicking Here
- Find the descriptive statistics for the data using the functions above.
3.Construct grouped frequency table of Petal.Width column in iris data.