8 Descriptive Statistics

Before proceeding further please go through the first 6 chapters from my book Textbook of Agricultural Statistics.

You can download iris data from here or use the inbuilt data set as discussed in chapter 7.

To do any analysis we first import data, see Chapter 6. You can inspect your data using the functions head() and tail(), which will display the first and the last part of the data, respectively.

# import data set in to R
# here we are using iris dataset

my_data<-iris # iris dataset is called in to variable my_data

# display first 6 rows
head(my_data, 6)  
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# display last 6 rows
tail(my_data,6)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

8.1 List of R functions

Descriptive Statistics R Function
Mean mean()
Standard deviation sd()
Variance var()
Minimum min()
Maximum maximum()
Median median()
Range of values (minimum and maximum) range()
Sample quantiles quantile()
Generic function summary()
Interquartile range IQR()

The function mfv(), for most frequent value, [in modeest package] can be used to find the statistical mode of a numeric vector.

8.2 Measures of central tendancy

Know more about Measures of central tendency

# Compute the mean value
mean(iris$Sepal.Length)
## [1] 5.843333
# Compute the median value
median(iris$Sepal.Length)  
## [1] 5.8
# Compute the mode
# install.packages("modeest")
require(modeest)
## Loading required package: modeest
mfv(iris$Sepal.Length)  
## [1] 5
# Compute the minimum value
min(iris$Sepal.Length)
## [1] 4.3
# Compute the maximum value
max(iris$Sepal.Length)  
## [1] 7.9
# Quartiles  
quantile(iris$Sepal.Length)  
##   0%  25%  50%  75% 100% 
##  4.3  5.1  5.8  6.4  7.9
# Compute the median
median(iris$Sepal.Length)
## [1] 5.8

8.3 Measures of dispersion

Know more about Measures of dispersion

# Range
# minimum value and maximum value is displayed
range(iris$Sepal.Length)  
## [1] 4.3 7.9
# Inter quartile range  
IQR(iris$Sepal.Length)  
## [1] 1.3
# Compute the variance
var(iris$Sepal.Length)
## [1] 0.6856935
# Compute the standard deviation 
sd(iris$Sepal.Length)
## [1] 0.8280661
# Compute the median absolute deviation
mad(iris$Sepal.Length)  
## [1] 1.03782

8.4 Overall summary

The overall summary statistics of One variable or an entire data frame can be displayed using the function summary().

8.4.1 Summary of single variable

The mean, median, 25th and 75th quartiles, minimum and maximum values are all returned in a single line call.

summary(iris$Sepal.Length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900

8.4.2 Summary of a Data frame

In this instance, each column receives an automated use of the summary() method. The kind of data in the column determines how the output is formatted. For instance:

  • The mean, median, min, max, and quartiles are returned if the column is a numeric variable.

  • The total number of observations in each group is returned if the column is a factor variable.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

8.5 The sapply function

The sapply() method may also be used to apply a specific function on a list or vector. For example, we may use it to compute the mean, sd, var, min, quantile,… for each column in a data frame.

sapply(iris[, -5], mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333
# The command direct to find the mean of all columns in the dataset iris except 5th column, which is species and is non numeric 

# Compute quartiles
sapply(iris[, -5], quantile)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0%            4.3         2.0         1.00         0.1
## 25%           5.1         2.8         1.60         0.3
## 50%           5.8         3.0         4.35         1.3
## 75%           6.4         3.3         5.10         1.8
## 100%          7.9         4.4         6.90         2.5

8.6 Some other useful functions

8.6.1 stat.desc()

The function stat.desc() [in pastecs package], provides lot of useful statistics in a single call

#Install 'pastecs' package
#load the library
library(pastecs)  

res <- stat.desc(my_data[, -5]) # results were stored to the variable 'res'  

# decimal digits were rounded to 2 using round()function
round(res, 2)  
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## nbr.val            150.00      150.00       150.00      150.00
## nbr.null             0.00        0.00         0.00        0.00
## nbr.na               0.00        0.00         0.00        0.00
## min                  4.30        2.00         1.00        0.10
## max                  7.90        4.40         6.90        2.50
## range                3.60        2.40         5.90        2.40
## sum                876.50      458.60       563.70      179.90
## median               5.80        3.00         4.35        1.30
## mean                 5.84        3.06         3.76        1.20
## SE.mean              0.07        0.04         0.14        0.06
## CI.mean.0.95         0.13        0.07         0.28        0.12
## var                  0.69        0.19         3.12        0.58
## std.dev              0.83        0.44         1.77        0.76
## coef.var             0.14        0.14         0.47        0.64

8.7 Missing values

It should be noted that some R functions will return errors or NA even if just one item is missing or when the data contains missing values. For instance, if a vector has even a single missing value, the mean() method will return NA. The option na.rm = TRUE, which instructs the function to delete any NAs before computations, can be used to prevent this. Here’s an example of how to use the mean function:

mean(iris$Sepal.Length, na.rm = TRUE)

8.8 Frequency table

For more theoretical details see here Dataset below shows the number of children in 50 household collected as a part of survey. I will demonstrate how frequency table can be constructed.

# Create example vector  
set.seed(99)
survey_data<- floor(runif(50, min=0, max=10)) 
survey_data                                                  
##  [1] 5 1 6 9 5 9 6 2 3 1 5 5 1 6 6 6 3 1 0 1 2 0 8 5 7 3 0 8 0 2 5 3 3 5 0 4 6 8
## [39] 6 9 8 4 7 9 4 9 3 4 5 1

The frequency counts of the variables that appear in the chosen column of the dataframe are calculated using the table() function in R. The outcome is presented as a two-row tabular structure, with the first row displaying the value of each column and the second row displaying the matching frequencies.

# Frequency Table

freq_table<- table(survey_data)
freq_table
## survey_data
## 0 1 2 3 4 5 6 7 8 9 
## 5 6 3 6 4 8 7 2 4 5

8.8.1 Cumulative frequency

The total of all classes in a frequency distribution table, including this class below it, makes up the cumulative frequency distribution of a particular data set. The value at every cell location is calculated by adding the current value and all previous values encountered up to that point. This may be determined by using the cumsum() method.

# cumulative frequency

cumsum(freq_table)
##  0  1  2  3  4  5  6  7  8  9 
##  5 11 14 20 24 32 39 41 45 50

From the above results now you can easily answer the question: number of families having 8 or less than 8 children?. It is 45

8.8.2 Grouped Frequency distribution

What is a grouped frequency table?

Before going in to the details of constructing frequency table read thoroughly The construction of grouped frequency table.

# Constucting frequency table
data<-iris$Sepal.Length  

# Decide how many classes you wish to use using struges formula
N<- length(data)
k<-1+3.322*log(N,base=10)
n_class<-ceiling(k) #Rounds value of k to nearest highest integer.
  
#Determine the class width  
c<-(max(data)-min(data))/k

# Rounded value of c to one digit after decimal 
class_width<-round(c,1) 

# Find lower limit  
L<-min(data)-((c*k-(max(data)-min(data)))/2)

# Upper limit  
U<-L+(class_width*n_class)

#Break the range [L,U] into non-overlapping sub-intervals by defining a sequence of equal distance break points (class_width). 
breaks = seq(L, U, by=class_width) 

#Classify the data according to the sub-intervals (each of class_width length) with cut. As the intervals are to be closed on the left, and open on the right, we set the right argument as FALSE.
data.cut = cut(data, breaks, right=FALSE)

# Generate frequency table of classes
data.freq = table(data.cut)
data.freq
## data.cut
## [4.3,4.7) [4.7,5.1) [5.1,5.5) [5.5,5.9) [5.9,6.3) [6.3,6.7) [6.7,7.1) [7.1,7.5) 
##         9        23        20        28        19        23        16         6 
## [7.5,7.9) 
##         5

8.9 R function

This section is just for your information. I don’t want to go deep in to programming part, as we are concentrating only on data analysis. If you are not interested you can skip this section.

A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.

8.9.1 Creating a Function

To create a function, use the function() keyword:

#General Format
function_name<- function(){
  #function_body
}

For example, here I’m creating a function named print_me(), on calling print_me(), it automatically prints “Hello world!”

# create a function with the name print_me
print_me <- function() { 
  print("Hello World!")
}
# calling function print_me()
print_me()
## [1] "Hello World!"

In the above section we have written at least 12 lines of code to construct a grouped frequency table from a data. Each time we need to run all these code individually, which is a difficult task. You can create a Function to perform this task in a single call. I’m creating a function called construct() to create a grouped frequency table.

construct <- function(x) 
  {
  #This function will take in a numeric vector x and gives its frequency table
  N<- length(x)
  k<-1+3.322*log(N,base=10)
  n_class<-ceiling(k)
  c<-(max(x)-min(x))/k
  class_width<-round(c,1) 
  L<-min(x)-((c*k-(max(x)-min(x)))/2)
  U<-L+(class_width*n_class)
  breaks = seq(L, U, by=class_width) 
  data.cut = cut(x, breaks, right=FALSE)
  data.freq = table(data.cut)
  data.freq

}

Constructing frequency table of Petal.Length in iris dataset using the construct function we just created.

construct(iris$Petal.Length)
## data.cut
##   [1,1.7) [1.7,2.4) [2.4,3.1) [3.1,3.8) [3.8,4.5) [4.5,5.2) [5.2,5.9) [5.9,6.6) 
##        44         6         1         6        22        37        21         9 
## [6.6,7.3) 
##         4

Self assesment

  1. Download the dataset by clicking Here
  2. Find the descriptive statistics for the data using the functions above.
    3.Construct grouped frequency table of Petal.Width column in iris data.