Univariate analysis

Introduction

  • The univariate statistics help summarizing, understanding the characteristics of a single variable, providing a foundation for further data analysis and interpretation.

  • They provide insights into the distribution, central tendency, and variability of the data.

Categorical series

In statistical analysis, a categorical variable represents a variable that can take on a limited, and usually fixed, number of possible values, assigning each individual or other observational unit to a particular group or nominal category. Let us consider a categorical data series \(x = (x_1, \cdots, x_n)\) consisting of \(n\) observations, where each observation \(x_i\) belongs to one of \(K\) categories \(a_1, \cdots, a_K\).

Statistic table

Definition 1 (Frequency) The frequency \(n_k\) of a category \(a_k\) is given by: \[ n_k = \sum_{i=1}^n \mathbf{1}_{(x_i = a_k)}.\]

  • \(n_k\) is the number of occurrences of category \(a_k\) in \(x\).
  • The sample size \(n=\sum_{k=1}^Kn_k\)

Definition 2 (Proportion - Percentage)  

  • The proportion \(f_k\) of a category \(a_k\) is given by: \[f_k = \dfrac{n_k}{n}\]

  • The percentage of \(a_k\) is \[p_k=f_k\times 100.\]

Definition 3 (Mode) It is the most common category or categories in the nominal series, i.e., the category with the highest frequency.

Frequencies table for nominal series

Levels \(x_1\) \(\cdots\) \(x_k\) \(\cdots\) \(x_K\) Total
Freq. \(n_1\) \(\cdots\) \(n_k\) \(\cdots\) \(n_K\) \(n\)
Prop. \(f_1\) \(\cdots\) \(f_k\) \(\cdots\) \(f_K\) \(1\)

Frequencies table for ordinal series

Ordinal data is a type of categorical data with a clear ordering of the categories. Such data often appear in surveys and questionnaires where responses indicate the degree of preference or agreement.

Now assume that the \(K\) categories are ordered: \(a_1 < a_2 < \cdots < a_K\).

Definition 4 (Cumulative frequencies and proportions)  

  • Cumulative frequency: \[N_k=\sum_{h=1}^kn_h\]

  • Cumulative proportion: \[F_k=\sum_{h=1}^kf_h\]

Levels \(x_1\) \(\cdots\) \(x_k\) \(\cdots\) \(x_K\) Total
Freq. \(n_1\) \(\cdots\) \(n_k\) \(\cdots\) \(n_K\) \(n\)
C. Freq. \(N_1\) \(\cdots\) \(N_k\) \(\cdots\) \(N_K\) -
Prop. \(f_1\) \(\cdots\) \(f_k\) \(\cdots\) \(f_K\) \(1\)
C. Prop. \(F_1\) \(\cdots\) \(F_k\) \(\cdots\) \(F_K\) -

Graphics

Bar chart

  • Each category is represented by a bar.

  • The height (in a vertical bar chart) or length (in a horizontal bar chart) of the bar corresponds to the value or frequency of that category.

  • \(X\)-Axis: For a vertical bar chart, the x-axis usually represents the categories.

  • \(Y\)-Axis: For a vertical bar chart, the y-axis represents the values or frequencies.

  • Labels: Each bar is labeled to indicate which category it represents.

Example 1 (Eye colors) The followwing data representing the eye colors of \(n-100\) individuals. The eye colors observed are Brown, Blue, Green, Hazel, and Gray, which reflect typical variations in human eye color.

      Brown Blue Green Hazel Gray Total
freq     50   20    10    15    5   100
color brown blue green coral gray     1

Barplot

Pie chart

  • A pie chart is a circular statistical graphic that is divided into slices to illustrate numerical proportions.

  • Each slice represents a category of data, and the size of each slice is proportional to the quantity it represents.

  • Pie charts are particularly useful for showing the parts of a whole and are commonly used to visualize percentages or proportions.

  • The size \(\alpha_k\) of each slice is proportional to the value it represents relative to the total: \[ \alpha_k=\dfrac{360\times n_k}{n}=360\times f_k. \]

  • Each slice is usually labeled with the category it represents and often with the percentage or value of that category.

Pie chart

  1. Pie chart using ggplot2. Use print(gg) to display it.

  2. Pie chart using plotly

Quantitative series

  • Consider the followinng quantitative series: \(x=\left(x_1,\cdots,x_n\right)\in\mathbb{R}^n\).

  • Sample Size: \(n\)

Measures of Central Tendency

  • Minimum: \(x_{(1)}=\min\left\{x_1,\cdots,x_n\right\}\)
  • Maximum: \(x_{(n)}=\max\left\{x_1,\cdots,x_n\right\}\)
  • Mean (average): \(\overline{x}=\dfrac{1}{n}\sum_{i=1}^nx_i\)
  • Order \(\alpha\) quantile: Any number \(q_{\alpha}\) such that \[ \begin{aligned} \dfrac{1}{n}\sum_{i=1}^n1_{[x_i\leq q_{\alpha}]}&\geq\alpha\ and\\ \dfrac{1}{n}\sum_{i=1}^n1_{[x_i\geq q_{\alpha}]}&\geq 1-\alpha\\ \end{aligned} \]
  • First Quartile: \(q_{0.25}\)
  • Second Quartile or median: \(q_{0.5}\)
  • Third Quartile: \(q_{0.75}\)
  • Mode: The most common value in the series

Measures of Variability

  • Range: \(x_{(n)}-x_{(1)}=\max\left\{x_i\right\} - \min\left\{x_i\right\}\)
  • Variance: \(\widetilde{S}^2=\dfrac{1}{n}\sum_{i=1}^n\left(x_i-\overline{x}\right)^2\)
  • Sample variance: \(S^2=\dfrac{1}{n-1}\sum_{i=1}^n\left(x_i-\overline{x}\right)^2\)
  • Standard Deviation (sample standard deviation): \(\sqrt{S^2}\)
  • Interquartile Range= \(IQR=q_{0.75}-q_{0.25}\)

Measures of Distribution Shape

Skewness

  • Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. The skewness value can be positive, zero, negative, or undefined.

\[ \text{Sk} = \dfrac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left( \dfrac{x_i - \bar{x}}{S} \right)^3 \]

  • Multiplying by \(\dfrac{n}{(n-1)(n-2)}\) adjusts for bias in the skewness calculation.

  • This method gives you a dimensionless number. A skewness close to \(0\) indicates a symmetrical distribution.

  • A positive skewness indicates a distribution that is skewed to the right,

  • While a negative skewness indicates a distribution that is skewed to the left.

Kurtosis

  • Kurtosis is a measure of the tailedness of the probability distribution of a real-valued random variable. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.
  • A higher kurtosis implies a high likelihood of outliers.

\[ \begin{aligned} \text{Ku} =& \dfrac{n(n+1)}{(n-1)(n-2)(n-3)} \times\\ &\sum_{i=1}^{n} \left( \dfrac{x_i - \bar{x}}{S} \right)^4 - \dfrac{3(n-1)^2}{(n-2)(n-3)}\\ \end{aligned} \]

  • This formula adjusts for the bias in the estimation of the population kurtosis for small sample sizes by this factor \(\dfrac{n(n+1)}{(n-1)(n-2)(n-3)}\).

Here are the steps to calculate kurtosis:

Usual interpretation
  • A kurtosis close to 0 indicates a distribution with tails similar to the normal distribution.

  • A positive kurtosis indicates a distribution with heavier tails than the normal distribution, suggesting a higher likelihood of outliers.

  • A negative kurtosis indicates a distribution with lighter tails than the normal distribution, suggesting fewer outliers.

Graphics

Histogram

To construct a histogram manually from a data set \(x = (x_1, x_2, \dots, x_n)\), you need a set of breakpoints or class boundaries that define the bins of the histogram. Let these breakpoints be denoted as \(b_0, b_1, \dots, b_K\), where \(K\) is the number of bins.

Computation of Bin Heights

Each bin of the histogram represents a range of data values, and the height of each bin (or bar) is calculated based on the frequency of data points within these ranges.

Step 1: Define the Breakpoints

Assume the breakpoints \(b_0, b_1, \dots, b_K\) are given. These should satisfy:

\[ b_0 < b_1 < b_2 < \cdots < b_K \]

These breakpoints partition the range of the data into \(K\) intervals or bins where each bin \(k\) represents the interval \([b_{k-1}, b_k)\).

Step 2: Count the Data Points in Each Bin

For each bin \(k\), count the number of data points \(x_i\) such that:

\[ b_{k-1} \leq x_i < b_k \]

This count is denoted as \(n_k\) for bin \(k\).

Step 3: Compute the Bin Heights

The height \(h_k\) of each bin in the histogram is determined by either the frequency or the relative frequency of the data points within that bin.

  • The formula for the height \(h_k\) when using frequency is:

\[ h_k = \dfrac{n_k}{\text{bin width}} = \frac{n_k}{b_k - b_{k-1}} \]

where \(n_k\) is the number of data points in bin \(k\) and \(b_k - b_{k-1}\) is the width of the bin. This height represents the density of data points per unit of measurement in the bin.

  • The formula for the height \(h_k\) when using proportion \(p_k=\dfrac{n_k}{n}\) is:

\[ h_k = \dfrac{p_k}{\text{bin width}} = \frac{p_k}{b_k - b_{k-1}} \]

Step 4: Representation

The histogram visually represents these calculations where each bin’s height corresponds to its computed \(h_k\). The area of each bin (height times width) gives the number of data points in the bin, thus ensuring that the area of each bin in the histogram is proportional to the frequency of observations falling within the bin.

Example 2 (Histogram)  

1breaks = c(3, 4, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6, 6.5, 7, 7.5, 8)

(
  iris %>%
    ggplot(mapping = aes(x = Sepal.Length, y = after_stat(density), fill = Species, color = Species)) +
      geom_histogram(breaks = breaks, position = position_identity(), alpha = 0.75) +
2    geom_density(alpha = 0.50, )
) %>%
  ggplotly()  %>%
    layout(legend = list(x=0.75, y=0.90))
1
Choose breakpoints
2
Add density estimation

Cumulative ditribution function

The Empirical Cumulative Distribution Function (ECDF) provides a visual representation of the proportion or count of observations below each value in a dataset. It is particularly useful for understanding the distribution of the data, identifying outliers, and comparing distributions.

Example 3 (Cumulative distribution function)  

x = ( 6,6,1,1,1,5,6,4,1,3 )

Empirical cumulative distribution function is defined by: \[ \widehat{F}_s(x)=\left\{ \begin{array}{lll} 0 & \text{if} & x<1\\ 0.4 & \text{if} & 1\leq x < 3\\ 0.5 & \text{if} & 3\leq x < 4\\ 0.6 & \text{if} & 4 \leq x < 5\\ 0.7 & \text{if} & 5\leq x < 6\\ 1 & \text{if} & x\geq 7\\ \end{array} \right. \]

  1. Generate data randomily from normal distribution