1 Univariate analysis

Authors

Affiliation

1.1 Introduction

Univariate analysis focus on analyzing and summarizing a single variable at a time.
The univariate statistics help to summarize, understand the characteristics of a single variable, providing a foundation for further data analysis and interpretation.
The univariate statistics provide insights into the distribution, central tendency, and variability of the data.
These statistics depend on the nature of the statistical series we have.

1.2 Nominal series

Let denote \(x_1,\cdots,x_K\) as the categories.

1.2.1 Statistic table

Frequency of \(x_k\): \(n_k\) represents the number of occurrences or counts for category \(x_k\) in the nominal series;
Sample size: \(n=\sum_{k=1}^Kn_k\)
Proportion of \(x_k\): \(f_k=\dfrac{n_k}{n}\)
Percentage of \(x_k\): \(p_k=f_k\times 100\)
Mode (M): The most common category or categories in the nominal series, i.e., the category with the highest frequency.

Levels	\(x_1\)	\(\cdots\)	\(x_k\)	\(\cdots\)	\(x_K\)	Total
Freq.	\(n_1\)	\(\cdots\)	\(n_k\)	\(\cdots\)	\(n_K\)	\(n\)
Prop.	\(f_1\)	\(\cdots\)	\(f_k\)	\(\cdots\)	\(f_K\)	\(1\)

When the levels are ordered, the statistical table is supplemented with cumulative frequencies and cumulative proportions;
Assume the levels \(x_1\preceq\cdots\preceq x_K\) are ordored: > - Cumulative freq.: \(N_k=\sum_{h=1}^kn_k\) > - Cumulative prop.: \(F_k=\sum_{h=1}^kf_k\)

Levels	\(x_1\)	\(\cdots\)	\(x_k\)	\(\cdots\)	\(x_K\)	Total
Freq.	\(n_1\)	\(\cdots\)	\(n_k\)	\(\cdots\)	\(n_K\)	\(n\)
C. Freq.	\(N_1\)	\(\cdots\)	\(N_k\)	\(\cdots\)	\(N_K\)	-
Prop.	\(f_1\)	\(\cdots\)	\(f_k\)	\(\cdots\)	\(f_K\)	\(1\)
C. Prop.	\(F_1\)	\(\cdots\)	\(F_k\)	\(\cdots\)	\(F_K\)	-

1.2.2 Graphics

Example 1.1 (Eye colors) The followwing data representing the eye colors of \(n-100\) individuals. The eye colors observed are Brown, Blue, Green, Hazel, and Gray, which reflect typical variations in human eye color.

Frequencies
	Blue	Brown	Gray	Green	Hazel	Total
freq	27.00	45.00	5.00	11.00	12.00	100
prop	0.27	0.45	0.05	0.11	0.12	1

Barplot

Pie chart

Pie chart using ggplot2. Use print(gg) to display it.
Pie chart using plotly

1.3 Quantitative series

Consider the followinng quantitative series: \(x=\left(x_1,\cdots,x_n\right)\in\mathbb{R}^n\).
Sample Size: \(n\)

1.3.1 Measures of Central Tendency

Minimum: \(x_{(1)}=\min\left\{x_1,\cdots,x_n\right\}\)
Maximum: \(x_{(n)}=\max\left\{x_1,\cdots,x_n\right\}\)
Mean (average): \(\overline{x}=\dfrac{1}{n}\sum_{i=1}^nx_i\)
Order \(\alpha\) quantile: Any number \(q_{\alpha}\) such that \[ \begin{aligned} \dfrac{1}{n}\sum_{i=1}^n1_{[x_i\leq q_{\alpha}]}&\geq\alpha\ and\\ \dfrac{1}{n}\sum_{i=1}^n1_{[x_i\geq q_{\alpha}]}&\geq 1-\alpha\\ \end{aligned} \]
First Quartile: \(q_{0.25}\)
Second Quartile or median: \(q_{0.5}\)
Third Quartile: \(q_{0.75}\)
Mode: The most common value in the series

1.3.2 Measures of Variability

Range: \(x_{(n)}-x_{(1)}=\max\left\{x_i\right\} - \min\left\{x_i\right\}\)
Variance: \(\widetilde{S}^2=\dfrac{1}{n}\sum_{i=1}^n\left(x_i-\overline{x}\right)^2\)
Sample variance: \(S^2=\dfrac{1}{n-1}\sum_{i=1}^n\left(x_i-\overline{x}\right)^2\)
Standard Deviation (sample standard deviation): \(\sqrt{S^2}\)
Interquartile Range= \(IQR=q_{0.75}-q_{0.25}\)

1.3.3 Measures of Distribution Shape

1.3.3.1 Skewness

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. The skewness value can be positive, zero, negative, or undefined.

\[ \text{Sk} = \dfrac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left( \dfrac{x_i - \bar{x}}{S} \right)^3 \]

Multiplying by \(\dfrac{n}{(n-1)(n-2)}\) adjusts for bias in the skewness calculation.
This method gives you a dimensionless number. A skewness close to \(0\) indicates a symmetrical distribution.
A positive skewness indicates a distribution that is skewed to the right,
While a negative skewness indicates a distribution that is skewed to the left.

1.3.3.2 Kurtosis

Kurtosis is a measure of the tailedness of the probability distribution of a real-valued random variable. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.
A higher kurtosis implies a high likelihood of outliers.

\[ \begin{aligned} \text{Ku} =& \dfrac{n(n+1)}{(n-1)(n-2)(n-3)} \times\\ &\sum_{i=1}^{n} \left( \dfrac{x_i - \bar{x}}{S} \right)^4 - \dfrac{3(n-1)^2}{(n-2)(n-3)}\\ \end{aligned} \]

This formula adjusts for the bias in the estimation of the population kurtosis for small sample sizes by this factor \(\dfrac{n(n+1)}{(n-1)(n-2)(n-3)}\).

Here are the steps to calculate kurtosis:

Usual interpretation

A kurtosis close to 0 indicates a distribution with tails similar to the normal distribution.
A positive kurtosis indicates a distribution with heavier tails than the normal distribution, suggesting a higher likelihood of outliers.
A negative kurtosis indicates a distribution with lighter tails than the normal distribution, suggesting fewer outliers.

1.3.4 Graphics

1.3.4.1 Histogram

To construct a histogram manually from a data set \(x = (x_1, x_2, \dots, x_n)\), you need a set of breakpoints or class boundaries that define the bins of the histogram. Let these breakpoints be denoted as \(b_0, b_1, \dots, b_K\), where \(K\) is the number of bins.

1.3.4.1.1 Computation of Bin Heights

Each bin of the histogram represents a range of data values, and the height of each bin (or bar) is calculated based on the frequency of data points within these ranges.

Step 1: Define the Breakpoints

Assume the breakpoints \(b_0, b_1, \dots, b_K\) are given. These should satisfy:

\[ b_0 < b_1 < b_2 < \cdots < b_K \]

These breakpoints partition the range of the data into \(K\) intervals or bins where each bin \(k\) represents the interval \([b_{k-1}, b_k)\).

Step 2: Count the Data Points in Each Bin

For each bin \(k\), count the number of data points \(x_i\) such that:

\[ b_{k-1} \leq x_i < b_k \]

This count is denoted as \(n_k\) for bin \(k\).

Step 3: Compute the Bin Heights

The height \(h_k\) of each bin in the histogram is determined by either the frequency or the relative frequency of the data points within that bin.

The formula for the height \(h_k\) when using frequency is:

\[ h_k = \dfrac{n_k}{\text{bin width}} = \frac{n_k}{b_k - b_{k-1}} \]

where \(n_k\) is the number of data points in bin \(k\) and \(b_k - b_{k-1}\) is the width of the bin. This height represents the density of data points per unit of measurement in the bin.

The formula for the height \(h_k\) when using proportion \(p_k=\dfrac{n_k}{n}\) is:

\[ h_k = \dfrac{p_k}{\text{bin width}} = \frac{p_k}{b_k - b_{k-1}} \]

Step 4: Representation

The histogram visually represents these calculations where each bin’s height corresponds to its computed \(h_k\). The area of each bin (height times width) gives the number of data points in the bin, thus ensuring that the area of each bin in the histogram is proportional to the frequency of observations falling within the bin.

Example 1.2 (Histogram)

1breaks = c(3, 4, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6, 6.5, 7, 7.5, 8)

(
  iris %>%
    ggplot(mapping = aes(x = Sepal.Length, y = after_stat(density), fill = Species, color = Species)) +
      geom_histogram(breaks = breaks, position = position_identity(), alpha = 0.75) +
2    geom_density(alpha = 0.50, )
) %>%
  ggplotly()  %>%
    layout(legend = list(x=0.75, y=0.90))

1: Choose breakpoints
2: Add density estimation

1.3.5 Cumulative ditribution function

The Empirical Cumulative Distribution Function (ECDF) provides a visual representation of the proportion or count of observations below each value in a dataset. It is particularly useful for understanding the distribution of the data, identifying outliers, and comparing distributions.

Example 1.3 (Cumulative distribution function)

myEcdf <- function(x) {
  xx = sart(x)
  tx = table(xx)
  yy = cumsum(tx)/length(xx)
  xxx = as.numeric(unique(xx))
  
  
}

1data_vector <- rnorm(20, mean = 50, sd = 10)

(
  ggplot(data = data.frame(x = data_vector), aes(x = x)) +
    stat_ecdf(geom = "step", color = "steelblue", lwd = 1) +
    xlim(min(data_vector) - 1, max(data_vector) + 5) +
    labs(x = "Data Values", y = "ECDF")
) %>% ggplotly()

1: Generate data randomily from normal distribution