Blue | Brown | Gray | Green | Hazel | Total | |
---|---|---|---|---|---|---|
freq | 27.00 | 45.00 | 5.00 | 11.00 | 12.00 | 100 |
prop | 0.27 | 0.45 | 0.05 | 0.11 | 0.12 | 1 |
1 Univariate analysis
1.1 Introduction
Univariate analysis focus on analyzing and summarizing a single variable at a time.
The univariate statistics help to summarize, understand the characteristics of a single variable, providing a foundation for further data analysis and interpretation.
The univariate statistics provide insights into the distribution, central tendency, and variability of the data.
These statistics depend on the nature of the statistical series we have.
1.2 Nominal series
Let denote \(x_1,\cdots,x_K\) as the categories.
1.2.1 Statistic table
Frequency of \(x_k\): \(n_k\) represents the number of occurrences or counts for category \(x_k\) in the nominal series;
Sample size: \(n=\sum_{k=1}^Kn_k\)
Proportion of \(x_k\): \(f_k=\dfrac{n_k}{n}\)
Percentage of \(x_k\): \(p_k=f_k\times 100\)
Mode (M): The most common category or categories in the nominal series, i.e., the category with the highest frequency.
Levels | \(x_1\) | \(\cdots\) | \(x_k\) | \(\cdots\) | \(x_K\) | Total |
---|---|---|---|---|---|---|
Freq. | \(n_1\) | \(\cdots\) | \(n_k\) | \(\cdots\) | \(n_K\) | \(n\) |
Prop. | \(f_1\) | \(\cdots\) | \(f_k\) | \(\cdots\) | \(f_K\) | \(1\) |
- When the levels are ordered, the statistical table is supplemented with cumulative frequencies and cumulative proportions;
- Assume the levels \(x_1\preceq\cdots\preceq x_K\) are ordored: > - Cumulative freq.: \(N_k=\sum_{h=1}^kn_k\) > - Cumulative prop.: \(F_k=\sum_{h=1}^kf_k\)
Levels | \(x_1\) | \(\cdots\) | \(x_k\) | \(\cdots\) | \(x_K\) | Total |
---|---|---|---|---|---|---|
Freq. | \(n_1\) | \(\cdots\) | \(n_k\) | \(\cdots\) | \(n_K\) | \(n\) |
C. Freq. | \(N_1\) | \(\cdots\) | \(N_k\) | \(\cdots\) | \(N_K\) | - |
Prop. | \(f_1\) | \(\cdots\) | \(f_k\) | \(\cdots\) | \(f_K\) | \(1\) |
C. Prop. | \(F_1\) | \(\cdots\) | \(F_k\) | \(\cdots\) | \(F_K\) | - |
1.2.2 Graphics
Example 1.1 (Eye colors) The followwing data representing the eye colors of \(n-100\) individuals. The eye colors observed are Brown, Blue, Green, Hazel, and Gray, which reflect typical variations in human eye color.
Barplot
Pie chart
Pie chart using
ggplot2
. Useprint(gg)
to display it.Pie chart using
plotly
1.3 Quantitative series
Consider the followinng quantitative series: \(x=\left(x_1,\cdots,x_n\right)\in\mathbb{R}^n\).
Sample Size: \(n\)
1.3.1 Measures of Central Tendency
- Minimum: \(x_{(1)}=\min\left\{x_1,\cdots,x_n\right\}\)
- Maximum: \(x_{(n)}=\max\left\{x_1,\cdots,x_n\right\}\)
- Mean (average): \(\overline{x}=\dfrac{1}{n}\sum_{i=1}^nx_i\)
- Order \(\alpha\) quantile: Any number \(q_{\alpha}\) such that \[ \begin{aligned} \dfrac{1}{n}\sum_{i=1}^n1_{[x_i\leq q_{\alpha}]}&\geq\alpha\ and\\ \dfrac{1}{n}\sum_{i=1}^n1_{[x_i\geq q_{\alpha}]}&\geq 1-\alpha\\ \end{aligned} \]
- First Quartile: \(q_{0.25}\)
- Second Quartile or median: \(q_{0.5}\)
- Third Quartile: \(q_{0.75}\)
- Mode: The most common value in the series
1.3.2 Measures of Variability
- Range: \(x_{(n)}-x_{(1)}=\max\left\{x_i\right\} - \min\left\{x_i\right\}\)
- Variance: \(\widetilde{S}^2=\dfrac{1}{n}\sum_{i=1}^n\left(x_i-\overline{x}\right)^2\)
- Sample variance: \(S^2=\dfrac{1}{n-1}\sum_{i=1}^n\left(x_i-\overline{x}\right)^2\)
- Standard Deviation (sample standard deviation): \(\sqrt{S^2}\)
- Interquartile Range= \(IQR=q_{0.75}-q_{0.25}\)
1.3.3 Measures of Distribution Shape
1.3.3.1 Skewness
- Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. The skewness value can be positive, zero, negative, or undefined.
\[ \text{Sk} = \dfrac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left( \dfrac{x_i - \bar{x}}{S} \right)^3 \]
Multiplying by \(\dfrac{n}{(n-1)(n-2)}\) adjusts for bias in the skewness calculation.
This method gives you a dimensionless number. A skewness close to \(0\) indicates a symmetrical distribution.
A positive skewness indicates a distribution that is skewed to the right,
While a negative skewness indicates a distribution that is skewed to the left.
1.3.3.2 Kurtosis
- Kurtosis is a measure of the tailedness of the probability distribution of a real-valued random variable. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.
- A higher kurtosis implies a high likelihood of outliers.
\[ \begin{aligned} \text{Ku} =& \dfrac{n(n+1)}{(n-1)(n-2)(n-3)} \times\\ &\sum_{i=1}^{n} \left( \dfrac{x_i - \bar{x}}{S} \right)^4 - \dfrac{3(n-1)^2}{(n-2)(n-3)}\\ \end{aligned} \]
- This formula adjusts for the bias in the estimation of the population kurtosis for small sample sizes by this factor \(\dfrac{n(n+1)}{(n-1)(n-2)(n-3)}\).
Here are the steps to calculate kurtosis:
A kurtosis close to 0 indicates a distribution with tails similar to the normal distribution.
A positive kurtosis indicates a distribution with heavier tails than the normal distribution, suggesting a higher likelihood of outliers.
A negative kurtosis indicates a distribution with lighter tails than the normal distribution, suggesting fewer outliers.
1.3.4 Graphics
1.3.4.1 Histogram
To construct a histogram manually from a data set \(x = (x_1, x_2, \dots, x_n)\), you need a set of breakpoints or class boundaries that define the bins of the histogram. Let these breakpoints be denoted as \(b_0, b_1, \dots, b_K\), where \(K\) is the number of bins.
1.3.4.1.1 Computation of Bin Heights
Each bin of the histogram represents a range of data values, and the height of each bin (or bar) is calculated based on the frequency of data points within these ranges.
Step 1: Define the Breakpoints
Assume the breakpoints \(b_0, b_1, \dots, b_K\) are given. These should satisfy:
\[ b_0 < b_1 < b_2 < \cdots < b_K \]
These breakpoints partition the range of the data into \(K\) intervals or bins where each bin \(k\) represents the interval \([b_{k-1}, b_k)\).
Step 2: Count the Data Points in Each Bin
For each bin \(k\), count the number of data points \(x_i\) such that:
\[ b_{k-1} \leq x_i < b_k \]
This count is denoted as \(n_k\) for bin \(k\).
Step 3: Compute the Bin Heights
The height \(h_k\) of each bin in the histogram is determined by either the frequency or the relative frequency of the data points within that bin.
- The formula for the height \(h_k\) when using frequency is:
\[ h_k = \dfrac{n_k}{\text{bin width}} = \frac{n_k}{b_k - b_{k-1}} \]
where \(n_k\) is the number of data points in bin \(k\) and \(b_k - b_{k-1}\) is the width of the bin. This height represents the density of data points per unit of measurement in the bin.
- The formula for the height \(h_k\) when using proportion \(p_k=\dfrac{n_k}{n}\) is:
\[ h_k = \dfrac{p_k}{\text{bin width}} = \frac{p_k}{b_k - b_{k-1}} \]
Step 4: Representation
The histogram visually represents these calculations where each bin’s height corresponds to its computed \(h_k\). The area of each bin (height times width) gives the number of data points in the bin, thus ensuring that the area of each bin in the histogram is proportional to the frequency of observations falling within the bin.
Example 1.2 (Histogram)
1= c(3, 4, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6, 6.5, 7, 7.5, 8)
breaks
(%>%
iris ggplot(mapping = aes(x = Sepal.Length, y = after_stat(density), fill = Species, color = Species)) +
geom_histogram(breaks = breaks, position = position_identity(), alpha = 0.75) +
2geom_density(alpha = 0.50, )
%>%
) ggplotly() %>%
layout(legend = list(x=0.75, y=0.90))
- 1
- Choose breakpoints
- 2
- Add density estimation
1.3.5 Cumulative ditribution function
The Empirical Cumulative Distribution Function (ECDF) provides a visual representation of the proportion or count of observations below each value in a dataset. It is particularly useful for understanding the distribution of the data, identifying outliers, and comparing distributions.
Example 1.3 (Cumulative distribution function)
<- function(x) {
myEcdf = sart(x)
xx = table(xx)
tx = cumsum(tx)/length(xx)
yy = as.numeric(unique(xx))
xxx
}
1<- rnorm(20, mean = 50, sd = 10)
data_vector
(ggplot(data = data.frame(x = data_vector), aes(x = x)) +
stat_ecdf(geom = "step", color = "steelblue", lwd = 1) +
xlim(min(data_vector) - 1, max(data_vector) + 5) +
labs(x = "Data Values", y = "ECDF")
%>% ggplotly() )
- 1
- Generate data randomily from normal distribution