STA 113 2.0 Descriptive Statistics

Summary Measures (Cont.)

Dr. Thiyanga S. Talagala
Department of Statistics, Faculty of Applied Sciences
University of Sri Jayewardenepura, Sri Lanka

Variance of a sample

The variance of a sample of n observations \(x_1, x_2, x_3,..,x_n\) having mean \(\bar{x}\) is defined as

\[s^2 = \frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}\]

Alternatively, the sample variance can be written in the following forms:

\[s^2 = \frac{1}{n-1}[\sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^nx_i)^2}{n}]\]

or

\[s^2 = \frac{1}{n-1}[\sum_{i=1}^nx_i^2-n\bar{x}]\]

Measures of relative standing/ Measures of noncentral location

  • Quartiles

  • Percentiles

Quantiles

Quantiles are descriptive measures that split the ordered data into four quarters (four equal parts).

Q1 - first (lower) quantile

Q2 - second (middle) quantile

Q3 - third (upper) quantile

First quantile

The value which 25% of the observations are smaller and 75% are larger

\[Q_1 = \frac{n+1}{4} \text{ ordered observation}\]

Second quantile

Same as median

Third quantile

The value for which 75% of the observations are smaller and 25% are larger

\[Q_3 = \frac{3(n+1)}{4} \text{ ordered observation}\]

Rules

  1. If the resulting positioning point is an integer, take the particular value corresponding to that positioning point.

  2. For a non-integer position \(p\), let \(k\) be the integer part and \(d\) be the fractional part (e.g., for \(p=2.75\), \(k=2\) and \(d=0.75\))

\[Q_q = ((1-d) \times \text{ value at position }k )+ (d \times \text{ value at position } (k+1))\]

Your turn

Find quantiles for

1,3,4,6,7,8,10,12,14,15

05:00

Percentiles

  • Percentiles divides a given ordered data array into 100 equal parts, it divides the complete data set into hundred groups of 1% each. There are total of 99 percentiles denoted as P1, P2, P3,…, and, P99, and they are known as 1st percentile, 2nd percentile,…., 99th percentile respectively.

first decile = 10th percentile

Q1 = 25th percentile

Q2 = 50th percentile

Q3 = 75th percentile

ninth decile = 90th percentile

Location of a percentile

The following formula allows us to approximate the location of any percentile.

\[L_p = (n+1)\frac{p}{100}\]

where \(L_p\) is the location of the \(p^{th}\) percentile.

Your turn

5, 5, 10, 12, 13, 14, 17, 19, 27, 38

Find \(L_{25}\), \(L_{50}\) and \(L_{75}\).

05:00

Interquartile Range (IQR)

  • Measure of dispersion

\[IQR = Q_3 - Q_1\]

  • Measure considers the spread in the middle 50% of the data.

  • Not influenced by extreme values.

Box and whisker Plot

1 4 8 9 11 5 4 3 2 20
3 7 8 10 2 6 7 2 20 30
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.00    6.50    8.10    9.25   30.00 

Let’s draw the box and whisker plot

Data

s1 s2 s3 s4
-0.4941825 0.9915941 1.2651757 -0.4547021
-0.6687905 0.7621325 0.4623908 -0.5227580
-0.5469909 0.4472162 0.1096497 -0.9385280
-0.6231902 0.9652850 1.6850990 -1.0626077
1.4458192 0.7953235 0.1734091 -0.7810612
1.4321517 0.9403680 0.2576630 -0.9400142
-1.7834318 0.2037412 0.9451751 -0.4921403
-0.1310000 0.5811571 0.0841218 -0.9899774
-1.1957376 0.5004076 0.9543987 -0.6680712
0.4298630 0.6224225 0.2428342 -0.2485180
0.6814681 0.3328808 0.8645305 -0.7541138
-0.7584425 0.6059351 0.8742853 -0.6532270
0.4903058 0.5572273 0.0927645 -1.1676951
-0.0359587 0.9757127 1.4798731 -1.0630715
-0.4126305 0.3127026 0.6411980 -0.9250913
-0.6958339 0.8282239 0.6666132 -1.2498425
1.1344751 0.5325348 0.1791371 -0.7756105
-0.2479113 0.2439792 0.5884629 -0.8957835
0.3880169 0.0597912 0.5377590 -1.2037738
-0.6955792 0.7714712 0.7566644 -0.3658032
-1.2602048 0.8607332 0.0237823 -0.8675267
-0.4121627 0.9078947 0.3648342 -1.3747030
-1.0600587 0.5853123 0.0991288 -0.9682996
-0.6901896 0.8314007 0.1282657 -0.8799173
-0.1512943 0.7785955 0.3227742 -1.3436420
-0.8817142 0.2555937 0.2120927 -0.8572044
2.3139477 0.3352818 3.4199927 -0.7016565
2.6254204 0.9125926 0.3246409 -1.2752869
0.0997495 0.1433173 0.5174605 -0.3899073
-0.6003863 0.5621560 0.8610009 -0.8642701
0.7316799 0.7373602 0.2427046 -1.1415947
0.5253369 0.4902503 2.0961923 -0.7712001
0.2979311 0.4586207 0.2769770 -0.1074875
s1 s2 s3 s4
34 0.1707726 0.9417774 0.5032580 -0.6044107
35 -0.3198814 0.8077437 0.0181263 -1.2421812
36 -0.8186460 0.6444308 6.8161623 -1.1468929
37 -0.0072054 0.8448965 0.3052239 -0.8844985
38 -0.4512637 0.3424624 1.3661048 -1.3386497
39 -0.1925807 0.5625122 0.9218917 -0.8877711
40 2.2657184 0.6736529 0.2075009 -0.4708897
41 -0.9951849 0.8909363 5.0017890 -0.9994745
42 -0.2856625 0.8033804 0.0968602 -0.7467907
43 0.2191878 0.1782850 1.5306568 -0.7682978
44 -1.0367488 0.8343284 0.3886233 -1.3388111
45 1.1718172 0.2599477 2.9639756 -0.7148029
46 0.1918960 0.5094981 0.5536213 -0.7586514
47 0.5286750 0.7987479 0.4308663 -0.9896018
48 1.5910981 0.6943393 0.6255345 -1.1681160
49 -1.1722861 0.9982673 0.4587839 -0.9371935
50 0.1934595 0.6716770 0.3116617 -1.1103634
51 -1.4356298 0.9698931 5.7776384 0.8623060
52 -0.7890743 0.5939492 0.0224797 1.2137335
53 -0.6106055 0.0478229 3.0745683 0.9983251
54 -2.3119511 0.5605901 0.8132757 0.6957828
55 0.8667858 0.3635374 0.2700010 0.6795177
56 0.4041022 0.7116650 0.6794256 0.8919491
57 2.0842797 0.6958857 0.3890802 1.4396962
58 -1.6350715 0.4293465 0.9316933 1.5557211
59 -1.0041166 0.7481887 2.2529575 0.9289430
60 -0.1833006 0.8444459 0.3561461 1.2954135
61 -0.6370463 0.6549463 1.5700440 1.4239294
62 0.7891126 0.9183704 3.4331633 0.9424039
63 -1.9343009 0.8069813 1.7880099 1.0407747
64 1.0142296 0.9055865 0.5375650 0.0505981
65 1.9339288 0.2774429 0.4135352 1.1525336
66 0.0475422 0.9947034 1.3488308 1.2897422
67 0.2263924 0.8236884 0.8586223 0.4717407
s1 s2 s3 s4
68 -0.4959669 0.5652228 1.1851333 1.1189474
69 0.8229532 0.8178838 0.1777389 0.1525808
70 0.9298399 0.9716167 2.0837673 1.1283459
71 -0.8396403 0.6955667 1.1255832 0.8254749
72 -2.6643294 0.5633113 2.6622062 0.4479653
73 1.6035095 0.3289823 0.7328378 1.1938157
74 -0.7712137 0.6223501 3.6241769 1.0612159
75 -0.3840300 0.3932550 0.1783265 0.9074435
76 0.4037572 0.8684061 0.0584704 0.7707421
77 -0.2177293 0.7424245 1.3926143 0.7044549
78 -0.0787400 0.8584837 0.1558730 0.7731302
79 -1.3780882 0.4996357 0.5496975 1.1174710
80 -0.4246498 0.9272462 0.0421114 1.0224212
81 0.3751303 0.2122582 0.0489879 1.2401247
82 1.0816647 0.8919946 5.6415263 1.1950053
83 -0.1589649 0.4317767 0.2723074 0.6052308
84 0.2382120 0.1271924 0.0322854 0.6628394
85 0.7576892 0.7173759 0.5761504 0.7795326
86 -0.9829056 0.3897945 0.4854600 0.8800581
87 0.3364830 0.4760740 2.5864094 1.1548429
88 -0.2949718 0.9238953 1.3136616 0.5118688
89 0.7969451 0.6609144 7.4977078 1.2616219
90 -0.6483374 0.3351259 1.1174526 0.1745040
91 -0.3334201 0.9155915 0.2567393 1.0479585
92 -0.9726304 0.6654774 0.3236723 0.8446453
93 -0.8009399 0.2512914 1.4986115 0.4152249
94 -1.2797944 0.9317872 2.0137554 0.8929735
95 -1.0862999 0.5974472 0.4877782 0.3485280
96 -1.1646197 0.9956436 0.0835399 0.6462357
97 -0.0201359 0.7754757 0.3588838 1.6385065
98 -0.4628470 0.6599176 2.7117010 0.7595622
99 -0.4063630 0.4857695 0.5379405 0.8379056
100 -0.0119749 0.8521035 1.3446080 0.5883827

Histogram

Box and whisker Plot

Your turn

Prices of Chocolate in Rupees (LKR):

50, 75, 100, 125, 150, 175, 200, 225, 250, 275

Prices of Chocolate in USD:

5, 7, 9, 12, 15, 18, 20, 23, 26, 0

Which group of chocolate prices exhibits the highest variation?

Coefficient of variation

  • Relative measure of variation

  • It always expressed as a percentage rather than in terms of the units of the particular data.

  • This is useful when comparing two or more sets of data that are measured in the different units.

\[CV = \frac{s}{\bar{x}}\times 100\%\]

The coefficient of variation of the height of 30 people selected at random from a given village is found to be 15%. The mean weight of the selected group is 72 kg and a standard deviation 8 kg.

The obtained results show that

  1. the weight is more variable than height.

  2. the weight is less variable than height.

  3. height and weight have the same degree of variation.

  4. height and weight values are identical.

Measures of shape

Skewness and the relationship of the mean, median and mode

In-class diagram:

Skewness describes the degree and direction of asymmetry in the data.

Formula used to calculate skewness in Excel:

\[\text{Skewness} = \frac{n}{(n-1)(n-2)}\sum_{i=1}^{n}(\frac{X_i - \bar{X}}{8})^3\]

Skewness

  • Skewness is the degree of asymmetry of a distribution.

  • If the frequency distribution has a longer “tail” to the right of the central maximum than to the left, the distribution is said to be skewed to the right (or to have a positive skewness).

  • If the reverse is true, it is said to be skewed to the left (or to have a negative skewness)

Pearson’s first coefficient of skewness

\[\text{Skewness} = \frac{\text{mean}-\text{mode}}{\text{standard deviation}}\]

or

Pearson’s second coefficient of skewness

To avoid using the mode

\[\text{Skewness} = \frac{3(\text{mean}-\text{median})}{\text{standard deviation}}\]

Conditions

  1. \(Mean = Mode = Median\), then the coefficient of skewness is zero for symmetrical distribution.

  2. \(Mean > Mode\), then the coefficient of skewness will be positive.

  3. \(Mean < Mode\), then the coefficient of skewness will be negative.

Karl person`s coefficient of skewness has a positive sign for the positively skewed and a negative sign for the negatively skewed.

Empirical relationship between mean, median and mode

Using Karl Pearson’s formula we can show

\(3 (median) = mode + 2 mean\)

  1. Symmetric

\(Mean = Median = Mode\)

  1. Positively Skewed

\(Mean > Median > Mode\)

  1. Negatively Skewed

\(Mean < Median < Mode\)

Kurtosis

  • Degree of peakness of a distribution

  • Usually taken relative to a normal distribution

Type Kurtosis Excess Kurtosis
Mesokurtic =3 =0
Leptokurtic >3 >0
Platykurtic <3 <0

Excess Kurtosis = Kurtosis - 3

Your turn

Find kurtosis and excess kurtosis formula

Interpret the following

Figure 1: Boxplot of Flipper Length by Species

species Mean Median Mode SD Q1 Q2 Q3 Kurtosis Skewness
Adelie 190.1027 190 190 6.521825 186 190 195.0 3.327738 0.0795785
Chinstrap 195.8235 196 187 7.131894 191 196 201.0 2.956063 -0.0092622
Gentoo 217.2353 216 215 6.585431 212 216 221.5 2.322009 0.3640858
  • The distributions of flipper lengths for each species appear to be roughly symmetric.

  • Gentoo penguins have longer flippers than both Adélie and Chinstrap penguins, which have similar flipper lengths.

  • There are two notable outliers in the flipper lengths for Adélie penguins, which are visible in the boxplot.