2  Data Structures

There are five main data structures in R. They are:

  1. vectors

  2. matrix

  3. array

  4. data frame

  5. list

2.1 Vectors

  1. One dimensional data object.

  2. Homogeneous data structure. That means data in a vector must only be one type or mode (numeric, character, or logical). You cannot mix different types of data. If you try to mix different types of data, R will automatically convert them into one type.

2.2 Creating Vectors

Vectors can be made in four primary ways. They are

  1. using c() function

  2. using : function

  3. using seq function

  4. using rep function

Methods ii–iv simplify vector creation. They are useful when there is a pattern in data.

2.2.1 Concatenate: c()

syntax:

Example:

The following will create the vector but not assigned a name.

c(1996, 1998, 2000, 2005)
[1] 1996 1998 2000 2005

Assigning a name to vector:

The advantage of assigning a name is that we can reuse the same set of values by calling the vector name.

a <- c(1996, 1998, 2000, 2005)
a
[1] 1996 1998 2000 2005

2.2.2 Colon: :

The : function can be used to create a regular decreasing or increasing sequence.

Examples:

1:10
 [1]  1  2  3  4  5  6  7  8  9 10
10:1
 [1] 10  9  8  7  6  5  4  3  2  1
-0.5:10
 [1] -0.5  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5
-0.3:10
 [1] -0.3  0.7  1.7  2.7  3.7  4.7  5.7  6.7  7.7  8.7  9.7

In all of the above sequences the increment is one. The output will display the numbers only within the range.

2.2.3 Sequence: seq

seq function cal also be used for creating regular sequence. With seq you can control the increment and length of the output.

Example 1

seq(1, 19)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19

Example 2

seq(1, 19, length.out=8)
[1]  1.000000  3.571429  6.142857  8.714286 11.285714 13.857143 16.428571
[8] 19.000000

Example 3

seq(1, 19, by = 3)
[1]  1  4  7 10 13 16 19

2.3 Repeat: rep

The rep function can be used if there is a pattern of repetition in the data.

Example 1

The number 8 is repeated three times.

rep(8, 5)
[1] 8 8 8 8 8

Example 2

The sequence 1, 2, 3 is repeated five times.

rep(1:3, times=5)
 [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Example 3

Same as in Example 2 above.

rep(1:3, 5)
 [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Example 4

Each element in the sequence is repeated five times.

rep(1:3, each=5)
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

Example 5

First, each element is repeated five times. After that, the whole sequence is repeated three times.

rep(1:3, each=5, times=3)
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2 2 2
[39] 2 2 3 3 3 3 3

Example 6

Same as before. Changing the ordering of each and time does not change the output.

rep(1:3, times=3, each=5)
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2 2 2
[39] 2 2 3 3 3 3 3

2.4 Coercion

When you try to include different types they will be coerced to the most flexible type.

a <- c(1, 3, "GPA", TRUE, 1L)
typeof(a)
[1] "character"

Explicit coercion means that if we try to convert a data type to another data type intentionally using a specific function. For example,

b <- c(3.1, 3.2, 3.7, 5.9)
b
[1] 3.1 3.2 3.7 5.9
as.integer(b)
[1] 3 3 3 5

2.5 Functions that can be used to inspect vectors

Consider the vector below

example.vec <- c(1,  2,  3, 4, 5, 6, 7, 8)
  1. To check the storage mode
typeof(example.vec)
[1] "double"
  1. To check the class type
class(example.vec)
[1] "numeric"
  1. Testing functions
is.character(example.vec)
[1] FALSE
is.integer(example.vec)
[1] FALSE
is.logical(example.vec)
[1] FALSE
is.double(example.vec)
[1] TRUE
  1. Mathematical and statistical functions
sum(example.vec)
[1] 36
mean(example.vec)
[1] 4.5
summary(example.vec)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    2.75    4.50    4.50    6.25    8.00 
  1. To check if there are any missing values
is.na(example.vec)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

There are many more functions that you can use with vectors. We will learn about them in the upcoming chapters.

2.6 Exercise

  1. Write R codes to create the following vectors: If you see patterns in the data, use vector simplification methods.
[1] 1990 1992 1934 1957 1970 2000 2005
 [1] 3 6 9 3 6 9 3 6 9 3 6 9 3 6 9
 [1] 3 3 3 3 3 6 6 6 6 6 9 9 9 9 9
 [1] 3 3 3 3 3 6 6 6 6 6 9 9 9 9 9 3 3 3 3 3 6 6 6 6 6 9 9 9 9 9
 [1]  1  4  7 10 13 16 19 22 25 28 31 34
  [1] 0.1000000 0.1020202 0.1040404 0.1060606 0.1080808 0.1101010 0.1121212
  [8] 0.1141414 0.1161616 0.1181818 0.1202020 0.1222222 0.1242424 0.1262626
 [15] 0.1282828 0.1303030 0.1323232 0.1343434 0.1363636 0.1383838 0.1404040
 [22] 0.1424242 0.1444444 0.1464646 0.1484848 0.1505051 0.1525253 0.1545455
 [29] 0.1565657 0.1585859 0.1606061 0.1626263 0.1646465 0.1666667 0.1686869
 [36] 0.1707071 0.1727273 0.1747475 0.1767677 0.1787879 0.1808081 0.1828283
 [43] 0.1848485 0.1868687 0.1888889 0.1909091 0.1929293 0.1949495 0.1969697
 [50] 0.1989899 0.2010101 0.2030303 0.2050505 0.2070707 0.2090909 0.2111111
 [57] 0.2131313 0.2151515 0.2171717 0.2191919 0.2212121 0.2232323 0.2252525
 [64] 0.2272727 0.2292929 0.2313131 0.2333333 0.2353535 0.2373737 0.2393939
 [71] 0.2414141 0.2434343 0.2454545 0.2474747 0.2494949 0.2515152 0.2535354
 [78] 0.2555556 0.2575758 0.2595960 0.2616162 0.2636364 0.2656566 0.2676768
 [85] 0.2696970 0.2717172 0.2737374 0.2757576 0.2777778 0.2797980 0.2818182
 [92] 0.2838384 0.2858586 0.2878788 0.2898990 0.2919192 0.2939394 0.2959596
 [99] 0.2979798 0.3000000
 [1] -0.5  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5
 [1]  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
[26] 52 54 56 58 60 62 64 66 68 70 72
  1. Use the typeof() function to check the R storage mode of the following vectors and class() to check the class type of the vector.
logical_vector <- c(TRUE, FALSE, TRUE, FALSE)
integer_vector <- c(1L, 2L, 3L, 4L)
double_vector <- c(1.1, 2.2, 3.3, 4.4)
complex_vector <- c(1+1i, 2+2i, 3+3i, 4+4i)
character_vector <- c("a", "b", "c", "d")
null_vector <- NULL
time_data <- 1996:2006
time_series_data <- ts(1996:2006)
  1. Create the vector (3, 3, 3, . . . 3, 6, 6, . . . 6, 9, 9, 9, . . . 9), where there are 10 occurrences of 3, 20 occurrences of 6 and 30 occurrences of 9.

  2. Find the value of the following expression.

  1. \(\sum_{i=1}^{100}i\)

  2. \(\sum_{i=1}^{100}i^2\)

  1. Generate a sequence using the code seq(from=1, to=10, by=1). What other ways can you generate the same sequence?

  2. Create a vector to hold population values, and label each element with the corresponding province name. The plot will display population values when hovered over.

2.7 Vector Operations

To be added

2.8 Creating Matrix

2.9 Matrix Operations

2.10 Exercise

  1. Write R codes to obtain following matrix outputs
     [,1] [,2] [,3] [,4] [,5]
[1,]   10   30   50   70   90
[2,]   20   40   60   80  100
     [,1] [,2] [,3] [,4] [,5]
[1,]   10   20   30   40   50
[2,]   60   70   80   90  100
     C1 C2 C3 C4
Row1  1  6 11 16
Row2  2  7 12 17
Row3  3  8 13 18
Row4  4  9 14 19
Row5  5 10 15 20
  1. Mr. Perera who lives in Soratha Mawatha - Wijerama wants to sell his house. He wants to decide a price for his house to list it in the market. He believes that the size of the house is one likely determinant of price. He asked from 10 homes in the neighbourhood, “what price should you ask for your home?” and the house size (in square feet). The collected data are shown below:
   size_x price_y
1    1000     810
2    1500    1210
3    2000    1450
4    2500    1610
5    3000    1690
6    3500    2010
7    4000    1490
8    4500    1690
9    5000    1890
10   5500    2410
  1. Write an R code to input size_x and price_y into two separate vectors.

  2. Mr. Perera wants to compute the least squares estimates of the model \(\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X\). Write an R code to compute \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using the matrix operation \(\hat{\beta} = (X^TX)^{-1}X^TY\). Do not use the built-in function lm.

Where,

\(\hat{\beta} =\begin{pmatrix} \hat{\beta_0} \\ \hat{\beta_1} \\ \end{pmatrix}\), \(Y = \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ . \\ . \\ . \\ y_n \end{pmatrix}\) and \(X = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ . \\ . \\ . \\ 1 & x_n \end{pmatrix}\)

2.11 Creating Arrays

2.12 Exercise

  1. Create a 3D array with 3 columns, 5 rows, and 2 layers in R, and enter the following values into it:
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30

2.13 Creating data frames

2.14 Exercise

Write a code to store the following values in a dataframe.

Girth Height Volume
8.3 70 10.3
8.6 65 10.3
8.8 63 10.2
10.5 72 16.4
10.7 81 18.8
10.8 83 19.7
11.0 66 15.6
11.0 75 18.2
11.1 80 22.6
11.2 75 19.9

2.15 Subsetting

2.16 Exercise

This exercise is based on mtcars built-in-dataset in R. Write R codes to obtain the answers for the followings.

  1. To obtain the help file of mtcars
  1. How many cars are in the mtcars dataset?
  1. How many variables are in the mtcars dataset?
  1. What are the column names of the mtcars dataset?
  1. What is the mean miles per gallon (mpg) of the cars in the dataset?
  1. Which car has the highest horsepower (hp)?
  1. What is the mean weight (wt) of the cars in the dataset?
  1. How many cars have 8 cylinders (cyl)?
  1. What is the range of displacement (disp) values in the dataset?
  1. What is the median quarter mile time (qsec) for the cars?
  1. How many cars have a manual transmission (am = 1)?
  1. What is the maximum miles per gallon (mpg) in the dataset?
  1. What is the minimum horsepower (hp) recorded in the dataset?
  1. Which car has the lowest weight (wt)?
  1. How many cars have 4 gears (gear)?
  1. What is the standard deviation of the mpg variable?
  1. What is the total number of carburetors (carb) for all cars combined?
  1. How many cars have a quarter mile time (qsec) less than 18 seconds?
  1. What is the mean value of the gear variable for cars with 6 cylinders (cyl)?
  1. How many cars have more than 100 horsepower (hp)?
  1. What is the correlation between horsepower (hp) and miles per gallon (mpg)?