Data wrangling with dplyr

Rstats Tidyverse

Tidyverse suit of packages for R provide some nifty solutions to clean and arrange your data

Sitendu Goswami
06-23-2021

Introduction

Data science as a discipline finds it’s applications across all fields of knowledge. If you are from biological sciences, the amount of statistics in most of the papers, is a testament to this collusion. However statistics and data-science are not the same. You can perform statistical analysis with your calculator, MS Excel or other software.

library(tidyverse)
#install.packages("magrittr")
library(magrittr)
library(gapminder)
gapminder <- as_tibble(gapminder)
gapminder2 <-  gapminder
gapminder3 <- gapminder

Basic components of a R syntax -

All programming languages at their very cores, are basically a communication platform between the user and the computer. Since computers understand the binary language, programming languages act as the translating tools which have specific synatx and granmar that help us effectively communicate with our computers and tell it what we want to do. Most programming languages like R have strict grammar that needs to be followed to ensure that the computer understands our commands. Therefore, before embarking on the data science journey we need to understand the rules of communication. At the very base of any sentence is a grammar that includes nouns, verbs, adjectives and punctuation. In the case of R, we can translate these as the following.

Data types in R (Nouns)

R is a language that can handle a multitude of data types. Data can be single dataframes, vectors, lists or matrices.



- character: “a”, “swc”

- numeric: 2, 15.5

- integer: 2L (the L tells R to store this as an integer)

- logical: TRUE, FALSE

- complex: 1+4i

class: left ## Exercise Data types - class() - what kind of object is it (high-level)? - types() - what is the object’s data type (low-level)? - length() - how long is it? What about two dimensional objects? - attributes() - does it have any metadata?

class(gapminder)
[1] "tbl_df"     "tbl"        "data.frame"
typeof(gapminder$country)
[1] "integer"
length(gapminder$continent)
[1] 1704
attributes(gapminder$continent)
$levels
[1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania" 

$class
[1] "factor"

class: left ## Data structures in R There are three major types of Data Structures - Vector - single dimension - Matrix - two dimensional - Array- n-dimensional

a <- c(1:54)
class(a)
[1] "integer"

class: left ## Operators in R

Arithmetic operators



Operator Description
+ Addition of two operands
- subtraction of b from a
* Multiplication
/ Division of a by b
%% Remainder from division of a with b
%/% quotient from division of a and b
^ a to the power of b

Exercise for arithmetic operators

#find the quotient for  23456789 /15432
23456789 %/% 15432
[1] 1520
# R follows BODMAS rule
 ((12345 + 456789) * (987456/2345678)^212) / 1234568
[1] 8.330017e-81

Assignment operators

Operator Description
= Assigns right side value to left side operand
<- Assigns right side value to left side operand
-> Assigns left side value to right side operand
<<- Assigns right side value to left side operand
->> Assigns left side value to right side operand

Exercise for assignment operators

# R Operators - R Assignment Operators
a = 2
print ( a )
[1] 2
a <- TRUE
print ( a )
[1] TRUE
454 -> a
print ( a )
[1] 454
a <<- 2.9
print ( a )
[1] 2.9
c(6, 8, 9) -> a
print ( a )
[1] 6 8 9

Relational operators

Operator Description
< Is first operand less than second operand
“>” Is first operand greater than second operand
== Is first operand equal to second operand
<= Is first operand less than or equal to second operand
“>=” Is first operand greater than or equal to second operand
“!=” Is first operand not equal to second operand

Exercise for relational operators

a <- 7.5
b <- 2
print( a < b ) # less than
[1] FALSE
print( a > b ) # greater than
[1] TRUE
print( a == b ) # equal to
[1] FALSE
print( a <= b ) # less than or equal to
[1] FALSE
print( a >= b ) # greater than or equal to
[1] TRUE
print( a != b ) # not equal to
[1] TRUE

Logical operators

Operator Description
& Element wise logical AND operation
?| Element wise logical OR operation
! Element wise logical NOT operation.
&& Operand wise logical AND operation.
| | Operand wise logical OR operation.

Exercise for Logical operators

# R Operators - R Logical Operators Example for basic logical elements
 
a <- 12 # logical FALSE
b <- 2 # logical TRUE
 
print ( a > 5 & a < 15 ) # logical AND element wise
[1] TRUE
print ( TRUE & FALSE )
[1] FALSE
print ( TRUE | FALSE ) # logical OR element wise
[1] TRUE
print ( !a ) # logical NOT element wise
[1] FALSE
print ( a && b ) # logical AND consolidated for all elements
[1] TRUE
print ( a || b ) # logical OR consolidated for all elements
[1] TRUE

Miscellaneous operators

Operator Description
%>% pipe operator
%in% Identifies if an element(a) belongs to a vector(b)
: creates series of vectors from left to right

Exercise for miscellaneous operators

st <- c(1:234) # create a vector with values from 1 to 234
# R Operators - R Misc Operators
print ( a )
[1] 12
a = c(25, 27, 76)
b = 27
print( b %in% a ) # check if b is present within a
[1] TRUE
print(a %in% b)
[1] FALSE  TRUE FALSE
a = gapminder$country
b = "Afghanistan"
c = "Congo"
print(c %in% a)
[1] FALSE
print(b %in% a)
[1] TRUE
names(gapminder)
[1] "country"   "continent" "year"      "lifeExp"   "pop"      
[6] "gdpPercap"
gapminder$continent %>% summary()# prints a summary of the factor gapminder$country
  Africa Americas     Asia   Europe  Oceania 
     624      300      396      360       24 

Data wrangling in R

The endless scope and opportunities

Topics

Pipe operators

Pipe operators cont..

#example 
#library(gapminder)
#library(tidyverse)

Pipe operators help you read and write code from left to right

data %>% function1 () %>% function(2) %>% function3() %>% function4()

#example 
#library(gapminder)
gapminder1 <- tbl_df(gapminder)
gapminder1
# A tibble: 1,704 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# … with 1,694 more rows

What to do to understand your data?

names(gapminder)
[1] "country"   "continent" "year"      "lifeExp"   "pop"      
[6] "gdpPercap"
str(gapminder)# investigate the structcure of the dataset
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Summarize your data

summary(gapminder)
        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       
View(gapminder)

glimpse function

glimpse(gapminder) # from here you can check the class of each variable as well
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, As…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 19…
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, …

Table function

table(gapminder$country)# count the number of times 

             Afghanistan                  Albania 
                      12                       12 
                 Algeria                   Angola 
                      12                       12 
               Argentina                Australia 
                      12                       12 
                 Austria                  Bahrain 
                      12                       12 
              Bangladesh                  Belgium 
                      12                       12 
                   Benin                  Bolivia 
                      12                       12 
  Bosnia and Herzegovina                 Botswana 
                      12                       12 
                  Brazil                 Bulgaria 
                      12                       12 
            Burkina Faso                  Burundi 
                      12                       12 
                Cambodia                 Cameroon 
                      12                       12 
                  Canada Central African Republic 
                      12                       12 
                    Chad                    Chile 
                      12                       12 
                   China                 Colombia 
                      12                       12 
                 Comoros         Congo, Dem. Rep. 
                      12                       12 
             Congo, Rep.               Costa Rica 
                      12                       12 
           Cote d'Ivoire                  Croatia 
                      12                       12 
                    Cuba           Czech Republic 
                      12                       12 
                 Denmark                 Djibouti 
                      12                       12 
      Dominican Republic                  Ecuador 
                      12                       12 
                   Egypt              El Salvador 
                      12                       12 
       Equatorial Guinea                  Eritrea 
                      12                       12 
                Ethiopia                  Finland 
                      12                       12 
                  France                    Gabon 
                      12                       12 
                  Gambia                  Germany 
                      12                       12 
                   Ghana                   Greece 
                      12                       12 
               Guatemala                   Guinea 
                      12                       12 
           Guinea-Bissau                    Haiti 
                      12                       12 
                Honduras         Hong Kong, China 
                      12                       12 
                 Hungary                  Iceland 
                      12                       12 
                   India                Indonesia 
                      12                       12 
                    Iran                     Iraq 
                      12                       12 
                 Ireland                   Israel 
                      12                       12 
                   Italy                  Jamaica 
                      12                       12 
                   Japan                   Jordan 
                      12                       12 
                   Kenya         Korea, Dem. Rep. 
                      12                       12 
             Korea, Rep.                   Kuwait 
                      12                       12 
                 Lebanon                  Lesotho 
                      12                       12 
                 Liberia                    Libya 
                      12                       12 
              Madagascar                   Malawi 
                      12                       12 
                Malaysia                     Mali 
                      12                       12 
              Mauritania                Mauritius 
                      12                       12 
                  Mexico                 Mongolia 
                      12                       12 
              Montenegro                  Morocco 
                      12                       12 
              Mozambique                  Myanmar 
                      12                       12 
                 Namibia                    Nepal 
                      12                       12 
             Netherlands              New Zealand 
                      12                       12 
               Nicaragua                    Niger 
                      12                       12 
                 Nigeria                   Norway 
                      12                       12 
                    Oman                 Pakistan 
                      12                       12 
                  Panama                 Paraguay 
                      12                       12 
                    Peru              Philippines 
                      12                       12 
                  Poland                 Portugal 
                      12                       12 
             Puerto Rico                  Reunion 
                      12                       12 
                 Romania                   Rwanda 
                      12                       12 
   Sao Tome and Principe             Saudi Arabia 
                      12                       12 
                 Senegal                   Serbia 
                      12                       12 
            Sierra Leone                Singapore 
                      12                       12 
         Slovak Republic                 Slovenia 
                      12                       12 
                 Somalia             South Africa 
                      12                       12 
                   Spain                Sri Lanka 
                      12                       12 
                   Sudan                Swaziland 
                      12                       12 
                  Sweden              Switzerland 
                      12                       12 
                   Syria                   Taiwan 
                      12                       12 
                Tanzania                 Thailand 
                      12                       12 
                    Togo      Trinidad and Tobago 
                      12                       12 
                 Tunisia                   Turkey 
                      12                       12 
                  Uganda           United Kingdom 
                      12                       12 
           United States                  Uruguay 
                      12                       12 
               Venezuela                  Vietnam 
                      12                       12 
      West Bank and Gaza              Yemen, Rep. 
                      12                       12 
                  Zambia                 Zimbabwe 
                      12                       12 

Change the class of a variable.

class(gapminder$country)
[1] "factor"
gapminder$country <- as.character(gapminder$country)

Select

glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, As…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 19…
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, …
gapminder %>% select(continent:gdpPercap) %>% glimpse()
Rows: 1,704
Columns: 5
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, As…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 19…
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, …
gapminder1
# A tibble: 1,704 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# … with 1,694 more rows

Exercise

gapminder %>%  select(c(country, year, gdpPercap)) %>%  glimpse()
Rows: 1,704
Columns: 3
$ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 19…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, …
gapminder$year <- as.integer(gapminder$year)

str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Renaming variables

#gapminder <-  rename(gapminder,c(gdp = "gdp"))
names(gapminder)
[1] "country"   "continent" "year"      "lifeExp"   "pop"      
[6] "gdpPercap"

Arranging variables

library(gapminder)
gapminder <- arrange(gapminder,(year))
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Ar…
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, E…
$ year      <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 19…
$ lifeExp   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66…
$ pop       <int> 8425333, 1282697, 9279525, 4232095, 17876956, 8691…
$ gdpPercap <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.31…
View(gapminder)
gapminder  %>%  arrange(year) %>%  glimpse()
Rows: 1,704
Columns: 6
$ country   <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Ar…
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, E…
$ year      <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 19…
$ lifeExp   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66…
$ pop       <int> 8425333, 1282697, 9279525, 4232095, 17876956, 8691…
$ gdpPercap <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.31…
gapminder %>%  arrange(desc(year)) %>%  glimpse()
Rows: 1,704
Columns: 6
$ country   <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Ar…
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, E…
$ year      <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…
$ lifeExp   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79…
$ pop       <int> 31889923, 3600523, 33333216, 12420476, 40301927, 2…
$ gdpPercap <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3…

Summarize & aggregate 1

gapminder %>% group_by(continent, year) %>% glimpse()
Rows: 1,704
Columns: 6
Groups: continent, year [60]
$ country   <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Ar…
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, E…
$ year      <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 19…
$ lifeExp   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66…
$ pop       <int> 8425333, 1282697, 9279525, 4232095, 17876956, 8691…
$ gdpPercap <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.31…
gapminder %>% group_by(continent, year) %>% summarise(meangdp = mean(gdpPercap, na.rm = TRUE)) %>%  print()
# A tibble: 60 x 3
# Groups:   continent [5]
   continent  year meangdp
   <fct>     <int>   <dbl>
 1 Africa     1952   1253.
 2 Africa     1957   1385.
 3 Africa     1962   1598.
 4 Africa     1967   2050.
 5 Africa     1972   2340.
 6 Africa     1977   2586.
 7 Africa     1982   2482.
 8 Africa     1987   2283.
 9 Africa     1992   2282.
10 Africa     1997   2379.
# … with 50 more rows

summarize 2

new <- gapminder %>% group_by(continent, year) %>%
  summarise(meangdp = mean(gdpPercap, na.rm = TRUE)) %>% print()
# A tibble: 60 x 3
# Groups:   continent [5]
   continent  year meangdp
   <fct>     <int>   <dbl>
 1 Africa     1952   1253.
 2 Africa     1957   1385.
 3 Africa     1962   1598.
 4 Africa     1967   2050.
 5 Africa     1972   2340.
 6 Africa     1977   2586.
 7 Africa     1982   2482.
 8 Africa     1987   2283.
 9 Africa     1992   2282.
10 Africa     1997   2379.
# … with 50 more rows
new
# A tibble: 60 x 3
# Groups:   continent [5]
   continent  year meangdp
   <fct>     <int>   <dbl>
 1 Africa     1952   1253.
 2 Africa     1957   1385.
 3 Africa     1962   1598.
 4 Africa     1967   2050.
 5 Africa     1972   2340.
 6 Africa     1977   2586.
 7 Africa     1982   2482.
 8 Africa     1987   2283.
 9 Africa     1992   2282.
10 Africa     1997   2379.
# … with 50 more rows

Summarize 3 | some other functions

gapminder %>%
group_by(continent) %>%
summarize(n_obs = n(), n_countries = n_distinct(country)) %>%  print()
# A tibble: 5 x 3
  continent n_obs n_countries
  <fct>     <int>       <int>
1 Africa      624          52
2 Americas    300          25
3 Asia        396          33
4 Europe      360          30
5 Oceania      24           2

Distill is a publication format for scientific and technical writing, native to the web.

Learn more about using Distill at https://rstudio.github.io/distill.

Citation

For attribution, please cite this work as

Goswami (2021, June 23). The Thought Factory: Data wrangling with dplyr. Retrieved from https://sitendu.netlify.app/posts/2021-06-23-data-wranglng-with-r/

BibTeX citation

@misc{goswami2021data,
  author = {Goswami, Sitendu},
  title = {The Thought Factory: Data wrangling with dplyr},
  url = {https://sitendu.netlify.app/posts/2021-06-23-data-wranglng-with-r/},
  year = {2021}
}