Tidyverse suit of packages for R provide some nifty solutions to clean and arrange your data
Data science as a discipline finds it’s applications across all fields of knowledge. If you are from biological sciences, the amount of statistics in most of the papers, is a testament to this collusion. However statistics and data-science are not the same. You can perform statistical analysis with your calculator, MS Excel or other software.
All programming languages at their very cores, are basically a communication platform between the user and the computer. Since computers understand the binary language, programming languages act as the translating tools which have specific synatx and granmar that help us effectively communicate with our computers and tell it what we want to do. Most programming languages like R have strict grammar that needs to be followed to ensure that the computer understands our commands. Therefore, before embarking on the data science journey we need to understand the rules of communication. At the very base of any sentence is a grammar that includes nouns, verbs, adjectives and punctuation. In the case of R, we can translate these as the following.
R is a language that can handle a multitude of data types. Data can be single dataframes, vectors, lists or matrices.
- character: “a”, “swc”
- numeric: 2, 15.5
- integer: 2L (the L tells R to store this as an integer)
- logical: TRUE, FALSE
- complex: 1+4i
class: left ## Exercise Data types - class() - what kind of object is it (high-level)? - types() - what is the object’s data type (low-level)? - length() - how long is it? What about two dimensional objects? - attributes() - does it have any metadata?
class(gapminder)
[1] "tbl_df" "tbl" "data.frame"
typeof(gapminder$country)
[1] "integer"
length(gapminder$continent)
[1] 1704
attributes(gapminder$continent)
$levels
[1] "Africa" "Americas" "Asia" "Europe" "Oceania"
$class
[1] "factor"
class: left ## Data structures in R There are three major types of Data Structures - Vector - single dimension - Matrix - two dimensional - Array- n-dimensional
class: left ## Operators in R
Operators are used to define relationships between variables in R.
Operators are like punctuations in a sentence.
Different types of operators
| Operator | Description |
|---|---|
| + | Addition of two operands |
| - | subtraction of b from a |
| * | Multiplication |
| / | Division of a by b |
| %% | Remainder from division of a with b |
| %/% | quotient from division of a and b |
| ^ | a to the power of b |
#find the quotient for 23456789 /15432
23456789 %/% 15432
[1] 1520
# R follows BODMAS rule
((12345 + 456789) * (987456/2345678)^212) / 1234568
[1] 8.330017e-81
| Operator | Description |
|---|---|
| = | Assigns right side value to left side operand |
| <- | Assigns right side value to left side operand |
| -> | Assigns left side value to right side operand |
| <<- | Assigns right side value to left side operand |
| ->> | Assigns left side value to right side operand |
# R Operators - R Assignment Operators
a = 2
print ( a )
[1] 2
a <- TRUE
print ( a )
[1] TRUE
454 -> a
print ( a )
[1] 454
a <<- 2.9
print ( a )
[1] 2.9
[1] 6 8 9
| Operator | Description |
|---|---|
| < | Is first operand less than second operand |
| “>” | Is first operand greater than second operand |
| == | Is first operand equal to second operand |
| <= | Is first operand less than or equal to second operand |
| “>=” | Is first operand greater than or equal to second operand |
| “!=” | Is first operand not equal to second operand |
a <- 7.5
b <- 2
print( a < b ) # less than
[1] FALSE
print( a > b ) # greater than
[1] TRUE
print( a == b ) # equal to
[1] FALSE
print( a <= b ) # less than or equal to
[1] FALSE
print( a >= b ) # greater than or equal to
[1] TRUE
print( a != b ) # not equal to
[1] TRUE
| Operator | Description |
|---|---|
| & | Element wise logical AND operation |
| ?| | Element wise logical OR operation |
| ! | Element wise logical NOT operation. |
| && | Operand wise logical AND operation. |
| | | | Operand wise logical OR operation. |
# R Operators - R Logical Operators Example for basic logical elements
a <- 12 # logical FALSE
b <- 2 # logical TRUE
print ( a > 5 & a < 15 ) # logical AND element wise
[1] TRUE
print ( TRUE & FALSE )
[1] FALSE
print ( TRUE | FALSE ) # logical OR element wise
[1] TRUE
print ( !a ) # logical NOT element wise
[1] FALSE
print ( a && b ) # logical AND consolidated for all elements
[1] TRUE
print ( a || b ) # logical OR consolidated for all elements
[1] TRUE
| Operator | Description |
|---|---|
| %>% | pipe operator |
| %in% | Identifies if an element(a) belongs to a vector(b) |
| : | creates series of vectors from left to right |
st <- c(1:234) # create a vector with values from 1 to 234
# R Operators - R Misc Operators
print ( a )
[1] 12
[1] TRUE
print(a %in% b)
[1] FALSE TRUE FALSE
a = gapminder$country
b = "Afghanistan"
c = "Congo"
print(c %in% a)
[1] FALSE
print(b %in% a)
[1] TRUE
names(gapminder)
[1] "country" "continent" "year" "lifeExp" "pop"
[6] "gdpPercap"
gapminder$continent %>% summary()# prints a summary of the factor gapminder$country
Africa Americas Asia Europe Oceania
624 300 396 360 24
Fast
Reproducible
#example
#library(gapminder)
#library(tidyverse)
data %>% function1 () %>% function(2) %>% function3() %>% function4()
#example
#library(gapminder)
gapminder1 <- tbl_df(gapminder)
gapminder1
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 1,694 more rows
names(gapminder)
[1] "country" "continent" "year" "lifeExp" "pop"
[6] "gdpPercap"
str(gapminder)# investigate the structcure of the dataset
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
summary(gapminder)
country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
View(gapminder)
glimpse(gapminder) # from here you can check the class of each variable as well
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, As…
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 19…
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, …
table(gapminder$country)# count the number of times
Afghanistan Albania
12 12
Algeria Angola
12 12
Argentina Australia
12 12
Austria Bahrain
12 12
Bangladesh Belgium
12 12
Benin Bolivia
12 12
Bosnia and Herzegovina Botswana
12 12
Brazil Bulgaria
12 12
Burkina Faso Burundi
12 12
Cambodia Cameroon
12 12
Canada Central African Republic
12 12
Chad Chile
12 12
China Colombia
12 12
Comoros Congo, Dem. Rep.
12 12
Congo, Rep. Costa Rica
12 12
Cote d'Ivoire Croatia
12 12
Cuba Czech Republic
12 12
Denmark Djibouti
12 12
Dominican Republic Ecuador
12 12
Egypt El Salvador
12 12
Equatorial Guinea Eritrea
12 12
Ethiopia Finland
12 12
France Gabon
12 12
Gambia Germany
12 12
Ghana Greece
12 12
Guatemala Guinea
12 12
Guinea-Bissau Haiti
12 12
Honduras Hong Kong, China
12 12
Hungary Iceland
12 12
India Indonesia
12 12
Iran Iraq
12 12
Ireland Israel
12 12
Italy Jamaica
12 12
Japan Jordan
12 12
Kenya Korea, Dem. Rep.
12 12
Korea, Rep. Kuwait
12 12
Lebanon Lesotho
12 12
Liberia Libya
12 12
Madagascar Malawi
12 12
Malaysia Mali
12 12
Mauritania Mauritius
12 12
Mexico Mongolia
12 12
Montenegro Morocco
12 12
Mozambique Myanmar
12 12
Namibia Nepal
12 12
Netherlands New Zealand
12 12
Nicaragua Niger
12 12
Nigeria Norway
12 12
Oman Pakistan
12 12
Panama Paraguay
12 12
Peru Philippines
12 12
Poland Portugal
12 12
Puerto Rico Reunion
12 12
Romania Rwanda
12 12
Sao Tome and Principe Saudi Arabia
12 12
Senegal Serbia
12 12
Sierra Leone Singapore
12 12
Slovak Republic Slovenia
12 12
Somalia South Africa
12 12
Spain Sri Lanka
12 12
Sudan Swaziland
12 12
Sweden Switzerland
12 12
Syria Taiwan
12 12
Tanzania Thailand
12 12
Togo Trinidad and Tobago
12 12
Tunisia Turkey
12 12
Uganda United Kingdom
12 12
United States Uruguay
12 12
Venezuela Vietnam
12 12
West Bank and Gaza Yemen, Rep.
12 12
Zambia Zimbabwe
12 12
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, As…
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 19…
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, …
gapminder %>% select(continent:gdpPercap) %>% glimpse()
Rows: 1,704
Columns: 5
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, As…
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 19…
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, …
gapminder1
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 1,694 more rows
gapminder %>% select(c(country, year, gdpPercap)) %>% glimpse()
Rows: 1,704
Columns: 3
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 19…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, …
gapminder$year <- as.integer(gapminder$year)
str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
#gapminder <- rename(gapminder,c(gdp = "gdp"))
names(gapminder)
[1] "country" "continent" "year" "lifeExp" "pop"
[6] "gdpPercap"
Rows: 1,704
Columns: 6
$ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Ar…
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, E…
$ year <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 19…
$ lifeExp <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66…
$ pop <int> 8425333, 1282697, 9279525, 4232095, 17876956, 8691…
$ gdpPercap <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.31…
View(gapminder)
gapminder %>% arrange(year) %>% glimpse()
Rows: 1,704
Columns: 6
$ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Ar…
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, E…
$ year <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 19…
$ lifeExp <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66…
$ pop <int> 8425333, 1282697, 9279525, 4232095, 17876956, 8691…
$ gdpPercap <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.31…
gapminder %>% arrange(desc(year)) %>% glimpse()
Rows: 1,704
Columns: 6
$ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Ar…
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, E…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…
$ lifeExp <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79…
$ pop <int> 31889923, 3600523, 33333216, 12420476, 40301927, 2…
$ gdpPercap <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3…
gapminder %>% group_by(continent, year) %>% glimpse()
Rows: 1,704
Columns: 6
Groups: continent, year [60]
$ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Ar…
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, E…
$ year <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 19…
$ lifeExp <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66…
$ pop <int> 8425333, 1282697, 9279525, 4232095, 17876956, 8691…
$ gdpPercap <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.31…
gapminder %>% group_by(continent, year) %>% summarise(meangdp = mean(gdpPercap, na.rm = TRUE)) %>% print()
# A tibble: 60 x 3
# Groups: continent [5]
continent year meangdp
<fct> <int> <dbl>
1 Africa 1952 1253.
2 Africa 1957 1385.
3 Africa 1962 1598.
4 Africa 1967 2050.
5 Africa 1972 2340.
6 Africa 1977 2586.
7 Africa 1982 2482.
8 Africa 1987 2283.
9 Africa 1992 2282.
10 Africa 1997 2379.
# … with 50 more rows
new <- gapminder %>% group_by(continent, year) %>%
summarise(meangdp = mean(gdpPercap, na.rm = TRUE)) %>% print()
# A tibble: 60 x 3
# Groups: continent [5]
continent year meangdp
<fct> <int> <dbl>
1 Africa 1952 1253.
2 Africa 1957 1385.
3 Africa 1962 1598.
4 Africa 1967 2050.
5 Africa 1972 2340.
6 Africa 1977 2586.
7 Africa 1982 2482.
8 Africa 1987 2283.
9 Africa 1992 2282.
10 Africa 1997 2379.
# … with 50 more rows
new
# A tibble: 60 x 3
# Groups: continent [5]
continent year meangdp
<fct> <int> <dbl>
1 Africa 1952 1253.
2 Africa 1957 1385.
3 Africa 1962 1598.
4 Africa 1967 2050.
5 Africa 1972 2340.
6 Africa 1977 2586.
7 Africa 1982 2482.
8 Africa 1987 2283.
9 Africa 1992 2282.
10 Africa 1997 2379.
# … with 50 more rows
gapminder %>%
group_by(continent) %>%
summarize(n_obs = n(), n_countries = n_distinct(country)) %>% print()
# A tibble: 5 x 3
continent n_obs n_countries
<fct> <int> <int>
1 Africa 624 52
2 Americas 300 25
3 Asia 396 33
4 Europe 360 30
5 Oceania 24 2
Distill is a publication format for scientific and technical writing, native to the web.
Learn more about using Distill at https://rstudio.github.io/distill.
For attribution, please cite this work as
Goswami (2021, June 23). The Thought Factory: Data wrangling with dplyr. Retrieved from https://sitendu.netlify.app/posts/2021-06-23-data-wranglng-with-r/
BibTeX citation
@misc{goswami2021data,
author = {Goswami, Sitendu},
title = {The Thought Factory: Data wrangling with dplyr},
url = {https://sitendu.netlify.app/posts/2021-06-23-data-wranglng-with-r/},
year = {2021}
}