Web-scraping with rvest

Websraping with R and rvest is very convenient

Author

Affiliation

Sitendu Goswami

 

Published

May 13, 2022

Citation

Goswami, 2022

The Internet is filled with a lot of information and as data enthusiasts it is our sacred duty to get data from wherever we can. Some websites provide access to their data through apis, while others are not as forthcoming. First we load the packages

library(rvest)
library(httr)
library(htmltools)
library(stringr)
library(tidyverse)

There is a nice website called 91wheels, that helps us look at car prices and compare between them. We shall use this website to gather updated car prices, which we shall use for downstream analyses.

The first step is to identify the website and copy its entire url. Next, we use the “read_html” function to read the entire webpage. This is how webscraping works, since HTML is a language to display content on webpages, it has tags for all relevant content that we see anywhere. Since most of the times, the data of our interest is in text format, we can gather the relevant information by reading the entire webpage.

tiago<- read_html("https://www.91wheels.com/cars/between-5-lakh-to-10-lakh")

In the next step, I use the selector gadget in chrome extension market to find out the xml path to the two variables of my target i.e. car name and car price range.

price<- tiago |> html_nodes("._cardofcar_price__3qcNH") |> html_text() 
price1<- price |> str_remove_all("[:alpha:]") |> str_remove_all("^\\W") |> as_tibble()
name<- tiago |> html_nodes("._cardofcar_titleName__2ZokJ") |> html_text() |>  as_tibble()

Now we have two tibbles called “name” and “price1” that have all the relevant information we need to create the dataset. But if we look at the two tibbles, we can see that we are long way from clean data that is ready for analysis

head(name) # This tibble is fine since car names are character variables
# A tibble: 6 × 1
  value               
  <chr>               
1 "  KIA Carens  "    
2 "  Tata Punch  "    
3 "  Tata Nexon  "    
4 "  KIA Sonet  "     
5 "  Hyundai Venue  " 
6 "  Nissan Magnite  "
head(price1) # We want a single car price without any other strings and annoyiung characters like the rupee symbol
# A tibble: 6 × 1
  value                
  <chr>                
1 "₹ 9.60  - ₹ 17.70  "
2 "₹ 5.68  - ₹ 9.49  " 
3 "₹ 7.43  - ₹ 13.74  "
4 "₹ 7.15  - ₹ 13.79  "
5 "₹ 6.99  - ₹ 11.72  "
6 "₹ 5.76  - ₹ 10.15  "
price1<- price1 |> separate(value, into= c("lower", "higher"), sep = "-")
head(price1)
# A tibble: 6 × 2
  lower      higher      
  <chr>      <chr>       
1 "₹ 9.60  " " ₹ 17.70  "
2 "₹ 5.68  " " ₹ 9.49  " 
3 "₹ 7.43  " " ₹ 13.74  "
4 "₹ 7.15  " " ₹ 13.79  "
5 "₹ 6.99  " " ₹ 11.72  "
6 "₹ 5.76  " " ₹ 10.15  "

Now we join the car names with the prices to get a data set that is shaping up nicely. We can use the “bind_cols” or “cbind” functions to join them.

final<- bind_cols(name, price1)
head(final)
# A tibble: 6 × 3
  value                lower      higher      
  <chr>                <chr>      <chr>       
1 "  KIA Carens  "     "₹ 9.60  " " ₹ 17.70  "
2 "  Tata Punch  "     "₹ 5.68  " " ₹ 9.49  " 
3 "  Tata Nexon  "     "₹ 7.43  " " ₹ 13.74  "
4 "  KIA Sonet  "      "₹ 7.15  " " ₹ 13.79  "
5 "  Hyundai Venue  "  "₹ 6.99  " " ₹ 11.72  "
6 "  Nissan Magnite  " "₹ 5.76  " " ₹ 10.15  "

Still the pesky rupee symbol continues to pursue us. Notice there are spaces around the rupee symbol which we can use to separate the column into multiple columns and then select the column which only contains the numeric vector of price. Moreover, the higher and lower prices of the car should have been changed to a numeric vector by now.

final_prices<- final |> separate(lower, into = c("symbol", "lower", "space"), sep = " ") |> select(value,lower, higher )|> 
  separate(higher, into = c("symbol", "space", "higher"), sep = " ") |> 
  select(value,lower, higher ) |> 
  mutate(lower = as.double(lower),
         higher = as.double(higher))
head(final_prices)
# A tibble: 6 × 3
  value                lower higher
  <chr>                <dbl>  <dbl>
1 "  KIA Carens  "      9.6   17.7 
2 "  Tata Punch  "      5.68   9.49
3 "  Tata Nexon  "      7.43  13.7 
4 "  KIA Sonet  "       7.15  13.8 
5 "  Hyundai Venue  "   6.99  11.7 
6 "  Nissan Magnite  "  5.76  10.2 
final_prices |> knitr::kable(type = "html") # our final output looks like this.
value lower higher
KIA Carens 9.60 17.70
Tata Punch 5.68 9.49
Tata Nexon 7.43 13.74
KIA Sonet 7.15 13.79
Hyundai Venue 6.99 11.72
Nissan Magnite 5.76 10.15
Tata Altroz 6.00 10.00
Tata Tiago 5.23 7.33
Toyota Glanza 6.39 9.69
MG Astor 9.98 17.73
Renault Kiger 5.84 10.39
Honda Amaze 6.44 11.27
Volkswagen Virtus 10.00 10.50
KIA Sonet CNG 11.00 13.00
Tata Curvv 20.00 NA
Tata Avinya EV 15.00 20.00
Tata Blackbird 11.00 15.00
Jeep Meridian 30.00 NA
Mahindra Scorpio 2022 12.00 17.00
Toyota Land Cruiser 1.50 NA
Tata HEXA 14.00 20.00
Audi Q3 40.00 NA
Bugatti Chiron 19.21 21.22
Citroen C3 7.00 9.00

Scrape the Decathlon India website

decathlon<- function(search, pagenum){
  library(rvest)
  library(tidyverse)
  page<-("https://www.decathlon.in/search?query=")
  pagecont<- paste0(page, search)
  prices<- pagecont |>
    read_html() |>
    html_nodes("._3wHKeni9X-") |>
    html_text()
  # prices
  productname<- pagecont |> read_html() |>
    html_nodes(".card-title") |>
    html_text()
  # productname

  price<- tibble(productname, prices)
  price2<- bind_rows(NULL, price)
  print(price2)
}
data<- decathlon("merino", 1) |> janitor::clean_names()
# A tibble: 20 × 2
   productname                                                  prices
   <chr>                                                        <chr> 
 1 "TREKKING MERINO WOOL SCARF  - MT500 - KHAKI "               ₹ 799 
 2 "TREKKING MERINO WOOL KNIT CAP - MT500 - BLACK "             ₹ 699 
 3 "Men's Mountain Trekking Long-sleeved T-Shirt - MT500 MERIN… ₹ 2,4…
 4 "Men's Mountain Trekking Merino Wool Base Layer Tights / Le… ₹ 2,1…
 5 "Men's Long-sleeved T-shirt Mountain Trek Merino Wool with … ₹ 3,5…
 6 "Men's Long-sleeve T-shirt Merino Wool  MT500"               ₹ 2,1…
 7 "Travel 50 Merino Wool t-shirt-Grey"                         ₹ 999 
 8 "Men's Trekking Merino Wool Short-Sleeved T-Shirt MT500 "    ₹ 1,9…
 9 "MERINO WOOL TREKKING HEADBAND -MT500 - BLUE "               ₹ 599 
10 "Women's Mountain Trekking Merino T-shirt Trek 500 - Purple" ₹ 999 
11 "Men's Mountain Trekking Short-Sleeved T-Shirt Trek 500 Mer… ₹ 1,9…
12 "Men's Merino wool trekking travel polo shirt - TRAVEL 500 … ₹ 999 
13 "Women’s merino wool legging underwear - MT500"              ₹ 2,1…
14 "Women's Long-sleeve T-shirt Merino Wool  MT500"             ₹ 2,1…
15 "Men's Mountain Trekking Merino Wool Boxer Shorts MT500"     ₹ 1,5…
16 "Men's Trekking Merino Wool Short-Sleeved T-Shirt MT500 - p… ₹ 1,9…
17 "Women's Travel Trekking Merino Wool Tank Top  Travel 500 -… ₹ 1,4…
18 "Adult Mountain Trekking Merino Wool Liner Gloves Trek 500 … ₹ 899 
19 "Men's Mountain Trekking Merino Wool Boxer Shorts MT500"     ₹ 1,5…
20 "Women’s Long-sleeved Merino Zipped Neck T-shirt - MT500"    ₹ 2,5…
head(data)
# A tibble: 6 × 2
  productname                                                   prices
  <chr>                                                         <chr> 
1 "TREKKING MERINO WOOL SCARF  - MT500 - KHAKI "                ₹ 799 
2 "TREKKING MERINO WOOL KNIT CAP - MT500 - BLACK "              ₹ 699 
3 "Men's Mountain Trekking Long-sleeved T-Shirt - MT500 MERINO… ₹ 2,4…
4 "Men's Mountain Trekking Merino Wool Base Layer Tights / Leg… ₹ 2,1…
5 "Men's Long-sleeved T-shirt Mountain Trek Merino Wool with Z… ₹ 3,5…
6 "Men's Long-sleeve T-shirt Merino Wool  MT500"                ₹ 2,1…

Footnotes

    Citation

    For attribution, please cite this work as

    Goswami (2022, May 14). The Thought Factory: Web-scraping with rvest. Retrieved from https://sitendu.netlify.app/posts/2022-05-14-webscraping/

    BibTeX citation

    @misc{goswami2022web-scraping,
      author = {Goswami, Sitendu},
      title = {The Thought Factory: Web-scraping with rvest},
      url = {https://sitendu.netlify.app/posts/2022-05-14-webscraping/},
      year = {2022}
    }