Imagine you want to gather a large amount of data from several websites as quickly as possible, will you do it manually, or will you search for it all in a practical way?Now you are asking yourself, why would you want to do that! Okay, follow along as we go over some examples to understand the need for web scraping:
- Web Scraper Chrome Extension
- Web Scraping Reddit Python
- Reddit Comment Scraper
- Web Scraper Free
- Reddit Scraping
Jun 21, 2020 Praw is the most efficient way to scrape data from any subreddit on reddit. Also with the number of users,and the content (both quality and quantity) increasing, Reddit will be a powerhouse for any. To get started, open the Google Sheet and make a copy in your Google Drive. Go to Tools - Script editor to open the Google Script that will fetch all the data from the specified subreddit. Go to line 55 and change technology to the name of the subreddit that you wish to scrape. While you are in the script editor, choose Run - scrapeReddit.
Introduction
- Wego is a website where you can book your flights & hotels, it gives you the lowest price after comparing 1000 booking sites. This is done by web scraping that helps with that process.
- Plagiarismdetector is a tool you can use to check for plagiarism in your article, it also is using web scraping to compare your words with thousands of other websites.
- Another example that many companies are using web scraping for, is to create strategic marketing decisions after scraping social network profiles, to determine the posts with the most interactions.
Prerequisites
Before we dive right in, the reader would need to have the following:
- A good understanding of Python programming language.
- A basic understanding of HTML.
Now after having a brief about web scraping let’s talk about the most important thing, that is the “legal issues” surrounding the topic.
How to know if the website allows web scraping?
- You have to add “/robots.txt” to the URL, such as www.facebook.com/robots.txt, so that you can see the scraping rules (for the website) and see what is forbidden to scrap.
For example:
The rule above tells us that the site is doing a delay of 5 sec between the requests.
Another example:
On www.facebook.com/robots.txt you can find this rule listed above, it means that a Discord bot has the permission to do web scraping on Facebook videos.
- You can run the following Python code that makes a GET request to the website server:
If the result is a 200 then you have the permission to perform web scraping on the website, but you also have to take a look at the scraping rules.
As an example, if you run the following code:
If the result is a 200 then you have the permission to start crawling, but you must also be aware of the following Points:
- You can only scrape data that is available to the public, like the prices of a product, you can not scrape anything private, like a Sign In page.
- You can’t use the scraped data for any commercial purposes.
- Some websites provide an API to use for web scraping, like Amazon, you can find their APIhere.
As we know, Python has different libraries for different purposes.
In this tutorial, we are going to use Beautiful Soup4, urllib, requests, and plyer libraries.
For Windows users you can install it using the following command in your terminal:
For Linux users you can use:
You’re ready to go, let’s get started and learn a bit more on web scraping through two real-life projects.
Reddit Web Scraper
One year ago, I wanted to build a smart AI bot, I aimed to make it talk like a human, but I had a problem, I didn’t have a good dataset to train my bot on, so I decided to use posts and comments from REDDIT.
Here we will go through how to build the basics of the aforementioned app step by step, and we will use https://old.reddit.com/.
First of all, we imported the libraries we want to use in our code.
Requests library allows us to do GET, PUT,. requests to the website server, and the beautiful soup library is used for parsing a page then pulling out a specific item from it. We’ll see it in a practical example soon.
Second, the URL we are going to use is for the TOP posts on Reddit.
Third, the headers part with “User-Agent” is a browser-related method to not let the server know that you are a bot and restrict your requests number, to find out your “User-Agent” you can do a web search for “what is my User-Agent?” in your browser.
Finally, we did a get request to connect to that URL then to pull out the HTML code for that page using the Beautiful Soup library.
Now let’s move on to the next step of building our app:
Open this URL then press F12 to inspect the page, you will see the HTML code for it. To know in what line you can find the HTML code for the element you want to locate, you have to do a right-click on that element then click on inspect.
After doing the process above on the first title on the page, you can see the following code with a highlight for the tag that holds the data you right-clicked on:
Now let’s pull out every title on that page. You can see that there is a “div” that contains a table called siteTable, then the title is within it.
First, we have to search for that table, then get every “a” element in it that has a class “title”.
Now from each element, we will extract the text that is the title, then put every title in the dictionary before printing it.
After running our code you will see the following result, which is every title on that page:
Finally, you can do the same process for the comments and replies to build up a good dataset as mentioned before.
When it comes to web scraping, an API is the best solution that comes to the mind of most data scientists. APIs (Application Programming Interfaces) is an intermediary that allows one software to talk to another. In simple terms, you can ask the API for specific data by passing JSON to it and in return, it will also give you a JSON data format.
For example, Reddit has a publicly-documented API that can be utilized that you can find here.
Also, it is worth mentioning that certain websites contain XHTML or RSS feeds that can be parsed as XML (Extensible Markup Language). XML does not define the form of the page, it defines the content, and it’s free of any formatting constraints, so it will be much easier to scrape a website that is using XML.
For example, REDDIT provides RSS feeds that can be parsed as XML that you can find here.
Let’s build another app to better understand how web scraping works.
COVID-19 Desktop Notifer
Now, we are going to learn how to build a notification system for Covid-19 so we will be able to know the number of new cases and deaths within our country.
The data is taken from worldmeter website where you can find the COVID-19 real-time update for any country in the world.
Let’s get started by importing the libraries we are going to use:
Here we are using urllib to make requests, but feel free to use the request library that we used in the Reddit Web Scraper example above.
We are using the plyer package to show the notifications, and the time to make the next notification pop up after a time we set.
In the code above you can change US in the URL to the name of your country, and the urlopen is doing the same as opening the URL in your browser.
Now if we open this URL and scroll down to the UPDATES section, then right-click on the “new cases” and click on inspect, we will see the following HTML code for it:
We can see that the new cases and deaths part is within the “li” tag and “news_li” class, let’s write a code snippet to extract that data from it.
After pulling out the HTML code from the page and searching for the tag and class we talked about, we are taking the strong element that contain in the first part the new cases number, and in the second part the new deaths number by using “next siblings”.
In the last part of our code, we are making an infinite while loop that uses the data we pulled out before, to show it in a notification pop up.The delay time before the next notification will pop up is set to 20 seconds which you can change to whatever you want.
After running our code you will see the following notification in the right-hand corner of your desktop.
Conclusion
We’ve just proven that anything on the web can be scraped and stored, there are a lot of reasons why we would want to use that information, as an example:
Imagine you are working with a social media platform, and you have a task that is deleting any posts that may be against the community, the best way of doing that task is by developing a web scraper application that scrapes and stores the likes and comments number for every post, after that if the post received a lot of comments but without any like, we can deduce, that this particular post may be striking a chord in people and we should take a look at it.
There are a lot of possibilities, and it’s up to you (as a developer) to choose how you will use that information.
About the author
Ahmad MardeniAhmad is a passionate software developer, an avid researcher, and a business man. He began his journey to be a cybersecurity expert two years ago. Also he participated in a lot of hackathons and programming competitions. As he says “Knowledge is power” so he wants to deliver good content by being a technical writer.
Want to scrape the web with R? You’re at the right place!
We will teach you from ground up on how to scrape the web with R, and will take you through fundamentals of web scraping (with examples from R).
Throughout this article, we won’t just take you through prominent R libraries like rvest and Rcrawler, but will also walk you through how to scrape information with barebones code.
Overall, here’s what you are going to learn:
- R web scraping fundamentals
- Handling different web scraping scenarios with R
- Leveraging rvest and Rcrawler to carry out web scraping
Let’s start the journey!
Introduction
The first step towards scraping the web with R requires you to understand HTML and web scraping fundamentals. You’ll learn how to get browsers to display the source code, then you will develop the logic of markup languages which sets you on the path to scrape that information. And, above all - you’ll master the vocabulary you need to scrape data with R.
We would be looking at the following basics that’ll help you scrape R: Adobe illustrator trial mac.
- HTML Basics
- Browser presentation
- And Parsing HTML data in R
So, let’s get into it.
HTML Basics
HTML is behind everything on the web. Our goal here is to briefly understand how Syntax rules, browser presentation, tags and attributes help us learn how to parse HTML and scrape the web for the information we need.
Browser Presentation
Before we scrape anything using R we need to know the underlying structure of a webpage. And the first thing you notice, is what you see when you open a webpage, isn’t the HTML document. It’s rather how an underlying HTML code is represented. You can basically open any HTML document using a text editor like notepad.
HTML tells a browser how to show a webpage, what goes into a headline, what goes into a text, etc. The underlying marked up structure is what we need to understand to actually scrape it.
For example, here’s what ScrapingBee.com looks like when you see it in a browser.
And, here’s what the underlying HTML looks like for it
Looking at this source code might seem like a lot of information to digest at once, let alone scrape it! But don’t worry. The next section exactly shows how to see this information better.
HTML elements and tags
If you carefully checked the raw HTML of ScrapingBee.com earlier, you would notice something like <title>..</title>, <body>..</body etc. Those are tags that HTML uses, and each of those tags have their own unique property. For example <title> tag helps a browser render the title of a web page, similarly <body> tag defines the body of an HTML document.
Once you understand those tags, that raw HTML would start talking to you and you’d already start to get the feeling of how you would be scraping web using R. All you need to take away form this section is that a page is structured with the help of HTML tags, and while scraping knowing these tags can help you locate and extract the information easily.
Parsing a webpage using R
With what we know, let’s use R to scrape an HTML webpage and see what we get. Keep in mind, we only know about HTML page structures so far, we know what RAW HTML looks like. That’s why, with the code, we will simply scrape a webpage and get the raw HTML. It is the first step towards scraping the web as well.
Earlier in this post, I mentioned that we can even use a text editor to open an HTML document. And in the code below, we will parse HTML in the same way we would parse a text document and read it with R.
I want to scrape the HTML code of ScrapingBee.com and see how it looks. We will use readLines() to map every line of the HTML document and create a flat representation of it.
Now, when you see what flat_html looks like, you should see something like this in your R Console:
The whole output would be a hundred pages so I’ve trimmed it for you. But, here’s something you can do to have some fun before I take you further towards scraping web with R:
- Scrape www.google.com and try to make sense of the information you received
- Scrape a very simple web page like https://www.york.ac.uk/teaching/cws/wws/webpage1.html and see what you get
Remember, scraping is only fun if you experiment with it. So, as we move forward with the blog post, I’d love it if you try out each and every example as you go through them and bring your own twist. Share in comments if you found something interesting or feel stuck somewhere.
While our output above looks great, it still is something that doesn’t closely reflect an HTML document. In HTML we have a document hierarchy of tags which looks something like
But clearly, our output from readLines() discarded the markup structure/hierarchies of HTML. Given that, I just wanted to give you a barebones look at scraping, this code looks like a good illustration.
However, in reality, our code is a lot more complicated. But fortunately, we have a lot of libraries that simplify web scraping in R for us. We will go through four of these libraries in later sections.
First, we need to go through different scraping situations that you’ll frequently encounter when you scrape data through R.
Common web scraping scenarios with R
Access web data using R over FTP
FTP is one of the ways to access data over the web. And with the help of CRAN FTP servers, I’ll show you how you can request data over FTP with just a few lines of code. Overall, the whole process is:
- Save ftp URL
- Save names of files from the URL into an R object
- Save files onto your local directory
Let’s get started now. The URL that we are trying to get data from is ftp://cran.r-project.org/pub/R/web/packages/BayesMixSurv/.
Let’s check the name of the files we received with get_files
Looking at the string above can you see what the file names are?
The screenshot from the URL shows real file names
It turns out that when you download those file names you get carriage return representations too. And it is pretty easy to solve this issue. In the code below, I used str_split() and str_extract_all() to get the HTML file names of interest.
Let’s print the file names to see what we have now:
extracted_html_filenames
Web Scraper Chrome Extension
Great! So, we now have a list of HTML files that we want to access. In our case, it was only one HTML file.
Now, all we have to do is to write a function that stores them in a folder and a function that downloads HTML docs in that folder from the web.
We are almost there now! All we now have to do is to download these files to a specified folder in your local drive. Save those files in a folder called scrapignbee_html. To do so, use GetCurlHandle().
After that, we’ll use plyr package’s l_ply() function.
And, we are done!
I can see that on my local drive I have a folder named scrapingbee_html, where I have inde.html file stored. But, if you don’t want to manually go and check the scraped content, use this command to retrieve a list of HTMLs downloaded:
Macos internet recovery latest version. That was via FTP, but what about HTML retrieving specific data from a webpage? That’s what our next section covers.
Scraping information from Wikipedia using R
In this section, I’ll show you how to retrieve information from Leonardo Da Vinci’s Wikipedia page https://en.wikipedia.org/wiki/Leonardo_da_Vinci.
Let’s take the basic steps to parse information:
Leonardo Da Vinci’s Wikipedia HTML has now been parsed and stored in parsed_wiki.
But, let’s say you wanted to see what text we were able to parse. A very simple way to do that would be:
By doing that, we have essentially parsed everything that exists within the <p> node. And since it is an XML node set, we can easily use subsetting rules to access different paragraphs. For example, let’s say we pick the 4th element on a random name. Here’s what you’ll see:
Reading text is fun, but let’s do something else - let’s get all links that exist on this page. We can easily do that by using getHTMLLinks() function:
Notice what you see above is a mix of actual links and links to files.
You can also see the total number of links on this page by using length() function:
I’ll throw in one more use case here which is to scrape tables off such HTML pages. And it is something that you’ll encounter quite frequently too for web scraping purposes. XML package in R offers a function named readHTMLTable() which makes our life so easy when it comes to scraping tables from HTML pages.
Leonardo’s Wikipedia page has no HTML though, so I will use a different page to show how we can scrape HTML from a webpage using R. Here’s the new URL:
As usual, we will read this URL:

If you look at the page you’ll disagree with the number “108”. For a closer inspection I’ll use name() function to get names of all 108 tables:
Our suspicion was right, there are too many “NULL” and only a few tables. I’ll now read data from one of those tables in R:
Here’s how this table looks in HTML
Web Scraping Reddit Python
Awesome isn’t it? Imagine being able to access census, pricing, etc data over R and scraping it. Wouldn’t it be fun? That’s why I took a boring one, and kept the fun part for you. Try something much cooler than what I did. Here’s an example of table data that you can scrape https://en.wikipedia.org/wiki/United_States_Census
Let me know how it goes for you. But it usually isn’t that straightforward. We have forms and authentication that can block your R code from scraping. And that’s exactly what we are going to learn to get through here.
Handling HTML forms while scraping with R
Often we come across pages that aren’t that easy to scrape. Take a look at the Meteorological Service Singapore’s page (that lack of SSL though :O). Notice the dropdowns here
Imagine if you want to scrape information that you can only get upon clicking on the dropdowns. What would you do in that case?
Well, I’ll be jumping a few steps forward and will show you a preview of rvest package while scraping this page. Our goal here is to scrape data from 2016 to 2020.
Let’s check what type of data have been able to scrape. Here’s what our data frame looks like:
From the dataframe above, we can now easily generate URLs that provide direct access to data of our interest.
Now, we can download those files at scale using lappy().
Note: This is going to download a ton of data once you execute it.
Web scraping using Rvest
Inspired by libraries like BeautifulSoup, rvest is probably one of most popular packages in R that we use to scrape the web. While it is simple enough that it makes scraping with R look effortless, it is complex enough to enable any scraping operation.
Let’s see rvest in action now. I will scrape information from IMDB and we will scrape Sharknado (because it is the best movie in the world!) https://www.imdb.com/title/tt8031422/
Awesome movie, awesome cast! Let's find out what was the cast of this movie.
Awesome cast! Probably that’s why it was such a huge hit. Who knows.
Still, there are skeptics of Sharknado. I guess the rating would prove them wrong? Here’s how you extract ratings of Sharknado from IMDB
Reddit Comment Scraper
I still stand by my words. But I hope you get the point, right? See how easy it is for us to scrape information using rvest, while we were writing 10+ lines of code in much simpler scraping scenarios.
Next on our list is Rcrawler.
Web Scraper Free
Web Scraping using Rcrawler
Reddit Scraping
Rcrawler is another R package that helps us harvest information from the web. But unlike rvest, we use Rcrawler for network graph related scraping tasks a lot more. For example, if you wish to scrape a very large website, you might want to try Rcrawler in a bit more depth.
Note: Rcrawler is more about crawling than scraping.
We will go back to Wikipedia and we will try to find the date of birth, date of death and other details of scientists.
Output looks like this:
And that’s it!
You pretty much know everything you need to get started with Web Scraping in R.
Try challenging yourself with interesting use cases and uncover challenges. Scraping the web with R can be really fun!
While this whole article tackles the main aspect of web scraping with R, it does not talk about web scraping without getting blocked.
If you want to learn how to do it, we have wrote this complete guide, and if you don't want to take care of this, you can always use our web scraping API.
Happy scraping.
