Following up on my popular tutorial on how to create an easy web crawler in Node.js I decided to extend the idea a bit further by scraping a few popular websites. For now, I'll just append the results of web scraping to a .txt file, but in a future post I'll show you how to insert them into a database.
Web Scraping is one of the powerful tools for data collection and the guide to web scraping with Nodejs and Puppeteer will show you how to collect and analyze data using web scraping techniques. You probably might have heard of the term “Web Scraping” or “Puppeteer” and the cool things you can do with puppeteer web scraping.
Web scraping node js example. In this tutorial, You will learn how to use to retrieve data from any websites or web pages using the node js and cheerio. What is web scraping? Web scraping is a technique used to retrieve data from websites using a script. If you’re not familiar with Node, check out my 3 Best Node.JS Courses. We’ll also be using two open-sourced npm modules to make today’s task a little easier: request-promise — Request is a simple HTTP client that allows us to make quick and easy HTTP calls.
Each scraper takes about 20 lines of code and they're pretty easy to modify if you want to scrape other elements of the site or web page.
Web Scraping Reddit
First I'll show you what it does and then explain it.
It firsts visits reddit.com and then collects all the post titles, the score, and the username of the user that submitted each post. It writes all of this to a .txt file named
reddit.txt separating each entry on a new line. Alternatively it's easy to separate each entry with a comma or some other delimiter if you wanted to open the results in Excel or a spreadsheet.
Okay, so how did I do it?
Make sure you have Node.js and npm installed. If you're not familiar with them take a look at the paragraph here.
Open up your command line. You'll need to install just two Node.js dependencies. You can do that by either running
as shown below:
Alternate option to install dependencies
Another option is copying over the dependencies and adding them to a
package.json file and then running
npm install. My
package.json includes these:
The actual code to scrape reddit
Now to take a look at how I scraped reddit in about 20 lines of code. Open up your favorite text editor (I use Atom) and copy the following:
This is surprisingly simple. Save the file as
scrape-reddit.js and then run it by typing
node scrape-reddit.js. You should end up with a text file called
reddit.txt that looks something like:
which is the post title, then the score, and finally the username.
Web Scraping Hacker News
Let's take a look at how the posts are structured:
As you can see, there are a bunch of
tr HTML elements with a class of
athing. So the first step will be to gather up all of the
We'll then want to grab the post titles by selecting the
td.title child element and then the
a element (the anchor tag of the hyperlink).
Note that we skip over any hiring posts by making sure we only gather up the
tr.athing elements that have a
td.votelinks child, as demonstrated in the following picture:
Here's the code
Run that and you'll get a
hackernews.txt file that looks something like:
First you have the title of the post on Hacker News and then the URL of that post on the next line. If you wanted both the title and URL on the same line, you can change the code:
to something like:
This allows you to use a comma as a delimiter so you can open up the file in a spreadsheet like Excel or a different program. You may want to use a different delimiter, such as a semicolon, which is an easy change above.
Web Scraping BuzzFeed
Web Scraping Python
Run that and you'll get something like the following in a
I'll eventually update this post to explain how the web scraper works. Specifically I'll talk about how I chose the selectors to pull the correct content from the right HTML element. There are great tools that make this process very easy, such as Chrome DevTools that I use while I'm writing the web scraper for the first time.
I'll also show you how to iterate through the pages on each website to scrape even more content.
Open Source Web Scraper
Finally, in a future post I'll detail how to insert these records into a database instead of a .txt file. Be sure to check back!
Web Scraping Nodejs Cheerio