Web Scraping with PHP

Web scraping or Web data extraction is a way to automatically extract meaningful data available on websites. When there is no direct access to gather data using API or feeds, web scraping can be an effective way to automate data extraction.

In this article, we are going to use PHP to create a simple yet effective Web scraping script. This will be our starting point to dive deep into the field of web scraping using PHP. So, let’s begin!

In this example, we will write a simple scraping script, that will scrape book data from books.toscrape.com. The data we will scrape from the page will be following:

  • Book cover image URL (image_url)
  • Book title (title)
  • Price (price)
  • Availability (availability)
  • Book details page link (details_link)

The script will only scrape data from the home page (ie. a single URL with multiple books). We will try to grab all the books which are listed for sale at the URL.

To keep it simple, we will only scrape data from a single page and will leave pagination and crawling more pages as an exercise task. In this article, we will build the example scraper step by step, however, if you want to see the final code, scroll all the way at the bottom and check the Github Gist link.

Pre-requisite

Before moving forward, we will make sure we have the PHP environment ready for the scraper:

  1. A working Web Server with PHP 7.2+ setup
  2. Composer configured *
  3. Knowledge of XPath Selectors **

* If you don’t have a composer setup, then please visit this link to make sure the composer works for you and you can install any plugin available with the composer. You can install composer locally — per-project basis or globally — available for all projects in your system.

** In this article, we will use XPath selectors to read DOM Elements and read the data we need. If you don’t know how XPath Selectors work, I suggest you read through this article and then continue this article.

We will set up our project by creating a folder in the webserver path — say scraper and making composer available inside the project folder.

Step 1: Setup Guzzle HTTP library

In order to scrape data from a web page, the first thing we need to do is to read the HTML content of that page. To do that, we will require simply an HTTP client which can send a get or post request to the webpage URL and receive the HTML response back.

There are different HTTP clients available with PHP. For our requirements, we will use the Guzzle HTTP client. Guzzle HTTP is one of the most widely used and feature-rich HTTP clients for PHP.

To set up the Guzzle HTTP client, we will use composer. Simply add Guzzle as a dependency using Composer:

$ composer require guzzlehttp/guzzle

Once Guzzle is set up, we are ready to roll.

Step 2: Create the Scraper Structure

Let’s add a new file — index.php in the project folder with minimal structure code as:

This is not the final code, however, it is the bare minimum structure we will need for our scraper. If you thoroughly look at the code, it is self-explanatory. However, for sake of simplicity, let’s try to explain what’s going on.

  1. In the first two lines, we used the composer’s autoload file and called the Guzzle HTTP library.
  2. Then, we created a class — BooksScraper with constructor and run() function.
    * Constructor function sets a base_uri variable with webpage URL and a Guzzle client variable with base_uri and timeout of 5 minutes max.
    * run() function mainly performs three tasks — load HTML from a webpage, then convert the HTML to parsable DOM components and scrape content from DOM nodes.
  3. Finally, at the last three lines, we execute the Scraper and display the “Success!” message in the output if everything goes well.

That is all the structure we need for our scraper to function. In the coming steps, we will add code for each private function called inside run().

Step 3: Write function body

Let’s add body for each private function defined inside run() function, starting with load_html() as shown below:

private function load_html() {
$response = $this->client->get('/');
$this->html = $response->getBody()->getContents();
}

Easy enough. These two lines of load_html() body is enough to make a call to the URL and return back the HTML code of that page.

Now let’s first load the HTML to DOMDocument and then convert it to DOMXpath. Then we can use XPath expressions to read the data we want from the page. Below is the structure of load_dom() body:

private function load_dom() {
// throw Exception if no HTML content.
if ( !$this->html ) { throw new Exception('No HTML content.'); }

$this->doc = new DOMDocument;
@$this->doc->loadHTML($this->html);
$this->xpath = new DOMXpath($this->doc);
}

Apart from the main body content to load DOM, in the first line, we also added a check to make sure the scraper throws an exception and stops if there is no HTML content returned back from the HTTP client. That takes care of a hidden bug in case of a network failure.

Lets add scrape() function body:

This looks like a significant function and a lot more is going on within the body. Clearly, the scrape() function does three tasks

  1. Using XPath query expression, find all elements holding books data on the page
  2. Loop through each element and find book data
  3. Then write all the book data into a CSV file.

Question — how did we get the XPath query expression?

If you look at Line # 3 of scrape() function, we have the following code:

$elements = $this->xpath->query("//ol[@class='row']//li//article");

The value inside the query function ie. //ol[@class='row']//li//article, is an XPath expression to get elements containing all books. To find the XPath expression, I opened the page in Google Chrome. Upon inspecting the HTML code with Developer Tools, I found that the books are organized inside an ordered list — <ol> tag with class “row”. Within each list item, they are called in the <article> tag.

To validate the correct XPath expression, I used a chrome extension called ChroPath. ChroPath helps us to generate and validate XPath expressions, as shown below:

Then within the scrape() function, we used two more private functions — parse_node() and to_csv().

  • parse_node() function takes each element node and finds data for each book
  • to_csv() function takes all the collected data items and writes to a .csv file

So, let’s finalize our scraper by writing those two functions in the next step.

Step 4: Add supporting functions

  1. parse_node() function body looks as below:

In parse_node() function, we created an $item array variable. Then we used more XPath expressions, within the $element to extract book data.

For image_path and details_link, we are looking at src and href attributes of corresponding elements, while for title, price, and availability, we are looking at the text value within the element. Then we used the extract() helper function to extract the value based on XPath expression.

2. extract() function body is pretty simple as shown below:

private function extract($node, $element) {
// Get node text
$value = $this->xpath->query($node, $element)->item(0)->nodeValue;

return trim($value);
}

Step 5: Save Data to CSV

Finally, to_csv() function takes $data array and saves it to result.csv file within the project folder. Here is how the to_csv() function body looks like:

That’s it. Congratulations, if you made it so far, you have a working PHP scraper script. If you missed something or just want to verify the final code, here is how the final script looks like: https://gist.github.com/scrapingace/72d35d3f813c23482bd361cacd61be9c

Let’s verify the script by opening a browser and going to the URL — http://localhost/scraper/index.php

If all good, then the script should work perfectly and display a “Success!” message on the page. Then if you look inside the project folder, you should see result.csv file with book data as shown below:

result.csv File

Make sure you have correct access rights for the script to create result.csv file. If write permissions are not provided, then you might see a warning message and a file may not be created.

Perfect! Congratulations on writing a nice Scraper Script in PHP.

This is the end of the tutorial.

Hope you liked it!

Software Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store