A Comprehensive Guide to Building a Scalable Web Scraping Agent

by Vikash Rathee on

- Scraping thousands of websites with one scraping agent.

Ever imagined you can scrape multiple different HTML structure websites using one scraping agent? Last week, we came to a situation where one of our client looking to scrape thousands of corporate websites career page jobs to feed their jobs database for their portal. So, we started analyzing few of sites and found every website has different HTML structure and requires separate agents needs to be created in order to extract the data points.

Managing 1000 of agents - with more coming soon - is not easy task. You'd need lots of time, resources to mange and maintenance those agents which overall increase the setup cost. So it's good thing we put some time beforehand to try and automate as much of the process as we could. Which can centralize the agent for maintenance, auditing and finding which website extraction is working fine and which one required maintenance.

We started thinking which fits our three requirements:

  • Scalable : Adding more sites should require less time. 
  • Automated : Things should run fine without our direct involvement.
  • Free : Adding more sites should not add extra cost to our users.

Introducing Dynamic Selectors - 

Simply dynamic selector is a HTML pattern within your agent that changes based on the website URL. Here is the basic example of dynamic selector

Website 1 : http://cdn.datascraping.co/sample_content/website-1.html

<section class="details">
 <p class="date">July 14, 2016</p>
 <h1 class="title">Product Developer - Seattle</h1>
</section>

Website 2 : https://s3-ap-southeast-1.amazonaws.com/static.pricingindia.in/wp-content/website-2.html

<div>
 <span id="post-date">July 20, 2016</span>
 <h2>Product Manager - London</h2>
</div>

The dynamic selectors use the DynamicSelectors parameter. And once enabled, the Data Scraping application will pick the right selector from dynamic data list by matching the website URL.

To provide website-specific information, we'd need to add an array of data with website and selectors details. As in below example, where I've included the list of selectors for each field and website :

"DynamicSelectors":{
      "Enabled":true,
      "Data":[
         {
            "Website":"cdn.datascraping.co",
            "Selectors":[
               {
                  "Name":"DATE_TIME",
                  "Selector":".date",
                  "Extract":"TEXT",
                  "Attribute":""
               },
               {
                  "Name":"JOB_TITLE",
                  "Selector":"h1.title",
                  "Extract":"TEXT",
                  "Attribute":""
               }
            ]
         },
         {
            "Website":"s3-ap-southeast-1.amazonaws.com",
            "Selectors":[
               {
                  "Name":"DATE_TIME",
                  "Selector":"#post-date",
                  "Extract":"TEXT",
                  "Attribute":""
               },
               {
                  "Name":"JOB_TITLE",
                  "Selector":"h2",
                  "Extract":"TEXT",
                  "Attribute":""
               }
            ]
         }
      ]
}

Making it work

So here's what creating a dynamic scraping agent typically involves :

  1. Create a simple agent using chrome extension from any of the one website.
  2. Create a dynamic selectors list.
  3. Adding selectors in dynamic list using the chrome extension
  4. Enable the dynamic selector in agent by selecting the dynamic input list and execute

Step #1 : Creating a scraping agent.

In this step, I went to the website page and enabled the chrome extension. Then added two fields and clicked on the elements to generate their selected their CSS selectors.

website scraping in google chrome

Once done with the setup, click on the "Done" button will open the below dialog box. Enter the API Id and Key to save the agent in cloud app.

You can get your account API key from here

extract html pages

Step #2 : Creating a Dynamic Selector List

Login to your account > Go to dynamic selector > Click on the "New List" button

dynamic selector list in site scraping

Enter the name of the list and click on the "Create" button to finish the process. A new list will be created can be seen in table there.

my first dynamic selectors list

Step #3 : Adding Dynamic Selectors To List

To add the dynamic selectors, go to each site you want to extract and launch the advanced web scraper chrome extension to setup the selectors > Then click on the options button will open the below dialog box.

Go to advance tab in the dialog box > select the list (created in step #2) > Click on "Save" button to add those selectors to the list.

Be sure to use the same field name as in main agent. For example : If your main agent has a JOB_TITLE  filed, you'd need to keep the same name for that field for each website in dynamic list as well for correct mapping.

add selectors to dynamic scraper list

Note : Repeat the step #3 for each site you want to add to dynamic selectors list.

Step #4 : Enable Dynamic Selector and Execution

By now, we have created a simple scraping agent and then a dynamic selectors list. And added the selectors for 2 website in our list.

So, It's time to enable the dynamic selector option in agent.  Go to bottom of your agent page and click on the "Dynamic Selectors" button.

enable dynamic selectors

This will open the below dialog box, where you can enable the dynamic selector option and select the list contains the selectors.

select dynamic selectors to extract the data

So using the steps above, we've enabled that dynamic selectors list in our agent to let the application know that read the selectors from the list by matching the website URL and the Field Name

Now, we can enter the both website URLs in agent setup to crawl the 2 different HTML structure website.

dynamic selector in website scraping

Let's execute now - Click on the "Start" button to execute the agent and it'll be completed in few seconds(depends on number of pages). Once completed, refresh the page to see the extracted result. If you see the screenshot below the agent extracted DATE_TIME, JOB_TITLE fields from both the sites by applying dynamically selectors.

online web scraper in cloud

Wants to learn, how it worked?

Go to the crawling logs tab > And if you see logs messages closely, the application applied different CSS selectors automatically based on the website URL and extracted the result in it's corresponding fields is the application level logic.

Using this dynamic selector feature you can extract thousands or millions of different HTML structure website with a single agent, if your output expectation(Output fields) is same from each site.

multiple websites scraping with different HTML structure

So whether you consider yourself is programmer or not, I'd definitely encourage you to play around with this feature as it's all click-and-play and doesn't required any technical skills or programming to setup a dynamic agent and make that work. Do let me know what you think in comments.

Close me