Scraping Data From Websites With Pagination

When there are thousand or millions of records need to show on a webpage, the websites uses pagination feature to divide them into pages. In this tutorial I will cover 5 types of commonly used pagination in websites and how to scrape data from these interactive web pages using Data Scraping Studio.

  • Querystring Pagination
  • SEO Friendly Pagination
  • Next Previous Pagination
  • Infinite Scrolling Pagination
  • Load More Button Pagination

Querystring Pagination

Query String Pagination

The query string pagination is simple URL with query string parameter, page=1, page=2, page=3. And we can just paste them into a CSV/TSV file and configure your scraping agent to read input from that file for scraping.

http://www.pricetree.com/price-drops.aspx?p=1
http://www.pricetree.com/price-drops.aspx?p=2
http://www.pricetree.com/price-drops.aspx?p=3
http://www.pricetree.com/price-drops.aspx?p=4
http://www.pricetree.com/price-drops.aspx?p=5

Step 1 : Create a list of URLs and enter in your CSV/TSV file.

query string url pagination scraping

Step 2 : Configure your scraping agent to read input from that file. (Click on preview button to see the preview)

web scraping agent to read pagination url from csv

Step 3 : Execute the scraping agent

scraping agent with pagination

SEO Friendly Pagination

The SEO friendly pagination is also a URL but it's more SEO friendly for better ranking on search engine. e.g. page/1, page/2. Since it's also a simple URL there would not be any difference in our scraping agent, just paste all the URL into a CSV/TSV file and configure your scraping agent to read input from that file while execution. And Data Scraping Studio will traverse all the web pages and will extract the matching CSS selector or REGEX provided in setup.

http://www.domain.com/some_article-1.html
http://www.domain.com/some_article-2.html
http://www.domain.com/some_article-3.html

Next Previous Pagination

next previous pagination

Some website use Next-Previous pagination as well to load data from server when user click on next, previous button or a text box to enter the particular page number. To scrape data from this pagination :-

  • Create/Edit the scraping agent
  • Go to Advance Setting > Pagination
  • Enable the Pagination by "clicking on the check box".
  • Select paging type as "CLICK"
  • Enter the CSS selector of Next button/hyperlink which should be clicked.
  • Select the limit(n) to tell Data Scraping Studio how many pages needs to be crawled.

paging in website scraping

Save the agent back and Re-Run. Now the Data Scraping Studio will crawl the first page and then find the "Next Page" selector. And then click on that next page selector and will extract the next page data after page load event. The process will keeps running in loop until reach out to limit given in agent configuration.

The pagination feature can also be used for AJAX/JavaScript generated pages.

data scraping studio pagination

Infinite Scrolling Pagination

This paging is mostly used in new web to auto load the data from server while webpage is scrolled down in browser using client side Jquery AJAX(HTTP GET or HTTP POST) requests. to scrape data from these infinite scrolling pagination, we can see the network request in Chrome/Firefox and then use those internal structured JSON or XML pages directly in your scraping agent.

Step 1 : Navigate to the webpage. For example I found one infinite scrolling pagination example on producthunt.com to test and demonstrate

Step 2 : Open the Developer tools(Press F12) on Chrome browser and go to the Network tab. (You may also click on XHR tab to filter out other images, css etc)

Step 3 : Scroll down to load more pages.

  1. AJAX request fetching data from server as scrolled.
  2. HTTP response received from server in structured JSON format, which is later inserted into HTML using client side JavaScript/Jquery.

infinite scrolling ajax pages scraping

Step 4 : Go to "Headers" tab to see the HTTP request method(GET in this case) and you will see the actual AJAX request URL and other headers, cookie value sent by browser.

http get method for scraping

Step 5 : So, now we know about the AJAX URLs used by the website to get data and then inserted into HTML, let's make a list of URLs to crawl with our scraping agent.

So once we know about the actual back-end pages data is coming from, why extract data from html? Using Data scraping studio we can extract data directly from these internal JSON or XML pages using HTTP GET or POST method and JavaScript native JSON parser

https://www.producthunt.com/tech?page=1&per_page=2
https://www.producthunt.com/tech?page=2&per_page=2
https://www.producthunt.com/tech?page=3&per_page=2
https://www.producthunt.com/tech?page=4&per_page=2
https://www.producthunt.com/tech?page=5&per_page=2
https://www.producthunt.com/tech?page=6&per_page=2
https://www.producthunt.com/tech?page=7&per_page=2
https://www.producthunt.com/tech?page=8&per_page=2

Step 6 : Create a scraping agent with 2 built-in fields REQUEST_URL and RESPONSE_CONTENT

agent setup for ajax website scraping

Step 7 : Enter the list of urls in "input"

infinite scrolling input urls

Step 8 : Execute the Scraping Agent

crawler works with infinite scroll web pages

Step 8 : Now we have the JSON response in RESPONSE_CONTENT field for each URL. Let's use the "Modify Data with JavaScript" feature to write a JavaScript function to parse it further into structured table and fields we want to scrape.

var results = JSON.parse(data[0].RESPONSE_CONTENT);
var tabularArray = [];
for (var i in results[0].posts) {
	var row = {
		"PageUri": data[0].REQUEST_URL,
		"ID": results[0].posts[i].id,
		"Name": results[0].posts[i].name,
		"Tagline": results[0].posts[i].tagline,
		"Created_at": results[0].posts[i].created_at,
		"Url": ('https://www.producthunt.com' + results[0].posts[i].url),
		"Vote_count": results[0].posts[i].vote_count,
		"Category": results[0].posts[i].category.name,
		"UserName": results[0].posts[i].user.username
	};
	tabularArray.push(row);
}
data = tabularArray;

parse ajax response with Javascript JSON.parse

Step 9 : Now we have written a JavaScript function to parse the scraped JSON and tested with output preview, let's configure your scraping agent to execute this JS code automatically each time while executions.

  1. Create/Edit your scraping agent.
  2. Got to "Modify Data" tab.
  3. Enable the feature by checking "Enable Modify Data" check box.
  4. Browse the JavaScript we've written on step 8.
  5. Save the scraping agent back.

modify pagination data

Step 10 : Final execution, we've done with all the steps. Now execute the scraping agent for any number of pages.

Scraping a page which uses AJAX for pagination is easier & faster than scraping a regular HTML website, because these AJAX requests return well structured JSON or XML data which can increase the crawling speed by 2x, because no HTML, CSS and JavaScript will be loaded in data extractor engine

scraping data from infinite scrolling pagination

Load More Button Pagination

Load more pagination scraping

Similar to infinite scroll, the only difference you will see a "Load More" button will be there and clicking on that button will auto load the data instead OnScroll event. These client side event send a AJAX request on server and insert the data in HTML webpage. So we can track those AJAX pages in Chrome developer tools(as we did on infinite scroll above) and then extract data directly from those pages instead from HTML. 

The AJAX pagination is tricky, and in order to crawl data from AJAX pagination. So, we first need to analyze the web page using the chrome developer tool or tools like Fiddler to understand the type of pagination used in targeted website and how the data is rolling when pages changed . It may be...

  • Simple query string URL: When you click on next, previous page the URL in browser gets changed to ?page=1, ?page=2 etc
  • SEO friendly URL : When you click on next, previous page the URL gets changed to page/1, page/2 etc
  • AJAX with HTTP GET : When you click on next, previous button. A network request fetch data from server with HTTP GET query-string parameter, can be tracked on Chrome developer tool
    GET /api/v1/product?page=1
    GET /api/v1/product?page=2 
    GET /api/v1/product?page=3 
    GET /api/v1/product?page=4 
    GET /api/v1/product?page=5
  • AJAX with HTTP POST : When you click on next, previous button. A network request fetch data from server using HTTP POST method, can be tracked on Chrome developer tool
    POST /api/v1/product [PostData:p=1]
    POST /api/v1/product [PostData:p=2]
    POST /api/v1/product [PostData:p=3]
    POST /api/v1/product [PostData:p=4]
    POST /api/v1/product [PostData:p=5]

And once we know the type of pagination, we can use any of the above techniques for scraping data form pagination.

Close me