Crawling Password Protected Website

In my previous tutorials we've learned how to crawl a public website, writing CSS selectors, regex and export data in a file. In this tutorial we will learn how to crawl a password protected website using Data Scraping Studio desktop app and also using DataScraping.co hosted app.

In order to access the protected website, we must first get authenticated with username, password and then we can scrape the internal pages as we do with public websites. Scraping the web with DataScraping.co apps is pretty easy and quick to setup using the extension. This tutorial shows how to get data from a password protected website using DataScraping.co hosted app and you have nothing to do after your scraping agent or automation robot is configured and launched.

So a simple password protected website scraping workflow looks like below :

  1. Navigate to login page.
  2. Enter the username in input filed
  3. Enter the password in input field
  4. Click on the login button
  5. Start scraping internal pages.

The login engine has the following commands to interact with a login page using CSS Selectors and to complete the initial #1, #2, #3 and #4 login steps prior to start scraping internal pages.

Command Description Required*
NAVIGATE To navigate on a webpage
  1. Value : A URL to navigate
TYPE To type something in a text box
  1. CSS selectors : Selector of text box.
  2. Value : Value to enter in the text box.

CLICK

To click on a button or link

  1. CSS selectors : Selector of button/link need to be clicked
WAIT To wait (n) seconds
  1. Value : Value of seconds(int) to wait

Using Hosted App

To enable the scrape data from behind a login : 

  1. Scroll down to bottom of the agent page and click on "Edit agent" button
    edit scraping agent
  2. Go to "Password Authentication" tab and "Enable login to website" as in the screenshot below.

Now go to website you want to login and check the login form. For this tutorial I'm going to use this asp.net website http://client.vnpglobal.com/ where the login form HTML looks like below

<form method="post">
<input name="Login1$UserName" type="text" id="Login1_UserName" placeholder="Username" />
<input name="Login1$Password" type="password" id="Login1_Password" placeholder="Password" />
<input type="submit" name="Login1$LoginButton" value="Sign In" id="Login1_LoginButton" class="submit" />
</form>

So we need to navigate to this page and enter username, password and then click on the submit button to get authenticated and then we can access the internal pages. So our events will be like below

  • Navigate to http://client.vnpglobal.com/
  • Enter user name on text box with CSS selector #Login1_UserName
  • Enter password on text box with CSS selector #Login1_Password
  • Click on the Sign In button with CSS selector #Login1_LoginButton

CSS selector can be written with name, class or Id. For example to click on the "Sign In" button all these selectors are valid.

  • #Login1_LoginButton
  • .submit
  • input.submit

Enable login to site for crawling

Now save the scraping agent and go back to agent page. Enter some url in setup and start the crawling.

password protected site scraping

It will take few seconds depends on number of url need to be crawled and you can see the data in output table, as the scraping job completes.

crawled data behind login

Using Desktop App

DataScraping.co desktop app and hosted app uses the same login engine technically and you can execute the same agent in desktop app as well or vice-versa by download the *.scraping file. Or follow the steps below, in order to enable the "scrape data from behind a log-in" in desktop app-

  • Edit the scraping agent from agent explorer tree in Data Scraping Studio.
  • Go to "Advance settings" tab
  • Click on the enable check box

password protected site crawling

Click on the "Add" button to add all events one-by-one in scraping agent

add login commands for scraping

Login to website for crawling

Now save the scraping agent back and Re-run will login to website first and then will start scraping internal URLs from input file as the agent is setup to extract data from internal pages.

website login successfully for crawling

If you notice the logs window closely the Data Scraping Studio executes all the events provided in setup list to login a website in background and then maintain the cookie/session for scraping all internal pages. So the interaction events must be in order. For e.g we can't click on the login button first then type the username.

[Executing] "Password-protected-crawling.scraping"
[Requesting] Inputs at 03/01/2016 09:45:50
[Queued] 4 input records from {DIRECT} agent
[NAVIGATE] Navigating to http://client.vnpglobal.com/ at 03/01/2016 09:45:50
[TYPE] Typing "**************" to selector "#Login1_UserName"
[TYPE] Typing "**************" to selector "#Login1_Password"
[CLICK] Clicked on "#Login1_LoginButton"
[Starting] Extraction at 03/01/2016 09:45:51

To verify we've successfully logged in I'm using RESPONSE_URL default field here as the website will redirects me to login page if authentication fails. And we can always go to network tab to see the HTML response from server.

HTML response of server after login

Close me