Using Proxy for Anonymous Web Scraping

Proxy is an important part of website scraping and widely used term for anonymously data scraping from websites. Because sometime, you may not want to reveal your identity (network details) to remote web servers from where you are scraping data.

For e.g Nick is a pricing manger in eCommerce company Amazon and want to scrape price from competitor website eBay, would nick ever want...

  • eBay get to know that, there competitor is scraping their website for business intelligence gain?
  • Their IP address, location revealed on eBay server and potential risk to get blocked?

No - the solution is Proxy

A proxy or proxy server is basically another computer which serves as a hub through which internet requests are processed. By connecting through one of these servers, your computer sends your requests to the proxy server which then processes your request and returns what you were wanting (ref)

Data Scraping Studio lets you scrape data anonymously from websites with the help of configurable rotating HTTP proxy server. To configure this feature, Create/Edit a scraping agent > Go to "Advance Setting" tab and then Click on the "Proxy" tab. You may add a single proxy address or import a list of proxy addresses in bulk as shown below.

Only HTTP proxy is supported and each proxy will have the format protocol://login:password@IP:port
The login details and port are optional. Here are some examples:

  • http://66.12.121.140:8000
  • http://219.66.12.12
  • http://myproxy.com:80
  • http://username:password@219.66.12.14:8080
  • http://username:password@myproxy.com:80

web scraping proxy

Proxy Rotator

You may use this feature If you've more then one proxy and want to keeps rotating them randomly. Using this feature, Data Scraping Studio will pick the random proxy from the list provided and rotate after every (n) number of page fetches. - Here is how it works

//For example you have 52 proxy in your list

int randomProxy = rnd.Next(52); // creates a number between 0 and 51

function ProxyRotator(randomProxy){
  //Set this proxy
}

Click on the "Proxy Rotator" check box to enable the randomize proxy from list provided.

Bulk Import

Click on the "Import" button to use this feature if you have a large list of proxy to add in Bulk, paste one proxy per line in format protocol://login:password@IP:port

bulk proxy add for web scraping

Get Proxy from Input Files Dynamically

Dynamic input features may help you to get the proxy directly from your input file(JSON/CSV/TSV) and set the proxy while scraping. Using the feature you have full control on what proxy should be sent to which URL and skip it for few URL if needed.

Enabling the proxy with dynamic input feature and keeping Input file field blank will fetch the request without proxy.

dynamic proxy form

To use this feature click on the check box "Get proxy file from input file dynamically" and then provide the name of the field to populate while execution. For example COLUMN2 in format {{COLUMN2}} for below input file

dynamic proxy from csv

Free proxies

You may use free proxy or paid ones, here are some popular website will help to start and creating your first anonymous web scraper. Keep in mind, these free proxy lists might not be reliable because these are publicly visible and so many people use them. So you may also consider premium proxy as per the need

More best practice for anonymous data scraping

Use a random user agent for each request : By default, Data Scraping Studio uses the Chrome user-agent string which is sets while agent creation using chrome extension. You may change it to something else or use the dynamic input feature to set user agent on request basis.

Be sure to use the random user agent feature, as many website changes their HTML structure based on user-agent string. For example a website may produce mobile rendering HTML for Chrome in Android user agent and Web rendering HTML for Chrome Desktop user agent which may fail the CSS selectors as the website HTML structure was being changed on web server.

Send a custom HTTP Referrer : Referrer is a HTTP request header allows to set the HTTP referrer you want, both globally or on a per-page basis using dynamic inputs if you do not want to let the site know where you are coming from. For example.

  • Set a custom Referrer Globally - Referrer: google.com
  • _self - The request URL sets as referrer
  • Blank - Do not send any Referrer
  • Dynamic - {{COLUMN3}} Gets and sets the referrer from input file

user-agent in scraping

Close me