Web Scraping {API}

Build Unique Web Scraping Experiences.

Introduction

Welcome on DataScraping.co for Developers! Just like you, we are enthusiastic web developers and we love to automate things. With DataScraping.co we are building the next-gen data collection system and we know that one of the best way to do that is to give you the ability to have the hands on your scraped data. Please find below the full documentation and API reference in order to reach this goal. Happy crawling!


Authentication

In order to start using API, you need an DataScraping account. Sign up if you don’t have one.

Once you’re logged in, go to your Account page. Click on Add a new API key and get your API ID and your API token in order to start using the API. Do not forget to copy/paste your API token somewhere, we don’t store it as plain text!

We use HTTP basic authentication in order to authorize API requests. Just add your API ID as username and your API token as password to each of your API requests.


API

The root endpoint of DataScraping API is: https://api.datascraping.co/v1/ . We recommend you to fetch our API over https.

Endpoints

Currently, we have 16 endpoints you can fetch when you’re authenticated.

Method Path Description
POST /newagent Create a new scraping agent
GET /agents List all agents in your account
GET /agents/{id} Retrieve configuration of a specific scraping agent
POST /newlist Create a new list
GET /lists List all input URL list in your account
GET /lists/{id} Retrieve a specific list
POST /start/{id} Start a new scraping job for specific agent
POST /stop/{id} Stop the scraping
POST /schedule/{id} Schedule the scraping agent
GET /output/{id} Retrieve scraped data of specific scraping agent.
GET /logs/{id} Retrieve log messages of specific scraping agent
DELETE /agents/{id} Delete a specific scraping agent
DELETE /lists/{id} Delete a specific list
POST /inputs/{id} Set the input URLs for the agent

 

You’ve to replace {id} by the effective ID of the object.


List of scraping agent

GET This API fetches all the active and archived scraping agents under an account.

$http.get("https://api.datascraping.co/v1/agents/?api-id={apiID}&api-key={apiKey}")
            .then(function (response) {
               console.log(response)
            });

Sample output

[{
	"id": "6413ef1c82",
	"name": "pricetree.com",
	"source_url": "http://www.pricetree.com/price-drops.aspx",
	"created_at": "05/13/2016 06:52:00 PM",
	"frequency": "",
	"schedule_description": "",
	"next_auto_run": null,
	"last_run": "05/14/2016 08:45:42 AM",
	"last_status": "Completed"
}]

Retrieve a scraping agent

GET Retrieve a specific scraping agent using the id.

$http.get("https://api.datascraping.co/v1/agents/{agentId}?api-id={apiID}&api-key={apiKey})
            .then(function (response) {
               console.log(response);
            });

Sample output

{
	"id": "6413ef1c82",
	"name": "pricetree.com",
	"source_url": "http://www.pricetree.com/price-drops.aspx",
	"created_at": "05/13/2016 06:52:00 PM",
	"version": 2,
	"frequency": "",
	"schedule_description": "",
	"next_auto_run": null,
	"last_run": "05/14/2016 08:45:42 AM",
	"last_status": "Completed",
	"input_type": "manual",
	"list_id": "56b7e4",
	"pages_total": 1,
	"pages_crawled": 1,
	"agent": {
		"AgentName": "pricetree.com",
		"CreatedOn": "02/13/2016 10:00:59",
		"CreatedBy": "Data Scraping Studio",
		"Version": "1.6",
		"About": null,
		"SourceURL": "http://www.pricetree.com/price-drops.aspx",
		"Collection": [{
			"Name": "ProductName",
			"Pattern": "#priceDropList p",
			"ItemNumber": "TEXT",
			"Position": null,
			"Visible": true,
			"CollectionType": "CSS"
		}, {
			"Name": "PriceDrop%",
			"Pattern": ".price-drop",
			"ItemNumber": "TEXT",
			"Position": null,
			"Visible": true,
			"CollectionType": "CSS"
		}, {
			"Name": "ProductImage",
			"Pattern": ".product-list-img img",
			"ItemNumber": "ATTR",
			"Position": "src",
			"Visible": true,
			"CollectionType": "CSS"
		}, {
			"Name": "BestPrice",
			"Pattern": ".font-18+ .font-18",
			"ItemNumber": "TEXT",
			"Position": null,
			"Visible": true,
			"CollectionType": "CSS"
		}],
		"Header": {
			"Method": null,
			"TimeOut": 6000,
			"ReadWriteTimeout": 6000,
			"KeepSameURL": false,
			"ConnectionAlive": false,
			"Data": [{
				"Key": "Method",
				"Value": "GET"
			}, {
				"Key": "Accept",
				"Value": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
			}, {
				"Key": "User-Agent",
				"Value": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"
			}, {
				"Key": "Accept-Language",
				"Value": "*"
			}]
		},
		"Slicer": {
			"Enabled": false,
			"Data": []
		},
		"Proxy": {
			"Enabled": false,
			"RotateEnabled": false,
			"Rotate": 0,
			"EnableDynamicProxy": false,
			"DynamicColumn": "",
			"Data": []
		},
		"Throttling": {
			"Enabled": false,
			"DelayType": "",
			"DelaySeconds": 0,
			"AutoRedirectEnabled": true,
			"MaxAutoRedirect": 3
		},
		"Limit": {
			"Enabled": false,
			"StartAt": 4,
			"StopAt": 6
		},
		"FailRetry": {
			"Enabled": false,
			"RetryCount": 0,
			"DelaySeconds": 0
		},
		"Output": {
			"Type": null,
			"Path": null,
			"FileName": "",
			"PostDataOnAbortCancel": false,
			"IncludeHeader": false,
			"AppendMode": false
		},
		"ModifyOutput": {
			"Enabled": false,
			"Script": ""
		},
		"FormSubmit": null,
		"Input": {
			"Type": "WEB",
			"Path": null,
			"FileType": "JSON",
			"IncludeColumnInOutput": "0",
			"Data": []
		}
	}
}

Start the scraping agent

POST Start a new data collection job for a specific scraping agent using the id.

The start API endpoint requires the Admin API Key

$http.post("https://api.datascraping.co/v1/start/{agentId}?api-id={apiID}&api-key={apiKey}")
         .then(function (response) {
               console.log(response);
            });

Sample output

{
	"status_code": 201,
	"message": "New job started successfully",
	"version": 3
}

Stop the scraping agent

POST Stop the running data collection job for a specific scraping agent using the id.

The stop API endpoint requires the Admin API Key

$http.post("https://api.datascraping.co/v1/stop/{agentId}?api-id={apiID}&api-key={apiKey}")
         .then(function (response) {
               console.log(response);
            });

Sample output

‚Äč{
	"status_code": 201,
	"message": "Stop request sent to workers successfully",
	"version": 3
}

Schedule the scraping agent

POST Schedule the scraping agent to run every 15 minutes, every hour, daily or using custom CRON expression.

The scheduling API endpoint requires the Admin API Key and all the scheduled jobs is executed in Indian Standard Time - IST(+5:30) 

var scheduleData = {
	"cron_expression": "0 0/15 * 1/1 * ? *"
};

$http.post("https://api.datascraping.co/v1/schedule/{agentId}?api-id={apiID}&api-key={apiKey}", scheduleData)
         .then(function (response) {
               console.log(response);
            });

Sample output

{
	"status_code": 201,
	"message": "New schedule created successfully"
}

Minimum interval period 15 minutes required for 2nd next schedule. Cron expressions that does not meet with 15 minutes frequency is not supported.

You can use the following sample cron strings when creating a rule with scheduling API.

Cron Expression Meaning
0 0/15 * 1/1 * ? * Run every 15 minutes
0 50 8 1/1 * ? * At 08:50 AM(IST), every day
0 0 0/1 1/1 * ? * Every hour, every day
0 30 5 1/1 * ?  At 05:30 AM(IST), every day
0 5 11 ? * MON-FRI * At 11:05 AM(IST), Monday through Friday
0 15 10 ? * MON,THU,SAT * At 10:15 AM(IST), only on Monday, Thursday, and Saturday
0 0 20 ? 1/1 MON#1 *

At 08:00 PM(IST), on the first Monday of every month

0 0 20 5 1/1 ? * At 08:00 PM(IST), on day 5 of every month

Errors

Error Code Meaning
200 Okay. Everything seems good
201 Okay. A new entry created successfully
400 Bad request, arguments.
401 Api-Key or Api-Id not present
403 Forbidden - Not authorized. X-DataScraping-API-Key or X-DataScraping-API-ID is invalid
404 Not found
409

Conflict with the server expectations.

500 Something is wrong on our server

Close me