Built-in Fields in Data Scraping Studio

Data Scraping Studio CSS Selector engine and REGEX extractor engine allow you to extract anything from a source content you want, but some time you might need few items/fields to be included in your output which is not available in source content.

For e.g. You are extracting stock information of 100 companies from a stock website but also want a DATETIME field should also be in your result grid to see when a particular page was fetched and information was extracted.

So Data Scraping Studio has 11 built-in fields for all those cases.

Name Type Description
REQUEST_URL string URL of web page, same as in provided in input
RESPONSE_URL string

URL of web page returned by web server even after redirects.
(E.g. If your inputs has an old URL http://www.domain.com/some-old-product-page.html but web server returns a 301 redirect and serves the new URL http://www.domain.com/new-page.html . The RESPONSE_URL field will have the new URL populated)

RESPONSE_HTTP_CODE int HTTP Code of successful web request (E.g. 200, 301)
RESPONSE_HTTP_STATUS string HTTP Status of successful web request (E.g. Ok, Moved Permanently)
RESPONSE_ERROR_CODE int HTTP Error code of error-ed web request(E.g. 404, 408, 503 etc.)
RESPONSE_ERROR_MSG string

HTTP Error message of error-ed web request. E.g.

  • Not Found
  • Request Time Out
  • Website under maintenance etc.

See more about HTTP status codes on W3 website

RESPONSE_HEADER string

Collection of web response Header. E.g.

Cache-Control:private
Connection:Keep-Alive
Content-Encoding:gzip
Content-Length:8922
Content-Type:text/html; charset=utf-8
Date:Thu, 10 Dec 2015 11:44:36 GMT
Proxy-Connection:Keep-Alive
Server:Microsoft-IIS/7.5
Vary:Accept-Encoding
X-AspNet-Version:4.0.30319
X-Powered-By:ASP.NET

 

RESPONSE_CONTENT string

Complete source code of requested webpage

<!DOCTYPE html>
<html>
<head>
<title>
List Extraction
</title>
<meta name="robots" content="noindex, nofollow" />
</head>		
<body>
<h1>List extraction</h1>
<h2>This example is used to extract the list with Data Scraping Studio</h2>
.....
.....
.....
</body>
</html>

 

RESPONSE_DATETIME datetime System date time when a particular request was fetched (Format : MM/dd/yyyy hh:mm:ss)
RESPONSE_DATE date System date when a particular request was fetched (Format : MM/dd/yyyy)
RESPONSE_TIME time System time when a particular request was fetched (Format : hh:mm:ss)

 

These fields can be added/removed in a scraping agent while creating or edit the agent.
Steps : Go to New Agent/Edit Agent > Click on the add built-in field button > Choose the fields you want to add > Click on Add button > Save your scraping agent to Finish.

built in fields in data scraping studio

Related Q & A

Close me