Inputs or Batch Crawling in Data Scraping Studio

Data Scraping Studio supports many input formats for batch crawling and to best integrate with your current solution as well to meet our vision of automate the data collection process.

Enter the list of Inputs Manually

"Direct URL Lists", this option is used to copy paste the list of input URLs in the input grid manually and Data scraping studio will save this data in that particular scraping agent to crawl when executed.

Note  1: COLUMN1 Must be URL
Note 2: Up to 5 columns supported for dynamic input population

batch crawling

This is how the input data will be stored in scraping agent file(*.scraping)

"Data": [
      {
        "COLUMN1": "https://cdn.datascraping.co/sample_content/simple-list.html",
        "COLUMN2": "",
        "COLUMN3": "",
        "COLUMN4": "",
        "COLUMN5": ""
      },
      {
        "COLUMN1": "https://cdn.datascraping.co/sample_content/links.html",
        "COLUMN2": "",
        "COLUMN3": "",
        "COLUMN4": "",
        "COLUMN5": ""
      }
    ]

Read Inputs from Local CSV/TSV file

Select the "Local File" radio button in agent setup and browse the local file containing URLs (e.g. C:/MyData/Inputs.csv ) and then select the File type (CSV, TSV, JSON). Data Scraping Studio will read the input file run time to traverse those pages and scrape data. You can also see the preview by clicking on the "Preview button" will also test the format, syntax errors etc.

Note COLUMN1 must be URL

input from CSV for scraping

Include(Column*) in output : This option is used to display/include the input field in output window. For e.g. If you have a list of 100 URLs and want to extract the heading from each pages and also want to see the input URL in output. Just select 1 from the drop down to display the input URL (COLUMN1) in output window.

HEADING(Extracted field) COLUMN1(Populated from input)
heading 1 uri1
heading 2 uri2
heading 3 uri3
heading 4 uri4

Read Inputs from Web API

Select the "From Web" option radio button in agent setup and enter the URL of your JSON API endpoint. Data scraping studio will read the inputs run-time when executed.

inputs from web api for batch crawling

Sample JSON format

  1. COLUMN1 must be URL
  2. JSON format must be an array as shown in sample below
[
  {
    "COLUMN1": "https://cdn.datascraping.co/sample_content/simple-list.html",
    "COLUMN2": "",
    "COLUMN3": "",
    "COLUMN4": "",
    "COLUMN5": ""
  },
  {
    "COLUMN1": "https://cdn.datascraping.co/sample_content/links.html",
    "COLUMN2": "",
    "COLUMN3": "",
    "COLUMN4": "",
    "COLUMN5": ""
  }
]

Close me