Multiple Output Collections in Scraping

Starting October, 15 2016 - You can use the DataScraping.co edit agent feature to structure your scraping agent data into models based on the elements you select on the page. This means that using the hosted application, you can group like elements together in collections - For example, if you are looking at this PriceTree product page, the product information (title, price, image, rating) will be in a different collection from the product specifications(feature_name, feature_value). And a 3rd collection with price comparison information (store, price, shipping, store_url).

So the data model will looks like below and executing this agent will result in 3 output file e.g product_details.csv, spec_details.csv and price_comparison.csv

collections relationship in scraping

Note : I have included url in each collection to create relationship, which will be later used for vlookup in excel or join in SQL to relate the product data with specification or price comparison.

If you'd like to adjust the data model to create new collection or merge collections and fields, navigate to edit agent and then select the "Collections" tab. Now you can drag and drop the fields to any collection you want or add new fields, change the selector and more as in screenshot below.

The default name givens to collection are Collection1, Collection2.... You can also give your choice of friendly names to your collections as I did in this example and given product_details to collection1 and likewise to other data collections as well. When you move two fields into the same collection, the individual fields will be forced to associate in same collection. For example, if we move rating field from product_details collection to price_comparison collection. It will get associated with the price_comparison collection. So you can move any field to any collection manually to change your data model.

multiple collections in scraping

And this is how the Collections array will look like in scraping agent.

    "Collections": [
      {
        "Name": "product_details",
        "Fields": [
          {
            "Name": "title",
            "Selector": "#right-content h1",
            "Extract": "TEXT",
            "Attribute": null,
            "CollectionType": "CSS",
            "From": null,
            "Visible": null,
            "PostProcessing": null
          },
          {
            "Name": "price",
            "Selector": "meta[itemprop=lowPrice]",
            "Extract": "ATTR",
            "Attribute": "content",
            "CollectionType": "CSS",
            "From": null,
            "Visible": null,
            "PostProcessing": null
          },
          {
            "Name": "image",
            "Selector": ".product-image div div img",
            "Extract": "ATTR",
            "Attribute": "src",
            "CollectionType": "CSS",
            "From": null,
            "Visible": null,
            "PostProcessing": null
          },
          {
            "Name": "rating",
            "Selector": ".fg-color-lightGray",
            "Extract": "TEXT",
            "Attribute": null,
            "CollectionType": "CSS",
            "From": null,
            "Visible": null,
            "PostProcessing": null
          }
        ]
      },
      {
        "Name": "spec_details",
        "Fields": [
          {
            "Name": "feature_name",
            "Selector": ".spec-table-th",
            "Extract": "TEXT",
            "Attribute": null,
            "CollectionType": "CSS",
            "From": null,
            "Visible": null,
            "PostProcessing": null
          },
          {
            "Name": "feature_value",
            "Selector": ".spec-table-th+ .text-left",
            "Extract": "TEXT",
            "Attribute": null,
            "CollectionType": "CSS",
            "From": null,
            "Visible": null,
            "PostProcessing": null
          }
        ]
      },
      {
        "Name": "price_comparison",
        "Fields": [
          {
            "Name": "store",
            "Selector": "#PriceCompareTable .lazy",
            "Extract": "ATTR",
            "Attribute": "data-original",
            "CollectionType": "CSS",
            "From": null,
            "Visible": true,
            "PostProcessing": []
          },
          {
            "Name": "price",
            "Selector": ".fg-color-red",
            "Extract": "TEXT",
            "Attribute": null,
            "CollectionType": "CSS",
            "From": null,
            "Visible": true,
            "PostProcessing": []
          },
          {
            "Name": "shipping",
            "Selector": "#PriceCompareTable td:nth-child(3)",
            "Extract": "TEXT",
            "Attribute": null,
            "CollectionType": "CSS",
            "From": null,
            "Visible": true,
            "PostProcessing": []
          },
          {
            "Name": "store_url",
            "Selector": "#PriceCompareTable .ios-button-white",
            "Extract": "ATTR",
            "Attribute": "href",
            "CollectionType": "CSS",
            "From": null,
            "Visible": true,
            "PostProcessing": []
          }
        ]
      }
    ]

Now just enter some url in agent setup and start the scraping.

I'm entering below 2 URL in setup for this sample test job.

http://www.pricetree.com/mobile/apple-iphone-6-price-756
http://www.pricetree.com/mobile/samsung-galaxy-s4-price-10822

running online web scraper

Once the jobs is completed - Go back to Output tab and you will see 3 output tabs will be created and their data under separate tabs for each collection in agent we've created.

You can click on corresponding tab to see or download the product specifications and price comparison data in CSV, TSV or JSON format.

collection data extracted

So the multiple collection feature is amazing for multiple reasons : 

  • Structured data - You can extract product details in better and well structured format by grouping likewise fields together, which is easy to manage and then use of scraped data in production applications.
  • Faster crawling - Since we are extracting 3 collection with just one web request and already parsed HTML, which makes the overall process very fast.
  • Re-usability principles - The use of existing web request to parse the additional data fields.

Now, If we see the specification or price comparison data by downloading CSV or just clicking on it's tab you'll notice that it's not clear the particular record is extracted from page 1 or page 2. For example, if we see on the screenshot below there are 87 rows extracted from both product specification and I'm not clear which one specification is of iPhone(page1) and which one is for Samsung galaxy(page2)

collection data extracted

So to solve this I'm going to add one more field url in each 3 collections, which can be used to recognize that the data is extracted from page 1 or page 2. To add a new field, I just went to one webpage and found the url in html canonical tag.

html scraper

If we see this in chrome extension, the selector will be like below :

link canonical scraping

We can also use the Built-in field REQUEST_URL here to see the page url in all three output collections. 

And finally I just clicked on the edit agent and added a new field named as url. And then set the post processing function (AutoFillBlankCells: true) to autocomplete the blank cells because in specification table there are 87 matching records(around 40 from each page) but the url is one only. So I want the url should be repeated in blank cells.

add-new-css-selector-field-in-scraper

Note : The multiple collection feature available on hosted application only, and not on desktop app.

Want to try out this feature? Click here to download this example scraping agent.

Close me