Extract data from a lost site using archive.org

I've tried to use the hosted app to extract data from a lost site https://web.archive.org/web/20160110033729/http://persianpreneur.com/ . data is inside a div that appears on hover over images and results looks like below.

The agent id is : 20fce2dc52

archieve org scraping

Posted by Hadi Farnoud 10 months ago


The data was extracted but due to inner html tags, spaces and line breaks it was not in single line, which was confirmed by downloading the CSV(Screenshot below).

line breaks in website scraped data

And this is how the .description looks like in website

html in website scraped

To fix the issue Clean, Trim function is enabled in scraping agent and will be also enabled in each agents going onward to beautify the data if there are line breaks, or extra spaces due to inner html in web page.

online web scraper for archive.org

 

Posted by anonymous 10 months ago

Topic Closed! This question is closed and don't accept posts now.

Close me