HTML Table Scraping

Learn how to extract data from html table using Data Scraping Studio and export the output in CSV format

0 downloads

In this demo we will extract a HTML table from a web page and export the data in CSV format.

Step 1 : Create a new web scraping agent and name it "HTML_Table_extraction".

Step 2 : Now we will review the HTML source code.

<!DOCTYPE html>
<html>
<head>
<style>
table, th, td {border: 1px solid black;border-collapse: collapse;}th, td {padding: 15px;}</style>
</head>
<body>
<h1>HTML Table Extraction</h1>
<table style="width:60%">
  <tr>
    <td>Jill</td>
    <td>Smith</td>		
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>		
    <td>94</td>
  </tr>
  <tr>
    <td>John</td>
    <td>Doe</td>		
    <td>80</td>
  </tr>
  <tr>
    <td>Altay</td>
    <td>Doe</td>		
    <td>30</td>
  </tr>
  <tr>
    <td>Nick</td>
    <td>Smith</td>		
    <td>34</td>
  </tr>
  <tr>
    <td>Rob</td>
    <td>Milbern</td>		
    <td>45</td>
  </tr>
  <tr>
    <td>Scoot</td>
    <td>Sam</td>		
    <td>65</td>
  </tr>
</table>
</body>
</html>

Source https://cdn.datascraping.co/sample_content/html-table.html

Step 3 :  Now we will analyze the html source code and write unique REGEX to match the pattern of item we need to extract.

You can use the built-in REGEX tester tool to test pattern or any other regex tester tool of your choice.

I'm using Rubular online tool to write and test the regex and found below regular expression worked for this table.

<tr>\s*<td>([^<]+)<\/td>\s*<td>([^<]+)<\/td>\s*<td>([^<]+)<\/td>\s*<\/tr>

Now just paste this regex in new web scraping agent for each COLUMN1, COLUMN2, COLUMN3 and change the item# as 1,2,3 for corresponding column

The Item number in this setup grid is the position of item in REGEX

Step 4 : Click on Save to create.

Step 5 : Right click and excute this agent.

Close me