HTML Scraping with Java

One rainy Sunday afternoon since I can't get out to go somewhere I've decided to create an organized excel file (for now) for the list of birds commonly found in the Philippines. I found a good site to start with,, but I don't want to copy each detail into an excel document because that would take time. So I searched the internet for html scraping tools, I've used HTMLAgility for .net before and I think I'll still use the same if I'm working with .net again, but I want to do it in java today.

Here's a list of the most used html scraper for different PL:

And I've chosen jsoup for java since it's the most simplest to implement, with minimal dependencies compared to HTMLUnit and the rest.
Here's how I've written my implementation:
package org.ipiel.ipielHtmlParser;

import java.util.Iterator;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

 * @author Edward P. Legaspi
 * @since Jul 29, 2012
public class JsoupParserImpl {
 public static void main(String args[]) {
  try {
   new JsoupParserImpl();
  } catch (IOException e) {
   // TODO Auto-generated catch block

 public JsoupParserImpl() throws IOException {
  File input = new File("input.html");
  Document doc = Jsoup.parse(input, "UTF-8");
  Elements birdNames ="p[class=MsoNormal]");
  Iterator ite = birdNames.iterator();
  PrintWriter pw = new PrintWriter(new FileOutputStream("out.txt"));

  while (ite.hasNext()) {
   Element bird = (Element); // comm name + sci name
   Element birdName = (Element)"span[class=comname]").first();
   Element sciName = (Element)"span[class=sciname]").first();
   List endemics = (List)"span[class=endemic]");   
   Element endemic = null;
   if(endemics.size() > 0) {
    endemic = endemics.get(0);    
   Element location = (Element); // where found
   String out = birdName.text().trim() + "," + sciName.text() + "," + ((endemic != null) ? endemic.text() : "") + "," + location.text();
   pw.write("\n");; // spacer
HTML Scraping with Java HTML Scraping with Java Reviewed by Edward Legaspi on Sunday, July 29, 2012 Rating: 5

No comments:

Powered by Blogger.