Looking for a JavaEE Architect or Potential Tech Co-Founder?
Please don't hesitate to contact me.

HTML Scraping with Java

One rainy Sunday afternoon since I can't get out to go somewhere I've decided to create an organized excel file (for now) for the list of birds commonly found in the Philippines. I found a good site to start with, http://www.birding2asia.com/tours/reports/PhilFeb2010_list.html, but I don't want to copy each detail into an excel document because that would take time. So I searched the internet for html scraping tools, I've used HTMLAgility for .net before and I think I'll still use the same if I'm working with .net again, but I want to do it in java today.

Here's a list of the most used html scraper for different PL: http://stackoverflow.com/questions/2861/options-for-html-scraping

And I've chosen jsoup for java since it's the most simplest to implement, with minimal dependencies compared to HTMLUnit and the rest.
Here's how I've written my implementation:
package org.ipiel.ipielHtmlParser;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Iterator;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * @author Edward P. Legaspi
 * @since Jul 29, 2012
 **/
public class JsoupParserImpl {
 public static void main(String args[]) {
  try {
   new JsoupParserImpl();
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }

 public JsoupParserImpl() throws IOException {
  File input = new File("input.html");
  Document doc = Jsoup.parse(input, "UTF-8");
  Elements birdNames = doc.select("p[class=MsoNormal]");
  Iterator ite = birdNames.iterator();
  
  PrintWriter pw = new PrintWriter(new FileOutputStream("out.txt"));

  while (ite.hasNext()) {
   Element bird = (Element) ite.next(); // comm name + sci name
   Element birdName = (Element) bird.select("span[class=comname]").first();
   Element sciName = (Element) bird.select("span[class=sciname]").first();
   List endemics = (List) bird.select("span[class=endemic]");   
   Element endemic = null;
   if(endemics.size() > 0) {
    endemic = endemics.get(0);    
   }   
   
   Element location = (Element) ite.next(); // where found
   
   String out = birdName.text().trim() + "," + sciName.text() + "," + ((endemic != null) ? endemic.text() : "") + "," + location.text();
   
   System.out.println(out);
   pw.write(out);
   pw.write("\n");
   
   ite.next(); // spacer
  }
  pw.close();
 }
}
HTML Scraping with Java HTML Scraping with Java Reviewed by Edward Legaspi on Sunday, July 29, 2012 Rating: 5

No comments:

Powered by Blogger.