Information, Intelligence, Knowledge

Web Data Extraction, XPath and XQuery

In his article on Semantic Screenscraping, Jon Udell talks about how to integrate information from the web using XPath as a tool for scraping web pages and integrating them with open APIs like the Google Map API.

Regular expressions once dominated my screenscraping code. Now XPath expressions do. Screenscraping is becoming more declarative, more query-like.

Jon outlines the developments that make obtaining data from the web easier.

1. HTML is readily covertible to XHTML

2. The resulting XHTML is semantically richer

3. XPath and XQuery are maturing to the point where they are very useful in extracting information
This topic is very interesting to me. About 5 years ago, we attempted a product called Information Integrator. The goal was to interactively step through web pages, mark portions that you are interested in, convert them to XML and integrate them into a single page. So your home page will be a set of transclusions. After a few attempts working with tidy and a mapper UI, we gave it up in favor of our current InfoMinder product. I think a variation of that idea still has some merit.