Analyzing Unstructured Content

I am reading the UIMA overview document.  It is a fascinating description of an architecture for analyzing unstructured documents. In analyzing unstructured content, UIMA based  applications make use of a variety of analysis technologies including:

• Statistical and rule-based Natural Language Processing (NLP)
• Information Retrieval (IR)
• Machine learning
• Ontologies
• Automated reasoning and
• Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)

As the amount of unstructured information increases, it becomes important to make sense of it. The type of analysis is normally domain and application specific. You can take a collection of related documents and come up with various analysis views. Depending on the type of analysis you can use different analysis engines.

Let us take a current topic – Apple vs Samsung. If we gather a set of news items from the time the case started, you can analyze it in different ways.

  • An analysis of innovations which include levels of innovation and what is an innovation and what is not
  • An analysis of patents which may be useful to other vendors of smart phones and tablets
  • An analysis of human interest stories from both companies (and the style of product management)
  • An analysis of product development processes

Same documents, different views based on your interest levels. This is a fascinating area.

UIMA document provides an overview of how to develop simple and aggregated analysis engines. I found this document gripping (which is not a term you normally associate with technical documents). It not only explains the conceptual thinking behind UIMA, but also triggers several ideas and thoughts for further reading.

 

One thought on “Analyzing Unstructured Content”

Comments are closed.