LinkLog: Automated Sentiment Analysis

From Five Myths About Automatic Sentiment Analysis

Sentiment analysis using natural language processing. Yes, it is done by a machine and no, it’s not 100 % accurate. The industry estimates that it’s at 70 – 80%. We are very open about that and recommend that it be used as an overview.

It would take hours to manually review the same amount and one still wouldn’t have an overall sense of the percentage positive vs negative.

Research: How Do You Find What Blogs to Read?

Research quesion: What blogs should you read, to be up to date on newsworthy stories?

Given a budget of 100 blogs, the biggest bang for the buck belonged to the popular Instapundit blog, which featured more than 4,500 postings throughout the year. Assuming a budget of 5,000 posts, however, the top-scoring blog was the less well-known sisu site, which featured only 331 posts for all of 2006.

How do you find something like this? How do you even go looking for this information?

During the past couple of weeks, I have been reading about Sentiment Analysis. I started with a post in the Text Analytics mailing list by Seth Grimes and followed many good posts with links. I read a few, understood the concept. It is a fascinating idea.

I came across this article today. It is different, but very useful. How do you find what blogs to read? Researchers at Carnegie Mellon created an algorithm called Cascades.

A team of researchers and graduate students from Carnegie Mellon eventually created a complex mathematical equation called the cost-effective lazy forward-selection algorithm, later dubbed the Cascades algorithm for simplicity’s sake.

One part seeks to maximize reward, in this case detecting the most news in the least amount of time. Within the algorithm, that reward concept is captured by tallying the number of people who read a news item after it appears on a specific blog. If 10 million people read a story after its initial posting on Blog A but only 1,000 had read it beforehand, the story would be deemed both newsworthy and early-breaking for Blog A’s readers.

A second part of the algorithm seeks to minimize cost, namely the inordinate time that could be spent reading blogs. The team also exploited a mathematical relationship known as the law of diminishing returns.

Cascades algorithm is not only useful for detecting news worthy blogs, but also water pollution. The sensors are just different.