Web Data Mining

One of my articles on Web Data Mining appeared in i.t.magazine. They were kind enough to permit me to make it available from my blog.

Almost all of us need information. A lot of information is freely available on the Web. Learning a few techniques on how to mine information on the Web is a useful skill. Here are some sample usage scenarios:

  • You are an entrepreneur who is planning to start a new software business. You hear that Web 2.0 and social applications are hot. You want to do some research to understand the marketplace, and want to prototype a few product ideas.
  • You are part of the CTO office of a software company, and are interested in short-, medium-, and long-term technology and business trends in your industry. You need this information to build skills in your organization, and to build a few concept prototypes.
  • You are part of the CIO office of an organization. You need to balance early adoption of technologies with providing a stable environment for your business; you don’t want to jump at every new technology. In addition to finding new tools an techniques, you also want to understand the risks and the maturity level of these technologies, which ones are being used for building applications, and you also want to track many non-technical factors.
  • You are an outsourcing company and want to find customers for your business and track trends in outsourcing. Being a jump ahead of your competition and carving a niche are important differentiators.
  • You are part of HR, or a Learning Officer, and need to plan for the skill development of your employees. You want to keep your software team happy and so need to know the latest technologies, tools and resources to plan training and skill development.
  • You are a development lead, and need to provide the team with the latest information on product releases, and access to product/technology knowledge bases. You need to know of any problems, including security issues, in the tools or software that you are currently using for your projects.

Broadly, there are several components to finding, using and sharing information.

  • Identifying and discovering information sources
  • Tracking information from various sources and filtering them for their relevance to your needs
  • Organizing collected information and sharing it with others

Information sources can be many. A few listed below are typical.

Information sources can be categorized as:

  • News sources
  • Company websites
  • Blogs
  • Search engines
  • Wikis
  • Discussion groups
  • Social bookmarking sites
  • Social networks


This article ( webdata-mining.pdf) describes these sources and their significance in more detail (the article uses British spelling which is common in India).


I love this quote from Dan Brickley in his blog post Open Social Networks: Bring Back Iran:

For me, one of the big motivations for working (through FOAF, SPARQL, XMPP and other technologies) on social networking interop, is so young people in the future can grow up naturally having friends in distant nations, regardless of whether their government thinks that’s a priority.

I do believe that with Social Networks and other tools of People Connectivity,  we can incrementally “Change” the world for the better.

Web 2.0 Elephant

When Tim O Reilly coined the Web 2.0 term. Then the collective intelligence went to work and added their own definitions. I just came out of a Web Innovation 2007 conference in Bangalore. A bunch of us there were in a deep quest and introspection of how Web 2.0 can help the common man. But that did not prevent us from fantasizing a bit. The number of different descriptions were as varied as the participants and the speakers. Here is a small sample.


Web Innovation 2007: On Building a Social Network

This was undoubtedly the best presentation of the two days. It had information, insights and some great advice. It was by Rohit Agarwal, Founder and CEO of Techtribe. I will try to get his presentation and upload it. He kindly agreed to email it on request.

Here is how Rohit sees Social Networks:

  • mySpace, Orkut – See Me
  • LinkedIn , techTribe – Meet m
  • Facebook – See what I am doing

It is cute but a bit oversimplified.

Web 2.0 is:

Participatory Web

Is not about technology

Is about people

Is about connections

Is about community

Is about self-expression

Is about reach and pervasiveness

Rohit shares some of his lessons learned in building his social network.


  • Build a solid foundation
  • Focus on necessary infrastructure


  • Page load time
  • Server response, latency

Pay attention to:

  • Hosting Infrastructure
  • Email Infrastructure (viral invites, messaging, suppression lists)
  • Usability (Cognitive Behavior, Colors widgets)
  • Think through the flow
  • Driving User Behavior (understand why users behave in a certain way, soft suggestions drive behavior)

That was quite valuable. But Rohit made it even better by covering two more aspects:

Key Hires

  • Performance Engineer
  • System Administrator
  • UI Designer

Marketing is key:

  • Personalized
  • Simple Messaging
  • Clear value proposition
  • Not In your face

This was the best session I attended in the Conference. I plan to check out both techTribe and startTribe. Rohit did two sessions. One in the business track and one in Tech track. Talking to him, listening to him, you know why some of the entrepreneurs are more successful than others. He was cool about admitting mistakes, sharing knowledge and had an easy comfortable informal presentation which was much more like a conversation.

Web Information Sources

Here is the mind map of various web information sources. This is not an exhaustive list. I will have a few posts following that describe each one of these in more detail.


Look at this entry for some contextual information.

Update Jul 1, 2009

There are a whole host of new sources. So I will add them to comments and try to update this mind map once in a while.

Here are some:

Freebase is a social database of open data
Twine is a smart way to keep track of information and share it with others. It goes beyond simple bookmarking.
data.gov is a fabulous source of  US government information. Will try to find and add other similar resources for other governments.