Sometimes the simplest problem we solve so effortlessly takes quite a bit effort to solve using programming. However, thinking through the steps (and actually implementing) was worthwhile. It taught me a lot more about the problem and made explicit many of the thoughts I did not even pay attention to.
The problem was a simple one – given the name of company, find the website of the company. I can do it in less than a minute, manually. Search for the company name, scan the results, click on a link that looks like the closest match and copy the url. Ask me to do that hundred times. That would be an entirely different story.
Now let us try to write a program to find the website urls given a set of company names. We need the following:
- Simulate the process of searching through a search engine
- Simulate the process of looking through the results
- Simulate the process of picking the correct result
- Verify that the result we picked is correct
Let us think through each one of these. Fortunately most search engines provide an API (application programming interface) for you to program against. So step 1 is pretty easy.
Step 2 and 3 can be combined. But if you don’t have the benefit of human eyes and mind to look through the results but only the logic of the machine, how will you simulate these steps? Try it. It is an interesting study. The reason for this study is that within a couple of iterations, you will start noticing new aspects of the problem that was not obvious when you first looked at (formulating) it.
Step 4 is even more fascinating. Given a company name and a URL, what logic will you use to verify that the URL represents the company?
Solving a problem like this requires you to think a lot more about how you solve problems. How it comes to us so naturally, but an order of magnitude more difficult when you try to automate it. When you program a solution to the problem, you do not have the benefit of human eyes and the mysterious thought process. You need to form a set of hypotheses, try them out through small programs and verify the results and iterate.
I was working on one such last week and got up to 92% percent accuracy with limited test data. I may test it a bit more, clean it up a bit and make it available for people to use. I need to put a web interface to it (right now it is a set of python programs running from the command line).
If you have to find hundreds of thousands, however, you may use a different approach, probably with some machine learning techniques.
I heard Marvin Minsky mention in (his class) that software provides a new language for problem solving. It gives you ways to model, abstract, simulate and study processes and behavior. Even more important, it gives you a framework to express them more easily.