New "Similar Pages" Feature On My Site

10/18/2009 17:23:00

For a while I have been interested in using computers to analyze the content of text documents, and how programs are able to find similar documents. I found a web site that had a “Similar entries” feature that would list a couple of posts that were similar to the one you were viewing. That’s clearly a useful feature to help visitors locate content on other pages they may be interested in.

After much digging around the internet, I was able to find some descriptions and examples of algorithms to implement such a feature. Basically, I count all the instances of each word in a given document. These are used to generate a normalized vector which represents the overall content of the document in terms of which words are used more frequently. These vectors are calculated for each document in the site.

Then, the vectors for any given pair of documents on the site are compared (technically, by using the cosine of the angle between them). The function generates a number between 0 and 1 that indicates how “parallel” the vectors are, which correlates to an estimate of how similar the documents are.

I then wrote a quick script to query this database and show similar pages that match a threshold of similarity, and also limit the number of matches shown so as not to be overwhelming.

I am really interested in generalizing this to look into how to automatically tag certain documents, and find new tags I should be using… We’ll see!

Similar Pages