New tag feature enabled
After rebuilding my site to basically run off of static html files, with a healthy dose of server side includes, I have been experimenting with how to add other feature. For example, I used Google’s Custom Search Engine to handle searches, I added a feature to suggest similar pages, custom atom feeds, and more.
I was also interested in re-enabling tags, but wasn’t sure of the best way to do it. Ideally, the system would be intelligent — suggesting tags for new content, providing a mechanism to analyze the tag framework as it grows to find redundancy, and ensuring that pages are tagged appropriately.
I’ve started working on a system. To back up for a moment, though, I want to describe the similar pages feature. Basically, it creates a “vector” for each page in the site that corresponds to the number of occurrences of each word in the document. The words are stemmed, certain common words are removed, and everything is lower-cased. For example, a document consisting of “The Cow jumped over the Moon” would become:
cow 0.577350269189626
moon 0.577350269189626
jump 0.577350269189626
Because of the way the math works, you don’t need to store a “0” for each word that is not in a given document, but occurs in other documents. This makes things much easier.
To compare two documents, one simply calculates the cosine of the angle between the two respective vectors. The closer the vectors align with each other, the closer the cosine is to 1 (a perfect match).
The way I suggest similar pages is to compare each document with every other document and calculate their similarities. When a page is shown, this list is scanned, and matches are found that exceed a certain threshold. I also cap the list at the top 5 choices so a visitor isn’t overwhelmed with 12 similar pages.
Back to tags…. I have a routine I can run locally that compares pages to the overall vector for the pages contained within each tag. In otherwords, I can find pages that look similar to pages already in a tag with the idea that perhaps this new page belongs there as well. This can also suggest tags for a new document that has yet to be posted (thanks, TextMate!).
I still would like to create a tool to visualize the relationships between pages based on similarities and tags to more formally look for a structure to the pages and the relationships between them. This might help suggest new tags that aren’t being used — for example, if there is a cluster of closely related pages about “ice cream”, that would likely be a tag that should be added.