Tag Categorizer

10/29/2005 10:30:25

What is it?

TagCategorizer is a perl script designed to allow you to discover the relationship among tags in a collection of tagged objects.

Tags (under various names) are increasingly used as a means of labeling objects in various online and offline programs (Flickr, delicious, iPhoto, to name a few.) Most tagging systems encourage you to add as many tags as you like to each object, and you don’t have to plan ahead to decide what the relationships between tags will be. You can then search using these tags to locate the particular object you desire.

This lack of imposed structure is one of the main strengths of tagging - an object can belong to many different tags at once. This can also be a hindrance

  • it is difficult to make sense of the overall organization of the tags, precisely because none is enforced. It becomes difficult to look for patterns among the objects, or identify objects that are similar to the one you already have.

Kaspar Schiess has created a tool to take a particular user’s delicious bookmarks, analyze their tags, and create a hierarchical graph of the relationship between tags.

It’s not perfect, but it does a pretty good job of identifying “top-level” tags, and then placing subgroups underneath them. The key feature is that it does this based on the tags within the collection of data without any top-down structure. The hierarchy emerges as you tag more and more objects in different ways. It also identifies tags that are effectively synonyms (meaning they tag the exact same collection of objects).

I liked this idea, and I have been looking for something similar to use with my Oddmuse wiki. I currently use a ClusterMap to organize my site. This has some useful features, but the downside is that a page can only belong to one cluster at a time. There is no way to express a hierarchy of clusters without breaking some of the functionality. Additionally, because you have to intentionally plan such a hierarchy, you might discover that you planned it incorrectly.

What I wanted to do was to allow you to tag individual pages, and then create just such a hierarchy of the tags based on the way the tags are used, not based on some externally imposed structure. Kaspar Schiess' tool inspired me to go ahead and create this.

How do you use it?

TagCategorizer is still evolving, but basically it takes an xml chunk specifying a list of object identifiers and the tags assigned to each object. These objects can be anything you like, they just need to have a unique identifier. The tags can be whatever you want as well, but should contain only letters, numbers, and spaces if necessary - no punctuation. Note that some tagging system do not allow the use of spaces, and some do.

An example of the input format:

<taglist>
	<object>
		<id>unique id</id>
		<tag>tag</tag>
		<tag>long tag</tag>
	</object>
</taglist>

Once you create the XML, you pass it to TagCategorizer, which processes it, and then outputs another XML chunk that describes the hierarchy of tags and which objects belong to which tag.

The output format:

<tagHierarchy>
	<tag title="sometag">
		<tag title="anothertag">
			<object>object identifier</object>
		</tag>
		<object>another object</object>
	</tag>
</tagHierarchy>

Important things to note:

  • You do not pass entire objects to TagCategorizer, just some meaningful unique id. It’s up to your program to decide how to use the output.

  • TagCategorizer does NOT verify that your object id’s are unique. If you reuse an id, the two objects will be treated as one.

  • Tags are case sensitive, so tag and Tag are not the same. Depending on your application, you may wish to force the case prior to passing the XML to TagCategorizer.

How does it work?

The algorithm used by TagCategorizer is pretty basic. Each tag is compared with every other tag, and there are five possible relationships to emerge:

  1. A can be a subset of B - one tag is a subset of another if every tag contained in set A is also found in set B.

  2. A can be a superset of B - which means that B is a subset of A.

  3. A can be a synonym for B - if A is a subset of B, and B is also a subset of A, then they are synonyms - the same set.

  4. A and B may intersect - they share some members, but neither is a subset of the other.

  5. A and B do not share any members.

As far as TagCategorizer is concerned, for now, conditions 4 and 5 are treated the same. We are currently only concerned with subsets and synonyms. This could be changed in the future.

After all tags are compared with each other, we have a list of which tags are synonyms, which tags are subsets of another set, and which tags are supersets of another set.

Synonyms are combined into single tags in order to simplify the results, and the new set is named something like tag1/tag2/tag3 where tag1, tag2, and tag3 are synonyms of each other.

Tags that are not subsets of another set are designated as top-level tags.

All the tags that list a top-level tag, toptag for example, are it’s descendants. Each descendant is compared with the other descendants to see if there are subset relationships between them. If so, then the subset is moved under the superset, creating a nested hierarchy. This process is continued as needed.

Once the hierarchy is determined, the objects associated with a given tag are pruned, such that a given object is displayed only once in a given branch. If an object is shared between tagA and it’s descendant tagB, it will not be listed as belonging to tagA, but rather will belong to tagB, which in turn, belongs to tagA. It makes sense when you see it in action.

Things to do

I like the way that Kaspar’s version shows you other related parts of the tree. In other words, a given tag can be a subset of multiple other tags, and this allows you to find additional related tags. I would like to add this feature to TagCategorizer.

One question I have is how exact the process should be. For instance, if you have two tags, and a human would readily agree that tag B should be a subset of tag A, but tag B contains a single object that does not belong to tag A, it will not be identified as a subset. Is this the right behavior? For now, this is how TagCategorizer works. It is likely to remain this way, but it’s an interesting point for discussion.

Where can you find it?

You can download the current version of TagCategorizer here:

Similar Pages