05.12.08

Algorithm to generate tag clouds

I was talking to a friend who’s trying to figure out a formula for generating tag cloud. The purpose of a tag cloud is to convey the extent of relevance of a tag among all other tags being used. Trying to display the frequency of tags sounds easy but there are different issues for different algorithms out there. ( the ones found by quick google search)

The first one is from WordPress Codex. The summary for the algorithm is they sort the tags by frequencies and then group them into each font size.  The problem with this algorithm is if the max frequency is 5000 and the second to max frequency is 50. They will be displayed in the same font size since the algorithm does not take the frequency range into account.  Also when frequency range is too narrow, a tag with 24 occurrences and a tag with 25 occurrences may just have different font sizes if the cut off happens to be between 24 and 25.

The second algorithm we found is this one. This one takes the frequency range into account but it tries to display tags in font sizes that only differ by 1 unit each.  The problem here is human eye’s inability to detect varying font sizes if there are too many steps between the max font and the min font. So 100 different tags with sizes ranging from 10px to 48 px will not be visually distinguishable.

I figure if we take the second algorithm and modify it so that font size gap is big enough to be visible, it may just work.

count the frequencies for all tags

find min freq and max freq

x =  freq of tag we want to calculate the font size

scaling factor, K = (x – min freq) / (max freq – min freq)

font range = max font size – min font size

font step = C  (the constant font step size)

font for tag =    min font size  + (C * floor (K * (font range/ C)))

so if we reuse the example from the second algorithm

min freq = 6 , max freq = 91, freq for  current tag  = x

scaling factor K = (x – 6)/85

min font = 10 , max font = 30, font step = 4

font for tag = 10 + ( 4 * floor (( (x-6)/85) *( 20/4)))

so if x = 64,  font for tag = 22

if x = 79, font for tag =  26

if x = 14, font for tag = 10

if x = 32,  font for tag =14

Although it does not solve the problem of the max frequency so far off from the rest, it will pile  more tags with lower frequencies into smaller font sizes. But a spike can be handled as a special edge case.  this will work for most cases as long as the distribution curve is not too far off from the bell curve.

I need to look into another algorithm that takes median or mean with standard deviation as a way to generate banding. But it again won’t work since we are still assuming the distribution curve to be bell shaped.

I need to keep thinking about how we can display tag clouds with frequencies with a distribution curve that doesn’t fit bell shaped.  It’d be nice to get real data of tag occurrences from live sources like flickr, youtube, digg etc.

Tags: , ,
| Posted in UI | 2 Comments »