I recently wrote about the R statistics and graphics platform, and introduced it as a way to do further analytics or visualizations for data from Oracle databases. This article will take a look at a graphical idea called tree maps. They are an extension of the more basic visualization technique: heat maps. Whereas heat maps use the idea of color intensity to express an extra dimension or quality about the data, tree maps do this in addition to expressing a hierarchy of quantity values.
These types of visualizations have become extremely popular; they can be seen all over the web. This fellow seems to have been a seminal figure and in any case has written a comprehensive historical overview of tree mapping, which he claims to have originated about 1990. Nowadays, everybody from Google Docs to cloud tool BI firms are clamoring for your tree map attention. Here’s a nice source for a test drive. We are going to continue using R for our investigation.
English Wordlists
We’ll continue to explore the dataset previously developed which contains the most common 76,000 English words. Suppose now that we wish to convey some deeper insights about this wordlist. The plot, barplot, and pie chart which we’ve seen all were used to display essentially 2-D data: word counts against alphabet letters. We could imagine adding hierarchical layers to the alphabet letter dimension. Let’s say we want to also capture word count data for all existing initial 2-letter combinations in English? (AB, AC, AD, AF, …) With a tree map graph, we can render this type of data spatially, within the larger category “A”, using inscribed rectangles. The relative size (area) of the rectangles will be proportional to their word counts.
This conveys a deeper level of linguistic analysis; you can think of the initial level (first letters only) as being the tree, and the second level (first two letters) as the branches. But we can go further, and describe a third level, the leaves. Suppose we wish to show not only how many English words begin with the letters ‘SC’, but also how many allowable unique 3rd letters exist in English for this group? The answer turns out to be 9: (SCA, SCE, SCH, SCI, SCL, SCO, SCR, SCU, SCY); and this quantity could be conveyed with the color intensity of the corresponding rectangle. You can think of it as a kind of measure of word-building flexibility for any English 2-letter starting sequence.We need to return to Oracle to compute this data and produce a new CSV. Here’s the appropriate query. We will eliminate small clusters (less than 20 words) as well as all words beginning with Q, X, Y or Z to allow for a more readable tree map.
w h i t e s p a c e
SELECT
MAX(SUBSTR(UPPER(lexeme),1,1)) AS first,
SUBSTR(UPPER(lexeme),1,2) AS first_two,
COUNT(*) AS wc,
COUNT(UNIQUE(SUBSTR(UPPER(lexeme),3,1))) AS tuple3
FROM work_wordlist
WHERE SUBSTR(UPPER(lexeme),1,1) NOT IN (‘Q’,’Z’,’Y’,’X’)
GROUP BY SUBSTR(UPPER(lexeme),1,2)
HAVING COUNT(*) > 20
/
w h i t e s p a c e
The query output is alphabetically sorted; take a look at the entries for ‘CO’ and ‘EF’ to better understand. 2579 English words begin with ‘CO’, and fully 25 different letters occur as the 3rd letter within this wordset. Of all the English letters, only the letter ‘J’ does not appear as a 3rd letter. Contrast this with ‘EF’ which begins only 59 words and permits exactly 1 possible 3rd letter, namely another ‘F’. So, the first two-letter sequence is highly malleable in English spelling, whereas the latter sequence is rigid and determinating. We could express this visually as a bright hue contrasted with a dark hue, and of course many grades of malleability would exist in between these two poles.
R Method map.market()
Having exported the CSV file from this query into the R working directory, (‘first_2letters.csv’), let’s inspect the R code needed to generate the tree map. Note that we’ll need to load a non-standard package into our local library called ‘portfolio’ to draw the tree map. The function called is map.market(), which was originally designed for stock market visuals.
w h i t e s p a c e
# tree map needs external library package — only necessary once
install.packages("portfolio")
# load package into R memory for present workload — necessary each time
library(portfolio)
# grab the data from CSV into R data frame
df_eng2 <- read.csv("english_first2letters.csv", header=TRUE, sep=",",
+ as.is=TRUE)
# execute tree map function call
map.market(id=df_eng2$FIRST_TWO, area=df_eng2$WC, group=df_eng2$FIRST,
+ color=df_eng2$TUPLE3, main="English Word Beginnings",
+ lab=c(TRUE, FALSE) )
w h i t e s p a c e
It’s worthwhile taking a moment to clarify the important arguments for map.market():
• id= ~points to scalar of label values for smaller-grain rectangles
• area= ~points to scalar of quantity to indicate rectangle sizes
• group= ~points to scalar identifying group to which each smaller-grain entry belongs
• color= ~which value to indicate via hue intensity for smaller-grain rectangles
• lab= ~Boolean vector of 2 values indicating if labels should display for group, id rectangles
Here’s the R output:
w h i t e s p a c e
w h i t e s p a c e
Just for kicks, I’ve reproduced the same graph twice. Labels have been assigned to the larger rectangles in the first case and to the smaller rectangles in the second.. This was accomplished by flipping the value vector within the lab argument: lab=(FALSE, TRUE). As you can see, tree map #2 needs a little cleanup. The function code can only reduce the fonts within certain limits. To make an interactive tree map with clickable components and mouse-over enlargements would require Java code and is beyond the present scope.
w h i t e s p a c e
R Nerd Addendum
This isn’t really an R blog, rather an Oracle blog. But since I came across an idiosyncratic R error, which looks fairly Martian at first glance, I thought I’d report on it. The error in question, while executing the map.market() function, was:
Error: all(!is.na(id)) is not TRUE
This happened because of the innocent looking 121st entry within the $FIRST_TWO vector, which is “NA”. In R, “NA” means roughly the same thing which NULL means to Oracle SQL. Strangely, this is true even when R encounters “NA” inside of a CSV file (within double quotes!). So, the datapoint for the English words starting with the letters ‘NA’ were being misinterpreted by R. After playing around a little, I was able to bypass this problem by explicitly reassigning the value inside of R. But noticing the problem, and solving it, were not obvious, so I’ve included the relevant R session below, with annotations, in case anyone else is afflicted by this. R bears some resemblance to APL syntactically, and I recall these same kind of obscure errors back in my university days with that language.
Funkier Tree Map Variations
The hierarchical quality of many datasets can be more complex, containing nested levels of detail. The de rigeur method for handling this layered aspect of data presentation is with the use of zooming techniques. Normally these would require more code. The example illustrated at right is a circular tree map, taken from the New York Times. It depicts average U.S. consumer spending during a 12-month period between the spring of 2007 and 2008. The link allows you to mouse over every small spending fragment to see a pop-up with the relevant data. This was a particularly volatile period for the U.S. consumer economy: the coloration changes depict percentage shifts in costs during the year, with the darker reddish hue indicating an increase of up to 40% (think eggs and oil)! This PDF describes something called a clock map, with similar zoomability. Finally, here’s an abstract from a French research project laying out the principles for effecting user navigation of complex layered data with hundreds of thousands of elements. I’ve not yet found an R library which can handle these kinds of zoom maps, but let me know if you’ve got one. Likely, the main issue is with the detail pop-ups and mouse sensing — factors which make other tools such as JavaScript more suitable.~RS