Comments on: the frequency of a word: Awesomeness

By: Wojciech

Wojciech — Fri, 28 Mar 2008 06:56:34 +0000

Awesome post! I’ve been meaning to reply for forever, so apologies for the delay.

Regarding Pam’s last comment — I think one way to test how proper writing techniques are being used is comparing the growth in commonly misspelled words (e.g. http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/).

For example, look at “believe” versus “beleive” at Google Trends (sorry, my web analytics aren’t as fancy as yours 🙂 ): http://www.google.com/trends?q=believe%2C+beleive&ctab=0&geo=all&date=all&sort=0 Notice the weird drop in “beleive” back in 2004? Weird… I’m trying to find some long term patterns but am not being very successful… Regardless, this could be a good proxy for grammar or “proper spelling”.

Are you still playing around with this?

By: Math Coach Pam

Math Coach Pam — Fri, 21 Mar 2008 21:51:58 +0000

Hi all,
My response has been a bit delayed due to some travel to a no internet zone.

Quick thoughts: I actually would not consider quadratic growth, since the line fit so well to the log(frequency). This is strongly suggestive that the growth is close to exponetial and not quadratic. At least over the times considered. Exponential may also make more sense as a growth model in. Re: the slope differences, there is a statistical test that could be done here….

Regarding what word to normalize against. The key factor is not length for that comparison word. How often the comparison word is being used is not the relevant quantity. The quantitiy of interst is really the change over time. This is exacly what the slope gives you. The intercept is absorbing some of the absolute differences in useage (specifically the difference at time 0).
For this effort, what you care about for an index word is that the rate of usuage for that word (number per blog or per page or whatever Ray is counting) is fairly stable for the time considered. So a basic connector word is a good idea. Somthing inert that wouldnt flare up due to a news story, like the word mortgage over the last few years. A linguist may have more thoughts on this. Articles in general may be experiencing declines. Good grammer and a word like “are” may also be experienceing changes (declines) in usuage. But if these changes are happening at a rate that is much slower than the rate of a trendy word like awesomeness than I think it still makes a decent comparator.

By: Ray

Ray — Wed, 19 Mar 2008 05:19:11 +0000

To Franklin, my brain just exploded a little bit. But I’m amazed at how subtle this growth is to capture.

And to everyone nice enough to read this post and also comment. Your awesomeness is undeniable.

By: Ray

Ray — Wed, 19 Mar 2008 05:02:39 +0000

Aleks: Thanks for the link, I will look into it. You raise important point, there is a lot to try to normalize against. I wasn’t that worried about other languages, I’m curious to know how often Are appears in non-English language sties. Your spammy pages point is a good one too. This was a paper napkin calculation, which points to a lot more questions, I not sure I’m quality to answer, but are nevertheless interesting.

By: Franklin

Franklin — Wed, 19 Mar 2008 02:58:40 +0000

Ray, it is nice to hear that you are studying a word that I’ve never heard of. Even the examples on the web don’t make any sense to me. Am I too old?

It is not so easy to tell that the slopes 0.0001 and 0.0003 are significantly different. I would suggest that you do a regression where the log frequency of awesome is entered as a quadratic term like this
awesomeness ~ awesome + awesome^2
So if awesomeness is growing faster than the awesome, then the quadratic term will be significant. If this doesn’t make sense, I would suggest reading Baayen’s chapter on regression modeling and downloading the R stats package.
http://www.ualberta.ca/~baayen/publications/BaayenCUPstats.pdf
http://cran.r-project.org/
Also, another problem with using “are” as a control is that word frequency is related to word length, so it would be better to control for word length somehow. Maybe you could enter a word of a similar length like “specialness” into the equation.
awesomeness ~ awesome + awesome^2 + specialness
let me know if you need some help with this…

By: Aleks

Aleks — Tue, 18 Mar 2008 20:38:24 +0000

Well, “are” is a dangerous word to normalize against: it doesn’t control for uneven growth of other languages and spammy pages, it doesn’t control for the context, and so on. Try engaging the http://datamining.typepad.com/ blog via a trackback?

By: Ray

Ray — Tue, 18 Mar 2008 14:58:39 +0000

MCP: Ah… that is interesting. You are quite correct that I was assuming that the demographics of blog writers was increasing at a uniform rate, which you suggest is not true. I am inclined to agree with you. I think the hipster blog index is worth investigating. That “bend” in early 2005 deserves another look. But downloading all the data get time consuming. I think I need an intern, or learn how to write bots.

By: Ray

Ray — Tue, 18 Mar 2008 14:52:49 +0000

Noah: Actually to be truthful, I didn’t know how Google is grouping the pages. But I should have been more clear on differentiating between word frequency and page counts. Glad you liked the post. I’m hoping to do more of these kinds of ad-hoc “studies.”

By: Noah Brier

Noah Brier — Tue, 18 Mar 2008 12:01:05 +0000

This is awesome.

One note that you’ve probably already realized: When you were searching Google you were getting the total number of pages which included the word, not the total number of mentions of the word itself. This becomes a problem when Google groups it’s results from a single site because the most serious purveyors of awesomeness are only getting two pages at most (which I guess may work equally for “are” and therefore cancel each other out.)

By: Math Coach Pam

Math Coach Pam — Tue, 18 Mar 2008 11:49:24 +0000

Ray,
There is a lot of awesomeness here. Regarding your conclusions, I think now its time to think about all those yahoos who are spewing blogs. One might think of the index of are to awesome or awesomeness as being influenced by not only the relative uptake of 80s tongue, and its 00s incarnations, but also by the proportion of blog entries that are being created by hipsters versus the vanilla people. So we have a hidden confounder, the hipster blog index. My theory is as blogs are being incorporated into everything from news shows to cousin susie’s wedding, that the proportion of blogs generated by those inclined to use the word awesome or awesomeness is decreasing. So the use of 80s slang may in fact be experiencing a wicked comeback in certain subpopulations who uses blogs, but the general population of bloggers is experiencing a coincident demographic shift that is preventing awesomeness from keeping up with the general growth of blogs.

m.c.p.