Sunday, December 20, 2015

Wikipedia Calculations

This is a little write-up explaining some statistics and calculations I did for my most recent video, “Can You Read All of Wikipedia?” While the statistics from the video are all approximately correct (as I mentioned), they should have been a little more accurate. I want to explain why.

Let's begin with this Wikipedia article showing Wikipedia's word count. 


Two things should stand out: 

  1. The word count here is 2.95 billion, not 2.9275 billion like I claimed, and...
  2. After the word count the brackets literally say “not in citation given”
These two are both related. It was September 1st* when I last checked this page and finalized my calculations. Back then, it said that there were 2.9275 billion words and it didn’t have that bracket saying that number wasn’t in the citation. I tried to look through the edits to find this number but I couldn’t. I really remember seeing that number there (because where else would I have gotten it?) but it's seeming more and more like I pulled it out of thin air.ª

ANYWAY the point is that even if that number had been there, I didn’t verify it by checking the source. As someone who attempts to research material with academic curiosity and legitimacy, this is embarrassingly irresponsible. I will say that with virtually every other topic I’ve ever done through YT, I did verify the source, but for this one I chose not to. Maybe I trusted the source because it was Wikipedia writing about itself, or maybe because I thought this would be a short video and it wouldn’t be a big deal. Either way, I should have checked the source and noticed that both 2.9275 and 2.95 billion aren’t supported by the available data. 

The actual source links to the English Wikipedia's statistics page. Here do find a word count here...but it’s most recent count comes from January 2010. 


Since the total word count hasn't been updated after 2010, we really have no way of definitively knowing from this source what the current word count is. Yet I think I have a guess as to where the contributor got 2.95 billion. If you divide Jan 2010’s 1.798 billion by 3.1 million, the Jan. 2010 article count, then you get an average of 580 words-per-article. Now if we assume that that WpA doesn’t change over the years, then multiplying it with the current article amount of 5.1 million would give us a current word count of 2.9 billion. Not exactly 2.95 billion—we’d need an average of 590 WpA for that—but close enough that I feel comfortable saying we’re in the right ballpark.

Speaking of the 590 WpA count, I actually made my first ever Wikipedia edit to point this out. You can see me explaining what I did and also asking what methodology the previous contributor used to arrive at their conclusions. 


So we’ve worked out the number of total words. The other stat I mentioned is that Wikipedia was adding 440,000 words daily to the site. Is this number of words accurate?

Probably not, because I used the exact same methodology as before (checked it Sept. 1st, didn’t check source, used a random wiki page). So if we’re working with the 2010 WpA count, multiplying that by August’s° new article count gives us an average of 492,420 words added daily. Again, not exactly the same, but close enough that I’m okay saying my original data was approximately correct.

Looking back, I’m not sure if it would have been smarter to use 2010 numbers. Obviously they’re more accurate, but Wikipedia’s clearly grown so much since then that while the 2010 data may be more numerically accurate, it might actually be a less accurate representation of Wikipedia. 

. . . . .

I made this post detailing how a couple numbers I used were off by a certain degree. But this degree turned out to be relatively small…why did I write this, publicly announcing my (seemingly) small mistake? 

I value academic transparency and sources. I think if you’re involved in education (i.e. professing to know something others don’t) it’s your responsibility to not mislead others and publicly provide access to the information which informed you. I believe learners have a right to know where knowledge came from so they may examine it for themselves. If I make mistakes I want to explain why it happened and make sure people have access to the correct information.

However I’m okay with making these mistakes because I’m trying a lot of new things. Almost all of my videos up to this point have been simply reciting information I’ve learned. But my last two videos** haven’t been reciting information—they’ve been creating new information from scratch. No source told me how to convert video to words or to measure humans’ ability to read Wikipedia. That was my math, my thought process, my original work. 

So I’m okay with making these types of mistakes because (1) I learn from them, and (2) it means I’m trying more complex things I haven’t done before. I like that :)


~~~Footnote~~~
ªIf you’d like to look for it please be my guest :) https://en.wikipedia.org/w/index.php?title=Wikipedia:Size_in_volumes&action=history

*I did think I would publish this video sooner than I did (more than three months after the research), but ultimately I should have done recalculations to check and see if the stats changed much. They didn’t change much, but as I’ve said the point of this post is to highlight my exact methodology as well as point out what I did irresponsibly and/or incorrectly. 

°Technically I’m not sure what month was used as the source for determining the article count, but I’m using August here since I collected my data on September 1st. 

**excluding the P4A one

No comments:

Post a Comment