Sunday, December 20, 2015

Wikipedia Calculations

This is a little write-up explaining some statistics and calculations I did for my most recent video, “Can You Read All of Wikipedia?” While the statistics from the video are all approximately correct (as I mentioned), they should have been a little more accurate. I want to explain why.

Let's begin with this Wikipedia article showing Wikipedia's word count. 


Two things should stand out: 

  1. The word count here is 2.95 billion, not 2.9275 billion like I claimed, and...
  2. After the word count the brackets literally say “not in citation given”
These two are both related. It was September 1st* when I last checked this page and finalized my calculations. Back then, it said that there were 2.9275 billion words and it didn’t have that bracket saying that number wasn’t in the citation. I tried to look through the edits to find this number but I couldn’t. I really remember seeing that number there (because where else would I have gotten it?) but it's seeming more and more like I pulled it out of thin air.ª

ANYWAY the point is that even if that number had been there, I didn’t verify it by checking the source. As someone who attempts to research material with academic curiosity and legitimacy, this is embarrassingly irresponsible. I will say that with virtually every other topic I’ve ever done through YT, I did verify the source, but for this one I chose not to. Maybe I trusted the source because it was Wikipedia writing about itself, or maybe because I thought this would be a short video and it wouldn’t be a big deal. Either way, I should have checked the source and noticed that both 2.9275 and 2.95 billion aren’t supported by the available data. 

The actual source links to the English Wikipedia's statistics page. Here do find a word count here...but it’s most recent count comes from January 2010. 


Since the total word count hasn't been updated after 2010, we really have no way of definitively knowing from this source what the current word count is. Yet I think I have a guess as to where the contributor got 2.95 billion. If you divide Jan 2010’s 1.798 billion by 3.1 million, the Jan. 2010 article count, then you get an average of 580 words-per-article. Now if we assume that that WpA doesn’t change over the years, then multiplying it with the current article amount of 5.1 million would give us a current word count of 2.9 billion. Not exactly 2.95 billion—we’d need an average of 590 WpA for that—but close enough that I feel comfortable saying we’re in the right ballpark.

Speaking of the 590 WpA count, I actually made my first ever Wikipedia edit to point this out. You can see me explaining what I did and also asking what methodology the previous contributor used to arrive at their conclusions. 


So we’ve worked out the number of total words. The other stat I mentioned is that Wikipedia was adding 440,000 words daily to the site. Is this number of words accurate?

Probably not, because I used the exact same methodology as before (checked it Sept. 1st, didn’t check source, used a random wiki page). So if we’re working with the 2010 WpA count, multiplying that by August’s° new article count gives us an average of 492,420 words added daily. Again, not exactly the same, but close enough that I’m okay saying my original data was approximately correct.

Looking back, I’m not sure if it would have been smarter to use 2010 numbers. Obviously they’re more accurate, but Wikipedia’s clearly grown so much since then that while the 2010 data may be more numerically accurate, it might actually be a less accurate representation of Wikipedia. 

. . . . .

I made this post detailing how a couple numbers I used were off by a certain degree. But this degree turned out to be relatively small…why did I write this, publicly announcing my (seemingly) small mistake? 

I value academic transparency and sources. I think if you’re involved in education (i.e. professing to know something others don’t) it’s your responsibility to not mislead others and publicly provide access to the information which informed you. I believe learners have a right to know where knowledge came from so they may examine it for themselves. If I make mistakes I want to explain why it happened and make sure people have access to the correct information.

However I’m okay with making these mistakes because I’m trying a lot of new things. Almost all of my videos up to this point have been simply reciting information I’ve learned. But my last two videos** haven’t been reciting information—they’ve been creating new information from scratch. No source told me how to convert video to words or to measure humans’ ability to read Wikipedia. That was my math, my thought process, my original work. 

So I’m okay with making these types of mistakes because (1) I learn from them, and (2) it means I’m trying more complex things I haven’t done before. I like that :)


~~~Footnote~~~
ªIf you’d like to look for it please be my guest :) https://en.wikipedia.org/w/index.php?title=Wikipedia:Size_in_volumes&action=history

*I did think I would publish this video sooner than I did (more than three months after the research), but ultimately I should have done recalculations to check and see if the stats changed much. They didn’t change much, but as I’ve said the point of this post is to highlight my exact methodology as well as point out what I did irresponsibly and/or incorrectly. 

°Technically I’m not sure what month was used as the source for determining the article count, but I’m using August here since I collected my data on September 1st. 

**excluding the P4A one

Sunday, December 6, 2015

Interpreting the Law


I found this statement of academic integrity in a study guide for one my classes. I think this statement is intended to prevent one person from e-mailing the study guide to everyone in the class, but regardless the language is way too broad: “Distributing this information in any form…in any way…is a violation of academic integrity and the student code of conduct.” This means that if I physically give the unmodified study guide to a friend in the same class who had the ability to print it for himself, that’d be cheating. And that’s ridiculous.

This makes me think of two philosophies that exist for interpreting the laws: (1) that they should be followed to the letter with no exceptions or wiggle room and (2) that the laws are meant to represent general concepts and there can be extenuating circumstances where they shouldn’t apply. (Obviously I think the study guide falls into the second interpretation.) 

Now, there are good and bad componenents about each of these interpretations. For (1), it is very clear and obvious what constitutes the law, which I think makes applying and understanding it easier. The bad thing is that it might penalize people who share an unmodified study guide even though his friend had the ability to print it for himself. For (2), the good is that it wouldn’t penalize that person, but the bad is that it can be time-consuming to decide things on a case-by-case basis and it could also be difficult to “draw the line” in terms of where the extenuating circumstances lie. Still I lean towards the second one because laws are written without the full understanding of how they might play out in the real world. If my teacher saw me hand a printed study guide to a friend, I doubt she’d even think of it as cheating.


I thought about all of this because my friend actually e-mailed me a study guide she filled out and shared it with me. I didn’t think twice of this until I realized that this was technically cheating. But it doesn’t feel like it…I mean, we could have just as easily met up in-person and read our thoughts and research for each question. If we were called into the office of student academic integrity, I’d make a good case for us.