Understanding the Math of Big Data

I was doing some reading about big data recently and came across an article that asserts that “the most adept" edtech companies are tracking up to 10 million data points per student, per day. The intent of the piece was to scare us with the idea that companies are watching every move our children make. But instead, the reference spurred me to think a bit more about how much data our schools are generating or using and what that means for kids.

10 Million Data Points?

Take that 10 million data points a day number: At first I thought this figure was utterly ridiculous -- there are, after all, only 86,400 seconds in an entire day. However, upon further reflection it is possible that "10 million data points" is just a greatly exaggerated way of saying some companies store around 10 MB of data per day: not an unreasonable amount. As an example, phonics software might store G.711 recitation samples for a teacher to review later, which could easily total 10 MB of "data" per day (approximately 10 minutes of audio). Mystery solved!

Then I began to wonder what the real cost of those 10 MB is for schools so I did some more math.

The Cost of Big Data

Try this thought experiment: Take a school of 300 students and assume they use the aforementioned phonics software 20 days every month. So 10 MB X 300 students X 20 days = 60,000 MB (60 GB) of data transferred and stored every month. Over a nine-month school year, that comes out to 540 GB of storage.

Now take that 540 GB: Allowing for 100 server operations per second, 60 GB of data both in and out each month and a 60 GB monthly snapshot to backup the data, Amazon EC2 estimates the cost to be $49.59/month, or an additional $1.50/year per student ($50 X 9 months / 300 students = $1.50). Not a trivial amount but not a deal-breaker either. The good news is that storage and bandwidth get cheaper every year so that price will trend even lower as time goes on.

However, this doesn't quite tell the whole story.

Data Bridges

Internet speed can be divided into two basic components: bandwidth and baud. These components can be compared to the number of lanes on a bridge (bandwidth) and the speed limit (baud). Increasing the number of lanes will get more cars into the city during times of heavy traffic but doesn't much matter when traffic is light. Increasing the speed limit, by contrast, only improves travel times when traffic is light.

School internet traffic is similar to bridge traffic in that it occurs in chunks: 100 kids simultaneously cramming 10mb of data through the pipes is the equivalent of morning rush hour on the San Francisco Bay Bridge. Any school using web technology needs to have a very wide bridge.

Consider that as a bridge gets longer, it gets exponentially more expensive to widen. This is a problem that we might take for granted in places like San Francisco since the Bay is home to both the widest bridge in the world and fantastic network infrastructure. Unfortunately, the rest of the world often has a much greater distance between their school and their local internet providers and a much longer distance between the providers and wherever the web servers are hosted. Imagine your morning commute over a Bay Bridge that's four times as long and half as wide and you'll get an idea what downloading a web app in South Africa is like.

Slim Data

So can you collect big data in places with narrow bandwidth? Well technically, yes: it will just take a little bit longer. But that "little bit longer" is classroom time -- minutes when a student is sitting idle in front of a screen or time when data transfer could be interrupted by a power failure or even a closed browser window. In other words, in a classroom where time is the most valuable commodity the costs of big data can be far greater than $1.50 per student.

This reflection isn't so much a critique of big data as it is a general examination of the cost of data transfer for schools--and keep in mind, downloading an app is data transfer as well. In order for web-based software to reach as many students as possible, we should shift our focus from “big data” to slim data and lightweight apps.