Data De-Identification: Useful Tool, But No Magic Bullet

The systematic analysis of student data is the only way to know if children in different schools are reading at grade level, or to discover if children in large classes have lower scores than those in smaller ones. But how can edtech companies achieve these goals while protecting the privacy of individual students?

De-identifying data offers a solution.

Because it separates the identity of the individual student, de-identified data doesn’t share personally identifiable information, so it doesn’t pose the same threat to privacy. Student privacy laws, including FERPA and SOPIPA in California, usually include a major exception for data that has been de-identified. Masking (hiding or deleting personal identifiers such as name or student number), pertubation (changing or swapping the values in a data field--such as test scores--in a way that retains the ability to accurately analyze without disclosing the original values), redaction (expunging sensitive data before consolidating data for analysis) and suppression (hiding or deleting a particular data field, such as zip code, that would too easily narrow the focus to a particular school), are all common technical steps that can be taken to de-identify data.

But can personally identifiable information truly be removed from student data? It seems that every month, we read a new report that describes how a team of researchers has been able to re-identify data from a purportedly anonymous data set. Some argue that in this era of “Big Data,” true promises of de-identification are impossible. Others disagree, noting that many of the famous examples of de-identification are attacks on data sets that weren’t correctly de-identified in the first place.

When making the decisions for how to appropriately protect data for analysis, school service providers handling student data should look closely at FERPA’s discussion of de-identification. First, FERPA allows and recognizes the importance of research. The U.S. Department of Education explains: “While FERPA is a privacy statute and not a research statute, it should not be a barrier to conducting useful and valid educational research that uses de-identified student data.”

However, FERPA regulations require educational agencies and institutions--and other parties that release de-identified education records--to take into account information that is “linked or linkable to a specific student”, as well as other reasonably available information about a student, so that the cumulative effect does not allow a “reasonable person in the school community to identify the student with reasonable certainty.”

But FERPA isn’t all that needs to be considered when analyzing whether information is personal. Companies that collect information directly from students under 13 years of age are often also subject to the federal Children’s Online Privacy Protection Act (COPPA). COPPA includes a broader definition of personal information than FERPA, and includes individual identifiers the company may assign related to the user’s device or on-line behavior over time and across online services. COPPA makes only a narrow exemption for such “persistent identifiers” when used for internal operations, and only so long as there is no use for contacting specific individuals, behavioral advertising, or individual profiling.

As a result, data linked to cookies or device identifiers is automatically considered personal information, unless it fits into the exceptions. Companies subject to COPPA need to be very careful not to allow third party ad networks or analytics companies to track users, unless they have very clear terms that restrict how data is used. For example, popular social sharing plug-ins AddThis and ShareThis provide cookie-linked web surfing data to third-party ad targeting companies. On a school site serving children, COPPA defines this type of data as personal information and forbids the sharing of this data.

The consequences of running afoul of this law can be severe. COPPA allows for up to a $16,500 penalty, per violation, per student, and is assessed directly against companies. In case after case, the Federal Trade Commission has used this statute to impose whopping fines on violators.

In addition, state laws and school contracts may further regulate the use of de-identified data, or might define de-identification differently.

De-identification is not a single on-off switch. Nor is it a magic bullet. Instead, it’s a process.

By understanding the general privacy protections at different de-identification levels, as well as the specific legal requirements unique to student data, ed-tech companies, schools, and researchers can successfully enable the use of data to enable the important analysis needed to improve educational outcomes while avoiding risk to students.