K-Anonymity for School Districts: A Plain-Language Guide

A district's data team publishes the annual enrollment report to the board. Names are removed. Everyone agrees it's a "de-identified" summary safe to post on the district website.

Six months later, a local reporter uses the report to identify a single student. The report didn't name the student. It didn't have to.

This is the re-identification problem, and it's why FERPA's definition of personally identifiable information explicitly includes combinations of attributes that could identify a student "with reasonable certainty," even if no single field names anyone.

The Intuition Behind K-Anonymity

K-anonymity is a simple idea with a fancy name. The "k" stands for a number. A dataset is "k-anonymous" if, for every row in the data, at least k rows have the exact same combination of identifying attributes.

If k = 5, then every row describes a group of at least 5 students who look identical in the data. Even a very determined person with access to the dataset can only narrow any individual down to "one of 5." That's considered reasonably safe.

If k = 1, every row is unique. Each row describes exactly one student. The data is, in effect, identified — just with the names stripped off.

A Worked Example

Suppose a district publishes the following enrollment breakdown for one of its elementary schools:

Grade	Gender	English Learner	Special Education	Free/Reduced Lunch	Students
K	Female	No	No	No	22
K	Male	No	No	No	19
1	Female	Yes	No	Yes	12
1	Male	Yes	Yes	Yes	1
2	Female	No	Yes	No	3
2	Male	No	No	Yes	18

The combination in row 4 — first grade, male, English Learner, receiving special education services, free/reduced lunch — describes exactly one student.

Anyone with access to this report can go to the school, look at the first-grade boys, and (knowing which families are new to the country, which student has an IEP, and which families qualify for free lunch) narrow down to that one child with high confidence. The dataset is technically "de-identified" — no names are listed. But for this row, k = 1, and the student is effectively named.

Why Small Cells Are Dangerous

Educational researchers have used 10 as a rough safe threshold for decades. The U.S. Department of Education's own guidance recommends not publishing any cell with fewer than 10 students, and suppressing adjacent cells when a single cell would be suppressed (because otherwise subtraction can reveal the suppressed value).

The problem compounds when tables disaggregate by multiple dimensions:

District-level totals are almost always safe
School-level totals are usually safe
Grade-by-gender breakdowns are sometimes safe
Grade-by-gender-by-demographic-by-program cross-tabulations are frequently unsafe

The more dimensions you add to a breakdown, the smaller the cells get, and the more likely one of them drops to 1, 2, or 3.

Common Places Small Cells Appear

1. Board Reports With Demographic Breakdowns

Tables like the one above are routine in board packets. They're informative for trustees. They can also be publicly posted as part of the meeting materials, at which point a small cell becomes a permanent public record.

2. State Data Transparency Portals

State-level dashboards often publish school-level breakdowns. Some states apply suppression rules automatically; others leave it to the district. Districts that copy state-portal data onto their own websites without checking suppression rules can end up publishing more granular data than the state itself.

3. Discipline and Attendance Reports

A report titled "Suspensions by Grade and Demographic Group, Third Quarter" can easily contain cells with single-student counts, especially at smaller schools. "One Asian American female in grade 7 was suspended" is not an anonymous statement in a 600-student middle school.

4. Open Data Releases

Open data initiatives are valuable. But published datasets that include student-level records — even with names removed — are essentially always re-identifiable if they contain enough attributes. The classic academic finding is that knowing a person's ZIP code, birth date, and gender uniquely identifies about 87% of Americans. Student datasets often contain much richer information than that.

What Districts Can Do

1. Apply a Minimum Cell Size Rule

Adopt a policy that any publicly reported cross-tabulation of student data must have cells of at least 10 (some districts use 11 or higher). When a cell would be smaller, either combine categories or suppress the cell and its complement.

2. Apply the Rule to Published Files, Not Just Reports

It's one thing to apply suppression rules when the data team prepares a formal board report. It's another thing entirely to apply them to ad hoc spreadsheets that staff members link from staff pages, bury in PDFs, or share with community groups. Those are frequently the leakiest artifacts.

3. Think About Adjacent Rows

Suppressing a single cell is insufficient if the row total and the remaining cells allow the suppressed value to be calculated by subtraction. "Complementary suppression" — suppressing a second cell to prevent back-calculation — is part of doing this right.

4. Use Statistical Testing Tools

Manual k-anonymity checking is tedious. Statistical disclosure control tools (including the analysis built into SchoolScan) can test entire datasets and flag rows where k is below a threshold, including the rows that matter and the ones that matter only when combined with other published data.

5. Think About What's Already Online

Re-identification risk compounds across datasets. A single published file may be fine in isolation, but combined with a second file elsewhere on the district website, it becomes identifiable. Districts that audit their entire public data footprint — not just individual releases — catch risks that are invisible at the single-file level.

The Deeper Point

"De-identification" in K-12 is not as simple as deleting the name column. A dataset can be de-identified in name and still identified in effect. FERPA recognizes this: the regulatory definition of personally identifiable information explicitly includes "other information that, alone or in combination, is linked or linkable to a specific student that would allow a reasonable person in the school community... to identify the student with reasonable certainty."

K-anonymity gives districts a concrete, measurable way to evaluate whether a published dataset actually meets that standard. It's not the only tool — differential privacy and synthetic data are more rigorous alternatives — but it's the most accessible one for a typical K-12 data team.

The goal isn't perfection. The goal is not publishing a table where k = 1.

Want to know where re-identification risks live in your district's data?

SchoolScan automatically runs k-anonymity analysis on published spreadsheets and CSV files across your web presence, flagging rows that describe individual students.

Request a Demo

Related reading: