hobby 2020

Color Polygraph

A self-directed experiment from the second year of upper-secondary school. I built a twenty-question color survey, sent it to around 160,000 students through the Oslo school directory, and ended up with 6,731 cleaned sessions and a small neural net that could guess a participant's age, gender and self-reported mood from nothing but which color swatches they picked.

Explore the data Model leaderboard Test it yourself

Explore the data

Before the write-up, the toy. Drag the cube to rotate it, double-click to resume the slow auto-rotate. The buttons swap between men, women, the difference between them, the colors both groups love, and a free pair of sliders. Tap "Invert" to flip the view to what each group rejects instead. Age filtering is off by default; tap "All ages" to switch the age slider on. Each glowing dot is one cell of an 8×8×8 RGB grid, sized by how often colors from that cell were picked compared to how often they were shown.

R G B

all all ages n = 0

Men gender Women

6 age · 13–18 68

0 happiness 1

0 hours invert 23

Most-loved colors

Voxels with the strongest preference score: shown to many people and picked far more often than chance.

Most-rejected colors

Voxels people were shown over and over but almost never picked. Muddy yellow-browns lose every popularity contest.

The question

Does which colors you prefer, and how you click on them, carry enough signal to predict who you are? I had a hunch the answer was yes, and I wanted the dataset to be big enough that the answer wasn't a coincidence.

The survey

Participants answered three demographic questions (age, gender and a 0 to 60 mood rating where 0 is happy and 60 is glum), then ran through a single-elimination color bracket. Each round showed four swatches and the favourite carried into the next round:

64 colors
→
16 rounds16 winners
→
4 rounds4 winners
→
1 round1 winner

Alongside the picks, the survey recorded interaction signals: how long each response took, where on the swatch the click landed, and the order participants scanned the options.

Test it yourself

Reach and response

Around 160,000 students opened the mail through the Oslo school directory
Around 20,000 finished the full survey (roughly 12.5% completion)
Every response stored with a timestamp and a session-local salt

At sixteen the most surprising part wasn't the model. It was watching a dataset of twenty thousand human responses arrive in two days.

Who took it, and how

Once the dataset was cleaned, the cohort itself was almost as interesting as the color picks. Below is what the 6,731 valid sessions look like sliced different ways. Each line plot draws men and women separately so the gap, where there is one, is visible.

Headline numbers

Responses over the first days

Daily completions from launch. The second wave on day 9 is the second batch of emails I sent.

men women

Age distribution

Every valid session, by reported age. The 13–18 peak is students; the long tail is teachers, substitutes, admins, and other municipal staff who work or have worked in the school system.

Men vs women by age

Where the gender split lived. The student years are roughly balanced; the adult tail skews female (Norwegian primary-school staff is mostly women).

men women

Happiness by age

Self-reported mood, mapped to happiness (1 = happy, 0 = glum). The adolescence dip lands almost exactly where you'd expect: a long valley from 13 to 18.

men women

Responses through the day

When people actually clicked through the test, by hour they started (Europe/Oslo). The school-day window dominates; the 21:00 evening bump is mostly older respondents at home.

men women

Happiness through the day

Mean happiness of the people who started the test, by when they started (Europe/Oslo). 15-minute buckets through the active stretch (08–22), hourly at the shoulders, and 01–05 skipped since the dataset is too thin to read there.

men women

How long the test took, by age

Mean total time to finish the color bracket. Sessions over ten minutes are dropped here as tabs left open.

men women

Per-question speed by age

Average seconds spent on each swatch decision. Reaction time grows with age; teens click fastest.

men women

Time per slide through the bracket

Mean seconds spent on each of the 21 picks. Q1 carries the startup overhead (reading the page, settling in), then the clicks accelerate as the bracket progresses. Round 2 starts at Q17 and the final at Q21.

men women

How long did people take?

Distribution of total test time in 15-second bins, men and women overlaid. Most people finish inside two minutes.

men women

Cleaning the data

The dataset arrived dirty. Around 20,000 people finished the survey, but the cube at the top only runs on 6,731 of them. Everything else got filtered out by a chain of sanity checks that I built up as I noticed the failure modes:

Self-reported gender had to match the gender on file in the school directory. People who said the opposite of what was on file were almost always joking.
Self-reported age had to be plausibly close to the directory age. A first-grader who claims to be 47 isn't answering seriously.
Everyone under 6 was dropped. So was everyone above 68. Roughly half of all "adults" in the raw data claimed to be exactly 69.
Sessions where almost every click landed on the same screen coordinate were dropped. Someone tapping one corner twenty times is not picking colors.
Sessions where the average answer took less than 0.3 seconds were dropped. You can't read four swatches that fast.
Plus a handful of smaller heuristics for duplicate IDs, missing fields, impossible total times, and people who clicked through without ever moving the mouse.

What survives is the 6,731 sessions where I'm reasonably sure a real human was making a real choice. That's the cleaned set the cube at the top is drawing from.

The model

Features

Per-trial: chosen color (one-hot), three rejected colors, dwell time, click coordinates relative to swatch centre
Per-session: total time, time variance, first-click bias

Architecture

A small fully-connected network. Three targets, trained jointly: age (regression), gender (binary classification), mood (0 to 60 ordinal, lower is happier). I leaned on the meta-features as much as the color picks themselves. They turned out to carry a lot of the signal, especially for age.

Six years later I keep coming back to this dataset to test new architectures on it. Stacked gradient boosting, a small transformer, a BiGRU, an LSTM, plain linear baselines. Current numbers live on the architecture leaderboard.

What I learned

Three things. First, that how you click is at least as informative as what you click. Reaction time and click coordinates carried more weight in the age head than the color picks themselves. Second, that running a study at scale is less about the model and more about the boring plumbing around it: surveys, storage, deduplication, abuse handling, the cleaning pipeline I just walked through. The model itself was the smallest file in the project.

Third, that almost two thirds of a "completed" dataset can be noise. The 13,000 sessions I threw away weren't all malicious; plenty were kids speed-running for the sake of it, or people typing 69 as their age out of habit. The lesson stuck. Every dataset I've touched since then I treat as guilty until proven clean.

I'd do this differently today: clearer consent, better aggregation, a published write-up. It was a high-school project and it shows in places, but it taught me that data beats cleverness, and I've never quite let go of that.

Explore the data yourself

Everything that powers the cube above lives in the public repo. The raw export, the Python cleaner that turned 20,000 rows into 6,731, and the aggregator that produced the JSON the viz reads, are all in the project folder. If you want to try a different binning, a different filter, or train your own little net on it, this is where to start.

Project folder on GitHub Raw cleaned export (save.json, ~10 MB) Aggregator script (treat_data.py) Aggregated voxel data (preferences.json)

Back to timeline