Color Polygraph
A self-directed experiment from the second year of upper-secondary school. I built a twenty-question color survey, sent it to around 160,000 students through the Oslo school directory, and ended up with 6,731 cleaned sessions and a small neural net that could guess a participant's age, gender and self-reported mood from nothing but which color swatches they picked.
Explore the data
Before the write-up, the toy. Drag the cube to rotate it, double-click to resume the slow auto-rotate. The buttons swap between men, women, the difference between them, the colors both groups love, and a free pair of sliders. Tap "Invert" to flip the view to what each group rejects instead. Age filtering is off by default; tap "All ages" to switch the age slider on. Each glowing dot is one cell of an 8×8×8 RGB grid, sized by how often colors from that cell were picked compared to how often they were shown.
Each session ran 64 colors through a knockout bracket. Every round showed 4 colors at once and the favourite advanced.
- 64 colors
- 16 winners
- 4 winners
- 1 winner
Every time a user picks a color, it adds to that color's score and every round-1 loss subtracts. Points are cumulative across every session, and the cube is normalized to the strongest score on display, so pushing one weight high can squeeze weaker colors below the visibility threshold.
Most-loved colors
Voxels with the strongest preference score: shown to many people and picked far more often than chance.
Most-rejected colors
Voxels people were shown over and over but almost never picked. Muddy yellow-browns lose every popularity contest.
The question
Does which colors you prefer, and how you click on them, carry enough signal to predict who you are? I had a hunch the answer was yes, and I wanted the dataset to be big enough that the answer wasn't a coincidence.
The survey
Participants answered three demographic questions (age, gender and a 0 to 60 mood rating where 0 is happy and 60 is glum), then ran through a single-elimination color bracket. Each round showed four swatches and the favourite carried into the next round:
- 64 colors
- 16 rounds16 winners
- 4 rounds4 winners
- 1 round1 winner
Alongside the picks, the survey recorded interaction signals: how long each response took, where on the swatch the click landed, and the order participants scanned the options.
Reach and response
- Around 160,000 students opened the mail through the Oslo school directory
- Around 20,000 finished the full survey (roughly 12.5% completion)
- Every response stored with a timestamp and a session-local salt
At sixteen the most surprising part wasn't the model. It was watching a dataset of twenty thousand human responses arrive in two days.
Who took it, and how
Once the dataset was cleaned, the cohort itself was almost as interesting as the color picks. Below is what the 6,731 valid sessions look like sliced different ways. Each line plot draws men and women separately so the gap, where there is one, is visible.
Headline numbers
Responses over the first days
Daily completions from launch. The second wave on day 9 is the second batch of emails I sent.
Age distribution
Every valid session, by reported age. The 13–18 peak is students; the long tail is teachers, substitutes, admins, and other municipal staff who work or have worked in the school system.
Men vs women by age
Where the gender split lived. The student years are roughly balanced; the adult tail skews female (Norwegian primary-school staff is mostly women).
Happiness by age
Self-reported mood, mapped to happiness (1 = happy, 0 = glum). The adolescence dip lands almost exactly where you'd expect: a long valley from 13 to 18.
Responses through the day
When people actually clicked through the test, by hour they started (Europe/Oslo). The school-day window dominates; the 21:00 evening bump is mostly older respondents at home.
Happiness through the day
Mean happiness of the people who started the test, by when they started (Europe/Oslo). 15-minute buckets through the active stretch (08–22), hourly at the shoulders, and 01–05 skipped since the dataset is too thin to read there.
How long the test took, by age
Mean total time to finish the color bracket. Sessions over ten minutes are dropped here as tabs left open.
Per-question speed by age
Average seconds spent on each swatch decision. Reaction time grows with age; teens click fastest.
Time per slide through the bracket
Mean seconds spent on each of the 21 picks. Q1 carries the startup overhead (reading the page, settling in), then the clicks accelerate as the bracket progresses. Round 2 starts at Q17 and the final at Q21.
How long did people take?
Distribution of total test time in 15-second bins, men and women overlaid. Most people finish inside two minutes.
Cleaning the data
The dataset arrived dirty. Around 20,000 people finished the survey, but the cube at the top only runs on 6,731 of them. Everything else got filtered out by a chain of sanity checks that I built up as I noticed the failure modes:
- Self-reported gender had to match the gender on file in the school directory. People who said the opposite of what was on file were almost always joking.
- Self-reported age had to be plausibly close to the directory age. A first-grader who claims to be 47 isn't answering seriously.
- Everyone under 6 was dropped. So was everyone above 68. Roughly half of all "adults" in the raw data claimed to be exactly 69.
- Sessions where almost every click landed on the same screen coordinate were dropped. Someone tapping one corner twenty times is not picking colors.
- Sessions where the average answer took less than 0.3 seconds were dropped. You can't read four swatches that fast.
- Plus a handful of smaller heuristics for duplicate IDs, missing fields, impossible total times, and people who clicked through without ever moving the mouse.
What survives is the 6,731 sessions where I'm reasonably sure a real human was making a real choice. That's the cleaned set the cube at the top is drawing from.
The model
Features
- Per-trial: chosen color (one-hot), three rejected colors, dwell time, click coordinates relative to swatch centre
- Per-session: total time, time variance, first-click bias
Architecture
A small fully-connected network. Three targets, trained jointly: age (regression), gender (binary classification), mood (0 to 60 ordinal, lower is happier). I leaned on the meta-features as much as the color picks themselves. They turned out to carry a lot of the signal, especially for age.
Six years later I keep coming back to this dataset to test new architectures on it. Stacked gradient boosting, a small transformer, a BiGRU, an LSTM, plain linear baselines. Current numbers live on the architecture leaderboard.
What I learned
Three things. First, that how you click is at least as informative as what you click. Reaction time and click coordinates carried more weight in the age head than the color picks themselves. Second, that running a study at scale is less about the model and more about the boring plumbing around it: surveys, storage, deduplication, abuse handling, the cleaning pipeline I just walked through. The model itself was the smallest file in the project.
Third, that almost two thirds of a "completed" dataset can be noise. The 13,000 sessions I threw away weren't all malicious; plenty were kids speed-running for the sake of it, or people typing 69 as their age out of habit. The lesson stuck. Every dataset I've touched since then I treat as guilty until proven clean.
I'd do this differently today: clearer consent, better aggregation, a published write-up. It was a high-school project and it shows in places, but it taught me that data beats cleverness, and I've never quite let go of that.
Explore the data yourself
Everything that powers the cube above lives in the public repo. The raw export, the Python cleaner that turned 20,000 rows into 6,731, and the aggregator that produced the JSON the viz reads, are all in the project folder. If you want to try a different binning, a different filter, or train your own little net on it, this is where to start.