LLM-Powered Sorting with TrueSkill

Some things are easier to compare than to score. Ask someone to rate a photo from 1–10 and you’ll get noise; ask them which of two photos they prefer and you’ll get signal. The same dynamic shows up when using LLMs as judges: pairwise comparisons tend to be more reliable than absolute scores.

TrueSkill — originally designed for Xbox Live matchmaking — offers a principled Bayesian framework for maintaining skill estimates from pairwise outcomes. Pair it with an LLM judge and you get a surprisingly effective sorting pipeline for anything that resists direct scoring: creative writing, research directions, design options, and more.

This post walks through the approach, the tradeoffs, and some practical lessons from building it.

Full write-up coming soon.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Google Gemini updates: Flash 1.5, Gemma 2 and Project Astra
  • Displaying External Posts on Your al-folio Blog
  • Should Developers Care about Interpretability?