So-called reasoning AI models are becoming easier — and cheaper — to develop.
On Friday, NovaSky, a team of researchers based out of UC Berkeley’s Sky Computing Lab, released Sky-T1-32B-Preview, a reasoning model that’s competitive with an earlier version of OpenAI’s o1 on a number of key benchmarks. Sky-T1 appears to be the first truly open source reasoning model in the sense that it can be replicated from scratch; the team released the data set they used to train it as well as the necessary training code.
“Remarkably, Sky-T1-32B-Preview was trained for less than $450,” the team wrote in a blog post, “demonstrating that it is possible to replicate high-level reasoning capabilities affordably and efficiently.”
$450 might not sound that affordable. But it wasn’t long ago that the price tag for training a model with comparable performance often ranged in the millions of dollars. Synthetic training data, or training data generated by other models, has helped drive costs down. Palmyra X 004, a model recently released by AI company Writer, trained almost entirely on synthetic data, reportedly cost just $700,000 to develop.
Unlike most AI, reasoning models effectively fact-check themselves, which helps them to avoid some of the pitfalls that normally trip up models. Reasoning models take a little longer — usually seconds to minutes longer — to arrive at solutions compared to a typical non-reasoning model. The upside is, they tend to be more reliable in domains such as physics, science, and mathematics.
The NovaSky team says it used another reasoning model, Alibaba’s QwQ-32B-Preview, to generate the initial training data for Sky-T1, then “curated” the data mixture and leveraged OpenAI’s GPT-4o-mini to refactor the data into a more workable format. Training the 32-billion-parameter Sky-T1 took about 19 hours using a rack of 8 Nvidia H100 GPUs. (Parameters roughly correspond to a model’s problem-solving skills.)
According to the NovaSky team, Sky-T1 performs better than an early preview version of o1 on MATH500, a collection of “competition-level” math challenges. The model also beats the preview of o1 on a set of difficult problems from LiveCodeBench, a coding evaluation.
However, Sky-T1 falls short of the o1 preview on GPQA-Diamond, …