Ask anyone in the open source AI community, and they will tell you the gap between them and the big private companies is more than just computing power. AI2 is working to fix that, first with fully open source databases and models, and now with an open and easily adapted post-training regimen to turn “raw” large language models into usable ones.
Contrary to what many think, “foundation” language models don’t come out of the training process ready to put to work. The pre-training process is necessary, of course, but far from sufficient. Some even believe that pre-training may soon no longer be the most important part at all.
That’s because the post-training process is increasingly being shown to be where real value can be created. That’s where the model is molded from a giant, know-it-all network that will as readily produce Holocaust denial talking points as it will cookie recipes. You generally don’t want that!
Companies are secretive about their post-training regimens because, while everyone can scrape the web and make a model using state-of-the-art methods, making that model useful to, say, a therapist or research analyst is a completely different challenge.
AI2 (formerly known as the Allen Institute for AI) has spoken out about the lack of openness in ostensibly “open” AI projects, like Meta’s Llama. While the model is indeed free for anyone to use and tweak, the sources and process of making the raw model and the method of training it for general use remain carefully guarded secrets. It’s not bad — but it also isn’t really “open.”
AI2, on the other hand, is committed to being as open as it can possibly be, from exposing its data collection, curation, cleaning, and other pipelines to the exact training methods it used to produce LLMs like OLMo.
But the simple truth is that few …