This article is part of VentureBeat’s special issue, “The Real Cost of AI: Performance, Efficiency and ROI at Scale.” Read more from this special issue.
Model providers continue to roll out increasingly sophisticated large language models (LLMs) with longer context windows and enhanced reasoning capabilities.
This allows models to process and “think” more, but it also increases compute: The more a model takes in and puts out, the more energy it expends and the higher the costs.
Couple this with all the tinkering involved with prompting — it can take a few tries to get to the intended result, and sometimes the question at hand simply doesn’t need a model that can think like a PhD — and compute spend can get out of control.
This is giving rise to prompt ops, a whole new discipline in the dawning age of AI.
“Prompt engineering is kind of like writing, the actual creating, whereas prompt ops is like publishing, where you’re evolving the content,” Crawford Del Prete, IDC president, told VentureBeat. “The content is alive, the content is changing, and you want to make sure you’re refining that over time.”
The challenge of compute use and cost
Compute use and cost are two “related but separate concepts” in the context of LLMs, explained David Emerson, applied scientist at the Vector Institute. Generally, the price users pay scales based on both the number of input tokens (what the user prompts) and the number of output tokens (what the model delivers). However, they are not changed for behind-the-scenes actions like meta-prompts, steering instructions or retrieval-augmented generation (RAG).
While longer context allows models to process much more text at once, it directly translates to significantly more FLOPS (a measurement of compute power), he explained. Some aspects of transformer models even scale quadratically with input length if not well managed. Unnecessarily long responses can also slow down processing time and require additional …