Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

by | Jan 10, 2025 | Technology

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

As large language models (LLMs) continue to improve in coding, the benchmarks used to evaluate their performance are steadily becoming less useful.

That’s because even as many LLMs have similar high scores on these benchmarks, understanding which ones to use on specific software development projects and enterprises can be difficult.

A new paper by Yale University and Tsinghua University presents a novel method to test the ability of models to tackle “self-invoking code generation” problems that require reasoning, generating code, and reusing existing code in problem-solving.

Self-invoking code generation is much more similar to realistic programming scenarios and provides a better understanding of current LLMs’ ability to solve real-world coding problems.

Self-invoking code generation

Two popular benchmarks used to evaluate the coding abilities of LLMs are HumanEval and MBPP (Mostly Basic Python Problems). These are datasets of handcrafted problems that require the model to write code for simple tasks. 

However, these benchmarks only cover a subset of the challenges software developers face in the real world. In practical scenarios, software developers don’t just write new code—they must also understand and reuse existing code and create reusable components to solve complex problems.

“The ability to understand and subsequently leverage one’s own generated code, namely self-invoking code generation, plays an important role for LLMs to leverage their reasoning capabilities to code generation that current benchmarks fail to capture,” the researchers write.

To test the ability of LLMs in self-invoking code generation, the researchers created two new benchmarks, HumanEval Pro and MBPP Pro, which extend the existing datasets. Each problem in HumanEval Pro and MBPP Pro builds on top of an existing example in the original dataset and introduces additional elements that require the model to solve the base problem and invoke the solution to solve a more complex problem. 

Self-invoking code generation (source: arXiv)

For example, the original problem can be something simple, like writ …

Article Attribution | Read More at Article Source