Which LLM to code in cobol? A specific benchmark appears


How to create a measure specific to cobol code generation? By adjusting a data set existence.

The American start-up chose this option. She relied on the reference HumanEval. It results…COBOLEval.

Of the 164 Python problems in HumanEval, 146 were actually translated. In particular, those that accept or return types that are difficult to represent in cobol, such as Any and Dict, are neglected.

For functions that did not exist in Cobol, it was necessary to use subroutines (definition of arguments and return variables in the link section).

Another annoyance: there are no strings, integers, or variable-length strings in cobol. You must specify, in the PICTURE clause, the number of characters occupied in memory. Problem solved by setting an upper limit: COBOLEval neither accepts nor returns elements of length greater than 100.

Cobol does not, elsewhere, have local variables. They must register in advance, in the so-called working warehouse. Aligning this strict structure with the functioning of the LLM involved decomposition techniques andcharge.

Each task has an average of six tests. The answer must correct them all to be considered correct. COBOLEval has recently been using the GnuCOBOL compiler betweenin France, in the Interministerial Free Software Database.

The authors of COBOLEval also developed “special kobol” LLM: mAInframer-1, with Llam code as base. They convey their performance under very specific conditions: the creation of a single solution (pass@1) with a temperature of 0.

Model Rate of correct answers Compilation rate with GnuCOBOL
GPT-3.5 Turbo 4.11 19.17
GPT-4 8.9 47.94
Code llama 7B 0.68 25.34
Llama code 13B 1.36 13.01
Code llama 34B 2.05 78.76
MAInframer 7B 6.16 69.17
MAInframer 13B 8.9 54.1
MAInframer 34B 10.27 73.97


Illustration © Quardia Inc. – Adobe Stock



Source link

Leave a Comment