IBM’s Project CodeNet will examine how far you can make AI write code.

IBM’s AI research division has published a 14-million-sample dataset to help programmers create machine learning models. The dataset, dubbed Project CodeNet, is named after ImageNet, the famous repository of labelled images that ignited a revolution in computer vision and deep learning.

Although it’s not like machine learning models based on the CodeNet dataset will render human programmers useless, there’s reason to be optimistic that they will increase developer productivity.

Automating programming with deep learning

Developments in machine learning(ML) in the early 2010s ignited excitement that artificial intelligence (AI) would soon automate several tasks, like programming. However, the use of AI in software development has been minimal.

Human programmers use various conscious and subconscious thought processes to discover new problems and explore alternative solutions. However, for building models that can solve the same issues, most machine learning algorithms require well-defined problems and a large amount of annotated data.

Many attempts have been made to build and test “AI for code” systems by creating datasets and benchmarks. However, considering software development’s innovative and open nature, creating the ideal dataset for programming is extremely difficult.

The CodeNet dataset

IBM researchers have attempted to build a multi-purpose dataset. People can use it to train machine learning models for various tasks with Project CodeNet. CodeNet is a “diverse, vast scale, and high-quality dataset to accelerate algorithmic developments in AI for Code,” according to its developers.

There are over 14 million code samples in the dataset, totalling 500 million lines of code written in 55 different programming languages. Researchers gathered the code samples from nearly 4,000 challenges on AIZU and AtCoder, two popular online coding platforms. Both right and incorrect answers to the challenges are included in the code samples.

The amount of annotation that has been applied to the examples is one of CodeNet’s main features. Each of the coding challenges in the dataset has a textual definition and CPU and memory constraints. In addition, a dozen pieces of information are included with any code submission, including the language, the date of submission, the duration, execution time, approval, and error forms.

The IBM researchers also went to great lengths to ensure that the dataset is balanced across several dimensions, such as programming language, acceptance, and error types.

Programming tasks for machine learning with CodeNet

CodeNet isn’t the only dataset used to train programming machine, learning models. Although, some features set it apart. The first is the sheer scale of the dataset, which includes the number of samples and language diversity.

However, the metadata associated with the coding samples may be more relevant. In contrast to other coding datasets specialized for particular programming tasks, CodeNet’s rich annotations make it ideal for many functions.

CodeNet can be used in a variety of ways to build machine learning models for programming tasks. The first is linguistic translation. Data scientists will use the dataset to build machine learning models that convert code from one language to another since each coding challenge involves submissions in various programming languages. It is helpful for companies who want to port old code to new languages and make it available to younger generations of programmers while still making it maintainable with the latest development tools.

Machine learning models for code recommendation can also be made with CodeNet’s assistance. Accessible autocomplete-style models that finish the current line of code to more complicated systems that write full functions or blocks of code are some examples of recommendation tools.

Data scientists may use CodeNet to build code optimization systems because it contains a wealth of metadata about memory and execution time metrics. They can also train machine learning systems to detect possible weaknesses in source code using the error-type metadata.

Code generation is a more sophisticated use case that will be fun to see. CodeNet is an extensive database of textual explanations of problems and the source code that goes with them. Developers have already used advanced language models like GPT-3 to create code from natural language descriptions in several cases. It will be essential to see whether CodeNet can help refine these language models so that code generation is more reliable.

A monstrous engineering effort: CodeNet

To curate the CodeNet dataset and build complementary resources, IBM engineers undertook a complex software and data engineering effort.

They needed to collect code samples from AIZU and AtCoder first. Although one had an application programming interface (API) that made getting the code simple, the other didn’t. So the researchers had to create tools that scraped data from the platform’s web pages and decomposed it into a tabular format. They then had to transform the two datasets into a single schema by hand.

They then had to create software to clean the data by finding and eliminating duplicates and samples with a high percentage of dead code (source code not executed at runtime).

They also created preprocessing tools to make machine learning models easier to train on the CodeNet corpus. Tokenizers for various programming languages, parse trees, and a graph representation generator for neural graph networks are among the resources available.

Both of these activities serve as a reminder of the enormous amount of human effort required to develop effective machine learning systems. Unfortunately, artificial intelligence cannot take over as a programming language (at least for the time being). However, it can alter the types of tasks that involve human programmers’ efforts and creativity.

Ben Dickson is the founder of TechTalks and a software developer. He covers technology, industry, and politics in his writing.