Skip to content

Commit 63bbaa9

Browse files
author
Miltos Allamanis
committed
Add Puri et al.
1 parent 1d54e3b commit 63bbaa9

1 file changed

Lines changed: 34 additions & 0 deletions

File tree

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
layout: publication
3+
title: "Project CodeNet:A Large-Scale AI for Code Dataset for Learning aDiversity of Coding Tasks"
4+
authors: Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Ulrich Finkler
5+
conference:
6+
year: 2021
7+
bibkey: puri2021project
8+
additional_links:
9+
- {name: "GitHub", url: "https://github.com/IBM/Project_CodeNet"}
10+
tags: ["dataset"]
11+
---
12+
Advancements in deep learning and machine learning algorithms have enabled
13+
breakthrough progress in computer vision, speech recognition, natural language
14+
processing and beyond. In addition, over the last several decades, software has
15+
been built into the fabric of every aspect of our society. Together, these two
16+
trends have generated new interest in the fast-emerging research area of “AI for
17+
Code”. As software development becomes ubiquitous across all industries and code
18+
infrastructure of enterprise legacy applications ages, it is more critical than ever
19+
to increase software development productivity and modernize legacy applications.
20+
Over the last decade, datasets like ImageNet, with its large scale and diversity,
21+
have played a pivotal role in algorithmic advancements from computer vision to
22+
language and speech understanding. In this paper, we present "Project CodeNet",
23+
a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate
24+
the algorithmic advancements in AI for Code. It consists of 14M code samples
25+
and about 500M lines of code in 55 different programming languages. Project
26+
CodeNet is not only unique in its scale, but also in the diversity of coding tasks
27+
it can help benchmark: from code similarity and classification for advances in
28+
code recommendation algorithms, and code translation between a large variety
29+
programming languages, to advances in code performance (both runtime, and
30+
memory) improvement techniques. CodeNet also provides sample input and output
31+
test sets for over 7M code samples, which can be critical for determining code
32+
equivalence in different languages. As a usability feature, we provide several
33+
preprocessing tools in Project CodeNet to transform source codes into representations
34+
that can be readily used as inputs into machine learning models.

0 commit comments

Comments
 (0)