|
| 1 | +--- |
| 2 | +layout: publication |
| 3 | +title: "Project CodeNet:A Large-Scale AI for Code Dataset for Learning aDiversity of Coding Tasks" |
| 4 | +authors: Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Ulrich Finkler |
| 5 | +conference: |
| 6 | +year: 2021 |
| 7 | +bibkey: puri2021project |
| 8 | +additional_links: |
| 9 | + - {name: "GitHub", url: "https://github.com/IBM/Project_CodeNet"} |
| 10 | +tags: ["dataset"] |
| 11 | +--- |
| 12 | +Advancements in deep learning and machine learning algorithms have enabled |
| 13 | +breakthrough progress in computer vision, speech recognition, natural language |
| 14 | +processing and beyond. In addition, over the last several decades, software has |
| 15 | +been built into the fabric of every aspect of our society. Together, these two |
| 16 | +trends have generated new interest in the fast-emerging research area of “AI for |
| 17 | +Code”. As software development becomes ubiquitous across all industries and code |
| 18 | +infrastructure of enterprise legacy applications ages, it is more critical than ever |
| 19 | +to increase software development productivity and modernize legacy applications. |
| 20 | +Over the last decade, datasets like ImageNet, with its large scale and diversity, |
| 21 | +have played a pivotal role in algorithmic advancements from computer vision to |
| 22 | +language and speech understanding. In this paper, we present "Project CodeNet", |
| 23 | +a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate |
| 24 | +the algorithmic advancements in AI for Code. It consists of 14M code samples |
| 25 | +and about 500M lines of code in 55 different programming languages. Project |
| 26 | +CodeNet is not only unique in its scale, but also in the diversity of coding tasks |
| 27 | +it can help benchmark: from code similarity and classification for advances in |
| 28 | +code recommendation algorithms, and code translation between a large variety |
| 29 | +programming languages, to advances in code performance (both runtime, and |
| 30 | +memory) improvement techniques. CodeNet also provides sample input and output |
| 31 | +test sets for over 7M code samples, which can be critical for determining code |
| 32 | +equivalence in different languages. As a usability feature, we provide several |
| 33 | +preprocessing tools in Project CodeNet to transform source codes into representations |
| 34 | +that can be readily used as inputs into machine learning models. |
0 commit comments