Add Puri et al.

Miltos Allamanis · Miltos Allamanis · commit 63bbaa9d59b6 · 2021-05-21T16:18:49.000+01:00
diff --git a/_publications/puri2021project.markdown b/_publications/puri2021project.markdown
@@ -0,0 +1,34 @@
+---
+layout: publication
+title: "Project CodeNet:A Large-Scale AI for Code Dataset for Learning aDiversity of Coding Tasks"
+authors: Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Ulrich Finkler
+conference:
+year: 2021
+bibkey: puri2021project
+additional_links:
+   - {name: "GitHub", url: "https://github.com/IBM/Project_CodeNet"}
+tags: ["dataset"]
+---
+Advancements in deep learning and machine learning algorithms have enabled
+breakthrough progress in computer vision, speech recognition, natural language
+processing and beyond.  In addition, over the last several decades, software has
+been built into the fabric of every aspect of our society.   Together,  these two
+trends have generated new interest in the fast-emerging research area of “AI for
+Code”. As software development becomes ubiquitous across all industries and code
+infrastructure of enterprise legacy applications ages, it is more critical than ever
+to increase software development productivity and modernize legacy applications.
+Over the last decade, datasets like ImageNet, with its large scale and diversity,
+have played a pivotal role in algorithmic advancements from computer vision to
+language and speech understanding. In this paper, we present "Project CodeNet",
+a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate
+the algorithmic advancements in AI for Code.  It consists of 14M code samples
+and about 500M lines of code in 55 different programming languages.  Project
+CodeNet is not only unique in its scale, but also in the diversity of coding tasks
+it can help benchmark:  from code similarity and classification for advances in
+code recommendation algorithms, and code translation between a large variety
+programming languages, to advances in code performance (both runtime, and
+memory) improvement techniques. CodeNet also provides sample input and output
+test sets for over 7M code samples, which can be critical for determining code
+equivalence in different languages. As a usability feature, we provide several 
+preprocessing tools in Project CodeNet to transform source codes into representations
+that can be readily used as inputs into machine learning models.