ml4code
diff --git a/‎_publications/bui2020efficient.markdown‎
Lines changed: 0 additions & 12 deletions b/‎_publications/bui2020efficient.markdown‎
Lines changed: 0 additions & 12 deletions
diff --git a/‎_publications/bui2021efficient.markdown‎
Lines changed: 12 additions & 0 deletions b/‎_publications/bui2021efficient.markdown‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎_publications/deze2021mulcode.markdown‎
Lines changed: 12 additions & 0 deletions b/‎_publications/deze2021mulcode.markdown‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎_publications/waunakh2019evaluating.markdown‎
Lines changed: 0 additions & 12 deletions b/‎_publications/waunakh2019evaluating.markdown‎
Lines changed: 0 additions & 12 deletions
diff --git a/‎_publications/waunakh2019idbench.markdown‎
Lines changed: 12 additions & 0 deletions b/‎_publications/waunakh2019idbench.markdown‎
Lines changed: 12 additions & 0 deletions
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations"
+authors: Nghi D. Q. Bui, Y. Yu, L. Jiang
+conference: SIGIR
+year: 2021
+bibkey: bui2020efficient
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2009.02731"}
+tags: ["self-supervised, pretraining, code-search"]
+---
+We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning objective. To do so, we use a set of semantic-preserving transformation operators to generate code snippets that are syntactically diverse but semantically equivalent. Through extensive experiments, we have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks.
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "MulCode: A Multi-task Learning Approach for Source Code Understanding"
+authors: D. Wang, Y. Yu, S. Li, W. Dong, J. Wang, L. Qing
+conference: SANER
+year: 2021
+bibkey: deze2021mulcode
+additional_links:
+   - {name: "PDF", url: "https://yuyue.github.io/res/paper/mulcode_saner2021.pdf"}
+tags: ["representation, multi task"]
+---
+Recent years have witnessed the significant rise of Deep Learning (DL) techniques applied to source code. Researchers exploit DL for a multitude of tasks and achieve impressive results. However, most tasks are explored separately, resulting in a lack of generalization of the solutions. In this work, we propose MulCode, a multi-task learning approach for source code understanding that learns unified representation space for tasks, with the pre-trained BERT model for the token sequence and the Tree-LSTM model for abstract syntax trees. Furthermore, we integrate two source code views into a hybrid representation via the attention mechanism and set learnable uncertainty parameters to adjust the tasks’ relationship. We train and evaluate MulCode in three downstream tasks: comment classification, author attribution, and duplicate function detection. In all tasks, MulCode outperforms the state-of-theart techniques. Moreover, experiments on three unseen tasks demonstrate the generalization ability of MulCode compared with state-of-the-art embedding methods.
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "IdBench: Evaluating Semantic Representations of Identifier Names in Source Code"
+authors: Y. Wainakh, M. Rauf, M. Pradel
+conference: ICSE
+year: 2021
+bibkey: waunakh2019idbench
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/1910.05177"}
+tags: ["representation"]
+---
+Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect bugs, to predict types, and to improve the readability of code. At the core of namebased analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., len and size, are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 500 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to be similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation.