Added more work.

mallamanis · mallamanis · commit 5e6e1045403b · 2017-09-17T00:16:07.000+01:00
diff --git a/_data/allamanistaxonomy.yml b/_data/allamanistaxonomy.yml
@@ -61,17 +61,17 @@
 - {bibkey: chae2016automatically, categories: [representational], input_rep: Data Flow Graph , modeled_target: Static Analysis , intermediate_rep: Localized , application: Program Analysis }
 - {bibkey: corley2015exploring, categories: [representational], input_rep: Tokens , modeled_target: Feature Location , intermediate_rep: Distributed , application: Feature Location }
 - {bibkey: cummins2017end, categories: [representational], input_rep: Tokens , modeled_target: Optimization Flags , intermediate_rep: Distributed , application: Optimization Heuristics }
-- {bibkey: green2017learning, categories: [representational], input_rep: Statements, modeled_target: Alignment , intermediate_rep: Distributed , application: Decompiling }
 - {bibkey: gu2016deep, categories: [representational], input_rep: Natural Language , modeled_target: API Calls , intermediate_rep: Distributed , application: API Search }
 - {bibkey: guo2017semantically, categories: [representational], input_rep: Tokens , modeled_target: Traceability link , intermediate_rep: Distributed , application: Traceability }
-- {bibkey: gupta2016deepfix, categories: [representational], input_rep: Tokens , modeled_target: Code Fix , intermediate_rep: Distributed , application: Code Fixing }
+- {bibkey: gupta2017deepfix, categories: [representational], input_rep: Tokens , modeled_target: Code Fix , intermediate_rep: Distributed , application: Code Fixing }
 - {bibkey: hu2017codesum, categories: [representational], input_rep: Linearized AST , modeled_target: Natural Language , intermediate_rep: Distributed , application: Summarization }
 - {bibkey: iyer2016summarizing, categories: [representational], input_rep: Tokens , modeled_target: Natural Language , intermediate_rep: Distributed , application: Summarization }
 - {bibkey: jiang2017automatically, categories: [representational], input_rep: Tokens (Diff) , modeled_target: Natural Language , intermediate_rep: Distributed , application: Commit Message }
 - {bibkey: koc2017learning, categories: [representational], input_rep: Bytecode , modeled_target: False Positives , intermediate_rep: Distributed , application: Program Analysis }
 - {bibkey: kremenek2007factor, categories: [representational], input_rep: Partial PDG , modeled_target: Ownership , intermediate_rep: Factor (GM) , application: Pointer Ownership }
-- {bibkey: loyola2017neural, categories: [representational], input_rep: Tokens (Diff) , modeled_target: Natural Language , intermediate_rep: Distributed , application: Explain code changes }
+- {bibkey: levy2017learning, categories: [representational], input_rep: Statements, modeled_target: Alignment , intermediate_rep: Distributed , application: Decompiling }
 - {bibkey: li2015gated, categories: [representational], input_rep: Memory Heap , modeled_target: Separation Logic , intermediate_rep: Distributed , application: Verification }
+- {bibkey: loyola2017neural, categories: [representational], input_rep: Tokens (Diff) , modeled_target: Natural Language , intermediate_rep: Distributed , application: Explain code changes }
 - {bibkey: mangal2015user, categories: [representational], input_rep: Logic + Feedback , modeled_target: Prob. Analysis , intermediate_rep: MaxSAT , application: Program Analysis }
 - {bibkey: movshovitz2013natural, categories: [representational], input_rep: Tokens , modeled_target: Code Comments , intermediate_rep: Directed GM , application: Comment Prediction }
 - {bibkey: mou2016convolutional, categories: [representational], input_rep: Syntax , modeled_target: Classification , intermediate_rep: Distributed , application: Task Classification }
diff --git a/_publications/bichsel2016statistical.markdown b/_publications/bichsel2016statistical.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Statistical Deobfuscation of Android Applications"
+authors: B. Bichsel, V. Raychev, P. Tsankov, M. Vechev
+conference: CCS
+year: 2016
+bibkey: bichsel2016statistical
+---
+This work presents a new approach for deobfuscating Android APKs based on probabilistic learning of large code bases (termed "Big Code"). The key idea is to learn a probabilistic model over thousands of non-obfuscated Android applications and to use this probabilistic model to deobfuscate new, unseen Android APKs. The concrete focus of the paper is on reversing layout obfuscation, a popular transformation which renames key program elements such as classes, packages, and methods, thus making it difficult to understand what the program does. Concretely, the paper: (i) phrases the layout deobfuscation problem of Android APKs as structured prediction in a probabilistic graphical model, (ii) instantiates this model with a rich set of features and constraints that capture the Android setting, ensuring both semantic equivalence and high prediction accuracy, and (iii) shows how to leverage powerful inference and learning algorithms to achieve overall precision and scalability of the probabilistic predictions.
+
+We implemented our approach in a tool called DeGuard and used it to: (i) reverse the layout obfuscation performed by the popular ProGuard system on benign, open-source applications, (ii) predict third-party libraries imported by benign APKs (also obfuscated by ProGuard), and (iii) rename obfuscated program elements of Android malware. The experimental results indicate that DeGuard is practically effective: it recovers 79.1% of the program element names obfuscated with ProGuard, it predicts third-party libraries with accuracy of 91.3%, and it reveals string decoders and classes that handle sensitive data in Android malware.
+
diff --git a/_publications/chae2016automatically.markdown b/_publications/chae2016automatically.markdown
@@ -0,0 +1,9 @@
+---
+layout: publication
+title: "Automatically generating features for learning program analysis heuristics"
+authors: K. Chae, H. Oh, K. Heo, H. Yang
+conference: ArXiV 1612.09394
+year: 2016
+bibkey: chae2016automatically
+---
+We present a technique for automatically generating features for data-driven program analyses. Recently data-driven approaches for building a program analysis have been proposed, which mine existing codebases and automatically learn heuristics for finding a cost-effective abstraction for a given analysis task. Such approaches reduce the burden of the analysis designers, but they do not remove it completely; they still leave the highly nontrivial task of designing so called features to the hands of the designers. Our technique automates this feature design process. The idea is to use programs as features after reducing and abstracting them. Our technique goes through selected program-query pairs in codebases, and it reduces and abstracts the program in each pair to a few lines of code, while ensuring that the analysis behaves similarly for the original and the new programs with respect to the query. Each reduced program serves as a boolean feature for program-query pairs. This feature evaluates to true for a given program-query pair when (as a program) it is included in the program part of the pair. We have implemented our approach for three real-world program analyses. Our experimental evaluation shows that these analyses with automatically-generated features perform comparably to those with manually crafted features. 
diff --git a/_publications/dam2016deep.markdown b/_publications/dam2016deep.markdown
@@ -0,0 +1,9 @@
+---
+layout: publication
+title: "A deep language model for software code"
+authors: H. K. Dam, T. Tran, T. Pham
+conference: ArXiV 1608.02715
+year: 2016
+bibkey: dam2016deep
+---
+Existing language models such as n-grams for software code often fail to capture a long context where dependent code elements scatter far apart. In this paper, we propose a novel approach to build a language model for software code to address this particular issue. Our language model, partly inspired by human memory, is built upon the powerful deep learning-based Long Short Term Memory architecture that is capable of learning long-term dependencies which occur frequently in software code. Results from our intrinsic evaluation on a corpus of Java projects have demonstrated the effectiveness of our language model. This work contributes to realizing our vision for DeepSoft, an end-to-end, generic deep learning-based framework for modeling software and its development process. 
diff --git a/_publications/guo2017semantically.markdown b/_publications/guo2017semantically.markdown
@@ -0,0 +1,9 @@
+---
+layout: publication
+title: "Semantically enhanced software traceability using deep learning techniques"
+authors: J. Guo, J. Cheng, J. Cleland-Huang
+conference: ICSE
+year: 2017
+bibkey: guo2017semantically
+---
+In most safety-critical domains the need for traceability is prescribed by certifying bodies. Trace links are generally created among requirements, design, source code, test cases and other artifacts; however, creating such links manually is time consuming and error prone. Automated solutions use information retrieval and machine learning techniques to generate trace links; however, current techniques fail to understand semantics of the software artifacts or to integrate domain knowledge into the tracing process and therefore tend to deliver imprecise and inaccurate results. In this paper, we present a solution that uses deep learning to incorporate requirements artifact semantics and domain knowledge into the tracing solution. We propose a tracing network architecture that utilizes Word Embedding and Recurrent Neural Network (RNN) models to generate trace links. Word embedding learns word vectors that represent knowledge of the domain corpus and RNN uses these word vectors to learn the sentence semantics of requirements artifacts. We trained 360 different configurations of the tracing network using existing trace links in the Positive Train Control domain and identified the Bidirectional Gated Recurrent Unit (BI-GRU) as the best model for the tracing task. BI-GRU significantly out-performed state-of-the-art tracing methods including the Vector Space Model and Latent Semantic Indexing.
diff --git a/_publications/gupta2017deepfix.markdown b/_publications/gupta2017deepfix.markdown
@@ -0,0 +1,21 @@
+---
+layout: publication
+title: "DeepFix: Fixing Common C Language Errors by Deep Learning"
+authors: R. Gupta, S. Pal, A. Kanade, S. Shevade
+conference: AAAI
+year: 2017
+bibkey: gupta2017deepfix
+---
+The problem of automatically fixing programming errors is a
+very active research topic in software engineering. This is a
+challenging problem as fixing even a single error may require
+analysis of the entire program. In practice, a number of errors
+arise due to programmer’s inexperience with the programming language or lack of attention to detail. We call these
+common programming errors. These are analogous to grammatical errors in natural languages. Compilers detect such errors, but their error messages are usually inaccurate. In this
+work, we present an end-to-end solution, called DeepFix, that
+can fix multiple such errors in a program without relying on
+any external tool to locate or fix them. At the heart of DeepFix
+is a multi-layered sequence-to-sequence neural network with
+attention which is trained to predict erroneous program locations along with the required correct statements. On a set of
+6971 erroneous C programs written by students for 93 programming tasks, DeepFix could fix 1881 (27%) programs
+completely and 1338 (19%) programs partially.
diff --git a/_publications/hu2017codesum.markdown b/_publications/hu2017codesum.markdown
@@ -0,0 +1,9 @@
+---
+layout: publication
+title: "CodeSum: Translate Program Language to Natural Language"
+authors: X. Hu, Y. Wei, G. Li, Z. Jin
+conference: ArXiV 1708.01837
+year: 2017
+bibkey: hu2017codesum
+---
+During software maintenance, programmers spend a lot of time on code comprehension. Reading comments is an effective way for programmers to reduce the reading and navigating time when comprehending source code. Therefore, as a critical task in software engineering, code summarization aims to generate brief natural language descriptions for source code. In this paper, we propose a new code summarization model named CodeSum. CodeSum exploits the attention-based sequence-to-sequence (Seq2Seq) neural network with Structure-based Traversal (SBT) of Abstract Syntax Trees (AST). The AST sequences generated by SBT can better present the structure of ASTs and keep unambiguous. We conduct experiments on three large-scale corpora in different program languages, i.e., Java, C#, and SQL, in which Java corpus is our new proposed industry code extracted from Github. Experimental results show that our method CodeSum outperforms the state-of-the-art significantly. 
diff --git a/_publications/iyer2016summarizing.markdown b/_publications/iyer2016summarizing.markdown
@@ -0,0 +1,27 @@
+---
+layout: publication
+title: "Summarizing Source Code using a Neural Attention Model"
+authors: S. Iyer, I. Konstas, A. Cheung, L. Zettlemoyer
+conference: ACL
+year: 2016
+bibkey: iyer2016summarizing
+---
+High quality source code is often paired
+with high level summaries of the computation it performs, for example in code
+documentation or in descriptions posted
+in online forums. Such summaries are
+extremely useful for applications such as
+code search but are expensive to manually
+author, hence only done for a small fraction of all code that is produced. In this
+paper, we present the first completely data-driven approach for generating high level
+summaries of source code. Our model,
+CODE-NN , uses Long Short Term Memory (LSTM) networks with attention to
+produce sentences that describe C# code
+snippets and SQL queries. CODE-NN
+is trained on a new corpus that is automatically collected from StackOverflow,
+which we release. Experiments demonstrate strong performance on two tasks:
+(1) code summarization, where we establish the first end-to-end learning results
+and outperform strong baselines, and (2)
+code retrieval, where our learned model
+improves the state of the art on a recently
+introduced C# benchmark by a large margin.
diff --git a/_publications/jiang2017automatically.markdown b/_publications/jiang2017automatically.markdown
@@ -0,0 +1,9 @@
+---
+layout: publication
+title: "Automatically Generating Commit Messages from Diffs using Neural Machine Translation"
+authors: S. Jiang, A. Armaly, C. McMillan 
+conference: ArXiV 1708.09492 
+year: 2017
+bibkey: jiang2017automatically
+---
+Commit messages are a valuable resource in comprehension of software evolution, since they provide a record of changes such as feature additions and bug repairs. Unfortunately, programmers often neglect to write good commit messages. Different techniques have been proposed to help programmers by automatically writing these messages. These techniques are effective at describing what changed, but are often verbose and lack context for understanding the rationale behind a change. In contrast, humans write messages that are short and summarize the high level rationale. In this paper, we adapt Neural Machine Translation (NMT) to automatically "translate" diffs into commit messages. We trained an NMT algorithm using a corpus of diffs and human-written commit messages from the top 1k Github projects. We designed a filter to help ensure that we only trained the algorithm on higher-quality commit messages. Our evaluation uncovered a pattern in which the messages we generate tend to be either very high or very low quality. Therefore, we created a quality-assurance filter to detect cases in which we are unable to produce good messages, and return a warning instead. 
diff --git a/_publications/kremenek2007factor.markdown b/_publications/kremenek2007factor.markdown
@@ -0,0 +1,19 @@
+---
+layout: publication
+title: "A Factor Graph Model for Software Bug Finding"
+authors: T. Kremenek, A.Y. Ng, D. Engler
+conference: IJCAI
+year: 2007
+bibkey: kremenek2007factor
+---
+Automatic tools for finding software errors require
+knowledge of the rules a program must obey, or
+“specifications,” before they can identify bugs. We
+present a method that combines factor graphs and
+static program analysis to automatically infer specifications directly from programs. We illustrate the
+approach on inferring functions in C programs that
+allocate and release resources, and evaluate the approach on three codebases: SDL, OpenSSH, and
+the OS kernel for Mac OS X (XNU). The inferred
+specifications are highly accurate and with them we
+have discovered numerous bugs.
+
diff --git a/_publications/levy2017learning.markdown b/_publications/levy2017learning.markdown
@@ -0,0 +1,19 @@
+---
+layout: publication
+title: "Learning to Align the Source Code to the Compiled Object Code"
+authors: D. Levy, L. Wolf
+conference: ICML
+year: 2017
+bibkey: levy2017learning
+---
+We propose a new neural network architecture
+and use it for the task of statement-by-statement
+alignment of source code and its compiled object code. Our architecture learns the alignment
+between the two sequences – one being the translation of the other – by mapping each statement
+to a context-dependent representation vector and
+aligning such vectors using a grid of the two sequence domains. Our experiments include short
+C functions, both artificial and human-written,
+and show that our neural network architecture
+is able to predict the alignment with high accuracy, outperforming known baselines. We also
+demonstrate that our model is general and can
+learn to solve graph problems such as the Traveling Salesman Problem.
diff --git a/_publications/li2015gated.markdown b/_publications/li2015gated.markdown
@@ -0,0 +1,9 @@
+---
+layout: publication
+title: "Gated Graph Sequence Neural Networks"
+authors: Y. Li, D. Tarlow, M. Brockschmidt, R. Zemel
+conference: ICLR
+year: 2015
+bibkey: li2015gated
+---
+Graph-structured data appears frequently in domains including chemistry, natural language semantics, social networks, and knowledge bases. In this work, we study feature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify to use gated recurrent units and modern optimization techniques and then extend to output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based models (e.g., LSTMs) when the problem is graph-structured. We demonstrate the capabilities on some simple AI (bAbI) and graph algorithm learning tasks. We then show it achieves state-of-the-art performance on a problem from program verification, in which subgraphs need to be matched to abstract data structures. 
diff --git a/_publications/loyola2017neural.markdown b/_publications/loyola2017neural.markdown
@@ -0,0 +1,9 @@
+---
+layout: publication
+title: "A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes"
+authors: P. Loyola, E. Marrese-Taylor, Y. Matsuo
+conference: ArXiV 1704.04856
+year: 2017
+bibkey: loyola2017neural
+---
+We propose a model to automatically describe changes introduced in the source code of a program using natural language. Our method receives as input a set of code commits, which contains both the modifications and message introduced by an user. These two modalities are used to train an encoder-decoder architecture. We evaluated our approach on twelve real world open source projects from four different programming languages. Quantitative and qualitative results showed that the proposed approach can generate feasible and semantically sound descriptions not only in standard in-project settings, but also in a cross-project setting. 
diff --git a/_publications/mou2016convolutional.markdown b/_publications/mou2016convolutional.markdown
@@ -0,0 +1,16 @@
+---
+layout: publication
+title: "Convolutional Neural Networks over Tree Structures for Programming Language Processing"
+authors: L. Mou, G. Li, L. Zhang, T. Wang, Z. Jin
+conference: AAAI
+year: 2016
+bibkey: mou2016convolutional
+---
+Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the
+artificial intelligence community. However, different from a
+natural language sentence, a program contains rich, explicit,
+and complicated structural information. Hence, traditional
+NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in
+which a convolution kernel is designed over programs’ abstract syntax trees to capture structural information. TBCNN
+is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according
+to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.
diff --git a/_publications/murali2017bayesian.markdown b/_publications/murali2017bayesian.markdown
diff --git a/_publications/murali2017finding.markdown b/_publications/murali2017finding.markdown
diff --git a/_publications/rabinovich2017abstract.markdown b/_publications/rabinovich2017abstract.markdown
diff --git a/_publications/wang2016bugram.markdown b/_publications/wang2016bugram.markdown
diff --git a/_publications/zaremba2014learning.markdown b/_publications/zaremba2014learning.markdown