Skip to content

Commit 5e6e104

Browse files
committed
Added more work.
1 parent 34668c0 commit 5e6e104

19 files changed

+271
-3
lines changed

_data/allamanistaxonomy.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -61,17 +61,17 @@
6161
- {bibkey: chae2016automatically, categories: [representational], input_rep: Data Flow Graph , modeled_target: Static Analysis , intermediate_rep: Localized , application: Program Analysis }
6262
- {bibkey: corley2015exploring, categories: [representational], input_rep: Tokens , modeled_target: Feature Location , intermediate_rep: Distributed , application: Feature Location }
6363
- {bibkey: cummins2017end, categories: [representational], input_rep: Tokens , modeled_target: Optimization Flags , intermediate_rep: Distributed , application: Optimization Heuristics }
64-
- {bibkey: green2017learning, categories: [representational], input_rep: Statements, modeled_target: Alignment , intermediate_rep: Distributed , application: Decompiling }
6564
- {bibkey: gu2016deep, categories: [representational], input_rep: Natural Language , modeled_target: API Calls , intermediate_rep: Distributed , application: API Search }
6665
- {bibkey: guo2017semantically, categories: [representational], input_rep: Tokens , modeled_target: Traceability link , intermediate_rep: Distributed , application: Traceability }
67-
- {bibkey: gupta2016deepfix, categories: [representational], input_rep: Tokens , modeled_target: Code Fix , intermediate_rep: Distributed , application: Code Fixing }
66+
- {bibkey: gupta2017deepfix, categories: [representational], input_rep: Tokens , modeled_target: Code Fix , intermediate_rep: Distributed , application: Code Fixing }
6867
- {bibkey: hu2017codesum, categories: [representational], input_rep: Linearized AST , modeled_target: Natural Language , intermediate_rep: Distributed , application: Summarization }
6968
- {bibkey: iyer2016summarizing, categories: [representational], input_rep: Tokens , modeled_target: Natural Language , intermediate_rep: Distributed , application: Summarization }
7069
- {bibkey: jiang2017automatically, categories: [representational], input_rep: Tokens (Diff) , modeled_target: Natural Language , intermediate_rep: Distributed , application: Commit Message }
7170
- {bibkey: koc2017learning, categories: [representational], input_rep: Bytecode , modeled_target: False Positives , intermediate_rep: Distributed , application: Program Analysis }
7271
- {bibkey: kremenek2007factor, categories: [representational], input_rep: Partial PDG , modeled_target: Ownership , intermediate_rep: Factor (GM) , application: Pointer Ownership }
73-
- {bibkey: loyola2017neural, categories: [representational], input_rep: Tokens (Diff) , modeled_target: Natural Language , intermediate_rep: Distributed , application: Explain code changes }
72+
- {bibkey: levy2017learning, categories: [representational], input_rep: Statements, modeled_target: Alignment , intermediate_rep: Distributed , application: Decompiling }
7473
- {bibkey: li2015gated, categories: [representational], input_rep: Memory Heap , modeled_target: Separation Logic , intermediate_rep: Distributed , application: Verification }
74+
- {bibkey: loyola2017neural, categories: [representational], input_rep: Tokens (Diff) , modeled_target: Natural Language , intermediate_rep: Distributed , application: Explain code changes }
7575
- {bibkey: mangal2015user, categories: [representational], input_rep: Logic + Feedback , modeled_target: Prob. Analysis , intermediate_rep: MaxSAT , application: Program Analysis }
7676
- {bibkey: movshovitz2013natural, categories: [representational], input_rep: Tokens , modeled_target: Code Comments , intermediate_rep: Directed GM , application: Comment Prediction }
7777
- {bibkey: mou2016convolutional, categories: [representational], input_rep: Syntax , modeled_target: Classification , intermediate_rep: Distributed , application: Task Classification }
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
layout: publication
3+
title: "Statistical Deobfuscation of Android Applications"
4+
authors: B. Bichsel, V. Raychev, P. Tsankov, M. Vechev
5+
conference: CCS
6+
year: 2016
7+
bibkey: bichsel2016statistical
8+
---
9+
This work presents a new approach for deobfuscating Android APKs based on probabilistic learning of large code bases (termed "Big Code"). The key idea is to learn a probabilistic model over thousands of non-obfuscated Android applications and to use this probabilistic model to deobfuscate new, unseen Android APKs. The concrete focus of the paper is on reversing layout obfuscation, a popular transformation which renames key program elements such as classes, packages, and methods, thus making it difficult to understand what the program does. Concretely, the paper: (i) phrases the layout deobfuscation problem of Android APKs as structured prediction in a probabilistic graphical model, (ii) instantiates this model with a rich set of features and constraints that capture the Android setting, ensuring both semantic equivalence and high prediction accuracy, and (iii) shows how to leverage powerful inference and learning algorithms to achieve overall precision and scalability of the probabilistic predictions.
10+
11+
We implemented our approach in a tool called DeGuard and used it to: (i) reverse the layout obfuscation performed by the popular ProGuard system on benign, open-source applications, (ii) predict third-party libraries imported by benign APKs (also obfuscated by ProGuard), and (iii) rename obfuscated program elements of Android malware. The experimental results indicate that DeGuard is practically effective: it recovers 79.1% of the program element names obfuscated with ProGuard, it predicts third-party libraries with accuracy of 91.3%, and it reveals string decoders and classes that handle sensitive data in Android malware.
12+
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
layout: publication
3+
title: "Automatically generating features for learning program analysis heuristics"
4+
authors: K. Chae, H. Oh, K. Heo, H. Yang
5+
conference: ArXiV 1612.09394
6+
year: 2016
7+
bibkey: chae2016automatically
8+
---
9+
We present a technique for automatically generating features for data-driven program analyses. Recently data-driven approaches for building a program analysis have been proposed, which mine existing codebases and automatically learn heuristics for finding a cost-effective abstraction for a given analysis task. Such approaches reduce the burden of the analysis designers, but they do not remove it completely; they still leave the highly nontrivial task of designing so called features to the hands of the designers. Our technique automates this feature design process. The idea is to use programs as features after reducing and abstracting them. Our technique goes through selected program-query pairs in codebases, and it reduces and abstracts the program in each pair to a few lines of code, while ensuring that the analysis behaves similarly for the original and the new programs with respect to the query. Each reduced program serves as a boolean feature for program-query pairs. This feature evaluates to true for a given program-query pair when (as a program) it is included in the program part of the pair. We have implemented our approach for three real-world program analyses. Our experimental evaluation shows that these analyses with automatically-generated features perform comparably to those with manually crafted features.

_publications/dam2016deep.markdown

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
layout: publication
3+
title: "A deep language model for software code"
4+
authors: H. K. Dam, T. Tran, T. Pham
5+
conference: ArXiV 1608.02715
6+
year: 2016
7+
bibkey: dam2016deep
8+
---
9+
Existing language models such as n-grams for software code often fail to capture a long context where dependent code elements scatter far apart. In this paper, we propose a novel approach to build a language model for software code to address this particular issue. Our language model, partly inspired by human memory, is built upon the powerful deep learning-based Long Short Term Memory architecture that is capable of learning long-term dependencies which occur frequently in software code. Results from our intrinsic evaluation on a corpus of Java projects have demonstrated the effectiveness of our language model. This work contributes to realizing our vision for DeepSoft, an end-to-end, generic deep learning-based framework for modeling software and its development process.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
layout: publication
3+
title: "Semantically enhanced software traceability using deep learning techniques"
4+
authors: J. Guo, J. Cheng, J. Cleland-Huang
5+
conference: ICSE
6+
year: 2017
7+
bibkey: guo2017semantically
8+
---
9+
In most safety-critical domains the need for traceability is prescribed by certifying bodies. Trace links are generally created among requirements, design, source code, test cases and other artifacts; however, creating such links manually is time consuming and error prone. Automated solutions use information retrieval and machine learning techniques to generate trace links; however, current techniques fail to understand semantics of the software artifacts or to integrate domain knowledge into the tracing process and therefore tend to deliver imprecise and inaccurate results. In this paper, we present a solution that uses deep learning to incorporate requirements artifact semantics and domain knowledge into the tracing solution. We propose a tracing network architecture that utilizes Word Embedding and Recurrent Neural Network (RNN) models to generate trace links. Word embedding learns word vectors that represent knowledge of the domain corpus and RNN uses these word vectors to learn the sentence semantics of requirements artifacts. We trained 360 different configurations of the tracing network using existing trace links in the Positive Train Control domain and identified the Bidirectional Gated Recurrent Unit (BI-GRU) as the best model for the tracing task. BI-GRU significantly out-performed state-of-the-art tracing methods including the Vector Space Model and Latent Semantic Indexing.
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
layout: publication
3+
title: "DeepFix: Fixing Common C Language Errors by Deep Learning"
4+
authors: R. Gupta, S. Pal, A. Kanade, S. Shevade
5+
conference: AAAI
6+
year: 2017
7+
bibkey: gupta2017deepfix
8+
---
9+
The problem of automatically fixing programming errors is a
10+
very active research topic in software engineering. This is a
11+
challenging problem as fixing even a single error may require
12+
analysis of the entire program. In practice, a number of errors
13+
arise due to programmer’s inexperience with the programming language or lack of attention to detail. We call these
14+
common programming errors. These are analogous to grammatical errors in natural languages. Compilers detect such errors, but their error messages are usually inaccurate. In this
15+
work, we present an end-to-end solution, called DeepFix, that
16+
can fix multiple such errors in a program without relying on
17+
any external tool to locate or fix them. At the heart of DeepFix
18+
is a multi-layered sequence-to-sequence neural network with
19+
attention which is trained to predict erroneous program locations along with the required correct statements. On a set of
20+
6971 erroneous C programs written by students for 93 programming tasks, DeepFix could fix 1881 (27%) programs
21+
completely and 1338 (19%) programs partially.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
layout: publication
3+
title: "CodeSum: Translate Program Language to Natural Language"
4+
authors: X. Hu, Y. Wei, G. Li, Z. Jin
5+
conference: ArXiV 1708.01837
6+
year: 2017
7+
bibkey: hu2017codesum
8+
---
9+
During software maintenance, programmers spend a lot of time on code comprehension. Reading comments is an effective way for programmers to reduce the reading and navigating time when comprehending source code. Therefore, as a critical task in software engineering, code summarization aims to generate brief natural language descriptions for source code. In this paper, we propose a new code summarization model named CodeSum. CodeSum exploits the attention-based sequence-to-sequence (Seq2Seq) neural network with Structure-based Traversal (SBT) of Abstract Syntax Trees (AST). The AST sequences generated by SBT can better present the structure of ASTs and keep unambiguous. We conduct experiments on three large-scale corpora in different program languages, i.e., Java, C#, and SQL, in which Java corpus is our new proposed industry code extracted from Github. Experimental results show that our method CodeSum outperforms the state-of-the-art significantly.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
layout: publication
3+
title: "Summarizing Source Code using a Neural Attention Model"
4+
authors: S. Iyer, I. Konstas, A. Cheung, L. Zettlemoyer
5+
conference: ACL
6+
year: 2016
7+
bibkey: iyer2016summarizing
8+
---
9+
High quality source code is often paired
10+
with high level summaries of the computation it performs, for example in code
11+
documentation or in descriptions posted
12+
in online forums. Such summaries are
13+
extremely useful for applications such as
14+
code search but are expensive to manually
15+
author, hence only done for a small fraction of all code that is produced. In this
16+
paper, we present the first completely data-driven approach for generating high level
17+
summaries of source code. Our model,
18+
CODE-NN , uses Long Short Term Memory (LSTM) networks with attention to
19+
produce sentences that describe C# code
20+
snippets and SQL queries. CODE-NN
21+
is trained on a new corpus that is automatically collected from StackOverflow,
22+
which we release. Experiments demonstrate strong performance on two tasks:
23+
(1) code summarization, where we establish the first end-to-end learning results
24+
and outperform strong baselines, and (2)
25+
code retrieval, where our learned model
26+
improves the state of the art on a recently
27+
introduced C# benchmark by a large margin.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
layout: publication
3+
title: "Automatically Generating Commit Messages from Diffs using Neural Machine Translation"
4+
authors: S. Jiang, A. Armaly, C. McMillan
5+
conference: ArXiV 1708.09492
6+
year: 2017
7+
bibkey: jiang2017automatically
8+
---
9+
Commit messages are a valuable resource in comprehension of software evolution, since they provide a record of changes such as feature additions and bug repairs. Unfortunately, programmers often neglect to write good commit messages. Different techniques have been proposed to help programmers by automatically writing these messages. These techniques are effective at describing what changed, but are often verbose and lack context for understanding the rationale behind a change. In contrast, humans write messages that are short and summarize the high level rationale. In this paper, we adapt Neural Machine Translation (NMT) to automatically "translate" diffs into commit messages. We trained an NMT algorithm using a corpus of diffs and human-written commit messages from the top 1k Github projects. We designed a filter to help ensure that we only trained the algorithm on higher-quality commit messages. Our evaluation uncovered a pattern in which the messages we generate tend to be either very high or very low quality. Therefore, we created a quality-assurance filter to detect cases in which we are unable to produce good messages, and return a warning instead.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
---
2+
layout: publication
3+
title: "A Factor Graph Model for Software Bug Finding"
4+
authors: T. Kremenek, A.Y. Ng, D. Engler
5+
conference: IJCAI
6+
year: 2007
7+
bibkey: kremenek2007factor
8+
---
9+
Automatic tools for finding software errors require
10+
knowledge of the rules a program must obey, or
11+
“specifications,” before they can identify bugs. We
12+
present a method that combines factor graphs and
13+
static program analysis to automatically infer specifications directly from programs. We illustrate the
14+
approach on inferring functions in C programs that
15+
allocate and release resources, and evaluate the approach on three codebases: SDL, OpenSSH, and
16+
the OS kernel for Mac OS X (XNU). The inferred
17+
specifications are highly accurate and with them we
18+
have discovered numerous bugs.
19+

0 commit comments

Comments
 (0)