Skip to content

Commit 502a7aa

Browse files
author
Max Schaefer
committed
JavaScript: Start documenting extension points provided by the standard library.
1 parent ae07546 commit 502a7aa

File tree

1 file changed

+243
-0
lines changed

1 file changed

+243
-0
lines changed
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
Customizing the JavaScript analysis
2+
===================================
3+
4+
This document describes the main extension points offered by the JavaScript analysis for customizing
5+
analysis behavior without editing the queries or libraries themselves.
6+
7+
Customization mechanisms
8+
------------------------
9+
10+
The two mechanisms used for customization are subclassing and overriding.
11+
12+
By subclassing an abstract class used by the JavaScript analysis and implementing its abstract
13+
member predicates we can teach the analysis to handle further instances of abstract concepts it
14+
already understands. For example, the standard library defines an abstract class
15+
``SystemCommandExecution`` that covers various APIs for executing operating-system commands. This
16+
class is used by the command-injection analysis to identify potentially problematic flows where
17+
input from a potentially malicious user is interpreted as the name of a system command to execute.
18+
By defining additional subclasses of ``SystemCommandExecution``, we can make this analysis more
19+
powerful without touching its implementation.
20+
21+
By overriding a member predicate defined in the library, we can change its behavior either for all
22+
its receivers or only a subset. For example, the standard library predicate
23+
``ControlFlowNode::getASuccessor`` implements the basic control-flow graph on which many further
24+
analyses are based. By overriding it, we can add, suppress or modify control-flow graph edges.
25+
26+
Once a customization has been defined, it needs to be brought into scope so that the default
27+
analysis queries pick it up. This can be done by adding the customizing definitions to
28+
``Customizations.qll``, an initially empty library file that is imported by the default library
29+
``javascript.qll``.
30+
31+
Sometimes you may want to apply the two customization mechanisms of subclassing to provide new
32+
implementations of an API and of overriding to selectively change the implementation of the API to
33+
the same base class. This is not always easy to do, since the former requires the base class to be
34+
abstract, while the latter requires it to be concrete.
35+
36+
To work around this, the JavaScript library uses the so-called `range pattern`: the base class
37+
``Base`` itself is concrete, but it has an abstract companion class called ``Base::Range`` with the
38+
same member predicates and covering the same set of values. The default implementation of all
39+
predicates in ``Base`` simply delegates to their implementations in ``Base::Range``. To extend
40+
``Base`` with new implementations, we subclass ``Base::Range`` and implement its API. To customize
41+
``Base``, on the other hand, we subclass ``Base`` itself and override the predicates we want to
42+
adjust.
43+
44+
Note that currently the range pattern is not yet used everywhere, so you will find some abstract
45+
classes without a concrete companion. We are planning on eventually migrating most abstract classes
46+
to use the range pattern.
47+
48+
Analysis layers
49+
---------------
50+
51+
The JavaScript analysis libraries have a layered structure with higher-level analyses based on
52+
lower-level ones. Usually, classes and predicates in a lower layer should not depend on a higher
53+
layer to avoid performance problems and negative recursion issues.
54+
55+
We briefly survey the most important analysis layers here, starting from the lowest layer. Below we
56+
will discuss the extension points offered by the individual layers.
57+
58+
AST
59+
~~~
60+
61+
The abstract syntax tree, implemented by class ``ASTNode`` and its subclasses, is the lowest layer
62+
and more or less directly represents the information stored in the snapshot data base. It
63+
corresponds closely to the syntactic structure of the program, only abstracting away from
64+
typographical details such as whitespace and indentation.
65+
66+
CFG
67+
~~~
68+
69+
The (intra-procedural) control-flow graph, implemented by class ``ControlFlowNode`` and its
70+
subclasses, is the next higher level. It models flow of control inside functions and top-level
71+
scripts, and is overlaid on top of the AST in that each AST node has a corresponding CFG node. There
72+
are also synthetic CFG nodes that do not correspond to an AST node: entry and exit nodes
73+
(``ControlFlowEntryNode`` and ``ControlFlowExitNode``) mark the beginning and end, respectively, of
74+
the execution of a function or top-level, while guard nodes (``GuardControlFlowNode``) record the
75+
fact that some condition is known to hold at some point in the program.
76+
77+
Basic blocks (class ``BasicBlock``) organize control-flow nodes into maximal sequences of
78+
straight-line code, which is vital for efficiently reasoning about control flow.
79+
80+
SSA
81+
~~~
82+
83+
The static single-assignment representation (class ``SsaVariable`` and ``SsaDefinition``) uses
84+
control-flow information to split up local variables into SSA variables that each only have a single
85+
definition. In addition to regular definitions from assignments and increment/decrement expressions,
86+
the SSA form also introduces pseudo-definitions such as `phi nodes` where multiple possible values
87+
for a variable are merged and `refinement nodes` (also known as `pi nodes`) marking program points
88+
where additional information about a variable becomes available that may restrict its possible set
89+
of values.
90+
91+
Local data flow
92+
~~~~~~~~~~~~~~~
93+
94+
The (intra-procedural) data-flow graph, implemented by class ``DataFlow::Node`` and its subclasses,
95+
represents the flow of data within a function or top-level. Each expression has a corresponding
96+
data-flow node. Additionally, there are data-flow nodes that do not correspond to syntactic
97+
elements; for example, each SSA variable has a corresponding data-flow node. Note that flow between
98+
functions (through arguments and return values) is not modelled in this layer, except for the
99+
special case of immediately-invoked function expressions. Flow through object properties is also not
100+
modelled.
101+
102+
This layer also implements the widely-used source-node API: class ``DataFlow::SourceNode`` and its
103+
subclasses represent data-flow nodes where new objects are created (such as object expressions), or
104+
where non-local data flow enters the intra-procedural data-flow graph (such as function parameters
105+
or property reads). The source-node API provides convenience predicates for reasoning about these
106+
nodes without having to explicitly encode data-flow graph traversal.
107+
108+
Type inference
109+
~~~~~~~~~~~~~~
110+
111+
Class ``AnalyzedNode`` and its subclasses implement (intra-procedural) type inference on top of the
112+
local data-flow graph. Some reasoning about properties is implemented as well, but more advanced
113+
features such as the prototype chain are not considered.
114+
115+
Call graph
116+
~~~~~~~~~~
117+
118+
The call graph is implemented as a predicate ``getACallee`` on ``DataFlow::InvokeNode``, the class
119+
of data-flow nodes representing function calls (with or wihout ``new``). It uses local data flow and
120+
type information, as well as type annotations where available.
121+
122+
Type tracking
123+
~~~~~~~~~~~~~
124+
125+
The type-tracking framework (classes ``DataFlow::TypeTracker`` and ``DataFlow::TypeBackTracker``) is
126+
a library for implementing custom type inference systems that track values inter-procedurally,
127+
including tracking through one level of object properties.
128+
129+
Framework models
130+
~~~~~~~~~~~~~~~~
131+
132+
The libraries under ``semmle/javascript/frameworks`` model a broad range of popular JavaScript
133+
libraries and frameworks, such as Express or Vue.js.
134+
135+
Global data flow and taint tracking
136+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
137+
138+
The inter-procedural data flow and taint-tracking libraries can be used to implement static
139+
information-flow analyses. Most of our security queries are based on this approach.
140+
141+
Extension points
142+
----------------
143+
144+
Below we discuss the most important extension points for the individual analysis layers introduced above.
145+
146+
AST
147+
~~~
148+
149+
This layer should not normally be customized. It is technically possible to override, say,
150+
``ASTNode.getChild`` to change the way the AST structure is represented, but this should normally be
151+
avoided in the interest of keeping a close correspondence between AST and concrete syntax.
152+
153+
CFG
154+
~~~
155+
156+
You can override ``ControlFlowNode.getASuccessor`` to customize the control-flow graph. Note that overriding ``ControlFlowNode.getAPredecessor`` is not normally useful, since it is rarely used in higher-level libraries.
157+
158+
SSA
159+
~~~
160+
161+
It is not normally necessary to customize this layer.
162+
163+
Local data flow
164+
~~~~~~~~~~~~~~~
165+
166+
The ``DataFlow::SourceNode`` class implements the range pattern, so new kinds of source nodes can be
167+
added by extending ``Dataflow::SourceNode::Range``. Some of its subclasses can similarly be
168+
extended: ``DataFlow::ModuleImportNode`` models module imports, and ``DataFlow::ClassNode`` models
169+
class definitions. The former provides default implementations covering CommonJS, AMD and ECMAScript
170+
2015 modules, while the latter handles ECMAScript 2015 classes as well as traditional function-based
171+
classes. You can extend their corresponding ``::Range`` classes to add support for other module or
172+
class systems.
173+
174+
Type inference
175+
~~~~~~~~~~~~~~
176+
177+
You can override ``AnalyzedNode::getAValue`` to customize the type inference. Note that the type inference is expected to be sound, that is (as far as practical) the abstract values inferred for a data-flow nodes should cover all possible concrete values this node may take on at runtime.
178+
179+
You can also extend the set of abstract values in one of two ways:
180+
181+
1. To add individual abstract values that are independent of the program being analyzed, define a
182+
subclass of ``CustomAbstractValueTag`` describing the new abstract value. There will then be a
183+
corresponding value of class ``CustomAbstractValue`` that you can use in overriding
184+
definitions of the ``getAValue`` predicate.
185+
2. To add abstract values that are induced by a program element, define a subclass of
186+
``CustomAbstractValueDefinition``, and use its corresponding
187+
``CustomAbstractValueFromDefinition``.
188+
189+
Call graph
190+
~~~~~~~~~~
191+
192+
You can override ``DataFlow::InvokeNode::getACallee(int)`` to customize the call graph. Note that overriding the zero-argument version ``getACallee()`` is not enough since higher layers use the one-argument version.
193+
194+
Type tracking
195+
~~~~~~~~~~~~~
196+
197+
It is not normally necessary to customize this layer.
198+
199+
Framework models
200+
~~~~~~~~~~~~~~~~
201+
202+
The ``semmle.javascript.frameworks.HTTP`` module defines many abstract classes that can be extended
203+
to implement support for new web frameworks. These classes, in turn, are used by some of the
204+
security queries (such as the cross-site scripting queries) to define sources and sinks, so these
205+
queries will automatically benefit from the additional modeling.
206+
207+
Similarly, the ``semmle.javascript.frameworks.SQL`` module defines abstract classes for modeling SQL
208+
connector libraries, and the ``semmle.javascript.JsonParsers`` and
209+
``semmle.javascript.frameworks.XML`` modules for modeling JSON and XML parsers, respectively.
210+
211+
The ``semmle.javascript.Concepts`` modules defines a few very broad concepts such as system-command
212+
executions or file-system accesses, which are concretely instantiated in some of the existing
213+
framework libraries, but can of course be further extended to model additional frameworks.
214+
215+
Global data flow and taint tracking
216+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
217+
218+
Most security queries consist of one QL file defining the query, one configuration module defining
219+
the taint-tracking configuration, and one customization module defining sources, sinks and
220+
sanitizers. For example, ``Security/CWE-078/CommandInjection.ql`` defines the command-injection
221+
query. It imports module ``semmle.javascript.security.dataflow.CommandInjection``, which defines the
222+
configuration class ``CommandInjection::Configuration``, and itself imports module
223+
``semmle.javascript.security.dataflow.CommandInjectionCustomizations``, which defines sources, sinks
224+
and sanitizers by means of three abstract classes ``CommandInjection::Source``,
225+
``CommandInjetion::Sink`` and ``CommandInjection::Sanitizer``, respectively.
226+
227+
To define additional sources, sinks or sanitizers for this or any other security query, import the
228+
customization module and extend these abstract classes with additional subclasses.
229+
230+
Note that you should normally only import the configuration module from a QL file. Importing it into
231+
the standard library (for example by importing it in ``Customizations.qll``) will slow down all the
232+
other security queries, since the configuration class will now be always in scope and flow from its
233+
sources to sinks will be tracked in addition to all the other configuration classes.
234+
235+
Another useful extension point is the class ``RemoteFlowSource``, which is used as a source by most
236+
queries looking for injection vulnerabilities (such as SQL injection or cross-site scripting). By
237+
extending it with new subclasses modelling other sources of user-controlled input you can
238+
simultaneously improve all of these queries.
239+
240+
Finally, you can extend the classes ``Dataflow::AdditionalSource``, ``DataFlow::AdditionalSink``,
241+
``DataFlow::AdditionalFlowStep`` and ``DataFlow::AdditionalBarrierGuardNode`` (and its subclasses)
242+
to define new sources, sinks, flow steps and sanitizers for all configurations, or only for specific
243+
configurations.

0 commit comments

Comments
 (0)