|
| 1 | +Customizing the JavaScript analysis |
| 2 | +=================================== |
| 3 | + |
| 4 | +This document describes the main extension points offered by the JavaScript analysis for customizing |
| 5 | +analysis behavior without editing the queries or libraries themselves. |
| 6 | + |
| 7 | +Customization mechanisms |
| 8 | +------------------------ |
| 9 | + |
| 10 | +The two mechanisms used for customization are subclassing and overriding. |
| 11 | + |
| 12 | +By subclassing an abstract class used by the JavaScript analysis and implementing its abstract |
| 13 | +member predicates we can teach the analysis to handle further instances of abstract concepts it |
| 14 | +already understands. For example, the standard library defines an abstract class |
| 15 | +``SystemCommandExecution`` that covers various APIs for executing operating-system commands. This |
| 16 | +class is used by the command-injection analysis to identify potentially problematic flows where |
| 17 | +input from a potentially malicious user is interpreted as the name of a system command to execute. |
| 18 | +By defining additional subclasses of ``SystemCommandExecution``, we can make this analysis more |
| 19 | +powerful without touching its implementation. |
| 20 | + |
| 21 | +By overriding a member predicate defined in the library, we can change its behavior either for all |
| 22 | +its receivers or only a subset. For example, the standard library predicate |
| 23 | +``ControlFlowNode::getASuccessor`` implements the basic control-flow graph on which many further |
| 24 | +analyses are based. By overriding it, we can add, suppress or modify control-flow graph edges. |
| 25 | + |
| 26 | +Once a customization has been defined, it needs to be brought into scope so that the default |
| 27 | +analysis queries pick it up. This can be done by adding the customizing definitions to |
| 28 | +``Customizations.qll``, an initially empty library file that is imported by the default library |
| 29 | +``javascript.qll``. |
| 30 | + |
| 31 | +Sometimes you may want to apply the two customization mechanisms of subclassing to provide new |
| 32 | +implementations of an API and of overriding to selectively change the implementation of the API to |
| 33 | +the same base class. This is not always easy to do, since the former requires the base class to be |
| 34 | +abstract, while the latter requires it to be concrete. |
| 35 | + |
| 36 | +To work around this, the JavaScript library uses the so-called `range pattern`: the base class |
| 37 | +``Base`` itself is concrete, but it has an abstract companion class called ``Base::Range`` with the |
| 38 | +same member predicates and covering the same set of values. The default implementation of all |
| 39 | +predicates in ``Base`` simply delegates to their implementations in ``Base::Range``. To extend |
| 40 | +``Base`` with new implementations, we subclass ``Base::Range`` and implement its API. To customize |
| 41 | +``Base``, on the other hand, we subclass ``Base`` itself and override the predicates we want to |
| 42 | +adjust. |
| 43 | + |
| 44 | +Note that currently the range pattern is not yet used everywhere, so you will find some abstract |
| 45 | +classes without a concrete companion. We are planning on eventually migrating most abstract classes |
| 46 | +to use the range pattern. |
| 47 | + |
| 48 | +Analysis layers |
| 49 | +--------------- |
| 50 | + |
| 51 | +The JavaScript analysis libraries have a layered structure with higher-level analyses based on |
| 52 | +lower-level ones. Usually, classes and predicates in a lower layer should not depend on a higher |
| 53 | +layer to avoid performance problems and negative recursion issues. |
| 54 | + |
| 55 | +We briefly survey the most important analysis layers here, starting from the lowest layer. Below we |
| 56 | +will discuss the extension points offered by the individual layers. |
| 57 | + |
| 58 | +AST |
| 59 | +~~~ |
| 60 | + |
| 61 | +The abstract syntax tree, implemented by class ``ASTNode`` and its subclasses, is the lowest layer |
| 62 | +and more or less directly represents the information stored in the snapshot data base. It |
| 63 | +corresponds closely to the syntactic structure of the program, only abstracting away from |
| 64 | +typographical details such as whitespace and indentation. |
| 65 | + |
| 66 | +CFG |
| 67 | +~~~ |
| 68 | + |
| 69 | +The (intra-procedural) control-flow graph, implemented by class ``ControlFlowNode`` and its |
| 70 | +subclasses, is the next higher level. It models flow of control inside functions and top-level |
| 71 | +scripts, and is overlaid on top of the AST in that each AST node has a corresponding CFG node. There |
| 72 | +are also synthetic CFG nodes that do not correspond to an AST node: entry and exit nodes |
| 73 | +(``ControlFlowEntryNode`` and ``ControlFlowExitNode``) mark the beginning and end, respectively, of |
| 74 | +the execution of a function or top-level, while guard nodes (``GuardControlFlowNode``) record the |
| 75 | +fact that some condition is known to hold at some point in the program. |
| 76 | + |
| 77 | +Basic blocks (class ``BasicBlock``) organize control-flow nodes into maximal sequences of |
| 78 | +straight-line code, which is vital for efficiently reasoning about control flow. |
| 79 | + |
| 80 | +SSA |
| 81 | +~~~ |
| 82 | + |
| 83 | +The static single-assignment representation (class ``SsaVariable`` and ``SsaDefinition``) uses |
| 84 | +control-flow information to split up local variables into SSA variables that each only have a single |
| 85 | +definition. In addition to regular definitions from assignments and increment/decrement expressions, |
| 86 | +the SSA form also introduces pseudo-definitions such as `phi nodes` where multiple possible values |
| 87 | +for a variable are merged and `refinement nodes` (also known as `pi nodes`) marking program points |
| 88 | +where additional information about a variable becomes available that may restrict its possible set |
| 89 | +of values. |
| 90 | + |
| 91 | +Local data flow |
| 92 | +~~~~~~~~~~~~~~~ |
| 93 | + |
| 94 | +The (intra-procedural) data-flow graph, implemented by class ``DataFlow::Node`` and its subclasses, |
| 95 | +represents the flow of data within a function or top-level. Each expression has a corresponding |
| 96 | +data-flow node. Additionally, there are data-flow nodes that do not correspond to syntactic |
| 97 | +elements; for example, each SSA variable has a corresponding data-flow node. Note that flow between |
| 98 | +functions (through arguments and return values) is not modelled in this layer, except for the |
| 99 | +special case of immediately-invoked function expressions. Flow through object properties is also not |
| 100 | +modelled. |
| 101 | + |
| 102 | +This layer also implements the widely-used source-node API: class ``DataFlow::SourceNode`` and its |
| 103 | +subclasses represent data-flow nodes where new objects are created (such as object expressions), or |
| 104 | +where non-local data flow enters the intra-procedural data-flow graph (such as function parameters |
| 105 | +or property reads). The source-node API provides convenience predicates for reasoning about these |
| 106 | +nodes without having to explicitly encode data-flow graph traversal. |
| 107 | + |
| 108 | +Type inference |
| 109 | +~~~~~~~~~~~~~~ |
| 110 | + |
| 111 | +Class ``AnalyzedNode`` and its subclasses implement (intra-procedural) type inference on top of the |
| 112 | +local data-flow graph. Some reasoning about properties is implemented as well, but more advanced |
| 113 | +features such as the prototype chain are not considered. |
| 114 | + |
| 115 | +Call graph |
| 116 | +~~~~~~~~~~ |
| 117 | + |
| 118 | +The call graph is implemented as a predicate ``getACallee`` on ``DataFlow::InvokeNode``, the class |
| 119 | +of data-flow nodes representing function calls (with or wihout ``new``). It uses local data flow and |
| 120 | +type information, as well as type annotations where available. |
| 121 | + |
| 122 | +Type tracking |
| 123 | +~~~~~~~~~~~~~ |
| 124 | + |
| 125 | +The type-tracking framework (classes ``DataFlow::TypeTracker`` and ``DataFlow::TypeBackTracker``) is |
| 126 | +a library for implementing custom type inference systems that track values inter-procedurally, |
| 127 | +including tracking through one level of object properties. |
| 128 | + |
| 129 | +Framework models |
| 130 | +~~~~~~~~~~~~~~~~ |
| 131 | + |
| 132 | +The libraries under ``semmle/javascript/frameworks`` model a broad range of popular JavaScript |
| 133 | +libraries and frameworks, such as Express or Vue.js. |
| 134 | + |
| 135 | +Global data flow and taint tracking |
| 136 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 137 | + |
| 138 | +The inter-procedural data flow and taint-tracking libraries can be used to implement static |
| 139 | +information-flow analyses. Most of our security queries are based on this approach. |
| 140 | + |
| 141 | +Extension points |
| 142 | +---------------- |
| 143 | + |
| 144 | +Below we discuss the most important extension points for the individual analysis layers introduced above. |
| 145 | + |
| 146 | +AST |
| 147 | +~~~ |
| 148 | + |
| 149 | +This layer should not normally be customized. It is technically possible to override, say, |
| 150 | +``ASTNode.getChild`` to change the way the AST structure is represented, but this should normally be |
| 151 | +avoided in the interest of keeping a close correspondence between AST and concrete syntax. |
| 152 | + |
| 153 | +CFG |
| 154 | +~~~ |
| 155 | + |
| 156 | +You can override ``ControlFlowNode.getASuccessor`` to customize the control-flow graph. Note that overriding ``ControlFlowNode.getAPredecessor`` is not normally useful, since it is rarely used in higher-level libraries. |
| 157 | + |
| 158 | +SSA |
| 159 | +~~~ |
| 160 | + |
| 161 | +It is not normally necessary to customize this layer. |
| 162 | + |
| 163 | +Local data flow |
| 164 | +~~~~~~~~~~~~~~~ |
| 165 | + |
| 166 | +The ``DataFlow::SourceNode`` class implements the range pattern, so new kinds of source nodes can be |
| 167 | +added by extending ``Dataflow::SourceNode::Range``. Some of its subclasses can similarly be |
| 168 | +extended: ``DataFlow::ModuleImportNode`` models module imports, and ``DataFlow::ClassNode`` models |
| 169 | +class definitions. The former provides default implementations covering CommonJS, AMD and ECMAScript |
| 170 | +2015 modules, while the latter handles ECMAScript 2015 classes as well as traditional function-based |
| 171 | +classes. You can extend their corresponding ``::Range`` classes to add support for other module or |
| 172 | +class systems. |
| 173 | + |
| 174 | +Type inference |
| 175 | +~~~~~~~~~~~~~~ |
| 176 | + |
| 177 | +You can override ``AnalyzedNode::getAValue`` to customize the type inference. Note that the type inference is expected to be sound, that is (as far as practical) the abstract values inferred for a data-flow nodes should cover all possible concrete values this node may take on at runtime. |
| 178 | + |
| 179 | +You can also extend the set of abstract values in one of two ways: |
| 180 | + |
| 181 | + 1. To add individual abstract values that are independent of the program being analyzed, define a |
| 182 | + subclass of ``CustomAbstractValueTag`` describing the new abstract value. There will then be a |
| 183 | + corresponding value of class ``CustomAbstractValue`` that you can use in overriding |
| 184 | + definitions of the ``getAValue`` predicate. |
| 185 | + 2. To add abstract values that are induced by a program element, define a subclass of |
| 186 | + ``CustomAbstractValueDefinition``, and use its corresponding |
| 187 | + ``CustomAbstractValueFromDefinition``. |
| 188 | + |
| 189 | +Call graph |
| 190 | +~~~~~~~~~~ |
| 191 | + |
| 192 | +You can override ``DataFlow::InvokeNode::getACallee(int)`` to customize the call graph. Note that overriding the zero-argument version ``getACallee()`` is not enough since higher layers use the one-argument version. |
| 193 | + |
| 194 | +Type tracking |
| 195 | +~~~~~~~~~~~~~ |
| 196 | + |
| 197 | +It is not normally necessary to customize this layer. |
| 198 | + |
| 199 | +Framework models |
| 200 | +~~~~~~~~~~~~~~~~ |
| 201 | + |
| 202 | +The ``semmle.javascript.frameworks.HTTP`` module defines many abstract classes that can be extended |
| 203 | +to implement support for new web frameworks. These classes, in turn, are used by some of the |
| 204 | +security queries (such as the cross-site scripting queries) to define sources and sinks, so these |
| 205 | +queries will automatically benefit from the additional modeling. |
| 206 | + |
| 207 | +Similarly, the ``semmle.javascript.frameworks.SQL`` module defines abstract classes for modeling SQL |
| 208 | +connector libraries, and the ``semmle.javascript.JsonParsers`` and |
| 209 | +``semmle.javascript.frameworks.XML`` modules for modeling JSON and XML parsers, respectively. |
| 210 | + |
| 211 | +The ``semmle.javascript.Concepts`` modules defines a few very broad concepts such as system-command |
| 212 | +executions or file-system accesses, which are concretely instantiated in some of the existing |
| 213 | +framework libraries, but can of course be further extended to model additional frameworks. |
| 214 | + |
| 215 | +Global data flow and taint tracking |
| 216 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 217 | + |
| 218 | +Most security queries consist of one QL file defining the query, one configuration module defining |
| 219 | +the taint-tracking configuration, and one customization module defining sources, sinks and |
| 220 | +sanitizers. For example, ``Security/CWE-078/CommandInjection.ql`` defines the command-injection |
| 221 | +query. It imports module ``semmle.javascript.security.dataflow.CommandInjection``, which defines the |
| 222 | +configuration class ``CommandInjection::Configuration``, and itself imports module |
| 223 | +``semmle.javascript.security.dataflow.CommandInjectionCustomizations``, which defines sources, sinks |
| 224 | +and sanitizers by means of three abstract classes ``CommandInjection::Source``, |
| 225 | +``CommandInjetion::Sink`` and ``CommandInjection::Sanitizer``, respectively. |
| 226 | + |
| 227 | +To define additional sources, sinks or sanitizers for this or any other security query, import the |
| 228 | +customization module and extend these abstract classes with additional subclasses. |
| 229 | + |
| 230 | +Note that you should normally only import the configuration module from a QL file. Importing it into |
| 231 | +the standard library (for example by importing it in ``Customizations.qll``) will slow down all the |
| 232 | +other security queries, since the configuration class will now be always in scope and flow from its |
| 233 | +sources to sinks will be tracked in addition to all the other configuration classes. |
| 234 | + |
| 235 | +Another useful extension point is the class ``RemoteFlowSource``, which is used as a source by most |
| 236 | +queries looking for injection vulnerabilities (such as SQL injection or cross-site scripting). By |
| 237 | +extending it with new subclasses modelling other sources of user-controlled input you can |
| 238 | +simultaneously improve all of these queries. |
| 239 | + |
| 240 | +Finally, you can extend the classes ``Dataflow::AdditionalSource``, ``DataFlow::AdditionalSink``, |
| 241 | +``DataFlow::AdditionalFlowStep`` and ``DataFlow::AdditionalBarrierGuardNode`` (and its subclasses) |
| 242 | +to define new sources, sinks, flow steps and sanitizers for all configurations, or only for specific |
| 243 | +configurations. |
0 commit comments