forked from marijnh/Eloquent-JavaScript
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path09_regexp.html
More file actions
853 lines (538 loc) · 99.7 KB
/
09_regexp.html
File metadata and controls
853 lines (538 loc) · 99.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
<!doctype html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Regular Expressions :: Eloquent JavaScript</title>
<link rel=stylesheet href="js/node_modules/codemirror/lib/codemirror.css">
<script src="js/acorn_codemirror.js"></script>
<link rel=stylesheet href="css/ejs.css">
<script src="js/sandbox.js"></script>
<script src="js/ejs.js"></script><script>var chapNum = 9;</script></head>
<article>
<nav><a href="08_error.html" title="previous chapter">◀</a> <a href="index.html" title="cover">◆</a> <a href="10_modulos.html" title="next chapter">▶</a></nav>
<h1><div class=chap_num>Chapter 9</div>Regular Expressions</h1>
<blockquote>
<p><a class="p_ident" id="p_MWUwIAb0uO" href="#p_MWUwIAb0uO" tabindex="-1" role="presentation"></a>Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.</p>
<footer>Jamie Zawinski</footer>
</blockquote>
<blockquote>
<p><a class="p_ident" id="p_icxlw7+18l" href="#p_icxlw7+18l" tabindex="-1" role="presentation"></a>Yuan-Ma said, ‘When you cut against the grain of the wood, much strength is needed. When you program against the grain of the problem, much code is needed.’</p>
<footer>Master Yuan-Ma, <cite>The Book of Programming</cite></footer>
</blockquote>
<p><a class="p_ident" id="p_mYvGNMWwx9" href="#p_mYvGNMWwx9" tabindex="-1" role="presentation"></a>Programming tools and techniques survive and spread in a chaotic, evolutionary way. It’s not always the pretty or brilliant ones that win but rather the ones that function well enough within the right niche or happen to be integrated with another successful piece of technology.</p>
<p><a class="p_ident" id="p_iH3Aqi6y2A" href="#p_iH3Aqi6y2A" tabindex="-1" role="presentation"></a>In this chapter, I will discuss one such tool, <em>regular
expressions</em>. Regular expressions are a way to describe patterns in string data. They form a small, separate language that is part of JavaScript and many other languages and systems.</p>
<p><a class="p_ident" id="p_cxbejyPUGl" href="#p_cxbejyPUGl" tabindex="-1" role="presentation"></a>Regular expressions are both terribly awkward and extremely useful. Their syntax is cryptic, and the programming interface JavaScript provides for them is clumsy. But they are a powerful tool for inspecting and processing strings. Properly understanding regular expressions will make you a more effective programmer.</p>
<h2><a class="h_ident" id="h_5w4yGFJRYl" href="#h_5w4yGFJRYl" tabindex="-1" role="presentation"></a>Creating a regular expression</h2>
<p><a class="p_ident" id="p_u/9SKAI2Yi" href="#p_u/9SKAI2Yi" tabindex="-1" role="presentation"></a>A regular expression is a type of object. It can either be constructed with the <code>RegExp</code> constructor or written as a literal value by enclosing a pattern in forward slash (<code>/</code>) characters.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_O1I2rl+HTy" href="#c_O1I2rl+HTy" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">re1</span> <span class="cm-operator">=</span> <span class="cm-keyword">new</span> <span class="cm-variable">RegExp</span>(<span class="cm-string">"abc"</span>);
<span class="cm-keyword">let</span> <span class="cm-def">re2</span> <span class="cm-operator">=</span> <span class="cm-string-2">/abc/</span>;</pre>
<p><a class="p_ident" id="p_uNMQxzr01n" href="#p_uNMQxzr01n" tabindex="-1" role="presentation"></a>Both of these regular expression objects represent the same pattern: an <em>a</em> character followed by a <em>b</em> followed by a <em>c</em>.</p>
<p><a class="p_ident" id="p_qv8UWLVrTv" href="#p_qv8UWLVrTv" tabindex="-1" role="presentation"></a>When using the <code>RegExp</code> constructor, the pattern is written as a normal string, so the usual rules apply for backslashes.</p>
<p><a class="p_ident" id="p_0mNIcPpslS" href="#p_0mNIcPpslS" tabindex="-1" role="presentation"></a>The second notation, where the pattern appears between slash characters, treats backslashes somewhat differently. First, since a forward slash ends the pattern, we need to put a backslash before any forward slash that we want to be <em>part</em> of the pattern. In addition, backslashes that aren’t part of special character codes (like <code>\n</code>) will be <em>preserved</em>, rather than ignored as they are in strings, and change the meaning of the pattern. Some characters, such as question marks and plus signs, have special meanings in regular expressions and must be preceded by a backslash if they are meant to represent the character itself.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_uRzUiBSrul" href="#c_uRzUiBSrul" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">eighteenPlus</span> <span class="cm-operator">=</span> <span class="cm-string-2">/eighteen\+/</span>;</pre>
<h2><a class="h_ident" id="h_vPyyYjMEtz" href="#h_vPyyYjMEtz" tabindex="-1" role="presentation"></a>Testing for matches</h2>
<p><a class="p_ident" id="p_SHaMWlzFzk" href="#p_SHaMWlzFzk" tabindex="-1" role="presentation"></a>Regular expression objects have a number of methods. The simplest one is <code>test</code>. If you pass it a string, it will return a Boolean telling you whether the string contains a match of the pattern in the expression.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_Szn1CmrIV5" href="#c_Szn1CmrIV5" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/abc/</span>.<span class="cm-property">test</span>(<span class="cm-string">"abcde"</span>));
<span class="cm-comment">// → true</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/abc/</span>.<span class="cm-property">test</span>(<span class="cm-string">"abxde"</span>));
<span class="cm-comment">// → false</span></pre>
<p><a class="p_ident" id="p_YGcbMDV493" href="#p_YGcbMDV493" tabindex="-1" role="presentation"></a>A regular expression consisting of only nonspecial characters simply represents that sequence of characters. If <em>abc</em> occurs anywhere in the string we are testing against (not just at the start), <code>test</code> will return <code>true</code>.</p>
<h2><a class="h_ident" id="h_8EFR0DU1xd" href="#h_8EFR0DU1xd" tabindex="-1" role="presentation"></a>Sets of characters</h2>
<p><a class="p_ident" id="p_ZyB7HeLr75" href="#p_ZyB7HeLr75" tabindex="-1" role="presentation"></a>Finding out whether a string contains <em>abc</em> could just as well be done with a call to <code>indexOf</code>. Regular expressions allow us to express more complicated patterns.</p>
<p><a class="p_ident" id="p_i/99SEfu9y" href="#p_i/99SEfu9y" tabindex="-1" role="presentation"></a>Say we want to match any number. In a regular expression, putting a set of characters between square brackets makes that part of the expression match any of the characters between the brackets.</p>
<p><a class="p_ident" id="p_sC+2E08KnL" href="#p_sC+2E08KnL" tabindex="-1" role="presentation"></a>Both of the following expressions match all strings that contain a digit:</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_Z3UJdL//cY" href="#c_Z3UJdL//cY" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/[0123456789]/</span>.<span class="cm-property">test</span>(<span class="cm-string">"in 1992"</span>));
<span class="cm-comment">// → true</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/[0-9]/</span>.<span class="cm-property">test</span>(<span class="cm-string">"in 1992"</span>));
<span class="cm-comment">// → true</span></pre>
<p><a class="p_ident" id="p_i0WYLVUede" href="#p_i0WYLVUede" tabindex="-1" role="presentation"></a>Within square brackets, a dash (<code>-</code>) between two characters can be used to indicate a range of characters, where the ordering is determined by the character’s Unicode number. Characters 0 to 9 sit right next to each other in this ordering (codes 48 to 57), so <code>[0-9]</code> covers all of them and matches any digit.</p>
<p><a class="p_ident" id="p_wRTFHKw9PD" href="#p_wRTFHKw9PD" tabindex="-1" role="presentation"></a>There are a number of common character groups that have their own built-in shortcuts. Digits are one of them: <code>\d</code> means the same thing as <code>[0-9]</code>.</p>
<table>
<tr><td><code>\d</code></td><td>Any digit character</td>
</tr>
<tr><td><code>\w</code></td><td>An alphanumeric character (“word character”)</td>
</tr>
<tr><td><code>\s</code></td><td>Any whitespace character (space, tab, newline, and similar)</td>
</tr>
<tr><td><code>\D</code></td><td>A character that is <em>not</em> a digit</td>
</tr>
<tr><td><code>\W</code></td><td>A nonalphanumeric character</td>
</tr>
<tr><td><code>\S</code></td><td>A nonwhitespace character</td>
</tr>
<tr><td><code>.</code></td><td>Any character except for newline</td>
</tr>
</table>
<p><a class="p_ident" id="p_yXMUKEYpwG" href="#p_yXMUKEYpwG" tabindex="-1" role="presentation"></a>So you could match a date and time format like 30-01-2003 15:20 with the following expression:</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_OD88m8wrrs" href="#c_OD88m8wrrs" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">dateTime</span> <span class="cm-operator">=</span> <span class="cm-string-2">/\d\d-\d\d-\d\d\d\d \d\d:\d\d/</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">dateTime</span>.<span class="cm-property">test</span>(<span class="cm-string">"30-01-2003 15:20"</span>));
<span class="cm-comment">// → true</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">dateTime</span>.<span class="cm-property">test</span>(<span class="cm-string">"30-jan-2003 15:20"</span>));
<span class="cm-comment">// → false</span></pre>
<p><a class="p_ident" id="p_gdY0EhLXlE" href="#p_gdY0EhLXlE" tabindex="-1" role="presentation"></a>That looks completely awful, doesn’t it? Half of it is backslashes, producing a background noise that makes it hard to spot the actual pattern expressed. We’ll see a slightly improved version of this expression <a href="09_regexp.html#date_regexp_counted">later</a>.</p>
<p><a class="p_ident" id="p_P0qAMYu0C/" href="#p_P0qAMYu0C/" tabindex="-1" role="presentation"></a>These backslash codes can also be used inside square brackets. For example, <code>[\d.]</code> means any digit or a period character. But the period itself, between square brackets, loses its special meaning. The same goes for other special characters, such as <code>+</code>.</p>
<p><a class="p_ident" id="p_HqQEZsitdl" href="#p_HqQEZsitdl" tabindex="-1" role="presentation"></a>To <em>invert</em> a set of characters—that is, to express that you want to match any character <em>except</em> the ones in the set—you can write a caret (<code>^</code>) character after the opening bracket.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_XH8deAcckk" href="#c_XH8deAcckk" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">notBinary</span> <span class="cm-operator">=</span> <span class="cm-string-2">/[^01]/</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">notBinary</span>.<span class="cm-property">test</span>(<span class="cm-string">"1100100010100110"</span>));
<span class="cm-comment">// → false</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">notBinary</span>.<span class="cm-property">test</span>(<span class="cm-string">"1100100010200110"</span>));
<span class="cm-comment">// → true</span></pre>
<h2><a class="h_ident" id="h_iFI1qvUwY9" href="#h_iFI1qvUwY9" tabindex="-1" role="presentation"></a>Repeating parts of a pattern</h2>
<p><a class="p_ident" id="p_crYiu/oAUM" href="#p_crYiu/oAUM" tabindex="-1" role="presentation"></a>We now know how to match a single digit. What if we want to match a whole number—a sequence of one or more digits?</p>
<p><a class="p_ident" id="p_B4wupHzbR+" href="#p_B4wupHzbR+" tabindex="-1" role="presentation"></a>When you put a plus sign (<code>+</code>) after something in a regular expression, it indicates that the element may be repeated more than once. Thus, <code>/\d+/</code> matches one or more digit characters.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_9/5mFF4Ih4" href="#c_9/5mFF4Ih4" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/'\d+'/</span>.<span class="cm-property">test</span>(<span class="cm-string">"'123'"</span>));
<span class="cm-comment">// → true</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/'\d+'/</span>.<span class="cm-property">test</span>(<span class="cm-string">"''"</span>));
<span class="cm-comment">// → false</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/'\d*'/</span>.<span class="cm-property">test</span>(<span class="cm-string">"'123'"</span>));
<span class="cm-comment">// → true</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/'\d*'/</span>.<span class="cm-property">test</span>(<span class="cm-string">"''"</span>));
<span class="cm-comment">// → true</span></pre>
<p><a class="p_ident" id="p_/oNBIVm41F" href="#p_/oNBIVm41F" tabindex="-1" role="presentation"></a>The star (<code>*</code>) has a similar meaning but also allows the pattern to match zero times. Something with a star after it never prevents a pattern from matching—it’ll just match zero instances if it can’t find any suitable text to match.</p>
<p><a class="p_ident" id="p_77Y5t9C8NV" href="#p_77Y5t9C8NV" tabindex="-1" role="presentation"></a>A question mark makes a part of a pattern <em>optional</em>, meaning it may occur zero or one time. In the following example, the <em>u</em> character is allowed to occur, but the pattern also matches when it is missing.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_EiCIowdq+d" href="#c_EiCIowdq+d" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">neighbor</span> <span class="cm-operator">=</span> <span class="cm-string-2">/neighbou?r/</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">neighbor</span>.<span class="cm-property">test</span>(<span class="cm-string">"neighbour"</span>));
<span class="cm-comment">// → true</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">neighbor</span>.<span class="cm-property">test</span>(<span class="cm-string">"neighbor"</span>));
<span class="cm-comment">// → true</span></pre>
<p><a class="p_ident" id="p_B4ikd8xN8i" href="#p_B4ikd8xN8i" tabindex="-1" role="presentation"></a>To indicate that a pattern should occur a precise number of times, use curly braces. Putting <code>{4}</code> after an element, for example, requires it to occur exactly four times. It is also possible to specify a range this way: <code>{2,4}</code> means the element must occur at least twice and at most four times.</p>
<p id="date_regexp_counted"><a class="p_ident" id="p_a1yNHuI+49" href="#p_a1yNHuI+49" tabindex="-1" role="presentation"></a>Here is another version of the date and time pattern that allows both single- and double-digit days, months, and hours. It is also slightly easier to decipher.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_MjrOJNLVF5" href="#c_MjrOJNLVF5" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">dateTime</span> <span class="cm-operator">=</span> <span class="cm-string-2">/\d{1,2}-\d{1,2}-\d{4} \d{1,2}:\d{2}/</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">dateTime</span>.<span class="cm-property">test</span>(<span class="cm-string">"30-1-2003 8:45"</span>));
<span class="cm-comment">// → true</span></pre>
<p><a class="p_ident" id="p_RjqN6VMQIa" href="#p_RjqN6VMQIa" tabindex="-1" role="presentation"></a>You can also specify open-ended ranges when using curly braces by omitting the number after the comma. So <code>{5,}</code> means five or more times.</p>
<h2><a class="h_ident" id="h_uICSDspz1I" href="#h_uICSDspz1I" tabindex="-1" role="presentation"></a>Grouping subexpressions</h2>
<p><a class="p_ident" id="p_pKTOYUDGIr" href="#p_pKTOYUDGIr" tabindex="-1" role="presentation"></a>To use an operator like <code>*</code> or <code>+</code> on more than one element at a time, you have to use parentheses. A part of a regular expression that is enclosed in parentheses counts as a single element as far as the operators following it are concerned.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_P/f6a65XwI" href="#c_P/f6a65XwI" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">cartoonCrying</span> <span class="cm-operator">=</span> <span class="cm-string-2">/boo+(hoo+)+/i</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">cartoonCrying</span>.<span class="cm-property">test</span>(<span class="cm-string">"Boohoooohoohooo"</span>));
<span class="cm-comment">// → true</span></pre>
<p><a class="p_ident" id="p_S5jkv2dMC+" href="#p_S5jkv2dMC+" tabindex="-1" role="presentation"></a>The first and second <code>+</code> characters apply only to the second <em>o</em> in <em>boo</em> and <em>hoo</em>, respectively. The third <code>+</code> applies to the whole group <code>(hoo+)</code>, matching one or more sequences like that.</p>
<p><a class="p_ident" id="p_c4RlIM4/HI" href="#p_c4RlIM4/HI" tabindex="-1" role="presentation"></a>The <code>i</code> at the end of the expression in the example makes this regular expression case insensitive, allowing it to match the uppercase <em>B</em> in the input string, even though the pattern is itself all lowercase.</p>
<h2><a class="h_ident" id="h_CV5XL/TADP" href="#h_CV5XL/TADP" tabindex="-1" role="presentation"></a>Matches and groups</h2>
<p><a class="p_ident" id="p_K3KRDzatsp" href="#p_K3KRDzatsp" tabindex="-1" role="presentation"></a>The <code>test</code> method is the absolute simplest way to match a regular expression. It tells you only whether it matched and nothing else. Regular expressions also have an <code>exec</code> (execute) method that will return <code>null</code> if no match was found and return an object with information about the match otherwise.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_JJMWZpk0iD" href="#c_JJMWZpk0iD" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">match</span> <span class="cm-operator">=</span> <span class="cm-string-2">/\d+/</span>.<span class="cm-property">exec</span>(<span class="cm-string">"one two 100"</span>);
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">match</span>);
<span class="cm-comment">// → ["100"]</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">match</span>.<span class="cm-property">index</span>);
<span class="cm-comment">// → 8</span></pre>
<p><a class="p_ident" id="p_fJSwbQyG6w" href="#p_fJSwbQyG6w" tabindex="-1" role="presentation"></a>An object returned from <code>exec</code> has an <code>index</code> property that tells us <em>where</em> in the string the successful match begins. Other than that, the object looks like (and in fact is) an array of strings, whose first element is the string that was matched—in the previous example, this is the sequence of digits that we were looking for.</p>
<p><a class="p_ident" id="p_VT4fpht7D7" href="#p_VT4fpht7D7" tabindex="-1" role="presentation"></a>String values have a <code>match</code> method that behaves similarly.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_uAkAqNYx+q" href="#c_uAkAqNYx+q" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string">"one two 100"</span>.<span class="cm-property">match</span>(<span class="cm-string-2">/\d+/</span>));
<span class="cm-comment">// → ["100"]</span></pre>
<p><a class="p_ident" id="p_/9rdcJO9zZ" href="#p_/9rdcJO9zZ" tabindex="-1" role="presentation"></a>When the regular expression contains subexpressions grouped with parentheses, the text that matched those groups will also show up in the array. The whole match is always the first element. The next element is the part matched by the first group (the one whose opening parenthesis comes first in the expression), then the second group, and so on.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_5E2M1BBsUm" href="#c_5E2M1BBsUm" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">quotedText</span> <span class="cm-operator">=</span> <span class="cm-string-2">/'([^']*)'/</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">quotedText</span>.<span class="cm-property">exec</span>(<span class="cm-string">"she said 'hello'"</span>));
<span class="cm-comment">// → ["'hello'", "hello"]</span></pre>
<p><a class="p_ident" id="p_f4bciMASJ1" href="#p_f4bciMASJ1" tabindex="-1" role="presentation"></a>When a group does not end up being matched at all (for example, when followed by a question mark), its position in the output array will hold <code>undefined</code>. Similarly, when a group is matched multiple times, only the last match ends up in the array.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_j9t+gn+1eT" href="#c_j9t+gn+1eT" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/bad(ly)?/</span>.<span class="cm-property">exec</span>(<span class="cm-string">"bad"</span>));
<span class="cm-comment">// → ["bad", undefined]</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/(\d)+/</span>.<span class="cm-property">exec</span>(<span class="cm-string">"123"</span>));
<span class="cm-comment">// → ["123", "3"]</span></pre>
<p><a class="p_ident" id="p_GvLofvnQnz" href="#p_GvLofvnQnz" tabindex="-1" role="presentation"></a>Groups can be useful for extracting parts of a string. If we don’t just want to verify whether a string contains a date but also extract it and construct an object that represents it, we can wrap parentheses around the digit patterns and directly pick the date out of the result of <code>exec</code>.</p>
<p><a class="p_ident" id="p_B9SEqDbr+Y" href="#p_B9SEqDbr+Y" tabindex="-1" role="presentation"></a>But first, a brief detour, in which we discuss the built-in way to represent date and time values in JavaScript.</p>
<h2><a class="h_ident" id="h_8U7L7LCU27" href="#h_8U7L7LCU27" tabindex="-1" role="presentation"></a>The Date class</h2>
<p><a class="p_ident" id="p_2NeTRvucQq" href="#p_2NeTRvucQq" tabindex="-1" role="presentation"></a>JavaScript has a standard class for representing dates—or rather, points in time. It is called <code>Date</code>. If you simply create a date object using <code>new</code>, you get the current date and time.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_AjgqFetryg" href="#c_AjgqFetryg" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-keyword">new</span> <span class="cm-variable">Date</span>());
<span class="cm-comment">// → Mon Nov 13 2017 16:19:11 GMT+0100 (CET)</span></pre>
<p><a class="p_ident" id="p_IcV7kv3B1y" href="#p_IcV7kv3B1y" tabindex="-1" role="presentation"></a>You can also create an object for a specific time.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_2VCU0f4HsQ" href="#c_2VCU0f4HsQ" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-keyword">new</span> <span class="cm-variable">Date</span>(<span class="cm-number">2009</span>, <span class="cm-number">11</span>, <span class="cm-number">9</span>));
<span class="cm-comment">// → Wed Dec 09 2009 00:00:00 GMT+0100 (CET)</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-keyword">new</span> <span class="cm-variable">Date</span>(<span class="cm-number">2009</span>, <span class="cm-number">11</span>, <span class="cm-number">9</span>, <span class="cm-number">12</span>, <span class="cm-number">59</span>, <span class="cm-number">59</span>, <span class="cm-number">999</span>));
<span class="cm-comment">// → Wed Dec 09 2009 12:59:59 GMT+0100 (CET)</span></pre>
<p><a class="p_ident" id="p_cYaexzwHiw" href="#p_cYaexzwHiw" tabindex="-1" role="presentation"></a>JavaScript uses a convention where month numbers start at zero (so December is 11), yet day numbers start at one. This is confusing and silly. Be careful.</p>
<p><a class="p_ident" id="p_gVdQSb0Lv9" href="#p_gVdQSb0Lv9" tabindex="-1" role="presentation"></a>The last four arguments (hours, minutes, seconds, and milliseconds) are optional and taken to be zero when not given.</p>
<p><a class="p_ident" id="p_1mIMU5T5MA" href="#p_1mIMU5T5MA" tabindex="-1" role="presentation"></a>Timestamps are stored as the number of milliseconds since the start of 1970, in the UTC time zone. This follows a convention set by “Unix time”, which was invented around that time. You can use negative numbers for times before 1970. The <code>getTime</code> method on a date object returns this number. It is big, as you can imagine.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_lMlCuckMIc" href="#c_lMlCuckMIc" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-keyword">new</span> <span class="cm-variable">Date</span>(<span class="cm-number">2013</span>, <span class="cm-number">11</span>, <span class="cm-number">19</span>).<span class="cm-property">getTime</span>());
<span class="cm-comment">// → 1387407600000</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-keyword">new</span> <span class="cm-variable">Date</span>(<span class="cm-number">1387407600000</span>));
<span class="cm-comment">// → Thu Dec 19 2013 00:00:00 GMT+0100 (CET)</span></pre>
<p><a class="p_ident" id="p_Cn9WUyCZhq" href="#p_Cn9WUyCZhq" tabindex="-1" role="presentation"></a>If you give the <code>Date</code> constructor a single argument, that argument is treated as such a millisecond count. You can get the current millisecond count by creating a new <code>Date</code> object and calling <code>getTime</code> on it or by calling the <code>Date.now</code> function.</p>
<p><a class="p_ident" id="p_KBadEflbjz" href="#p_KBadEflbjz" tabindex="-1" role="presentation"></a>Date objects provide methods like <code>getFullYear</code>, <code>getMonth</code>, <code>getDate</code>, <code>getHours</code>, <code>getMinutes</code>, and <code>getSeconds</code> to extract their components. Besides <code>getFullYear</code>, there’s also <code>getYear</code>, which gives you a rather useless two-digit year value (such as <code>93</code> or <code>14</code>).</p>
<p><a class="p_ident" id="p_/RCtQyD3w/" href="#p_/RCtQyD3w/" tabindex="-1" role="presentation"></a>Putting parentheses around the parts of the expression that we are interested in, we can now create a date object from a string.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_eIqBpPGjqP" href="#c_eIqBpPGjqP" tabindex="-1" role="presentation"></a><span class="cm-keyword">function</span> <span class="cm-def">getDate</span>(<span class="cm-def">string</span>) {
<span class="cm-keyword">let</span> [<span class="cm-def">_</span>, <span class="cm-def">day</span>, <span class="cm-def">month</span>, <span class="cm-def">year</span>] <span class="cm-operator">=</span>
<span class="cm-string-2">/(\d{1,2})-(\d{1,2})-(\d{4})/</span>.<span class="cm-property">exec</span>(<span class="cm-variable-2">string</span>);
<span class="cm-keyword">return</span> <span class="cm-keyword">new</span> <span class="cm-variable">Date</span>(<span class="cm-variable-2">year</span>, <span class="cm-variable-2">month</span> <span class="cm-operator">-</span> <span class="cm-number">1</span>, <span class="cm-variable-2">day</span>);
}
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">getDate</span>(<span class="cm-string">"30-1-2003"</span>));
<span class="cm-comment">// → Thu Jan 30 2003 00:00:00 GMT+0100 (CET)</span></pre>
<p><a class="p_ident" id="p_YUOJEGEtSI" href="#p_YUOJEGEtSI" tabindex="-1" role="presentation"></a>The <code>_</code> (underscore) binding is ignored, and only used to skip the full match element in the array returned by <code>exec</code>.</p>
<h2><a class="h_ident" id="h_26ixny78VY" href="#h_26ixny78VY" tabindex="-1" role="presentation"></a>Word and string boundaries</h2>
<p><a class="p_ident" id="p_xdYJVr9vlf" href="#p_xdYJVr9vlf" tabindex="-1" role="presentation"></a>Unfortunately, <code>getDate</code> will also happily extract the nonsensical date 00-1-3000 from the string <code>"100-1-30000"</code>. A match may happen anywhere in the string, so in this case, it’ll just start at the second character and end at the second-to-last character.</p>
<p><a class="p_ident" id="p_kLS7rqRrqG" href="#p_kLS7rqRrqG" tabindex="-1" role="presentation"></a>If we want to enforce that the match must span the whole string, we can add the markers <code>^</code> and <code>$</code>. The caret matches the start of the input string, while the dollar sign matches the end. So, <code>/^\d+$/</code> matches a string consisting entirely of one or more digits, <code>/^!/</code> matches any string that starts with an exclamation mark, and <code>/x^/</code> does not match any string (there cannot be an <em>x</em> before the start of the string).</p>
<p><a class="p_ident" id="p_fEYS5Ev94W" href="#p_fEYS5Ev94W" tabindex="-1" role="presentation"></a>If, on the other hand, we just want to make sure the date starts and ends on a word boundary, we can use the marker <code>\b</code>. A word boundary can be the start or end of the string or any point in the string that has a word character (as in <code>\w</code>) on one side and a nonword character on the other.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_6U0b866tUk" href="#c_6U0b866tUk" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/cat/</span>.<span class="cm-property">test</span>(<span class="cm-string">"concatenate"</span>));
<span class="cm-comment">// → true</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/\bcat\b/</span>.<span class="cm-property">test</span>(<span class="cm-string">"concatenate"</span>));
<span class="cm-comment">// → false</span></pre>
<p><a class="p_ident" id="p_btxd6luedx" href="#p_btxd6luedx" tabindex="-1" role="presentation"></a>Note that a boundary marker doesn’t match an actual character. It just enforces that the regular expression matches only when a certain condition holds at the place where it appears in the pattern.</p>
<h2><a class="h_ident" id="h_In3b+t6uOO" href="#h_In3b+t6uOO" tabindex="-1" role="presentation"></a>Choice patterns</h2>
<p><a class="p_ident" id="p_G5RTt0AFku" href="#p_G5RTt0AFku" tabindex="-1" role="presentation"></a>Say we want to know whether a piece of text contains not only a number but a number followed by one of the words <em>pig</em>, <em>cow</em>, or <em>chicken</em>, or any of their plural forms.</p>
<p><a class="p_ident" id="p_GcEbQJT+nS" href="#p_GcEbQJT+nS" tabindex="-1" role="presentation"></a>We could write three regular expressions and test them in turn, but there is a nicer way. The pipe character (<code>|</code>) denotes a choice between the pattern to its left and the pattern to its right. So I can say this:</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_z0soEIN8RB" href="#c_z0soEIN8RB" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">animalCount</span> <span class="cm-operator">=</span> <span class="cm-string-2">/\b\d+ (pig|cow|chicken)s?\b/</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">animalCount</span>.<span class="cm-property">test</span>(<span class="cm-string">"15 pigs"</span>));
<span class="cm-comment">// → true</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">animalCount</span>.<span class="cm-property">test</span>(<span class="cm-string">"15 pigchickens"</span>));
<span class="cm-comment">// → false</span></pre>
<p><a class="p_ident" id="p_bPWKulKcxf" href="#p_bPWKulKcxf" tabindex="-1" role="presentation"></a>Parentheses can be used to limit the part of the pattern that the pipe operator applies to, and you can put multiple such operators next to each other to express a choice between more than two alternatives.</p>
<h2><a class="h_ident" id="h_AzxCBCKdvY" href="#h_AzxCBCKdvY" tabindex="-1" role="presentation"></a>The mechanics of matching</h2>
<p><a class="p_ident" id="p_SXQOi9ZwwH" href="#p_SXQOi9ZwwH" tabindex="-1" role="presentation"></a>Conceptually, when you use <code>exec</code> or <code>test</code> the regular expression engine looks for a match in your string by trying to match the expression first from the start of the string, then from the second character, and so on until it finds a match or reaches the end of the string. It’ll either return the first match that can be found or fail to find any match at all.</p>
<p><a class="p_ident" id="p_HJjJAo8dQp" href="#p_HJjJAo8dQp" tabindex="-1" role="presentation"></a>To do the actual matching, the engine treats a regular expression something like a flow diagram. This is the diagram for the livestock expression in the previous example:</p><figure><img src="img/re_pigchickens.svg" alt="Visualization of /\b\d+ (pig|cow|chicken)s?\b/"></figure>
<p><a class="p_ident" id="p_SNiUdMyezk" href="#p_SNiUdMyezk" tabindex="-1" role="presentation"></a>Our expression matches if we can find a path from the left side of the diagram to the right side. We keep a current position in the string, and every time we move through a box, we verify that the part of the string after our current position matches that box.</p>
<p><a class="p_ident" id="p_MB99a8uIlE" href="#p_MB99a8uIlE" tabindex="-1" role="presentation"></a>So if we try to match <code>"the 3 pigs"</code> from position 4, our progress through the flow chart would look like this:</p>
<ul>
<li>
<p><a class="p_ident" id="p_bgxoerTVW4" href="#p_bgxoerTVW4" tabindex="-1" role="presentation"></a>At position 4, there is a word boundary, so we can move past the first box.</p></li>
<li>
<p><a class="p_ident" id="p_YCV1/H+Rbe" href="#p_YCV1/H+Rbe" tabindex="-1" role="presentation"></a>Still at position 4, we find a digit, so we can also move past the second box.</p></li>
<li>
<p><a class="p_ident" id="p_fQdWHxKgCF" href="#p_fQdWHxKgCF" tabindex="-1" role="presentation"></a>At position 5, one path loops back to before the second (digit) box, while the other moves forward through the box that holds a single space character. There is a space here, not a digit, so we must take the second path.</p></li>
<li>
<p><a class="p_ident" id="p_KItk5iNp9m" href="#p_KItk5iNp9m" tabindex="-1" role="presentation"></a>We are now at position 6 (the start of “pigs”) and at the three-way branch in the diagram. We don’t see “cow” or “chicken” here, but we do see “pig”, so we take that branch.</p></li>
<li>
<p><a class="p_ident" id="p_SowlGZC6lM" href="#p_SowlGZC6lM" tabindex="-1" role="presentation"></a>At position 9, after the three-way branch, one path skips the <em>s</em> box and goes straight to the final word boundary, while the other path matches an <em>s</em>. There is an <em>s</em> character here, not a word boundary, so we go through the <em>s</em> box.</p></li>
<li>
<p><a class="p_ident" id="p_oJRMcnDoAt" href="#p_oJRMcnDoAt" tabindex="-1" role="presentation"></a>We’re at position 10 (the end of the string) and can match only a word boundary. The end of a string counts as a word boundary, so we go through the last box and have successfully matched this string.</p></li></ul>
<h2 id="backtracking"><a class="h_ident" id="h_NFMtGK0tD3" href="#h_NFMtGK0tD3" tabindex="-1" role="presentation"></a>Backtracking</h2>
<p><a class="p_ident" id="p_tCd15MFAty" href="#p_tCd15MFAty" tabindex="-1" role="presentation"></a>The regular expression <code>/<wbr>\b([01]+b|[\da-f]+h|\d+)\b/<wbr></code> matches either a binary number followed by a <em>b</em>, a hexadecimal number (that is, base 16, with the letters <em>a</em> to <em>f</em> standing for the digits 10 to 15) followed by an <em>h</em>, or a regular decimal number with no suffix character. This is the corresponding diagram:</p><figure><img src="img/re_number.svg" alt="Visualization of /\b([01]+b|\d+|[\da-f]+h)\b/"></figure>
<p><a class="p_ident" id="p_MypvfTaiTG" href="#p_MypvfTaiTG" tabindex="-1" role="presentation"></a>When matching this expression, it will often happen that the top (binary) branch is entered even though the input does not actually contain a binary number. When matching the string <code>"103"</code>, for example, it becomes clear only at the 3 that we are in the wrong branch. The string <em>does</em> match the expression, just not the branch we are currently in.</p>
<p><a class="p_ident" id="p_SjTCKE9hvf" href="#p_SjTCKE9hvf" tabindex="-1" role="presentation"></a>So the matcher <em>backtracks</em>. When entering a branch, it remembers its current position (in this case, at the start of the string, just past the first boundary box in the diagram) so that it can go back and try another branch if the current one does not work out. For the string <code>"103"</code>, after encountering the 3 character, it will start trying the branch for hexadecimal numbers, which fails again because there is no <em>h</em> after the number. So it tries the decimal number branch. This one fits, and a match is reported after all.</p>
<p><a class="p_ident" id="p_VymH7raTcU" href="#p_VymH7raTcU" tabindex="-1" role="presentation"></a>The matcher stops as soon as it finds a full match. This means that if multiple branches could potentially match a string, only the first one (ordered by where the branches appear in the regular expression) is used.</p>
<p><a class="p_ident" id="p_zEBIV8lYeb" href="#p_zEBIV8lYeb" tabindex="-1" role="presentation"></a>Backtracking also happens for repetition operators like + and <code>*</code>. If you match <code>/^.*x/</code> against <code>"abcxe"</code>, the <code>.*</code> part will first try to consume the whole string. The engine will then realize that it needs an <em>x</em> to match the pattern. Since there is no <em>x</em> past the end of the string, the star operator tries to match one character less. But the matcher doesn’t find an <em>x</em> after <code>abcx</code> either, so it backtracks again, matching the star operator to just <code>abc</code>. <em>Now</em> it finds an <em>x</em> where it needs it and reports a successful match from positions 0 to 4.</p>
<p><a class="p_ident" id="p_0MBBMH8aI2" href="#p_0MBBMH8aI2" tabindex="-1" role="presentation"></a>It is possible to write regular expressions that will do a <em>lot</em> of backtracking. This problem occurs when a pattern can match a piece of input in many different ways. For example, if we get confused while writing a binary-number regular expression, we might accidentally write something like <code>/([01]+)+b/</code>.</p><figure><img src="img/re_slow.svg" alt="Visualization of /([01]+)+b/"></figure>
<p><a class="p_ident" id="p_5cI0Ma3Wy8" href="#p_5cI0Ma3Wy8" tabindex="-1" role="presentation"></a>If that tries to match some long series of zeros and ones with no trailing <em>b</em> character, the matcher will first go through the inner loop until it runs out of digits. Then it notices there is no <em>b</em>, so it backtracks one position, goes through the outer loop once, and gives up again, trying to backtrack out of the inner loop once more. It will continue to try every possible route through these two loops. This means the amount of work <em>doubles</em> with each additional character. For even just a few dozen characters, the resulting match will take practically forever.</p>
<h2><a class="h_ident" id="h_k0YuTOu54D" href="#h_k0YuTOu54D" tabindex="-1" role="presentation"></a>The replace method</h2>
<p><a class="p_ident" id="p_HMQv5qrs78" href="#p_HMQv5qrs78" tabindex="-1" role="presentation"></a>String values have a <code>replace</code> method, which can be used to replace part of the string with another string.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_dPdIdK/Wyi" href="#c_dPdIdK/Wyi" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string">"papa"</span>.<span class="cm-property">replace</span>(<span class="cm-string">"p"</span>, <span class="cm-string">"m"</span>));
<span class="cm-comment">// → mapa</span></pre>
<p><a class="p_ident" id="p_jjBKX9l81o" href="#p_jjBKX9l81o" tabindex="-1" role="presentation"></a>The first argument can also be a regular expression, in which case the first match of the regular expression is replaced. When a <code>g</code> option (for <em>global</em>) is added to the regular expression, <em>all</em> matches in the string will be replaced, not just the first.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_ztGnSKyKy1" href="#c_ztGnSKyKy1" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string">"Borobudur"</span>.<span class="cm-property">replace</span>(<span class="cm-string-2">/[ou]/</span>, <span class="cm-string">"a"</span>));
<span class="cm-comment">// → Barobudur</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string">"Borobudur"</span>.<span class="cm-property">replace</span>(<span class="cm-string-2">/[ou]/g</span>, <span class="cm-string">"a"</span>));
<span class="cm-comment">// → Barabadar</span></pre>
<p><a class="p_ident" id="p_BTzyExrWv3" href="#p_BTzyExrWv3" tabindex="-1" role="presentation"></a>It would have been sensible if the choice between replacing one match or all matches was made through an additional argument to <code>replace</code> or by providing a different method, <code>replaceAll</code>. But for some unfortunate reason, the choice relies on a property of the regular expression instead.</p>
<p><a class="p_ident" id="p_/5YU/Qo2Np" href="#p_/5YU/Qo2Np" tabindex="-1" role="presentation"></a>The real power of using regular expressions with <code>replace</code> comes from the fact that we can refer back to matched groups in the replacement string. For example, say we have a big string containing the names of people, one name per line, in the format <code>Lastname, Firstname</code>. If we want to swap these names and remove the comma to get a <code>Firstname Lastname</code> format, we can use the following code:</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_5P5aZAbVLL" href="#c_5P5aZAbVLL" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(
<span class="cm-string">"Liskov, Barbara\nMcCarthy, John\nWadler, Philip"</span>
.<span class="cm-property">replace</span>(<span class="cm-string-2">/(\w+), (\w+)/g</span>, <span class="cm-string">"$2 $1"</span>));
<span class="cm-comment">// → Barbara Liskov</span>
<span class="cm-comment">// John McCarthy</span>
<span class="cm-comment">// Philip Wadler</span></pre>
<p><a class="p_ident" id="p_sEudLRqyzC" href="#p_sEudLRqyzC" tabindex="-1" role="presentation"></a>The <code>$1</code> and <code>$2</code> in the replacement string refer to the parenthesized groups in the pattern. <code>$1</code> is replaced by the text that matched against the first group, <code>$2</code> by the second, and so on, up to <code>$9</code>. The whole match can be referred to with <code>$&</code>.</p>
<p><a class="p_ident" id="p_BpgnqwKFHn" href="#p_BpgnqwKFHn" tabindex="-1" role="presentation"></a>It is possible to pass a function—rather than a string—as the second argument to <code>replace</code>. For each replacement, the function will be called with the matched groups (as well as the whole match) as arguments, and its return value will be inserted into the new string.</p>
<p><a class="p_ident" id="p_GbNoBizUD+" href="#p_GbNoBizUD+" tabindex="-1" role="presentation"></a>Here’s a small example:</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_fwgl3+oeyX" href="#c_fwgl3+oeyX" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">s</span> <span class="cm-operator">=</span> <span class="cm-string">"the cia and fbi"</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">s</span>.<span class="cm-property">replace</span>(<span class="cm-string-2">/\b(fbi|cia)\b/g</span>,
<span class="cm-def">str</span> <span class="cm-operator">=></span> <span class="cm-variable-2">str</span>.<span class="cm-property">toUpperCase</span>()));
<span class="cm-comment">// → the CIA and FBI</span></pre>
<p><a class="p_ident" id="p_cDMXCsyNOw" href="#p_cDMXCsyNOw" tabindex="-1" role="presentation"></a>And here’s a more interesting one:</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_Zo/y2Vv93l" href="#c_Zo/y2Vv93l" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">stock</span> <span class="cm-operator">=</span> <span class="cm-string">"1 lemon, 2 cabbages, and 101 eggs"</span>;
<span class="cm-keyword">function</span> <span class="cm-def">minusOne</span>(<span class="cm-def">match</span>, <span class="cm-def">amount</span>, <span class="cm-def">unit</span>) {
<span class="cm-variable-2">amount</span> <span class="cm-operator">=</span> <span class="cm-variable">Number</span>(<span class="cm-variable-2">amount</span>) <span class="cm-operator">-</span> <span class="cm-number">1</span>;
<span class="cm-keyword">if</span> (<span class="cm-variable-2">amount</span> <span class="cm-operator">==</span> <span class="cm-number">1</span>) { <span class="cm-comment">// only one left, remove the 's'</span>
<span class="cm-variable-2">unit</span> <span class="cm-operator">=</span> <span class="cm-variable-2">unit</span>.<span class="cm-property">slice</span>(<span class="cm-number">0</span>, <span class="cm-variable-2">unit</span>.<span class="cm-property">length</span> <span class="cm-operator">-</span> <span class="cm-number">1</span>);
} <span class="cm-keyword">else</span> <span class="cm-keyword">if</span> (<span class="cm-variable-2">amount</span> <span class="cm-operator">==</span> <span class="cm-number">0</span>) {
<span class="cm-variable-2">amount</span> <span class="cm-operator">=</span> <span class="cm-string">"no"</span>;
}
<span class="cm-keyword">return</span> <span class="cm-variable-2">amount</span> <span class="cm-operator">+</span> <span class="cm-string">" "</span> <span class="cm-operator">+</span> <span class="cm-variable-2">unit</span>;
}
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">stock</span>.<span class="cm-property">replace</span>(<span class="cm-string-2">/(\d+) (\w+)/g</span>, <span class="cm-variable">minusOne</span>));
<span class="cm-comment">// → no lemon, 1 cabbage, and 100 eggs</span></pre>
<p><a class="p_ident" id="p_bv4e/DVilz" href="#p_bv4e/DVilz" tabindex="-1" role="presentation"></a>This takes a string, finds all occurrences of a number followed by an alphanumeric word, and returns a string wherein every such occurrence is decremented by one.</p>
<p><a class="p_ident" id="p_H94SX/MJX8" href="#p_H94SX/MJX8" tabindex="-1" role="presentation"></a>The <code>(\d+)</code> group ends up as the <code>amount</code> argument to the function, and the <code>(\w+)</code> group gets bound to <code>unit</code>. The function converts <code>amount</code> to a number—which always works, since it matched <code>\d+</code>—and makes some adjustments in case there is only one or zero left.</p>
<h2><a class="h_ident" id="h_kiECehz+i+" href="#h_kiECehz+i+" tabindex="-1" role="presentation"></a>Greed</h2>
<p><a class="p_ident" id="p_VccKwuX/1m" href="#p_VccKwuX/1m" tabindex="-1" role="presentation"></a>It is possible to use <code>replace</code> to write a function that removes all comments from a piece of JavaScript code. Here is a first attempt:</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_u0oKSJTOA2" href="#c_u0oKSJTOA2" tabindex="-1" role="presentation"></a><span class="cm-keyword">function</span> <span class="cm-def">stripComments</span>(<span class="cm-def">code</span>) {
<span class="cm-keyword">return</span> <span class="cm-variable-2">code</span>.<span class="cm-property">replace</span>(<span class="cm-string-2">/\/\/.*|\/\*[^]*\*\//g</span>, <span class="cm-string">""</span>);
}
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">stripComments</span>(<span class="cm-string">"1 + /* 2 */3"</span>));
<span class="cm-comment">// → 1 + 3</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">stripComments</span>(<span class="cm-string">"x = 10;// ten!"</span>));
<span class="cm-comment">// → x = 10;</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">stripComments</span>(<span class="cm-string">"1 /* a */+/* b */ 1"</span>));
<span class="cm-comment">// → 1 1</span></pre>
<p><a class="p_ident" id="p_DkzBCJQQdu" href="#p_DkzBCJQQdu" tabindex="-1" role="presentation"></a>The part before the <em>or</em> operator matches two slash characters followed by any number of non-newline characters. The part for multiline comments is more involved. We use <code>[^]</code> (any character that is not in the empty set of characters) as a way to match any character. We cannot just use a period here because block comments can continue on a new line, and the period character does not match newline characters.</p>
<p><a class="p_ident" id="p_s9E9JYjAYp" href="#p_s9E9JYjAYp" tabindex="-1" role="presentation"></a>But the output for the last line appears to have gone wrong. Why?</p>
<p><a class="p_ident" id="p_atS1ERkauC" href="#p_atS1ERkauC" tabindex="-1" role="presentation"></a>The <code>[^]*</code> part of the expression, as I described in the section on backtracking, will first match as much as it can. If that causes the next part of the pattern to fail, the matcher moves back one character and tries again from there. In the example, the matcher first tries to match the whole rest of the string and then moves back from there. It will find an occurrence of <code>*/</code> after going back four characters and match that. This is not what we wanted—the intention was to match a single comment, not to go all the way to the end of the code and find the end of the last block comment.</p>
<p><a class="p_ident" id="p_eNtLSVH65f" href="#p_eNtLSVH65f" tabindex="-1" role="presentation"></a>Because of this behavior, we say the repetition operators (<code>+</code>, <code>*</code>, <code>?</code>, and <code>{}</code>) are <em>greedy</em>, meaning they match as much as they can and backtrack from there. If you put a question mark after them (<code>+?</code>, <code>*?</code>, <code>??</code>, <code>{}?</code>), they become nongreedy and start by matching as little as possible, matching more only when the remaining pattern does not fit the smaller match.</p>
<p><a class="p_ident" id="p_0L47KZXZKa" href="#p_0L47KZXZKa" tabindex="-1" role="presentation"></a>And that is exactly what we want in this case. By having the star match the smallest stretch of characters that brings us to a <code>*/</code>, we consume one block comment and nothing more.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_MCNF7GxfR1" href="#c_MCNF7GxfR1" tabindex="-1" role="presentation"></a><span class="cm-keyword">function</span> <span class="cm-def">stripComments</span>(<span class="cm-def">code</span>) {
<span class="cm-keyword">return</span> <span class="cm-variable-2">code</span>.<span class="cm-property">replace</span>(<span class="cm-string-2">/\/\/.*|\/\*[^]*?\*\//g</span>, <span class="cm-string">""</span>);
}
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">stripComments</span>(<span class="cm-string">"1 /* a */+/* b */ 1"</span>));
<span class="cm-comment">// → 1 + 1</span></pre>
<p><a class="p_ident" id="p_o+3JCFC4Dr" href="#p_o+3JCFC4Dr" tabindex="-1" role="presentation"></a>A lot of bugs in regular expression programs can be traced to unintentionally using a greedy operator where a nongreedy one would work better. When using a repetition operator, consider the nongreedy variant first.</p>
<h2><a class="h_ident" id="h_Rhu25fogrG" href="#h_Rhu25fogrG" tabindex="-1" role="presentation"></a>Dynamically creating RegExp objects</h2>
<p><a class="p_ident" id="p_34PsyHYX4x" href="#p_34PsyHYX4x" tabindex="-1" role="presentation"></a>There are cases where you might not know the exact pattern you need to match against when you are writing your code. Say you want to look for the user’s name in a piece of text and enclose it in underscore characters to make it stand out. Since you will know the name only once the program is actually running, you can’t use the slash-based notation.</p>
<p><a class="p_ident" id="p_KAQggWa80Y" href="#p_KAQggWa80Y" tabindex="-1" role="presentation"></a>But you can build up a string and use the <code>RegExp</code> constructor on that. Here’s an example:</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_3yQimfD35d" href="#c_3yQimfD35d" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">name</span> <span class="cm-operator">=</span> <span class="cm-string">"harry"</span>;
<span class="cm-keyword">let</span> <span class="cm-def">text</span> <span class="cm-operator">=</span> <span class="cm-string">"Harry is a suspicious character."</span>;
<span class="cm-keyword">let</span> <span class="cm-def">regexp</span> <span class="cm-operator">=</span> <span class="cm-keyword">new</span> <span class="cm-variable">RegExp</span>(<span class="cm-string">"\\b("</span> <span class="cm-operator">+</span> <span class="cm-variable">name</span> <span class="cm-operator">+</span> <span class="cm-string">")\\b"</span>, <span class="cm-string">"gi"</span>);
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">text</span>.<span class="cm-property">replace</span>(<span class="cm-variable">regexp</span>, <span class="cm-string">"_$1_"</span>));
<span class="cm-comment">// → _Harry_ is a suspicious character.</span></pre>
<p><a class="p_ident" id="p_J6H1NBoQy/" href="#p_J6H1NBoQy/" tabindex="-1" role="presentation"></a>When creating the <code>\b</code> boundary markers, we have to use two backslashes because we are writing them in a normal string, not a slash-enclosed regular expression. The second argument to the <code>RegExp</code> constructor contains the options for the regular expression—in this case <code>"gi"</code> for global and case-insensitive.</p>
<p><a class="p_ident" id="p_UPAgEiKHfS" href="#p_UPAgEiKHfS" tabindex="-1" role="presentation"></a>But what if the name is <code>"dea+hl[]rd"</code> because our user is a nerdy teenager? That would result in a nonsensical regular expression, which won’t actually match the user’s name.</p>
<p><a class="p_ident" id="p_Q+hqmMv8NT" href="#p_Q+hqmMv8NT" tabindex="-1" role="presentation"></a>To work around this, we can add backslashes before any character that has a special meaning.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_lvBZoIypZE" href="#c_lvBZoIypZE" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">name</span> <span class="cm-operator">=</span> <span class="cm-string">"dea+hl[]rd"</span>;
<span class="cm-keyword">let</span> <span class="cm-def">text</span> <span class="cm-operator">=</span> <span class="cm-string">"This dea+hl[]rd guy is super annoying."</span>;
<span class="cm-keyword">let</span> <span class="cm-def">escaped</span> <span class="cm-operator">=</span> <span class="cm-variable">name</span>.<span class="cm-property">replace</span>(<span class="cm-string-2">/[\\\[.+*?(){|^$]/g</span>, <span class="cm-string">"\\$&"</span>);
<span class="cm-keyword">let</span> <span class="cm-def">regexp</span> <span class="cm-operator">=</span> <span class="cm-keyword">new</span> <span class="cm-variable">RegExp</span>(<span class="cm-string">"\\b"</span> <span class="cm-operator">+</span> <span class="cm-variable">escaped</span> <span class="cm-operator">+</span> <span class="cm-string">"\\b"</span>, <span class="cm-string">"gi"</span>);
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">text</span>.<span class="cm-property">replace</span>(<span class="cm-variable">regexp</span>, <span class="cm-string">"_$&_"</span>));
<span class="cm-comment">// → This _dea+hl[]rd_ guy is super annoying.</span></pre>
<h2><a class="h_ident" id="h_Txg7z4j/ei" href="#h_Txg7z4j/ei" tabindex="-1" role="presentation"></a>The search method</h2>
<p><a class="p_ident" id="p_3QlEdRm5L2" href="#p_3QlEdRm5L2" tabindex="-1" role="presentation"></a>The <code>indexOf</code> method on strings cannot be called with a regular expression. But there is another method, <code>search</code>, which does expect a regular expression. Like <code>indexOf</code>, it returns the first index on which the expression was found, or -1 when it wasn’t found.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_diUfxE6ifs" href="#c_diUfxE6ifs" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string">" word"</span>.<span class="cm-property">search</span>(<span class="cm-string-2">/\S/</span>));
<span class="cm-comment">// → 2</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string">" "</span>.<span class="cm-property">search</span>(<span class="cm-string-2">/\S/</span>));
<span class="cm-comment">// → -1</span></pre>
<p><a class="p_ident" id="p_tqlyvUKoi5" href="#p_tqlyvUKoi5" tabindex="-1" role="presentation"></a>Unfortunately, there is no way to indicate that the match should start at a given offset (like we can with the second argument to <code>indexOf</code>), which would often be useful.</p>
<h2><a class="h_ident" id="h_duFTd2hqd0" href="#h_duFTd2hqd0" tabindex="-1" role="presentation"></a>The lastIndex property</h2>
<p><a class="p_ident" id="p_MvO8+re1D+" href="#p_MvO8+re1D+" tabindex="-1" role="presentation"></a>The <code>exec</code> method similarly does not provide a convenient way to start searching from a given position in the string. But it does provide an <em>in</em>convenient way.</p>
<p><a class="p_ident" id="p_F+JgzwxLtK" href="#p_F+JgzwxLtK" tabindex="-1" role="presentation"></a>Regular expression objects have properties. One such property is <code>source</code>, which contains the string that expression was created from. Another property is <code>lastIndex</code>, which controls, in some limited circumstances, where the next match will start.</p>
<p><a class="p_ident" id="p_Ld5Vcdy0jB" href="#p_Ld5Vcdy0jB" tabindex="-1" role="presentation"></a>Those circumstances are that the regular expression must have the global (<code>g</code>) or sticky (<code>y</code>) option enabled, and the match must happen through the <code>exec</code> method. Again, a less confusing solution would have been to just allow an extra argument to be passed to <code>exec</code>, but confusion is an essential feature of JavaScript’s regular expression interface.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_nXsHtqIJdF" href="#c_nXsHtqIJdF" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">pattern</span> <span class="cm-operator">=</span> <span class="cm-string-2">/y/g</span>;
<span class="cm-variable">pattern</span>.<span class="cm-property">lastIndex</span> <span class="cm-operator">=</span> <span class="cm-number">3</span>;
<span class="cm-keyword">let</span> <span class="cm-def">match</span> <span class="cm-operator">=</span> <span class="cm-variable">pattern</span>.<span class="cm-property">exec</span>(<span class="cm-string">"xyzzy"</span>);
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">match</span>.<span class="cm-property">index</span>);
<span class="cm-comment">// → 4</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">pattern</span>.<span class="cm-property">lastIndex</span>);
<span class="cm-comment">// → 5</span></pre>
<p><a class="p_ident" id="p_hjLQ+57mDd" href="#p_hjLQ+57mDd" tabindex="-1" role="presentation"></a>If the match was successful, the call to <code>exec</code> automatically updates the <code>lastIndex</code> property to point after the match. If no match was found, <code>lastIndex</code> is set back to zero, which is also the value it has in a newly constructed regular expression object.</p>
<p><a class="p_ident" id="p_dQPVkpMm7y" href="#p_dQPVkpMm7y" tabindex="-1" role="presentation"></a>The difference between the global and the sticky options is that, when sticky is enabled, the match will only succeed if it starts directly at <code>lastIndex</code>, whereas with global, it will search ahead for a position where a match can start.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_98GwGRIMj8" href="#c_98GwGRIMj8" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">global</span> <span class="cm-operator">=</span> <span class="cm-string-2">/abc/g</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">global</span>.<span class="cm-property">exec</span>(<span class="cm-string">"xyz abc"</span>));
<span class="cm-comment">// → ["abc"]</span>
<span class="cm-keyword">let</span> <span class="cm-def">sticky</span> <span class="cm-operator">=</span> <span class="cm-string-2">/abc/y</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">sticky</span>.<span class="cm-property">exec</span>(<span class="cm-string">"xyz abc"</span>));
<span class="cm-comment">// → null</span></pre>
<p><a class="p_ident" id="p_042bNmzNZK" href="#p_042bNmzNZK" tabindex="-1" role="presentation"></a>When using a shared regular expression value for multiple <code>exec</code> calls, these automatic updates to the <code>lastIndex</code> property can cause problems. Your regular expression might be accidentally starting at an index that was left over from a previous call.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_wrx2wO0P8M" href="#c_wrx2wO0P8M" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">digit</span> <span class="cm-operator">=</span> <span class="cm-string-2">/\d/g</span>;
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">digit</span>.<span class="cm-property">exec</span>(<span class="cm-string">"here it is: 1"</span>));
<span class="cm-comment">// → ["1"]</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">digit</span>.<span class="cm-property">exec</span>(<span class="cm-string">"and now: 1"</span>));
<span class="cm-comment">// → null</span></pre>
<p><a class="p_ident" id="p_9l7tQ3SsME" href="#p_9l7tQ3SsME" tabindex="-1" role="presentation"></a>Another interesting effect of the global option is that it changes the way the <code>match</code> method on strings works. When called with a global expression, instead of returning an array similar to that returned by <code>exec</code>, <code>match</code> will find <em>all</em> matches of the pattern in the string and return an array containing the matched strings.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_weT/d5+8vE" href="#c_weT/d5+8vE" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string">"Banana"</span>.<span class="cm-property">match</span>(<span class="cm-string-2">/an/g</span>));
<span class="cm-comment">// → ["an", "an"]</span></pre>
<p><a class="p_ident" id="p_zFHO63a2iV" href="#p_zFHO63a2iV" tabindex="-1" role="presentation"></a>So be cautious with global regular expressions. The cases where they are necessary—calls to <code>replace</code> and places where you want to explicitly use <code>lastIndex</code>—are typically the only places where you want to use them.</p>
<h3><a class="i_ident" id="i_m0fs21dHEg" href="#i_m0fs21dHEg" tabindex="-1" role="presentation"></a>Looping over matches</h3>
<p><a class="p_ident" id="p_Rhy/hnaaT+" href="#p_Rhy/hnaaT+" tabindex="-1" role="presentation"></a>A common thing to do is to scan through all occurrences of a pattern in a string, in a way that gives us access to the match object in the loop body. We can do this by using <code>lastIndex</code> and <code>exec</code>.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_rSzEnbVHja" href="#c_rSzEnbVHja" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">input</span> <span class="cm-operator">=</span> <span class="cm-string">"A string with 3 numbers in it... 42 and 88."</span>;
<span class="cm-keyword">let</span> <span class="cm-def">number</span> <span class="cm-operator">=</span> <span class="cm-string-2">/\b\d+\b/g</span>;
<span class="cm-keyword">let</span> <span class="cm-def">match</span>;
<span class="cm-keyword">while</span> (<span class="cm-variable">match</span> <span class="cm-operator">=</span> <span class="cm-variable">number</span>.<span class="cm-property">exec</span>(<span class="cm-variable">input</span>)) {
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string">"Found"</span>, <span class="cm-variable">match</span>[<span class="cm-number">0</span>], <span class="cm-string">"at"</span>, <span class="cm-variable">match</span>.<span class="cm-property">index</span>);
}
<span class="cm-comment">// → Found 3 at 14</span>
<span class="cm-comment">// Found 42 at 33</span>
<span class="cm-comment">// Found 88 at 40</span></pre>
<p><a class="p_ident" id="p_ZdCI2+edqA" href="#p_ZdCI2+edqA" tabindex="-1" role="presentation"></a>This makes use of the fact that the value of an assignment expression (<code>=</code>) is the assigned value. So by using <code>match = number.<wbr>exec(input)</code> as the condition in the <code>while</code> statement, we perform the match at the start of each iteration, save its result in a binding, and stop looping when no more matches are found.</p>
<h2 id="ini"><a class="h_ident" id="h_RGsf6ah1EY" href="#h_RGsf6ah1EY" tabindex="-1" role="presentation"></a>Parsing an INI file</h2>
<p><a class="p_ident" id="p_JbrLORqV9r" href="#p_JbrLORqV9r" tabindex="-1" role="presentation"></a>To conclude the chapter, we’ll look at a problem that calls for regular expressions. Imagine we are writing a program to automatically collects information about our enemies from the Internet. (We will not actually write that program here, just the part that reads the configuration file. Sorry.) The configuration file looks like this:</p>
<pre class="snippet cm-s-default" data-language="text/plain" ><a class="c_ident" id="c_RV3f5fiptq" href="#c_RV3f5fiptq" tabindex="-1" role="presentation"></a>searchengine=https://duckduckgo.com/?q=$1
spitefulness=9.7
; comments are preceded by a semicolon...
; each section concerns an individual enemy
[larry]
fullname=Larry Doe
type=kindergarten bully
website=http://www.geocities.com/CapeCanaveral/11451
[davaeorn]
fullname=Davaeorn
type=evil wizard
outputdir=/home/marijn/enemies/davaeorn</pre>
<p><a class="p_ident" id="p_OgIQS1TJxB" href="#p_OgIQS1TJxB" tabindex="-1" role="presentation"></a>The exact rules for this format (which is a widely used format, usually called an <em>INI</em> file) are as follows:</p>
<ul>
<li>
<p><a class="p_ident" id="p_jIewfc/40B" href="#p_jIewfc/40B" tabindex="-1" role="presentation"></a>Blank lines and lines starting with semicolons are ignored.</p></li>
<li>
<p><a class="p_ident" id="p_O/dGCr+aR5" href="#p_O/dGCr+aR5" tabindex="-1" role="presentation"></a>Lines wrapped in <code>[</code> and <code>]</code> start a new section.</p></li>
<li>
<p><a class="p_ident" id="p_l2Yjl1fUVB" href="#p_l2Yjl1fUVB" tabindex="-1" role="presentation"></a>Lines containing an alphanumeric identifier followed by an <code>=</code> character add a setting to the current section.</p></li>
<li>
<p><a class="p_ident" id="p_bCaQwCXJCi" href="#p_bCaQwCXJCi" tabindex="-1" role="presentation"></a>Anything else is invalid.</p></li></ul>
<p><a class="p_ident" id="p_clbD+OAS4y" href="#p_clbD+OAS4y" tabindex="-1" role="presentation"></a>Our task is to convert a string like this into an object whose properties hold strings for sectionless settings and sub-objects for settings, with those sub-objects holding the section’s settings.</p>
<p><a class="p_ident" id="p_8U3vMRn7g4" href="#p_8U3vMRn7g4" tabindex="-1" role="presentation"></a>Since the format has to be processed line by line, splitting up the file into separate lines is a good start. We used <code>string.<wbr>split("\n")</code> to do this in <a href="04_data.html#split">Chapter 4</a>. Some operating systems, however, use not just a newline character to separate lines but a carriage return character followed by a newline (<code>"\r\n"</code>). Given that the <code>split</code> method also allows a regular expression as its argument, we can split on a regular expression like <code>/\r?\n/</code> to split in a way that allows both <code>"\n"</code> and <code>"\r\n"</code> between lines.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_neI86/XXg2" href="#c_neI86/XXg2" tabindex="-1" role="presentation"></a><span class="cm-keyword">function</span> <span class="cm-def">parseINI</span>(<span class="cm-def">string</span>) {
<span class="cm-comment">// Start with an object to hold the top-level fields</span>
<span class="cm-keyword">let</span> <span class="cm-def">result</span> <span class="cm-operator">=</span> {};
<span class="cm-keyword">let</span> <span class="cm-def">section</span> <span class="cm-operator">=</span> <span class="cm-variable-2">result</span>;
<span class="cm-variable-2">string</span>.<span class="cm-property">split</span>(<span class="cm-string-2">/\r?\n/</span>).<span class="cm-property">forEach</span>(<span class="cm-def">line</span> <span class="cm-operator">=></span> {
<span class="cm-keyword">let</span> <span class="cm-def">match</span>;
<span class="cm-keyword">if</span> (<span class="cm-variable-2">match</span> <span class="cm-operator">=</span> <span class="cm-variable-2">line</span>.<span class="cm-property">match</span>(<span class="cm-string-2">/^(\w+)=(.*)$/</span>)) {
<span class="cm-variable-2">section</span>[<span class="cm-variable-2">match</span>[<span class="cm-number">1</span>]] <span class="cm-operator">=</span> <span class="cm-variable-2">match</span>[<span class="cm-number">2</span>];
} <span class="cm-keyword">else</span> <span class="cm-keyword">if</span> (<span class="cm-variable-2">match</span> <span class="cm-operator">=</span> <span class="cm-variable-2">line</span>.<span class="cm-property">match</span>(<span class="cm-string-2">/^\[(.*)\]$/</span>)) {
<span class="cm-variable-2">section</span> <span class="cm-operator">=</span> <span class="cm-variable-2">result</span>[<span class="cm-variable-2">match</span>[<span class="cm-number">1</span>]] <span class="cm-operator">=</span> {};
} <span class="cm-keyword">else</span> <span class="cm-keyword">if</span> (<span class="cm-operator">!</span><span class="cm-string-2">/^\s*(;.*)?$/</span>.<span class="cm-property">test</span>(<span class="cm-variable-2">line</span>)) {
<span class="cm-keyword">throw</span> <span class="cm-keyword">new</span> <span class="cm-variable">Error</span>(<span class="cm-string">"Line '"</span> <span class="cm-operator">+</span> <span class="cm-variable-2">line</span> <span class="cm-operator">+</span> <span class="cm-string">"' is not valid."</span>);
}
});
<span class="cm-keyword">return</span> <span class="cm-variable-2">result</span>;
}
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">parseINI</span>(<span class="cm-string-2">`</span>
<span class="cm-string-2">name=Vasilis</span>
<span class="cm-string-2">[address]</span>
<span class="cm-string-2">city=Tessaloniki`</span>));
<span class="cm-comment">// → {name: "Vasilis", address: {city: "Tessaloniki"}}</span></pre>
<p><a class="p_ident" id="p_86q0K3iF4C" href="#p_86q0K3iF4C" tabindex="-1" role="presentation"></a>The code goes over the file’s lines and builds up an object. Properties at the top are stored directly into that object, whereas properties found in sections are stored in a separate section object. The <code>section</code> binding points at the object for the current section.</p>
<p><a class="p_ident" id="p_ixTfvSC1VN" href="#p_ixTfvSC1VN" tabindex="-1" role="presentation"></a>There are two kinds of significant lines—section headers or property lines. When a line is a regular property, it is stored in the current section. When it is a section header, a new section object is created, and <code>section</code> is set to point at it.</p>
<p><a class="p_ident" id="p_FPzqsloIkT" href="#p_FPzqsloIkT" tabindex="-1" role="presentation"></a>Note the recurring use of <code>^</code> and <code>$</code> to make sure the expression matches the whole line, not just part of it. Leaving these out results in code that mostly works but behaves strangely for some input, which can be a difficult bug to track down.</p>
<p><a class="p_ident" id="p_ACT8bIScp+" href="#p_ACT8bIScp+" tabindex="-1" role="presentation"></a>The pattern <code>if (match = string.<wbr>match(.<wbr>.<wbr>.<wbr>))</code> is similar to the trick of using an assignment as the condition for <code>while</code>. You often aren’t sure that your call to <code>match</code> will succeed, so you can access the resulting object only inside an <code>if</code> statement that tests for this. To not break the pleasant chain of <code>else if</code> forms, we assign the result of the match to a binding and immediately use that assignment as the test for the <code>if</code> statement.</p>
<p><a class="p_ident" id="p_mwlBKfUu5D" href="#p_mwlBKfUu5D" tabindex="-1" role="presentation"></a>If a line is not a section header or a property, the function checks whether it is a comment or an empty line using the expression <code>/^\s*(;.*)?$/</code>. Do you see how it works? The part between the parentheses will match comments, and the <code>?</code> makes sure it also matches lines containing only whitespace. When a line doesn’t match any of the expected forms, the function throws an exception.</p>
<h2><a class="h_ident" id="h_+y54//b0l+" href="#h_+y54//b0l+" tabindex="-1" role="presentation"></a>International characters</h2>
<p><a class="p_ident" id="p_2zJ37rLrbl" href="#p_2zJ37rLrbl" tabindex="-1" role="presentation"></a>Because of JavaScript’s initial simplistic implementation and the fact that this simplistic approach was later set in stone as standard behavior, JavaScript’s regular expressions are rather dumb about characters that do not appear in the English language. For example, as far as JavaScript’s regular expressions are concerned, a “word
character” is only one of the 26 characters in the Latin alphabet (uppercase or lowercase), decimal digits, and, for some reason, the underscore character. Things like <em>é</em> or <em>β</em>, which most definitely are word characters, will not match <code>\w</code> (and <em>will</em> match uppercase <code>\W</code>, the nonword category).</p>
<p><a class="p_ident" id="p_H4r1oRJB6J" href="#p_H4r1oRJB6J" tabindex="-1" role="presentation"></a>By a strange historical accident, <code>\s</code> (whitespace) does not have this problem and matches all characters that the Unicode standard considers whitespace, including things like the nonbreaking space and the Mongolian vowel separator.</p>
<p><a class="p_ident" id="p_Ln5OarYp4l" href="#p_Ln5OarYp4l" tabindex="-1" role="presentation"></a>Another problem is that, by default, regular expressions work on code units, as discussed in <a href="higher_order#code_units">Chapter ?</a>, not actual characters. This means that characters that are composed of two code units behave strangely.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_CfMTYxun8D" href="#c_CfMTYxun8D" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/🍎{3}/</span>.<span class="cm-property">test</span>(<span class="cm-string">"🍎🍎🍎"</span>));
<span class="cm-comment">// → false</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/<.>/</span>.<span class="cm-property">test</span>(<span class="cm-string">"<🌹>"</span>));
<span class="cm-comment">// → false</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/<.>/u</span>.<span class="cm-property">test</span>(<span class="cm-string">"<🌹>"</span>));
<span class="cm-comment">// → true</span></pre>
<p><a class="p_ident" id="p_j4Kcv6J/rF" href="#p_j4Kcv6J/rF" tabindex="-1" role="presentation"></a>The problem is that the 🍎 in the first line is treated as two code units, and the <code>{3}</code> part is applied only to the second one. Similarly, the dot matches a single code unit, not the two that make up the rose emoji.</p>
<p><a class="p_ident" id="p_1OZOJ3sk/b" href="#p_1OZOJ3sk/b" tabindex="-1" role="presentation"></a>You must add an <code>u</code> option (for Unicode) to your regular expression to make it treat such characters properly. The wrong behavior remains the default, unfortunately, because changing that might cause problems for existing code that depends on it.</p>
<p><a class="p_ident" id="p_MmzTSqcyKg" href="#p_MmzTSqcyKg" tabindex="-1" role="presentation"></a>Though this was only just standardized and is, at the time of writing, not widely supported yet, it is possible to use <code>\p</code> in a regular expression (that must have the Unicode option enabled) to match all characters to which the Unicode standard assigns a given property.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_+jV1oln0sr" href="#c_+jV1oln0sr" tabindex="-1" role="presentation"></a><span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/\p{Script=Greek}/u</span>.<span class="cm-property">test</span>(<span class="cm-string">"α"</span>));
<span class="cm-comment">// → true</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/\p{Script=Arabic}/u</span>.<span class="cm-property">test</span>(<span class="cm-string">"α"</span>));
<span class="cm-comment">// → false</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/\p{Alphabetic}/u</span>.<span class="cm-property">test</span>(<span class="cm-string">"α"</span>));
<span class="cm-comment">// → true</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">/\p{Alphabetic}/u</span>.<span class="cm-property">test</span>(<span class="cm-string">"!"</span>));
<span class="cm-comment">// → false</span></pre>
<p><a class="p_ident" id="p_5NxQGhRrMq" href="#p_5NxQGhRrMq" tabindex="-1" role="presentation"></a>Unicode defines a number of useful properties, though finding the one that you need may not always be trivial. You can use the <code>\p{Property=Value}</code> notation to match any character that has the given value for that property. If the property name is left off, as in <code>\p{Name}</code>, the name is assumed to either be a binary property such as <code>Alphabetic</code>, or a category such as <code>Number</code>.</p>
<h2 id="summary_regexp"><a class="h_ident" id="h_ErccPg/l98" href="#h_ErccPg/l98" tabindex="-1" role="presentation"></a>Summary</h2>
<p><a class="p_ident" id="p_/hQX04GtpS" href="#p_/hQX04GtpS" tabindex="-1" role="presentation"></a>Regular expressions are objects that represent patterns in strings. They use their own language to express these patterns.</p>
<table>
<tr><td><code>/abc/</code></td><td>A sequence of characters</td>
</tr>
<tr><td><code>/[abc]/</code></td><td>Any character from a set of characters</td>
</tr>
<tr><td><code>/[^abc]/</code></td><td>Any character <em>not</em> in a set of characters</td>
</tr>
<tr><td><code>/[0-9]/</code></td><td>Any character in a range of characters</td>
</tr>
<tr><td><code>/x+/</code></td><td>One or more occurrences of the pattern <code>x</code></td>
</tr>
<tr><td><code>/x+?/</code></td><td>One or more occurrences, nongreedy</td>
</tr>
<tr><td><code>/x*/</code></td><td>Zero or more occurrences</td>
</tr>
<tr><td><code>/x?/</code></td><td>Zero or one occurrence</td>
</tr>
<tr><td><code>/x{2,4}/</code></td><td>Two to four occurrences</td>
</tr>
<tr><td><code>/(abc)/</code></td><td>A group</td>
</tr>
<tr><td><code>/a|b|c/</code></td><td>Any one of several patterns</td>
</tr>
<tr><td><code>/\d/</code></td><td>Any digit character</td>
</tr>
<tr><td><code>/\w/</code></td><td>An alphanumeric character (“word character”)</td>
</tr>
<tr><td><code>/\s/</code></td><td>Any whitespace character</td>
</tr>
<tr><td><code>/./</code></td><td>Any character except newlines</td>
</tr>
<tr><td><code>/\b/</code></td><td>A word boundary</td>
</tr>
<tr><td><code>/^/</code></td><td>Start of input</td>
</tr>
<tr><td><code>/$/</code></td><td>End of input</td>
</tr>
</table>
<p><a class="p_ident" id="p_AVY5pFcEyH" href="#p_AVY5pFcEyH" tabindex="-1" role="presentation"></a>A regular expression has a method <code>test</code> to test whether a given string matches it. It also has a method <code>exec</code> that, when a match is found, returns an array containing all matched groups. Such an array has an <code>index</code> property that indicates where the match started.</p>
<p><a class="p_ident" id="p_FoVJlvxp9q" href="#p_FoVJlvxp9q" tabindex="-1" role="presentation"></a>Strings have a <code>match</code> method to match them against a regular expression and a <code>search</code> method to search for one, returning only the starting position of the match. Their <code>replace</code> method can replace matches of a pattern with a replacement string or function.</p>
<p><a class="p_ident" id="p_APfM9C3A6j" href="#p_APfM9C3A6j" tabindex="-1" role="presentation"></a>Regular expressions can have options, which are written after the closing slash. The <code>i</code> option makes the match case insensitive. The <code>g</code> option makes the expression <em>global</em>, which, among other things, causes the <code>replace</code> method to replace all instances instead of just the first. The <code>y</code> option makes it sticky, which means that it will not search ahead and skip part of the string when looking for a match. The <code>u</code> option turns on Unicode mode, which fixes a number of problems around the handling of characters that take up two code units.</p>
<p><a class="p_ident" id="p_mvLGdyUb97" href="#p_mvLGdyUb97" tabindex="-1" role="presentation"></a>Regular expressions are a sharp tool with an awkward handle. They simplify some tasks tremendously but can quickly become unmanageable when applied to complex problems. Part of knowing how to use them is resisting the urge to try to shoehorn things that they cannot cleanly express into them.</p>
<h2><a class="h_ident" id="h_TcUD2vzyMe" href="#h_TcUD2vzyMe" tabindex="-1" role="presentation"></a>Exercises</h2>
<p><a class="p_ident" id="p_meNfX2B/+s" href="#p_meNfX2B/+s" tabindex="-1" role="presentation"></a>It is almost unavoidable that, in the course of working on these exercises, you will get confused and frustrated by some regular expression’s inexplicable behavior. Sometimes it helps to enter your expression into an online tool like <a href="https://www.debuggex.com/"><em>debuggex.com</em></a> to see whether its visualization corresponds to what you intended and to experiment with the way it responds to various input strings.</p>
<h3><a class="i_ident" id="i_vDM8PzwQWU" href="#i_vDM8PzwQWU" tabindex="-1" role="presentation"></a>Regexp golf</h3>
<p><a class="p_ident" id="p_V79Usnw26S" href="#p_V79Usnw26S" tabindex="-1" role="presentation"></a><em>Code golf</em> is a term used for the game of trying to express a particular program in as few characters as possible. Similarly, <em>regexp golf</em> is the practice of writing as tiny a regular expression as possible to match a given pattern, and <em>only</em> that pattern.</p>
<p><a class="p_ident" id="p_VGCqgCur6C" href="#p_VGCqgCur6C" tabindex="-1" role="presentation"></a>For each of the following items, write a regular expression to test whether any of the given substrings occur in a string. The regular expression should match only strings containing one of the substrings described. Do not worry about word boundaries unless explicitly mentioned. When your expression works, see whether you can make it any smaller.</p>
<ol>
<li>
<p><a class="p_ident" id="p_togdFO+/b9" href="#p_togdFO+/b9" tabindex="-1" role="presentation"></a><em>car</em> and <em>cat</em></p></li>
<li>
<p><a class="p_ident" id="p_2Q37Tsr9DS" href="#p_2Q37Tsr9DS" tabindex="-1" role="presentation"></a><em>pop</em> and <em>prop</em></p></li>
<li>
<p><a class="p_ident" id="p_2Ah4dFikw1" href="#p_2Ah4dFikw1" tabindex="-1" role="presentation"></a><em>ferret</em>, <em>ferry</em>, and <em>ferrari</em></p></li>
<li>
<p><a class="p_ident" id="p_ttiBCcePDl" href="#p_ttiBCcePDl" tabindex="-1" role="presentation"></a>Any word ending in <em>ious</em></p></li>
<li>
<p><a class="p_ident" id="p_XnqTy5SopM" href="#p_XnqTy5SopM" tabindex="-1" role="presentation"></a>A whitespace character followed by a period, comma, colon, or semicolon</p></li>
<li>
<p><a class="p_ident" id="p_Ku7hE3qqDn" href="#p_Ku7hE3qqDn" tabindex="-1" role="presentation"></a>A word longer than six letters</p></li>
<li>
<p><a class="p_ident" id="p_2Tx4SPp5Wm" href="#p_2Tx4SPp5Wm" tabindex="-1" role="presentation"></a>A word without the letter <em>e</em></p></li>
</ol>
<p><a class="p_ident" id="p_Tzjl1Axr+h" href="#p_Tzjl1Axr+h" tabindex="-1" role="presentation"></a>Refer to the table in the <a href="09_regexp.html#summary_regexp">chapter summary</a> for help. Test each solution with a few test strings.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_6JENPvxVBy" href="#c_6JENPvxVBy" tabindex="-1" role="presentation"></a><span class="cm-comment">// Fill in the regular expressions</span>
<span class="cm-variable">verify</span>(<span class="cm-string-2">/.../</span>,
[<span class="cm-string">"my car"</span>, <span class="cm-string">"bad cats"</span>],
[<span class="cm-string">"camper"</span>, <span class="cm-string">"high art"</span>]);
<span class="cm-variable">verify</span>(<span class="cm-string-2">/.../</span>,
[<span class="cm-string">"pop culture"</span>, <span class="cm-string">"mad props"</span>],
[<span class="cm-string">"plop"</span>]);
<span class="cm-variable">verify</span>(<span class="cm-string-2">/.../</span>,
[<span class="cm-string">"ferret"</span>, <span class="cm-string">"ferry"</span>, <span class="cm-string">"ferrari"</span>],
[<span class="cm-string">"ferrum"</span>, <span class="cm-string">"transfer A"</span>]);
<span class="cm-variable">verify</span>(<span class="cm-string-2">/.../</span>,
[<span class="cm-string">"how delicious"</span>, <span class="cm-string">"spacious room"</span>],
[<span class="cm-string">"ruinous"</span>, <span class="cm-string">"consciousness"</span>]);
<span class="cm-variable">verify</span>(<span class="cm-string-2">/.../</span>,
[<span class="cm-string">"bad punctuation ."</span>],
[<span class="cm-string">"escape the period"</span>]);
<span class="cm-variable">verify</span>(<span class="cm-string-2">/.../</span>,
[<span class="cm-string">"hottentottententen"</span>],
[<span class="cm-string">"no"</span>, <span class="cm-string">"hotten totten tenten"</span>]);
<span class="cm-variable">verify</span>(<span class="cm-string-2">/.../</span>,
[<span class="cm-string">"red platypus"</span>, <span class="cm-string">"wobbling nest"</span>],
[<span class="cm-string">"earth bed"</span>, <span class="cm-string">"learning ape"</span>]);
<span class="cm-keyword">function</span> <span class="cm-def">verify</span>(<span class="cm-def">regexp</span>, <span class="cm-def">yes</span>, <span class="cm-def">no</span>) {
<span class="cm-comment">// Ignore unfinished exercises</span>
<span class="cm-keyword">if</span> (<span class="cm-variable-2">regexp</span>.<span class="cm-property">source</span> <span class="cm-operator">==</span> <span class="cm-string">"..."</span>) <span class="cm-keyword">return</span>;
<span class="cm-keyword">for</span> (<span class="cm-keyword">let</span> <span class="cm-def">str</span> <span class="cm-keyword">of</span> <span class="cm-variable-2">yes</span>) <span class="cm-keyword">if</span> (<span class="cm-operator">!</span><span class="cm-variable-2">regexp</span>.<span class="cm-property">test</span>(<span class="cm-variable-2">str</span>)) {
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">`Failure to match '${</span><span class="cm-variable-2">str</span><span class="cm-string-2">}</span><span class="cm-string-2">'`</span>);
}
<span class="cm-keyword">for</span> (<span class="cm-keyword">let</span> <span class="cm-def">str</span> <span class="cm-keyword">of</span> <span class="cm-variable-2">no</span>) <span class="cm-keyword">if</span> (<span class="cm-variable-2">regexp</span>.<span class="cm-property">test</span>(<span class="cm-variable-2">str</span>)) {
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">`Unexpected match for '${</span><span class="cm-variable-2">str</span><span class="cm-string-2">}</span><span class="cm-string-2">'`</span>);
}
}</pre>
<h3><a class="i_ident" id="i_dTiEW14oG0" href="#i_dTiEW14oG0" tabindex="-1" role="presentation"></a>Quoting style</h3>
<p><a class="p_ident" id="p_x7xoQ6mk60" href="#p_x7xoQ6mk60" tabindex="-1" role="presentation"></a>Imagine you have written a story and used single quotation marks throughout to mark pieces of dialogue. Now you want to replace all the dialogue quotes with double quotes, while keeping the single quotes used in contractions like <em>aren’t</em>.</p>
<p><a class="p_ident" id="p_k3Y0NF9w4b" href="#p_k3Y0NF9w4b" tabindex="-1" role="presentation"></a>Think of a pattern that distinguishes these two kinds of quote usage and craft a call to the <code>replace</code> method that does the proper replacement.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_sPrcOR+s/4" href="#c_sPrcOR+s/4" tabindex="-1" role="presentation"></a><span class="cm-keyword">let</span> <span class="cm-def">text</span> <span class="cm-operator">=</span> <span class="cm-string">"'I'm the cook,' he said, 'it's my job.'"</span>;
<span class="cm-comment">// Change this call.</span>
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-variable">text</span>.<span class="cm-property">replace</span>(<span class="cm-string-2">/A/g</span>, <span class="cm-string">"B"</span>));
<span class="cm-comment">// → "I'm the cook," he said, "it's my job."</span></pre>
<div class="solution"><div class="solution-text">
<p><a class="p_ident" id="p_rNoBQVCfFp" href="#p_rNoBQVCfFp" tabindex="-1" role="presentation"></a>The most obvious solution is to only replace quotes with a nonword character on at least one side. Something like <code>/\W'|'\W/</code>. But you also have to take the start and end of the line into account.</p>
<p><a class="p_ident" id="p_1SUsrUgWek" href="#p_1SUsrUgWek" tabindex="-1" role="presentation"></a>In addition, you must ensure that the replacement also includes the characters that were matched by the <code>\W</code> pattern so that those are not dropped. This can be done by wrapping them in parentheses and including their groups in the replacement string (<code>$1</code>, <code>$2</code>). Groups that are not matched will be replaced by nothing.</p>
</div></div>
<h3><a class="i_ident" id="i_izldJoT3uv" href="#i_izldJoT3uv" tabindex="-1" role="presentation"></a>Numbers again</h3>
<p><a class="p_ident" id="p_0OQXsuIIcQ" href="#p_0OQXsuIIcQ" tabindex="-1" role="presentation"></a>Write an expression that matches only JavaScript-style numbers. It must support an optional minus <em>or</em> plus sign in front of the number, the decimal dot, and exponent notation—<code>5e-3</code> or <code>1E10</code>— again with an optional sign in front of the exponent. Also note that it is not necessary for there to be digits in front of or after the dot, but the number cannot be a dot alone. That is, <code>.5</code> and <code>5.</code> are valid JavaScript numbers, but a lone dot <em>isn’t</em>.</p>
<pre class="snippet cm-s-default" data-language="javascript" ><a class="c_ident" id="c_aHAzeMYYGe" href="#c_aHAzeMYYGe" tabindex="-1" role="presentation"></a><span class="cm-comment">// Fill in this regular expression.</span>
<span class="cm-keyword">let</span> <span class="cm-def">number</span> <span class="cm-operator">=</span> <span class="cm-string-2">/^...$/</span>;
<span class="cm-comment">// Tests:</span>
<span class="cm-keyword">for</span> (<span class="cm-keyword">let</span> <span class="cm-def">str</span> <span class="cm-keyword">of</span> [<span class="cm-string">"1"</span>, <span class="cm-string">"-1"</span>, <span class="cm-string">"+15"</span>, <span class="cm-string">"1.55"</span>, <span class="cm-string">".5"</span>, <span class="cm-string">"5."</span>,
<span class="cm-string">"1.3e2"</span>, <span class="cm-string">"1E-4"</span>, <span class="cm-string">"1e+12"</span>]) {
<span class="cm-keyword">if</span> (<span class="cm-operator">!</span><span class="cm-variable">number</span>.<span class="cm-property">test</span>(<span class="cm-variable">str</span>)) {
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">`Failed to match '${</span><span class="cm-variable">str</span><span class="cm-string-2">}</span><span class="cm-string-2">'`</span>);
}
}
<span class="cm-keyword">for</span> (<span class="cm-keyword">let</span> <span class="cm-def">str</span> <span class="cm-keyword">of</span> [<span class="cm-string">"1a"</span>, <span class="cm-string">"+-1"</span>, <span class="cm-string">"1.2.3"</span>, <span class="cm-string">"1+1"</span>, <span class="cm-string">"1e4.5"</span>,
<span class="cm-string">".5."</span>, <span class="cm-string">"1f5"</span>, <span class="cm-string">"."</span>]) {
<span class="cm-keyword">if</span> (<span class="cm-variable">number</span>.<span class="cm-property">test</span>(<span class="cm-variable">str</span>)) {
<span class="cm-variable">console</span>.<span class="cm-property">log</span>(<span class="cm-string-2">`Incorrectly accepted '${</span><span class="cm-variable">str</span><span class="cm-string-2">}</span><span class="cm-string-2">'`</span>);
}
}</pre>
<div class="solution"><div class="solution-text">
<p><a class="p_ident" id="p_sWIFtGBNR7" href="#p_sWIFtGBNR7" tabindex="-1" role="presentation"></a>First, do not forget the backslash in front of the period.</p>
<p><a class="p_ident" id="p_ShOca+aF11" href="#p_ShOca+aF11" tabindex="-1" role="presentation"></a>Matching the optional sign in front of the number, as well as in front of the exponent, can be done with <code>[+\-]?</code> or <code>(\+|-|)</code> (plus, minus, or nothing).</p>
<p><a class="p_ident" id="p_z9QJjd6IxQ" href="#p_z9QJjd6IxQ" tabindex="-1" role="presentation"></a>The more complicated part of the exercise is the problem of matching both <code>"5."</code> and <code>".5"</code> without also matching <code>"."</code>. For this, a good solution is to use the <code>|</code> operator to separate the two cases—either one or more digits optionally followed by a dot and zero or more digits <em>or</em> a dot followed by one or more digits.</p>
<p><a class="p_ident" id="p_WHNmLsGl4C" href="#p_WHNmLsGl4C" tabindex="-1" role="presentation"></a>Finally, to make the <em>e</em> case-insensitive, either add an <code>i</code> option to the regular expression or use <code>[eE]</code>.</p>
</div></div><nav><a href="08_error.html" title="previous chapter">◀</a> <a href="index.html" title="cover">◆</a> <a href="10_modulos.html" title="next chapter">▶</a></nav>
</article>