forked from marijnh/Eloquent-JavaScript
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path09_regexp.txt
More file actions
1200 lines (971 loc) · 42.2 KB
/
09_regexp.txt
File metadata and controls
1200 lines (971 loc) · 42.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
:chap_num: 9
:prev_link: 08_error
:next_link: 10_modules
= Regular Expressions =
[quote,Jamie Zawinski]
____
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
____
[quote, Master Yuan-Ma, The Book of Programming]
____
Yuan-Ma said, 'When you cut against the grain of the wood, much
strength is needed. When you program against the grain of a problem,
much code is needed.'
____
The way programming conventions and techniques survive and spread
happens in a chaotic, evolutionary way. It's not usually the pretty or
brilliant ones that win, but rather the ones that combine working
passably well with sitting in the right niche—by, for example, being
integrated with another successful piece of technology.
In this chapter I will discuss one such technology, _regular
expressions_. Regular expressions are a way to describe patterns in
string data. They form a small, separate language that is part of
JavaScript (as well as various other programming languages and tools).
Regular expressions are both extremely useful and terribly awkward.
Learning them properly will make it much easier to do various kinds of
string processing. But their syntax used to express them is
ridiculously cryptic. And in addition to that the programming
interface JavaScript provides for them is quite clumsy.
== Notation ==
A regular expression is an object. It can either be constructed with
the `RegExp` constructor or written as a literal value by enclosing
the pattern in slash (‘`/`’) characters.
[source,javascript]
----
var re1 = new RegExp("abc");
var re2 = /abc/;
----
Such an object represents a pattern. In this case, the pattern is an
“a” character followed by a “b”, followed by a “c”.
When using the `RegExp` constructor, the pattern is written as a
normal string, so the usual rules apply for backslashes. In the second
notation, we are using slashes to delimit the pattern, so we'd have to
backslash-escape slash characters that are part of the pattern. Some
other characters, such as question marks and plus signs, are used as
special markers in regular expressions, and must to be preceded by a
backslash if they are meant to represent the character itself.
[source,javascript]
----
var onePlusOne = /1 \+ 1/;
----
Knowing precisely what characters to backslash-escape when writing
regular expressions requires you to know about all the special
meaning assigned to characters by this syntax. For now, this may not
be realistic. So as an alternative, when in doubt, just put a
backslash before any character that is not alphanumeric or whitespace.
== Testing for matches ==
(((test method)))Regular expression objects have a number of methods.
The simplest one is `test`, which you give a string, and will return a
boolean that tells you whether the pattern contained in the expression
matches the string.
[source,javascript]
----
console.log(/abc/.test("abcdef"));
// → true
console.log(/abc/.test("12345"));
// → false
----
A regular expression consisting of only normal characters simply
represents that sequence of characters. If “abc” occurs anywhere (not
just at the start) in the string we are testing against, the result
will be true.
== Matching a set of characters ==
(((indexOf method)))Finding out whether a string contains “abc” could
just as well be done with a call to `indexOf`. The point of regular
expressions is that they allow more complicated patterns to be
expressed.
If I want to match any number, for example, I want to find digit
characters. In a regular expression, putting a set of characters
between square brackets makes that part of the expression match any of
the characters between the brackets.
The first expression below matches any string that contains a
three-character sequence that starts with “s”, ends with “n”, and has
a vowel in between.
[source,javascript]
----
console.log(/s[auieo]n/.test("son"));
// → true
console.log(/[0123456789]/.test("in 1992"));
// → true
console.log(/[0-9]/.test("in 1992"));
// → true
----
(((digit (regexp))))Between square brackets, a dash (“-”) between two
characters can be used to indicate a range of characters. Since the
Unicode character codes for “0” to “9” sit right next to each other
(codes 48 to 57), `[0-9]` matches any digit.
(((whitespace (regexp))))(((alphanumeric character (regexp))))(((dot
(regexp)))) There are a number of commonly use character groups that
have their own built-in shortcuts. Digits are in fact one of them—you
can write backslash-d (`\d`) to mean the same thing as `[0-9]`.
[cols="1,5"]
|====
|`\d` |Digit characters.
|`\w` |Alphanumeric characters (“word characters”).
|`\s` |Whitespace characters (space, tab, newline, and similar).
|`\D` |Characters that are _not_ digits.
|`\W` |Non-alphanumeric characters.
|`\S` |Non-whitespace characters.
|`.` (dot) |All characters except newlines.
|====
For each of the backslash-prefixed categories, there is an uppercase
variant that means the exact opposite.
So you could express a date and time format like “30/01/2003 15:20” with
the following expression:
[source,javascript]
----
var dateTime = /\d\d\/\d\d\/\d\d\d\d \d\d:\d\d/;
console.log(dateTime.test("30/01/2003 15:20"));
// → true
console.log(dateTime.test("30/jan/2003 15:20"));
// → false
----
(((backslash)))Looks completely awful, doesn't it? Way too many
backslashes, producing a background noise that makes it hard to spot
the actual pattern expressed. Such is life with regular expressions.
These category markers can also be used inside of square brackets, so
`[\d\.]` means any digit or a dot.
To “invert” a set of characters, to express that you want to match any
character _except_ the ones in the set, a caret (‘^’) character is
written after the opening bracket.
[source,javascript]
----
var notBinary = /[^01]/;
console.log(notBinary.test("01101"));
// → false
console.log(notBinary.test("01201"));
// → true
----
== Repeating parts of a pattern ==
We found out how to match a single digit. But I wanted to match a
number, a sequence of one or more digits.
(((repetition (regexp))))(((\* operator)))(((+ operator)))When you put
a plus sign (‘+’) after something in a regular expression, that
indicates that it may be repeated more than once. So `/\d+/` matches
one or more digit characters.
[source,javascript]
----
console.log(/wo+w/.test("woow"));
// → true
console.log(/wo+w/.test("ww"));
// → false
console.log(/wo*w/.test("wooooow"));
// → true
console.log(/wo*w/.test("ww"));
// → true
----
The star (‘*’) has a similar meaning, but also allows the pattern to
match zero times. So something with a star after it never prevents a
pattern from matching—it'll just match zero instances if it can't find
any suitable text to match.
A question mark makes a part of a pattern “optional”, meaning it may
occur zero or one times. In this example, the “u” character is allowed
to occur, but the pattern also matches when it is missing.
[source,javascript]
----
var neighbor = /neighbou?r/;
console.log(neighbor.test("neighbour"));
// → true
console.log(neighbor.test("neighbor"));
// → true
----
To allow a pattern to occur a precisely defined number of times, curly
braces are used. Putting `{4}` after an element requires it to occur
exactly four times. Similarly, `{2,4}` is used when the element must
occur at least twice, and at most four times.
Here is another version of the date and time pattern. It allows
single-digit day, month, and hour numbers, and is slightly more
readable:
[source,javascript]
----
var dateTime = /\d{1,2}\/\d{1,2}\/\d{4} \d{1,2}:\d{2}/;
console.log(dateTime.test("30/1/2003 8:45"));
// → true
----
It is also possible to leave either the minimum or the maximum amount
of occurrences open-ended, by omitting the number on one side of the
comma. So `{,5}` means zero to five times, and `{5,}` means five or
more times.
== Grouping sub-expressions ==
(((grouping (regexp))))It is possible to use an operator like ‘*’ or
‘+’ on more than one character at a time. When a piece of a regular
expression is surrounded in parentheses, it counts as a single unit as
far as the operators following it are concerned.
[source,javascript]
----
var cartoonCrying = /boo+(hoo+)+/i;
console.log(cartoonCrying.test("Boohoooohoohooo"));
// → true
----
The third ‘+’ applies to the whole group `(hoo+)`, matching one or
more sequences like that.
(((case-sensitivity (regexp))))The “i” at the end of the expression in
the example above makes this regular expression case-insensitive,
allowing it to match the uppercase “B” in the input string, even
though the pattern is itself all lowercase.
== Matches and groups ==
The `test` method is the absolute simplest way to match a regular
expression. It only tells you whether it matched, and nothing else.
Regular expressions also have an `exec` (execute) method, that will
return `null` when no match was found, and an object with information
about the match otherwise.
[source,javascript]
----
var match = /\d+/.exec("one two 100");
console.log(match);
// → ["100"]
console.log(match.index);
// → 8
----
(((match method)))String values have a `match` method that behaves
very similarly.
[source,javascript]
----
console.log("one two 100".match(/\d+/));
// → ["100"]
----
(((index property)))An object returned from `exec` or `match` has an
`index` property that tells us _where_ in the string the successful
match started. Otherwise, the object looks like (and in fact is) an
array of strings, whose first element is the string that was
matched—in the example above, that is the sequence of digits that we
were looking for.
When the regular expression contains expressions grouped with
parentheses, the text that matched those groups will also show up in
the array. The first element is always the whole match, after that
follows the part matched by the first group (the one whose opening
parenthesis comes first in the expression), then the second, and so
on.
[source,javascript]
----
var quotedText = /'([^']*)'/;
console.log(quotedText.exec("she said 'hello'"));
// → ["'hello'", "hello"]
----
When a group does not end up being matched at all, its position in the
output array will hold `undefined`. For example, it might have a
question mark after it and not match the string. Similarly, when a
group is matched multiple times, the last match ends up in the array.
[source,javascript]
----
console.log(/(\d)+/.exec("123"));
// → ["123", "3"]
----
Groups can be very useful for extracting parts of a string. For
example, we may not just want to verify whether a string contains a
date, but also extract it, and construct an object that represents it.
If we wrap parentheses around the digit patterns, we'll be able to
directly pick them out of the result of `exec`.
But first, a short detour.
== The date type ==
JavaScript has a standard object type for representing dates—or
rather, points in time. It is called `Date`. If you simply create a
date object using `new`, you get the current date and time.
// test: no
[source,javascript]
----
console.log(new Date());
// → Wed Dec 04 2013 14:24:57 GMT+0100 (CET)
----
You can also create an object for a specific time.
[source,javascript]
----
console.log(new Date(2009, 11, 9));
// → Wed Dec 09 2009 00:00:00 GMT+0100 (CET)
console.log(new Date(2009, 11, 9, 12, 59, 59, 999));
// → Wed Dec 09 2009 12:59:59 GMT+0100 (CET)
----
JavaScript uses a convention where month numbers start at zero (so
December is 11), yet day numbers start at one. This is unfortunate and
rather confusing, so be careful.
The last four arguments (hours, minutes, seconds, and milliseconds)
are optional, and taken to be zero when not given.
Internally, times are stored as the number of milliseconds since the
start of 1970. The `getTime` method on a date returns this number. It
is quite big, as you can imagine.
[source,javascript]
----
console.log(new Date(1980, 1, 1).getTime());
// → 318207600000
console.log(new Date(318207600000));
// → Fri Feb 01 1980 00:00:00 GMT+0100 (CET)
----
When giving the constructor a single argument, that argument is
treated as such a millisecond number.
Date objects provide methods like `getFullYear` (`getYear` gets you
the useless two-digit version), `getMonth`, `getDate`, `getHours`,
`getMinutes`, and `getSeconds` to extract their components.
So now we can build an expression that matches a date, and add
parentheses around the parts that we are interested in.
[source,javascript]
----
function findDate(string) {
var dateTime = /(\d{1,2})\/(\d{1,2})\/(\d{4})/;
var match = dateTime.exec(string);
return new Date(Number(match[3]), Number(match[2]),
Number(match[1]));
}
console.log(findDate("30/1/2003"));
// → Sun Mar 02 2003 00:00:00 GMT+0100 (CET)
----
== Word and string boundaries ==
The `findDate` function above will also happily extract a date from a
string like `"100/1/30000"`—a match may happen anywhere in the
string, so in this case it'll just start at the second character and
end at the one-but-last.
(((boundary (regexp))))If we want to enforce that the match must span
the whole string, we can add the markers ‘^’ and ‘$’. The first
matches the start of the input string, and the second the end. So
`/^\d+$/` matches a string consisting only of one or more digits,
`/^!/` matches any string that starts with an exclamation sign, and
`/x^/` does not match anything (the start of a string can not be after
a character).
(((word boundary (regexp))))If, on the other hand, we just want to
make sure the date starts and ends on a word boundary, we can use the
marker `\b`. A word boundary is a point that has a word character on
one side, and a non-word character on the other.
[source,javascript]
----
console.log(/cat/.test("concatenate"));
// → true
console.log(/\bcat\b/.test("concatenate"));
// → false
----
Note that these boundary markers don't “cover” any actual characters,
they just enforce that the pattern only matches when a certain
condition holds at the place where they appear.
== Alternatives ==
Next, I want to know whether a piece of text contains not only a
number, but a number followed by one of the words “pig”, “cow”, or
“chicken”, or their plural forms.
I could write three regular expressions, and test them in turn, but
there is a nicer way. The pipe character (`|`) denotes a choice
between the pattern to its left and the pattern to its right. So I can
say this:
[source,javascript]
----
var animalCount = /\b\d+ (pig|cow|chicken)s?\b/;
console.log(animalCount.test("15 pigs"));
// → true
console.log(animalCount.test("15 pigchickens"));
// → false
----
Parentheses can be used to limit the part of the pattern that the pipe
operator applies to, and you can put multiple such operators next to
each other to express a choice between more than two patterns.
== The mechanics of matching ==
Regular expressions can be thought of as flow diagrams. This is the
diagram for the livestock expression in the previous example:
image::img/re_pigchickens.svg[alt="Visualization of /\b\d+ (pig|cow|chicken)s?\b/"]
A string matches the expression if a path from the start (left) to the
end (right) of the diagram can be found, with a corresponding start
and end position in the string, such that every time we go through a
box, we verify that our current position in the string corresponds to
the element described by the box and, for elements that match actual
characters (which the word boundaries do not), move our position
forward.
So if we match `"the 3 pigs"` there is a match between character 4
(the digit “3”) and 10 (the end of the string).
- At position 4, there is a word boundary, so we can move past the
first box.
- Still at position 4, we find a digit, so we can also move past the
second box.
- At position 5, we could go back to before the second (digit) box,
or move forward through the box that holds a single space
character. There is a space here, not a digit, so we choose the
second path.
- We are now at position 6 (the start of “pigs”) and at the three-way
branch in the diagram. We don't see “cow” or “chicken” here, but we
do see “pig”, so we take that branch.
- At position 9, after the three-way branching, we could either skip
the “s” box and go straight to the final word boundary, or first
match an “s”. There is an “s” character here, not a word boundary,
so we go through the “s” box.
- We're at position 10 (end of the string) and can only match a word
boundary. The end of a string counts as a word boundary, so we go
through the last box and have successfully matched this string.
The way the regular expression engine present in a JavaScript system
will conceptually look for a match in a string is simple. It starts at
the start of the string and tries a match there. In this case, there
_is_ a word boundary there, so it'd get past the first box, but there
is no digit, so it'd fail at the second box. Then it moves on to the
second character in the string, and tries there. And so on, until it
finds a match, or reaches the end of the string and decides that there
really is no match.
== Backtracking ==
(((backtracking)))The regular expression `/\b([01]+b|\d+|[\da-f]h)\b/`
matches either a binary number followed by a “b”, a regular decimal
number without suffix character, or a hexadecimal number (base 16,
with the letters “a” to “f” standing for the digits 10 to 15) followed
by an “h”. This is the corresponding diagram:
image::img/re_number.svg[alt="Visualization of /\b([01]+b|\d+|[\da-f]h)\b/"]
When matching this expression, it will often happen that the top
(binary) branch is entered although the input does not actually
contain a binary number. When matching the string `"103"`, it is only
at the “3” that it becomes clear that we are in the wrong branch. The
string does match the expression, just not the branch we are currently
in.
What happens then is that the matcher _backtracks_. When entering a
branch, it remembers where it was when it entered the current branch
(in this case, at the start of the string, just past the first
boundary box in the diagram), so that it can go back and try another
branch if the current one does not work out. So for the string
`"103"`, after encountering the “3” character, it will start trying
the decimal (second) branch. This one matches, so a match is reported
after all.
When more than one branch could match, the first one (in the order in
which the branches appear in the expression) will be taken.
Backtracking also happens, in slightly different forms, when matching
repeat operators. If you match `/^.*x/` against `"abcxe"`, the `.*`
part will first try to match the whole string. It'll then realize that
it can only match when it is followed by an “x”, and there is no “x”
past the end of the string. So it tries to match one character less.
And then another character less. And _now_ it finds an “x” where it
needs it, and reports a successful match from position 0 to 4.
It is possible to write regular expressions that will do a _lot_ of
backtracking. The problem occurs when a pattern can match a piece of
input in a lot of ways. For example, if we get confused while writing
our binary-number regexp and accidentally write something like
`/([01]+)+b/`.
image::img/re_number.svg[alt="Visualization of /([01]+)+b/"]
If that tries to match some long series of zeroes and ones _without_ a
“b” character after them, it will first go through the inner loop
until it runs out of digits. Then it notices there is no “b”, so it
backtracks one position, goes through the _outer_ loop once, and give
up again, backtracking out of the inner loop once more. It will
continue to try every possible route through these two loops, which
means the amount of work it needs to do doubles with each additional
character. For a few dozen characters, the resulting match will
already take practically forever.
== The replace method ==
(((replace method)))(((regular expression!replacing)))String values
have a `replace` method, which can be used to replace parts of the
string with another string.
[source,javascript]
----
console.log("papa".replace("p", "m"));
// → mapa
----
The first argument can also be a regular expression, in which case the
first match of the regular expression is replaced.
[source,javascript]
----
console.log("Borobudur".replace(/[ou]/, "a"));
// → Barobudur
console.log("Borobudur".replace(/[ou]/g, "a"));
// → Barabadar
----
When a “g” option (for “global”) is added to the regular expression,
_all_ matches in the string will be replaced, not just the first. It
would have been more sensible if this choice was made through an
addition argument to `replace`, rather than through a property of the
regular expression we pass it. This is one of the poor interface
choices that I was referring to earlier.
The real power of using regular expressions with `replace` comes from
the fact that we can refer back to the matched groups in the
expression. For example, say we have a big string containing the names
of people, one name per line, in the format `Lastname, Firstname`. If
we want to swap these names and remove the comma to get a simple
`Firstname Lastname` format, we can use the following code:
[source,javascript]
----
console.log(
"Hopper, Grace\nMcCarthy, John\nRitchie, Dennis"
.replace(/([\w ]+), ([\w ]+)/g, "$2 $1"));
// → Grace Hopper
// John McCarthy
// Dennis Ritchie
----
The `$1` and `$2` in the replacement string refer to the parenthesized
parts in the pattern. `$1` is replaced by the text that matched
against the first pair of parentheses, `$2` by the second, and so on,
up to `$9`.
It is also possible to pass a function, rather than a string, as the
second argument to `replace`. The for each replacement, the function
will be called with the matched groups (as well as the whole match) as
arguments, and the value it returns will be inserted into the new
string.
Here's a simple example:
[source,javascript]
----
var s = "the cia and fbi";
console.log(s.replace(/\b(fbi|cia)\b/g, function(str) {
return str.toUpperCase();
}));
// → the CIA and FBI
----
And here's a cuter one:
[source,javascript]
----
var stock = "1 lemon, 2 cabbages, and 101 eggs";
function minusOne(match, amount, unit) {
amount = Number(amount) - 1;
if (amount == 1) // only one left, remove the 's'
unit = unit.slice(0, unit.length - 1);
else if (amount == 0)
amount = "no";
return amount + " " + unit;
}
console.log(stock.replace(/(\d+) (\w+)/g, minusOne));
// → no lemon, 1 cabbage, and 100 eggs
----
This takes a string, finds all occurrences of a number followed by an
alphanumeric word, and returns a string wherein every such occurrence
is decremented by one.
The `(\d+)` group ends up as the `amount` argument to the function,
and the `(\w+)` group gets bound to `unit`. The function converts the
amount to a number—which always works, since it matched `\d+`—and
makes some adjustments in case there is only one or zero left.
== Greed ==
It isn't hard to use `replace` to write a function that removes all
comments from a piece of JavaScript code. Here is the first attempt:
// test: wrap
[source,javascript]
----
function stripComments(code) {
return code.replace(/\/\/.*|\/\*[\w\W]*\*\//g, "");
}
console.log(stripComments("1 + /* 2 */3"));
// → 1 + 3
console.log(stripComments("x = 10;// ten!"));
// → x = 10;
console.log(stripComments("1 /* a */+/* b */ 1"));
// → 1 1
----
The `[\w\W]` part is an (ugly) way to match any character. Remember
that a dot will not match a newline character. Block comments can
continue on a new line, so we can't use a dot here. Matching something
that is either a word character or not a word character will match all
possible characters.
But the output of the last example appears to have gone wrong. Why?
The `.*` part of the expression, as I described in the section on
backtracking, will first match as much as it can, and then, if that
causes the part of the pattern after it to fail, move back one match
at a time and try from there. In this case, we are first matching the
whole rest of the string, and then moving back from there. It will
find an occurrence of `*/` after going back four characters, and match
that. This is not what we wanted—the intention was to match a single
comment, not to go all the way to the end of the code and find the end
of the last block comment.
There are two variants of the repetition operators in regular
expressions (‘+’, ‘*’, and ‘{}’). By default, they are _greedy_,
meaning they match as much as they can and backtrack back from there.
If you put a question mark after them, they become non-greedy, and
start by matching as little as possible, and only matching more then
the remaining pattern does not fit with the smaller match.
And that is exactly what we want in this case. By having the star
match the smallest stretch of characters that brings us to a `*/`
closing marker, we consume one block comment, and nothing more.
// test: wrap
[source,javascript]
----
function stripComments(code) {
return code.replace(/\/\/.*|\/\*[\w\W]*?\*\//g, "");
}
console.log(stripComments("1 /* a */+/* b */ 1"));
// → 1 + 1
----
== Dynamically creating RegExp objects ==
(((RegExp type)))There are cases where you might not know the exact
pattern you need to match against when you are writing your code. Say
you want to look for the user's name in a piece of text, and enclose
it in underscore characters to make it stand out. The name is only
known when the program is actually running, so we can not use the
slash-based notation.
But we can build up a string and use the `RegExp` constructor on that.
For example:
[source,javascript]
----
var name = "marijn";
var text = "Marijn is a suspicious character.";
var regexp = new RegExp("\\b(" + name + ")\\b", "gi");
console.log(text.replace(regexp, "_$1_"));
// → _Marijn_ is a suspicious character.
----
When creating the `\b` boundary markers, we have to use two
backslashes, because we are writing them in a normal string, not a
slash-enclosed regular expression. The options (global and
case-insensitive) for the regular expression can be given as a second
argument to the `RegExp` constructor.
But what if the variable `name` holds `"dea+hl[]rd"` because our user
is a nerdy teenager? That would cause us to produce a bogus regular
expression, which will cause unexpected results.
To work around this, we can add backslashes before any character that
we don't trust. Adding backslashes before alphabetic characters is a
bad idea, because things like `\b` and `\n` have a special meaning.
But escaping everything that's not alphanumeric or whitespace is safe.
[source,javascript]
----
var name = "dea+hl[]rd";
var text = "This dea+hl[]rd guy is quite annoying.";
var escaped = name.replace(/[^\w\s]/g, "\\$&");
var regexp = new RegExp("\\b(" + escaped + ")\\b", "gi");
console.log(text.replace(regexp, "_$1_"));
// → This _dea+hl[]rd_ guy is quite annoying.
----
The `$&` placeholder in the replacement string act similar to `$1`,
but will be replaced by the whole match, rather than a matched group.
== The search method ==
(((indexOf method)))(((search method)))The `indexOf` method on strings
can not be called with a regular expression. But there is another
method, `search`, which does expect a regular expression, and, like
`indexOf` returns the first index on which the expression was found,
or -1 when it wasn't found.
[source,javascript]
----
console.log(" word".search(/\S/));
// → 2
console.log(" ".search(/\S/));
// → -1
----
Unfortunately, there is no way to indicate that the match should start
at a given offset (as with the second argument to `indexOf`), which
would often be very useful.
== The lastIndex property ==
(((exec method)))The `exec` method also does not provide a convenient
way to start searching from a given position in the string. But it
does provide an inconvenient way.
(((source property)))(((lastIndex property)))Regular expression
objects have properties (such as `source`, which contains the string
that expression was created from). One such property, `lastIndex`,
controls, in some limited circumstances, where the next match will
start.
Those circumstances are that the regular expression must have the
“global” (`g`) option enabled, and the match must happen through the
`exec` method. Again, the same way would have been to just allow an
extra argument to be passed to `exec`, but sanity is not a defining
characteristic of JavaScript's regular expression interface.
[source,javascript]
----
var pattern = /y/g;
pattern.lastIndex = 3;
var match = pattern.exec("xyzzy");
console.log(match.index);
// → 4
console.log(pattern.lastIndex);
// → 5
----
The `lastIndex` property is updated by the call to `exec` to point
after the match, when the match was successful. When no match was
found, `lastIndex` is set back to zero, which is also the value it
has in a newly constructed regular expression object.
When using a global regular expressions value for multiple `exec`
calls, this changing of the `lastIndex` property can cause
problems—your regular expression might be accidentally starting at an
index that was left over from a previous call.
(((match method)))Another interesting effect of the global option is
that changes the way the `match` method on strings works. When called
with a global expression, instead of returning an array similar to
that returned by `exec`, `match` will find _all_ matches of the
pattern in the string, and return an array containing the matched
strings.
[source,javascript]
----
console.log("Banana".match(/an/g));
// → ["an", "an"]
----
So be cautious with global regular expressions. The cases where they
are necessary—calls to `replace` and places where you want to
explicitly use `lastIndex`—are typically the only places where you
want to use them.
A common pattern is to scan through all occurrences of a pattern in a
string, with full access to matched groups and the `index` property,
by using `lastIndex` and `exec`.
[source,javascript]
----
var input = "A text with 3 numbers in it... 42 and 88.";
var re = /\b(\d+)\b/g;
var match;
while (match = re.exec(input))
console.log("Found", match[1], "at", match.index);
// → Found 3 at 12
// Found 42 at 31
// Found 88 at 38
----
This makes use of the fact that the value of an assignment (‘=’)
expression is the assigned value. So by using `match = re.exec(input)`
as the condition in the `while` statement, we both perform the match
at the start of each iteration, save its result in a variable, and
stop looping when no more matches are found.
== Parsing an ini file ==
(((ini file)))Now let's look at a real problem that calls for regular
expressions. Imagine we are writing a program to automatically harvest
information about our enemies from the Internet. (We will not actually
write such a program here, just the part that reads the configuration
file.) This file looks like this:
[source,text/plain]
----
searchengine=http://www.google.com/search?q=$1
spitefulness=9.7
; comments are preceded by a semicolon...
; these are sections, concerning individual enemies
[larry]
fullname=Larry Doe
type=kindergarten bully
website=http://www.geocities.com/CapeCanaveral/11451
[gargamel]
fullname=Gargamel
type=evil sorcerer
outputdir=/home/marijn/enemies/gargamel
----
(((grammar)))The exact rules for this format (which is actually a
widely used format, usually called an _.ini_ file) are as follows:
- Blank lines and lines starting with semicolons are ignored.
- Lines wrapped in `[` and `]` start a new section.
- Lines containing an alphanumeric identifier followed by an `=`
character add a setting to the current section.
- Anything else is invalid.
Our task is to convert a string like this into an array of objects,
each with a `name` property and an array of `name`/`value` pairs.
We'll need one such object for each section and one for the
section-less settings.
Since the format has to be processed line by line, splitting it up
into separate lines is a good start. We've used the `split` method
once before for this, as `string.split("\n")`. Some operating systems,
however, use not just a newline character to separate lines but a
carriage return character followed by a newline (`"\r\n"`).
Given that the `split` method of strings also allows a regular
expression as its argument, we can split on a regular expression like
`/\r?\n/` to split in a way that allows both `"\n"` and `"\r\n"`
between lines.
[source,javascript]
----
function parseINI(string) {
var categories = [];
function newCategory(name) {
var cat = {name: name, fields: []};
categories.push(cat);
return cat;
}
var currentCategory = newCategory("TOP");
string.split(/\r?\n/).forEach(function(line) {
var match;
if (/^\s*(;.*)?$/.test(line))
return;
else if (match = line.match(/^\[(.*)\]$/))
currentCategory = newCategory(match[1]);
else if (match = line.match(/^(\w+)=(.*)$/))
currentCategory.fields.push({name: match[1],
value: match[2]});
else
throw new Error("Line '" + line + "' is invalid.");
});
return categories;
}
----
The code goes over every line in the file. It keeps a “current
category” object, and when it finds a normal directive, it adds it to
this object. When it encounters a line that starts a new category, it
replaces the current category with a new one, to which subsequent
directives will get added. Finally, it returns an array containing all
the categories it came across.
Note the recurring use of `^` and `$` to make sure the expression
matches the whole line, not just part of it. Leaving these out is a
common mistake, which results in code that mostly works but behaves
strangely for some input.
The expression `/^\s*(;.*)?$/` can be used to test for lines that can
be ignored. Do you see how it works? The part between the parentheses
will match comments, and the `?` after that will make sure it also
matches lines with only whitespace.
The pattern `if (match = string.match(...))` is similar to the trick
where I used an assignment as the condition for `while` a little
earlier. You usually aren't sure that your expression will match. But
you only want to do something with the resulting match array if it not
null, so you need to test for that first. To not break the pleasant
chain of `if` forms, we can assign this result to a variable as the
test for `if` and do the matching and the testing in a single line.
== International characters ==
Due to an initial simplistic implementation and the fact that this
simplistic approach was later set in stone as standard behavior,
JavaScript's regular expressions are rather dumb about characters that
do not appear in the English language. For example “word” characters,
in this context, actually means the 26 characters in the Latin
alphabet, their upper-case variants, and, for some reason, the
underscore character. Things like “é” or “β”, which most definitely
are word characters, will not match `\w` (and _will_ match upper-case
`\W`).
Through strange historical accident, `\s` (whitespace) is different,
and will match all characters that the Unicode standard considers
whitespace, such as a non-breaking space or a Mongolian vowel
separator.
Some regular expression implementations in other languages have syntax
to match specific Unicode character categories, such as all uppercase
letters, all punctuation, control characters, or similar. There are
plans to add support to this to JavaScript, but they unfortunately
look like they won't be realized in the near future.
== Summary ==
Regular expressions are objects that represent patterns in strings.
They use their own syntax for expressing these patterns.
[cols="1,5"]
|====
|`/abc/` |Sequence of characters.
|`/[abc]/` |Any character from a set of characters.
|`/[^abc]/` |Any character _not_ in a set of characters.
|`/[0-9]/` |Any character in a range of characters.
|`/x+/` |One or more occurrences of a pattern.
|`/x+?/` |One or more occurrences, non-greedy.
|`/x*/` |Zero or more occurrences.
|`/x?/` |Zero or one occurrence.
|`/x{2,4}/` |Between two and four occurrences.
|`/(abc)+/` |Grouping.
|`/a|b|c/` |Alternative patterns.
|`/\d/` |Digit characters.