Skip to content

Commit 9c72a42

Browse files
updates for version 1.2
1 parent e3ab1c1 commit 9c72a42

8 files changed

Lines changed: 64 additions & 34 deletions

File tree

Version_changes.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
11
<br>
22

3+
### 1.2
4+
5+
* Added link to exercise solutions
6+
* Corrected typo in a solution
7+
* Two of the buffer examples simplified
8+
* Corrected line anchor explanations to be referred as string anchor instead
9+
10+
<br>
11+
312
### 1.1
413

514
* Clarified BRE vs ERE difference for line anchor escaping

code_snippets/Gotchas_and_Tips.sh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@ printf 'mat dog\r\n123 789\r\n' | awk -v RS='\r\n' '{print $2, $1}'
1818

1919
printf 'mat dog\r\n123 789\r\n' | awk -v RS='\r\n' '{sub(/$/, ".")} 1'
2020

21+
## Behavior of ^ and $ when string contains newline
22+
23+
printf 'apple\n,mustard,grape,\nmango' | awk -v RS=, '/e$/'
24+
25+
printf 'apple\n,mustard,grape,\nmango' | awk -v RS=, '/^m/'
26+
2127
## Word boundary differences
2228

2329
echo 'I have 12, he has 2!' | awk '{gsub(/\y..\y/, "[&]")} 1'

code_snippets/Processing_multiple_records.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,14 @@ seq 30 | awk -v n=2 '/4/{f=1; c++} f && c!=n; /6/{f=0}'
7070

7171
seq 30 | awk '/4/{f=1; buf=$0; m=0; next}
7272
f{buf=buf ORS $0}
73-
/6/{f=0; if(buf && m) print buf; buf=""}
74-
/^1/{m=1}'
73+
/6/{f=0; if(m) print buf}
74+
$0=="15"{m=1}'
7575

7676
## Broken blocks
7777

7878
cat broken.txt
7979

8080
awk '/error/{f=1; buf=$0; next}
8181
f{buf=buf ORS $0}
82-
/state/{f=0; if(buf) print buf; buf=""}' broken.txt
82+
/state/{if(f) print buf; f=0}' broken.txt
8383

code_snippets/Regular_Expressions.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ printf 'spared no one\ngrasped\nspar\n' | awk '/ed/'
44

55
printf 'spared no one\ngrasped\nspar\n' | awk '{r = @/ed/} $0 ~ r'
66

7-
## Line Anchors
7+
## String Anchors
88

99
printf 'spared no one\ngrasped\nspar\n' | awk '/^sp/'
1010

exercises/Exercise_solutions.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -923,12 +923,12 @@ Believe it
923923
pink blue white yellow
924924
car,mat,ball,basket
925925
926-
$ awk -v n=2 '/^### /{f=1; c++} c==n' concat.txt
926+
$ awk -v n=2 '/^### /{c++} c==n' concat.txt
927927
### broken.txt
928928
top
929929
1234567890
930930
bottom
931-
$ awk -v n=4 '/^### /{f=1; c++} c==n' concat.txt
931+
$ awk -v n=4 '/^### /{c++} c==n' concat.txt
932932
### mixed_fs.txt
933933
pink blue white yellow
934934
car,mat,ball,basket

exercises/Exercises.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# awk introduction
22

3-
>![info](../images/info.svg) Exercise related files are available from [exercises folder of learn_gnuawk repo](https://github.com/learnbyexample/learn_gnuawk/tree/master/exercises).
3+
>![info](../images/info.svg) Exercise related files are available from [exercises folder of learn_gnuawk repo](https://github.com/learnbyexample/learn_gnuawk/tree/master/exercises). For solutions, see [Exercise_solutions.md](https://github.com/learnbyexample/learn_gnuawk/blob/master/exercises/Exercise_solutions.md).
44
55
**a)** For the input file `addr.txt`, display all lines containing `is`.
66

gnu_awk.md

Lines changed: 42 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ Resources mentioned in Acknowledgements section are available under original lic
6666

6767
## Book version
6868

69-
1.1
69+
1.2
7070
See [Version_changes.md](https://github.com/learnbyexample/learn_gnuawk/blob/master/Version_changes.md) to track changes across book versions.
7171

7272
# Installation and Documentation
@@ -386,7 +386,7 @@ Next chapter is dedicated solely for regular expressions. The features introduce
386386

387387
## Exercises
388388

389-
>![info](images/info.svg) Exercise related files are available from [exercises folder of learn_gnuawk repo](https://github.com/learnbyexample/learn_gnuawk/tree/master/exercises).
389+
>![info](images/info.svg) Exercise related files are available from [exercises folder of learn_gnuawk repo](https://github.com/learnbyexample/learn_gnuawk/tree/master/exercises). All the exercises are also collated together in one place at [Exercises.md](https://github.com/learnbyexample/learn_gnuawk/blob/master/exercises/Exercises.md). For solutions, see [Exercise_solutions.md](https://github.com/learnbyexample/learn_gnuawk/blob/master/exercises/Exercise_solutions.md).
390390
391391
**a)** For the input file `addr.txt`, display all lines containing `is`.
392392

@@ -490,48 +490,50 @@ spared no one
490490
grasped
491491
```
492492

493-
## Line Anchors
493+
## String Anchors
494494

495-
In the examples seen so far, the regexp was a simple string value without any special characters. Also, the regexp pattern evaluated to `true` if it was found anywhere in the string. Instead of matching anywhere in the line, restrictions can be specified. These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as **metacharacters** in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a `\` (discussed in [Matching the metacharacters](#matching-the-metacharacters) section).
495+
In the examples seen so far, the regexp was a simple string value without any special characters. Also, the regexp pattern evaluated to `true` if it was found anywhere in the string. Instead of matching anywhere in the string, restrictions can be specified. These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as **metacharacters** in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a `\` (discussed in [Matching the metacharacters](#matching-the-metacharacters) section).
496496

497-
There are two line anchors:
497+
There are two string anchors:
498498

499-
* `^` metacharacter restricts the matching to start of line
500-
* `$` metacharacter restricts the matching to end of line
499+
* `^` metacharacter restricts the matching to the start of string
500+
* `$` metacharacter restricts the matching to the end of string
501501

502502
```bash
503-
$ # lines starting with 'sp'
503+
$ # string starting with 'sp'
504504
$ printf 'spared no one\ngrasped\nspar\n' | awk '/^sp/'
505505
spared no one
506506
spar
507507

508-
$ # lines ending with 'ar'
508+
$ # string ending with 'ar'
509509
$ printf 'spared no one\ngrasped\nspar\n' | awk '/ar$/'
510510
spar
511511

512-
$ # change only whole line 'spar'
513-
$ # can also use: awk '/^spar$/{$0 = 123} 1'
512+
$ # change only whole string 'spar'
513+
$ # can also use: awk '/^spar$/{$0 = 123} 1' or awk '$0=="spar"{$0 = 123} 1'
514514
$ printf 'spared no one\ngrasped\nspar\n' | awk '{sub(/^spar$/, "123")} 1'
515515
spared no one
516516
grasped
517517
123
518518
```
519519

520-
The anchors can be used by themselves as a pattern. Helps to insert text at start or end of line, emulating string concatenation operations. These might not feel like useful capability, but combined with other features they become quite a handy tool.
520+
The anchors can be used by themselves as a pattern. Helps to insert text at the start or end of string, emulating string concatenation operations. These might not feel like useful capability, but combined with other features they become quite a handy tool.
521521

522522
```bash
523523
$ printf 'spared no one\ngrasped\nspar\n' | awk '{gsub(/^/, "* ")} 1'
524524
* spared no one
525525
* grasped
526526
* spar
527527

528-
$ # append only if line doesn't contain space characters
528+
$ # append only if string doesn't contain space characters
529529
$ printf 'spared no one\ngrasped\nspar\n' | awk '!/ /{gsub(/$/, ".")} 1'
530530
spared no one
531531
grasped.
532532
spar.
533533
```
534534

535+
>![info](images/info.svg) See also [Behavior of ^ and $ when string contains newline](#behavior-of--and--when-string-contains-newline) section.
536+
535537
## Word Anchors
536538

537539
The second type of restriction is word anchors. A word character is any alphabet (irrespective of case), digit and the underscore character. You might wonder why there are digits and underscores as well, why not only alphabets? This comes from variable and function naming conventions — typically alphabets, digits and underscores are allowed. So, the definition is more programming oriented than natural language.
@@ -604,7 +606,7 @@ c:o:p:p:e:r
604606
Before seeing next regexp feature, it is good to note that sometimes using logical operators is easier to read and maintain compared to doing everything with regexp.
605607

606608
```bash
607-
$ # lines starting with 'b' but not containing 'at'
609+
$ # string starting with 'b' but not containing 'at'
608610
$ awk '/^b/ && !/at/' table.txt
609611
blue cake mug shirt -7
610612

@@ -621,12 +623,11 @@ Many a times, you'd want to search for multiple terms. In a conditional expressi
621623
Alternation is similar to using `||` operator between two regexps. Having a single regexp helps to write terser code and `||` cannot be used when substitution is required.
622624

623625
```bash
624-
$ # lines with whole word 'par' or lines ending with 's'
626+
$ # match whole word 'par' or string ending with 's'
625627
$ # same as: awk '/\<par\>/ || /s$/'
626628
$ awk '/\<par\>|s$/' word_anchors.txt
627629
sub par
628630
two spare computers
629-
630631
$ # replace 'cat' or 'dog' or 'fox' with '--'
631632
$ echo 'cats dog bee parrot foxed' | awk '{gsub(/cat|dog|fox/, "--")} 1'
632633
--s -- bee parrot --ed
@@ -691,7 +692,7 @@ part time
691692

692693
You have seen a few metacharacters and escape sequences that help to compose a regular expression. To match the metacharacters literally, i.e. to remove their special meaning, prefix those characters with a `\` character. To indicate a literal `\` character, use `\\`.
693694

694-
Unlike `grep` and `sed`, the line anchors have to be always escaped to match them literally as there is no BRE mode in `awk`. They do not lose their special meaning when not used in their customary positions.
695+
Unlike `grep` and `sed`, the string anchors have to be always escaped to match them literally as there is no BRE mode in `awk`. They do not lose their special meaning when not used in their customary positions.
695696

696697
```bash
697698
$ # awk '/b^2/' will not work even though ^ isn't being used as anchor
@@ -931,7 +932,7 @@ $ # same as: awk '{gsub(/\<(s|o|t)(o|n)\>/, "X")} 1'
931932
$ echo 'no so in to do on' | awk '{gsub(/\<[sot][on]\>/, "X")} 1'
932933
no X in X do X
933934

934-
$ # lines made up of letters 'o' and 'n', line length at least 2
935+
$ # strings made up of letters 'o' and 'n', string length at least 2
935936
$ # /usr/share/dict/words contains dictionary words, one word per line
936937
$ awk '/^[on]{2,}$/' /usr/share/dict/words
937938
no
@@ -1109,7 +1110,7 @@ universe: 42
11091110
>![info](images/info.svg) If a metacharacter is specified by ASCII value, it will still act as the metacharacter. Undefined escape sequences will result in a warning and treated as the character it escapes.
11101111
11111112
```bash
1112-
$ # \x5e is ^ character, acts as line anchor here
1113+
$ # \x5e is ^ character, acts as string anchor here
11131114
$ printf 'cute\ncot\ncat\ncoat\n' | awk '/\x5eco/'
11141115
cot
11151116
coat
@@ -1171,7 +1172,7 @@ $ # duplicate first column value as final column
11711172
$ echo 'one,2,3.14,42' | awk '{print gensub(/^([^,]+).*/, "&,\\1", 1)}'
11721173
one,2,3.14,42,one
11731174
1174-
$ # add something at start and end of line
1175+
$ # add something at start and end of string
11751176
$ # as only '&' is used, gensub isn't needed here
11761177
$ echo 'hello world' | awk '{sub(/.*/, "Hi. &. Have a nice day")} 1'
11771178
Hi. hello world. Have a nice day
@@ -1283,7 +1284,7 @@ $ echo 'f*(a^b) - 3*(a^b)' |
12831284
awk -v s='(a^b)' '{gsub(/[{[(^$*?+.|\\]/, "\\\\&", s); gsub(s, "c")} 1'
12841285
f*c - 3*c
12851286
1286-
$ # match given input string literally, but only at end of line
1287+
$ # match given input string literally, but only at the end of string
12871288
$ echo 'f*(a^b) - 3*(a^b)' |
12881289
awk -v s='(a^b)' '{gsub(/[{[(^$*?+.|\\]/, "\\\\&", s); gsub(s "$", "c")} 1'
12891290
f*(a^b) - 3*c
@@ -3113,10 +3114,10 @@ a+b,pi=3.14,5e12
31133114
The return value is also useful to ensure match is found at specific positions only. For example start or end of input string.
31143115
31153116
```bash
3116-
$ # start of line
3117+
$ # start of string
31173118
$ awk 'index($0, "a+b")==1' eqns.txt
31183119
a+b,pi=3.14,5e12
3119-
$ # end of line
3120+
$ # end of string
31203121
$ awk -v s="a+b" 'index($0, s)==length()-length(s)+1' eqns.txt
31213122
i*(t+9-g)/8,4-a+b
31223123
```
@@ -3968,11 +3969,11 @@ $ seq 30 | awk -v n=2 '/4/{f=1; c++} f && c!=n; /6/{f=0}'
39683969
All blocks, only if the records between the markers match an additional condition.
39693970
39703971
```bash
3971-
$ # additional condition here is a line starting with '1'
3972+
$ # additional condition here is a record with entire content as '15'
39723973
$ seq 30 | awk '/4/{f=1; buf=$0; m=0; next}
39733974
f{buf=buf ORS $0}
3974-
/6/{f=0; if(buf && m) print buf; buf=""}
3975-
/^1/{m=1}'
3975+
/6/{f=0; if(m) print buf}
3976+
$0=="15"{m=1}'
39763977
14
39773978
15
39783979
16
@@ -4002,7 +4003,7 @@ zzzzzzzzzzzzzzzz
40024003
40034004
$ awk '/error/{f=1; buf=$0; next}
40044005
f{buf=buf ORS $0}
4005-
/state/{f=0; if(buf) print buf; buf=""}' broken.txt
4006+
/state/{if(f) print buf; f=0}' broken.txt
40064007
error 2
40074008
1234
40084009
6789
@@ -4769,6 +4770,20 @@ mat dog.
47694770
123 789.
47704771
```
47714772
4773+
## Behavior of ^ and $ when string contains newline
4774+
4775+
In some regular expression implementations, `^` matches the start of a line and `$` matches the end of a line (with newline as the line separator). In `awk`, these anchors always match the start of the entire string and end of the entire string respectively. This comes into play when `RS` is other than the newline character, or if you have a string value containing newline characters.
4776+
4777+
```bash
4778+
$ # 'apple\n' doesn't match as there's newline character
4779+
$ printf 'apple\n,mustard,grape,\nmango' | awk -v RS=, '/e$/'
4780+
grape
4781+
4782+
$ # '\nmango' doesn't match as there's newline character
4783+
$ printf 'apple\n,mustard,grape,\nmango' | awk -v RS=, '/^m/'
4784+
mustard
4785+
```
4786+
47724787
## Word boundary differences
47734788
47744789
The word boundary `\y` matches both start and end of word locations. Whereas, `\<` and `\>` match exactly the start and end of word locations respectively. This leads to cases where you have to choose which of these word boundaries to use depending on results desired. Consider `I have 12, he has 2!` as sample text, shown below as an image with vertical bars marking the word boundaries. The last character `!` doesn't have end of word boundary as it is not a word character.
1.84 KB
Binary file not shown.

0 commit comments

Comments
 (0)