Exercise solutions

cat and tac

1) The given sample data has empty lines at the start and end of the input. Also, there are multiple empty lines between the paragraphs. How would you get the output shown below?

# note that there's an empty line at the end of the output
+$ printf '\n\n\ndragon\n\n\n\nunicorn\nbee\n\n\n' | cat -sb
+
+     1  dragon
+
+     2  unicorn
+     3  bee
+
+

2) Pass appropriate arguments to the cat command to get the output shown below.

$ cat greeting.txt
+Hi there
+Have a nice day
+
+$ echo '42 apples and 100 bananas' | cat - greeting.txt
+42 apples and 100 bananas
+Hi there
+Have a nice day
+

3) What does the -v option of the cat command do?

Displays nonprinting characters using the caret notation.

4) Which options of the cat command do the following stand in for?

-e option is equivalent to -vE
-t option is equivalent to -vT
-A option is equivalent to -vET

5) Will the two commands shown below produce the same output? If not, why not?

$ cat fruits.txt ip.txt | tac
+
+$ tac fruits.txt ip.txt
+

No. The first command concatenates the input files before reversing the content linewise. With the second command, each file content will be reversed separately.

6) Reverse the contents of blocks.txt file as shown below, considering ---- as the separator.

$ cat blocks.txt
+----
+apple--banana
+mango---fig
+----
+3.14
+-42
+1000
+----
+sky blue
+dark green
+----
+hi hello
+
+$ tac -bs '----' blocks.txt
+----
+hi hello
+----
+sky blue
+dark green
+----
+3.14
+-42
+1000
+----
+apple--banana
+mango---fig
+

7) For the blocks.txt file, write solutions to display only the last such group and last two groups.

# can also use: tac -bs '----' blocks.txt | awk '/----/ && ++c==2{exit} 1'
+$ tac blocks.txt | sed '/----/q' | tac
+----
+hi hello
+
+$ tac -bs '----' blocks.txt | awk '/----/ && ++c==3{exit} 1' | tac -bs '----'
+----
+sky blue
+dark green
+----
+hi hello
+

8) Reverse the contents of items.txt as shown below. Consider digits at the start of lines as the separator.

$ cat items.txt
+1) fruits
+apple 5
+banana 10
+2) colors
+green
+sky blue
+3) magical beasts
+dragon 3
+unicorn 42
+
+$ tac -brs '^[0-9]' items.txt
+3) magical beasts
+dragon 3
+unicorn 42
+2) colors
+green
+sky blue
+1) fruits
+apple 5
+banana 10
+

head and tail

1) Use appropriate commands and shell features to get the output shown below.

$ printf 'carpet\njeep\nbus\n'
+carpet
+jeep
+bus
+
+# use the above 'printf' command for input data
+$ c=$(printf 'carpet\njeep\nbus\n' | head -c3)
+$ echo "$c"
+car
+

2) How would you display all the input lines except the first one?

$ printf 'apple\nfig\ncarpet\njeep\nbus\n' | tail -n +2
+fig
+carpet
+jeep
+bus
+

3) Which command would you use to get the output shown below?

$ cat fruits.txt
+banana
+papaya
+mango
+$ cat blocks.txt
+----
+apple--banana
+mango---fig
+----
+3.14
+-42
+1000
+----
+sky blue
+dark green
+----
+hi hello
+
+$ head -n2 fruits.txt blocks.txt
+==> fruits.txt <==
+banana
+papaya
+
+==> blocks.txt <==
+----
+apple--banana
+

4) Use a combination of head and tail commands to get the 11th to 14th characters from the given input.

# can also use: tail -c +11 | head -c4
+$ printf 'apple\nfig\ncarpet\njeep\nbus\n' | head -c14 | tail -c +11
+carp
+

5) Extract the starting six bytes from the input files ip.txt and fruits.txt.

$ head -q -c6 ip.txt fruits.txt
+it is banana
+

6) Extract the last six bytes from the input files fruits.txt and ip.txt.

$ tail -q -c6 fruits.txt ip.txt
+mango
+erish
+

7) For the input file ip.txt, display except the last 5 lines.

$ head -n -5 ip.txt
+it is a warm and cozy day
+listen to what I say
+go play in the park
+come back before the sky turns dark
+

8) Display the third line from the given stdin data. Consider the NUL character as the line separator.

$ printf 'apple\0fig\0carpet\0jeep\0bus\0' | head -z -n3 | tail -z -n1
+carpet
+

tr

1) What's wrong with the following command?

$ echo 'apple#banana#cherry' | tr # :
+tr: missing operand
+Try 'tr --help' for more information.
+
+$ echo 'apple#banana#cherry' | tr '#' ':'
+apple:banana:cherry
+

As a good practice, always quote the arguments passed to the tr command to avoid conflict with shell metacharacters. Unless of course, you need the shell to interpret them.

2) Retain only alphabets, digits and whitespace characters.

$ printf 'Apple_42  cool,blue\tDragon:army\n' | tr -dc '[:alnum:][:space:]'
+Apple42  coolblue       Dragonarmy
+

3) Similar to rot13, figure out a way to shift digits such that the same logic can be used both ways.

$ echo '4780 89073' | tr '0-9' '5-90-4'
+9235 34528
+
+$ echo '9235 34528' | tr '0-9' '5-90-4'
+4780 89073
+

4) Figure out the logic based on the given input and output data. Hint: use two ranges for the first set and only 6 characters in the second set.

$ echo 'apple banana cherry damson etrog' | tr 'a-ep-z' '12345X'
+1XXl5 21n1n1 3h5XXX 41mXon 5XXog
+

5) Which option would you use to truncate the first set so that it matches the length of the second set?

The -t option is needed for this.

6) What does the * notation do in the second set?

The [c*n] notation repeats a character c by n times. You can specify n in decimal or octal formats. If n is omitted, the character c is repeated as many times as needed to equalize the length of the sets.

7) Change : to - and ; to the newline character.

$ echo 'tea:coffee;brown:teal;dragon:unicorn' | tr ':;' '-\n'
+tea-coffee
+brown-teal
+dragon-unicorn
+

8) Convert all characters to * except digit and newline characters.

$ echo 'ajsd45_sdg2Khnf4v_54as' | tr -c '0-9\n' '*'
+****45****2****4**54**
+

9) Change consecutive repeated punctuation characters to a single punctuation character.

$ echo '""hi..."", good morning!!!!' | tr -s '[:punct:]'
+"hi.", good morning!
+

10) Figure out the logic based on the given input and output data.

$ echo 'Aapple    noon     banana!!!!!' | tr -cs 'a-z\n' ':'
+:apple:noon:banana:
+

11) The books.txt file has items separated by one or more : characters. Change this separator to a single newline character as shown below.

$ cat books.txt
+Cradle:::Mage Errant::The Weirkey Chronicles
+Mother of Learning::Eight:::::Dear Spellbook:Ascendant
+Mark of the Fool:Super Powereds:::Ends of Magic
+
+$ <books.txt tr -s ':' '\n'
+Cradle
+Mage Errant
+The Weirkey Chronicles
+Mother of Learning
+Eight
+Dear Spellbook
+Ascendant
+Mark of the Fool
+Super Powereds
+Ends of Magic
+

cut

1) Display only the third field.

$ printf 'tea\tcoffee\tchocolate\tfruit\n' | cut -f3
+chocolate
+

2) Display the second and fifth fields. Consider , as the field separator.

$ echo 'tea,coffee,chocolate,ice cream,fruit' | cut -d, -f2,5
+coffee,fruit
+

3) Why does the below command not work as expected? What other tools can you use in such cases?

cut ignores all repeated fields and the output is always presented in the ascending order.

# not working as expected
+$ echo 'apple,banana,cherry,fig' | cut -d, -f3,1,3
+apple,cherry
+
+# expected output
+$ echo 'apple,banana,cherry,fig' | awk -F, -v OFS=, '{print $3, $1, $3}'
+cherry,apple,cherry
+

4) Display except the second field in the format shown below. Can you construct two different solutions?

# solution 1
+$ echo 'apple,banana,cherry,fig' | cut -d, --output-delimiter=' ' -f1,3-
+apple cherry fig
+
+# solution 2
+$ echo '2,3,4,5,6,7,8' | cut -d, --output-delimiter=' ' --complement -f2
+2 4 5 6 7 8
+

5) Extract the first three characters from the input lines as shown below. Can you also use the head command for this purpose? If not, why not?

$ printf 'apple\nbanana\ncherry\nfig\n' | cut -c-3
+app
+ban
+che
+fig
+

head cannot be used because it acts on the input as a whole, whereas cut works line wise.

6) Display only the first and third fields of the scores.csv input file, with tab as the output field separator.

$ cat scores.csv
+Name,Maths,Physics,Chemistry
+Ith,100,100,100
+Cy,97,98,95
+Lin,78,83,80
+
+$ cut -d, --output-delimiter=$'\t' -f1,3 scores.csv
+Name    Physics
+Ith     100
+Cy      98
+Lin     83
+

7) The given input data uses one or more : characters as the field separator. Assume that no field content will have the : character. Display except the second field, with : as the output field separator.

$ cat books.txt
+Cradle:::Mage Errant::The Weirkey Chronicles
+Mother of Learning::Eight:::::Dear Spellbook:Ascendant
+Mark of the Fool:Super Powereds:::Ends of Magic
+
+$ <books.txt tr -s ':' | cut --complement -d':' --output-delimiter=' : ' -f2
+Cradle : The Weirkey Chronicles
+Mother of Learning : Dear Spellbook : Ascendant
+Mark of the Fool : Ends of Magic
+

8) Which option would you use to not display lines that do not contain the input delimiter character?

You can use the -s option to suppress such lines.

9) Modify the command to get the expected output shown below.

$ printf 'apple\nbanana\ncherry\n' | cut -c-3 --output-delimiter=:
+app
+ban
+che
+
+$ printf 'apple\nbanana\ncherry\n' | cut -c1,2,3 --output-delimiter=:
+a:p:p
+b:a:n
+c:h:e
+

10) Figure out the logic based on the given input and output data.

$ printf 'apple\0fig\0carpet\0jeep\0' | cut -z --complement -c-2 | cat -v
+ple^@g^@rpet^@ep^@
+

seq

1) Generate numbers from 42 to 45 in ascending order.

$ seq 42 45
+42
+43
+44
+45
+

2) Why does the command shown below produce no output?

You have to explicitly provide a negative step value to generate numbers in descending order.

# no output
+$ seq 45 42
+
+# expected output
+$ seq 45 -1 42
+45
+44
+43
+42
+

3) Generate numbers from 25 to 10 in descending order, with a step value of 5.

$ seq 25 -5 10
+25
+20
+15
+10
+

4) Is the sequence shown below possible to generate with seq? If so, how?

$ seq -w -s, 01.5 6
+01.5,02.5,03.5,04.5,05.5
+

5) Modify the command shown below to customize the output numbering format.

$ seq 30.14 3.36 40.72
+30.14
+33.50
+36.86
+40.22
+
+$ seq -f'%.3e' 30.14 3.36 40.72
+3.014e+01
+3.350e+01
+3.686e+01
+4.022e+01
+

shuf

1) What's wrong with the given command?

shuf doesn't accept multiple input files. You can use cat to concatenate them first.

$ shuf --random-source=greeting.txt fruits.txt books.txt
+shuf: extra operand ‘books.txt’
+Try 'shuf --help' for more information.
+
+# expected output
+$ cat fruits.txt books.txt | shuf --random-source=greeting.txt
+banana
+Cradle:::Mage Errant::The Weirkey Chronicles
+Mother of Learning::Eight:::::Dear Spellbook:Ascendant
+papaya
+Mark of the Fool:Super Powereds:::Ends of Magic
+mango
+

2) What do the -r and -n options do? Why are they often used together?

The -r option helps if you want to allow input lines to be repeated. This option is usually paired with -n to limit the number of lines in the output. Otherwise, shuf -r will produce output lines indefinitely.

3) What does the following command do?

The -e option is useful to specify multiple input lines as arguments to the command.

$ shuf -e apple banana cherry fig mango
+cherry
+banana
+mango
+fig
+apple
+

4) Which option would you use to generate random numbers? Given an example.

The -i option helps generate random positive integers.

$ shuf -n3 -i 100-200
+128
+177
+193
+

5) How would you generate 5 random numbers between 0.125 and 0.789 with a step value of 0.023?

$ seq 0.125 0.023 0.789 | shuf -n5
+0.378
+0.631
+0.447
+0.746
+0.723
+

paste

1) What's the default delimiter character added by the paste command? Which option would you use to customize this separator?

Tab. You can use the -d option to change the delimiter between the columns.

2) Will the following two commands produce equivalent output? If not, why not?

$ paste -d, <(seq 3) <(printf '%s\n' item_{1..3})
+1,item_1
+2,item_2
+3,item_3
+
+$ printf '%s\n' {1..3},item_{1..3}
+1,item_1
+1,item_2
+1,item_3
+2,item_1
+2,item_2
+2,item_3
+3,item_1
+3,item_2
+3,item_3
+

The outputs are not equivalent because brace expansion creates all combinations when multiple braces are used.

3) Combine the two data sources as shown below.

$ printf '1)\n2)\n3)'
+1)
+2)
+3)
+$ cat fruits.txt
+banana
+papaya
+mango
+
+$ paste -d '' <(printf '1)\n2)\n3)') fruits.txt
+1)banana
+2)papaya
+3)mango
+

4) Interleave the contents of fruits.txt and books.txt.

$ paste -d'\n' fruits.txt books.txt
+banana
+Cradle:::Mage Errant::The Weirkey Chronicles
+papaya
+Mother of Learning::Eight:::::Dear Spellbook:Ascendant
+mango
+Mark of the Fool:Super Powereds:::Ends of Magic
+

5) Generate numbers 1 to 9 in two different formats as shown below.

$ seq 9 | paste -d: - - -
+1:2:3
+4:5:6
+7:8:9
+
+$ paste -d' : ' <(seq 3) /dev/null /dev/null <(seq 4 6) /dev/null /dev/null <(seq 7 9)
+1 : 4 : 7
+2 : 5 : 8
+3 : 6 : 9
+

6) Combine the contents of fruits.txt and colors.txt as shown below.

$ cat fruits.txt
+banana
+papaya
+mango
+$ cat colors.txt
+deep blue
+light orange
+blue delight
+
+$ paste -d'\n' fruits.txt colors.txt | paste -sd,
+banana,deep blue,papaya,light orange,mango,blue delight
+

pr

1) What does the -t option do?

The -t option turns off the pagination features like headers and trailers.

2) Generate numbers 1 to 16 in two different formats as shown below.

$ seq -w 16 | pr -4ats,
+01,02,03,04
+05,06,07,08
+09,10,11,12
+13,14,15,16
+
+$ seq -w 16 | pr -4ts,
+01,05,09,13
+02,06,10,14
+03,07,11,15
+04,08,12,16
+

3) How'd you solve the issue shown below?

$ seq 100 | pr -37ats,
+pr: page width too narrow
+
+$ seq 100 | pr -J -w73 -37ats,
+1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37
+38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74
+75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
+

(N-1)*length(separator) + N is the minimum page width you need, where N is the number of columns required. So, for 37 columns and a separator of length 1, you'll need a minimum width of 73. -J option ensures input lines aren't truncated.

4) Combine the contents of fruits.txt and colors.txt in two different formats as shown below.

$ cat fruits.txt
+banana
+papaya
+mango
+$ cat colors.txt
+deep blue
+light orange
+blue delight
+
+$ pr -mts' : ' fruits.txt colors.txt
+banana : deep blue
+papaya : light orange
+mango : blue delight
+
+$ pr -n:2 -mts, fruits.txt colors.txt
+ 1:banana,deep blue
+ 2:papaya,light orange
+ 3:mango,blue delight
+

5) What does the -d option do?

You can use the -d option to double space the input contents. That is, every newline character is doubled.

fold and fmt

1) What's the default wrap length of the fold and fmt commands?

80 bytes and 93% of 75 columns respectively.

2) Fold the given stdin data at 9 bytes.

$ echo 'hi hello, how are you?' | fold -w9
+hi hello,
+ how are 
+you?
+

3) Figure out the logic based on the given input and output data using the fold command.

$ cat ip.txt
+it is a warm and cozy day
+listen to what I say
+go play in the park
+come back before the sky turns dark
+
+There are so many delights to cherish
+Apple, Banana and Cherry
+Bread, Butter and Jelly
+Try them all before you perish
+
+$ head -n2 ip.txt | fold -sw10
+it is a 
+warm and 
+cozy day
+listen to 
+what I say
+

4) What does the fold -b option do?

The -b option will cause fold to treat tab, backspace, and carriage return characters as if they were a single byte character.

5) How'd you get the expected output shown below?

# wrong output
+$ echo 'fig appleseed mango pomegranate' | fold -sw7
+fig 
+applese
+ed 
+mango 
+pomegra
+nate
+
+# expected output
+$ echo 'fig appleseed mango pomegranate' | fmt -w7
+fig
+appleseed
+mango
+pomegranate
+

6) What do the options -s and -u of the fmt command do?

By default, the fmt command joins lines together that are shorter than the specified width. The -s option will disable this behavior.

The -u option changes multiple spaces to a single space. Excess spacing between sentences will be changed to two spaces.

sort

1) Default sort doesn't work for numbers. Which option would you use to get the expected output shown below?

$ printf '100\n10\n20\n3000\n2.45\n' | sort -n
+2.45
+10
+20
+100
+3000
+

2) Which sort option will help you ignore case? LC_ALL=C is used here to avoid differences due to locale.

$ printf 'Super\nover\nRUNE\ntea\n' | LC_ALL=C sort -f
+over
+RUNE
+Super
+tea
+

3) The -n option doesn't work for all sorts of numbers. Which sort option would you use to get the expected output shown below?

# wrong output
+$ printf '+120\n-1.53\n3.14e+4\n42.1e-2' | sort -n
+-1.53
++120
+3.14e+4
+42.1e-2
+
+# expected output
+$ printf '+120\n-1.53\n3.14e+4\n42.1e-2' | sort -g
+-1.53
+42.1e-2
++120
+3.14e+4
+

4) What do the -V and -h options do?

The -V option is useful when you have a mix of alphabets and digits. It also helps when you want to treat digits after a decimal point as whole numbers, for example 1.10 should be greater than 1.2.

Commands like du (disk usage) have the -h and --si options to display numbers with SI suffixes like k, K, M, G and so on. In such cases, you can use sort -h to order them.

5) Is there a difference between shuf and sort -R?

The sort -R option will display the output in random order. Unlike shuf, this option will always place identical lines next to each other due to the implementation.

6) Sort the scores.csv file numerically in ascending order using the contents of the second field. Header line should be preserved as the first line as shown below.

$ cat scores.csv
+Name,Maths,Physics,Chemistry
+Ith,100,100,100
+Cy,97,98,95
+Lin,78,83,80
+
+$ (sed -u '1q' ; sort -t, -k2,2n) < scores.csv
+Name,Maths,Physics,Chemistry
+Lin,78,83,80
+Cy,97,98,95
+Ith,100,100,100
+

7) Sort the contents of duplicates.csv by the fourth column numbers in descending order. Retain only the first copy of lines with the same number.

$ cat duplicates.csv
+brown,toy,bread,42
+dark red,ruby,rose,111
+blue,ruby,water,333
+dark red,sky,rose,555
+yellow,toy,flower,333
+white,sky,bread,111
+light red,purse,rose,333
+
+$ sort -t, -k4,4nr -u duplicates.csv
+dark red,sky,rose,555
+blue,ruby,water,333
+dark red,ruby,rose,111
+brown,toy,bread,42
+

8) Sort the contents of duplicates.csv by the third column item. Use the fourth column numbers as the tie-breaker.

$ sort -t, -k3,3 -k4,4n duplicates.csv
+brown,toy,bread,42
+white,sky,bread,111
+yellow,toy,flower,333
+dark red,ruby,rose,111
+light red,purse,rose,333
+dark red,sky,rose,555
+blue,ruby,water,333
+

9) What does the -s option provide?

The -s option is useful to retain the original order of input lines when two or more lines are deemed equal.

10) Sort the given input based on the numbers inside the brackets.

$ printf '(-3.14)\n[45]\n(12.5)\n{14093}' | sort -k1.2n
+(-3.14)
+(12.5)
+[45]
+{14093}
+

11) What do the -c, -C and -m options do?

The -c option helps you spot the first unsorted entry in the given input. The uppercase -C option is similar but only affects the exit status. Note that these options will not work for multiple inputs.

The -m option is useful if you have one or more sorted input files and need a single sorted output file. This would be faster than normal sorting.

uniq

1) Will uniq throw an error if the input is not sorted? What do you think will be the output for the following input?

uniq doesn't necessarily require the input to be sorted. Adjacent lines are used for comparison purposes.

$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq
+red
+green
+red
+blue
+

2) Are there differences between sort -u file and sort file | uniq?

Yes. For example, you may need to sort based on some specific criteria and then identify duplicates based on the entire line contents. Here's an example:

# can't use sort -n -u here
+$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -n | uniq
+2 balls
+2 pins
+13 pens
+

3) What are the differences between sort -u and uniq -u options, if any?

sort -u retains the first copy of duplicates that are deemed to be equal. uniq -u retains only the unique copies (i.e. not even a single copy of the duplicates will be part of the output).

4) Filter the third column items from duplicates.csv. Construct three solutions to display only unique items, duplicate items and all duplicates.

$ cat duplicates.csv
+brown,toy,bread,42
+dark red,ruby,rose,111
+blue,ruby,water,333
+dark red,sky,rose,555
+yellow,toy,flower,333
+white,sky,bread,111
+light red,purse,rose,333
+
+# unique
+$ cut -d, -f3 duplicates.csv | sort | uniq -u
+flower
+water
+
+# duplicates
+$ cut -d, -f3 duplicates.csv | sort | uniq -d
+bread
+rose
+
+# all duplicates
+$ cut -d, -f3 duplicates.csv | sort | uniq -D
+bread
+bread
+rose
+rose
+rose
+

5) What does the --group option do? What customization features are available?

The --group options allows you to visually separate groups of similar lines with an empty line. This option can accept four values — separate, prepend, append and both. The default is separate, which adds a newline character between the groups. prepend will add a newline before the first group as well and append will add a newline after the last group. both combines the prepend and append behavior.

6) Count the number of times input lines are repeated and display the results in the format shown below.

$ s='brown\nbrown\nbrown\ngreen\nbrown\nblue\nblue'
+$ printf '%b' "$s" | sort | uniq -c | sort -n
+      1 green
+      2 blue
+      4 brown
+

7) For the input file f1.txt, retain only unique entries based on the first two characters of each line. For example, abcd and ab12 should be considered as duplicates and neither of them will be part of the output.

$ cat f1.txt
+3) cherry
+1) apple
+2) banana
+1) almond
+4) mango
+2) berry
+3) chocolate
+1) apple
+5) cherry
+
+$ sort f1.txt | uniq -u -w2
+4) mango
+5) cherry
+

8) For the input file f1.txt, display only the duplicate items without considering the first two characters of each line. For example, abcd and 12cd should be considered as duplicates. Assume that the third character of each line is always a space character.

$ sort -k2 f1.txt | uniq -d -f1
+1) apple
+3) cherry
+

9) What does the -s option do?

The -s option allows you to skip the first N characters (calculated as bytes).

10) Filter only unique lines, but ignore differences due to case.

$ printf 'cat\nbat\nCAT\nCar\nBat\nmat\nMat' | sort -f | uniq -iu
+Car
+

comm

1) Get the common lines between the s1.txt and s2.txt files. Assume that their contents are already sorted.

$ paste s1.txt s2.txt
+apple   banana
+coffee  coffee
+fig     eclair
+honey   fig
+mango   honey
+pasta   milk
+sugar   tea
+tea     yeast
+
+$ comm -12 s1.txt s2.txt
+coffee
+fig
+honey
+tea
+

2) Display lines present in s1.txt but not s2.txt and vice versa.

# lines unique to the first file
+$ comm -23 s1.txt s2.txt
+apple
+mango
+pasta
+sugar
+
+# lines unique to the second file
+$ comm -13 s1.txt s2.txt
+banana
+eclair
+milk
+yeast
+

3) Display lines unique to the s1.txt file and the common lines when compared to the s2.txt file. Use ==> to separate the output columns.

$ comm -2 --output-delimiter='==>' s1.txt s2.txt
+apple
+==>coffee
+==>fig
+==>honey
+mango
+pasta
+sugar
+==>tea
+

4) What does the --total option do?

Gives you the count of lines for each of the three columns.

5) Will the comm command fail if there are repeated lines in the input files? If not, what'd be the expected output for the command shown below?

The number of duplicate lines in the common column will be minimum of the duplicate occurrences between the two files. Rest of the duplicate lines, if any, will be considered as unique to the file having the excess lines.

$ cat s3.txt
+apple
+apple
+guava
+honey
+tea
+tea
+tea
+
+$ comm -23 s3.txt s1.txt
+apple
+guava
+tea
+tea
+

join

Assume that the input files are already sorted for these exercises.

1) Use appropriate options to get the expected outputs shown below.

# no output
+$ join <(printf 'apple 2\nfig 5') <(printf 'Fig 10\nmango 4')
+
+# expected output 1
+$ join -i <(printf 'apple 2\nfig 5') <(printf 'Fig 10\nmango 4')
+fig 5 10
+
+# expected output 2
+$ join -i -a1 -a2 <(printf 'apple 2\nfig 5') <(printf 'Fig 10\nmango 4')
+apple 2
+fig 5 10
+mango 4
+

2) Use the join command to display only the non-matching lines based on the first field.

$ cat j1.txt
+apple   2
+fig     5
+lemon   10
+tomato  22
+$ cat j2.txt
+almond  33
+fig     115
+mango   20
+pista   42
+
+# first field items present in j1.txt but not j2.txt
+$ join -v1 j1.txt j2.txt
+apple 2
+lemon 10
+tomato 22
+
+# first field items present in j2.txt but not j1.txt
+$ join -v2 j1.txt j2.txt
+almond 33
+mango 20
+pista 42
+

3) Filter lines from j1.txt and j2.txt that match the items from s1.txt.

$ cat s1.txt
+apple
+coffee
+fig
+honey
+mango
+pasta
+sugar
+tea
+
+# note that sort -m is used since the input files are already sorted
+$ join s1.txt <(sort -m j1.txt j2.txt)
+apple 2
+fig 115
+fig 5
+mango 20
+

4) Join the marks_1.csv and marks_2.csv files to get the expected output shown below.

$ cat marks_1.csv
+Name,Biology,Programming
+Er,92,77
+Ith,100,100
+Lin,92,100
+Sil,86,98
+$ cat marks_2.csv
+Name,Maths,Physics,Chemistry
+Cy,97,98,95
+Ith,100,100,100
+Lin,78,83,80
+
+$ join -t, --header marks_1.csv marks_2.csv
+Name,Biology,Programming,Maths,Physics,Chemistry
+Ith,100,100,100,100,100
+Lin,92,100,78,83,80
+

5) By default, the first field is used to combine the lines. Which options are helpful if you want to change the key field to be used for joining?

You can use -1 and -2 options followed by a field number to specify a different field number. You can use the -j option if the field number is the same for both the files.

6) Join the marks_1.csv and marks_2.csv files to get the expected output with specific fields as shown below.

$ join -t, --header -o 1.1,1.3,2.2,1.2 marks_1.csv marks_2.csv
+Name,Programming,Maths,Biology
+Ith,100,100,100
+Lin,100,78,92
+

7) Join the marks_1.csv and marks_2.csv files to get the expected output shown below. Use 50 as the filler data.

$ join -t, --header -o auto -a1 -a2 -e '50' marks_1.csv marks_2.csv
+Name,Biology,Programming,Maths,Physics,Chemistry
+Cy,50,50,97,98,95
+Er,92,77,50,50,50
+Ith,100,100,100,100,100
+Lin,92,100,78,83,80
+Sil,86,98,50,50,50
+

8) When you use the -o auto option, what'd happen to the extra fields compared to those in the first lines of the input data?

If you use auto as the argument for the -o option, first line of both the input files will be used to determine the number of output fields. If the other lines have extra fields, they will be discarded.

9) From the input files j3.txt and j4.txt, filter only the lines are unique — i.e. lines that are not common to these files. Assume that the input files do not have duplicate entries.

$ cat j3.txt
+almond
+apple pie
+cold coffee
+honey
+mango shake
+pasta
+sugar
+tea
+$ cat j4.txt
+apple
+banana shake
+coffee
+fig
+honey
+mango shake
+milk
+tea
+yeast
+
+$ join -t '' -v1 -v2 j3.txt j4.txt
+almond
+apple
+apple pie
+banana shake
+coffee
+cold coffee
+fig
+milk
+pasta
+sugar
+yeast
+

10) From the input files j3.txt and j4.txt, filter only the lines are common to these files.

$ join -t '' j3.txt j4.txt
+honey
+mango shake
+tea
+

nl

1) nl and cat -n are always equivalent for numbering lines. True or False?

True if there are no empty lines in the input data. cat -b and nl are always equivalent.

2) What does the -n option do?

You can use the -n option to customize the number formatting. The available styles are:

rn right justified with space fillers (default)
rz right justified with leading zeros
ln left justified with space fillers

3) Use nl to produce the two expected outputs shown below.

$ cat greeting.txt
+Hi there
+Have a nice day
+
+# expected output 1
+$ nl -w3 -n'rz' greeting.txt
+001     Hi there
+002     Have a nice day
+
+# expected output 2
+$ nl -w3 -n'rz' -s') ' greeting.txt
+001) Hi there
+002) Have a nice day
+

4) Figure out the logic based on the given input and output data.

$ cat s1.txt
+apple
+coffee
+fig
+honey
+mango
+pasta
+sugar
+tea
+
+$ nl -w2 -s'. ' -v15 -i-2 s1.txt
+15. apple
+13. coffee
+11. fig
+ 9. honey
+ 7. mango
+ 5. pasta
+ 3. sugar
+ 1. tea
+

5) What are the three types of sections supported by nl?

nl recognizes three types of sections with the following default patterns:

\:\:\: as header
\:\: as body
\: as footer

These special lines will be replaced with an empty line after numbering. The numbering will be reset at the start of every section unless the -p option is used.

6) Only number the lines that start with ---- in the format shown below.

$ cat blocks.txt
+----
+apple--banana
+mango---fig
+----
+3.14
+-42
+1000
+----
+sky blue
+dark green
+----
+hi hello
+
+$ nl -w2 -s') ' -bp'^----' blocks.txt
+ 1) ----
+    apple--banana
+    mango---fig
+ 2) ----
+    3.14
+    -42
+    1000
+ 3) ----
+    sky blue
+    dark green
+ 4) ----
+    hi hello
+

7) For the blocks.txt file, determine the logic to produce the expected output shown below.

$ nl -w1 -s'. ' -d'--' blocks.txt
+
+1. apple--banana
+2. mango---fig
+
+1. 3.14
+2. -42
+3. 1000
+
+1. sky blue
+2. dark green
+
+1. hi hello
+

8) What does the -l option do?

The -l option controls how many consecutive empty lines should be considered as a single entry. Only the last empty line of such groupings will be numbered.

9) Figure out the logic based on the given input and output data.

$ cat all_sections.txt
+\:\:\:
+Header
+teal
+\:\:
+Hi there
+How are you
+\:\:
+banana
+papaya
+mango
+\:
+Footer
+
+$ nl -p -w2 -s') ' -ha all_sections.txt
+
+ 1) Header
+ 2) teal
+
+ 3) Hi there
+ 4) How are you
+
+ 5) banana
+ 6) papaya
+ 7) mango
+
+    Footer
+

wc

1) Save the number of lines in the greeting.txt input file to the lines shell variable.

$ lines=$(wc -l <greeting.txt)
+$ echo "$lines"
+2
+

2) What do you think will be the output of the following command?

Word count is based on whitespace separation.

$ echo 'dragons:2 ; unicorns:10' | wc -w
+3
+

3) Use appropriate options and arguments to get the output as shown below. Also, why is the line count showing as 2 instead of 3 for the stdin data?

Line count is based on the number of newline characters. So, if the last line of the input doesn't end with the newline character, it won't be counted.

$ printf 'apple\nbanana\ncherry' | wc -lc greeting.txt -
+      2      25 greeting.txt
+      2      19 -
+      4      44 total
+

4) Use appropriate options and arguments to get the output shown below.

$ printf 'greeting.txt\0scores.csv' | wc --files0-from=-
+2 6 25 greeting.txt
+4 4 70 scores.csv
+6 10 95 total
+

5) What is the difference between wc -c and wc -m options? And which option would you use to get the longest line length?

The -c option is useful to get the byte count. Use the -m option instead of -c if the input has multibyte characters.

# byte count
+$ printf 'αλεπού' | wc -c
+12
+
+# character count
+$ printf 'αλεπού' | wc -m
+6
+

You can use the -L option to report the length of the longest line in the input (excluding the newline character of a line).

6) Calculate the number of comma separated words from the scores.csv file.

$ cat scores.csv
+Name,Maths,Physics,Chemistry
+Ith,100,100,100
+Cy,97,98,95
+Lin,78,83,80
+
+$ <scores.csv tr ',' ' ' | wc -w
+16
+

split

Remove the output files after every exercise.

1) Split the s1.txt file 3 lines at a time.

$ split -l3 s1.txt
+
+$ head xa?
+==> xaa <==
+apple
+coffee
+fig
+
+==> xab <==
+honey
+mango
+pasta
+
+==> xac <==
+sugar
+tea
+
+$ rm xa?
+

2) Use appropriate options to get the output shown below.

$ echo 'apple,banana,cherry,dates' | split -t, -l1
+
+$ head xa?
+==> xaa <==
+apple,
+==> xab <==
+banana,
+==> xac <==
+cherry,
+==> xad <==
+dates
+
+$ rm xa?
+

3) What do the -b and -C options do?

The -b option allows you to split the input by the number of bytes. This option also accepts suffixes such as K for 1024 bytes, KB for 1000 bytes, M for 1024 * 1024 bytes and so on.

The -C option is similar to the -b option, but it will try to break on line boundaries if possible. The break will happen before the given byte limit. If a line exceeds the given limit, it will be broken down into multiple parts.

4) Display the 2nd chunk of the ip.txt file after splitting it 4 times as shown below.

$ split -nl/2/4 ip.txt
+come back before the sky turns dark
+
+There are so many delights to cherish
+

5) What does the r prefix do when used with the -n option?

This creates output files with interleaved lines.

6) Split the ip.txt file 2 lines at a time. Customize the output filenames as shown below.

$ split -l2 -a1 -d --additional-suffix='.txt' ip.txt ip_
+
+$ head ip_*
+==> ip_0.txt <==
+it is a warm and cozy day
+listen to what I say
+
+==> ip_1.txt <==
+go play in the park
+come back before the sky turns dark
+
+==> ip_2.txt <==
+
+There are so many delights to cherish
+
+==> ip_3.txt <==
+Apple, Banana and Cherry
+Bread, Butter and Jelly
+
+==> ip_4.txt <==
+Try them all before you perish
+
+$ rm ip_*
+

7) Which option would you use to prevent empty files in the output?

The -e option prevents empty files in the output.

8) Split the items.txt file 5 lines at a time. Additionally, remove lines starting with a digit character as shown below.

$ cat items.txt
+1) fruits
+apple 5
+banana 10
+2) colors
+green
+sky blue
+3) magical beasts
+dragon 3
+unicorn 42
+
+$ split -l5 --filter='grep -v "^[0-9]" > $FILE' items.txt
+
+$ head xa?
+==> xaa <==
+apple 5
+banana 10
+green
+
+==> xab <==
+sky blue
+dragon 3
+unicorn 42
+
+$ rm xa?
+

csplit

Remove the output files after every exercise.

1) Split the blocks.txt file such that the first 7 lines are in the first file and the rest are in the second file as shown below.

$ csplit -q blocks.txt 8
+
+$ head xx*
+==> xx00 <==
+----
+apple--banana
+mango---fig
+----
+3.14
+-42
+1000
+
+==> xx01 <==
+----
+sky blue
+dark green
+----
+hi hello
+
+$ rm xx*
+

2) Split the input file items.txt such that the text before a line containing colors is part of the first file and the rest are part of the second file as shown below.

$ csplit -q items.txt '/colors/'
+
+$ head xx*
+==> xx00 <==
+1) fruits
+apple 5
+banana 10
+
+==> xx01 <==
+2) colors
+green
+sky blue
+3) magical beasts
+dragon 3
+unicorn 42
+
+$ rm xx*
+

3) Split the input file items.txt such that the line containing magical and all the lines that come after are part of the single output file.

$ csplit -q items.txt '%magical%'
+
+$ cat xx00
+3) magical beasts
+dragon 3
+unicorn 42
+
+$ rm xx00
+

4) Split the input file items.txt such that the line containing colors as well the line that comes after are part of the first output file.

$ csplit -q items.txt '/colors/2'
+
+$ head xx*
+==> xx00 <==
+1) fruits
+apple 5
+banana 10
+2) colors
+green
+
+==> xx01 <==
+sky blue
+3) magical beasts
+dragon 3
+unicorn 42
+
+$ rm xx*
+

5) Split the input file items.txt on the line that comes before a line containing magical. Generate only a single output file as shown below.

$ csplit -q items.txt '%magical%-1'
+
+$ cat xx00
+sky blue
+3) magical beasts
+dragon 3
+unicorn 42
+
+$ rm xx00
+

6) Split the input file blocks.txt on the 4th occurrence of a line starting with the - character. Generate only a single output file as shown below.

$ csplit -q blocks.txt '%^-%' '{3}'
+
+$ cat xx00
+----
+sky blue
+dark green
+----
+hi hello
+
+$ rm xx00
+

7) For the input file blocks.txt, determine the logic to produce the expected output shown below.

$ csplit -qz --suppress-matched blocks.txt '/----/' '{*}'
+
+$ head xx*
+==> xx00 <==
+apple--banana
+mango---fig
+
+==> xx01 <==
+3.14
+-42
+1000
+
+==> xx02 <==
+sky blue
+dark green
+
+==> xx03 <==
+hi hello
+
+$ rm xx*
+

8) What does the -k option do?

By default, csplit will remove the created output files if there's an error or a signal that causes the command to stop. You can use the -k option to keep such files. One use case is line number based splitting with the {*} modifier.

$ seq 7 | csplit -q - 4 '{*}'
+csplit: ‘4’: line number out of range on repetition 1
+$ ls xx*
+ls: cannot access 'xx*': No such file or directory
+
+# -k option will allow you to retain the created files
+$ seq 7 | csplit -qk - 4 '{*}'
+csplit: ‘4’: line number out of range on repetition 1
+$ head xx*
+==> xx00 <==
+1
+2
+3
+
+==> xx01 <==
+4
+5
+6
+7
+
+$ rm xx*
+

9) Split the books.txt file on every line as shown below.

# can also use: split -l1 -d -a1 books.txt row_
+$ csplit -qkz -f'row_' -n1 books.txt 1 '{*}'
+csplit: ‘1’: line number out of range on repetition 3
+
+$ head row_*
+==> row_0 <==
+Cradle:::Mage Errant::The Weirkey Chronicles
+
+==> row_1 <==
+Mother of Learning::Eight:::::Dear Spellbook:Ascendant
+
+==> row_2 <==
+Mark of the Fool:Super Powereds:::Ends of Magic
+
+$ rm row_*
+

10) Split the items.txt file on lines starting with a digit character. Matching lines shouldn't be part of the output and the files should be named group_0.txt, group_1.txt and so on.

$ csplit -qz --suppress-matched -q -f'group_' -b'%d.txt' items.txt '/^[0-9]/' '{*}'
+
+$ head group_*
+==> group_0.txt <==
+apple 5
+banana 10
+
+==> group_1.txt <==
+green
+sky blue
+
+==> group_2.txt <==
+dragon 3
+unicorn 42
+
+$ rm group_*
+

expand and unexpand

1) The items.txt file has space separated words. Convert the spaces to be aligned at 10 column widths as shown below.

$ cat items.txt
+1) fruits
+apple 5
+banana 10
+2) colors
+green
+sky blue
+3) magical beasts
+dragon 3
+unicorn 42
+
+$ <items.txt tr ' ' '\t' | expand -t 10
+1)        fruits
+apple     5
+banana    10
+2)        colors
+green
+sky       blue
+3)        magical   beasts
+dragon    3
+unicorn   42
+

2) What does the expand -i option do?

The -i option converts only the tab characters present at the start of a line. The first occurrence of a character that is not tab or space characters will stop the expansion.

3) Expand the first tab character to stop at the 10th column and the second one at the 16th column. Rest of the tabs should be converted to a single space character.

$ printf 'app\tfix\tjoy\tmap\ttap\n' | expand -t 10,16
+app       fix   joy map tap
+
+$ printf 'appleseed\tfig\tjoy\n' | expand -t 10,16
+appleseed fig   joy
+
+$ printf 'a\tb\tc\td\te\n' | expand -t 10,16
+a         b     c d e
+

4) Will the following code give back the original input? If not, is there an option that can help?

By default, the unexpand command converts only the initial blank characters (space or tab) to tabs. The -a option will allow you to convert all sequences of two or more blanks at tab boundaries.

$ printf 'a\tb\tc\n' | expand | unexpand
+a       b       c
+
+$ printf 'a\tb\tc\n' | expand | unexpand | cat -T
+a       b       c
+
+$ printf 'a\tb\tc\n' | expand | unexpand -a | cat -T
+a^Ib^Ic
+

5) How do the + and / prefix modifiers affect the -t option?

If you prefix a / character to the last width, the remaining tab characters will use multiple of this position instead of a single space default.

If you use + instead of / as the prefix for the last width, the multiple calculation will use the second last width as an offset.

basename and dirname

1) Is the following command valid? If so, what would be the output?

Yes, it is valid. Multiple slashes will be considered as a single slash. Any trailing slashes will be removed before determining the portion to be extracted.

$ basename -s.txt ~///test.txt///
+test
+

2) Given the file path in the shell variable p, how'd you obtain the outputs shown below?

$ p='~/projects/square_tictactoe/python/game.py'
+$ dirname $(dirname "$p")
+~/projects/square_tictactoe
+
+$ p='/backups/jan_2021.tar.gz'
+$ dirname $(dirname "$p")
+/
+

3) What would be the output of the basename command if the input has no leading directory component or only has the / character?

If there's no leading directory component or if slash alone is the input, the argument will be returned as is after removing any trailing slashes.

$ basename filename.txt
+filename.txt
+$ basename /
+/
+

4) For the paths stored in the shell variable p, how'd you obtain the outputs shown below?

$ p='/a/b/ip.txt /c/d/e/f/op.txt'
+
+# expected output 1
+$ basename -s'.txt' $p
+ip
+op
+
+# expected output 2
+$ dirname $p
+/a/b
+/c/d/e/f
+

5) Given the file path in the shell variable p, how'd you obtain the outputs shown below?

$ p='~/projects/python/square_tictactoe/game.py'
+$ basename $(dirname "$p")
+square_tictactoe
+
+$ p='/backups/aug_2024/ip.tar.gz'
+$ basename $(dirname "$p")
+aug_2024
+

sort

The sort command provides a wide variety of features. In addition to lexicographic ordering, it supports various numerical formats. You can also sort based on particular columns. And there are nifty features like merging already sorted input, debugging, determining whether the input is already sorted and so on.

Default sort and Collating order

By default, sort orders the input in ascending order. If you know about ASCII codepoints, do you agree that the following two examples are showing the correct expected output?

$ cat greeting.txt
+Hi there
+Have a nice day
+# extract and sort space separated words
+$ <greeting.txt tr ' ' '\n' | sort
+a
+day
+Have
+Hi
+nice
+there
+
+$ printf '(banana)\n{cherry}\n[apple]' | sort
+[apple]
+(banana)
+{cherry}
+

From the sort manual:

Unless otherwise specified, all comparisons use the character collating sequence specified by the LC_COLLATE locale.
If you use a non-POSIX locale (e.g., by setting LC_ALL to en_US), then sort may produce output that is sorted differently than you're accustomed to. In that case, set the LC_ALL environment variable to C. Note that setting only LC_COLLATE has two problems. First, it is ineffective if LC_ALL is also set. Second, it has undefined behavior if LC_CTYPE (or LANG, if LC_CTYPE is unset) is set to an incompatible value. For example, you get undefined behavior if LC_CTYPE is ja_JP.PCK but LC_COLLATE is en_US.UTF-8.

My locale settings are based on en_IN, which is different from the POSIX sorting order. So, the fact to remember is that sort obeys the rules of the current locale. If you want POSIX sorting, one option is to use LC_ALL=C as shown below.

$ <greeting.txt tr ' ' '\n' | LC_ALL=C sort
+Have
+Hi
+a
+day
+nice
+there
+
+$ printf '(banana)\n{cherry}\n[apple]' | LC_ALL=C sort
+(banana)
+[apple]
+{cherry}
+

Another benefit of C locale is that it will be significantly faster compared to Unicode parsing and sorting rules.

Use the -f option if you want to explicitly ignore case. See also GNU Core Utilities FAQ: Sort does not sort in normal order!.

See this unix.stackexchange thread if you want to create your own custom sort order.

Ignoring headers

You can use sed -u to consume only the header lines and leave the rest of the input for the sort command. Note that this unbuffered option is supported by GNU sed, might not be available with other implementations.

$ cat scores.csv
+Name,Maths,Physics,Chemistry
+Ith,100,100,100
+Cy,97,98,95
+Lin,78,83,80
+
+# 1q is used to quit after the first line
+$ ( sed -u '1q' ; sort ) <scores.csv
+Name,Maths,Physics,Chemistry
+Cy,97,98,95
+Ith,100,100,100
+Lin,78,83,80
+

See this unix.stackexchange thread for more ways of ignoring headers. See bash manual: Grouping Commands for more details about the () grouping used in the above example.

Dictionary sort

The -d option will consider only alphabets, numbers and blanks for sorting. Space and tab characters are considered as blanks, but this would also depend on the locale.

$ printf '(banana)\n{cherry}\n[apple]' | LC_ALL=C sort -d
+[apple]
+(banana)
+{cherry}
+

Use the -i option if you want to ignore only the non-printing characters.

Reversed order

The -r option will reverse the output order. Note that this doesn't change how sort performs comparisons, only the output is reversed. You'll see an example later where this distinction becomes clearer.

$ printf 'peace\nrest\nquiet' | sort -r
+rest
+quiet
+peace
+

In case you haven't noticed yet, sort adds a newline character to the final line even if it wasn't present in the input.

Numeric sort

The sort command provides various options to work with numeric formats. For most cases, the -n option is enough. Here's an example:

# lexicographic ordering isn't suited for numbers
+$ printf '20\n2\n3\n111\n314' | sort
+111
+2
+20
+3
+314
+
+# -n helps in this case
+$ printf '20\n2\n3\n111\n314' | sort -n
+2
+3
+20
+111
+314
+

The -n option can handle negative and floating-point numbers as well. The decimal point and the thousands separator characters will depend on the locale settings.

$ cat mixed_numbers.txt
+12,345
+42
+31.24
+-100
+42
+5678
+
+# , is the thousands separator in en_IN
+# . is the decimal point in en_IN
+$ sort -n mixed_numbers.txt
+-100
+31.24
+42
+42
+5678
+12,345
+

Use the -g option if your input can have the + prefix for positive numbers or follows the E-scientific notation.

$ cat e_notation.txt
++120
+-1.53
+3.14e+4
+42.1e-2
+
+$ sort -g e_notation.txt
+-1.53
+42.1e-2
++120
+3.14e+4
+

Unless otherwise specified, sort will break ties by using the entire input line content. In the case of -n, sorting will work even if there are extra characters after the number. Those extra characters will affect the output order if the numbers are equal. If a line doesn't start with a number (excluding blanks), it will be treated as 0.
# 'b' comes before 'p'
+$ printf '2 pins\n13 pens\n2 balls' | sort -n
+2 balls
+2 pins
+13 pens
+
+# 'z' and 'a2p' will be treated as '0'
+# 'a' comes before 'z'
+$ printf 'z\na2p\n13p\n2b\n-1\n    10' | sort -n
+-1
+a2p
+z
+2b
+    10
+13p
+

Human numeric sort

Commands like du (disk usage) have the -h and --si options to display numbers with SI suffixes like k, K, M, G and so on. In such cases, you can use sort -h to order them.

$ cat file_size.txt
+104K    power.log
+316M    projects
+746K    report.log
+20K     sample.txt
+1.4G    games
+
+$ sort -hr file_size.txt
+1.4G    games
+316M    projects
+746K    report.log
+104K    power.log
+20K     sample.txt
+

Version sort

$ printf '1.10\n1.2' | sort -n
+1.10
+1.2
+$ printf '1.10\n1.2' | sort -V
+1.2
+1.10
+
+$ cat versions.txt
+file2
+cmd5.2
+file10
+cmd1.6
+file5
+cmd5.10
+$ sort -V versions.txt
+cmd1.6
+cmd5.2
+cmd5.10
+file2
+file5
+file10
+

Here's an example of dealing with numbers reported by the time command (assuming all the entries have the same format).

$ cat timings.txt
+5m35.363s
+3m20.058s
+4m11.130s
+3m42.833s
+4m3.083s
+
+$ sort -V timings.txt
+3m20.058s
+3m42.833s
+4m3.083s
+4m11.130s
+5m35.363s
+

See Version sort ordering for more details. Note that the ls command uses lowercase -v for this task.

Random sort

The -R option will display the output in random order. Unlike shuf, this option will always place identical lines next to each other due to the implementation.

# the two lines with '42' will always be next to each other
+# use 'shuf' if you don't want this behavior
+$ sort -R mixed_numbers.txt
+31.24
+5678
+42
+42
+12,345
+-100
+

Unique sort

The -u option will keep only the first copy of lines that are deemed equal.

# (10) and [10] are deemed equal with dictionary sorting
+$ printf '(10)\n[20]\n[10]' | sort -du
+(10)
+[20]
+
+$ cat purchases.txt
+coffee
+tea
+washing powder
+coffee
+toothpaste
+tea
+soap
+tea
+$ sort -u purchases.txt
+coffee
+soap
+tea
+toothpaste
+washing powder
+

As seen earlier, the -n option will work even if there are extra characters after the number. When the -u option is also used, only the first such copy will be retained. Use the uniq command if you want to remove duplicates based on the whole line.

$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -nu
+2 balls
+13 pens
+
+# note that only the output order is reversed
+# use tac if you want the last duplicate to be preserved instead of the first
+$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -r -nu
+13 pens
+2 balls
+
+# use uniq when the entire line contents should be compared
+$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -n | uniq
+2 balls
+2 pins
+13 pens
+

You can use the -f option to ignore case while determining duplicates.

$ printf 'mat\nbat\nMAT\ncar\nbat\n' | sort -u
+bat
+car
+mat
+MAT
+
+# the first copy between 'mat' and 'MAT' is retained
+$ printf 'mat\nbat\nMAT\ncar\nbat\n' | sort -fu
+bat
+car
+mat
+

Column sort

The -k option allows you to sort based on specific columns instead of the entire input line. By default, the empty string between non-blank and blank characters is considered as the separator and thus the blanks are also part of the field contents. The effect of blanks and mitigation will be discussed later.

The -k option accepts arguments in various ways. You can specify the starting and ending column numbers separated by a comma. If you specify only the starting column, the last column will be used as the ending column. Usually you just want to sort by a single column, in which case the same number is specified as both the starting and ending columns. Here's an example:

$ cat shopping.txt
+apple   50
+toys    5
+Pizza   2
+mango   25
+Banana  10
+
+# sort based on the 2nd column numbers
+$ sort -k2,2n shopping.txt
+Pizza   2
+toys    5
+Banana  10
+mango   25
+apple   50
+

Note that in the above example, the -n option was also appended to the -k option. This makes it specific to that column and overrides global options, if any. Also, remember that the entire line will be used to break ties, unless otherwise specified.

You can use the -t option to specify a single byte character as the field separator. Use \0 to specify NUL as the separator. Depending on your shell you can use ANSI-C quoting to use escapes like \t instead of a literal tab character. When the -t option is used, the field separator won't be part of the field contents.

# department,name,marks
+$ cat marks.csv
+ECE,Raj,53
+ECE,Joel,72
+EEE,Moi,68
+CSE,Surya,81
+EEE,Raj,88
+CSE,Moi,62
+EEE,Tia,72
+ECE,Om,92
+CSE,Amy,67
+
+# name column is the primary sort key
+# entire line content will be used for breaking ties
+$ sort -t, -k2,2 marks.csv
+CSE,Amy,67
+ECE,Joel,72
+CSE,Moi,62
+EEE,Moi,68
+ECE,Om,92
+ECE,Raj,53
+EEE,Raj,88
+CSE,Surya,81
+EEE,Tia,72
+

You can use the -k option multiple times to specify your own order of tie breakers. Entire line will still be used to break ties if needed.

# second column is the primary key
+# reversed numeric sort on the third column is the secondary key
+# entire line will be used only if there are still tied entries
+$ sort -t, -k2,2 -k3,3nr marks.csv
+CSE,Amy,67
+ECE,Joel,72
+EEE,Moi,68
+CSE,Moi,62
+ECE,Om,92
+EEE,Raj,88
+ECE,Raj,53
+CSE,Surya,81
+EEE,Tia,72
+
+# sort by month first and then the day
+# -M option sorts based on abbreviated month names
+$ printf 'Aug-20\nMay-5\nAug-3' | sort -t- -k1,1M -k2,2n
+May-5
+Aug-3
+Aug-20
+

Use the -s option to retain the original order of input lines when two or more lines are deemed equal. You can still use multiple keys to specify your own tie breakers, -s only prevents the last resort comparison.

# -s prevents last resort comparison
+# so, lines having the same value in the 2nd column will retain input order
+$ sort -t, -s -k2,2 marks.csv
+CSE,Amy,67
+ECE,Joel,72
+EEE,Moi,68
+CSE,Moi,62
+ECE,Om,92
+ECE,Raj,53
+EEE,Raj,88
+CSE,Surya,81
+EEE,Tia,72
+

The -u option, as discussed earlier, will retain only the first copy of lines that are deemed equal.

# only the first copy of duplicates in the 2nd column will be retained
+$ sort -t, -u -k2,2 marks.csv
+CSE,Amy,67
+ECE,Joel,72
+EEE,Moi,68
+ECE,Om,92
+ECE,Raj,53
+CSE,Surya,81
+EEE,Tia,72
+

Character positions within columns

The -k option also accepts starting and ending character positions within the columns. These are specified after the column number, separated by a . character. If the character position is not specified for the ending column, the last character of that column is assumed.

The character positions start with 1 for the first character. Recall that when the -t option is used, the field separator is not part of the field contents.

# based on the second column number
+# 2.2 helps to ignore first character, otherwise -n won't have any effect here
+$ printf 'car,(20)\njeep,[10]\ntruck,(5)\nbus,[3]' | sort -t, -k2.2,2n
+bus,[3]
+truck,(5)
+jeep,[10]
+car,(20)
+
+# first character of the second column is the primary key
+# entire line acts as the last resort tie breaker
+$ printf 'car,(20)\njeep,[10]\ntruck,(5)\nbus,[3]' | sort -t, -k2.1,2.1
+car,(20)
+truck,(5)
+bus,[3]
+jeep,[10]
+

The default separation based on blank characters works differently. The empty string between non-blank and blank characters is considered as the separator and thus the blanks are also part of the field contents. You can use the -b option to ignore such leading blanks of field contents.

# the second column here starts with blank characters
+# adjusting the character position isn't feasible due to varying blanks
+$ printf 'car   (20)\njeep  [10]\ntruck (5)\nbus   [3]' | sort -k2.2,2n
+bus   [3]
+car   (20)
+jeep  [10]
+truck (5)
+
+# use -b in such cases to ignore the leading blanks
+$ printf 'car   (20)\njeep  [10]\ntruck (5)\nbus   [3]' | sort -k2.2b,2n
+bus   [3]
+truck (5)
+jeep  [10]
+car   (20)
+

Debugging

The --debug option can help you identify issues if the output isn't what you expected. Here's the previously seen -b example, now with --debug enabled. The underscores in the debug output shows which portions of the input are used as primary key, secondary key and so on. The collating order being used is also shown in the output.

$ printf 'car (20)\njeep [10]\ntruck (5)\nbus [3]' | sort -k2.2,2n --debug
+sort: text ordering performed using ‘en_IN’ sorting rules
+sort: leading blanks are significant in key 1; consider also specifying 'b'
+sort: note numbers use ‘.’ as a decimal point in this locale
+bus [3]
+    ^ no match for key
+_______
+car (20)
+    ^ no match for key
+________
+jeep [10]
+     ^ no match for key
+_________
+truck (5)
+      ^ no match for key
+_________
+
+$ printf 'car (20)\njeep [10]\ntruck (5)\nbus [3]' | sort -k2.2b,2n --debug
+sort: text ordering performed using ‘en_IN’ sorting rules
+sort: note numbers use ‘.’ as a decimal point in this locale
+bus [3]
+     _
+_______
+truck (5)
+       _
+_________
+jeep [10]
+      __
+_________
+car (20)
+     __
+________
+

Check if sorted

$ cat shopping.txt
+apple   50
+toys    5
+Pizza   2
+mango   25
+Banana  10
+
+$ sort -c shopping.txt
+sort: shopping.txt:3: disorder: Pizza   2
+$ echo $?
+1
+
+$ sort -C shopping.txt
+$ echo $?
+1
+

Specifying output file

The -o option can be used to specify the output file to be used for saving the results.

$ sort -R nums.txt -o rand_nums.txt
+
+$ cat rand_nums.txt
+1000
+3.14
+42
+

You can use -o for in-place editing as well, but the documentation gives this warning:

However, it is often safer to output to an otherwise-unused file, as data may be lost if the system crashes or sort encounters an I/O or other serious error while a file is being sorted in place. Also, sort with --merge (-m) can open the output file before reading all input, so a command like cat F | sort -m -o F - G is not safe as sort might start writing F before cat is done reading it.

Merge sort

The -m option is useful if you have one or more sorted input files and need a single sorted output file. Typically the use case is that you want to add newly obtained data to existing sorted data. In such cases, you can sort only the new data separately and then combine all the sorted inputs using the -m option. Here's a sample timing comparison between different combinations of sorted/unsorted inputs.

$ shuf -n1000000 -i1-999999999999 > n1.txt
+$ shuf -n1000000 -i1-999999999999 > n2.txt
+$ sort -n n1.txt > n1_sorted.txt
+$ sort -n n2.txt > n2_sorted.txt
+
+$ time sort -n n1.txt n2.txt > op1.txt
+real    0m1.010s
+$ time sort -mn n1_sorted.txt <(sort -n n2.txt) > op2.txt
+real    0m0.535s
+$ time sort -mn n1_sorted.txt n2_sorted.txt > op3.txt
+real    0m0.218s
+
+$ diff -sq op1.txt op2.txt
+Files op1.txt and op2.txt are identical
+$ diff -sq op1.txt op3.txt
+Files op1.txt and op3.txt are identical
+
+$ rm n{1,2}{,_sorted}.txt op{1..3}.txt
+

You might wonder if you can improve the performance of a single large file using the -m option. By default, sort already uses the available processors to split the input and merge. You can use the --parallel option to customize this behavior.

NUL separator

Use the -z option if you want to use NUL character as the line separator. In this scenario, sort will ensure to add a final NUL character even if not present in the input.

$ printf 'cherry\0apple\0banana' | sort -z | cat -v
+apple^@banana^@cherry^@
+

Exercises

The exercises directory has all the files used in this section.

1) Default sort doesn't work for numbers. Which option would you use to get the expected output shown below?

$ printf '100\n10\n20\n3000\n2.45\n' | sort ##### add your solution here
+2.45
+10
+20
+100
+3000
+

2) Which sort option will help you ignore case? LC_ALL=C is used here to avoid differences due to locale.

$ printf 'Super\nover\nRUNE\ntea\n' | LC_ALL=C sort ##### add your solution here
+over
+RUNE
+Super
+tea
+

3) The -n option doesn't work for all sorts of numbers. Which sort option would you use to get the expected output shown below?

# wrong output
+$ printf '+120\n-1.53\n3.14e+4\n42.1e-2' | sort -n
+-1.53
++120
+3.14e+4
+42.1e-2
+
+# expected output
+$ printf '+120\n-1.53\n3.14e+4\n42.1e-2' | sort ##### add your solution here
+-1.53
+42.1e-2
++120
+3.14e+4
+

4) What do the -V and -h options do?

5) Is there a difference between shuf and sort -R?

6) Sort the scores.csv file numerically in ascending order using the contents of the second field. Header line should be preserved as the first line as shown below.

$ cat scores.csv
+Name,Maths,Physics,Chemistry
+Ith,100,100,100
+Cy,97,98,95
+Lin,78,83,80
+
+##### add your solution here
+Name,Maths,Physics,Chemistry
+Lin,78,83,80
+Cy,97,98,95
+Ith,100,100,100
+

7) Sort the contents of duplicates.csv by the fourth column numbers in descending order. Retain only the first copy of lines with the same number.

$ cat duplicates.csv
+brown,toy,bread,42
+dark red,ruby,rose,111
+blue,ruby,water,333
+dark red,sky,rose,555
+yellow,toy,flower,333
+white,sky,bread,111
+light red,purse,rose,333
+
+##### add your solution here
+dark red,sky,rose,555
+blue,ruby,water,333
+dark red,ruby,rose,111
+brown,toy,bread,42
+

8) Sort the contents of duplicates.csv by the third column item. Use the fourth column numbers as the tie-breaker.

##### add your solution here
+brown,toy,bread,42
+white,sky,bread,111
+yellow,toy,flower,333
+dark red,ruby,rose,111
+light red,purse,rose,333
+dark red,sky,rose,555
+blue,ruby,water,333
+

9) What does the -s option provide?

10) Sort the given input based on the numbers inside the brackets.

$ printf '(-3.14)\n[45]\n(12.5)\n{14093}' | ##### add your solution here
+(-3.14)
+(12.5)
+[45]
+{14093}
+

11) What do the -c, -C and -m options do?

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils

CLI text processing with GNU Coreutils