Text manipulation

(This example was kindly provided by John Simpson.)

In this section we’ll use two tools for text manipulation: sed and tr. Our goal is to calculate the frequency of all dictionary words in the novel “The Invisible Man” by Herbert Wells (public domain). First, let’s apply our knowledge of grep to this text:

$ cd /path/to/data-shell
$ ls   # shows wellsInvisibleMan.txt
$ wc wellsInvisibleMan.txt                          # number of lines, words, characters
$ grep invisible wellsInvisibleMan.txt              # see the invisible man
$ grep invisible wellsInvisibleMan.txt | wc -l      # returns 60; adding -w gives the same count
$ grep -i invisible wellsInvisibleMan.txt | wc -l   # returns 176 (includes: invisible Invisible INVISIBLE)

Let’s sidetrack for a second and see how we can use the “stream editor” sed:

$ sed 's/[iI]nvisible/visible/g' wellsInvisibleMan.txt > visible.txt   # make him visible
$ cat wellsInvisibleMan.txt | sed 's/[iI]nvisible/visible/g' > visible.txt   # this also works (standard input)
$ grep -i invisible visible.txt   # see what was not converted
$ man sed

Now let’s remove punctuation from the original file using “tr” (translate) command:

$ cat wellsInvisibleMan.txt | tr -d "[:punct:]" > nopunct.txt    # tr only takes standard input
$ tail wellsInvisibleMan.txt
$ tail nopunct.txt

Next, convert all upper case to lower case:

$ cat nopunct.txt | tr '[:upper:]' '[:lower:]' > lower.txt
$ tail lower.txt

Next, replace spaces with new lines:

$ cat lower.txt | sed 's/ /\'$'\n/g' > words.txt   # \'$'\n is a shortcut for a new line
$ more words.txt

Next, remove empty lines:

$ sed '/^$/d' words.txt  > compact.txt

Next, sort the list alphabetically, count each word’s occurrence, and remove duplicate words:

$ cat compact.txt | sort | uniq -c > dictionary.txt
$ more dictionary.txt

Next, sort the list into most frequent words:

$ cat dictionary.txt | sort -gr > frequency.txt   # use 'man sort'
$ more frequency.txt

You can watch a video for this topic after the workshop.

Quick reference:

sed 's/pattern1/pattern2/' filename    # replace pattern1 with pattern2, one per line
sed 's/pattern1/pattern2/g' filename   # same but multiple per line
sed 's|pattern1|pattern2|g' filename   # same

cat wellsInvisibleMan.txt | tr -d "[:punct:]" > nopunct.txt # remove punctuation; tr only takes standard input
cat nopunct.txt | tr '[:upper:]' '[:lower:]' > lower.txt    # convert all upper case to lower case
cat lower.txt | sed 's/ /\'$'\n/g' > words.txt              # replace spaces with new lines
sed '/^$/d' words.txt  > compact.txt                # remove empty lines
cat compact.txt | sort | uniq -c > dictionary.txt   # sort the list alphabetically, count each word's occurrence
cat dictionary.txt | sort -gr > frequency.txt       # sort the list into most frequent words

Question 10.1

Can you shorten our novel-manipulation workflow putting it into a single line using pipes?

Question 10.2

Write a script that takes an English-language text file and prints the list of its 100 most common words, along with the word count in its dictionary. Hint: use the workflow we just studied. Next, convert this script into a bash function.

Column-based text processing with `awk` scripting language

cd /path/to/data-shell/writing
cat haiku.txt   # 11 lines

You can define inline awk scripts with braces surrounded by single quotation:

awk '{print $1}' haiku.txt        # $1 is the first field (word) in each line => processing columns
awk '{print $0}' haiku.txt        # $0 is the whole line
awk '{print}' haiku.txt           # the whole line is the default action
awk -Fa '{print $1}' haiku.txt    # can specify another separator with -F ("a" in this case)
awk -F, '{print $1}' cities.csv   # for a CSV file

You can use multiple commands inside your awk script:

echo Hello Tom > hello.txt
echo Hello John >> hello.txt
awk '{$2="Adam"; print $0}' hello.txt   # we replaced the second word in each line with "Adam"

Most common awk usage is to postprocess output of other commands:

/bin/ps aux    # display all running processes as multi-column output
/bin/ps aux | awk '{print $2 " " $11}'   # print only the process number and the command

Awk also takes patterns in addition to scripts:

awk '/Yesterday/' haiku.txt         # print the lines that contain "Yesterday"
awk '/Yesterday|Today/' haiku.txt   # print the lines that contain "Yesterday" or "Today"

And then you act on these patterns: if the pattern evaluates to True, then run the script:

awk '/Yesterday|Today/{print $3}' haiku.txt
awk '/Yesterday|Today/' haiku.txt | awk '{print $3}'   # same as previous line

everything inside '' is processed by awk language; then nested further inside,
- everything inside // is a search pattern
- everything inside {} is an action to run

Awk has a number of built-in variables; the most commonly used is NR:

awk 'NR>1' haiku.txt    # if NumberRecord >1 then print it (default action), i.e. skip the first line
awk 'NR>1{print $0}' haiku.txt     # last command expanded
awk 'NR>1 && NR < 5' haiku.txt     # print lines 2-4
awk 'NR>=2 && NR <= 4' haiku.txt   # the same

Question 10.3

Write an awk script to process cities.csv to print only town/city names and their population and store it in a separate file populations.csv. Try to do everything in a single-line command.

Question 10.4

Write an awk script that prints every 10th line from cities.csv starting from line 2, i.e. the first line after the header. Hint: use NR variable.

Question `copy every 10th file`

Imagine that the directory /project/def-sponsor00/shared/toyModel contains results from a numerical simulation. Write a command to copy every 10th file (starting from yB31_oneblock_00000.vti) from this directory to one of your own directories. Hint: create an alphabetically sorted list of files in that directory and then use awk’s NR variable.

Question `archive every 20th file`

Similarly to the previous exercise, write a command to create a tar archive that includes every 20th file from the simulation directory /project/def-sponsor00/shared/toyModel. Is it possible to do this in one command? Why does it remove leading ‘/’ from file paths?

Quick reference:

ls -l | awk 'NR>3 {print $5 "  " $9}'     # print 5th and 9th columns starting with line 4
awk 'NR>1 && NR < 5' haiku.txt            # print lines 2-4
awk 'NR>1 && NR < 5 {print $1}' haiku.txt # print lines 2-4, column 1
awk '/Yesterday|Today/' haiku.txt         # print lines that contain Yesterday or Today

Question 10.7

Write a one-line command that finds 5 largest files in the current directory (and all of its subdirectories) and prints only their names and file sizes in the human-readable format (indicating bytes, kB, MB, GB, …) in the decreasing file-size order. Hint: use find, xargs, and awk.

Question `ps`

Use ps command to see how many processes you are running on the training cluster. Explore its flags. Write commands to reduce ps output to a few essential columns.

You can watch a video for this topic after the workshop.

Text manipulation

Column-based text processing with awk scripting language

Column-based text processing with `awk` scripting language