Class 11: Regular Expressions

November 21, 2019

How to copy without copying

Efficient usage of the disk

For today’s class we need to “copy” several files to each one of your folders

But we will not modify the files. They will be read only

And we do not want 30 copies of the same file. It will use too much disk

Copy without copy

Instead of copying, we will link the files to your folder

mkdir Gutenberg
cd Gutenberg
ln /home/andres/Gutenberg/* .
ls -l

Result should be like this

total 6316
-rw-r--r-- 1 andres andres  594933 Nov 14 12:46 Adventures_of_Sherlock_Holmes.txt
-rw-r--r-- 1 andres andres 2347825 Nov 14 12:46 Don_Quixote.txt
-rw-r--r-- 1 andres andres  405900 Nov 14 12:46 Dubliners.txt
-rw-r--r-- 1 andres andres  748536 Nov 14 12:46 Educating_by_Story-Telling.txt
-rw-rw-r-- 1 andres andres 1938980 Nov 15 09:52 english
-rw-r--r-- 1 andres andres  141420 Nov 14 12:46 Metamorphosis.txt
-rw-r--r-- 1 andres andres  272277 Nov 14 12:46 Study_In_Scarlet.txt

`ln` gives new names to existing files

Each physical file on the disk can have several names

ln creates a new name to an existing file

You can see the number of names of a file in the output of ls -l

What happens when we use rm?

Searching for a pattern

We use grep to look for a pattern in one or more files

We use these options:

grep --color 'regex' file ...
grep --only 'regex' file ...
grep --count 'regex' file ...

The pattern is a regex. This means Regular Expression

A regex describes several words with a single text

One word can match several lines

grep --color 'analyze' english

analyze
analyzed
analyzer
analyzer's
analyzers
analyzes
psychoanalyze
psychoanalyzed
psychoanalyzes

Only lines starting with “analyze”

grep --color '^analyze' english

analyze
analyzed
analyzer
analyzer's
analyzers
analyzes

The symbol ^ represents “start of line”

Only lines ending with “analyze”

grep --color 'analyze$' english

analyze
psychoanalyze

The symbol $ represents “end of line”

Only “analyze”

grep --color '^analyze$' english

analyze

Symbols ^ and $ are called “anchors”

Exercise

Count how many times the word “Sherlock” appears on each text file in the Gutenberg folder

At the beginning of each file
At the end
anywhere

Searching American and British

There are small differences between American and British versions of the English language

grep --color '^analyze$' english

analyze

grep --color '^analyse$' english

analyse

Looking for both at the same time

grep --color '^analy[sz]e$' english

analyse
analyze

The symbols [ and ] indicate a character class

That is, one letter from a set of letters

Character classes

A character class allows you to match a range or set of characters

Example: [aeiou] will match any (English) vowel

This matches “c”, followed by a vowel, followed by “t”

grep --only 'c[aeiou]t' english | sort |uniq

cat
cet
cit
cot
cut

Negated Character Classes

We can also use character classes to specify characters we don’t want to match. These are called negated character classes

They are created by putting a caret ^ at the be-ginning of the class

This will match a “c”, followed by a non- vowel, followed by a “t”:

grep --only 'c[^aeiou]t' english | sort |uniq

cht
ckt
cst
cyt

Exercise

Show the complete matching line with the pattern in color for the following regex

‘c[aeiou]t’
‘c[^aeiou]t’

Ranges

You can also match a range of characters using a character class. For example,

[a-i]

will match any of the letters between a and i (inclusive)

Character classes work with numbers too

This matches a date between 1000 and 9999:

grep '[1-9][0-9][0-9][0-9]' *txt

Any letter

The symbol . represents any character

grep --only 'c.t' english | sort |uniq

cat
cet
cht
cit
ckt
cot
cst
cut
cyt

Exercise

Find all the lines ending with “Holmes” followed by a single character

Repetitions

The * symbol means that something should be repeated zero or more times

That is, it folles an optional expression

grep '^colou*r$' english

color
colour

Escaping

The characters ., *, [, ], ^, $ are special

They are called meta-characters

How can we look for them?

To take out the “superpowers”, we use \

\., \*, \[, \], \^, $, and \\

The rest of the characters match themselves

Exercise

Look for “Holmes.”

Extended regular expressions

Instead of grep we will use egrep

Now the characters .?*[]^${}()+|\ are special

As before, we can always escape them

Zero on One time

? is like * but means “zero or one time”

egrep --only 'lo?k' english |sort |uniq

lk
lok

egrep --only 'lo*k' english |sort |uniq

lk
lok
look

One or more times

+ means one or more times

That is, [a-z][a-z]* is the same as [a-z]+

Controlling the number of times

We can use curly braces to repeat something between a range of times:

^a{3,5}$

That will match the letter “a” repeated 3, 4, or 5 times.

Controlling the number of times

If you want to match something repeated up to a certain number of times, you can use 0 as the first number.

If you want to match something more than a certain number with no maximum, you can just leave the second number blank:

^a{3,}$

Alternatives

If you want to match two different expressions, you can use |

egrep 'cat|dog' english

Alcatraz    Yucatan     adjudicate  advocate
Decatur     abdicate    adjudicated advocated
Hecate      abdicated   adjudicates advocates
Ladoga      abdicates   adjudicating    advocating
Mercator    abdicating  adjudication    allocate
Muscat      abdication  adjudicator allocated
Popocatepetl    abdications adjudicators    allocates

Look for cats and dogs on all the text files

Grouping

We can use ( and ) to define groups of expressions

egrep --only '([aeiou][^aeiou]){2}' english |sort |uniq

aliy    uran    arer    ured    alit    ole'    amat    edim    erat    uter
aron    urim    arin    ures    aliz    olic    elod    edom    emun    uper
asid    urin    aris    urin    anim    itic    amat    eter    erat    uter
elar    urit    ines    umin    ated    itiv    ical    edom    emun    uper
ilen    anis    aten    umul    ates    ilat    elod    eter    erat    uter
aham    urit    ater    ativ    atin    enat    elon    edom    emun    uper
alom    anis    atif    umul    ator    ened    ane'    eter    erat    ivit
uja'    urit    icat    ativ    idis    enin    anes    ekab    emun    uper
apul    anis    atif    umul    ilat    oses    eme'    eked    erat    ivit
ure'    urus    icat    ifor    eral    osis    emen    ekin    enal    uper

How to copy without copying

Efficient usage of the disk

Copy without copy

Result should be like this

ln gives new names to existing files

Searching for a pattern

One word can match several lines

Only lines starting with “analyze”

Only lines ending with “analyze”

Only “analyze”

Exercise

Searching American and British

Looking for both at the same time

Character classes

Negated Character Classes

Exercise

Ranges

Any letter

Exercise

Repetitions

Escaping

Exercise

Extended regular expressions

Zero on One time

One or more times

Controlling the number of times

Controlling the number of times

Alternatives

Grouping

`ln` gives new names to existing files