November 21, 2019
For today’s class we need to “copy” several files to each one of your folders
But we will not modify the files. They will be read only
And we do not want 30 copies of the same file. It will use too much disk
Instead of copying, we will link the files to your folder
mkdir Gutenberg cd Gutenberg ln /home/andres/Gutenberg/* . ls -l
total 6316 -rw-r--r-- 1 andres andres 594933 Nov 14 12:46 Adventures_of_Sherlock_Holmes.txt -rw-r--r-- 1 andres andres 2347825 Nov 14 12:46 Don_Quixote.txt -rw-r--r-- 1 andres andres 405900 Nov 14 12:46 Dubliners.txt -rw-r--r-- 1 andres andres 748536 Nov 14 12:46 Educating_by_Story-Telling.txt -rw-rw-r-- 1 andres andres 1938980 Nov 15 09:52 english -rw-r--r-- 1 andres andres 141420 Nov 14 12:46 Metamorphosis.txt -rw-r--r-- 1 andres andres 272277 Nov 14 12:46 Study_In_Scarlet.txt
ln
gives new names to existing filesEach physical file on the disk can have several names
ln
creates a new name to an existing file
You can see the number of names of a file in the output of ls -l
What happens when we use rm
?
We use grep
to look for a pattern in one or more files
We use these options:
grep --color 'regex' file ... grep --only 'regex' file ... grep --count 'regex' file ...
The pattern is a regex. This means Regular Expression
A regex describes several words with a single text
grep --color 'analyze' english
analyze analyzed analyzer analyzer's analyzers analyzes psychoanalyze psychoanalyzed psychoanalyzes
grep --color '^analyze' english
analyze analyzed analyzer analyzer's analyzers analyzes
The symbol ^
represents “start of line”
grep --color 'analyze$' english
analyze psychoanalyze
The symbol $
represents “end of line”
grep --color '^analyze$' english
analyze
Symbols ^
and $
are called “anchors”
Count how many times the word “Sherlock” appears on each text file in the Gutenberg
folder
There are small differences between American and British versions of the English language
grep --color '^analyze$' english
analyze
grep --color '^analyse$' english
analyse
grep --color '^analy[sz]e$' english
analyse analyze
The symbols [
and ]
indicate a character class
That is, one letter from a set of letters
A character class allows you to match a range or set of characters
Example: [aeiou]
will match any (English) vowel
This matches “c”, followed by a vowel, followed by “t”
grep --only 'c[aeiou]t' english | sort |uniq
cat cet cit cot cut
We can also use character classes to specify characters we don’t want to match. These are called negated character classes
They are created by putting a caret ^
at the be-ginning of the class
This will match a “c”, followed by a non- vowel, followed by a “t”:
grep --only 'c[^aeiou]t' english | sort |uniq
cht ckt cst cyt
Show the complete matching line with the pattern in color for the following regex
You can also match a range of characters using a character class. For example,
[a-i]
will match any of the letters between a
and i
(inclusive)
Character classes work with numbers too
This matches a date between 1000 and 9999:
grep '[1-9][0-9][0-9][0-9]' *txt
The symbol .
represents any character
grep --only 'c.t' english | sort |uniq
cat cet cht cit ckt cot cst cut cyt
Find all the lines ending with “Holmes” followed by a single character
The *
symbol means that something should be repeated zero or more times
That is, it folles an optional expression
grep '^colou*r$' english
color colour
The characters .
, *
, [
, ]
, ^
, $
are special
They are called meta-characters
How can we look for them?
To take out the “superpowers”, we use \
\.
, \*
, \[
, \]
, \^
, $
, and \\
The rest of the characters match themselves
Look for “Holmes.”
Instead of grep
we will use egrep
Now the characters .?*[]^${}()+|\
are special
As before, we can always escape them
?
is like *
but means “zero or one time”
egrep --only 'lo?k' english |sort |uniq
lk lok
egrep --only 'lo*k' english |sort |uniq
lk lok look
+
means one or more times
That is, [a-z][a-z]*
is the same as [a-z]+
We can use curly braces to repeat something between a range of times:
^a{3,5}$
That will match the letter “a” repeated 3, 4, or 5 times.
If you want to match something repeated up to a certain number of times, you can use 0 as the first number.
If you want to match something more than a certain number with no maximum, you can just leave the second number blank:
^a{3,}$
If you want to match two different expressions, you can use |
egrep 'cat|dog' english
Alcatraz Yucatan adjudicate advocate Decatur abdicate adjudicated advocated Hecate abdicated adjudicates advocates Ladoga abdicates adjudicating advocating Mercator abdicating adjudication allocate Muscat abdication adjudicator allocated Popocatepetl abdications adjudicators allocates
Look for cats and dogs on all the text files
We can use (
and )
to define groups of expressions
egrep --only '([aeiou][^aeiou]){2}' english |sort |uniq
aliy uran arer ured alit ole' amat edim erat uter aron urim arin ures aliz olic elod edom emun uper asid urin aris urin anim itic amat eter erat uter elar urit ines umin ated itiv ical edom emun uper ilen anis aten umul ates ilat elod eter erat uter aham urit ater ativ atin enat elon edom emun uper alom anis atif umul ator ened ane' eter erat ivit uja' urit icat ativ idis enin anes ekab emun uper apul anis atif umul ilat oses eme' eked erat ivit ure' urus icat ifor eral osis emen ekin enal uper