November 28, 2019

Regular Expression’s Flavors

The idea of regular expressions comes from Linguistics and Mathematics

Different people made slightly different computational versions

There are small differences between different programs. We say that Regular Expressions come in different “flavors”

Basic v/s Extended Regular Expression

In both cases the characters .*[]^$\ are special

In extended regex the symbols ?{}()+| are also special

You need to use \ to make them “normal”

In basic regex the symbols ?{}()+| are “normal”

But (usually) you can use \ to make them “special”

Extended regex (ab)+ can be Basic regex \(ab\)\+

grep, egrep, fgrep

grep
search for basic regular expressions
egrep
search for extended regular expressions
fgrep
search for fixed patterns

Example: searching for [ab]|[cd] has different meaning on each case

grep Patterns must be written in quotations

grep and egrep patterns are regular expressions

regular expressions can have ? and *

The shell can think that ? and * are wildcards

Wildcards are expanded by the shell before being passed as arguments to the program

That is bad for the regular expression

Always wrap the grep pattern in single quotes '

Examples from last year’s midterm exam

Data file is world_2007.txt

world_2007.txt is a TAB-separated file. The columns represents “country”, “population”, “continent”, “life expectancy”, and “GDP per capita”.

head world_2007.txt
Afghanistan 31889923    Asia    43.828  974.5803384
Albania 3600523 Europe  76.423  5937.029526
Algeria 33333216    Africa  72.301  6223.367465
Angola  12420476    Africa  42.731  4797.231267
Argentina   40301927    Americas    75.32   12779.37964
Australia   20434176    Oceania 81.235  34435.36744
Austria 8199783 Europe  79.829  36126.4927
Bahrain 708573  Asia    75.635  29796.04834
Bangladesh  150448339   Asia    64.062  1391.253792
Belgium 10392226    Europe  79.441  33692.60508

Some questions

  • Show the life expectancy and population of Turkey
  • Which is the African country with 135031164 habitants?
  • Which world countries have bigger populations than Nigeria?
  • Which countries have a little more than 9 million people?
  • What is the total GDP of India?

Show the life expectancy and population of Turkey

world_2007.txt is a TAB-separated file. The columns represents “country”, “population”, “continent”, “life expectancy”, and “GDP per capita”.

grep 'Turkey' world_2007.txt | cut -f 4,2
71158647    71.777

But that is the wrong order!

Which is the African country with 135031164 habitants?

grep '135031164.*Africa'  world_2007.txt
Nigeria 135031164   Africa  46.859  2013.977305

Which world countries have bigger populations than Nigeria?

Which countries have a little more than 9 million people?

What is the total GDP of India?

Total GDP = GDP per capita * population

This is a question that we cannot solve with grep and cut

For these advanced questions we have another tool, called awk

AWK

Created by Aho, Weinberger, and Kernighan,

AWK is a programming language that permits

  • easy manipulation of structured data
  • generation of formatted reports

What is the total GDP of India?

awk '$1=="India" {print $2 * $5}' world_2007.txt
2.72293e+12

In awk, each statement has two parts:

  • a condition
  • a block of commands to run when the condition is true

AWK works row by row

We write

awk 'statements' file1 file2 ...

awk process each line of each file, one by one

It can process billions of lines, one by one

Each line is split on whitespace into fields

The columns of the file are called fields, the lines are called records

AWK commands

There are a few commands

You can always do man awk

For today we will use only print

print can take one or more arguments, separated by comma

The value of the argument is printed to the standard output, in a separated line

AWK automatic variables

There are many automatic variables in awk

$1, $2, and so are the fields of each row

$0 is the complete input line

NF is the Number of Fields in the current record

NR is the Number of the current Record

AWK is not the shell

This is a different language

Things are different than in the shell

In particular the symbol $ is only used for fields

awk variables do not need $ to be read

AWK conditions

The basic conditions are comparisons

awk '$1=="Turkey"' world_2007.txt
Turkey  71158647    Europe  71.777  8458.276384
awk '$2>135031164' world_2007.txt
Bangladesh  150448339   Asia    64.062  1391.253792
Brazil  190010647   Americas    72.39   9065.800825
China   1318683096  Asia    72.961  4959.114854
India   1110396331  Asia    64.698  2452.210407
Indonesia   223547000   Asia    70.65   3540.651564
Pakistan    169270617   Asia    65.483  2605.94758
United_States   301139947   Americas    78.242  42951.65309

Awk statements have 2 optional parts

Each awk statement is like this

condition {command}

We can omit the command. It will automatically print all

We can omit the condition. It will work always

We can omit {print $0}

awk '$1=="Turkey" {print $0}' world_2007.txt
Turkey  71158647    Europe  71.777  8458.276384

can be shortened as

awk '$1=="Turkey"' world_2007.txt
Turkey  71158647    Europe  71.777  8458.276384

Complex conditions

We can combine small conditions to make longer conditions

Population over 9 million AND (&&) population less than 10 million

awk '$2 > 9000000 && $2<10000000' world_2007.txt
Bolivia 9119152 Americas    65.554  3822.137084
Dominican_Republic  9319622 Americas    72.235  6025.374752
Guinea  9947814 Africa  56.007  942.6542111
Hungary 9956108 Europe  73.338  18008.94444
Somalia 9118773 Africa  48.159  926.1410683
Sweden  9031088 Europe  80.884  33859.74835

Complex condition: Eurasian countries

Continent is “Europe” OR (||) “Asia”

awk '$3=="Europe" || $3=="Asia"' world_2007.txt
Afghanistan 31889923    Asia    43.828  974.5803384
Albania 3600523 Europe  76.423  5937.029526
Austria 8199783 Europe  79.829  36126.4927
Bahrain 708573  Asia    75.635  29796.04834
Bangladesh  150448339   Asia    64.062  1391.253792
Belgium 10392226    Europe  79.441  33692.60508
Bosnia_and_Herzegovina  4552198 Europe  74.852  7446.298803
Bulgaria    7322858 Europe  73.005  10680.79282
Cambodia    14131858    Asia    59.723  1713.778686
China   1318683096  Asia    72.961  4959.114854
Croatia 4493312 Europe  75.748  14619.22272
Czech_Republic  10228744    Europe  76.486  22833.30851
Denmark 5468120 Europe  78.332  35278.41874
Finland 5238460 Europe  79.313  33207.0844
France  61083916    Europe  80.657  30470.0167
Germany 82400996    Europe  79.406  32170.37442
Greece  10706290    Europe  79.483  27538.41188
Hong_Kong,_China    6980412 Asia    82.208  39724.97867
Hungary 9956108 Europe  73.338  18008.94444
Iceland 301931  Europe  81.757  36180.78919
India   1110396331  Asia    64.698  2452.210407
Indonesia   223547000   Asia    70.65   3540.651564
Iran    69453570    Asia    70.964  11605.71449
Iraq    27499638    Asia    59.545  4471.061906
Ireland 4109086 Europe  78.885  40675.99635
Israel  6426679 Asia    80.745  25523.2771
Italy   58147733    Europe  80.546  28569.7197
Japan   127467972   Asia    82.603  31656.06806
Jordan  6053193 Asia    72.535  4519.461171
Korea,_Dem._Rep.    23301725    Asia    67.297  1593.06548
Korea,_Rep. 49044790    Asia    78.623  23348.13973
Kuwait  2505559 Asia    77.588  47306.98978
Lebanon 3921278 Asia    71.993  10461.05868
Malaysia    24821286    Asia    74.241  12451.6558
Mongolia    2874127 Asia    66.803  3095.772271
Montenegro  684736  Europe  74.543  9253.896111
Myanmar 47761980    Asia    62.069  944
Nepal   28901790    Asia    63.785  1091.359778
Netherlands 16570613    Europe  79.762  36797.93332
Norway  4627926 Europe  80.196  49357.19017
Oman    3204897 Asia    75.64   22316.19287
Pakistan    169270617   Asia    65.483  2605.94758
Philippines 91077287    Asia    71.688  3190.481016
Poland  38518241    Europe  75.563  15389.92468
Portugal    10642836    Europe  78.098  20509.64777
Romania 22276056    Europe  72.476  10808.47561
Saudi_Arabia    27601038    Asia    72.777  21654.83194
Serbia  10150265    Europe  74.002  9786.534714
Singapore   4553009 Asia    79.972  47143.17964
Slovak_Republic 5447502 Europe  74.663  18678.31435
Slovenia    2009245 Europe  77.926  25768.25759
Spain   40448191    Europe  80.941  28821.0637
Sri_Lanka   20378239    Asia    72.396  3970.095407
Sweden  9031088 Europe  80.884  33859.74835
Switzerland 7554661 Europe  81.701  37506.41907
Syria   19314747    Asia    74.143  4184.548089
Taiwan  23174294    Asia    78.4    28718.27684
Thailand    65068149    Asia    70.616  7458.396327
Turkey  71158647    Europe  71.777  8458.276384
United_Kingdom  60776238    Europe  79.425  33203.26128
Vietnam 85262356    Asia    74.249  2441.576404
West_Bank_and_Gaza  4018332 Asia    73.422  3025.349798
Yemen,_Rep. 22211743    Asia    62.698  2280.769906

Advanced conditions

We can use regular expression as conditions

awk '/Turkey/' world_2007.txt
Turkey  71158647    Europe  71.777  8458.276384

is the same as

grep 'Turkey' world_2007.txt
Turkey  71158647    Europe  71.777  8458.276384

Print the number of record

Avoiding head

We can decide to print or not based on the row number

awk 'NR <= 10 {print NR, $0}' science.txt
1 The Electronic Telegraph  Thursday 28 September 1995  Science
2 
3 This summer the Royal Observatory at Herstmonceux
4 found new life as a science centre. Andro Linklater
5 celebrates a partial victory for the heritage
6 
7 THE SIGHT of a child's top spinning unsupported in mid-air should have been
8 surprising. Rotating there in space, it not only defied the rules of gravity,
9 it defied common sense, and at least three Fellows of the Royal Society gazed
10 at it in something close to wonder.

Counting words

Each word is a field

awk 'NR <= 10 {print NR, NF, $0}' science.txt
1 8 The Electronic Telegraph  Thursday 28 September 1995  Science
2 0 
3 7 This summer the Royal Observatory at Herstmonceux
4 9 found new life as a science centre. Andro Linklater
5 7 celebrates a partial victory for the heritage
6 0 
7 13 THE SIGHT of a child's top spinning unsupported in mid-air should have been
8 13 surprising. Rotating there in space, it not only defied the rules of gravity,
9 14 it defied common sense, and at least three Fellows of the Royal Society gazed
10 7 at it in something close to wonder.

Print non-empty lines

Print the last word of every line

Command without condition

If there is no condition, the command is run for every line

awk '{print $0, $2 * $5}' world_2007.txt
Afghanistan 31889923    Asia    43.828  974.5803384 3.10793e+10
Albania 3600523 Europe  76.423  5937.029526 2.13764e+10
Algeria 33333216    Africa  72.301  6223.367465 2.07445e+11
Angola  12420476    Africa  42.731  4797.231267 5.95839e+10
Argentina   40301927    Americas    75.32   12779.37964 5.15034e+11
Australia   20434176    Oceania 81.235  34435.36744 7.03658e+11
Austria 8199783 Europe  79.829  36126.4927 2.96229e+11
Bahrain 708573  Asia    75.635  29796.04834 2.11127e+10
Bangladesh  150448339   Asia    64.062  1391.253792 2.09312e+11
Belgium 10392226    Europe  79.441  33692.60508 3.50141e+11
Benin   8078314 Africa  56.728  1441.284873 1.16432e+10
Bolivia 9119152 Americas    65.554  3822.137084 3.48546e+10
Bosnia_and_Herzegovina  4552198 Europe  74.852  7446.298803 3.3897e+10
Botswana    1639131 Africa  50.728  12569.85177 2.06036e+10
Brazil  190010647   Americas    72.39   9065.800825 1.7226e+12
Bulgaria    7322858 Europe  73.005  10680.79282 7.82139e+10
Burkina_Faso    14326203    Africa  52.295  1217.032994 1.74355e+10
Burundi 8390505 Africa  49.58   430.0706916 3.60851e+09
Cambodia    14131858    Asia    59.723  1713.778686 2.42189e+10
Cameroon    17696293    Africa  50.43   2042.09524 3.61375e+10
Canada  33390141    Americas    80.653  36319.23501 1.2127e+12
Central_African_Republic    4369038 Africa  44.741  706.016537 3.08461e+09
Chad    10238807    Africa  50.651  1704.063724 1.74476e+10
Chile   16284741    Americas    78.553  13171.63885 2.14497e+11
China   1318683096  Asia    72.961  4959.114854 6.5395e+12
Colombia    44227550    Americas    72.889  7006.580419 3.09884e+11
Comoros 710960  Africa  65.152  986.1478792 7.01112e+08
Congo,_Dem._Rep.    64606759    Africa  46.462  277.5518587 1.79317e+10
Congo,_Rep. 3800610 Africa  55.322  3632.557798 1.38059e+10
Costa_Rica  4133884 Americas    78.782  9645.06142 3.98716e+10
Cote_d'Ivoire   18013409    Africa  48.328  1544.750112 2.78262e+10
Croatia 4493312 Europe  75.748  14619.22272 6.56887e+10
Cuba    11416987    Americas    78.273  8948.102923 1.0216e+11
Czech_Republic  10228744    Europe  76.486  22833.30851 2.33556e+11
Denmark 5468120 Europe  78.332  35278.41874 1.92907e+11
Djibouti    496374  Africa  54.791  2082.481567 1.03369e+09
Dominican_Republic  9319622 Americas    72.235  6025.374752 5.61542e+10
Ecuador 13755680    Americas    74.994  6873.262326 9.45464e+10
Egypt   80264543    Africa  71.338  5581.180998 4.47971e+11
El_Salvador 6939688 Americas    71.878  5728.353514 3.9753e+10
Equatorial_Guinea   551201  Africa  51.579  12154.08975 6.69935e+09
Eritrea 4906585 Africa  58.04   641.3695236 3.14693e+09
Ethiopia    76511887    Africa  52.947  690.8055759 5.28548e+10
Finland 5238460 Europe  79.313  33207.0844 1.73954e+11
France  61083916    Europe  80.657  30470.0167 1.86123e+12
Gabon   1454867 Africa  56.735  13206.48452 1.92137e+10
Gambia  1688359 Africa  59.448  752.7497265 1.27091e+09
Germany 82400996    Europe  79.406  32170.37442 2.65087e+12
Ghana   22873338    Africa  60.022  1327.60891 3.03668e+10
Greece  10706290    Europe  79.483  27538.41188 2.94834e+11
Guatemala   12572928    Americas    70.259  5186.050003 6.52038e+10
Guinea  9947814 Africa  56.007  942.6542111 9.37735e+09
Guinea-Bissau   1472041 Africa  46.388  579.231743 8.52653e+08
Haiti   8502814 Americas    60.916  1201.637154 1.02173e+10
Honduras    7483763 Americas    70.198  3548.330846 2.65549e+10
Hong_Kong,_China    6980412 Asia    82.208  39724.97867 2.77297e+11
Hungary 9956108 Europe  73.338  18008.94444 1.79299e+11
Iceland 301931  Europe  81.757  36180.78919 1.09241e+10
India   1110396331  Asia    64.698  2452.210407 2.72293e+12
Indonesia   223547000   Asia    70.65   3540.651564 7.91502e+11
Iran    69453570    Asia    70.964  11605.71449 8.06058e+11
Iraq    27499638    Asia    59.545  4471.061906 1.22953e+11
Ireland 4109086 Europe  78.885  40675.99635 1.67141e+11
Israel  6426679 Asia    80.745  25523.2771 1.6403e+11
Italy   58147733    Europe  80.546  28569.7197 1.66126e+12
Jamaica 2780132 Americas    72.567  7320.880262 2.0353e+10
Japan   127467972   Asia    82.603  31656.06806 4.03513e+12
Jordan  6053193 Asia    72.535  4519.461171 2.73572e+10
Kenya   35610177    Africa  54.11   1463.249282 5.21066e+10
Korea,_Dem._Rep.    23301725    Asia    67.297  1593.06548 3.71212e+10
Korea,_Rep. 49044790    Asia    78.623  23348.13973 1.1451e+12
Kuwait  2505559 Asia    77.588  47306.98978 1.1853e+11
Lebanon 3921278 Asia    71.993  10461.05868 4.10207e+10
Lesotho 2012649 Africa  42.592  1569.331442 3.15851e+09
Liberia 3193942 Africa  45.678  414.5073415 1.32391e+09
Libya   6036914 Africa  73.952  12057.49928 7.27901e+10
Madagascar  19167654    Africa  59.443  1044.770126 2.00258e+10
Malawi  13327079    Africa  48.303  759.3499101 1.01199e+10
Malaysia    24821286    Asia    74.241  12451.6558 3.09066e+11
Mali    12031795    Africa  54.467  1042.581557 1.25441e+10
Mauritania  3270065 Africa  64.164  1803.151496 5.89642e+09
Mauritius   1250882 Africa  72.801  10956.99112 1.37059e+10
Mexico  108700891   Americas    76.195  11977.57496 1.30197e+12
Mongolia    2874127 Asia    66.803  3095.772271 8.89764e+09
Montenegro  684736  Europe  74.543  9253.896111 6.33648e+09
Morocco 33757175    Africa  71.164  3820.17523 1.28958e+11
Mozambique  19951656    Africa  42.082  823.6856205 1.64339e+10
Myanmar 47761980    Asia    62.069  944 45087309120
Namibia 2055080 Africa  52.906  4811.060429 9.88711e+09
Nepal   28901790    Asia    63.785  1091.359778 3.15423e+10
Netherlands 16570613    Europe  79.762  36797.93332 6.09764e+11
New_Zealand 4115771 Oceania 80.204  25185.00911 1.03656e+11
Nicaragua   5675356 Americas    72.899  2749.320965 1.56034e+10
Niger   12894865    Africa  56.867  619.6768924 7.99065e+09
Nigeria 135031164   Africa  46.859  2013.977305 2.7195e+11
Norway  4627926 Europe  80.196  49357.19017 2.28421e+11
Oman    3204897 Asia    75.64   22316.19287 7.15211e+10
Pakistan    169270617   Asia    65.483  2605.94758 4.4111e+11
Panama  3242173 Americas    75.537  9809.185636 3.18031e+10
Paraguay    6667147 Americas    71.752  4172.838464 2.78209e+10
Peru    28674757    Americas    71.421  7408.905561 2.12449e+11
Philippines 91077287    Asia    71.688  3190.481016 2.9058e+11
Poland  38518241    Europe  75.563  15389.92468 5.92793e+11
Portugal    10642836    Europe  78.098  20509.64777 2.18281e+11
Puerto_Rico 3942491 Americas    78.746  19328.70901 7.62033e+10
Reunion 798094  Africa  76.442  7670.122558 6.12148e+09
Romania 22276056    Europe  72.476  10808.47561 2.4077e+11
Rwanda  8860588 Africa  46.242  863.0884639 7.64747e+09
Sao_Tome_and_Principe   199579  Africa  65.528  1598.435089 3.19014e+08
Saudi_Arabia    27601038    Asia    72.777  21654.83194 5.97696e+11
Senegal 12267493    Africa  63.062  1712.472136 2.10077e+10
Serbia  10150265    Europe  74.002  9786.534714 9.93359e+10
Sierra_Leone    6144562 Africa  42.568  862.5407561 5.29994e+09
Singapore   4553009 Asia    79.972  47143.17964 2.14643e+11
Slovak_Republic 5447502 Europe  74.663  18678.31435 1.0175e+11
Slovenia    2009245 Europe  77.926  25768.25759 5.17747e+10
Somalia 9118773 Africa  48.159  926.1410683 8.44527e+09
South_Africa    43997828    Africa  49.339  9269.657808 4.07845e+11
Spain   40448191    Europe  80.941  28821.0637 1.16576e+12
Sri_Lanka   20378239    Asia    72.396  3970.095407 8.09036e+10
Sudan   42292929    Africa  58.556  2602.394995 1.10063e+11
Swaziland   1133066 Africa  39.613  4513.480643 5.11407e+09
Sweden  9031088 Europe  80.884  33859.74835 3.0579e+11
Switzerland 7554661 Europe  81.701  37506.41907 2.83348e+11
Syria   19314747    Asia    74.143  4184.548089 8.08235e+10
Taiwan  23174294    Asia    78.4    28718.27684 6.65526e+11
Tanzania    38139640    Africa  52.517  1107.482182 4.2239e+10
Thailand    65068149    Asia    70.616  7458.396327 4.85304e+11
Togo    5701579 Africa  58.42   882.9699438 5.03432e+09
Trinidad_and_Tobago 1056608 Americas    69.819  18008.50924 1.90279e+10
Tunisia 10276158    Africa  73.923  7092.923025 7.2888e+10
Turkey  71158647    Europe  71.777  8458.276384 6.0188e+11
Uganda  29170398    Africa  51.542  1056.380121 3.0815e+10
United_Kingdom  60776238    Europe  79.425  33203.26128 2.01797e+12
United_States   301139947   Americas    78.242  42951.65309 1.29345e+13
Uruguay 3447496 Americas    76.384  10611.46299 3.6583e+10
Venezuela   26084662    Americas    73.747  11415.80569 2.97777e+11
Vietnam 85262356    Asia    74.249  2441.576404 2.08175e+11
West_Bank_and_Gaza  4018332 Asia    73.422  3025.349798 1.21569e+10
Yemen,_Rep. 22211743    Asia    62.698  2280.769906 5.06599e+10
Zambia  11746035    Africa  42.384  1271.211593 1.49317e+10
Zimbabwe    12311143    Africa  43.487  469.7092981 5.78266e+09