April 26th, 2016
DNA replication is not perfect. Some bases can change
If these changes are lethal, the cell dies (by definition)
Therefore we only see changes that are compatible with cell life
This is one component of natural selection
Naturally, if two genes have the same sequence, they encode the same protein
If they differ in a few bases, the proteins will also differ a little, or less (why?)
So if two proteins are very similar, they probably do the same function
A few changes will probably not change they way it folds
Hypothesis: Same shape, same function.
A graphical method to compare two protein or DNA sequences and identify regions of close similarity between them.
A two-dimensional image, with the two sequences along the axes. The dotPlot displays a dot at points where there is an identical letter in the two sequences.
We can create a dotPlot for two sequences using the “dotPlot()” function in the SeqinR R package.
prot <- read.fasta("dehydrin.faa") dotPlot(prot$AA3, prot$K102)
dotPlot(prot[[1]], prot[[2]])
dotPlot(prot[[1]], prot[[2]], wsize = 2, nmatch = 2)
dotPlot(prot[[1]], prot[[2]], wsize = 3, nmatch = 3)
Try on your own with different sequences
How do you interpret them?
Can you do it in Excel?
dotPlot(prot$Dicktoo, prot$Seq4982, wsize=3, nmatch=3)
dotPlot(prot$Dicktoo, prot$LK8, wsize=3, nmatch=3)
what does this mean?
dotPlot shows a visual representation of the similarity between sequences.
The question is to what extent the two sequences are similar.
To quantify similarity, it is necessary to align the two sequences, and then you can calculate a similarity score based on the alignment.
To align two sequences we simply put each one in consecutive rows, inserting “gaps” if necessary to make them equal length. For example
G A A T T C G A T T - A
Another option is
G A A T T C G A - T T A
Which one is better?
The first step in computing a alignment is to decide on a scoring system. For example, we may decide to give a score of +2 to a match and a penalty of -1 to a mismatch, and a penalty of -2 to a gap. Thus, for the alignment:
G A A T T C G A T T - A
we would compute a score of 2 + 2 -1 + 2 -2 - 1 = 2. Similarly, the score for this
G A A T T C G A - T T A
is 2 + 2 -2 + 2 + 2 -1 = 5:
If there are several ways to align two sequences, which one do we choose?
An alignment is optimal if it has the maximum score.
That is, the optimal alignment has a score greater or equal to the scores of all possible alignments
There are “efficient” methods to find the optimal alginment
The scoring system can be represented by a scoring matrix (also known as a substitution matrix).
The scoring matrix has one row and one column for each possible letter in our alphabet of letters (eg. 4 rows and 4 columns for DNA sequences).
The (i,j) element of the matrix has a value of +2 in case of a match and -1 in case of a mismatch.
We can make a scoring matrix in R by using the nucleotideSubstitutionMatrix()
function in the Biostrings package.
The inputs for the nucleotideSubstitutionMatrix()
function are the score that we want to assign to a match and the score that we want to assign to a mismatch.
We can also specify that we use only the four nucleotides (ie. A, C, G, T) by setting ‘baseOnly=TRUE’, or whether we also use the letters that represent ambiguous cases (eg. ‘N’ = A/C/G/T).
To make a scoring matrix which assigns a score of +2 to a match and -1 to a mismatch, and store it in the variable sigma, we type:
library(Biostrings)
Loading required package: methods
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel': clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB
The following object is masked from 'package:ade4': score
The following objects are masked from 'package:stats': IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base': anyDuplicated, append, as.data.frame, cbind, colMeans, colnames, colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following object is masked from 'package:base': expand.grid
Loading required package: IRanges
Loading required package: XVector
Attaching package: 'Biostrings'
The following object is masked from 'package:seqinr': translate
The following object is masked from 'package:base': strsplit
sigma <- nucleotideSubstitutionMatrix(match = 2, mismatch = -1, baseOnly = TRUE) sigma # Print out the matrix
A C G T A 2 -1 -1 -1 C -1 2 -1 -1 G -1 -1 2 -1 T -1 -1 -1 2
Instead of assigning the same penalty (eg. -8) to every gap position, it is common instead to assign a gap opening penalty to the first position in a gap (eg. -8), and a smaller gap extension penalty to every subsequent position in the same gap.
It is likely that adjacent gap positions were created by the same insertion or deletion event, rather than by several independent insertion or deletion events.
We don’t want to penalise a 3-letter gap as much as we would penalise three separate 1-letter gaps, as the 3-letter gap may have arisen due to just one insertion or deletion event, while the 3 separate 1-letter gaps probably arose due to three independent insertion or deletion events.
There are two types of alignment in general.
A global alignment is an alignment of the full length of two sequences, for example, of two protein sequences or of two DNA sequences.
A local alignment is an alignment of part of one sequence to part of another sequence.
The pairwiseAlignment()
function finds the score for the optimal global alignment between two sequences, given a particular scoring system.
The inputs are the two sequences that you want to align, the scoring matrix, the gap opening penalty, and the gap extension penalty.
You can also tell the function that you want to just have the optimal global alignment’s score by setting scoreOnly = TRUE
, or that you want to have both the optimal global alignment and its score by setting scoreOnly = FALSE
.
For example, to find the score for the optimal global alignment between the sequences GAATTC
and GATTA
, we type:
s1 <- "GAATTC" s2 <- "GATTA" globalAligns1s2 <- pairwiseAlignment(s1, s2, substitutionMatrix = sigma, gapOpening = -2, gapExtension = -8, scoreOnly = FALSE) globalAligns1s2 # Print out the optimal alignment and its score
Global PairwiseAlignmentsSingleSubject (1 of 1) pattern: [1] GAATTC subject: [1] GA-TTA score: -3
Note that we set gapOpening
to be -2 and “gapExtension” to be -8, which means that the first position of a gap is assigned a score of \[(-8-2=)-10\] and every subsequent position in a gap is given a score of -8. Here the alignment contains four matches, one mismatch, and one gap of length 1, so its score is \[(4\times 2)+(1\times -1)+(1\times -10) = -3\]
To align proteins we need to use a scoring matrix for amino acids rather than for nucleotides.
There are several well known scoring matrices, such as the BLOSUM series of matrices.
BLOSUM with high numbers are designed for comparing closely related sequences, while BLOSUM with low numbers are designed for comparing distantly related sequences.
For example, BLOSUM62 is used for less divergent alignments (alignments of sequences that differ little), and BLOSUM30 is used for more divergent alignments (alignments of sequences that differ a lot).
We can use the data()
function to load a data set of BLOSUM matrices that comes with Biostrings package.
data(BLOSUM62) BLOSUM62 # Print out the data
A R N D C Q E G H I L K M F P S T W Y V B J Z X A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 -1 -1 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2 0 -1 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 4 -3 0 -1 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 -3 1 -1 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 -2 4 -1 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 -3 4 -1 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -4 -2 -1 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 -3 0 -1 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 3 -3 -1 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 3 -3 -1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 -3 1 -1 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 2 -1 -1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 0 -3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -3 -1 -1 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 -2 0 -1 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 -1 -1 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -2 -2 -1 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -1 -2 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 2 -2 -1 B -2 -1 4 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 -3 0 -1 J -1 -2 -3 -3 -1 -2 -3 -4 -3 3 3 -3 2 0 -3 -2 -1 -2 -1 2 -3 3 -3 -1 Z -1 0 0 1 -3 4 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -2 -2 -2 0 -3 4 -1 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 * A -4 R -4 N -4 D -4 C -4 Q -4 E -4 G -4 H -4 I -4 L -4 K -4 M -4 F -4 P -4 S -4 T -4 W -4 Y -4 V -4 B -4 J -4 Z -4 X -4 * 1
Which other scoring matrices are used for protein alignment?
Hint: use data(package="Biostrings")
What is the biological meaning of these matrices? How are they built?
To find the optimal global alignment between the protein sequences “PAWHEAE” and “HEAGAWGHEE” using the Needleman-Wunsch algorithm using the BLOSUM62 matrix, we type:
data(BLOSUM62) s3 <- "PAWHEAE" s4 <- "HEAGAWGHEE" globalAligns3s4 <- pairwiseAlignment(s3, s4, substitutionMatrix = "BLOSUM62", gapOpening = -2, gapExtension = -8, scoreOnly = FALSE) globalAligns3s4 # Print out the optimal global alignment and its score
Global PairwiseAlignmentsSingleSubject (1 of 1) pattern: [1] P---AWHEAE subject: [1] HEAGAWGHEE score: -11