Class 4: Assembly statistics

September 28th, 2018

Depth

The average number of times that a particular nucleotide is represented in a collection of reads

Average Depth is sometimes called Coverage

Percentage of bases that are sequenced a given number of times

Example: genome sequencing 30× average depth can achieve a 95% breadth of coverage of the reference genome at a minimum depth of ten reads

The number of contigs depends on

In general L can be different for each read. In this simulation we will assume all reads have the same length

G <- 1000

N <- 10
L <- 100

The total number of nucleotides we got is

N*L

[1] 1000

Thus, the average depth (coverage) is

N*L/G

[1] 1

start <- sample.int(G, size=N)
end <- start + L

depth <- rep(0, G)
par(mar=c(7,4,2,2)+0.1)
plot(depth, type = "l", ylim=c(0,5))

read_pos <- start[1]:min(end[1], G)
depth[read_pos] <- depth[read_pos] + 1
plot(depth, type = "l")

for(i in 2:N) {
    # we assume Linear Chromosome
    read_pos <- start[i]:min(end[i], G)
    depth[read_pos] <- depth[read_pos] + 1
}

Sometimes end[i] can be greater than G. Then part of the read is outside the chromosome. We only see the inside part.

How would you handle a circular genome?

plot(depth, type = "l")

barplot(table(depth), xlab="depth",ylab="Num bases")

If depth is 0, then we did not see that part of the genome

What percentage of the genome did we see?

sum(depth > 0) / G

[1] 0.659

In this case we use the theoretical value of G, since we do not know the real genome (yet)

What percentage of the genome has depth 2 or more?

sum(depth >= 2) / G

[1] 0.22