guvenmathbio: Transcript quantification assessment

Tuesday, September 27, 2016

Transcript quantification assessment

Suppose the lengths

lengths = c(100,200,300,100,100)

The matrix which identifies transcripts with exons and junctions:

mat = cbind(c(1,1,0,1,0),c(1,1,1,1,1),c(0,1,1,0,1))

The length of the transcripts is then: lengths %*% mat

Suppose we align 1000 reads to this gene. So w = 1000.

Suppose we observe the following read counts for the exons and for the two junctions:

counts = c(125,350,300,125,100)

Given the formula above, and the assumption of uniform read distribution and Poisson counts, we can get a rough estimate of theta by just solving the linear system above.

First try a guess at theta:

theta.hat = c(1, 2, 3) / 10000

We can see with this guess, our counts are too low, and not properly apportioned to the exons and junctions:

mat %*% theta.hat * lengths * w

We can roughly estimate theta by solving the linear system of equations above:

LHS = counts/(lengths * w)
lm.fit(mat, LHS)$coefficients

(recall the linear models covered in PH525.2x).

Q:What would be the rough estimate of theta for the first transcript if the counts for exons and junctions were 60,320,420,60,140?

A:counts = c(60,320,420,60,140)

LHS = counts/(lengths * w)

lm.fit(mat, LHS)$coefficients

Q:What is the estimate of theta using our rough estimator, for the first transcript (the transcript with exon 1 and exon 2)?

A:By following the code above, solving the linear system of equations give a rough estimate of theta as:

theta.hat = c(.00075, .0005, .0005)

this reproduced the observed counts exactly:

mat %*% theta.hat * lengths * w

guvenmathbio

Tuesday, September 27, 2016

Transcript quantification assessment

No comments:

Post a Comment