Tuesday, September 27, 2016

Transcript quantification assessment

Suppose the lengths

lengths = c(100,200,300,100,100)

The matrix which identifies transcripts with exons and junctions:
mat = cbind(c(1,1,0,1,0),c(1,1,1,1,1),c(0,1,1,0,1))
The length of the transcripts is then: lengths %*% mat 
Suppose we align 1000 reads to this gene. So w = 1000.
Suppose we observe the following read counts for the exons and for the two junctions:
counts = c(125,350,300,125,100)
Given the formula above, and the assumption of uniform read distribution and Poisson counts, we can get a rough estimate of theta by just solving the linear system above. 
First try a guess at theta:
theta.hat = c(1, 2, 3) / 10000
We can see with this guess, our counts are too low, and not properly apportioned to the exons and junctions:
mat %*% theta.hat * lengths * w
We can roughly estimate theta by solving the linear system of equations above:
LHS = counts/(lengths * w)
lm.fit(mat, LHS)$coefficients
 (recall the linear models covered in PH525.2x).
Q:What would be the rough estimate of theta for the first transcript if the counts for exons and junctions were 60,320,420,60,140?
A:counts = c(60,320,420,60,140)
LHS = counts/(lengths * w)
lm.fit(mat, LHS)$coefficients
Q:What is the estimate of theta using our rough estimator, for the first transcript (the transcript with exon 1 and exon 2)?
A:By following the code above, solving the linear system of equations give a rough estimate of theta as:
theta.hat = c(.00075, .0005, .0005)
this reproduced the observed counts exactly:
mat %*% theta.hat * lengths * w

No comments:

Post a Comment