guvenmathbio: September 2016

Tuesday, September 27, 2016

Unstable quantification

There is a statistical ambiguity(artifact) which causes a unstable quantification.
Non-uniform coverage bias is a suggested solution for this ambiguity.

Transcript quantification assessment

Suppose the lengths

lengths = c(100,200,300,100,100)

The matrix which identifies transcripts with exons and junctions:

mat = cbind(c(1,1,0,1,0),c(1,1,1,1,1),c(0,1,1,0,1))

The length of the transcripts is then: lengths %*% mat

Suppose we align 1000 reads to this gene. So w = 1000.

Suppose we observe the following read counts for the exons and for the two junctions:

counts = c(125,350,300,125,100)

Given the formula above, and the assumption of uniform read distribution and Poisson counts, we can get a rough estimate of theta by just solving the linear system above.

First try a guess at theta:

theta.hat = c(1, 2, 3) / 10000

We can see with this guess, our counts are too low, and not properly apportioned to the exons and junctions:

mat %*% theta.hat * lengths * w

We can roughly estimate theta by solving the linear system of equations above:

LHS = counts/(lengths * w)
lm.fit(mat, LHS)$coefficients

(recall the linear models covered in PH525.2x).

Q:What would be the rough estimate of theta for the first transcript if the counts for exons and junctions were 60,320,420,60,140?

A:counts = c(60,320,420,60,140)

LHS = counts/(lengths * w)

lm.fit(mat, LHS)$coefficients

Q:What is the estimate of theta using our rough estimator, for the first transcript (the transcript with exon 1 and exon 2)?

A:By following the code above, solving the linear system of equations give a rough estimate of theta as:

theta.hat = c(.00075, .0005, .0005)

this reproduced the observed counts exactly:

mat %*% theta.hat * lengths * w

Monday, September 26, 2016

RNA-Seq Quantifiying Transcript Levels

Y_f ~ Poisson(theta_f l_f ) where Y_f is the number of reads coming from a transcript (fragment) f.
Q:why multiplication?
A:because the larger the fragment the more reads there are to pick from
Q: why poisson
A: big N, small P_f => Poisson is proposed.

See the papers for more information for how Poisson model is described and more of techniques for RNA-seq.

Next module is going to be about maximum likelihood estimator , the transcript levels of each transcript within a gene.

RNA-Seq Quantifiying Transcript Levels

Next module is going to be about maximum likelihood estimator , the transcript levels of each transcript within a gene.

Friday, September 23, 2016

Ordinary Least Squares vs Linear Regression

Least squares is a method for performing the linear regression analysis. They can be used interchangeably, to explain how to fit the data to a "linear" line.

Another method is the traditional SSE method.

Sum of squared error: One thing is that can be done is to find a linear line, and then for each of the data points, measure the vertical distance between the point and that line, square it, and add these up; the fitted line would be the one where this sum of distances is as small as possible (minimizing the SSE).

My focus is today, Least squares method (LSM)

Nice source for normalization vs standardization.
https://docs.google.com/document/d/1x0A1nUz1WWtMCZb5oVzF0SVMY7a_58KQulqQVT8LaVA/edit

Scaling the data sets:

Normalization: I use data [(x-mean)/sd] normalization whenever differences in variable ranges could potentially affect negatively to the performance of my algorithm. This is the case of PCA, regression or simple correlation analysis for example.

I use [x/max] when I'm just interested in some internal structure of the samples and not in the absolute differences between samples. This might be the case of peak detection in spectra for samples in which the strength of the signal which I'm seeking changes from sample to sample.

Finally I use [x-mean] normalization when some samples could be potentially using just a part of a bigger scale. This is the case of ratings for movies for example, in which by some user tend to give more positive ratings than others.

Standardization: (z-score) We do data normalization when seeking for relations. Some people do this methods, unfortunately, in experimental designs, which is not correct except if the variable is a transformed one, and all the data needs the same normalization method, such as pH in sum agricultural studies. Normalization in experimental designs are meaningless because we can't compare the mean of, for instance, a treatment with the mean of another treatment logarithmically normalized. In regression and multivariate analysis which the relationships are of interest, however, we can do the normalization to reach a linear, more robust relationship. Commonly when the relationship between two dataset is non-linear we transform data to reach a linear relationship. Here, normalization doesn't mean normalizing data, it means normalizing residuals by transforming data. So normalization of data implies to normalize residuals using the methods of transformation.

Notice that do not confuse normalization with standardization (e.g. Z-score). Do this when the distribution of the data is normal (gaussian) distribution.

x" =(x-mean)/sd

Coefficient of Variance = Sd(DataSet)/Mean(DataSet)

Coefficient of variance: It tells us about the noise level in data.
The smaller CoVar tends to give less noise since small CoVar indicate less noisy and more robustness.

effectiveness test

randomization is done to avoid bias
a blind vs a double-blind experiment to further eliminate the chance of bias
in a double-blind experiment the researchers are unaware of which treatment group a subject is in. This is a lot of work but in order to eliminate the effects of other variables other than we like to test besides the treatment (confounding variables) that may affect the results. This is how we draw a cause and effect relation.
How one chooses to compare or present results can have a dramatic effect on what is implied.

Example: it took too many years to prove the detrimental effects of smoking on health even though there were a lot of results of observational studies.

Some R studio exercise:

Giving data vectors have a type

>simpsons = c("Jane","John","Ada","Adam")

>names(simpsons)= c("mum,"dad","sister","brother")

simpsons

mum dad sister brother

"Jane" "John" "Ada" "Adam"

RNA Sequencing-functional genomics course

Good lecture is on edX

https://www.edx.org/course/case-studies-functional-genomics-harvardx-ph525-7x

And another one:

https://www.youtube.com/watch?v=hksQlJLwKqo

useful links:
Study information at the Sequence Read Archive
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP033351

Himes et al paper at PubMed Central
https://www.ncbi.nlm.nih.gov/pubmed/24926665

European Nucleotide Archive (EMBL-EBI)
http://www.ebi.ac.uk/ena

Sequence Read Archive (NCBI)
https://www.ncbi.nlm.nih.gov/sra/

A sample table stored in course repo on github
https://github.com/genomicsclass/labs/blob/master/rnaseq/airway_sample_table.csv

(details on creating this table available at the airway package vignette)
http://www.bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html

RNA Sequencing-functional genomics course

Good lecture is on edX

https://www.edx.org/course/case-studies-functional-genomics-harvardx-ph525-7x

And another one: