My name is Emine Guven. I am an applied mathematician and study quantitative biology. My interests are cellular aging, VEGF receptors clustering, math modeling of biological systems with a broad focus on data analysis and simulations.This site is reserve as a notebook to keep my studies fresh and open to my students and collaborators.
Tuesday, September 27, 2016
Unstable quantification
There is a statistical ambiguity(artifact) which causes a unstable quantification.
Non-uniform coverage bias is a suggested solution for this ambiguity.
Transcript quantification assessment
Suppose the lengths
lengths = c(100,200,300,100,100)
lengths = c(100,200,300,100,100)
The matrix which identifies transcripts with exons and junctions:
mat = cbind(c(1,1,0,1,0),c(1,1,1,1,1),c(0,1,1,0,1))
The length of the transcripts is then:
lengths %*% mat
Suppose we align 1000 reads to this gene. So w = 1000.
Suppose we observe the following read counts for the exons and for the two junctions:
counts = c(125,350,300,125,100)
Given the formula above, and the assumption of uniform read distribution and Poisson counts, we can get a rough estimate of theta by just solving the linear system above.
First try a guess at theta:
theta.hat = c(1, 2, 3) / 10000
We can see with this guess, our counts are too low, and not properly apportioned to the exons and junctions:
mat %*% theta.hat * lengths * w
We can roughly estimate theta by solving the linear system of equations above:
LHS = counts/(lengths * w)
lm.fit(mat, LHS)$coefficients
(recall the linear models covered in PH525.2x).
Q:What would be the rough estimate of theta for the first transcript if the counts for exons and junctions were 60,320,420,60,140?
A:counts = c(60,320,420,60,140)
LHS = counts/(lengths * w)
lm.fit(mat, LHS)$coefficients
Q:What is the estimate of theta using our rough estimator, for the first transcript (the transcript with exon 1 and exon 2)?
A:By following the code above, solving the linear system of equations give a rough estimate of theta as:
theta.hat = c(.00075, .0005, .0005)
this reproduced the observed counts exactly:
mat %*% theta.hat * lengths * w
Monday, September 26, 2016
RNA-Seq Quantifiying Transcript Levels
Q:why multiplication?
A:because the larger the fragment the more reads there are to pick from
Q: why poisson
A: big N, small P_f => Poisson is proposed.
See the papers for more information for how Poisson model is described and more of techniques for RNA-seq.
Next module is going to be about maximum likelihood estimator , the transcript levels of each transcript within a gene.
RNA-Seq Quantifiying Transcript Levels
Q:why multiplication?
A:because the larger the fragment the more reads there are to pick from
Q: why poisson
A: big N, small P_f => Poisson is proposed.
See the papers for more information for how Poisson model is described and more of techniques for RNA-seq.
Next module is going to be about maximum likelihood estimator , the transcript levels of each transcript within a gene.
Friday, September 23, 2016
Ordinary Least Squares vs Linear Regression
Least squares is a method for performing the linear regression analysis. They can be used interchangeably, to explain how to fit the data to a "linear" line.
I use [x/max] when I'm just interested in some internal structure of the samples and not in the absolute differences between samples. This might be the case of peak detection in spectra for samples in which the strength of the signal which I'm seeking changes from sample to sample.
Finally I use [x-mean] normalization when some samples could be potentially using just a part of a bigger scale. This is the case of ratings for movies for example, in which by some user tend to give more positive ratings than others.
Notice that do not confuse normalization with standardization (e.g. Z-score). Do this when the distribution of the data is normal (gaussian) distribution.
Another method is the traditional SSE method.
Sum of squared error: One thing is that can be done is to find a linear line, and then for each of the data points, measure the vertical distance between the point and that line, square it, and add these up; the fitted line would be the one where this sum of distances is as small as possible (minimizing the SSE).
My focus is today, Least squares method (LSM)
Nice source for normalization vs standardization.
https://docs.google.com/document/d/1x0A1nUz1WWtMCZb5oVzF0SVMY7a_58KQulqQVT8LaVA/edit
Nice source for normalization vs standardization.
https://docs.google.com/document/d/1x0A1nUz1WWtMCZb5oVzF0SVMY7a_58KQulqQVT8LaVA/edit
Scaling the data sets:
Normalization: I use data [(x-mean)/sd] normalization whenever differences in variable ranges could potentially affect negatively to the performance of my algorithm. This is the case of PCA, regression or simple correlation analysis for example.
I use [x/max] when I'm just interested in some internal structure of the samples and not in the absolute differences between samples. This might be the case of peak detection in spectra for samples in which the strength of the signal which I'm seeking changes from sample to sample.
Finally I use [x-mean] normalization when some samples could be potentially using just a part of a bigger scale. This is the case of ratings for movies for example, in which by some user tend to give more positive ratings than others.
Standardization: (z-score) We do data normalization when seeking for relations. Some people do this methods, unfortunately, in experimental designs, which is not correct except if the variable is a transformed one, and all the data needs the same normalization method, such as pH in sum agricultural studies. Normalization in experimental designs are meaningless because we can't compare the mean of, for instance, a treatment with the mean of another treatment logarithmically normalized. In regression and multivariate analysis which the relationships are of interest, however, we can do the normalization to reach a linear, more robust relationship. Commonly when the relationship between two dataset is non-linear we transform data to reach a linear relationship. Here, normalization doesn't mean normalizing data, it means normalizing residuals by transforming data. So normalization of data implies to normalize residuals using the methods of transformation.
Notice that do not confuse normalization with standardization (e.g. Z-score). Do this when the distribution of the data is normal (gaussian) distribution.
x" =(x-mean)/sd
Coefficient of Variance = Sd(DataSet)/Mean(DataSet)
Coefficient of variance: It tells us about the noise level in data.
The smaller CoVar tends to give less noise since small CoVar indicate less noisy and more robustness.
Coefficient of Variance = Sd(DataSet)/Mean(DataSet)
Coefficient of variance: It tells us about the noise level in data.
The smaller CoVar tends to give less noise since small CoVar indicate less noisy and more robustness.
effectiveness test
- randomization is done to avoid bias
- a blind vs a double-blind experiment to further eliminate the chance of bias
- in a double-blind experiment the researchers are unaware of which treatment group a subject is in. This is a lot of work but in order to eliminate the effects of other variables other than we like to test besides the treatment (confounding variables) that may affect the results. This is how we draw a cause and effect relation.
- How one chooses to compare or present results can have a dramatic effect on what is implied.
Example: it took too many years to prove the detrimental effects of smoking on health even though there were a lot of results of observational studies.
Some R studio exercise:
Some R studio exercise:
Giving data vectors have a type
>simpsons = c("Jane","John","Ada","Adam")
>names(simpsons)= c("mum,"dad","sister","brother")
simpsons
mum dad sister brother
"Jane" "John" "Ada" "Adam"
RNA Sequencing-functional genomics course
Good lecture is on edX
And another one:
https://www.youtube.com/watch? v=hksQlJLwKqo
useful links:
Study information at the Sequence Read Archive
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP033351
Himes et al paper at PubMed Central
https://www.ncbi.nlm.nih.gov/pubmed/24926665
European Nucleotide Archive (EMBL-EBI)
http://www.ebi.ac.uk/ena
Sequence Read Archive (NCBI)
https://www.ncbi.nlm.nih.gov/sra/
A sample table stored in course repo on github
https://github.com/genomicsclass/labs/blob/master/rnaseq/airway_sample_table.csv
(details on creating this table available at the airway package vignette)
http://www.bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html
useful links:
Study information at the Sequence Read Archive
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP033351
Himes et al paper at PubMed Central
https://www.ncbi.nlm.nih.gov/pubmed/24926665
European Nucleotide Archive (EMBL-EBI)
http://www.ebi.ac.uk/ena
Sequence Read Archive (NCBI)
https://www.ncbi.nlm.nih.gov/sra/
A sample table stored in course repo on github
https://github.com/genomicsclass/labs/blob/master/rnaseq/airway_sample_table.csv
(details on creating this table available at the airway package vignette)
http://www.bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html
RNA Sequencing-functional genomics course
Good lecture is on edX
And another one:
https://www.youtube.com/watch? v=hksQlJLwKqo
useful links:
Study information at the Sequence Read Archive
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP033351
Himes et al paper at PubMed Central
https://www.ncbi.nlm.nih.gov/pubmed/24926665
European Nucleotide Archive (EMBL-EBI)
http://www.ebi.ac.uk/ena
Sequence Read Archive (NCBI)
https://www.ncbi.nlm.nih.gov/sra/
A sample table stored in course repo on github
https://github.com/genomicsclass/labs/blob/master/rnaseq/airway_sample_table.csv
(details on creating this table available at the airway package vignette)
http://www.bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html
useful links:
Study information at the Sequence Read Archive
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP033351
Himes et al paper at PubMed Central
https://www.ncbi.nlm.nih.gov/pubmed/24926665
European Nucleotide Archive (EMBL-EBI)
http://www.ebi.ac.uk/ena
Sequence Read Archive (NCBI)
https://www.ncbi.nlm.nih.gov/sra/
A sample table stored in course repo on github
https://github.com/genomicsclass/labs/blob/master/rnaseq/airway_sample_table.csv
(details on creating this table available at the airway package vignette)
http://www.bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html
Subscribe to:
Posts (Atom)