My name is Emine Guven. I am an applied mathematician and study quantitative biology. My interests are cellular aging, VEGF receptors clustering, math modeling of biological systems with a broad focus on data analysis and simulations.This site is reserve as a notebook to keep my studies fresh and open to my students and collaborators.
Thursday, December 15, 2016
state charts for gene network modeling
Methods section is gotten from http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0009376
Wednesday, December 14, 2016
gene network model
https://courses.edx.org/courses/course-v1:IEEEx+SysBio1x+3T2016/courseware/579a1cdc89624cfeaa18dabdd0785fcf/75147fb49bec4286979ab95e3b28326a/?child=first
Read the article: State charts for gene network modeling
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0009376
Read the article: State charts for gene network modeling
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0009376
Tuesday, December 13, 2016
Monday, December 5, 2016
time lapsed image analysis for yeast RLS meausrements
http://imagej.net/Batch_Processing
http://imagej.net/Scripting
http://imagej.net/How_to_apply_a_common_operation_to_a_complete_directory
**************
http://imagej.net/Scripting
http://imagej.net/How_to_apply_a_common_operation_to_a_complete_directory
**************
Tuesday, November 29, 2016
lab meeting 11/29/2016
1)a-DE gene lists for RNAseq project
Q1: there are various time points between control and treatment. Should we use the consensus DEG list?
IN the BGI report go to /Differential.../DEGList/
(There is an R package NOISeq)
BGI already did the analysis, we need to do a better job @11,12
check NOISeq in the report page 16/23-17/23-...
do a good job on that parts for a manuscript:
11-Pathway Analysis of DEG: functional enrichment phyer, hypergeormetric_distribution
12- PPI Analysis of DEG
It seems "GeneID" in BGI report are from NCBI. Example GeneID: 57573 is a standard ID.
1)b-Pathway analysis plan for DE gene lists
TODO: There are different sources of human gene/protein networks. We should try several for comparisons
TODO: We should try different clustering method, such as hlcust, mcl, etc (refer to Qin's previous paper for clustering analysis).
2)a-time-lapsed image analysis for yeast replicative lifespan
softwares:ImageJ, MATLAB, R
https://www.mendeley.com/groups/ gene -pathways/ pathway analysis
gene set analysis is a basic thing, we also need to do that.
data visualization course would be good for animations of .gifs
Q1: there are various time points between control and treatment. Should we use the consensus DEG list?
IN the BGI report go to /Differential.../DEGList/
(There is an R package NOISeq)
BGI already did the analysis, we need to do a better job @11,12
check NOISeq in the report page 16/23-17/23-...
do a good job on that parts for a manuscript:
11-Pathway Analysis of DEG: functional enrichment phyer, hypergeormetric_distribution
12- PPI Analysis of DEG
It seems "GeneID" in BGI report are from NCBI. Example GeneID: 57573 is a standard ID.
1)b-Pathway analysis plan for DE gene lists
TODO: There are different sources of human gene/protein networks. We should try several for comparisons
TODO: We should try different clustering method, such as hlcust, mcl, etc (refer to Qin's previous paper for clustering analysis).
2)a-time-lapsed image analysis for yeast replicative lifespan
softwares:ImageJ, MATLAB, R
https://www.mendeley.com/groups/ gene -pathways/ pathway analysis
gene set analysis is a basic thing, we also need to do that.
data visualization course would be good for animations of .gifs
Monday, November 28, 2016
RNA-seq further reading
Marioni JC et al (2008) RNA-seq: an assessment of technical
reproducibility and comparison with gene expression arrays.
Genome Res 18:1509–17.
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol. 11:R106.
Auer PL, Doerge RW (2010) Statistical design and analysis of RNA sequencing data. Genetics 185:405-416. Z.
Wang et al. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10:57-63.
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol. 11:R106.
Auer PL, Doerge RW (2010) Statistical design and analysis of RNA sequencing data. Genetics 185:405-416. Z.
Wang et al. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10:57-63.
Wednesday, November 23, 2016
tutorial RNA-seq
How to analyze RNA-Seq data? Find differentially expressed genes in your research
https://www.youtube.com/watch?v=xh_wpWj0AzMWhat is RNA-seq?
"using next generation sequencing to reveal the presence and quantity of RNA in a biological sample at a given moment in time."(wikipedia)
* differential expression of RNA-seq and pathway analysis is important
here is a nice set of slides:
http://www.mi.fu-berlin.de/wiki/pub/ABI/GenomicsLecture13Materials/rnaseq2.pdf
* nice discussion FPKM vs RPKM
https://www.biostars.org/p/124826/
Tuesday, November 22, 2016
RNA-seq lab meeting notes
*Stringtie got a bug.
*make sure you can follow the home path$ echo $PATH
*rnaseq_hisat2 /ballgown/chrx_genes*.csv
/hisat
* FPKM : stands for Fragments Per Kilobase of transcript per Million mapped reads. In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it.
*Why FPKM ?
* Lior Pachter is a professor of computational biology follow his blog.
*focus on pathway analysis in RNA-seq
*differential expression analysis
*make sure you can follow the home path$ echo $PATH
*rnaseq_hisat2 /ballgown/chrx_genes*.csv
/hisat
* FPKM : stands for Fragments Per Kilobase of transcript per Million mapped reads. In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it.
*Why FPKM ?
* Lior Pachter is a professor of computational biology follow his blog.
*focus on pathway analysis in RNA-seq
*differential expression analysis
Wednesday, November 16, 2016
language and scientific research editing
https://secure.authorservices.springernature.com/en/default/submit
Monday, November 14, 2016
methods for parameter estimations
1)Method of moments (MM)
2)maximum likelihood estimation (MLE)
3)least squares estimation (LSE)
4)simulated annealing (MCMC) or Bayesian statistics
2)maximum likelihood estimation (MLE)
3)least squares estimation (LSE)
4)simulated annealing (MCMC) or Bayesian statistics
Thursday, November 3, 2016
article published
https://peerj.com/articles/2671/
Congrats to our team. And special thanks to our lab PI Dr. Hong Qin for designing this work.
Congrats to our team. And special thanks to our lab PI Dr. Hong Qin for designing this work.
Monday, October 31, 2016
nonlinear regression and nonlinear least squares in R
a useful document
http://socserv.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Nonlinear-Regression.pdf
http://socserv.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Nonlinear-Regression.pdf
Tuesday, October 25, 2016
CV values
noise level test: bigger CV value means more noisy data
higher G (shape) parameter in model simulation reveals smaller lifespan data values.
higher R(rate) parameter in model simulation reveals higher mean lifespans.
higher G (shape) parameter in model simulation reveals smaller lifespan data values.
higher R(rate) parameter in model simulation reveals higher mean lifespans.
Saturday, October 15, 2016
free hard drive space command
df -H : hard drive allocation and usage check
top : memory usage check (slow method)
vm_stat : memory usage check (faster)
top : memory usage check (slow method)
vm_stat : memory usage check (faster)
Academic writing tips
Ten easy rules from Zhang et al:
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003453#pcbi.1003453-Watson1
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003453#pcbi.1003453-Watson1
Thursday, October 13, 2016
today's note
1)Copy number variation. CNV is worth to study. : http://www.nature.com/scitable/topicpage/copy-number-variation-445
2)Focus C. elegans for next project as experimental lifespan data sets.
3)Data bases for lifespan :
http://kaeberleinlab.org/projects/lifespan-observations-database
*Observations database
*Sagaweb database
*Managed databases
2)Focus C. elegans for next project as experimental lifespan data sets.
3)Data bases for lifespan :
http://kaeberleinlab.org/projects/lifespan-observations-database
*Observations database
*Sagaweb database
*Managed databases
Friday, October 7, 2016
solving ODEs
A nice recalling examples here:
http://mathinsight.org/ordinary_differential_equation_introduction
http://mathinsight.org/ordinary_differential_equation_introduction
Thursday, October 6, 2016
Bayesian statistics vs Maximum Likelihood estimation
Screenshot is taken from Hartig et.al 2011, Statistical inference for stochastic simulation models –
theory and application
here is description why we are normalizing the posterior position in Bayesian statistics.
Wednesday, October 5, 2016
Metropolis Hastings -MCMC in R
MCMC is achieved by;
1)Starting at a random parameter value (old)
2)Choose a new parameter value which is close to the old value (current) based on some probability density and that is called future function (new)
3)hop to this new point with a probability p(new)/p(old), where p is the target function, p>1.
https://github.com/florianhartig/LearningBayes/blob/master/CommentedCode/02-Samplers/MCMC/Convergence.md
what is noise
randomness: Living organisms flow on inferences (guesses) about the best response to make, because of the information they receive from outside world (environment) is a part of diluted noise.
information(about world)
inferences(on present)------------------>modify behaviors (to optimize survival probability)
information(about world)
inferences(on present)------------------>modify behaviors (to optimize survival probability)
Monday, October 3, 2016
Integrative Genomics Viewer
http://software.broadinstitute.org/software/igv/download
IGV is an excellent way to visualize seq. data , whether it is whole genome seq or ChIP-seq or RNA-seq.
IGV is an excellent way to visualize seq. data , whether it is whole genome seq or ChIP-seq or RNA-seq.
data analysis course sequences on edX
https://courses.edx.org/courses/course-v1:HarvardX+PH525.7x+3T2015/cd8cfac0f386436fa0cb1ed3d0012328/
Tuesday, September 27, 2016
Unstable quantification
There is a statistical ambiguity(artifact) which causes a unstable quantification.
Non-uniform coverage bias is a suggested solution for this ambiguity.
Transcript quantification assessment
Suppose the lengths
lengths = c(100,200,300,100,100)
lengths = c(100,200,300,100,100)
The matrix which identifies transcripts with exons and junctions:
mat = cbind(c(1,1,0,1,0),c(1,1,1,1,1),c(0,1,1,0,1))
The length of the transcripts is then:
lengths %*% mat
Suppose we align 1000 reads to this gene. So w = 1000.
Suppose we observe the following read counts for the exons and for the two junctions:
counts = c(125,350,300,125,100)
Given the formula above, and the assumption of uniform read distribution and Poisson counts, we can get a rough estimate of theta by just solving the linear system above.
First try a guess at theta:
theta.hat = c(1, 2, 3) / 10000
We can see with this guess, our counts are too low, and not properly apportioned to the exons and junctions:
mat %*% theta.hat * lengths * w
We can roughly estimate theta by solving the linear system of equations above:
LHS = counts/(lengths * w)
lm.fit(mat, LHS)$coefficients
(recall the linear models covered in PH525.2x).
Q:What would be the rough estimate of theta for the first transcript if the counts for exons and junctions were 60,320,420,60,140?
A:counts = c(60,320,420,60,140)
LHS = counts/(lengths * w)
lm.fit(mat, LHS)$coefficients
Q:What is the estimate of theta using our rough estimator, for the first transcript (the transcript with exon 1 and exon 2)?
A:By following the code above, solving the linear system of equations give a rough estimate of theta as:
theta.hat = c(.00075, .0005, .0005)
this reproduced the observed counts exactly:
mat %*% theta.hat * lengths * w
Monday, September 26, 2016
RNA-Seq Quantifiying Transcript Levels
Q:why multiplication?
A:because the larger the fragment the more reads there are to pick from
Q: why poisson
A: big N, small P_f => Poisson is proposed.
See the papers for more information for how Poisson model is described and more of techniques for RNA-seq.
Next module is going to be about maximum likelihood estimator , the transcript levels of each transcript within a gene.
RNA-Seq Quantifiying Transcript Levels
Q:why multiplication?
A:because the larger the fragment the more reads there are to pick from
Q: why poisson
A: big N, small P_f => Poisson is proposed.
See the papers for more information for how Poisson model is described and more of techniques for RNA-seq.
Next module is going to be about maximum likelihood estimator , the transcript levels of each transcript within a gene.
Friday, September 23, 2016
Ordinary Least Squares vs Linear Regression
Least squares is a method for performing the linear regression analysis. They can be used interchangeably, to explain how to fit the data to a "linear" line.
I use [x/max] when I'm just interested in some internal structure of the samples and not in the absolute differences between samples. This might be the case of peak detection in spectra for samples in which the strength of the signal which I'm seeking changes from sample to sample.
Finally I use [x-mean] normalization when some samples could be potentially using just a part of a bigger scale. This is the case of ratings for movies for example, in which by some user tend to give more positive ratings than others.
Notice that do not confuse normalization with standardization (e.g. Z-score). Do this when the distribution of the data is normal (gaussian) distribution.
Another method is the traditional SSE method.
Sum of squared error: One thing is that can be done is to find a linear line, and then for each of the data points, measure the vertical distance between the point and that line, square it, and add these up; the fitted line would be the one where this sum of distances is as small as possible (minimizing the SSE).
My focus is today, Least squares method (LSM)
Nice source for normalization vs standardization.
https://docs.google.com/document/d/1x0A1nUz1WWtMCZb5oVzF0SVMY7a_58KQulqQVT8LaVA/edit
Nice source for normalization vs standardization.
https://docs.google.com/document/d/1x0A1nUz1WWtMCZb5oVzF0SVMY7a_58KQulqQVT8LaVA/edit
Scaling the data sets:
Normalization: I use data [(x-mean)/sd] normalization whenever differences in variable ranges could potentially affect negatively to the performance of my algorithm. This is the case of PCA, regression or simple correlation analysis for example.
I use [x/max] when I'm just interested in some internal structure of the samples and not in the absolute differences between samples. This might be the case of peak detection in spectra for samples in which the strength of the signal which I'm seeking changes from sample to sample.
Finally I use [x-mean] normalization when some samples could be potentially using just a part of a bigger scale. This is the case of ratings for movies for example, in which by some user tend to give more positive ratings than others.
Standardization: (z-score) We do data normalization when seeking for relations. Some people do this methods, unfortunately, in experimental designs, which is not correct except if the variable is a transformed one, and all the data needs the same normalization method, such as pH in sum agricultural studies. Normalization in experimental designs are meaningless because we can't compare the mean of, for instance, a treatment with the mean of another treatment logarithmically normalized. In regression and multivariate analysis which the relationships are of interest, however, we can do the normalization to reach a linear, more robust relationship. Commonly when the relationship between two dataset is non-linear we transform data to reach a linear relationship. Here, normalization doesn't mean normalizing data, it means normalizing residuals by transforming data. So normalization of data implies to normalize residuals using the methods of transformation.
Notice that do not confuse normalization with standardization (e.g. Z-score). Do this when the distribution of the data is normal (gaussian) distribution.
x" =(x-mean)/sd
Coefficient of Variance = Sd(DataSet)/Mean(DataSet)
Coefficient of variance: It tells us about the noise level in data.
The smaller CoVar tends to give less noise since small CoVar indicate less noisy and more robustness.
Coefficient of Variance = Sd(DataSet)/Mean(DataSet)
Coefficient of variance: It tells us about the noise level in data.
The smaller CoVar tends to give less noise since small CoVar indicate less noisy and more robustness.
effectiveness test
- randomization is done to avoid bias
- a blind vs a double-blind experiment to further eliminate the chance of bias
- in a double-blind experiment the researchers are unaware of which treatment group a subject is in. This is a lot of work but in order to eliminate the effects of other variables other than we like to test besides the treatment (confounding variables) that may affect the results. This is how we draw a cause and effect relation.
- How one chooses to compare or present results can have a dramatic effect on what is implied.
Example: it took too many years to prove the detrimental effects of smoking on health even though there were a lot of results of observational studies.
Some R studio exercise:
Some R studio exercise:
Giving data vectors have a type
>simpsons = c("Jane","John","Ada","Adam")
>names(simpsons)= c("mum,"dad","sister","brother")
simpsons
mum dad sister brother
"Jane" "John" "Ada" "Adam"
RNA Sequencing-functional genomics course
Good lecture is on edX
And another one:
https://www.youtube.com/watch? v=hksQlJLwKqo
useful links:
Study information at the Sequence Read Archive
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP033351
Himes et al paper at PubMed Central
https://www.ncbi.nlm.nih.gov/pubmed/24926665
European Nucleotide Archive (EMBL-EBI)
http://www.ebi.ac.uk/ena
Sequence Read Archive (NCBI)
https://www.ncbi.nlm.nih.gov/sra/
A sample table stored in course repo on github
https://github.com/genomicsclass/labs/blob/master/rnaseq/airway_sample_table.csv
(details on creating this table available at the airway package vignette)
http://www.bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html
useful links:
Study information at the Sequence Read Archive
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP033351
Himes et al paper at PubMed Central
https://www.ncbi.nlm.nih.gov/pubmed/24926665
European Nucleotide Archive (EMBL-EBI)
http://www.ebi.ac.uk/ena
Sequence Read Archive (NCBI)
https://www.ncbi.nlm.nih.gov/sra/
A sample table stored in course repo on github
https://github.com/genomicsclass/labs/blob/master/rnaseq/airway_sample_table.csv
(details on creating this table available at the airway package vignette)
http://www.bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html
RNA Sequencing-functional genomics course
Good lecture is on edX
And another one:
https://www.youtube.com/watch? v=hksQlJLwKqo
useful links:
Study information at the Sequence Read Archive
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP033351
Himes et al paper at PubMed Central
https://www.ncbi.nlm.nih.gov/pubmed/24926665
European Nucleotide Archive (EMBL-EBI)
http://www.ebi.ac.uk/ena
Sequence Read Archive (NCBI)
https://www.ncbi.nlm.nih.gov/sra/
A sample table stored in course repo on github
https://github.com/genomicsclass/labs/blob/master/rnaseq/airway_sample_table.csv
(details on creating this table available at the airway package vignette)
http://www.bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html
useful links:
Study information at the Sequence Read Archive
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP033351
Himes et al paper at PubMed Central
https://www.ncbi.nlm.nih.gov/pubmed/24926665
European Nucleotide Archive (EMBL-EBI)
http://www.ebi.ac.uk/ena
Sequence Read Archive (NCBI)
https://www.ncbi.nlm.nih.gov/sra/
A sample table stored in course repo on github
https://github.com/genomicsclass/labs/blob/master/rnaseq/airway_sample_table.csv
(details on creating this table available at the airway package vignette)
http://www.bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html
Saturday, January 30, 2016
Today's lab meeting
Yesterday, we meet with Dr. Voit from Georgia Tech. It looks like their lab meeting are very beneficial to understand systems biology topics in more details. I hope I would take the advantage of having to many nice people in their group.
The topic yesterday was "Single cell particle tracking". The speaker was James Wade one of Dr. Voit's students.
The topic yesterday was "Single cell particle tracking". The speaker was James Wade one of Dr. Voit's students.
Wednesday, January 27, 2016
Today in bio125 course notes linear regression
Students seemed to does not know the meaning of Linear Regression. Why we do it? how we fit the linear line to the data does not make sense at all for most of them. The simple and mathematical notion should be introduced before we teach it.
H2O2_LOH manuscript revising
Keep revising the manuscript of Loss of Heterozygosity especially research and discussion part.
Review the tables and figures before submitting a preprint.
Review the tables and figures before submitting a preprint.
Tuesday, January 26, 2016
oxidative stress and aging key terms and parameters
CLS : chronological life span is a measure of amount of time taken for a single yeast mother cell to stay alive.
RLS: replicative life span measures the number of times required for a mother cell to stop undergoing cell division
MR : Mitotic recombination is the exchange of genetic information between homologous chromosomes in somatic cells
ROS: reactive oxygen species that has a free radical.
LOH: loss of heterozygosity can be used to measure genomic integrity of cells.
MA: mitotic asymmetry, the generation of two dissimilar daughter cells following mitotic division.
MEt15locus: LOH is detected in this genetic locus via knock out of one allele using a resistance marker.
Cv: a variable in the H2O2 does response curve that represents the middle concentration at which cell viability decreases by half
Cb: represents the middle concentration black colonies on MLA plates.
Tg: Biological survival curve , it represents the time at which there is a 50% decrease in genomic integrity.(Qin et al. 2008)
Tc: Tc represents the midpoint of chronological life span(CLS) (Qin et al. 2008)
L0: A ratio that measures the frequency of LOH events in daughter cell/mother cells
RLS: replicative life span measures the number of times required for a mother cell to stop undergoing cell division
MR : Mitotic recombination is the exchange of genetic information between homologous chromosomes in somatic cells
ROS: reactive oxygen species that has a free radical.
LOH: loss of heterozygosity can be used to measure genomic integrity of cells.
MA: mitotic asymmetry, the generation of two dissimilar daughter cells following mitotic division.
MEt15locus: LOH is detected in this genetic locus via knock out of one allele using a resistance marker.
Cv: a variable in the H2O2 does response curve that represents the middle concentration at which cell viability decreases by half
Cb: represents the middle concentration black colonies on MLA plates.
Tg: Biological survival curve , it represents the time at which there is a 50% decrease in genomic integrity.(Qin et al. 2008)
Tc: Tc represents the midpoint of chronological life span(CLS) (Qin et al. 2008)
L0: A ratio that measures the frequency of LOH events in daughter cell/mother cells
Friday, January 22, 2016
Today in bio125 class bradford protein assay of protein concentration and serial dilution is applied. The goal of this assay is to find the protein quantity (unknown) in each experiment. There are 6 experiments that have 0.9% NaCl + BstStock + 5*Bradford solutions.
BstStock quantity : 640 720 760 780 790 800
5*Bradford : 160 80 40 20 10 0
uknown = 50 ml +200B+750 ml
We will use the linear regression once we measure the amount of amino acids in each tube. Later to predict an unknown solution's protein amount.
1 2 3 4 5 6
NaCl : 16 8 4 2 1 0BstStock quantity : 640 720 760 780 790 800
5*Bradford : 160 80 40 20 10 0
uknown = 50 ml +200B+750 ml
We will use the linear regression once we measure the amount of amino acids in each tube. Later to predict an unknown solution's protein amount.
My name is Emine Guven. I study quantitative biology. My interests are cellular aging, VEGF receptors clustering, math modeling of biological systems with a broad focus on data analysis and simulations.This site is reserve as a notebook to keep my studies fresh and open to my students and collaborators. As a young scientist my goal is to become more objective and independent. Thanks for visiting my blog.
Subscribe to:
Posts (Atom)