guvenmathbio: 2016

Thursday, December 15, 2016

state charts for gene network modeling

Methods section is gotten from http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0009376

state charts for gene network modeling

Wednesday, December 14, 2016

gene network model

https://courses.edx.org/courses/course-v1:IEEEx+SysBio1x+3T2016/courseware/579a1cdc89624cfeaa18dabdd0785fcf/75147fb49bec4286979ab95e3b28326a/?child=first

Read the article: State charts for gene network modeling

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0009376

Tuesday, December 13, 2016

gene bank

Monday, December 5, 2016

time lapsed image analysis for yeast RLS meausrements

http://imagej.net/Batch_Processing

http://imagej.net/Scripting

http://imagej.net/How_to_apply_a_common_operation_to_a_complete_directory

**************

Tuesday, November 29, 2016

lab meeting 11/29/2016

1)a-DE gene lists for RNAseq project

Q1: there are various time points between control and treatment. Should we use the consensus DEG list?

IN the BGI report go to /Differential.../DEGList/
(There is an R package NOISeq)
BGI already did the analysis, we need to do a better job @11,12

check NOISeq in the report page 16/23-17/23-...

do a good job on that parts for a manuscript:

11-Pathway Analysis of DEG: functional enrichment phyer, hypergeormetric_distribution
12- PPI Analysis of DEG

It seems "GeneID" in BGI report are from NCBI. Example GeneID: 57573 is a standard ID.

https://www.ncbi.nlm.nih.gov/gene/?term=NP_065864

http://pantherdb.org/genes/gene.jsp?acc=HUMAN|HGNC=23226|UniProtKB=Q9BX82&showAllAlt=yes

1)b-Pathway analysis plan for DE gene lists
TODO: There are different sources of human gene/protein networks. We should try several for comparisons
TODO: We should try different clustering method, such as hlcust, mcl, etc (refer to Qin's previous paper for clustering analysis).

2)a-time-lapsed image analysis for yeast replicative lifespan
softwares:ImageJ, MATLAB, R

https://www.mendeley.com/groups/ gene -pathways/ pathway analysis

gene set analysis is a basic thing, we also need to do that.

data visualization course would be good for animations of .gifs

Monday, November 28, 2016

RNA-seq further reading

Marioni JC et al (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509–17.

Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol. 11:R106.

Auer PL, Doerge RW (2010) Statistical design and analysis of RNA sequencing data. Genetics 185:405-416. Z.

Wang et al. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10:57-63.

Theorem of the day

http://www.theoremoftheday.org/Theorems.html#112

Wednesday, November 23, 2016

tutorial RNA-seq

How to analyze RNA-Seq data? Find differentially expressed genes in your research

https://www.youtube.com/watch?v=xh_wpWj0AzM

What is RNA-seq?
"using next generation sequencing to reveal the presence and quantity of RNA in a biological sample at a given moment in time."(wikipedia)

* differential expression of RNA-seq and pathway analysis is important
here is a nice set of slides:

http://www.mi.fu-berlin.de/wiki/pub/ABI/GenomicsLecture13Materials/rnaseq2.pdf

* nice discussion FPKM vs RPKM

https://www.biostars.org/p/124826/

Tuesday, November 22, 2016

RNA-seq lab meeting notes

*Stringtie got a bug.

*make sure you can follow the home path$ echo $PATH

*rnaseq_hisat2 /ballgown/chrx_genes*.csv
/hisat

* FPKM : stands for Fragments Per Kilobase of transcript per Million mapped reads. In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it.

*Why FPKM ?

* Lior Pachter is a professor of computational biology follow his blog.

*focus on pathway analysis in RNA-seq
*differential expression analysis

Wednesday, November 16, 2016

language and scientific research editing

https://secure.authorservices.springernature.com/en/default/submit

Monday, November 14, 2016

methods for parameter estimations

1)Method of moments (MM)
2)maximum likelihood estimation (MLE)
3)least squares estimation (LSE)
4)simulated annealing (MCMC) or Bayesian statistics

Thursday, November 3, 2016

article published

https://peerj.com/articles/2671/

Congrats to our team. And special thanks to our lab PI Dr. Hong Qin for designing this work.

Monday, October 31, 2016

nonlinear regression and nonlinear least squares in R

a useful document

http://socserv.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Nonlinear-Regression.pdf

Tuesday, October 25, 2016

CV values

noise level test: bigger CV value means more noisy data
higher G (shape) parameter in model simulation reveals smaller lifespan data values.
higher R(rate) parameter in model simulation reveals higher mean lifespans.

Saturday, October 15, 2016

free hard drive space command

df -H : hard drive allocation and usage check
top : memory usage check (slow method)
vm_stat : memory usage check (faster)

Academic writing tips

Ten easy rules from Zhang et al:

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003453#pcbi.1003453-Watson1

Thursday, October 13, 2016

today's note

1)Copy number variation. CNV is worth to study. : http://www.nature.com/scitable/topicpage/copy-number-variation-445
2)Focus C. elegans for next project as experimental lifespan data sets.
3)Data bases for lifespan :

http://kaeberleinlab.org/projects/lifespan-observations-database

*Observations database
*Sagaweb database
*Managed databases

Friday, October 7, 2016

solving ODEs

A nice recalling examples here:

http://mathinsight.org/ordinary_differential_equation_introduction

Thursday, October 6, 2016

Bayesian statistics vs Maximum Likelihood estimation

Screenshot is taken from Hartig et.al 2011, Statistical inference for stochastic simulation models – theory and application

here is description why we are normalizing the posterior position in Bayesian statistics.

Wednesday, October 5, 2016

Metropolis Hastings -MCMC in R

MCMC is achieved by;

1)Starting at a random parameter value (old)

2)Choose a new parameter value which is close to the old value (current) based on some probability density and that is called future function (new)

3)hop to this new point with a probability p(new)/p(old), where p is the target function, p>1.

https://github.com/florianhartig/LearningBayes/blob/master/CommentedCode/02-Samplers/MCMC/Convergence.md

what is noise

randomness: Living organisms flow on inferences (guesses) about the best response to make, because of the information they receive from outside world (environment) is a part of diluted noise.

information(about world)

inferences(on present)------------------>modify behaviors (to optimize survival probability)

Monday, October 3, 2016

Integrative Genomics Viewer

http://software.broadinstitute.org/software/igv/download

IGV is an excellent way to visualize seq. data , whether it is whole genome seq or ChIP-seq or RNA-seq.

data analysis course sequences on edX

https://courses.edx.org/courses/course-v1:HarvardX+PH525.7x+3T2015/cd8cfac0f386436fa0cb1ed3d0012328/

GitHub commands on MAC OSX

Nice link on when and how to perform statistical tests:

What statistical analysis should I use?

http://www.ats.ucla.edu/stat/stata/whatstat/whatstat.htm

Tuesday, September 27, 2016

Unstable quantification

There is a statistical ambiguity(artifact) which causes a unstable quantification.
Non-uniform coverage bias is a suggested solution for this ambiguity.

Transcript quantification assessment

Suppose the lengths

lengths = c(100,200,300,100,100)

The matrix which identifies transcripts with exons and junctions:

mat = cbind(c(1,1,0,1,0),c(1,1,1,1,1),c(0,1,1,0,1))

The length of the transcripts is then: lengths %*% mat

Suppose we align 1000 reads to this gene. So w = 1000.

Suppose we observe the following read counts for the exons and for the two junctions:

counts = c(125,350,300,125,100)

Given the formula above, and the assumption of uniform read distribution and Poisson counts, we can get a rough estimate of theta by just solving the linear system above.

First try a guess at theta:

theta.hat = c(1, 2, 3) / 10000

We can see with this guess, our counts are too low, and not properly apportioned to the exons and junctions:

mat %*% theta.hat * lengths * w

We can roughly estimate theta by solving the linear system of equations above:

LHS = counts/(lengths * w)
lm.fit(mat, LHS)$coefficients

(recall the linear models covered in PH525.2x).

Q:What would be the rough estimate of theta for the first transcript if the counts for exons and junctions were 60,320,420,60,140?

A:counts = c(60,320,420,60,140)

LHS = counts/(lengths * w)

lm.fit(mat, LHS)$coefficients

Q:What is the estimate of theta using our rough estimator, for the first transcript (the transcript with exon 1 and exon 2)?

A:By following the code above, solving the linear system of equations give a rough estimate of theta as:

theta.hat = c(.00075, .0005, .0005)

this reproduced the observed counts exactly:

mat %*% theta.hat * lengths * w

Monday, September 26, 2016

Gene model assesment

Follow this instructions:

RNA-Seq Quantifiying Transcript Levels

Y_f ~ Poisson(theta_f l_f ) where Y_f is the number of reads coming from a transcript (fragment) f.
Q:why multiplication?
A:because the larger the fragment the more reads there are to pick from
Q: why poisson
A: big N, small P_f => Poisson is proposed.

See the papers for more information for how Poisson model is described and more of techniques for RNA-seq.

Next module is going to be about maximum likelihood estimator , the transcript levels of each transcript within a gene.

RNA-Seq Quantifiying Transcript Levels

Next module is going to be about maximum likelihood estimator , the transcript levels of each transcript within a gene.

Friday, September 23, 2016

Ordinary Least Squares vs Linear Regression

Least squares is a method for performing the linear regression analysis. They can be used interchangeably, to explain how to fit the data to a "linear" line.

Another method is the traditional SSE method.

Sum of squared error: One thing is that can be done is to find a linear line, and then for each of the data points, measure the vertical distance between the point and that line, square it, and add these up; the fitted line would be the one where this sum of distances is as small as possible (minimizing the SSE).

My focus is today, Least squares method (LSM)

Nice source for normalization vs standardization.
https://docs.google.com/document/d/1x0A1nUz1WWtMCZb5oVzF0SVMY7a_58KQulqQVT8LaVA/edit

Scaling the data sets:

Normalization: I use data [(x-mean)/sd] normalization whenever differences in variable ranges could potentially affect negatively to the performance of my algorithm. This is the case of PCA, regression or simple correlation analysis for example.

I use [x/max] when I'm just interested in some internal structure of the samples and not in the absolute differences between samples. This might be the case of peak detection in spectra for samples in which the strength of the signal which I'm seeking changes from sample to sample.

Finally I use [x-mean] normalization when some samples could be potentially using just a part of a bigger scale. This is the case of ratings for movies for example, in which by some user tend to give more positive ratings than others.

Standardization: (z-score) We do data normalization when seeking for relations. Some people do this methods, unfortunately, in experimental designs, which is not correct except if the variable is a transformed one, and all the data needs the same normalization method, such as pH in sum agricultural studies. Normalization in experimental designs are meaningless because we can't compare the mean of, for instance, a treatment with the mean of another treatment logarithmically normalized. In regression and multivariate analysis which the relationships are of interest, however, we can do the normalization to reach a linear, more robust relationship. Commonly when the relationship between two dataset is non-linear we transform data to reach a linear relationship. Here, normalization doesn't mean normalizing data, it means normalizing residuals by transforming data. So normalization of data implies to normalize residuals using the methods of transformation.

Notice that do not confuse normalization with standardization (e.g. Z-score). Do this when the distribution of the data is normal (gaussian) distribution.

x" =(x-mean)/sd

Coefficient of Variance = Sd(DataSet)/Mean(DataSet)

Coefficient of variance: It tells us about the noise level in data.
The smaller CoVar tends to give less noise since small CoVar indicate less noisy and more robustness.

effectiveness test

randomization is done to avoid bias
a blind vs a double-blind experiment to further eliminate the chance of bias
in a double-blind experiment the researchers are unaware of which treatment group a subject is in. This is a lot of work but in order to eliminate the effects of other variables other than we like to test besides the treatment (confounding variables) that may affect the results. This is how we draw a cause and effect relation.
How one chooses to compare or present results can have a dramatic effect on what is implied.

Example: it took too many years to prove the detrimental effects of smoking on health even though there were a lot of results of observational studies.

Some R studio exercise:

Giving data vectors have a type

>simpsons = c("Jane","John","Ada","Adam")

>names(simpsons)= c("mum,"dad","sister","brother")

simpsons

mum dad sister brother

"Jane" "John" "Ada" "Adam"

RNA Sequencing-functional genomics course

Good lecture is on edX

https://www.edx.org/course/case-studies-functional-genomics-harvardx-ph525-7x

And another one:

https://www.youtube.com/watch?v=hksQlJLwKqo

useful links:
Study information at the Sequence Read Archive
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP033351

Himes et al paper at PubMed Central
https://www.ncbi.nlm.nih.gov/pubmed/24926665

European Nucleotide Archive (EMBL-EBI)
http://www.ebi.ac.uk/ena

Sequence Read Archive (NCBI)
https://www.ncbi.nlm.nih.gov/sra/

A sample table stored in course repo on github
https://github.com/genomicsclass/labs/blob/master/rnaseq/airway_sample_table.csv

(details on creating this table available at the airway package vignette)
http://www.bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html

RNA Sequencing-functional genomics course

Good lecture is on edX

https://www.edx.org/course/case-studies-functional-genomics-harvardx-ph525-7x

And another one:

Saturday, January 30, 2016

Today's lab meeting

Yesterday, we meet with Dr. Voit from Georgia Tech. It looks like their lab meeting are very beneficial to understand systems biology topics in more details. I hope I would take the advantage of having to many nice people in their group.
The topic yesterday was "Single cell particle tracking". The speaker was James Wade one of Dr. Voit's students.

Wednesday, January 27, 2016

Today in bio125 course notes linear regression

Students seemed to does not know the meaning of Linear Regression. Why we do it? how we fit the linear line to the data does not make sense at all for most of them. The simple and mathematical notion should be introduced before we teach it.

H2O2_LOH manuscript revising

Keep revising the manuscript of Loss of Heterozygosity especially research and discussion part.
Review the tables and figures before submitting a preprint.

Tuesday, January 26, 2016

oxidative stress and aging key terms and parameters

CLS : chronological life span is a measure of amount of time taken for a single yeast mother cell to stay alive.
RLS: replicative life span measures the number of times required for a mother cell to stop undergoing cell division
MR : Mitotic recombination is the exchange of genetic information between homologous chromosomes in somatic cells
ROS: reactive oxygen species that has a free radical.
LOH: loss of heterozygosity can be used to measure genomic integrity of cells.
MA: mitotic asymmetry, the generation of two dissimilar daughter cells following mitotic division.
MEt15locus: LOH is detected in this genetic locus via knock out of one allele using a resistance marker.
Cv: a variable in the H2O2 does response curve that represents the middle concentration at which cell viability decreases by half
Cb: represents the middle concentration black colonies on MLA plates.
Tg: Biological survival curve , it represents the time at which there is a 50% decrease in genomic integrity.(Qin et al. 2008)
Tc: Tc represents the midpoint of chronological life span(CLS) (Qin et al. 2008)
L0: A ratio that measures the frequency of LOH events in daughter cell/mother cells

Friday, January 22, 2016

Today in bio125 class bradford protein assay of protein concentration and serial dilution is applied. The goal of this assay is to find the protein quantity (unknown) in each experiment. There are 6 experiments that have 0.9% NaCl + BstStock + 5*Bradford solutions.

1 2 3 4 5 6

NaCl : 16 8 4 2 1 0
BstStock quantity : 640 720 760 780 790 800
5*Bradford : 160 80 40 20 10 0

uknown = 50 ml +200B+750 ml

We will use the linear regression once we measure the amount of amino acids in each tube. Later to predict an unknown solution's protein amount.

My name is Emine Guven. I study quantitative biology. My interests are cellular aging, VEGF receptors clustering, math modeling of biological systems with a broad focus on data analysis and simulations.This site is reserve as a notebook to keep my studies fresh and open to my students and collaborators. As a young scientist my goal is to become more objective and independent. Thanks for visiting my blog.