Friday, September 23, 2016

Ordinary Least Squares vs Linear Regression

Least squares is a method for performing the linear regression analysis. They can be used interchangeably, to explain how to fit the data to a "linear" line.
Another method is the traditional SSE method.
Sum of squared error: One thing is that can be done is to find a linear line, and then for each of the data points, measure the vertical distance between the point and that line, square it, and add these up; the fitted line would be the one where this sum of distances is as small as possible (minimizing the SSE).
 My focus is today, Least squares method (LSM)

Nice source for normalization vs standardization.
https://docs.google.com/document/d/1x0A1nUz1WWtMCZb5oVzF0SVMY7a_58KQulqQVT8LaVA/edit

Scaling the data sets:

Normalization: I use data [(x-mean)/sd] normalization whenever differences in variable ranges could potentially affect negatively to the performance of my algorithm. This is the case of PCA, regression or simple correlation analysis for example.

I use [x/max] when I'm just interested in some internal structure of the samples and not in the absolute differences between samples. This might be the case of peak detection in spectra for samples in which the strength of the signal which I'm seeking changes from sample to sample.

Finally I use [x-mean] normalization when some samples could be potentially using just a part of a bigger scale. This is the case of ratings for movies for example, in which by some user tend to give more positive ratings than others.

Standardization: (z-score) We do data normalization when seeking for relations. Some people do this methods, unfortunately, in experimental designs, which is not correct except if the variable is a transformed one, and all the data needs the same normalization method, such as pH in sum agricultural studies. Normalization in experimental designs are meaningless because we can't compare the mean of, for instance, a treatment with the mean of another treatment logarithmically normalized. In regression and multivariate analysis which the relationships are of interest, however, we can do the normalization to reach a linear, more robust relationship. Commonly when the relationship between two dataset is non-linear we transform data to reach a linear relationship. Here, normalization doesn't mean normalizing data, it means normalizing residuals by transforming data. So normalization of data implies to normalize residuals using the methods of transformation.

Notice that do not confuse normalization with standardization (e.g. Z-score). Do this when the distribution of the data is normal (gaussian) distribution.

x" =(x-mean)/sd

Coefficient of Variance = Sd(DataSet)/Mean(DataSet)

Coefficient of variance: It tells us about the noise level in data.
The smaller CoVar tends to give less noise since small CoVar indicate less noisy and more robustness. 



No comments:

Post a Comment