Tio

Just wanna share my experience

Missingness


Missing or incomplete data are a common scenario occurring in many studies. An observation is considered as incomplete case if the value of any of the variables is missing. Even with the best design and monitoring, the observations can be incomplete usually due to the following possible reasons: missing by design; censoring and drop-out; or non-response etc. Most statistical packages exclude incomplete cases from analysis by default. This approach is easy to implement but has serious problems. Firstly, the loss of any information on incomplete cases may lower the desired efficiency in the study .Secondly; they may lead to substantial biases in analyses. Thus, missing data are important to consider in the analyses.

In statistical terminology, missingness in the data is assumed to be three types: 1) Missing completely at Random (MCAR); 2) Missing at random and 3) Missing not at random (MNAR).

Missing Completely at Random

A non-response process is said to be missing completely at random (MCAR) if the missingness is independent of both unobserved and observed data. Under missing completely at Random (MCAR) the observed data can be analyzed as though the pattern of missing values were predetermined. In anyway of analyzing the data procedure the process generating the missing values can be ignored

Missing at Random (MAR)

A non-response process is said to be missing at random (MAR) if, conditional on the observed data, the missingness is independent of the unobserved measurements. Although, according to Molenberghs and Verbeke, the MAR assumption is particularly convenient in that it leads to considerable simplification in the issues surroundings the analysis of incomplete longitudinal data, it is rare in practice for an investigator to be able to justify its adoption, and so in the situations the final class of missing value mechanisms cannot be ruled out.

Missing Not at Random

A process that is neither MCAR nor MAR is termed nonrandom (MNAR). Under MNAR the probability of measurement being missing depends on the unobserved data. Inference can only be made by making further assumptions about which the observed data alone carries no information.

Missingness frequently complicates the analysis of longitudinal data. In many clinical trials and other setting, the standard methodology used to analyze incomplete longitudinal data is based on such methods as complete case analysis (CC), Last observation carried forward method (LOCF) or simple form of imputation (unconditional or conditional mean imputation). This is often done without questioning the possible influence of these assumptions on the final results

References:

Fitzmaurice G. M., Laird, N. M., Ware J. H. (2004). Applied Longitudinal Analysis. Wiley Series in Statistics. Wiley-IEEE.

Folstein, M.F., Folstein, S., McHugh, P.R. (1975): “Mini-mental state”: a practical method for grading the cognitive state of patients for the clinician. Journal of Psychiatric Research 12: pp. 189-198.

Jansen, I., Beunckens, C., Molenberghs, G., Verbeke, G. and Mallinckrodt, C. (2006). Analyzing incomplete discrete longitudinal clinical trial data.Statistical Science . 21, 1, pp. 52–69.

Molenberghs, G. and Verbeke, G. (2005). Models for Discrete Longitudinal Data. New York: Springer-Verlag.

Molenberghs, G. and Verbeke, G. (2007). Longitudinal Data Analysis. Censtat, Universiteit Hasselt.

May 12, 2008 Posted by | Statistics | | Leave a comment

Path Analysis


Path analysis is an extension of multiple regression analysis, used to test the fit of the correlation matrix against two or more causal models which are being compared by the researchers. It was developed by Sewal Wrigth in the 1930’s and is useful in illustrating a number of issues in causal analysis. In general, path diagrams are used to display a priori hypothesized structure among the variables in the models. Four possible relationships can be thought of for any two variables X and Y. These along with their diagrammatical relationship are given below.

v X -> Y implies that X might structurally influence Y but not vice versa

v X <- Y implies that Y might be structurally influence X but not vice versa.

v X <=> Y implies that both X and Y might structurally influencing each other, and finally,

v X <->Y implies that no relationship can be hypothesised between X and Y.

Structures that include the first two and/or the last relations are called recursive models, while those that involved the third relation are called non-recursive models

Since this is an extension of regression analysis, it equally has some assumptions which are:

v Relations among the variables are linear, additive, and causal. Curvilinear, multiplicative, or interaction relations are not inclusive.

v Residuals are uncorrelated with all other variables and residuals in the model.

v There is one-way causal flow.

v The variables are measured on interval scale.

One major advantage of the path analysis is that it allows testing hypothesized multivariate linear causal models. Other advantages are that; it provides a breakdown of the covariance between variables into direct, indirect and spurious or joint effects; it equally allows for the explanation of intervening variables as well as the terminal variables and finally gives a parsimonious diagram of causal links with parameters that indicate the relationship between determined and determining variables.

References:

1. Carey, G. (1998). Path Analysis Using Proc Calis. http://psych.colorado.edu/~carey/Courses/PSYC7291/ExampleCode.htm (assessed on 12-02-07)

2. Shkedy Z. (no date). Structural Equation Modeling, an Introduction: Path Analysis and Confirmatory Factor Analysis. Hasselt University, Diepenbeek.

3. Johnson, R.A., Wichern, D.W. (2002). Applied Multivariate Statistical Analysis. Fifth edition. Prentice Hall.

May 12, 2008 Posted by | Statistics | | Leave a comment

Exploratory Factor Analysis


The Factor analysis is a statistical technique used to uncover the unobserved structure (dimensions) of a set of variables. It reduces attribute space from a larger number of variables to a smaller number of factors and does not assume that a dependent variable is specified. Factor analysis could be used in the reduction of a large number of variables, also it is useful in selecting a subset variables out of a large set based on principal components factors. Not only that, exploratory factor analysis assists in selection of a factor of set of variables to be treated as uncorrelated when treating multicollinearity in multiple regression. It equally used in identifying a cluster of cases as well as outliers to mention but a few.

May 12, 2008 Posted by | Statistics | | Leave a comment

Multivariate Analysis of Variance (MANOVA)


While Analysis of variance (ANOVA) only handles 1 dependent variable, multivariate analysis of variance (MANOVA) is used to perform an ANOVA style analysis on several response variables simultaneously. This means that MANOVA will test whether the mean vector of the response variables across different groups are equal or not.

The advantage of MANOVA derives from the fact that the linear combination maximizes the differences among the different groups. Hereby we are improving the chance of detecting differences across the group. Moreover it will acknowledge relationships between the different groups and therefore becomes more powerful.

Several assumptions are taken when using MANOVA:

  1. The random samples from different populations are independent.
  2. Each population is multivariate normal. However, small violations on the normality assumption are usually not fatal.
  3. Homogeneity of Variance-Covariance Matrices.

May 12, 2008 Posted by | Statistics | , | 3 Comments

Canonical Correlation


To describe the association between two sets of variables, the canonical correlation analysis plays a significant role. In canonical correlation, two sets of variables are related and these variables may or may not be categorical. The main goal of the canonical correlation analysis is to develop these linear composites (canonical variable), derive a set of weights for each variable, thereby explaining the nature of relationships that exist between the sets of response and predictor variables that are measured by the relative contribution of each variable to the canonical functions (relationships) that exist.

The results of applying canonical correlation are a measure of the strength of the relationship between two sets of multiple variables. This measure is expressed as a canonical correlation coefficient (r) between the two sets. For interpretation of the results, standard canonical coefficients (not raw canonical coefficient) were used for unification of units and the scales of the original variables. The correlation between the original variables and canonical variables is known as intra-set structure correlation. The intra-set structure correlation is more stable than the raw or standardized canonical weights in univariable context (Gittens, 1985). That is, a correlation value describes a univariate relation between the variable and its canonical variable, without considering the existence of the other variables. That is, the correlation between the original variable and its canonical variable provide no information about the multivariate contribution of a variable to its canonical variable. To quantify the contribution of each variable to the canonical variables in a multivariate context, Rencher (1998 ) and Johnson and Wichern (2002) recommend using the standardized coefficients instead of the intra-set structure correlation.

Canonical correlation requires a relatively large number of observations compared to the number of variables. It is also sensitive to collinearity in independent variables and requires multinormal data sets. If the canonical correlation is done on the standardized variables, each canonical variable is a principal component and maximizing correlation and covariance are the same.

Reference:

Johnson, R.A., and Wichern, D.W. (2002). Applied Multivariate Statistical Analysis. 5th edition. Peason education: Prentice-Hall.

May 12, 2008 Posted by | Statistics | | Leave a comment

What is the difference between “Multiple” and “ Multivariate” Analysis?


What is the difference between “Multiple” and “ Multivariate”?

As the complexity of data increases, we have many variables measured. Then some confusion come up about what is Multiple Analysis and  Multivariate analysis?

Some people think they are the same, others mix them up.

Actually they are easy to distinguish.

First we have to know some basics:

Independent variable, is the variable that do not dependent to other variable. Sometime we refer it as covariates or predictor, and symbolize by X.

Dependent variable is the variable that dependent to other variable. Sometimes we call it as response, and symbolize by Y.

If we have more than one predictors (X) then we deal with a multiple case.

If there are more than one response (Y) then we have a multivariate case.

For example:

In univariate case we have simple linear regression:

Y=a+bX

Then multiple linear regressions is:

Y=a+b1X1+b2X2

For multivariate case we can have models:

Y1=a+bX

Y2=a+bX

Or

Y1=a+b1X1+b2X2

Y2=a+b1X1+b2X2

May 12, 2008 Posted by | Statistics | | Leave a comment

All about Statistics


Normal

STATISTICS

There are lies, darned lies, and statistical outliers.

Statistics means never having to say you’re certain.

Statistics is the art of never having to say you’re wrong. Variance is what any two statisticians are at.C.J.Bradfield

A statistician is a person who draws a mathematically precise line from an unwarranted assumption to a foregone conclusion.

A statistician is a person who stands in a bucket of ice water, sticks their head in an oven and says “on average, I feel fine!”K.Dunnigan

A statistician drowned while crossing a stream that was, on average, 6 inches deep.

Most people use statistics the way a drunk uses a lamp post, more for support than enlightenment.

Figures don’t lie, but liars figure.Samuel Clemens (alias Mark Twain)

Are statisticians normal?

An engineer, a physicist, and a statistician were moose hunting in northern

The weather man is never wrong. Suppose he says that there’s an 80% chance of rain. If it rains, the 80% chance came up; if it doesn’t, the 20% chance came up!

May 12, 2008 Posted by | Statistics | | 1 Comment