is a free, open-source implementation of the S statistical computing language and programming environment. Nowadays, R is widely used for statistical software development, data analysis, and machine learning. Since R is free and open source, now there are mor than 2000 user-contributed packages available. This means that anyone can fix bugs and/or add features. R can be integrated with other languages (C/C++, Java, andvPython). R can also interact with many data sources: ODBC-compliant databases (Excel and Access) and other statistical packages (SAS, Stata, SPSS, and Minitab). For the High Performance Computing Task, several R packages provide the advantage of multiple cores, either on a single machine or across a network.
Despite the aforementioned capabilities, R is a command line interface (CLI) where users type commands to perform a statistical analysis. The CLI is the preferred user interface for power users because it allows the direct control on calculations and it is flexible. However, this command-driven system requires good knowledge of the language and makes it difficult for beginners or less frequent users. To incorporate this limitation, several R projects were develop to produce user interfaces.
My talk presented in R Workshop Hasselt University Belgium and a seminar in Medical Epidemiology and Biostatistics department of Karolinska Institutet will provide a review of some R GUI projects and illustrate how to develop a simple R GUI using tcltk and some future development using R service bus and shiny. More detail of R GUI and some example of RGUIs can be found in my PhD thesis.
Some example of RGUIs:
Below is an example of an R interface developed using tcl/tk package for showing central limit theorem.
The code can be obtained here.
There are various classification algorithms that have been developed in different fields. Some algorithms are commonly used in genomics such as linear discriminant analysis (LDA), nearest neighbor classifier and logistic regression. Many authors such as Gohlmann and Talloen (2009), and Lee (2005) have comprehensively reviewed and compared of these algorithms.
Logistic regression is a supervised method for binary or multi-class classification (Hosmer and Lemeshow 1989). Because it is a simple, flexible and straightforward model that is easy to extend, the extensions of logistic regression
have been widely used in genomics research (e.g., Liao and Chin, 2007, and Sun and Wang, 2012).
In high-dimensional datasets such as in microarray settings where usually there are more variables than the observations and variables are correlated (multicolinierity), the classical logistic regression would perform badly and provide inaccurate estimates. It would give a perfect fit to the data with no bias and high variance which can lead to bad prediction (overfitting). In order to prevent this problem, a penalty for complexity in the model should be introduced.
The presentation which can be viewed here shows a short overview of L1 penalization logistics regression. Example of the application of this method in genomic is to define candidate classifiers genes to classify two different groups, e.g., cancer and non-cancer group.
• Lee JW, et al, 2005. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis. 48:869-885.
• Hosmer, D.W., Lemeshow, S., 1989. Applied Logistic Regression. Wiley Series in Probability and Mathematical Statistics. Wiley, New York, NY.
• Sun, H. andWang, S. 2012. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies Bioinformatics. 28(10):1368-1375
• Tibshirani, R. 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological). 58:267- 288.
• Goeman, J.J. 2010. L1 Penalized Estimation in the Cox Proportional Hazards Model. Biometrical Journal. 52 (1): 70-84.
• Gohlmann, H., and, Talloen, W. 2009. Gene Expression Studies Using Affymetrix Microarrays. Chapman & Hall/CRC.
• Liao, J.G. , and Chin, K.V. 2007. Logistic regression for disease classification using microarray data: model
DNA microarrays are a new and promising biotechnology which allow the monitoring of expression of thousand genes simultaneously.
Image courtesy: http://www.albany.edu/genomics/affymetrix-genechip.html
Microarray technology as one of recent biomedical technologies produces high dimensional data. This makes statistical analysis become challenging.
The statistical components of a microarray experiment involve the following steps (Allison et al. 2006):
(1) Design. The development of an experimental plan to maximize the quality and quantity of information obtained.
(2) Pre-processing. Processing of the microarray image and normalization of the data to remove systematic variation. Other potential preprocessing steps include transformation of data, data filtering and background subtraction.
(3) Inference and/or classification. Inference entails testing statistical hypotheses which are usually about which genes are differentially expressed. Classification refers to analytical approaches that attempt to divide data into classes with no prior information (unsupervised classification) or into predefined classes (supervised classification).
(4) Validation of findings. The process of confirming the veracity of the inferences and conclusions drawn in the study.
I presented an overview of microarray analysis specifically in the use of gene expression profiling in a discussion.
Statistical Methods for Microarray Experiments: Analysis Dose-response Studies and Software Development R. A Summary of My PhD Thesis
Statistical Methods for Microarray Experiments: Analysis Dose-response Studies and Software Development R.
Drug development is defined as the entire process of bringing a new drug or device to the market, and is designed to ensure that only those pharmaceutical products that are both safe and effective are brought to market. The aim of this long process of drug development is to find good drugs which have strong effects on the intended disease and on the other hand have minimal side effects (Marton et al., 1998).
Finding safe and efficacious dose or dose range and establishing a dose-response relationship are two of the most important objectives in drug-discovery studies in the pharmaceutical industry. Bretz (2006) defined two major goals when performing a dose-response study. The first goal is to demonstrate a relationship between the dose and response. Here, an overall dose related trend is assessed by ensuring that with increasing dose, the effect to increase (or decrease) as well. This can be achieved by testing the global dose-response relationship under a monotonicity which assumes that as the dose increases, so does the effect or conversely as dose decreases, so does its impact. This assumption is founded on the fact that in
general, increasing the dose of a harmful agent will result in a proportional increase in both the incidence of an adverse effect as well as the severity of the effect (Ting, 2006). Once the first goal is established, the second goal is to estimate a target dose of interest such as the dose of an experimental compound required to achieve 50% of the maximum effect (ED50), or the Minimum Effective Dose (MED).
In recent years, as the emerging of biomedical technologies, dose-response studies have been integrated with microarray studies which in this case the response is the gene expression measured at a certain dose level. The aim is usually to identify a subset of genes with overall dose related trend. The dose-response microarray experiments are usually performed in order to (1) get insight in the biological target and the mechanism of the new medicine used to treat the target disease , (2) generate data on the pathways involved in potential unwanted side effects, (3) search for biomarkers that could be used as signatures for response, and (4) identify biomarkers that eventually could be used in phase-II or phase-III clinical trials as biomarkers endpoints to replace the classical
In this dissertation, we mainly focus on (1) estimating target doses of interest by using parametric modeling and model averaging methods, (2) implementing resampling-based inference, including permutation and SAM, to the multiple contrast test of the Williams (Williams, 1971 and 1972), Marcus (Marcus, 1976), the M statistic (Hu et al., 2005) and the modified M statistic (Lin et al., 2007), (3) analyzing dose-response microarray studies using Hierarchical Bayesian model, and (4) developing two new user friendly software for performing dose-response microarray studies and biclustering analysis.
Part 1. Modeling Dose-response Microarrays Studies: A Model-based Approach
The first part of this dissertation will focus on parametric modeling and estimating a target dose of interest the ED50 in dose-response microarray. To reach this aim a three steps approach is proposed. The first step is to select gene with a monotonic trend. To perform this step, one of testing procedures for a monotonic trend used in dose-response studies of microarray experiments discussed by Lin et al. (2007) is implemented. These testing procedures take into account the order restriction of the means with respect to the increasing doses. After the genes with a monotonic trend are identified, then a model-based approach is carried out to estimate the ED50. However the validity of the model-based approach depends on the choice of the dose-response model. In order to incorporate for model uncertainty, several models are fitted and then in the third step the model averaging methods are carried out to estimate the model-averaged ED50.
In dose-response studies, the target dose is not only the ED50. Usually it is of interest to find the smallest dose with a discernible useful effect or a maximum dose beyond which no further beneficial effects is seen (ICH-E4, 1994). It is common in dose-response studies to use the mean efficacy of the control dose (usually zero dose) as a benchmark for comparison purposes to decide if a particular dose is clinically effective (Tamhane andLogan, 2006). The target dose for this purpose is known as the Minimum Effective Dose (MED). In this part, we also discuss the estimation of the MED in the dose-response microarray studies. The definition and estimation of the MED for a single model, i.e., the Emax model will be given. Then after several candidate models are fitted, the model averaging techniques are implemented to obtain the model averaged MED.
Part 2. Multiple Contrast Test (MCT) for Detecting Monotonic Trends
In part II, we investigate the application of resampling-based multiple contrast test procedure to detect for a monotonic trend. Bretz (1999, 2006) implemented the multiple contrast test (MCT) procedure to extend the Williams’ and Marcus’ tests for testing monotonicity of dose-response relationship which can improve the power. The idea is that the order restricted alternative hypotheses can be decomposed into several elementary alternatives with particular patterns of equalities and inequalities. These contrasts then are tested and then the contrast test which has the maximum value is used. Bretz (2006) showed that relying on the normality assumption, the Williams’ MCT and Marcus’ MCT follow the multivariate t distribution. Since in practice it might not be valid to assume a particular distribution of a statistic, especially in case of small sample sizes commonly found in microarray data (McLachlan et al., 2004), we propose to approximate the null distribution of Williams’ MCT and Marcus’ MCT (Bretz, 2006) using a permutation-based approach. In addition, we propose to apply the MCT to the M (Hu et al., 2005) and the modified M (M′, Lin et al., 2007) test statistics.
A common problem in microarray experiments is that some of the genes have small variance which can make their (absolute) test statistic very large although their fold change are small (McLachlan et al., 2004). In order to overcome the problem of small variance genes, we implement the Significance Analysis of Microarray (SAM) procedure proposed by Tusher et al. (2001) for the MCT. We also performed simulation studies aimed to investigate the performance of the proposed resampling-based MCTs in terms of the FDR control and power.
Part 3. Hierarchical Bayesian Dose-response Analysis
The third part, we propose a hierarchical Bayesian modeling approach to identify genes with a monotonic trend. The Bayesian approach seeks evidence in the data under both the null and order-restricted alternative hypotheses. The Bayesian Variable Selection (BVS) is implemented to estimate the posterior probability of all the possible monotone models. Then the posterior probability of the null model is used to identify the genes with strong evidence of a monotonic trend. Several simulations to investigate the performance of this approach are carried out and will be discussed. From the obtained posterior probabilities, we can construct posterior level of probabilities from the Bayesian perspective.
Part 3. Software Development
In the last part of this dissertation, we discuss the issue of software developments for dose response
microarray studies and biclustering analysis. The developed software are R packages intended to provide easy to use packages with more user friendly interfaces. The software are developed using GUI provided within R.
The first software is for analyzing dose-response microarray experiment. Lin et al. (2008) developed an R package IsoGene for testing for a monotonic relationship between gene expressions and doses in a microarray experiment (Pramana et al., 2010). The IsoGene package is a command-based package and requires users to have basic knowledge of R. This can be difficult for those who are not familiar with R or unfrequent R users to use the package. To disentangle this limitation, a user friendly package called IsoGeneGUI was developed (Pramana et al., 2010b, 2011a). We presents the capabilities of the IsoGeneGUI package/software along with illustrative examples of analysis of a dose-response microarray study.
The second software is to perform biclustering analysis. Biclustering is a technique to cluster rows and columns of a matrix simultaneously. The goal to find subgroups of rows and columns (called biclusters) which are as similar as possible to each other and as different as possible to the rest (Kaiser and Leisch, 2008). Note that in the microarray studies, rows and columns are genes and samples, respectively. Therefore bicluster methods in gene expression are aimed to discover a group of genes that have similar regulatory mechanism within a group of samples (Madeiraand Oliveira, 2004).
We developed a new R package, the RcmdrPlugin.BiclustGUI package (shortened as the BiclustGUI) which is an interface for several biclustering packages that are available biclust (Kaiser and Leisch, 2008), fabia (Hochreiter et al., 2010), and isa2 package (Cs´ardiet al., 2010). The BiclustGUI is an extension for R Commander package (Rcmdr Plug-in, Fox, 2005) and intended to beginners or unfrequent R users to perform biclustering analysis.
Biostatistics and Statistical Bioinformatics a Lecture at the Faculty of Mathematics and Natural Sciences, Brawijaya University
Image courtesy: here
On October 2012, I had an honor to give a guest lecture in the Mathematics Department of Brawijaya University. I never image that I will return to the university which provided me lot of valuable knowledge to give a lecture there.
Theory of Mean and Variance of Population Estimation Using Jackknife and Bootstrap Method
Abstract of my Bachelor Thesis (1999)
Most of statistical methods, such as hypothesis testing, and maximum likelihood estimation, were designed to be implemented on mechanical calculators. These classical methods require distributional assumptions, usually normal distribution. Modern electronic computation has encouraged a development of new statistical method called Resampling that require fewer distributional assumptions than classical methods and can be applied to more complicated statistical estimators.
This research is focusing on two resampling methods, Jackknife and Bootstrap method. The accuracy of the mean and variance estimation is compared by standard error they produce.
The Jackknife method introduced by Quenouille was not only a nonparametric device for estimating bias but also to obtain approximate confidence interval in problems where standard statistical procedures could not be done. One of the promising developments in computer-intensive is Bootstrap method. Bootstrap was introduced primarily as a device for extending the standard error of mean formula to other estimators that the mean. This method can be applied to almost any statistical estimation problems and the data set does not have to be a simple random sample from a single distribution.
Normally distributed data and data with unknown distribution are used. Statistical software (Minitab) and Turbo Pascal are used for resampling calculations.
Bootstrap and Jackknife will have the same accuracy if they are used to estimate mean of population using normally distributed data set. Otherwise, Bootstrap has better accuracy than jackknife for data with unknown distribution. For estimating variance of population in normally distributed data and data with unknown distribution, bootstrap has the highest accuracy. Overall, Bootstrap method is better than Jackknife for estimating mean and variance of population.
The aim of discriminant analysis is to find a set of independent variables that predict membership of groups. It is a technique for classifying a set of observations into predefined categories. A prediction of a new observation to a category is then made possible.
Applying Discriminant Analysis, provide us with linear combinations of the original values. These new values give us the maximal separation between the groups of interest and are therefore better in showing us the differences across the groups of interest. Classification error rates can also be obtained to assess the performance of the derived discrimination functions.
Missing or incomplete data are a common scenario occurring in many studies. An observation is considered as incomplete case if the value of any of the variables is missing. Even with the best design and monitoring, the observations can be incomplete usually due to the following possible reasons: missing by design; censoring and drop-out; or non-response etc. Most statistical packages exclude incomplete cases from analysis by default. This approach is easy to implement but has serious problems. Firstly, the loss of any information on incomplete cases may lower the desired efficiency in the study .Secondly; they may lead to substantial biases in analyses. Thus, missing data are important to consider in the analyses.
In statistical terminology, missingness in the data is assumed to be three types: 1) Missing completely at Random (MCAR); 2) Missing at random and 3) Missing not at random (MNAR).
A non-response process is said to be missing completely at random (MCAR) if the missingness is independent of both unobserved and observed data. Under missing completely at Random (MCAR) the observed data can be analyzed as though the pattern of missing values were predetermined. In anyway of analyzing the data procedure the process generating the missing values can be ignored
A non-response process is said to be missing at random (MAR) if, conditional on the observed data, the missingness is independent of the unobserved measurements. Although, according to Molenberghs and Verbeke, the MAR assumption is particularly convenient in that it leads to considerable simplification in the issues surroundings the analysis of incomplete longitudinal data, it is rare in practice for an investigator to be able to justify its adoption, and so in the situations the final class of missing value mechanisms cannot be ruled out.
A process that is neither MCAR nor MAR is termed nonrandom (MNAR). Under MNAR the probability of measurement being missing depends on the unobserved data. Inference can only be made by making further assumptions about which the observed data alone carries no information.
Missingness frequently complicates the analysis of longitudinal data. In many clinical trials and other setting, the standard methodology used to analyze incomplete longitudinal data is based on such methods as complete case analysis (CC), Last observation carried forward method (LOCF) or simple form of imputation (unconditional or conditional mean imputation). This is often done without questioning the possible influence of these assumptions on the final results
Fitzmaurice G. M., Laird, N. M., Ware J. H. (2004). Applied Longitudinal Analysis. Wiley Series in Statistics. Wiley-IEEE.
Folstein, M.F., Folstein, S., McHugh, P.R. (1975): “Mini-mental state”: a practical method for grading the cognitive state of patients for the clinician. Journal of Psychiatric Research 12: pp. 189-198.
Jansen, I., Beunckens, C., Molenberghs, G., Verbeke, G. and Mallinckrodt, C. (2006). Analyzing incomplete discrete longitudinal clinical trial data.Statistical Science . 21, 1, pp. 52–69.
Molenberghs, G. and Verbeke, G. (2005). Models for Discrete Longitudinal Data. New York: Springer-Verlag.
Molenberghs, G. and Verbeke, G. (2007). Longitudinal Data Analysis. Censtat, Universiteit Hasselt.
Path analysis is an extension of multiple regression analysis, used to test the fit of the correlation matrix against two or more causal models which are being compared by the researchers. It was developed by Sewal Wrigth in the 1930’s and is useful in illustrating a number of issues in causal analysis. In general, path diagrams are used to display a priori hypothesized structure among the variables in the models. Four possible relationships can be thought of for any two variables X and Y. These along with their diagrammatical relationship are given below.
v X -> Y implies that X might structurally influence Y but not vice versa
v X <- Y implies that Y might be structurally influence X but not vice versa.
v X <=> Y implies that both X and Y might structurally influencing each other, and finally,
v X <->Y implies that no relationship can be hypothesised between X and Y.
Structures that include the first two and/or the last relations are called recursive models, while those that involved the third relation are called non-recursive models
Since this is an extension of regression analysis, it equally has some assumptions which are:
v Relations among the variables are linear, additive, and causal. Curvilinear, multiplicative, or interaction relations are not inclusive.
v Residuals are uncorrelated with all other variables and residuals in the model.
v There is one-way causal flow.
v The variables are measured on interval scale.
One major advantage of the path analysis is that it allows testing hypothesized multivariate linear causal models. Other advantages are that; it provides a breakdown of the covariance between variables into direct, indirect and spurious or joint effects; it equally allows for the explanation of intervening variables as well as the terminal variables and finally gives a parsimonious diagram of causal links with parameters that indicate the relationship between determined and determining variables.
1. Carey, G. (1998). Path Analysis Using Proc Calis. http://psych.colorado.edu/~carey/Courses/PSYC7291/ExampleCode.htm (assessed on 12-02-07)
2. Shkedy Z. (no date). Structural Equation Modeling, an Introduction: Path Analysis and Confirmatory Factor Analysis. Hasselt University, Diepenbeek.
3. Johnson, R.A., Wichern, D.W. (2002). Applied Multivariate Statistical Analysis. Fifth edition. Prentice Hall.