## Developing R Graphical User Interfaces

is a free, open-source implementation of the S statistical computing language and programming environment. Nowadays, R is widely used for statistical software development, data analysis, and machine learning. Since R is free and open source, now there are mor than 2000 user-contributed packages available. This means that anyone can fix bugs and/or add features. R can be integrated with other languages (C/C++, Java, andvPython). R can also interact with many data sources: ODBC-compliant databases (Excel and Access) and other statistical packages (SAS, Stata, SPSS, and Minitab). For the High Performance Computing Task, several R packages provide the advantage of multiple cores, either on a single machine or across a network.

Despite the aforementioned capabilities, R is a command line interface (CLI) where users type commands to perform a statistical analysis. The CLI is the preferred user interface for power users because it allows the direct control on calculations and it is flexible. However, this command-driven system requires good knowledge of the language and makes it difficult for beginners or less frequent users. To incorporate this limitation, several R projects were develop to produce user interfaces.

My talk presented in R Workshop Hasselt University Belgium and a seminar in Medical Epidemiology and Biostatistics department of Karolinska Institutet will provide a review of some R GUI projects and illustrate how to develop a simple R GUI using tcltk and some future development using R service bus and shiny. More detail of R GUI and some example of RGUIs can be found in my PhD thesis.

Some example of RGUIs:

## Classification using L1-Penalized Logistic Regression

There are various classification algorithms that have been developed in different fields. Some algorithms are commonly used in genomics such as linear discriminant analysis (LDA), nearest neighbor classifier and logistic regression. Many authors such as Gohlmann and Talloen (2009), and Lee (2005) have comprehensively reviewed and compared of these algorithms.

Logistic regression is a supervised method for binary or multi-class classification (Hosmer and Lemeshow 1989). Because it is a simple, flexible and straightforward model that is easy to extend, the extensions of logistic regression

have been widely used in genomics research (e.g., Liao and Chin, 2007, and Sun and Wang, 2012).

In high-dimensional datasets such as in microarray settings where usually there are more variables than the observations and variables are correlated (multicolinierity), the classical logistic regression would perform badly and provide inaccurate estimates. It would give a perfect fit to the data with no bias and high variance which can lead to bad prediction (overfitting). In order to prevent this problem, a penalty for complexity in the model should be introduced.

The presentation which can be viewed here shows a short overview of L1 penalization logistics regression. Example of the application of this method in genomic is to define candidate classifiers genes to classify two different groups, e.g., cancer and non-cancer group.

**References **

• Lee JW, et al, 2005. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis. 48:869-885.

• Hosmer, D.W., Lemeshow, S., 1989. Applied Logistic Regression. Wiley Series in Probability and Mathematical Statistics. Wiley, New York, NY.

• Sun, H. andWang, S. 2012. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies Bioinformatics. 28(10):1368-1375

• Tibshirani, R. 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological). 58:267- 288.

• Goeman, J.J. 2010. L1 Penalized Estimation in the Cox Proportional Hazards Model. Biometrical Journal. 52 (1): 70-84.

• Gohlmann, H., and, Talloen, W. 2009. Gene Expression Studies Using Affymetrix Microarrays. Chapman & Hall/CRC.

• Liao, J.G. , and Chin, K.V. 2007. Logistic regression for disease classification using microarray data: model

## Statistical Methods for Microarray Experiments: Analysis Dose-response Studies and Software Development R. A Summary of My PhD Thesis

**Statistical Methods for Microarray Experiments: Analysis Dose-response Studies and Software Development R.**

Drug development is defined as the entire process of bringing a new drug or device to the market, and is designed to ensure that only those pharmaceutical products that are both safe and effective are brought to market. The aim of this long process of drug development is to find good drugs which have strong effects on the intended disease and on the other hand have minimal side effects (Marton et al., 1998).

Finding safe and efficacious dose or dose range and establishing a dose-response relationship are two of the most important objectives in drug-discovery studies in the pharmaceutical industry. Bretz (2006) defined two major goals when performing a dose-response study. The first goal is to demonstrate a relationship between the dose and response. Here, an overall dose related trend is assessed by ensuring that with increasing dose, the effect to increase (or decrease) as well. This can be achieved by testing the global dose-response relationship under a monotonicity which assumes that as the dose increases, so does the effect or conversely as dose decreases, so does its impact. This assumption is founded on the fact that in

general, increasing the dose of a harmful agent will result in a proportional increase in both the incidence of an adverse effect as well as the severity of the effect (Ting, 2006). Once the first goal is established, the second goal is to estimate a target dose of interest such as the dose of an experimental compound required to achieve 50% of the maximum effect (ED50), or the Minimum Effective Dose (MED).

In recent years, as the emerging of biomedical technologies, dose-response studies have been integrated with microarray studies which in this case the response is the gene expression measured at a certain dose level. The aim is usually to identify a subset of genes with overall dose related trend. The dose-response microarray experiments are usually performed in order to (1) get insight in the biological target and the mechanism of the new medicine used to treat the target disease , (2) generate data on the pathways involved in potential unwanted side effects, (3) search for biomarkers that could be used as signatures for response, and (4) identify biomarkers that eventually could be used in phase-II or phase-III clinical trials as biomarkers endpoints to replace the classical

In this dissertation, we mainly focus on (1) estimating target doses of interest by using parametric modeling and model averaging methods, (2) implementing resampling-based inference, including permutation and SAM, to the multiple contrast test of the Williams (Williams, 1971 and 1972), Marcus (Marcus, 1976), the M statistic (Hu et al., 2005) and the modified M statistic (Lin et al., 2007), (3) analyzing dose-response microarray studies using Hierarchical Bayesian model, and (4) developing two new user friendly software for performing dose-response microarray studies and biclustering analysis.

** **

**Part 1. Modeling Dose-response Microarrays Studies: A Model-based Approach**

The first part of this dissertation will focus on parametric modeling and estimating a target dose of interest the ED50 in dose-response microarray. To reach this aim a three steps approach is proposed. The first step is to select gene with a monotonic trend. To perform this step, one of testing procedures for a monotonic trend used in dose-response studies of microarray experiments discussed by Lin et al. (2007) is implemented. These testing procedures take into account the order restriction of the means with respect to the increasing doses. After the genes with a monotonic trend are identified, then a model-based approach is carried out to estimate the ED50. However the validity of the model-based approach depends on the choice of the dose-response model. In order to incorporate for model uncertainty, several models are fitted and then in the third step the model averaging methods are carried out to estimate the model-averaged ED50.

In dose-response studies, the target dose is not only the ED50. Usually it is of interest to find the smallest dose with a discernible useful effect or a maximum dose beyond which no further beneficial effects is seen (ICH-E4, 1994). It is common in dose-response studies to use the mean efficacy of the control dose (usually zero dose) as a benchmark for comparison purposes to decide if a particular dose is clinically effective (Tamhane andLogan, 2006). The target dose for this purpose is known as the Minimum Effective Dose (MED). In this part, we also discuss the estimation of the MED in the dose-response microarray studies. The definition and estimation of the MED for a single model, i.e., the Emax model will be given. Then after several candidate models are fitted, the model averaging techniques are implemented to obtain the model averaged MED.

**Part 2. Multiple Contrast Test (MCT) for Detecting Monotonic Trends**

In part II, we investigate the application of resampling-based multiple contrast test procedure to detect for a monotonic trend. Bretz (1999, 2006) implemented the multiple contrast test (MCT) procedure to extend the Williams’ and Marcus’ tests for testing monotonicity of dose-response relationship which can improve the power. The idea is that the order restricted alternative hypotheses can be decomposed into several elementary alternatives with particular patterns of equalities and inequalities. These contrasts then are tested and then the contrast test which has the maximum value is used. Bretz (2006) showed that relying on the normality assumption, the Williams’ MCT and Marcus’ MCT follow the multivariate t distribution. Since in practice it might not be valid to assume a particular distribution of a statistic, especially in case of small sample sizes commonly found in microarray data (McLachlan et al., 2004), we propose to approximate the null distribution of Williams’ MCT and Marcus’ MCT (Bretz, 2006) using a permutation-based approach. In addition, we propose to apply the MCT to the M (Hu et al., 2005) and the modified M (M′, Lin et al., 2007) test statistics.

A common problem in microarray experiments is that some of the genes have small variance which can make their (absolute) test statistic very large although their fold change are small (McLachlan et al., 2004). In order to overcome the problem of small variance genes, we implement the Significance Analysis of Microarray (SAM) procedure proposed by Tusher et al. (2001) for the MCT. We also performed simulation studies aimed to investigate the performance of the proposed resampling-based MCTs in terms of the FDR control and power.

**Part 3. Hierarchical Bayesian Dose-response Analysis**

The third part, we propose a hierarchical Bayesian modeling approach to identify genes with a monotonic trend. The Bayesian approach seeks evidence in the data under both the null and order-restricted alternative hypotheses. The Bayesian Variable Selection (BVS) is implemented to estimate the posterior probability of all the possible monotone models. Then the posterior probability of the null model is used to identify the genes with strong evidence of a monotonic trend. Several simulations to investigate the performance of this approach are carried out and will be discussed. From the obtained posterior probabilities, we can construct posterior level of probabilities from the Bayesian perspective.

**Part 3. Software Development**

In the last part of this dissertation, we discuss the issue of software developments for dose response

microarray studies and biclustering analysis. The developed software are R packages intended to provide easy to use packages with more user friendly interfaces. The software are developed using GUI provided within R.

The first software is for analyzing dose-response microarray experiment. Lin et al. (2008) developed an R package IsoGene for testing for a monotonic relationship between gene expressions and doses in a microarray experiment (Pramana et al., 2010). The IsoGene package is a command-based package and requires users to have basic knowledge of R. This can be difficult for those who are not familiar with R or unfrequent R users to use the package. To disentangle this limitation, a user friendly package called IsoGeneGUI was developed (Pramana et al., 2010b, 2011a). We presents the capabilities of the IsoGeneGUI package/software along with illustrative examples of analysis of a dose-response microarray study.

The second software is to perform biclustering analysis. Biclustering is a technique to cluster rows and columns of a matrix simultaneously. The goal to find subgroups of rows and columns (called biclusters) which are as similar as possible to each other and as different as possible to the rest (Kaiser and Leisch, 2008). Note that in the microarray studies, rows and columns are genes and samples, respectively. Therefore bicluster methods in gene expression are aimed to discover a group of genes that have similar regulatory mechanism within a group of samples (Madeiraand Oliveira, 2004).

We developed a new R package, the RcmdrPlugin.BiclustGUI package (shortened as the BiclustGUI) which is an interface for several biclustering packages that are available biclust (Kaiser and Leisch, 2008), fabia (Hochreiter et al., 2010), and isa2 package (Cs´ardiet al., 2010). The BiclustGUI is an extension for R Commander package (Rcmdr Plug-in, Fox, 2005) and intended to beginners or unfrequent R users to perform biclustering analysis.