Bioinformatics Workshop, Jakarta, Indonesia, April 14-15 2015
is a free, open-source implementation of the S statistical computing language and programming environment. Nowadays, R is widely used for statistical software development, data analysis, and machine learning. Since R is free and open source, now there are mor than 2000 user-contributed packages available. This means that anyone can fix bugs and/or add features. R can be integrated with other languages (C/C++, Java, andvPython). R can also interact with many data sources: ODBC-compliant databases (Excel and Access) and other statistical packages (SAS, Stata, SPSS, and Minitab). For the High Performance Computing Task, several R packages provide the advantage of multiple cores, either on a single machine or across a network.
Despite the aforementioned capabilities, R is a command line interface (CLI) where users type commands to perform a statistical analysis. The CLI is the preferred user interface for power users because it allows the direct control on calculations and it is flexible. However, this command-driven system requires good knowledge of the language and makes it difficult for beginners or less frequent users. To incorporate this limitation, several R projects were develop to produce user interfaces.
My talk presented in R Workshop Hasselt University Belgium and a seminar in Medical Epidemiology and Biostatistics department of Karolinska Institutet will provide a review of some R GUI projects and illustrate how to develop a simple R GUI using tcltk and some future development using R service bus and shiny. More detail of R GUI and some example of RGUIs can be found in my PhD thesis.
Some example of RGUIs:
Setiap mahluk hidup tersusun dari kumpulan sel-sel yang beraktivitas sesuai tugasnya. Tubuh manusia, misalnya, tersusun atas lebih dari 10 trilyun sel yang pada awalnya adalah berasal dari hasil pembelahan sebuah sel saja. Di dalam setiap sel terdapat inti sel (nucleus) yang dimana beberapa kromosom (23 pasang) terletak. Kromosom sendiri adalah kumpulan protein dan deoxyribonucleic acid (DNA).
DNA membawa semua informasi genetik mengenai tubuh kita yang nantinya akan diwariskan kepada generasi berikutnya. Jika kita bayangkan DNA sebagai sebuah buku, yang berisi berbagai informasi yang tertulis dari 23 bab (kromosomes) dimana di setiap kromosom terdapat ribuan informasi cerita (genes). Berbeda dengan kita yang menuliskan informasi dengan 26 huruf alphabet, di dalam buku DNA ini setiap kata dituliskan hanya dengan menggunakan 4 huruf yakni A, C, T, dan G.
lebih lanjut silahkan lanjutkan di link berikut: http://www.eramuslim.com/oase-iman/berkaca-dari-struktur-dna-kita.htm
Presentasi mengenai hal ini bisa dilihat disini http://www.slideshare.net/hafidztio/gene-sebuah-nikmat-allah
There are various classification algorithms that have been developed in different fields. Some algorithms are commonly used in genomics such as linear discriminant analysis (LDA), nearest neighbor classifier and logistic regression. Many authors such as Gohlmann and Talloen (2009), and Lee (2005) have comprehensively reviewed and compared of these algorithms.
Logistic regression is a supervised method for binary or multi-class classification (Hosmer and Lemeshow 1989). Because it is a simple, flexible and straightforward model that is easy to extend, the extensions of logistic regression
have been widely used in genomics research (e.g., Liao and Chin, 2007, and Sun and Wang, 2012).
In high-dimensional datasets such as in microarray settings where usually there are more variables than the observations and variables are correlated (multicolinierity), the classical logistic regression would perform badly and provide inaccurate estimates. It would give a perfect fit to the data with no bias and high variance which can lead to bad prediction (overfitting). In order to prevent this problem, a penalty for complexity in the model should be introduced.
The presentation which can be viewed here shows a short overview of L1 penalization logistics regression. Example of the application of this method in genomic is to define candidate classifiers genes to classify two different groups, e.g., cancer and non-cancer group.
• Lee JW, et al, 2005. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis. 48:869-885.
• Hosmer, D.W., Lemeshow, S., 1989. Applied Logistic Regression. Wiley Series in Probability and Mathematical Statistics. Wiley, New York, NY.
• Sun, H. andWang, S. 2012. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies Bioinformatics. 28(10):1368-1375
• Tibshirani, R. 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological). 58:267- 288.
• Goeman, J.J. 2010. L1 Penalized Estimation in the Cox Proportional Hazards Model. Biometrical Journal. 52 (1): 70-84.
• Gohlmann, H., and, Talloen, W. 2009. Gene Expression Studies Using Affymetrix Microarrays. Chapman & Hall/CRC.
• Liao, J.G. , and Chin, K.V. 2007. Logistic regression for disease classification using microarray data: model
DNA microarrays are a new and promising biotechnology which allow the monitoring of expression of thousand genes simultaneously.
Image courtesy: http://www.albany.edu/genomics/affymetrix-genechip.html
Microarray technology as one of recent biomedical technologies produces high dimensional data. This makes statistical analysis become challenging.
The statistical components of a microarray experiment involve the following steps (Allison et al. 2006):
(1) Design. The development of an experimental plan to maximize the quality and quantity of information obtained.
(2) Pre-processing. Processing of the microarray image and normalization of the data to remove systematic variation. Other potential preprocessing steps include transformation of data, data filtering and background subtraction.
(3) Inference and/or classification. Inference entails testing statistical hypotheses which are usually about which genes are differentially expressed. Classification refers to analytical approaches that attempt to divide data into classes with no prior information (unsupervised classification) or into predefined classes (supervised classification).
(4) Validation of findings. The process of confirming the veracity of the inferences and conclusions drawn in the study.
I presented an overview of microarray analysis specifically in the use of gene expression profiling in a discussion.
Biostatistics and Statistical Bioinformatics a Lecture at the Faculty of Mathematics and Natural Sciences, Brawijaya University
Image courtesy: here
On October 2012, I had an honor to give a guest lecture in the Mathematics Department of Brawijaya University. I never image that I will return to the university which provided me lot of valuable knowledge to give a lecture there.