COM.on C.A.6:e7/39-44   Online published on Jan.17, 2012.
doi:10.4236/coca.2012.61007
REVIEW
Application of Partial Least Squares in high-dimensional genomic data analysis

Panpan Wang

MOE Laboratory of Contemporary Anthropology, School of Life Science, Fudan University, Shanghai 200433, China

ABSTRACT: Partial Least Squares (PLS) is a statistical regression technology which could perform well on the analysis of high-dimensional genomic data, such as the microarray data, SNP data from GWAS, and proteomic data. In this article, we review the challenges that are faced by the classical linear regression, and lead to the advantages of PLS. PLS can not only solve the problem of co-linearity through dimension reduction but also the problem of regression singularity in the condition of small sample size and high dimensional predictive variables. We also provide some modified algorithms of PLS incorporate with the application in the real biological data analysis. For example, sparse partial least squares can simultaneously realize dimension reduction and variable selection, and the combination of PLS with cluster analysis or general linear regression can deal with diverse problems of data analysis.

Key words: partial least squares, high-dimensional genomic data, dimension reduction, variable selection

Recieved:  Dec.7, 2011   Accepted: Dec.14, 2011  Corresponding: catherine64278@163.com


《现代人类学通讯》第六卷e7篇 第39-44页  2012年1月17日网上发行

专题综述

偏最小二乘在高维基因组数据分析中的应用

王盼盼

复旦大学生命科学学院现代人类学教育部重点实验室 上海 200433

摘要:偏最小二乘是一个非常高效的统计回归技术,它能很好的应用于高维的基因组数据的分析中,如基因表达的芯片数据,全基因组关联分析的SNP数据,甚至蛋白质组数据等。在本文中,我们将从最初的线性回归讲起,引出偏最小二乘回归在高维数据分析中的优势。它不仅能通过降维解决预测变量的共线性问题,也能解决样本数目偏少的回归奇异性问题。并结合偏最小二乘在实际生物数据中的应用,给出修正的算法。如稀疏的偏最小二乘方法能在降维的同时实现变量选择,偏最小二乘与聚类分析或广义线性回归结合能更多的应用于各种不同的数据分析问题。

关键词偏最小二乘;高维基因组数据;降维;变量选择

 

收稿日期:2011年12月7日  修回日期:2011年12月14日 联系人:王盼盼 catherine64278@163.com



全文链接 Full text: [PDF]

参考文献 References

1.Maitra S,Yan J (2008) Principle Component Analysis and Partial Least Squares:Two Dimension Reduction Techniques for Regression. Casualty Actuarial Society:80-90.
2.Boulesteix AL,Strimmer K (2005) Partial Least Squares: A Versatile Tool for the Analysis of High-Dimensional Genomic Data. Seminar for Applied Stochastics.
3.Martens H (2001) Reliable and relevant modelling of real world data: a personal account of the development of PLS Regression. Chemometr Intell Lab 58(2):85-95.
4.Wold S (2001) Personal memories of the early PLS development. Chemometr Intell Lab 58(2):83-84.
5.Garthwaite PH (1994) An Interpretation of Partial Least-Squares. J Am Stat Assoc 89(425):122-127.
6.Martens H,Naes T (1989) Multivariate Calibration. New York: Wiley.
7. Boulesteix AL, Strimmer K (2005) Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. Theor Biol Med Model 2:23.
8.Datta S, Pihur V,Datta S (2008) Reconstruction of genetic association networks from microarray data: a partial least squares approach. Bioinformatics 24:561-568.
9.Brown PO, Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, Pergamenschikov A, Lee JCE, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24:227-235.
10.Boulesteix A-L (2004) PLS Dimension Reduction for Classification with Microarray Data. Statistical Applications in Genetics and Molecular Biology 3(1):A33.
11.Musumarra G, Barresi V, Condorelli DF,Scire S (2003) A bioinformatic approach to the identification of candidate genes for the development of new cancer diagnostics. Biol Chem 384:321-327.
12.Nguyen DV,Rocke DM (2002) Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 18:1625-1632.
13.Keles S,Chun H (2010) Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J Roy Stat Soc B 72:3-25.
14.Keles S,Chun H (2009) Expression Quantitative Trait Loci Mapping With Multivariate Sparse Partial Least Squares Regression. Genetics 182:79-90.
15.Chun HH, Ballard DH, Cho J, Zhao HY (2011) Identification of Association Between Disease and Multiple Markers Via Sparse Partial Least-Squares Regression. Genet Epidemiol 35:479-486.