COM.on C.A.6:e4/26-29   Online published on Jan.17, 2012.
doi:10.4236/coca.2012.61004
REVIEW
Variable Selection and Dimensional Reduction in Genetic Analysis

Kelin Xu

MOE Key Laboratory of Contemporary Anthropology, School of Life-sciences, Fudan University, Shanghai 200433, China

ABSTRACT: In the context of the development of next generation sequencing, a large amount of genetic data is accumulated. We usually focus on the extraction of meaningful information from these data, considering the problem of computation cost and the challenges in traditional statistical method. Consequently, variable selection and dimension reduction play important roles in genomics. Here, I reviewed some popular methods referring to that, classifying them to three main types along with their properties and application fields.

Key words: Genetic data, high dimension hazard, variable selection, dimension reduction

Recieved:  Dec.7, 2011   Accepted: Dec.11, 2011  Corresponding: xukelin0202@gmail.com


《现代人类学通讯》第六卷e4篇 第26-29页  2012年1月17日网上发行

专题综述

遗传数据中的变量选择与降维

徐珂琳

复旦大学 生命科学学院 现代人类学教育部重点实验室, 上海200433

摘要:随着新一代测序技术与芯片杂交技术的发展,海量高维数据涌现在研究者们的面前,如何从这些高维数据中提取有效信息成为摆在人们面前的一大难题。在这种高维问题的背景下,许多基于低纬度的统计结论不再成立;另外,庞大的数据量对计算速度提出了很高的要求。于是,在这种数据驱动的研究背景下,变量选择与降维成为主要的研究方向。本文从当下遗传数据的特点出发,回顾了当今几种主流的变量选择与降维方法,如主成分分析、偏最小二乘回归、切片逆回归、LASSO等,并就这几种方法的性质与适用范围展开讨论。

关键词遗传数据;高维问题;变量选择;降维

 

收稿日期:2011年12月7日  修回日期:2011年12月11日 联系人:徐珂琳 xukelin0202@gmail.com



全文链接 Full text: [PDF]

参考文献 References

1.Quackenbush J(2001)Computational analysis of microarray data. Nat Rev Genet 2(6): 418-427.
2.杨旭,焦睿,杨琳,吴莉萍,李英睿,王俊(2011)基于新一代高通量技术的人类疾病组学研究策略. 遗传 33(8): 829-846.
3.Wang H, van der Laan MJ(2011)Dimension reduction with gene expression data using targeted variable importance measurement. BMC Bioinformatics 12:312.
4.Fan JQ, Lv JC(2008)Sure independence screening for ultrahigh dimensional feature space
J R Stat Soc Series B Stat Methodol 70: 849-883.
5.王松桂,陈敏,陈丽萍(1999)线性统计模型. 北京: 高等教育出版社. 61.
6.Fan JQ(1996)Test of significance based on wavelet thresholding and Neyman's truncation.  J Am Stat Asso 91: 674-688.
7.Hedenfalk I, Duggan D, Chen Y(2002)Gene-expression profiles in hereditary breast cancer. Advances in Anatomic Pathology 9(1): 1-4.
8.Dettling M, Buhlmann P(2003)Boosting for tumor classification with gene expression data. Bioinformatics 19(9): 1061-1069.
9.Ghosh D(2002)Singular value decomposition regression modeling for classification of tumors from microarray experiments. Pac Symp Biocomputi 2002: 18-29.
10.Meng J(2011)Uncover cooperative gene regulations by microRNAs and transcription factors in glioblastoma using a nonnegative hybrid factor model.In International Conference on Acoustics, Speech and Signal Processing.
11.Nguyen DV(2005)Partial least squares dimension reduction for microarray gene expression data with a censored response. Math Biosci 193(1):119-137.
12.Chun H,Keles S(2009)Expression quantitative trait loci mapping with multivariate sparse partial least squares regression. Genetics 182(1): 79-90.
13.Antoniadis A, Lambert-Lacroix S, Leblanc F(2003)Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19(5): 563-570.
14.Tibshirani R(1996)Regression shrinkage and selection via the Lasso.
J R Stat Soc Series B Stat Methodol 58(1): 267-288.
15.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K(2005)Sparsity and smoothness via the fused lasso.
J R Stat Soc Series B Stat Methodol  67: 91-108.
16.沈炎峰(2010)多变量数据遗传分析方法的研究.  浙江大学博士论文.
17.Akaike H(1973)Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory 267-281.
18.Schwartz G(1978)Estimating the dimension of a model. Ann Statist 6: 461-464.
19.于秀林,任雪松(2007)多元统计分析. 北京:中国统计出版社.
20.何晓群(1998)现代统计分析方法与应用. 北京:中国人民大学出版社.
21.Xiong MM, Zhao JY, Boerwinkle E(2002)Generalized T-2 test for genome association studies. Am J HumGenet 70(5): 1257-1268.
22.Rocke DM, Nguyen DV(2002)Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18(1): 39-50.
23.Chun H, Keles S(2010)Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J R Stat Soc Series B Stat Methodol 72(1): 3-25.
24.Li KC(1991)Sliced inverse regression for dimmension reduction. J Am Stat Asso 86: 316-327.
25.Li KC(1992)On principal hessian directions for data visualization and dimension reduction-another application of steins lemma.  J Am Stat Asso 87:1025-1039.
26.Cook RD(2000)SAVE: A method for dimension reduction and graphics in regression. Commun Stat Theory Methods 29: 2109-2121.
27.杨乐(2010)现代数学基础丛书. 北京:科学出版社.
28.Naik P, Tsai CL(2000)Partial least squares estimator for single-index models. J R Stat Soc Series B Stat Methodol 62: 763-771.
29.Li LX,Cook D, Tsai CL(2007)Partial inverse regression. Biometrika 94(3): 615-625.
30.Li LX, Cook RD, Nachtsheim CJ(2005)Model-free variable selection. J R Stat Soc Series B Stat Methodol 67: 285-299.
31.Wang Q, Yin XR(2008)A nonlinear multi-dimensional variable selection method for high dimensional data: Sparse MAVE. Computational Statistics & Data Analysis 52(9): 4512-4520.
32.Lu Y, Zhou Y, Qu W, Deng M, Zhang C(2011)A Lasso regression model for the construction of microRNA-target regulatory networks. Bioinformatics 27(17): 2406-2413.
33.Tibshirani RJ, Taylor J(2011)The Solution Path of the Generalized Lasso. Ann Statist 39(3): 1335-1371.
34.Fan JQ, Li RZ(2001)Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Asso 96: 1348-1360.
35.Zhang CH(2009)Penalized linear unbiased selection. Ann Statist.