COM.on C.A.6:e4/26-29
Online published on
Jan.17, 2012.Kelin Xu
MOE Key Laboratory of Contemporary Anthropology, School of Life-sciences, Fudan University, Shanghai 200433, China
Key words: Genetic data, high dimension hazard, variable selection, dimension reduction
Recieved: Dec.7, 2011 Accepted: Dec.11, 2011 Corresponding: xukelin0202@gmail.com
《现代人类学通讯》第六卷e4篇 第26-29页 2012年1月17日网上发行
专题综述
遗传数据中的变量选择与降维
徐珂琳
复旦大学 生命科学学院 现代人类学教育部重点实验室, 上海200433
摘要:随着新一代测序技术与芯片杂交技术的发展,海量高维数据涌现在研究者们的面前,如何从这些高维数据中提取有效信息成为摆在人们面前的一大难题。在这种高维问题的背景下,许多基于低纬度的统计结论不再成立;另外,庞大的数据量对计算速度提出了很高的要求。于是,在这种数据驱动的研究背景下,变量选择与降维成为主要的研究方向。本文从当下遗传数据的特点出发,回顾了当今几种主流的变量选择与降维方法,如主成分分析、偏最小二乘回归、切片逆回归、LASSO等,并就这几种方法的性质与适用范围展开讨论。
关键词:遗传数据;高维问题;变量选择;降维
收稿日期:2011年12月7日
修回日期:2011年12月11日
联系人:徐珂琳
xukelin0202@gmail.com
全文链接 Full text:
[PDF]
参考文献 References
1.Quackenbush J(2001)Computational
analysis of microarray data. Nat Rev
Genet 2(6): 418-427.
2.杨旭,焦睿,杨琳,吴莉萍,李英睿,王俊(2011)基于新一代高通量技术的人类疾病组学研究策略.
遗传 33(8): 829-846.
3.Wang H, van der Laan MJ(2011)Dimension
reduction with gene expression data using
targeted variable importance measurement.
BMC Bioinformatics 12:312.
4.Fan JQ, Lv JC(2008)Sure
independence screening for ultrahigh
dimensional feature space.
J R Stat
Soc Series B Stat
Methodol
70: 849-883.
5.王松桂,陈敏,陈丽萍(1999)线性统计模型.
北京: 高等教育出版社. 61.
6.Fan JQ(1996)Test
of significance based on wavelet
thresholding and Neyman's truncation.
J Am
Stat Asso 91: 674-688.
7.Hedenfalk I, Duggan D, Chen
Y(2002)Gene-expression profiles in
hereditary breast cancer. Advances in
Anatomic Pathology 9(1): 1-4.
8.Dettling M, Buhlmann P(2003)Boosting
for tumor classification with gene
expression data. Bioinformatics 19(9):
1061-1069.
9.Ghosh D(2002)Singular
value decomposition regression modeling for
classification of tumors from microarray
experiments. Pac Symp Biocomputi 2002:
18-29.
10.Meng J(2011)Uncover cooperative gene
regulations by microRNAs and transcription
factors in glioblastoma using a nonnegative
hybrid factor model.In International
Conference on Acoustics, Speech and Signal
Processing.
11.Nguyen DV(2005)Partial
least squares dimension reduction for
microarray gene expression data with a
censored response. Math Biosci
193(1):119-137.
12.Chun H,Keles S(2009)Expression
quantitative trait loci mapping with
multivariate sparse partial least squares
regression. Genetics 182(1): 79-90.
13.Antoniadis A, Lambert-Lacroix S, Leblanc
F(2003)Effective
dimension reduction methods for tumor
classification using gene expression data.
Bioinformatics 19(5): 563-570.
14.Tibshirani R(1996)Regression
shrinkage and selection via the Lasso.
J R Stat Soc Series B Stat
Methodol
58(1): 267-288.
15.Tibshirani R, Saunders M, Rosset S, Zhu
J, Knight K(2005)Sparsity
and smoothness via the fused lasso.
J R Stat Soc Series B Stat
Methodol
67: 91-108.
16.沈炎峰(2010)多变量数据遗传分析方法的研究.
浙江大学博士论文.
17.Akaike H(1973)Information theory and an
extension of the maximum likelihood
principle. In Second International Symposium
on Information Theory 267-281.
18.Schwartz G(1978)Estimating
the dimension of a model. Ann Statist 6:
461-464.
19.于秀林,任雪松(2007)多元统计分析. 北京:中国统计出版社.
20.何晓群(1998)现代统计分析方法与应用. 北京:中国人民大学出版社.
21.Xiong MM, Zhao JY, Boerwinkle E(2002)Generalized
T-2 test for genome association studies.
Am J HumGenet 70(5): 1257-1268.
22.Rocke DM, Nguyen DV(2002)Tumor
classification by partial least squares
using microarray gene expression data.
Bioinformatics 18(1): 39-50.
23.Chun H, Keles S(2010)Sparse
partial least squares regression for
simultaneous dimension reduction and
variable selection. J R Stat Soc Series
B Stat Methodol 72(1): 3-25.
24.Li KC(1991)Sliced inverse regression for dimmension reduction.
J Am
Stat Asso 86: 316-327.
25.Li KC(1992)On principal hessian directions for data visualization and dimension reduction-another application of steins lemma.
J Am
Stat Asso 87:1025-1039.
26.Cook RD(2000)SAVE: A method for dimension
reduction and graphics in regression. Commun Stat Theory Methods 29: 2109-2121.
27.杨乐(2010)现代数学基础丛书. 北京:科学出版社.
28.Naik P, Tsai CL(2000)Partial least
squares estimator for single-index models.
J R Stat Soc Series B Stat Methodol 62:
763-771.
29.Li LX,Cook D, Tsai CL(2007)Partial
inverse regression. Biometrika 94(3):
615-625.
30.Li LX, Cook RD, Nachtsheim CJ(2005)Model-free variable selection. J R Stat Soc Series B Stat Methodol 67:
285-299.
31.Wang Q, Yin XR(2008)A nonlinear
multi-dimensional variable selection method
for high dimensional data: Sparse MAVE.
Computational Statistics & Data Analysis 52(9): 4512-4520.
32.Lu Y, Zhou Y, Qu W, Deng M, Zhang C(2011)A Lasso regression
model for the construction of microRNA-target
regulatory networks. Bioinformatics
27(17): 2406-2413.
33.Tibshirani RJ, Taylor J(2011)The
Solution Path of the Generalized Lasso.
Ann Statist 39(3):
1335-1371.
34.Fan JQ, Li RZ(2001)Variable selection
via nonconcave penalized likelihood and its
oracle properties. J Am
Stat Asso 96:
1348-1360.
35.Zhang CH(2009)Penalized linear unbiased
selection. Ann Statist.