看板 NCTU-STAT98G 關於我們 聯絡資訊
交通大學、清華大學 統計學研究所 專題演講 題 目:Entropy Based Statistical Inference for Some HDLSS Genomic Models: UI Tests in a Chen-Stein Perspective 主講人:蔡明田教授 (中研院統計所) 時 間:98年10月23日(星期五)上午10:40-11:30 (上午10:20-10:40茶會於交大統計所429室舉行) 地 點:交大綜合一館427室 Abstract One of the scientific foci is to classify the K genes into two subsets of disease genes and non-disease genes. For HDLSS (high-dimensional, low-sample size) categorical data models, the number of associated parameters increases exponentially with K, thus creating an impasse to adapt conventional discrete multivariate analysis or model selection tools. Faced with this rather awkward environment, often statistical appraisals are based on marginal p-values where the multiple hypothesis testing (MHT) problem can be handled with the original Fisher's method (developed nearly 80 years ago) along with various ramifications during the past 25 years or so. During the past two decades, the MHT has received considerable attention from data miners and statisticians in all walks of the disciplines while there is more attention being paid now to the variable selection (VS) problem, especially in the bioinformatics context. In this talk, some recent developments, including the linear or log-linear models (embracing the shrinkage idea) oriented LASSO method (van de Geer, 2008), Akaike information (1974) type criterion, FDR method (Benjamini and Hochberg, 1995), k-FWER method (Lehmann and Romano, 2005), emperical Bayes approach (Efron, 2004 and 2008), and nonparametric method (Sen, 2008), will be briefly reviewed. Most of works for the former two methods concentrate on the case when K < n, though K might be possibly large. There are serious roadblocks when K becomes exceedingly large but the sample size n is disproportionately small (i.e., K > > n), which are abundant in genomics, bioinformatics, pharmacogenomics, clinical trials, financial and economic statistics, etc. The latter four methods may appear to be tempting in this case, however, they have their own limitations. On the other hand, like the maximum likelihood being the dominant paradigm in statistics, the Shannon entropy (1948) is the dominant paradigm in information and coding theory. For qualitative data models, Gini-Simpson index (Gini, 1912; Simpson, 1949) and Shannon entropy are commonly used in dissimilarity and diversity analysis, economic inequality and poverty analysis, and genetic variation studies, as well as in many other fields. By the Lorenz curve, it is not difficult to show that Shannon entropy appears to be more informative than Gini-Simpson index. However, for HDLSS genomic models, we suspect that the information might not be fully captured in a pseudo-marginal setup (namely, the so-called multivariate version of Shannon entropy in the literature). To capture greater information, some new genuine multivariate analogues of Shannon entropy are proposed. The SARSCoV data set is appraised as illustration. -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 140.113.252.129