作者chenyutn (人生要死,何為苦心。)
看板Statistics
標題Re: [問題] SEM用PLS跑的問題與優缺點?
時間Sat May 16 00:18:08 2009
※ 引述《danny789 (這其中一定有什麼誤會)》之銘言:
: : 不知道是否有其他比較強(多)的證據或文獻可以證明 resample size 設越大越好?
: : 以下引用 chin(2001)的 PLS-Graph User's Guide 內容片段
: : (已有 MIS 領域不錯的 Journal paper 引用)
: : The default Bootstrap options are 100 resamples with each sample consisting of the same number of
: : cases as your original sample set. The bootstrap procedure samples with replacement from your
: : original sample set. It continues to sample until it reaches the number of cases you specify (or the
: : default). This procedure is repeated until it reaches the number of bootstrap resamples you specify (or
: : the default of 100). In general, resamples of 200 tend to provide reasonable standard error estimates.
: : 以下是快速找到的幾篇 MIS papers
: : Resample = 100
: : Henry, R.M., McCray, G.E., Purvis, R.L. Roberts, T.L. (2007) "Exploiting Organizational Knowledge in Developing IS Project Cost and Schedule Estimates: An Empirical Study", Information & Management, Vol. 44 No.6, pp.598-612.
: : Resample = 500
: : Ko, D., Kirsch, J.L., King, W.R. (2005) "Antecedents of knowledge transfer from consultants to clients in enterprise system implementations", MIS Quarterly, Vol. 29 No.1, pp.59-85.
: : Resample = 100 & 500
: : Goodhue, D., Lewis, W., and Thompson, R., (2007) "Statistical Power in Analyzing Interaction Effects: Questioning the Advantage of PLS With Product Indicators", Information Systems Research, Vol. 18 No.2, pp.211-227.
: : 也許 Goodhue et al.(2007) 這篇是答案, 但我找不到 pdf 檔可以看(汗)
: : → bmka:這個問題沒那麼複雜吧,先把bootstrap方法原理弄懂 05/12 23:01
: : → bmka:resample 數目當然越大越好,至於要多大,那要看data distribut 05/12 23:03
: : → bmka:跑久一點不會吃虧的 05/12 23:04
: 對於我來說 PLS 只是一個工具而已
: 我只要知道如何使用及瞭解它的假設及限制, 而能產出 outcome 並解讀就可以了
: 如同您會操作電腦, 但您知道半導體是如何製造的嗎? 畢竟電腦只是一個工具而已
: 也許您只是站在純數學的觀點來看, 認為 resample 設越大越好
: 但這樣反而太過操弄統計這個工具了, 這樣統計的結果真的就是事實的結果嗎?
: 如果您可以提供文獻證明 resample 設越大越好, 那我也可以修正我原來的看法.
: 若如您所言, 對於 resample 設越大越好, 我一個合理的懷疑
: 那麼這許多作研究的學者應該會有人提到這點, 但是並沒有 ...
: 至少我看過的 papers 沒人提到此點
: 而且我相信這些學者的電腦應該不會太差, resample 設100萬也不是問題才對
: 所以我認為這並不是電腦執行速度的問題
: 我後來還是找到了 Goodhue et al.(2007) 這篇 pdf 檔 (ISR 在 MIS 排前五大期刊)
: 也許底下的片段可以解答您的問題, 所以我的建議還是設 500 比較恰當
: 因為這是大多數學者所使用的數值
: It might be suggested that we should use bootstrapping
: with 500 resamples (rather than 100). Five hundred
: resamples is the usual recommendation when
: using bootstrapping to estimate a parameter using a
: single sample (Chin 1998). However, we draw 500
: samples (500 researchers) from the same population
: for each cell in our analysis, and use bootstrapping
: with 100 resamples on each of those. This amounts to
: 50,000 resamples for each cell, and hence we expect
: that moving from 50,000 to 250,000 resamples in each
: cell would not affect the outcome.
bootstrapping的目的本就是
Estimate parameters that we don't know how to estimate analytically
(Howell, 2002,
http://tinyurl.com/q6v3c2) .
以下取自Stata的guidelines(
http://www.stata.com/support/faqs/stat/reps.html),
懶得翻了,僅標重點。
這段告訴我們一點:
數字設多大不一定,但越大必然會獲得越精確的CI估計。
只是我們需不需要這麼精確的數字而已。
我想其實danny789板友也是想表達這個意思,只是在回文時我太注重500這個數字了,
因為我覺得能越精確當然越好啊。:P
所以bmka板友前幾篇推文給的建議非常實用,設個500次、1000次跑看看,
再跟2000次比較一下有沒有太大的差異,如果沒有,就放心報告吧。
How large should the bootstrapped samples be relative to the total number
of cases in the dataset?
In terms of the number of replications, there is no fixed answer such as
“250” or “1,000”to the question.
The right answer is that you should
choose an infinite number of replications because, at a formal level, that
is what the bootstrap requires.
The key to the usefulness of the bootstrap is that it converges in terms of
numbers of replications reasonably quickly, and so running a finite number
of replications is good enough—assuming the number of replications chosen
is large enough.
The above statement contains the key to choosing the right number of
replications. Here is the recipe:
1. Choose a large but tolerable number of replications. Obtain the
bootstrap estimates.
2. Change the random-number seed. Obtain the bootstrap estimates
again, using the same number of replications.
3. Do the results change meaningfully? If so, the first number you chose
was too small. Try a larger number.
If results are similar enough, you
probably have a large enough number. To be sure, you should probably
perform step 2 a few more times, but I seldom do.
Whether results change meaningfully is a matter of judgment and has to be
interpreted given the problem at hand. How accurately do you need the
standard errors, confidence intervals, etc.? Often, a few digits of precision
is good enough because, even if you had the standard error calculated
perfectly, you have to ask yourself how much you believe your model in terms
of all the other assumptions that went into it. For instance, in a Becker
earnings model of the return to schooling, you might tell me that return is
6% with a standard error of 1, and I might believe you. If you told me the
return is 6.10394884% and the standard error is .9899394,
you have more
precision but have not provided any additional useful information.
--
◤◢ 玄妙系列作第二部《黃泉路》 全家、福客多、OK便利商店熱賣中 ▊▋▌▍▎
▇▆◣▅▇▇▅▆▇█▋ ▇▆▇▍▄▇ ▇▅▂▄▆▇▏ ▇ .
. ▏ ‧ ═▉ ▎ 發生過命案的三重賓館857號房 ▏ ‧
… ═ ▉ ▏ 憑空傳來的詭異歌聲 ▏ ▎ . ‥
‥ . ═ ˙▊ ▉ ‧ 歸來的惡靈即將帶走他們的性命.◢▉ .
. ◣ ﹎ ▄ ‧ ▊ ▉▇◣ ▄▅ http://kuso.cc/4ltv ﹒ ▎▊ ▆ ﹊
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 114.45.174.165
推 bmka:有時候用bootstrap是因為asymptotic variance 太複雜了(不是 05/16 03:19
→ bmka:導不出來). Also, moment estimators (asymptotic variance 05/16 03:20
→ bmka:estimator is one of those) are less robust when outliers 05/16 03:20
→ bmka:present. 05/16 03:20
推 danny789:您說設個n次跑看看沒有太大的差異,老實說這樣並沒有符合 05/22 17:32
→ danny789:科學精神,也許這次分析OK但下次分析不同的資料卻不OK 05/22 17:33
→ danny789:雖然你有列出文獻(我不知道這算不算有力的證明),但太過偏 05/22 17:34
→ danny789:於數學上的論證,老實說我對於設n>50來分析並不反對 05/22 17:37
→ danny789: 更正 n>500 05/22 17:38
→ danny789:但是仍然沒有研究實證這樣作是有意義的,所以目前我對於 05/22 17:39
→ danny789:n>500還是有所保留 05/22 17:40
→ bmka:請問你到底有沒有去讀那些有關bootstrap method 的reference? 05/25 09:20
→ bmka:不然, 至少看看wikipedia上面的解說吧 05/25 09:27
→ bmka:chenyutn大已經解釋得這麼清楚了,怎麼你還這麼糾結 XD 05/25 09:32
→ bmka:還是要再說一次,computing time往往是研究裡最不花錢的 05/25 09:33
→ bmka:能跑就儘量跑多次一點,讓估計值穩定(以統計術語講就是收斂啦) 05/25 09:36
→ bmka:又,resampling 數目要大的理由正是要避免你上述不OK的情形 05/25 09:39
→ danny789:To B大,請問你到底有沒有看我寫的內容?你們提出的都是數 05/27 08:02
→ danny789:學上的討論,而這些是必須經過"實證"研究來證實是對的,而 05/27 08:03
→ danny789:不是參考一篇數學推理就認為可以這麼作,何況所提供的文獻 05/27 08:06
→ danny789:真的很薄弱."實證"研究就是用來考驗這些"理論"與"事實"的 05/27 08:08
→ danny789:差異,研究的進步是一點一滴慢慢前進的,不是自己在實驗室 05/27 08:10
→ danny789:得到結果(或數學上的研究結果),就認為外面的實務環境也是 05/27 08:11
→ danny789:同樣結果. 05/27 08:12
→ danny789:我至少有提出MIS前五大期刊的文獻來證明我的看法,我希望 05/27 08:15
→ danny789:你也可以提出有力的"實證"期刊文獻來證明我是錯的,我也會 05/27 08:17
→ danny789:樂於接受. 05/27 08:18
推 bmka:不曉得你是不是誤解了bootstrap...Boostrap是做approximation 05/27 10:49
→ bmka:的工具, "理論上", 只要 resampling 數目逼近於無窮大, 05/27 10:50
→ bmka:那麼 bootstrap approximation 就會逼近真實的值 05/27 10:50
→ bmka:(asymptotically consistent). 但是, 當resampling 數目是 05/27 10:50
→ bmka:finite 時 (eg 500),asymptotic consistency 未必成立 05/27 10:51
→ bmka:這也是為什麼大部分有關bootstrap repetition number的研究 05/27 10:52
→ bmka:都是關注於到底 bootstrap resampling 至少要多大才夠大 05/27 10:52
→ bmka:而我所提的檢驗的方法(比較500, 1000, 2000的結果) 05/27 10:53
→ bmka:只是檢查收斂的經驗法則, 其實還有更嚴謹的方法可以去估計 05/27 10:54
→ bmka:resampling number(請愛用google大神) 05/27 10:54
→ bmka:Bootstrap 最早是由Efron提出的,在統計界已經被用到爛了, 05/27 10:54
→ bmka:也有很多系統性的研究, 如果你有興趣應該去看看這些文章 05/27 10:55
→ bmka:(隨便google就一堆). 我注意到你引用的都是最近這幾年的paper 05/27 10:56
→ bmka:不知你的領域是否在這幾年才突然"發現"這個方法 05/27 10:56