[問題] 取樣的問題

作者ardodo (米蟲)

看板R_Language

標題[問題] 取樣的問題

時間Mon May 4 14:34:16 2015

版上先進大家好，我有個問題想請教大家現在我手上有筆某大專院校22個系所的學生資料(共1萬筆) 我想要在每個系所各取樣30名學生資料出來分析，請問該怎麼做？我想到的方法是：每個系所subset一次、隨機抽30名出來存成一個物件，重覆22次最後將上面22個物件rbind即可但是這樣的做法很費時也沒有效率，想請問有沒有比較快的方法？ -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 163.14.191.172 ※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1430721259.A.0C8.html

→ celestialgod: Assume dat is the dataset 05/04 16:03

→ celestialgod: dat %>% split(.$department)%>%lapply(function(x) 05/04 16:07

→ celestialgod: x[sample(1:nrow(x), 2),]) %>% rbindlist(.) 05/04 16:07

→ celestialgod: department是dat中系所的變數 05/04 16:08

→ celestialgod: base::split, data.table::rbindlist,magrittr::%>% 05/04 16:08

→ celestialgod: 2是取樣的樣本數 05/04 16:09

→ celestialgod: 你也可以用dplyr group_by做 05/04 16:10

→ celestialgod: rbindlist(.) 可以用do.call(rbind, .)取代 05/04 16:10

推 psinqoo: celestialgod 回文~這樣看很花 05/04 16:59

→ celestialgod: 不想要因此再開一篇文章QQ 05/04 18:01

→ celestialgod: 才兩行程式而已 05/04 18:01

→ gsuper: 用 tapply() 和 sample() 找出大矩陣的 index 就好 05/13 11:33

→ gsuper: tapply(1:10000,groupFactor,funciton(s){sample(s,30)}) 05/13 11:36

→ celestialgod: 樓上，split可以接根據變數做切割，會方便很多 05/13 11:36

→ gsuper: 然後: 大矩陣[index,] 05/13 11:36

→ celestialgod: 其實會比tapply快。 05/13 11:36

→ gsuper: Mm....split 也行 05/13 11:39

→ gsuper: sapply(split(1:6,c(1,1,1,2,2,2)),sample,2) 這個感覺 05/13 11:41