[心得] bootstrap long format

作者memphis (讓你喜歡這世界~)

看板R_Language

標題[心得] bootstrap long format

時間Tue May 23 15:04:31 2017

假設資料長這樣 ID V1 V2 1 10 11 1 11 12 1 12 13 2 13 14 2 14 15 2 15 16 3 16 17 3 17 18 4 18 19 4 19 20 先bootstrap ID s <- sample(unique(data$ID), replace=T) 再抓資料 data2 <- data[data$ID %in% s] #這樣就錯了 #s裡是有重複的ID沒錯 #可是 %in% 不會抓重複的值網路上查尋的的結果,是用grr:::matches s_idx <- as.numeric(unlist(matches(s, data$ID, list=T))) data2 <- data[s_idx] 看起來還算簡約, 只是為了一個小功能又要裝一個pkg..有點煩躁 -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 140.109.73.105 ※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1495523074.A.34C.html

→ cywhale: 先給一欄row_id再bootstrap再取回真正ID和值，這樣不行嗎 05/23 15:25

→ a78998042a: s_idx = unlist(lapply(as.list(s),function(x) 05/23 19:47

→ a78998042a: which(data$ID%in%x))) 05/23 19:47

→ memphis: @cywhale:沒看懂你說的是什麼~ 我要的是如果ID2被sample 05/23 21:42

→ memphis: 2次..那再重購資料的時候, 整套ID2都要重複兩次 05/23 21:42

→ memphis: @a78998042a: 看起來好像是對的XD, 也就是loop一遍清單 05/23 21:44

→ memphis: 把符合某一ID的那些row蒐集起來 05/23 21:44

> -------------------------------------------------------------------------- < 作者: celestialgod (天) 看板: R_Language 標題: Re: [心得] bootstrap long format 時間: Tue May 23 20:19:18 2017 ※ 引述《memphis (讓你喜歡這世界~)》之銘言： : 假設資料長這樣 : ID V1 V2 : 1 10 11 : 1 11 12 : 1 12 13 : 2 13 14 : 2 14 15 : 2 15 16 : 3 16 17 : 3 17 18 : 4 18 19 : 4 19 20 : 先bootstrap ID : s <- sample(unique(data$ID), replace=T) : 再抓資料 : data2 <- data[data$ID %in% s] #這樣就錯了 : #s裡是有重複的ID沒錯 : #可是 %in% 不會抓重複的值 : 網路上查尋的的結果,是用grr:::matches : s_idx <- as.numeric(unlist(matches(s, data$ID, list=T))) : data2 <- data[s_idx] : 看起來還算簡約, 只是為了一個小功能又要裝一個pkg..有點煩躁 DF <- read.table(textConnection(" ID V1 V2 1 10 11 1 11 12 1 12 13 2 13 14 2 14 15 2 15 16 3 16 17 3 17 18 4 18 19 4 19 20"), header = TRUE) # 簡單的方式，只是可能unique ID多一點會久一些些 lenID <- length(unique(DF$ID)) s <- sample(unique(DF$ID), replace = TRUE) s_idx <- which(sweep(matrix(rep(DF$ID, lenID), lenID, byrow = TRUE), 1, t(s), `==`), arr.ind = TRUE) DF[s_idx[ , 2], ] # plyr 只是簡單的抓出來要的 library(plyr) ldply(s, function(i) DF[DF$ID == i, ]) # dplyr 有點複雜Orz library(dplyr) library(tidyr) DF %>% group_by(ID) %>% mutate(t = sum(s == ID[1])) %>% summarise_each(funs(list(rep(., times = t))), -t) %>% unnest # data.table 用key加速抓取速度 library(data.table) DT <- data.table(DF) setkey(DT, ID) rbindlist(lapply(s, function(i) DT[ID == i])) # Rcpp 自幹一個matches library(Rcpp) sourceCpp(code = " #include <RcppArmadillo.h> // [[Rcpp::depends(RcppArmadillo)]] // [[Rcpp::export]] SEXP bootstrapId(arma::Col<int> ID, arma::Col<int> idx) { arma::uvec out = find(ID == idx[0]); if (idx.n_elem > 1) { for (arma::uword i = 1; i < idx.n_elem; ++i) out.insert_rows(out.n_rows, find(ID == idx[i])); } if (out.n_elem == 0) return R_NilValue; return Rcpp::wrap(out); }") DF[bootstrapId(DF[["ID"]], s), ] bootstrapId(DF[["ID"]], c(5,5,6,6)) # NULL -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 118.170.51.16 ※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1495541962.A.256.html

推 memphis: XD 居然是CPP版本最看得懂 05/23 21:49

自己寫一個matches就不用用套件了~~ ※ 編輯: celestialgod (118.170.51.16), 05/23/2017 21:51:30

推 memphis: 我投靠A78的解答了, 你也參考看看, 我覺得不裝新套件的 05/23 21:55

→ memphis: 情況下..他寫的是最直接的花費腦力少的 05/23 21:56