Re: [問題] 製作dummy variable矩陣效能問題

作者celestialgod (天)

看板R_Language

標題Re: [問題] 製作dummy variable矩陣效能問題

時間Mon Jan 8 19:27:57 2018

※ 引述《Wush978 (拒看低質媒體)》之銘言： : 你的問題，剛好等價於在文字探勘中建立document term matrix : ps. 給一段文字(一個字串)，用空格或其他符號切割後建立矩陣 : 感謝前面幾位板友的分享，不過我從這個角度切入問題後， : 可以站在巨人的肩膀來解問題（也就是以下的程式跑得比較快，是因為套件作者寫的好） : 目前我覺得R 裡面做這件事情比較好的套件是text2vec， : 另一個小要點是輸出的矩陣，最好是sparse，因為你的資料大部分都是0，用sparse : matrix可以大幅度的加速與節省記憶體。 : 而且當你的球員名單越多人，加速的效果越明顯。 : 這是我用text2vec去處理你給的範例資料： : it <- itoken(data[[1]], tokenizer = word_tokenizer, progressbar = FALSE, : n_chunks = 10) : it2 <- itoken(data[[2]], tokenizer = word_tokenizer, progressbar = FALSE, : n_chunks = 10) : vocab <- create_vocabulary(player) : vectorizer <- vocab_vectorizer(vocab) : m1 <- create_dtm(it, vectorizer) : m2 <- create_dtm(it2, vectorizer) : m2@x[] <- -1 : cbind(m1, m2) : 這是與其他板友的方法的比較結果： : http://rpubs.com/wush978/345283 : andrew43 大大的版本效能比較好 : 但是text2vec在打開平行處理之後，在我的電腦上可以比andrew43的方法再快一點 : ※ 引述《mowgur (PINNNNN)》之銘言： : : *[m- 問題: 當你想要問問題時，請使用這個類別。 : : 建議先到 http://tinyurl.com/mnerchs 搜尋本板舊文。 : : [問題類型]: : : 效能諮詢(我想讓R 跑更快) : : [軟體熟悉度]: : : 使用者(已經有用R 做過不少作品) : : [問題敘述]: : : 大家好我的資料是紀錄籃球比賽每個play是哪5個進攻及防守球員在場上 : : 想做的事情是: 假設總共有500位球員做出一個n(750000) x p(1000)的矩陣 : : 前500欄為進攻後500欄為防守 : : 矩陣內的元素為1代表球員在場上進攻(防守為-1) 不在場上為0 : : 所以每列會有5個1及5個-1還有很多個0 : : 資料大概長這樣 : : data$p.combination data$p.com.allowed : : 1 A, B, C, D, E J, K, L, M, N : : 2 A, C, F, H, I K, L, M, N, O : : 3 C, D, X, Y, Z K, M, O, Q, R : : ... ... ... : : 人名之間是用逗號和一個空格分開 : : 用我自己寫的已經跑了快12小時還沒跑完 : : 想請教版上各位大大有沒有更好的寫法 : : [程式範例]: : : https://ideone.com/PaBtM4 之前不方便回文，今天終於有空來提供一下我的方法XD 我是直接用fastmatch這個套件，找出需要的index直接得到sparse matrix 比較一下andrew大跟wush大的方法(單核心3.87 GHz下)，我的方法可以快上近4倍好讀版：https://pastebin.com/ySxqNtxt 程式碼： library(pipeR) library(stringr) library(data.table) library(fastmatch) library(plyr) library(text2vec) library(Matrix) # 資料生成 numPlayers <- 500 numGames <- 300000 namePlayers <- sprintf("P_%03d", 1:numPlayers) getCombinedFunc <- function(data, numSampling, numGroup) { DT <- data.table(V = sample(data, numGroup * numSampling, TRUE), i = rep(1:numSampling, each = numGroup, length.out = numGroup * numSampling), key = "i") # 確保每一列都是五個不同的PlayerNames uniqueDT <- unique(DT) while (nrow(uniqueDT) < numSampling * numGroup) { tmpDT <- uniqueDT[ , .N, by = .(i)][N < 5][ , N := 5 - N] uniqueDT <- rbind(uniqueDT, data.table(V = sample(data,nrow(tmpDT),TRUE), i = tmpDT$i)) %>>% unique } return(uniqueDT[ , .(combinedV = str_c(V, collapse = ",")), by = .(i)]$combinedV) } # 測一下生成時間 system.time(getCombinedFunc(namePlayers, numGames, 5)) # 1.64 seconds # 生成目標資料表 DT <- data.table(attack = getCombinedFunc(namePlayers, numGames, 5), defence = getCombinedFunc(namePlayers, numGames, 5)) # 修改自andrew大的方法 andrew <- function(data, name.player) { out.attack <- strsplit(data[[1]], ",") %>>% sapply(function(x) name.player %in% x) %>>% t %>>% `colnames<-`(str_c("attack_", name.player)) %>>% mapvalues(c(TRUE, FALSE), c(1L, 0L), FALSE) out.defence <- strsplit(data[[2]], ",") %>>% sapply(function(x) name.player %in% x) %>>% t %>>% `colnames<-`(str_c("defense_", name.player)) %>>% mapvalues(c(TRUE, FALSE), c(-1L, 0L), FALSE) cbind(out.attack, out.defence) } # 修改自wush大的方法 wush <- function(data, name.player) { it <- itoken(data[[1]], tokenizer = word_tokenizer, progressbar = FALSE, n_chunks = 10) it2 <- itoken(data[[2]], tokenizer = word_tokenizer, progressbar = FALSE, n_chunks = 10) vocab <- create_vocabulary(name.player) vectorizer <- vocab_vectorizer(vocab) m1 <- create_dtm(it, vectorizer) colnames(m1) <- str_c("attack_", colnames(m1)) m2 <- create_dtm(it2, vectorizer) colnames(m2) <- str_c("defense_", colnames(m2)) m2@x[] <- -1 cbind(m1, m2) } # 我的方法 getLocMatFunc <- function(x, table, value = 1, colnames = NULL) { tmp <- str_split(x, ",") # 找出column位置 j <- fmatch(do.call(c, tmp), table) # 找出row位置 i <- do.call(c, mapply(function(i, x) rep(i, length(x)), seq_along(tmp), tmp, SIMPLIFY = FALSE)) # 產生出sparse matrix sparseMatrix(i, j, x = value, dims = c(length(x), length(table)), dimnames = list(NULL, colnames)) } getMatrixFunc <- function(DT, namePlayers) { cbind(getLocMatFunc(DT$attack, namePlayers,1,str_c("attack_",namePlayers)), getLocMatFunc(DT$defence, namePlayers,-1,str_c("defense_",namePlayers))) } # check結果 Andrew <- andrew(DT, namePlayers) Wush <- wush(DT, namePlayers) rownames(Wush) <- NULL MyMethod <- getMatrixFunc(DT, namePlayers) all.equal(Wush, Matrix(Andrew, sparse = TRUE)) # TRUE all.equal(MyMethod, Wush) # TRUE all.equal(MyMethod, Matrix(Andrew, sparse = TRUE)) # TRUE # 使用microbenchmark library(microbenchmark) microbenchmark( Andrew = andrew(DT, namePlayers), Wush = wush(DT, namePlayers), MyMethod = getMatrixFunc(DT, namePlayers), times = 10L ) # Unit: seconds # expr min lq mean median uq max neval # Andrew 25.564674 25.631636 26.357786 26.429542 26.804092 27.312797 10 # Wush 8.051787 8.127275 8.327858 8.319552 8.556822 8.621760 10 # MyMethod 1.978885 2.033370 2.240003 2.145650 2.334539 2.959432 10 -- R資料整理套件系列文： magrittr #1LhSWhpH (R_Language) https://goo.gl/72l1m9 data.table #1LhW7Tvj (R_Language) https://goo.gl/PZa6Ue dplyr(上.下) #1LhpJCfB,#1Lhw8b-s (R_Language) https://goo.gl/I5xX9b tidyr #1Liqls1R (R_Language) https://goo.gl/i7yzAz pipeR #1NXESRm5 (R_Language) https://goo.gl/zRUISx -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 125.224.102.242 ※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1515410883.A.E8E.html

推 andrew43: 這效率金變態 01/08 20:01

※ 編輯: celestialgod (125.224.102.242), 01/08/2018 20:36:12

推 cywhale: 推感謝學到新pkg~~ 01/09 01:40

推 cywhale: 這一系列收在z-11-21~供大家參考比較~ 01/11 22:47