[問題] 大量資料程式抓取

作者sariel0322 (sariel)

看板Python

標題[問題] 大量資料程式抓取

時間Mon Dec 22 17:09:32 2014

大家好，我想在一個csv裡面抓取可能只出現的一筆的那一行資料(或兩筆、三筆) 我寫了一個code，希望能用最快的速度將資料抓出來已經在server上跑了結果似乎是卡住了? 目前問題: 有試過比較小資料量的資料，跑出來是可以的可能是我的資料量太大，因此他跑到出現我設定的"start output"就靜止在那邊了以下是我的code: import csv from collections import defaultdict protein_table = defaultdict(list) P = [] a = int(raw_input("times: ")) out = str(a+1) + " domain protein.csv" o = open(out,"w") f = open("multiple domain protein.csv","r") for row in csv.reader(f): P.append(row[1]) protein_table[row[1]].append(row[0]+","+row[1]+","+row[2]+","+row[3]+","+row[4]+"\n") print "----------------------start output-------------------" for i in [k for k in P if P.count(k)==a]: if i in protein_table: for protein in protein_table[i]: o.write(protein) o.flush() o.close() f.close() 請問大家有什麼比較好修改的地方嗎? 還是得寫跑比較久的迴圈之類的 -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 120.126.36.171 ※ 文章網址: http://www.ptt.cc/bbs/Python/M.1419239375.A.EE5.html

→ alibuda174: 把start output之後的程式碼改成：不知道可不可以 12/22 18:16

→ alibuda174: for i in protein_table: 12/22 18:16

→ alibuda174: if len(protein_table[i]) == a: 12/22 18:16

→ alibuda174: for p in protein_table[i]: 12/22 18:16

→ alibuda174: o.write(p) 12/22 18:16

→ alibuda174: o.flush() 12/22 18:17

→ alibuda174: 啊，抱歉...不太對。 12/22 18:18

→ alibuda174: 請試試看吧...:D 12/22 18:18

→ ccwang002: P.count 改成 Counter(P) 或 [k for k in set(P) ...] 12/22 18:25

→ ccwang002: 需要實測一下 Ref: stackoverflow.com/a/12452678 12/22 18:26

推 jimmytzeng: 何不直接將csv匯入sqlite,在透過sql語法去搜尋? 12/23 08:25

推 polom: 可以用re 模組(正則) 抽象化處理 01/26 22:18