[問題] 爬蟲

作者busystudent (busystudent)

看板Python

標題[問題] 爬蟲

時間Sun Nov 13 12:18:53 2016

各位好，想請問一段爬蟲網址，一時之間不知道錯在哪裡，請大家指點一下，謝謝問題點是我的div',{'class':'Titleinner'})#新聞標題和網址，怎麼都找不道資源了，之前撰寫時可以挖出網址，我退而檢查SOUP，結果連個 div class Titleinner都沒有挖出來，還請大家幫忙指點一下好讀版 https://ideone.com/BtvcXV import requests from bs4 import BeautifulSoup name_list = ['Joshtwery'] for a in name_list: links = ['http://www.diigo.com/user/Joshtwery?page_num=0&type=all&sort=updated'] for link in links: print link res = requests.get(link,"lxml") soup = BeautifulSoup(res.text.encode("utf-8")) fol_table = soup.findAll('div',{'class':'Titleinner'})#新聞標題和網址 print fol_table#檢查點 for a_link in fol_table: a_links =str([tag['href']for tag in a_link.findAll('a',{'href':True})])[1:-1] a_links =([tag['href']for tag in a_link.findAll('a',{'href':True})]) -- Sent from my Windows -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 1.172.114.184 ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1479010736.A.899.html ※ 編輯: busystudent (1.172.114.184), 11/13/2016 12:22:57 ※ 編輯: busystudent (223.139.128.15), 11/13/2016 12:27:16 ※ 編輯: busystudent (1.172.114.184), 11/13/2016 14:18:39 ※ 編輯: busystudent (1.172.114.184), 11/13/2016 14:22:59 ※ 編輯: busystudent (1.172.114.184), 11/13/2016 14:47:14 ※ 編輯: busystudent (1.172.114.184), 11/13/2016 14:49:23

→ s860134: 瀏覽器　F12 打開　本來你抓的頁面就沒有你要的東西 11/13 16:11

→ s860134: 網頁打開的內容是你爬的這個網址執行JS再去要的 11/13 16:11

→ s860134: 自然爬這個只會爬到一陀程式碼，裡面甚麼東西都沒有 11/13 16:12

→ busystudent: 原來是這樣之前是沒有js的 11/13 16:15

→ busystudent: 那要怎麼處理js呢？ 11/13 16:16

→ s860134: 能直接抓 api 網址就直接抓，這個網站 API 很簡單 11/13 16:29

→ s860134: 瀏覽器　F12 錄一下就有了，而且還 json 格式 11/13 16:29

→ busystudent: wow說慢點 js我還是第一次處理 11/13 16:38

→ busystudent: 我有找到js把原程式碼藏在哪裡 11/13 16:40

→ busystudent: 就是你說錄的那個地方 11/13 16:40

→ busystudent: 我還在想下一步要怎麼做 11/13 16:41

推 koshi0413: 直接當字串處理?小弟爬網頁有Bs4，純字串，axjx用開視 11/13 17:18

→ koshi0413: 窗 11/13 17:18

→ busystudent: 你好axjx怎麼開視窗呢? 11/13 17:19

→ busystudent: 我之前的處理方式都是直接爬的 11/13 17:20

推 koshi0413: 瀏覽器錄網頁，一個一個試，看是不是字串頁面，不然就 11/13 17:27

→ koshi0413: 用BS4，Bs4解不出來就去看用法，曾經為了一串標題用Bs4 11/13 17:27

→ koshi0413: 半天才解掉 11/13 17:27

→ s860134: http://imgur.com/a/qUcxk 11/13 20:24

推 koshi0413: 以s大的圖片看，字串處理就可以了，re比較快 11/13 20:30

→ busystudent: 我還是不懂,s大的東西怎麼來的,我網頁錄了一下,結果 11/13 20:59

→ busystudent: 是這樣http://imgur.com/a/h1Kj7 11/13 20:59

→ busystudent: 我之後該怎麼轉成字串呢? 11/13 20:59

→ busystudent: 另外s大的api網址是怎麼找到的 11/13 20:59

→ s860134: 瀏覽器開發工具是你開啟他後才開始錄，你就打開後再重整 11/13 21:33

→ s860134: 不用轉字串，他就是一個json, http://imgur.com/a/50a4Z 11/13 21:36

→ s860134: josn 是網路通用格式，python 有內建 lib 去 parsing 11/13 21:38

推 koshi0413: b大，照s大的方法解完後，花時間去網路找大數據學堂從 11/13 21:49

→ koshi0413: 頭看一下，會有幫助 11/13 21:49

→ busystudent: 感謝樓上的各位大哥們！ 11/13 22:32

→ busystudent: 再請教叫一個問題, s大給的圖片中第一行有一個api的 11/13 23:28

→ busystudent: 網址,我一直都找不到是怎麼出來的 11/13 23:28

→ busystudent: 我不管怎麼試都只有像是我第一張貼的圖片那樣 11/13 23:29

→ hoho8: 怡?一下就找到啦。你切換到XHR試試，比較快找到 11/14 04:54