[問題] 圖片爬蟲遇到https該如何解決

作者craig1122321 (半醉夜貓)

看板Python

標題[問題] 圖片爬蟲遇到https該如何解決

時間Thu May 11 18:47:15 2017

如題　下方為程式碼 import requests ,threading from bs4 import BeautifulSoup from urllib.request import urlopen headers ={ 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36' } url = ('http://www.dcard.tw/f/photography/p/226364232.html') res = requests.get(url , headers = headers) soup = BeautifulSoup(res.text, "html.parser") imgs = soup.select('img') for img in imgs: try: fn = img['src'] print(fn) img=urlopen(fn) except Exception as e: print (e) continue with open('./imgs/' + str(fn), 'wb') as f: f.write(img.read()) 上面的url為測試用網址。我有google爬過文有看到一種寫法是if re.match(r'^https?://(i.)?(m.)?imgur.com', link['href']): 不過因為Dcard的圖檔是存在src裡不知該如何修改第一次發文有錯誤煩請指導感謝各位大大 -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 123.205.57.171 ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1494499637.A.4E1.html

→ uranusjr: 你都用 requests 了為什麼還要用 urllib 05/11 22:02

→ zerof: ??? dcard 是走 https, imgur 也沒有擋 https 爬蟲吧? 05/11 22:33

→ craig1122321: 回一樓因為res變數的那行需要它 05/12 13:03

→ craig1122321: 回二樓會出現http error 403 修改後變成OSError 05/12 13:04

→ craig1122321: Error 22 Invalid argument 05/12 13:05

→ craig1122321: 附上目前的程式碼 https://repl.it/HsWh/8 05/12 13:07

→ zerof: ??? code 跑起來是正常的,無法複製你的 bug, 有 error log? 05/12 14:21

→ craig1122321: Z大應該這樣講程式跑起來是正常不過實際上並沒有 05/12 21:41

→ craig1122321: 把圖片下載到電腦（Imgs)中 05/12 21:41

→ zerof: ....所以你 imgs/ 裡面沒東西? 05/12 22:19

→ craig1122321: 是的請問Z大的可以？ 05/12 23:39

推 king4647: dcard 有開放api 滿好抓的 05/20 23:46

→ starcaspar: 我覺得名稱有斜線在檔案上會出問題 06/03 23:23

→ starcaspar: with open('./imgs/' + str(fn.split("/")[-1]) 06/03 23:23

→ starcaspar: (後面自行補齊)開檔存檔留下最後的檔名就好了 06/03 23:24

→ starcaspar: 補：https在repl會出問題，不是code的問題 06/03 23:28