看板 Python 關於我們 聯絡資訊
大家好 最近想試著撰寫網頁爬蟲 想抓取網頁的這部分資訊 http://imgur.com/rNdE4hh 嘗試的結果為 # -*- coding: utf-8 -*- from urllib2 import urlopen import xml.etree.ElementTree as ET from lxml import etree import mechanize import sys url = "http://www.tham.com.tw/recipe6.php" path = "//*[@id=\"left-inner\"]/div[2]/div[3]" html = urlopen(url).read() tree = etree.HTML(html) startindex = 4 data = tree.xpath(path) print data[0].text Output: >>> ================================ RESTART ================================ >>> 材料 2人份 >>> 看網頁的原始碼猜測是因為<br />阻擋了判斷的緣故 請問這個有解嗎?? -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 123.195.222.114 ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1457968017.A.79E.html
ckc1ark: //*[@id=\"left-inner\"]/div[2]/div[3]//text() 試試 03/15 00:37
girl5566: 感謝 已解決 03/15 19:43
請在請教一下 xpath這部分要怎麼debug? 有什麼秘訣嗎? 下面output也怪怪的 # -*- coding: utf-8 -*- from urllib2 import urlopen import xml.etree.ElementTree as ET from lxml import etree import mechanize import sys url = "https://icook.tw/recipes/133425" html = urlopen(url).read() tree = etree.HTML(html) path = "//*[@id=\"recipes_show\"]/div[3]" title = tree.xpath(path) print title Output: >>> [] ※ 編輯: girl5566 (123.195.222.114), 03/15/2016 20:24:59
aweimeow: path = "//*[@itemprop=\"name\"]" 03/16 20:18
aweimeow: print title[0].text 03/16 20:19
aweimeow: 你的 XPATH 抓錯了 03/16 20:19