scrapy xpath extraction 以及其編碼的問題

作者stevec (steve)

看板Python

標題scrapy xpath extraction 以及其編碼的問題

時間Sat Nov 29 19:20:32 2014

有點不曉得為什麼,想請各位大大看一下下面的程式碼只要是想利用scrapy 裡面的xpath extract一些我想要的info raw_html_article_content_ 是儲存我想extract的部分資訊 raw 是儲存範圍比較大的部分所以理論上raw會包含raw_html_article_content_ 的資訊可是raw包含的部分會有點跟raw_html_article_content_裡面的不一樣例如: raw: 結婚並無Z>B (這跟chrom瀏覽器打開source code的看到的是一樣的) raw_html_article_content_ : 結婚並無Z>B 我要怎麼讓raw裡面儲存的跟raw_html_article_content_的一樣啊？ ps:環境win 7, python 2.7,scrappy 1.4 from scrapy.http import HtmlResponse from scrapy.selector import Selector import urllib import urllib2 address = "http://www.ptt.cc/bbs/Boy-Girl/M.1416362560.A.881.html" response = urllib2.urlopen(address) html = response.read() html_response = HtmlResponse( address, body=html) sel = Selector(html_response) recog_assist_word = u"※ 文章網址: " xpath = """/html/body/div[@id="main-container"]/div[@id="main-content"]/ span[@class="f2" and text()="%s"][last()]/preceding-sibling::node()""" % recog_assist_word raw_html_article_content_ = sel.xpath(xpath).extract() raw_html_article_content_ = "".join([_ for _ in raw_html_article_content_]) raw=sel.xpath(u"""/html""").extract()[0] print raw_html_article_content_ print raw -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 140.112.218.124 ※ 文章網址: http://www.ptt.cc/bbs/Python/M.1417260034.A.579.html ※ 編輯: stevec (140.112.218.124), 11/29/2014 19:39:32

→ dritchie: 那個編碼叫HTML entity 11/30 01:27

→ stevec: 感謝大大,可是在python裡要怎麼樣讓name entities顯示正常 11/30 11:03

→ stevec: 呢？ 11/30 11:04

→ stevec: 為什麼scrapy有時候會幫忙修正,有時候又不會呢？ 11/30 11:05

→ stevec: 這個眉角在哪啊？ 11/30 11:05