[問題] HtmlAgilityPack parsing 問題

作者clliu168 (風)

看板C_Sharp

標題[問題] HtmlAgilityPack parsing 問題

時間Sun Feb 7 00:48:54 2010

我想要使用 HtmlAgilityPack 加上 Xpath 來抓取網頁資料我不知道是不是我少了什麼參數，假設我的 html 檔案是： <html> <head> <title> XPath Test Page</title> </head> <body> <div class="content"> <p> This is test </p> </body> </html> 上面是一個 well-formed 的 html 檔案，我可以用 Xpath: //div[@class='content']/p 順利的抓到資料 Code 大致上如下： HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.OptionAutoCloseOnEnd = false; doc.Load(fileName, Encoding.Default); // fileName 檔案就是上述 html // 底下的 xpath 就是 //div[@class='content']/p HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(xpath); 可是如果 html 內容少了 </p> 變成 <p> This is test 那 HtmlAgilityPack 就無法抓取到 "This is test" 的內容 HtmlAgilityPack 對於 non well-formed 的 html 是可以讀取，但是我需要更進一步使用 Xpath 抓取資料。不知道有沒有人知道怎麼解決這問題？我 Google 了很久，都沒遇到有人提到這問題，難道是我用法不對嘛？ -- My Blog: http://webapp-tech.blogspot.com/ -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 140.113.0.109