Ticket #88 (new defect)

Opened 14 months ago

Last modified 14 months ago

Hpricot is unable to find a div for some pages of the same site

Reported by: scrubber Owned by: why
Priority: major Milestone: 0.6
Component: ext/hpricot_scan Version:
Keywords: Cc:

Description

Check out this code:

require 'rubygems'
require 'hpricot'
require 'open-uri'

#a working page
#doc = Hpricot(open('http://www.handango.com/PlatformProductDetail.jsp?siteId=1&osId=322&jid=5898CAFFB9E872CAA57847B6862AEX58&platformId=5&N=4294966622&R=121718&productId=121718'))                                                                                                                                           

#a broken page
doc = Hpricot(open('http://www.handango.com/PlatformProductDetail.jsp?siteId=1&osId=322&jid=43BF1722D46DX2AFB2E166BE81ECE822&platformId=5&N=4294966622&R=171552&productId=171552'))            

records = doc/"//div[@id='detailTabs']"

p records[0].inner_html

The output is

"\r\n"

even though the div is there and it contains a chunk of HTML.

Uncomment the first doc = ... line (and comment out the second :-)) and you will see how should it work. I have this behavior for about the 15% of the pages of the same site.

Change History

Changed 14 months ago by lwu

I get the same behavior.

I don't see anything immediately wrong with the HTML in the second example, with the help of Firebug and Safari web inspector, but Hpricot is able to parse that div if it's by itself (delete rest of file) or just surrounded by divs on both sides.

It does, however, report a few bogusetags, so perhaps there's something screwy within that div itself (the table code)?

Note: See TracTickets for help on using tickets.