Ticket #121 (new defect)

Opened 12 months ago

Last modified 12 months ago

Any and all tags inside of object tags become emptyelems and bogusetags

Reported by: jmhodges Owned by: why
Priority: major Milestone:
Component: ext/hpricot_scan Version:
Keywords: Cc:

Description (last modified by jmhodges) (diff)

All tags nested inside of object tags are marked as emptyelems. If the tag nested inside is a param tag, we also get a bogusetag returned. This makes it very hard to sanitize input that can contain embeddings of Flash videos and the like.

ohtml = '<object><p></p></object>'
Hpricot(ohtml) 
# => #<Hpricot::Doc {elem <object> {emptyelem <p>} </object>}>
phtml = '<object><param></param></object>'
Hpricot(phtml)
 # => #<Hpricot::Doc {elem <object> {emptyelem <param>} {bogusetag </param>} </object>}>

Change History

Changed 12 months ago by jmhodges

  • description modified (diff)

Changed 12 months ago by jmhodges

Hm.. on reflection, the former is expected behavior. Its the latter that is not. Apologies.

Changed 12 months ago by jmhodges

This appears to be a bug in the way that self-closing tags are marked as emptyelem. Tags that could be self-closing are always be represented internally as if they had not end tag, even though the end tag is right next door.

Evidence:

Hpricot::SELF_CLOSING_TAGS.map do |tag|
  h = Hpricot("<p><#{tag}></#{tag}></p>") 
  h.to_html
end
# => ["<p><base /></p>", "<p><meta /></p>", "<p><link /></p>", "<p><hr /></p>", "<p><br /></p>", "<p><param /></p>", "<p><img /></p>", "<p><area /></p>", "<p><input /></p>", "<p><col /></p>"]

Now, the question you'll ask me is why this matters. Well, in rfeedparser, I am having to iterate through traverse_all_element and remove everything not in the whitelist of acceptable elements. (You might ask why I'm not using Hpricot#scrub. Unfortunately, using scrub required running through the HTML multiple times to clean everything out with a serious performance hit.)

Because of this iteration, it would be very difficult and potentially dangerous for rfeedparser to determine if the "self-closed" tag Hpricot returned was, in fact, self-closed by checking the next tag over.

I hope to have some kind of a patch soon but maybe you'll beat me to it.

Changed 12 months ago by jmhodges

Erm, crap. I left out how this is really a problem. In traverse_all_element, we do not simply see #<Hpricot::Doc {elem <p> {emptyelem <base>} <p>}>, we see instead the vastly worse:

#<Hpricot::Doc {elem <p> {emptyelem <base>} {bogusetag </base>} </p>}>

Its that extra bogusetag at the end that is killing it. In the future, I'll be more exact with my language.

Note: See TracTickets for help on using tickets.