Hpricot Challenge

Trying to parse out some trick HTML? Add your question to this page and we'll see if we can track down a simpler path to it.

Extracting multiple children from a table

Q: I am new to ruby, rails and Hpricot, and don't understand most of the XPath or cSS stuff! I _have_ managed to get hpricot to scrape through a document to find the section that I want, but now I am stuck with a table which is in the form

<table>
  <tr>
    <td>...stuff I don't want...</td>
  </tr>
  <tr>
    <td>
       <table>
         ------------rows i want
         <tr>
           <td>
             <table>
               <tr>
                 <td>Field 1</td>
                 <td>Field 2</td>
               </tr>
             </table>
           </td>
           <td>Field 3</td>
           <td>Field 4, Field 5</td>
         </tr>
         ------------end of rows i want
       </table>
    </td>
  </tr>
</table>

and I really need to be able to have these in the form ["Field 1", "Field 2", "Field 3", "Field 4", "Field 5"] for each row [there will be many rows]. I tried telling it to remove the first child to get rid of the first <td> contents, however it seems to go through all the code and also removes the Field 4 <td>. Anybody able to help me do that please?


A: This might not be optimal, but it seems to get the job done for what you want:

(doc/"table//table//td").collect{|k| k.inner_html.split(',') unless k.inner_html =~ /</}.flatten.compact

Iterating over XML

Q: I have an XML product feed that has some nodes that are always named the same, and some that can be different for different products. I know how to parse nodes when I know their names. How do I parse nodes when I don't know in advance what they will be called? These "dynamic" nodes are always the children of a given node -- how do I parse dynamically just for one node?


A: Looking for a solution to this same problem, I came up with traversing the document using #containers: "Return all children of this node which can contain other nodes. This is a good way to get all HTML elements which aren‘t text, comment, doctype or processing instruction nodes."

doc.at(:parent_of_dynamic_nodes).containers.each do |node|
  #process node
end

Strip all HTML tags

Q: How do I strip all HTML tags from a page?

A: Use regex replace!

str=doc.to_s
print str.gsub(/<\/?[^>]*>/, "")

Selecting only Immediate Children

Q: So, I've got an Hpricot::Elem, whose HTML looks like:

  <ul>
    <ul>
      <li>A</li>
    </ul>
    <li>B</li>
    <li>C</li>
  </ul>

How does one find only its immediate children li's (i.e. B and C, but not A)? For example, e.search("li") problematically gives me all of e's descendants, not just immediate children. I want something like e.search("./li"), but that totally doesn't work.


A: There are two possible selectors which may be used. The XPath selector would be /li. The CSS selector would be >li. Neither selector should have spaces in it. Spaces will trip up 0.5.

When you continue a search from an element, that element is treated as a root node.

Excepting the First

Q: Assume you have this HTML:

  <body>
    <div class="test">one</div>
    <div class="test">two</div>
    <div class="test">three</div> 
  </body>

I know I can select the first div element with the expression div.test:first-child, but how to I select the other two elements? I'd like to remove any divs of the test class which aren't first children.


A: This is a perfect place to use the :not operator. This operator is listed among the SupportedCssSelectors.

E:not(s) an E element that does not match simple selector s

We can negate the :first-child selector to select everything but the first child. Like this: div.test:not(:first-child).

Your removal code will end up like so:

 #!ruby
 (doc/"div#test:not(:first-child)").remove

Searching Inner HTML

Q: Assume you have this HTML:

  <a href="http://www.somewebsite.com">Click Me!</a>

I know how to search for an element based on its attributes, but is it possible to do a search using the tag's inner_html? For example selecting all links that contain the text "Click."


A: In Hpricot 0.5, you can use the text() selector just like any attribute.

 #!ruby
 click_links = doc.search("a[text()*='Click']")

Alternatively, in older Hpricots, you can simply scan the inner_text of selected elements. This is also handy if you want to search for a regular expression.

 #!ruby
 click_links = doc.search("a").select { |ele| ele.inner_text =~ /Click/ }

This approach should be no faster or slower than the first search. They both must scan each node individually.

Does an Element Meet a Selector?

Q: Given:

doc = Hpricot.parse(%{<div class='outer'><div class='inner'>text</div></div>})

How can a write a matches? method such that:

doc.at('.outer').matches?('.inner') # => false
doc.at('.inner').matches?('.inner') # => true

A: On further investigation, it appears that

! doc.at('.outer').search('../.inner').empty? # => false
! doc.at('.inner').search('../.outer').empty? # => true

Which is easy enough to wrap up in a method, once I've worked out where to put the method.

Checking for a Few Attributes

Q: Can I perform a single search and get all of the elements with "href" or "action" attributes? Something like this:

doc.search("[@href]|[@action]")

Similarly, is it possible to get all elements with both attributes present?


A: In recent Hpricots (after 2006 Mar 17,) you can go ahead and use the search from the question: doc.search("[@href]|[@action]").

In earlier Hpricots, you'll need to do two searches:

 #!ruby
 ele = doc.search("[@href]")
 ele.push *doc.search("[@type]")

As for doing a search which finds elements with both attributes, you can go ahead and stack the search in newer Hpricots:

 #!ruby
 doc.search("[@href][@type]")

Ignoring Case

Q: Can I search for elements where the attribute has a specific value and ignore case? I guess I would like to use something similar to XPath string functions to normalize text:

doc.search("span[lower-case(@title)='yes']")

A: I've used this to remove case from my search:

Hpricot.parse(File.read(file_name).downcase))

Its not really an hpricot specific solution but worked well for my requirements.

A: I've come up with a half-assed solution that isn't exactly valid XPath, but works. It also involves editing hpricot source :( !

I've added the following into elements.rb around line 473

    filter :contains_lowercase do |arg, ignore|
       html.include? arg.downcase
    end

    filter :contains_uppercase do |arg, ignore|
       html.include? arg.upcase
    end

You can then do the following:

irb(main):010:0> doc/"strong:contains('one')"
=> #<Hpricot::Elements[{elem <strong> "this is strong one" </strong>}]>

irb(main):011:0> doc/"strong:contains('ONE')"
=> #<Hpricot::Elements[]>

irb(main):012:0> doc/"strong:contains_lowercase('ONE')"
=> #<Hpricot::Elements[{elem <strong> "this is strong one" </strong>}]>

Alternatively you can just alter the :contains filter, but will lose the ability for case sensitive searches

    filter :contains do |arg, ignore|
       html.include? arg
       html.include? arg.upcase
       html.include? arg.downcase
    end

You can then do the following:

irb(main):008:0> (doc/"strong:contains('ONE')")
=> #<Hpricot::Elements[{elem <strong> "this is strong one" </strong>}, {elem <strong class="hard"> "this is strong one" </strong>}]>

irb(main):009:0> (doc/"strong[@class='hard']:contains('ONE')")
=> #<Hpricot::Elements[{elem <strong class="hard"> "this is strong one" </strong>}]>

The same approach could be taken with gsub to implement something along the lines of a translate() function

Still looking for a good way to do this...

Preceding / Following Children?

Q: So, I've got an Hpricot::Elem, whose HTML looks like:

<div>
  <A>...</A>
  ...
  <B>...</B>

  <a name='articlestart'/>

  <C>...</C>
  ...
  <D>...</D>
</div>

How do I find C to D? I suppose I somehow have to use preceding-sibling, but I can't seem to figure out how...


A: I think something like this might work, though it feels like there must be a better way

a = doc.at('//a[@name="articlestart"]')
new = Hpricot::Elements.new
while a = a.next_sibling
  new << a
end

Follow-up to 'Preceding/Follwing Children' (text nodes)

Q: How do you solve the preceding question when the tags are interspersed with text nodes? For example,

<C>...</C>
Some text
<tag> </tag>
More text
Even more text
<D>...</D>

Retrieving non-text elements only?

Q: Seth wants to know, "How can i get a list of all non-text elements?"

A. Evan suggests perhaps:

doc.search("*").grep(Hpricot::Elem)

Getting the contents of a tag attribute?

Q: Say I needed to get the value of the href attribute in an <A> tag, how would I do it?

A: Use .first and then Hash syntax to get at the attributes.

doc.search('a').first[:href]

or if you have an element

(doc/:a).first[:href]

The confusing thing can be if you have some XML that only has one item in it. You still need to call .first so you're working with a single element and not an array.

doc = Hpricot.XML(open('http://feeds.feedburner.com/rubyonrailspodcast'))
item = (doc/:item).first
type = (item/:enclosure).first[:type]
# => 'audio/mpeg'

Also, if you only want to get the first element, you can use % or at instead of / or search.

doc.at('a')[:href]

or

(doc % :a)[:href]

Getting the contents of tag multiple attributes?

Q: (Merge with above post?) What if I needed to get the value of all href attributes on a page?

A:

doc.search('a[@href]').map { |x| x['href'] }

If anyone knows a more concise way please post.

Using Hpricot via a proxy?

Q: How would I use Hpricot through a proxy? Where would I setup the url and port before calling?

Hpricot(open('http://myurl'))

A: You need to tell open-uri about the proxy, not Hpricot. This works:

Hpricot(open('http://myurl', :proxy => 'http://myproxy:8080'))

Mailing list

Q: How do I subscribe to the mailing list? The email instructions on the front page of the wiki produces no response. Members only?

Selecting part of a String with Dynamic Contents

Q: How would I go about locating and removing part of a string if the contents are all different or generated dynamically? Here's the example:

<a href="out.php?id=1112&url=www.website.com"></a>
<a href="out.php?id=2232&url=www.website.com"></a>
<a href="out.php?id=3346&url=www.website.com"></a>

I would like to remove the part between

php?id= and &url=

A: Your question isn't clear. If you want to obtain the values (as strings) 1112, 2232, 3346 then

require 'cgi'
doc.search('a[@href]').map { |x| CGI.parse(x['href'][/\?.+/][1..-1])['id'].first }

should do.

Selecting the text from actual node only

Q: How would I get the text 'sample text' from the example below? inner_text returns texts from all tags and not from the actual node only.

<div id="myid">
   <h4>title</h4>
   sample text
</div>

Selecting the value of an element's attribute using just XPath

Q: Using just an XPath query, is it possible to return the value of an element's attribute, such that:

<div id="foo">
    Fnord!
</div>
doc.someFuncIdontKnowYet("//div/@id")
=> "foo"

Selecting elements with no attributes

Q: How can I retrieve the elements that have no attributes associated?

<div class="foo"></div>
<div></div>
<div class="bar"></div>

I would like to select the middle div with something like

doc.search("//div[@=empty()]")

Putting embedded script in attributes

Q: How can I add embedded script, like PHP or Erb to an attribute without it being escaped?

a = Hpricot("<a href=\"http://w.w.w/\">Hello</a>")
=> #<Hpricot::Doc {elem <a href="http://w.w.w/"> "Hello" </a>}>
(a/"a").first[:href] = "<? echo 'boo' ?>"
=> "<? echo 'boo' ?>"
a.to_html
=> "<a href=\"&lt;? echo 'boo' ?&gt;\">Hello</a>"

A: Use the element's raw_attributes hash - it doesn't escape anything.

Wildcard in Attribute Search

Q: is there a way to use the wildcard character in doc/"" or doc.search for attributes? More specifically I have a page where:

<a href="blah.com" id=p-1>Some text</a>
<a href="blah.com" id=p-2>Some text</a>
<a href="blah.com" id=p-3>Some text</a>

I am attempting to use pure xpath if at all possible, however I am willing to hear other suggestions even though I may have to rethink my design a little. There is enough other code on the page to make it difficult. You can see here:

http://blogsearch.google.com/blogsearch?q=New+York+Jets

doc/"a[@id=p-*]"

Would be the ideal statement

A: Try the following: doc/"a[@id^=p-]"

This operator matches the beginning of the string. It is a part of Trac query language supported by Hpricot; learn more on TracQuery page.

Build a larger tree from several fragments

Q: What's the best way to combine several HTML fragments into one tree, without just concatenating the strings?

Suppose I have fragment 1:

<p>This is a paragraph la la la.</p>

and fragment 2:

<ul><li>This is a test.</li><li>This is only a test.</li></ul>

What's the best way to combine them into an Hpricot doc that contains:

<html>
<p>This is a paragraph la la la.</p>
<ul><li>This is a test.</li><li>This is only a test</li></ul>
</html>

without flattening to strings, concatenating them, and reparsing?

I'd like to stay in the Hpricot domain if possible. It seems to me that it's much faster to just join the trees than to round-trip through the emitter and parser, and I'm also concerned about what would happen if some of the input is bogus and produces invalid nesting.

Parsing not valid HTML

Q: What's the best way of parse a not valid HTML?

I am trying to extract the body of a HTML page, http://www.c2.com/cgi/wiki?AtsUserStories, with doc_content = doc.search('html/body'). The problem is, that page doesn't have the <html> and </html> tags. That kind of problem happens to me a lot, pages that don't have </body>, or that <head> comes before <html>. I thought Hpricot already deals with that kind of problem, but this not happens now.

So, how can I deal with that kind of problem? Thanks!

Outputting HTML instead of XHTML

Q: Hpricot seems to output XHTML instead of HTML by default. Is there a way to force HTML?

For example:

Hpricot('<br>').to_s

returns "<br />" and not "<br>" like I wanted.

Hpricot and character encoding

Q: Hpricot (or perhaps it's Ruby in general?) seems to struggle with character encoding. When using Hpricot with documents that contain "funny" characters such as `, the results are wonky. Does anyone have any advice on how to deal with this?

How to find elements' relative position ?

Q: Given two elements A and B how do I tell which comes before the other ? If I could use sth like start position of the element in the html document I could compare the positions and figure it out but there's no such feature, or is there ?

Warning while using :last-child

Q: Hpricot is throwing a warning, but doing what I want when I use :last-child. What gives?

Here is my code:

require 'hpricot'

foo = "<div class=\"blah\">
  <p>test</p>
  <p>go away</p>
</div>"

doc = Hpricot(foo)

(doc/"p:last-child").remove

It properly removes the last p element, but it gives the following warning:

c:/HIDDEN/gems/hpricot-0.6-x86-mswin32/lib/hpricot/elements.rb:429: warning: multiple values for a block parameter (2 for 1)

Starting from Scratch

My code should output a snippet of HTML for inclusion in the middle of a document. Does Hpricot have a role? Can I use it to build my snippet in the abstract (using code that knows nothing about HTML, but enough about Hpricot), and then can I call Hpricot to output the HTML of the snippet? -- mailto: ptuklse02 att snekemail daht kahm

Getting a hold of malformed data that isn't in an element.

Q: I'm dealing with data on *HORRIBLY* designed HTML pages, using tables and presentational elements for everything. What's the best method to grab data from *after* a known element?

For example:

<tr><td class="oreinfoleft"><b class=ul>Veldspar</b><br> 
<b>Units per batch:</b> 333<br> 
<b>Volume:</b> 0.1<br> 
<b>Cargo per batch:</b> 33.33
<td class="oreinforight"><b>Minerals:</b> Tritanium 100%<br> 
<b>Variations:</b> Concentrated Veldspar (+5%), Dense Veldspar (+10%)<br> 
<b>Found in:</b> 1.0<br> 
<font class=comment>Veldspar has the best cargo/mineral rate for tritanium</font> 
</tr>

How can I get ahold of, say, the 'volume' of 0.1? It's outside any element except the TD itself - I guess I'm looking for a psuedoselector of some sort for :test combined with a selector for :after, so I can grab whatever text is directly after a given <b> but before any later element.