Hpricot Searching with CSS

Part of AnHpricotShowcase.

CSS selectors are often the shortest and most readable technique for finding elements on an HTML page. Since more people are familiar with CSS than XPath, I recommend this approach.

Using CSS Selectors

When calling the search or the slash / methods, you may use CSS selectors as the search string.

 #!ruby
 doc = Hpricot(open("qwantz.html"))
 (doc/'div img[@src^="http://www.qwantz.com/comics/"]')
   #=> Elements[...]

For a complete list of selectors, see SupportedCssSelectors.

Selecting by ID

The quickest way to find a specific element is to search by ID. If an element is defined as <div id='menu'>, you can search for the element by searching for #menu.

 #!ruby
 puts doc.search('#menu').inner_html

Selecting by Tag Name

Another common search is to find all the elements with a given tag. The CSS selector for this is just the plain tag name.

To get a count of all span tags:

 #!ruby
 puts doc.search("span").length

A shortcut for this is to use a symbol :span as the search term, which will be converted to a string.

 #!ruby
 puts (doc/:span).length

Selecting by Class

Search for elements with a certain class by placing a dot before the class name.

 #!ruby
 doc.search(".entryTitle").each do |title|
   puts title.inner_html
 end

Often the search will happen quicker if you add the tag name.

 #!ruby
 (doc/"div.entryTitle").remove

A similar XPath would be //div[@class='entryTitle']. Usually, the CSS selector is far superior to using XPath, though. If an element has more than one class, the CSS selector will still match the element. But the XPath expects only one class name.

So <div class="entryTitle dark"> is a match for doc.search("div.entryTitle"). You can also search for class like this:

 #!ruby
 (doc/"div[@class~='entryTitle']").remove

Selecting by Hierarchy

If you'd like to narrow your search for a certain tag, it often helps to identify its parents. By seperating css selectors with a space, you can search deeper into the document.

 #!ruby
 (doc/"div.entryPermalink a").empty

That bit of code will find all links anywhere inside divs of the entryPermalink class, emptying the element, removing any HTML inside the link. The links can be anywhere inside the div, children of its children down to any level.

Stacking CSS selector calls also has the same effect:

 #!ruby
 (doc/"div.entryPermalink"/"a").empty

Selecting Close Children

If you want to limit your search to just the children of an element, use the > bracket.

 #!ruby
 doc.search("div.entryPermalink > a").
   prepend("<b>found you on the left</b>").
   append("<b>found you on the right</b>")

This code searches for all links which are immediate children of entryPermalink classed divs. It then adds some HTML inside each link, to the beginning and end of its inner_html.

Searching Attributes

Most people figure that CSS selectors aren't as comprehensive for searching attributes when compared to XPath functions. But, that's just not so. There's quite a pallette of ways to search attributes.

For example, you can search for all elements with an attribute. To search for all form fields that have a checked property:

 #!ruby
 doc.search("input[@checked]")

If you want to find all attributes set to a specific value, use the = equals operator. Let's search for an anchor named 'part_two':

 #!ruby
 doc.at("a[@name='part_two']")

Another common search is to find an attribute containing a bit of search text. For this, use the *= operator. So, to find all elements with onclick handlers which reference document.location:

 #!ruby
 doc.search("*[@onclick*='document.location']").each do |ele|
   ele.remove_attribute('onclick')
 end

Other attribute operators are listed among the SupportedCssSelectors.

Negating Searches

If you are having a difficult time tracking down a certain element, it may help to use the :not operator to narrow your search. So, to find paragraphs other than those in the blue class:

 #!ruby
 doc.search("p:not(.blue)")

Return to AnHpricotShowcase.