Hpricot Searching with CSS
Part of AnHpricotShowcase.
CSS selectors are often the shortest and most readable technique for finding elements on an HTML page. Since more people are familiar with CSS than XPath, I recommend this approach.
Using CSS Selectors
When calling the search or the slash / methods, you may use CSS selectors as the search string.
#!ruby
doc = Hpricot(open("qwantz.html"))
(doc/'div img[@src^="http://www.qwantz.com/comics/"]')
#=> Elements[...]
For a complete list of selectors, see SupportedCssSelectors.
Selecting by ID
The quickest way to find a specific element is to search by ID. If an element is defined as <div id='menu'>, you can search for the element by searching for #menu.
#!ruby
puts doc.search('#menu').inner_html
Selecting by Tag Name
Another common search is to find all the elements with a given tag. The CSS selector for this is just the plain tag name.
To get a count of all span tags:
#!ruby
puts doc.search("span").length
A shortcut for this is to use a symbol :span as the search term, which will be converted to a string.
#!ruby puts (doc/:span).length
Selecting by Class
Search for elements with a certain class by placing a dot before the class name.
#!ruby
doc.search(".entryTitle").each do |title|
puts title.inner_html
end
Often the search will happen quicker if you add the tag name.
#!ruby (doc/"div.entryTitle").remove
A similar XPath would be //div[@class='entryTitle']. Usually, the CSS selector is far superior to using XPath, though. If an element has more than one class, the CSS selector will still match the element. But the XPath expects only one class name.
So <div class="entryTitle dark"> is a match for doc.search("div.entryTitle"). You can also search for class like this:
#!ruby (doc/"div[@class~='entryTitle']").remove
Selecting by Hierarchy
If you'd like to narrow your search for a certain tag, it often helps to identify its parents. By seperating css selectors with a space, you can search deeper into the document.
#!ruby (doc/"div.entryPermalink a").empty
That bit of code will find all links anywhere inside divs of the entryPermalink class, emptying the element, removing any HTML inside the link. The links can be anywhere inside the div, children of its children down to any level.
Stacking CSS selector calls also has the same effect:
#!ruby (doc/"div.entryPermalink"/"a").empty
Selecting Close Children
If you want to limit your search to just the children of an element, use the > bracket.
#!ruby
doc.search("div.entryPermalink > a").
prepend("<b>found you on the left</b>").
append("<b>found you on the right</b>")
This code searches for all links which are immediate children of entryPermalink classed divs. It then adds some HTML inside each link, to the beginning and end of its inner_html.
Searching Attributes
Most people figure that CSS selectors aren't as comprehensive for searching attributes when compared to XPath functions. But, that's just not so. There's quite a pallette of ways to search attributes.
For example, you can search for all elements with an attribute. To search for all form fields that have a checked property:
#!ruby
doc.search("input[@checked]")
If you want to find all attributes set to a specific value, use the = equals operator. Let's search for an anchor named 'part_two':
#!ruby
doc.at("a[@name='part_two']")
Another common search is to find an attribute containing a bit of search text. For this, use the *= operator. So, to find all elements with onclick handlers which reference document.location:
#!ruby
doc.search("*[@onclick*='document.location']").each do |ele|
ele.remove_attribute('onclick')
end
Other attribute operators are listed among the SupportedCssSelectors.
Negating Searches
If you are having a difficult time tracking down a certain element, it may help to use the :not operator to narrow your search. So, to find paragraphs other than those in the blue class:
#!ruby
doc.search("p:not(.blue)")
Return to AnHpricotShowcase.
