| 1 | = Hpricot, Read Any HTML |
|---|
| 2 | |
|---|
| 3 | Hpricot is a fast, flexible HTML parser written in C. It's designed to be very |
|---|
| 4 | accommodating (like Tanaka Akira's HTree) and to have a very helpful library |
|---|
| 5 | (like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS |
|---|
| 6 | parser, in fact, is based on John Resig's JQuery. |
|---|
| 7 | |
|---|
| 8 | Also, Hpricot can be handy for reading broken XML files, since many of the same |
|---|
| 9 | techniques can be used. If a quote is missing, Hpricot tries to figure it out. |
|---|
| 10 | If tags overlap, Hpricot works on sorting them out. You know, that sort of |
|---|
| 11 | thing. |
|---|
| 12 | |
|---|
| 13 | *Please read this entire document* before making assumptions about how this |
|---|
| 14 | software works. |
|---|
| 15 | |
|---|
| 16 | == An Overview |
|---|
| 17 | |
|---|
| 18 | Let's clear up what Hpricot is. |
|---|
| 19 | |
|---|
| 20 | # Hpricot is *a standalone library*. It requires no other libraries. Just Ruby! |
|---|
| 21 | # While priding itself on speed, Hpricot *works hard to sort out bad HTML* and |
|---|
| 22 | pays a small penalty in order to get that right. So that's slightly more important |
|---|
| 23 | to me than speed. |
|---|
| 24 | # *If you can see it in Firefox, then Hpricot should parse it.* That's |
|---|
| 25 | how it should be! Let me know the minute it's otherwise. |
|---|
| 26 | # Primarily, Hpricot is used for reading HTML and tries to sort out troubled |
|---|
| 27 | HTML by having some idea of what good HTML is. Some people still like to use |
|---|
| 28 | Hpricot for XML reading, but *remember to use the Hpricot::XML() method* for that! |
|---|
| 29 | |
|---|
| 30 | == The Hpricot Kingdom |
|---|
| 31 | |
|---|
| 32 | First, here are all the links you need to know: |
|---|
| 33 | |
|---|
| 34 | * http://code.whytheluckystiff.net/hpricot is the Hpricot wiki and bug tracker. |
|---|
| 35 | Go there for news and recipes and patches. It's the center of activity. |
|---|
| 36 | * http://code.whytheluckystiff.net/svn/hpricot/trunk is the main Subversion |
|---|
| 37 | repository for Hpricot. You can get the latest code there. |
|---|
| 38 | * http://code.whytheluckystiff.net/doc/hpricot is the home for the latest copy of |
|---|
| 39 | this reference. |
|---|
| 40 | * See COPYING for the terms of this software. (Spoiler: it's absolutely free.) |
|---|
| 41 | |
|---|
| 42 | If you have any trouble, don't hesitate to contact the author. As always, I'm |
|---|
| 43 | not going to say "Use at your own risk" because I don't want this library to be |
|---|
| 44 | risky. If you trip on something, I'll share the liability by repairing things |
|---|
| 45 | as quickly as I can. Your responsibility is to report the inadequacies. |
|---|
| 46 | |
|---|
| 47 | == Installing Hpricot |
|---|
| 48 | |
|---|
| 49 | You may get the latest stable version from Rubyforge. Win32 binaries and source |
|---|
| 50 | gems are available. |
|---|
| 51 | |
|---|
| 52 | $ gem install hpricot |
|---|
| 53 | |
|---|
| 54 | As Hpricot is still under active development, you can also try the most recent |
|---|
| 55 | candidate build here: |
|---|
| 56 | |
|---|
| 57 | $ gem install hpricot --source http://code.whytheluckystiff.net |
|---|
| 58 | |
|---|
| 59 | The development gem is usually in pretty good shape actually. You can also |
|---|
| 60 | get the bleeding edge code or plain Ruby tarballs on the wiki. |
|---|
| 61 | |
|---|
| 62 | == An Hpricot Showcase |
|---|
| 63 | |
|---|
| 64 | We're going to run through a big pile of examples to get you jump-started. |
|---|
| 65 | Many of these examples are also found at |
|---|
| 66 | http://code.whytheluckystiff.net/hpricot/wiki/HpricotBasics, in case you |
|---|
| 67 | want to add some of your own. |
|---|
| 68 | |
|---|
| 69 | === Loading Hpricot Itself |
|---|
| 70 | |
|---|
| 71 | You have probably got the gem, right? To load Hpricot: |
|---|
| 72 | |
|---|
| 73 | require 'rubygems' |
|---|
| 74 | require 'hpricot' |
|---|
| 75 | |
|---|
| 76 | If you've installed the plain source distribution, go ahead and just: |
|---|
| 77 | |
|---|
| 78 | require 'hpricot' |
|---|
| 79 | |
|---|
| 80 | === Load an HTML Page |
|---|
| 81 | |
|---|
| 82 | The <tt>Hpricot()</tt> method takes a string or any IO object and loads the |
|---|
| 83 | contents into a document object. |
|---|
| 84 | |
|---|
| 85 | doc = Hpricot("<p>A simple <b>test</b> string.</p>") |
|---|
| 86 | |
|---|
| 87 | To load from a file, just get the stream open: |
|---|
| 88 | |
|---|
| 89 | doc = open("index.html") { |f| Hpricot(f) } |
|---|
| 90 | |
|---|
| 91 | To load from a web URL, use <tt>open-uri</tt>, which comes with Ruby: |
|---|
| 92 | |
|---|
| 93 | require 'open-uri' |
|---|
| 94 | doc = open("http://qwantz.com/") { |f| Hpricot(f) } |
|---|
| 95 | |
|---|
| 96 | Hpricot uses an internal buffer to parse the file, so the IO will stream |
|---|
| 97 | properly and large documents won't be loaded into memory all at once. However, |
|---|
| 98 | the parsed document object will be present in memory, in its entirety. |
|---|
| 99 | |
|---|
| 100 | === Search for Elements |
|---|
| 101 | |
|---|
| 102 | Use <tt>Doc.search</tt>: |
|---|
| 103 | |
|---|
| 104 | doc.search("//p[@class='posted']") |
|---|
| 105 | #=> #<Hpricot:Elements[{p ...}, {p ...}]> |
|---|
| 106 | |
|---|
| 107 | <tt>Doc.search</tt> can take an XPath or CSS expression. In the above example, |
|---|
| 108 | all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt> |
|---|
| 109 | attribute of <tt>"posted"</tt>. |
|---|
| 110 | |
|---|
| 111 | A shortcut is to use the divisor: |
|---|
| 112 | |
|---|
| 113 | (doc/"p.posted") |
|---|
| 114 | #=> #<Hpricot:Elements[{p ...}, {p ...}]> |
|---|
| 115 | |
|---|
| 116 | === Finding Just One Element |
|---|
| 117 | |
|---|
| 118 | If you're looking for a single element, the <tt>at</tt> method will return the |
|---|
| 119 | first element matched by the expression. In this case, you'll get back the |
|---|
| 120 | element itself rather than the <tt>Hpricot::Elements</tt> array. |
|---|
| 121 | |
|---|
| 122 | doc.at("body")['onload'] |
|---|
| 123 | |
|---|
| 124 | The above code will find the body tag and give you back the <tt>onload</tt> |
|---|
| 125 | attribute. This is the most common reason to use the element directly: when |
|---|
| 126 | reading and writing HTML attributes. |
|---|
| 127 | |
|---|
| 128 | === Fetching the Contents of an Element |
|---|
| 129 | |
|---|
| 130 | Just as with browser scripting, the <tt>inner_html</tt> property can be used to |
|---|
| 131 | get the inner contents of an element. |
|---|
| 132 | |
|---|
| 133 | (doc/"#elementID").inner_html |
|---|
| 134 | #=> "..<b>contents</b>.." |
|---|
| 135 | |
|---|
| 136 | If your expression matches more than one element, you'll get back the contents |
|---|
| 137 | of ''all the matched elements''. So you may want to use <tt>first</tt> to be |
|---|
| 138 | sure you get back only one. |
|---|
| 139 | |
|---|
| 140 | (doc/"#elementID").first.inner_html |
|---|
| 141 | #=> "..<b>contents</b>.." |
|---|
| 142 | |
|---|
| 143 | === Fetching the HTML for an Element |
|---|
| 144 | |
|---|
| 145 | If you want the HTML for the whole element (not just the contents), use |
|---|
| 146 | <tt>to_html</tt>: |
|---|
| 147 | |
|---|
| 148 | (doc/"#elementID").to_html |
|---|
| 149 | #=> "<div id='elementID'>...</div>" |
|---|
| 150 | |
|---|
| 151 | === Looping |
|---|
| 152 | |
|---|
| 153 | All searches return a set of <tt>Hpricot::Elements</tt>. Go ahead and loop |
|---|
| 154 | through them like you would an array. |
|---|
| 155 | |
|---|
| 156 | (doc/"p/a/img").each do |img| |
|---|
| 157 | puts img.attributes['class'] |
|---|
| 158 | end |
|---|
| 159 | |
|---|
| 160 | === Continuing Searches |
|---|
| 161 | |
|---|
| 162 | Searches can be continued from a collection of elements, in order to search deeper. |
|---|
| 163 | |
|---|
| 164 | # find all paragraphs. |
|---|
| 165 | elements = doc.search("/html/body//p") |
|---|
| 166 | # continue the search by finding any images within those paragraphs. |
|---|
| 167 | (elements/"img") |
|---|
| 168 | #=> #<Hpricot::Elements[{img ...}, {img ...}]> |
|---|
| 169 | |
|---|
| 170 | Searches can also be continued by searching within container elements. |
|---|
| 171 | |
|---|
| 172 | # find all images within paragraphs. |
|---|
| 173 | doc.search("/html/body//p").each do |para| |
|---|
| 174 | puts "== Found a paragraph ==" |
|---|
| 175 | pp para |
|---|
| 176 | |
|---|
| 177 | imgs = para.search("img") |
|---|
| 178 | if imgs.any? |
|---|
| 179 | puts "== Found #{imgs.length} images inside ==" |
|---|
| 180 | end |
|---|
| 181 | end |
|---|
| 182 | |
|---|
| 183 | Of course, the most succinct ways to do the above are using CSS or XPath. |
|---|
| 184 | |
|---|
| 185 | # the xpath version |
|---|
| 186 | (doc/"/html/body//p//img") |
|---|
| 187 | # the css version |
|---|
| 188 | (doc/"html > body > p img") |
|---|
| 189 | # ..or symbols work, too! |
|---|
| 190 | (doc/:html/:body/:p/:img) |
|---|
| 191 | |
|---|
| 192 | === Looping Edits |
|---|
| 193 | |
|---|
| 194 | You may certainly edit objects from within your search loops. Then, when you |
|---|
| 195 | spit out the HTML, the altered elements will show. |
|---|
| 196 | |
|---|
| 197 | (doc/"span.entryPermalink").each do |span| |
|---|
| 198 | span.attributes['class'] = 'newLinks' |
|---|
| 199 | end |
|---|
| 200 | puts doc |
|---|
| 201 | |
|---|
| 202 | This changes all <tt>span.entryPermalink</tt> elements to |
|---|
| 203 | <tt>span.newLinks</tt>. Keep in mind that there are often more convenient ways |
|---|
| 204 | of doing this. Such as the <tt>set</tt> method: |
|---|
| 205 | |
|---|
| 206 | (doc/"span.entryPermalink").set(:class => 'newLinks') |
|---|
| 207 | |
|---|
| 208 | === Figuring Out Paths |
|---|
| 209 | |
|---|
| 210 | Every element can tell you its unique path (either XPath or CSS) to get to the |
|---|
| 211 | element from the root tag. |
|---|
| 212 | |
|---|
| 213 | The <tt>css_path</tt> method: |
|---|
| 214 | |
|---|
| 215 | doc.at("div > div:nth(1)").css_path |
|---|
| 216 | #=> "div > div:nth(1)" |
|---|
| 217 | doc.at("#header").css_path |
|---|
| 218 | #=> "#header" |
|---|
| 219 | |
|---|
| 220 | Or, the <tt>xpath</tt> method: |
|---|
| 221 | |
|---|
| 222 | doc.at("div > div:nth(1)").xpath |
|---|
| 223 | #=> "/div/div:eq(1)" |
|---|
| 224 | doc.at("#header").xpath |
|---|
| 225 | #=> "//div[@id='header']" |
|---|
| 226 | |
|---|
| 227 | == Hpricot Fixups |
|---|
| 228 | |
|---|
| 229 | When loading HTML documents, you have a few settings that can make Hpricot more |
|---|
| 230 | or less intense about how it gets involved. |
|---|
| 231 | |
|---|
| 232 | == :fixup_tags |
|---|
| 233 | |
|---|
| 234 | Really, there are so many ways to clean up HTML and your intentions may be to |
|---|
| 235 | keep the HTML as-is. So Hpricot's default behavior is to keep things flexible. |
|---|
| 236 | Making sure to open and close all the tags, but ignore any validation problems. |
|---|
| 237 | |
|---|
| 238 | As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt |
|---|
| 239 | to shift the document's tags to meet XHTML 1.0 Strict. |
|---|
| 240 | |
|---|
| 241 | doc = open("index.html") { |f| Hpricot f, :fixup_tags => true } |
|---|
| 242 | |
|---|
| 243 | This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow |
|---|
| 244 | the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's |
|---|
| 245 | going to move the paragraph below the link. Or up and out of other elements |
|---|
| 246 | where paragraphs don't belong. |
|---|
| 247 | |
|---|
| 248 | If an unknown element is found, it is ignored. Again, <tt>:fixup_tags</tt>. |
|---|
| 249 | |
|---|
| 250 | == :xhtml_strict |
|---|
| 251 | |
|---|
| 252 | So, let's go beyond just trying to fix the hierarchy. The |
|---|
| 253 | <tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML |
|---|
| 254 | 1.0 Strict document. Even at the cost of removing elements that get in the way. |
|---|
| 255 | |
|---|
| 256 | doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true } |
|---|
| 257 | |
|---|
| 258 | What measures does <tt>:xhtml_strict</tt> take? |
|---|
| 259 | |
|---|
| 260 | 1. Shift elements into their proper containers just like <tt>:fixup_tags</tt>. |
|---|
| 261 | 2. Remove unknown elements. |
|---|
| 262 | 3. Remove unknown attributes. |
|---|
| 263 | 4. Remove illegal content. |
|---|
| 264 | 5. Alter the doctype to XHTML 1.0 Strict. |
|---|
| 265 | |
|---|
| 266 | == Hpricot.XML() |
|---|
| 267 | |
|---|
| 268 | The last option is the <tt>:xml</tt> option, which makes some slight variations |
|---|
| 269 | on the standard mode. The main difference is that :xml mode won't try to output |
|---|
| 270 | tags which are friendlier for browsers. For example, if an opening and closing |
|---|
| 271 | <tt>br</tt> tag is found, XML mode won't try to turn that into an empty element. |
|---|
| 272 | |
|---|
| 273 | The primary way to use Hpricot's XML mode is to call the Hpricot.XML method: |
|---|
| 274 | |
|---|
| 275 | doc = open("http://redhanded.hobix.com/index.xml") do |f| |
|---|
| 276 | Hpricot.XML(f) |
|---|
| 277 | end |
|---|
| 278 | |
|---|
| 279 | *Also, :fixup_tags is canceled out by the :xml option.* This is because |
|---|
| 280 | :fixup_tags makes assumptions based how HTML is structured. Specifically, how |
|---|
| 281 | tags are defined in the XHTML 1.0 DTD. |
|---|