Changeset 104
- Timestamp:
- 01/31/2007 13:31:14 (22 months ago)
- Files:
-
- 1 modified
-
trunk/README (modified) (15 diffs)
Legend:
- Unmodified
- Added
- Removed
-
trunk/README
r103 r104 2 2 3 3 Hpricot is a fast, flexible HTML parser written in C. It's designed to be very 4 accommodating (like Tanaka Akira's HTree) and to have a very helpful library (like5 some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS parser, 6 in fact, is based on John Resig's JQuery. 4 accommodating (like Tanaka Akira's HTree) and to have a very helpful library 5 (like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS 6 parser, in fact, is based on John Resig's JQuery. 7 7 8 8 Also, Hpricot can be handy for reading broken XML files, since many of the same 9 techniques can be used. If a quote is missing, Hpricot tries to figure it out. If 10 tags overlap, Hpricot works on sorting them out. You know, that sort of thing. 11 12 *Please read this entire document* before making assumptions about how this software 13 works. 9 techniques can be used. If a quote is missing, Hpricot tries to figure it out. 10 If tags overlap, Hpricot works on sorting them out. You know, that sort of 11 thing. 12 13 *Please read this entire document* before making assumptions about how this 14 software works. 14 15 15 16 == An Overview … … 39 40 * See COPYING for the terms of this software. (Spoiler: it's absolutely free.) 40 41 41 If you have any trouble, don't hesitate to contact the author. As always, I'm not 42 going to say "Use at your own risk" because I don't want this library to 43 be risky. If you trip on something, I'll share the liability by 44 repairing things as quickly as I can. Your responsibility is to report 45 the inadequacies. 42 If you have any trouble, don't hesitate to contact the author. As always, I'm 43 not going to say "Use at your own risk" because I don't want this library to be 44 risky. If you trip on something, I'll share the liability by repairing things 45 as quickly as I can. Your responsibility is to report the inadequacies. 46 46 47 47 == Installing Hpricot … … 80 80 === Load an HTML Page 81 81 82 The <tt>Hpricot()</tt> method takes a string or any IO object and loads the contents into a document object. 82 The <tt>Hpricot()</tt> method takes a string or any IO object and loads the 83 contents into a document object. 83 84 84 85 doc = Hpricot("<p>A simple <b>test</b> string.</p>") … … 93 94 doc = open("http://qwantz.com/") { |f| Hpricot(f) } 94 95 95 Hpricot uses an internal buffer to parse the file, so the IO will stream properly and large documents won't be96 loaded into memory all at once. However, the parsed document object will be present in memory, in its 97 entirety.96 Hpricot uses an internal buffer to parse the file, so the IO will stream 97 properly and large documents won't be loaded into memory all at once. However, 98 the parsed document object will be present in memory, in its entirety. 98 99 99 100 === Search for Elements … … 104 105 #=> #<Hpricot:Elements[{p ...}, {p ...}]> 105 106 106 <tt>Doc.search</tt> can take an XPath or CSS expression. In the above example, all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt> attribute of <tt>"posted"</tt>. 107 <tt>Doc.search</tt> can take an XPath or CSS expression. In the above example, 108 all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt> 109 attribute of <tt>"posted"</tt>. 107 110 108 111 A shortcut is to use the divisor: … … 113 116 === Finding Just One Element 114 117 115 If you're looking for a single element, the <tt>at</tt> method will return the first element matched by the expression. In this case, you'll get back the element itself rather than the <tt>Hpricot::Elements</tt> array. 118 If you're looking for a single element, the <tt>at</tt> method will return the 119 first element matched by the expression. In this case, you'll get back the 120 element itself rather than the <tt>Hpricot::Elements</tt> array. 116 121 117 122 doc.at("body")['onload'] 118 123 119 The above code will find the body tag and give you back the <tt>onload</tt> attribute. This is the most common reason to use the element directly: when reading and writing HTML attributes. 124 The above code will find the body tag and give you back the <tt>onload</tt> 125 attribute. This is the most common reason to use the element directly: when 126 reading and writing HTML attributes. 120 127 121 128 === Fetching the Contents of an Element 122 129 123 Just as with browser scripting, the <tt>inner_html</tt> property can be used to get the inner contents of an element. 130 Just as with browser scripting, the <tt>inner_html</tt> property can be used to 131 get the inner contents of an element. 124 132 125 133 (doc/"#elementID").inner_html 126 134 #=> "..<b>contents</b>.." 127 135 128 If your expression matches more than one element, you'll get back the contents of ''all the matched elements''. So you may want to use <tt>first</tt> to be sure you get back only one. 136 If your expression matches more than one element, you'll get back the contents 137 of ''all the matched elements''. So you may want to use <tt>first</tt> to be 138 sure you get back only one. 129 139 130 140 (doc/"#elementID").first.inner_html … … 133 143 === Fetching the HTML for an Element 134 144 135 If you want the HTML for the whole element (not just the contents), use <tt>to_html</tt>: 145 If you want the HTML for the whole element (not just the contents), use 146 <tt>to_html</tt>: 136 147 137 148 (doc/"#elementID").to_html … … 140 151 === Looping 141 152 142 All searches return a set of <tt>Hpricot::Elements</tt>. Go ahead and loop through them like you would an array. 153 All searches return a set of <tt>Hpricot::Elements</tt>. Go ahead and loop 154 through them like you would an array. 143 155 144 156 (doc/"p/a/img").each do |img| … … 180 192 === Looping Edits 181 193 182 You may certainly edit objects from within your search loops. Then, when you spit out the HTML, the altered elements will show. 194 You may certainly edit objects from within your search loops. Then, when you 195 spit out the HTML, the altered elements will show. 183 196 184 197 (doc/"span.entryPermalink").each do |span| … … 187 200 puts doc 188 201 189 This changes all <tt>span.entryPermalink</tt> elements to <tt>span.newLinks</tt>. Keep in mind that there are often more convenient ways of doing this. Such as the <tt>set</tt> method: 202 This changes all <tt>span.entryPermalink</tt> elements to 203 <tt>span.newLinks</tt>. Keep in mind that there are often more convenient ways 204 of doing this. Such as the <tt>set</tt> method: 190 205 191 206 (doc/"span.entryPermalink").set(:class => 'newLinks') … … 193 208 === Figuring Out Paths 194 209 195 Every element can tell you its unique path (either XPath or CSS) to get to the element from the root tag. 210 Every element can tell you its unique path (either XPath or CSS) to get to the 211 element from the root tag. 196 212 197 213 The <tt>css_path</tt> method: … … 211 227 == Hpricot Fixups 212 228 213 When loading HTML documents, you have a few settings that can make Hpricot more or less intense about how it gets214 involved.229 When loading HTML documents, you have a few settings that can make Hpricot more 230 or less intense about how it gets involved. 215 231 216 232 == :fixup_tags 217 233 218 Really, there are so many ways to clean up HTML and your intentions may be to keep the HTML as-is. So Hpricot's 219 default behavior is to keep things flexible. Making sure to open and close all the tags, but ignore any validation problems. 220 221 As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt to shift the document's tags to meet XHTML 1.0 Strict. 234 Really, there are so many ways to clean up HTML and your intentions may be to 235 keep the HTML as-is. So Hpricot's default behavior is to keep things flexible. 236 Making sure to open and close all the tags, but ignore any validation problems. 237 238 As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt 239 to shift the document's tags to meet XHTML 1.0 Strict. 222 240 223 241 doc = open("index.html") { |f| Hpricot f, :fixup_tags => true } 224 242 225 This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow the rules a bit better. Like: say Hpricot finds 226 a paragraph in a link, it's going to move the paragraph below the link. Or up and out of other elements where paragraphs don't belong. 243 This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow 244 the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's 245 going to move the paragraph below the link. Or up and out of other elements 246 where paragraphs don't belong. 227 247 228 248 If an unknown element is found, it is ignored. Again, <tt>:fixup_tags</tt>. … … 230 250 == :xhtml_strict 231 251 232 So, let's go beyond just trying to fix the hierarchy. The <tt>:xhtml_strict</tt> option really tries to force the document to be an 233 XHTML 1.0 Strict document. Even at the cost of removing elements that get in the way. 252 So, let's go beyond just trying to fix the hierarchy. The 253 <tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML 254 1.0 Strict document. Even at the cost of removing elements that get in the way. 234 255 235 256 doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true } … … 245 266 == Hpricot.XML() 246 267 247 The last option is the <tt>:xml</tt> option, which makes some slight variations on the standard mode. The main difference is 248 that :xml mode won't try to output tags which are friendlier for browsers. For example, if an opening and closing <tt>br</tt> 249 tag is found, XML mode won't try to turn that into an empty element. 268 The last option is the <tt>:xml</tt> option, which makes some slight variations 269 on the standard mode. The main difference is that :xml mode won't try to output 270 tags which are friendlier for browsers. For example, if an opening and closing 271 <tt>br</tt> tag is found, XML mode won't try to turn that into an empty element. 250 272 251 273 The primary way to use Hpricot's XML mode is to call the Hpricot.XML method: … … 255 277 end 256 278 257 *Also, :fixup_tags is canceled out by the :xml option.* This is because :fixup_tags makes assumptions based how HTML is 258 structured. Specifically, how tags are defined in the XHTML 1.0 DTD. 279 *Also, :fixup_tags is canceled out by the :xml option.* This is because 280 :fixup_tags makes assumptions based how HTML is structured. Specifically, how 281 tags are defined in the XHTML 1.0 DTD.
