root / tags / 0.5 / README

Revision 104, 9.3 kB (checked in by lwu, 22 months ago)

Make README line breaks a bit more consistent for readability

Line 
1= Hpricot, Read Any HTML
2
3Hpricot is a fast, flexible HTML parser written in C.  It's designed to be very
4accommodating (like Tanaka Akira's HTree) and to have a very helpful library
5(like some JavaScript libs -- JQuery, Prototype -- give you.)  The XPath and CSS
6parser, in fact, is based on John Resig's JQuery.
7
8Also, Hpricot can be handy for reading broken XML files, since many of the same
9techniques can be used.  If a quote is missing, Hpricot tries to figure it out.
10If tags overlap, Hpricot works on sorting them out.  You know, that sort of
11thing.
12
13*Please read this entire document* before making assumptions about how this
14software works.
15
16== An Overview
17
18Let's clear up what Hpricot is.
19
20# Hpricot is *a standalone library*.  It requires no other libraries.  Just Ruby!
21# While priding itself on speed, Hpricot *works hard to sort out bad HTML* and
22  pays a small penalty in order to get that right.  So that's slightly more important
23  to me than speed.
24# *If you can see it in Firefox, then Hpricot should parse it.*  That's
25  how it should be!  Let me know the minute it's otherwise.
26# Primarily, Hpricot is used for reading HTML and tries to sort out troubled
27  HTML by having some idea of what good HTML is.  Some people still like to use
28  Hpricot for XML reading, but *remember to use the Hpricot::XML() method* for that!
29
30== The Hpricot Kingdom
31
32First, here are all the links you need to know:
33
34* http://code.whytheluckystiff.net/hpricot is the Hpricot wiki and bug tracker.
35  Go there for news and recipes and patches.  It's the center of activity.
36* http://code.whytheluckystiff.net/svn/hpricot/trunk is the main Subversion
37  repository for Hpricot.  You can get the latest code there.
38* http://code.whytheluckystiff.net/doc/hpricot is the home for the latest copy of
39  this reference.
40* See COPYING for the terms of this software. (Spoiler: it's absolutely free.)
41
42If you have any trouble, don't hesitate to contact the author.  As always, I'm
43not going to say "Use at your own risk" because I don't want this library to be
44risky.  If you trip on something, I'll share the liability by repairing things
45as quickly as I can.  Your responsibility is to report the inadequacies.
46
47== Installing Hpricot
48
49You may get the latest stable version from Rubyforge. Win32 binaries and source
50gems are available.
51
52  $ gem install hpricot
53
54As Hpricot is still under active development, you can also try the most recent
55candidate build here:
56
57  $ gem install hpricot --source http://code.whytheluckystiff.net
58
59The development gem is usually in pretty good shape actually.  You can also
60get the bleeding edge code or plain Ruby tarballs on the wiki.
61
62== An Hpricot Showcase
63
64We're going to run through a big pile of examples to get you jump-started.
65Many of these examples are also found at
66http://code.whytheluckystiff.net/hpricot/wiki/HpricotBasics, in case you
67want to add some of your own.
68
69=== Loading Hpricot Itself
70
71You have probably got the gem, right?  To load Hpricot:
72
73 require 'rubygems'
74 require 'hpricot'
75
76If you've installed the plain source distribution, go ahead and just:
77
78 require 'hpricot'
79
80=== Load an HTML Page
81
82The <tt>Hpricot()</tt> method takes a string or any IO object and loads the
83contents into a document object.
84
85 doc = Hpricot("<p>A simple <b>test</b> string.</p>")
86
87To load from a file, just get the stream open:
88
89 doc = open("index.html") { |f| Hpricot(f) }
90
91To load from a web URL, use <tt>open-uri</tt>, which comes with Ruby:
92
93 require 'open-uri'
94 doc = open("http://qwantz.com/") { |f| Hpricot(f) }
95
96Hpricot uses an internal buffer to parse the file, so the IO will stream
97properly and large documents won't be loaded into memory all at once.  However,
98the parsed document object will be present in memory, in its entirety.
99
100=== Search for Elements
101
102Use <tt>Doc.search</tt>:
103
104 doc.search("//p[@class='posted']")
105 #=> #<Hpricot:Elements[{p ...}, {p ...}]>
106
107<tt>Doc.search</tt> can take an XPath or CSS expression.  In the above example,
108all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt>
109attribute of <tt>"posted"</tt>.
110
111A shortcut is to use the divisor:
112
113 (doc/"p.posted")
114 #=> #<Hpricot:Elements[{p ...}, {p ...}]>
115
116=== Finding Just One Element
117
118If you're looking for a single element, the <tt>at</tt> method will return the
119first element matched by the expression.  In this case, you'll get back the
120element itself rather than the <tt>Hpricot::Elements</tt> array.
121
122 doc.at("body")['onload']
123
124The above code will find the body tag and give you back the <tt>onload</tt>
125attribute.  This is the most common reason to use the element directly: when
126reading and writing HTML attributes.
127
128=== Fetching the Contents of an Element
129
130Just as with browser scripting, the <tt>inner_html</tt> property can be used to
131get the inner contents of an element.
132
133 (doc/"#elementID").inner_html
134 #=> "..<b>contents</b>.."
135
136If your expression matches more than one element, you'll get back the contents
137of ''all the matched elements''.  So you may want to use <tt>first</tt> to be
138sure you get back only one.
139
140 (doc/"#elementID").first.inner_html
141 #=> "..<b>contents</b>.."
142
143=== Fetching the HTML for an Element
144
145If you want the HTML for the whole element (not just the contents), use
146<tt>to_html</tt>:
147
148 (doc/"#elementID").to_html
149 #=> "<div id='elementID'>...</div>"
150
151=== Looping
152
153All searches return a set of <tt>Hpricot::Elements</tt>.  Go ahead and loop
154through them like you would an array.
155
156 (doc/"p/a/img").each do |img|
157   puts img.attributes['class']
158 end
159
160=== Continuing Searches
161
162Searches can be continued from a collection of elements, in order to search deeper.
163
164 # find all paragraphs.
165 elements = doc.search("/html/body//p")
166 # continue the search by finding any images within those paragraphs.
167 (elements/"img")
168 #=> #<Hpricot::Elements[{img ...}, {img ...}]>
169
170Searches can also be continued by searching within container elements.
171
172 # find all images within paragraphs.
173 doc.search("/html/body//p").each do |para|
174   puts "== Found a paragraph =="
175   pp para
176
177   imgs = para.search("img")
178   if imgs.any?
179     puts "== Found #{imgs.length} images inside =="
180   end
181 end
182
183Of course, the most succinct ways to do the above are using CSS or XPath.
184
185 # the xpath version
186 (doc/"/html/body//p//img")
187 # the css version
188 (doc/"html > body > p img")
189 # ..or symbols work, too!
190 (doc/:html/:body/:p/:img)
191
192=== Looping Edits
193
194You may certainly edit objects from within your search loops.  Then, when you
195spit out the HTML, the altered elements will show.
196
197 (doc/"span.entryPermalink").each do |span|
198   span.attributes['class'] = 'newLinks'
199 end
200 puts doc
201
202This changes all <tt>span.entryPermalink</tt> elements to
203<tt>span.newLinks</tt>.  Keep in mind that there are often more convenient ways
204of doing this.  Such as the <tt>set</tt> method:
205
206 (doc/"span.entryPermalink").set(:class => 'newLinks')
207
208=== Figuring Out Paths
209
210Every element can tell you its unique path (either XPath or CSS) to get to the
211element from the root tag.
212
213The <tt>css_path</tt> method:
214
215 doc.at("div > div:nth(1)").css_path
216   #=> "div > div:nth(1)"
217 doc.at("#header").css_path
218   #=> "#header"
219
220Or, the <tt>xpath</tt> method:
221
222 doc.at("div > div:nth(1)").xpath
223   #=> "/div/div:eq(1)"
224 doc.at("#header").xpath
225   #=> "//div[@id='header']"
226
227== Hpricot Fixups
228
229When loading HTML documents, you have a few settings that can make Hpricot more
230or less intense about how it gets involved.
231
232== :fixup_tags
233
234Really, there are so many ways to clean up HTML and your intentions may be to
235keep the HTML as-is.  So Hpricot's default behavior is to keep things flexible.
236Making sure to open and close all the tags, but ignore any validation problems.
237
238As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt
239to shift the document's tags to meet XHTML 1.0 Strict.
240
241 doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
242
243This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow
244the rules a bit better.  Like: say Hpricot finds a paragraph in a link, it's
245going to move the paragraph below the link.  Or up and out of other elements
246where paragraphs don't belong.
247
248If an unknown element is found, it is ignored.  Again, <tt>:fixup_tags</tt>.
249
250== :xhtml_strict
251
252So, let's go beyond just trying to fix the hierarchy.  The
253<tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML
2541.0 Strict document.  Even at the cost of removing elements that get in the way.
255
256 doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
257
258What measures does <tt>:xhtml_strict</tt> take?
259
260 1. Shift elements into their proper containers just like <tt>:fixup_tags</tt>.
261 2. Remove unknown elements.
262 3. Remove unknown attributes.
263 4. Remove illegal content.
264 5. Alter the doctype to XHTML 1.0 Strict.
265
266== Hpricot.XML()
267
268The last option is the <tt>:xml</tt> option, which makes some slight variations
269on the standard mode.  The main difference is that :xml mode won't try to output
270tags which are friendlier for browsers.  For example, if an opening and closing
271<tt>br</tt> tag is found, XML mode won't try to turn that into an empty element.
272
273The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:
274
275 doc = open("http://redhanded.hobix.com/index.xml") do |f|
276   Hpricot.XML(f)
277 end
278
279*Also, :fixup_tags is canceled out by the :xml option.*  This is because
280:fixup_tags makes assumptions based how HTML is structured.  Specifically, how
281tags are defined in the XHTML 1.0 DTD.
Note: See TracBrowser for help on using the browser.