Changeset 104

Show
Ignore:
Timestamp:
01/31/2007 13:31:14 (22 months ago)
Author:
lwu
Message:

Make README line breaks a bit more consistent for readability

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • trunk/README

    r103 r104  
    22 
    33Hpricot is a fast, flexible HTML parser written in C.  It's designed to be very 
    4 accommodating (like Tanaka Akira's HTree) and to have a very helpful library (like 
    5 some JavaScript libs -- JQuery, Prototype -- give you.)  The XPath and CSS parser, 
    6 in fact, is based on John Resig's JQuery.  
     4accommodating (like Tanaka Akira's HTree) and to have a very helpful library 
     5(like some JavaScript libs -- JQuery, Prototype -- give you.)  The XPath and CSS 
     6parser, in fact, is based on John Resig's JQuery. 
    77 
    88Also, Hpricot can be handy for reading broken XML files, since many of the same 
    9 techniques can be used.  If a quote is missing, Hpricot tries to figure it out.  If 
    10 tags overlap, Hpricot works on sorting them out.  You know, that sort of thing. 
    11  
    12 *Please read this entire document* before making assumptions about how this software 
    13 works. 
     9techniques can be used.  If a quote is missing, Hpricot tries to figure it out. 
     10If tags overlap, Hpricot works on sorting them out.  You know, that sort of 
     11thing. 
     12 
     13*Please read this entire document* before making assumptions about how this 
     14software works. 
    1415 
    1516== An Overview 
     
    3940* See COPYING for the terms of this software. (Spoiler: it's absolutely free.) 
    4041 
    41 If you have any trouble, don't hesitate to contact the author.  As always, I'm not 
    42 going to say "Use at your own risk" because I don't want this library to 
    43 be risky.  If you trip on something, I'll share the liability by 
    44 repairing things as quickly as I can.  Your responsibility is to report 
    45 the inadequacies. 
     42If you have any trouble, don't hesitate to contact the author.  As always, I'm 
     43not going to say "Use at your own risk" because I don't want this library to be 
     44risky.  If you trip on something, I'll share the liability by repairing things 
     45as quickly as I can.  Your responsibility is to report the inadequacies. 
    4646 
    4747== Installing Hpricot 
     
    8080=== Load an HTML Page 
    8181 
    82 The <tt>Hpricot()</tt> method takes a string or any IO object and loads the contents into a document object. 
     82The <tt>Hpricot()</tt> method takes a string or any IO object and loads the 
     83contents into a document object. 
    8384 
    8485 doc = Hpricot("<p>A simple <b>test</b> string.</p>") 
     
    9394 doc = open("http://qwantz.com/") { |f| Hpricot(f) } 
    9495 
    95 Hpricot uses an internal buffer to parse the file, so the IO will stream properly and large documents won't be 
    96 loaded into memory all at once.  However, the parsed document object will be present in memory, in its 
    97 entirety. 
     96Hpricot uses an internal buffer to parse the file, so the IO will stream 
     97properly and large documents won't be loaded into memory all at once.  However, 
     98the parsed document object will be present in memory, in its entirety. 
    9899 
    99100=== Search for Elements 
     
    104105 #=> #<Hpricot:Elements[{p ...}, {p ...}]> 
    105106 
    106 <tt>Doc.search</tt> can take an XPath or CSS expression.  In the above example, all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt> attribute of <tt>"posted"</tt>. 
     107<tt>Doc.search</tt> can take an XPath or CSS expression.  In the above example, 
     108all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt> 
     109attribute of <tt>"posted"</tt>. 
    107110 
    108111A shortcut is to use the divisor: 
     
    113116=== Finding Just One Element 
    114117 
    115 If you're looking for a single element, the <tt>at</tt> method will return the first element matched by the expression.  In this case, you'll get back the element itself rather than the <tt>Hpricot::Elements</tt> array. 
     118If you're looking for a single element, the <tt>at</tt> method will return the 
     119first element matched by the expression.  In this case, you'll get back the 
     120element itself rather than the <tt>Hpricot::Elements</tt> array. 
    116121 
    117122 doc.at("body")['onload'] 
    118123 
    119 The above code will find the body tag and give you back the <tt>onload</tt> attribute.  This is the most common reason to use the element directly: when reading and writing HTML attributes. 
     124The above code will find the body tag and give you back the <tt>onload</tt> 
     125attribute.  This is the most common reason to use the element directly: when 
     126reading and writing HTML attributes. 
    120127 
    121128=== Fetching the Contents of an Element 
    122129 
    123 Just as with browser scripting, the <tt>inner_html</tt> property can be used to get the inner contents of an element. 
     130Just as with browser scripting, the <tt>inner_html</tt> property can be used to 
     131get the inner contents of an element. 
    124132 
    125133 (doc/"#elementID").inner_html 
    126134 #=> "..<b>contents</b>.." 
    127135 
    128 If your expression matches more than one element, you'll get back the contents of ''all the matched elements''.  So you may want to use <tt>first</tt> to be sure you get back only one. 
     136If your expression matches more than one element, you'll get back the contents 
     137of ''all the matched elements''.  So you may want to use <tt>first</tt> to be 
     138sure you get back only one. 
    129139 
    130140 (doc/"#elementID").first.inner_html 
     
    133143=== Fetching the HTML for an Element 
    134144 
    135 If you want the HTML for the whole element (not just the contents), use <tt>to_html</tt>: 
     145If you want the HTML for the whole element (not just the contents), use 
     146<tt>to_html</tt>: 
    136147 
    137148 (doc/"#elementID").to_html 
     
    140151=== Looping 
    141152 
    142 All searches return a set of <tt>Hpricot::Elements</tt>.  Go ahead and loop through them like you would an array. 
     153All searches return a set of <tt>Hpricot::Elements</tt>.  Go ahead and loop 
     154through them like you would an array. 
    143155 
    144156 (doc/"p/a/img").each do |img| 
     
    180192=== Looping Edits 
    181193 
    182 You may certainly edit objects from within your search loops.  Then, when you spit out the HTML, the altered elements will show. 
     194You may certainly edit objects from within your search loops.  Then, when you 
     195spit out the HTML, the altered elements will show. 
    183196 
    184197 (doc/"span.entryPermalink").each do |span| 
     
    187200 puts doc 
    188201 
    189 This changes all <tt>span.entryPermalink</tt> elements to <tt>span.newLinks</tt>.  Keep in mind that there are often more convenient ways of doing this.  Such as the <tt>set</tt> method: 
     202This changes all <tt>span.entryPermalink</tt> elements to 
     203<tt>span.newLinks</tt>.  Keep in mind that there are often more convenient ways 
     204of doing this.  Such as the <tt>set</tt> method: 
    190205 
    191206 (doc/"span.entryPermalink").set(:class => 'newLinks') 
     
    193208=== Figuring Out Paths 
    194209 
    195 Every element can tell you its unique path (either XPath or CSS) to get to the element from the root tag. 
     210Every element can tell you its unique path (either XPath or CSS) to get to the 
     211element from the root tag. 
    196212 
    197213The <tt>css_path</tt> method: 
     
    211227== Hpricot Fixups 
    212228 
    213 When loading HTML documents, you have a few settings that can make Hpricot more or less intense about how it gets 
    214 involved. 
     229When loading HTML documents, you have a few settings that can make Hpricot more 
     230or less intense about how it gets involved. 
    215231 
    216232== :fixup_tags 
    217233 
    218 Really, there are so many ways to clean up HTML and your intentions may be to keep the HTML as-is.  So Hpricot's  
    219 default behavior is to keep things flexible.  Making sure to open and close all the tags, but ignore any validation problems. 
    220  
    221 As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt to shift the document's tags to meet XHTML 1.0 Strict. 
     234Really, there are so many ways to clean up HTML and your intentions may be to 
     235keep the HTML as-is.  So Hpricot's default behavior is to keep things flexible. 
     236Making sure to open and close all the tags, but ignore any validation problems. 
     237 
     238As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt 
     239to shift the document's tags to meet XHTML 1.0 Strict. 
    222240 
    223241 doc = open("index.html") { |f| Hpricot f, :fixup_tags => true } 
    224242 
    225 This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow the rules a bit better.  Like: say Hpricot finds  
    226 a paragraph in a link, it's going to move the paragraph below the link.  Or up and out of other elements where paragraphs don't belong. 
     243This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow 
     244the rules a bit better.  Like: say Hpricot finds a paragraph in a link, it's 
     245going to move the paragraph below the link.  Or up and out of other elements 
     246where paragraphs don't belong. 
    227247 
    228248If an unknown element is found, it is ignored.  Again, <tt>:fixup_tags</tt>. 
     
    230250== :xhtml_strict 
    231251 
    232 So, let's go beyond just trying to fix the hierarchy.  The <tt>:xhtml_strict</tt> option really tries to force the document to be an  
    233 XHTML 1.0 Strict document.  Even at the cost of removing elements that get in the way. 
     252So, let's go beyond just trying to fix the hierarchy.  The 
     253<tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML 
     2541.0 Strict document.  Even at the cost of removing elements that get in the way. 
    234255 
    235256 doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true } 
     
    245266== Hpricot.XML() 
    246267 
    247 The last option is the <tt>:xml</tt> option, which makes some slight variations on the standard mode.  The main difference is 
    248 that :xml mode won't try to output tags which are friendlier for browsers.  For example, if an opening and closing <tt>br</tt> 
    249 tag is found, XML mode won't try to turn that into an empty element. 
     268The last option is the <tt>:xml</tt> option, which makes some slight variations 
     269on the standard mode.  The main difference is that :xml mode won't try to output 
     270tags which are friendlier for browsers.  For example, if an opening and closing 
     271<tt>br</tt> tag is found, XML mode won't try to turn that into an empty element. 
    250272 
    251273The primary way to use Hpricot's XML mode is to call the Hpricot.XML method: 
     
    255277 end 
    256278 
    257 *Also, :fixup_tags is canceled out by the :xml option.*  This is because :fixup_tags makes assumptions based how HTML is 
    258 structured.  Specifically, how tags are defined in the XHTML 1.0 DTD. 
     279*Also, :fixup_tags is canceled out by the :xml option.*  This is because 
     280:fixup_tags makes assumptions based how HTML is structured.  Specifically, how 
     281tags are defined in the XHTML 1.0 DTD.