Ticket #105 (new enhancement)

Opened 13 months ago

Last modified 6 months ago

[PATCH] dynamically allocate memory when BUFSIZE is hit

Reported by: lwu Owned by: why
Priority: minor Milestone:
Component: lib/hpricot Version:
Keywords: Cc: oksteev

Description (last modified by oksteev) (diff)

From steve d on the mailing list:

you can find patched gem here: http://coderrr.wordpress.com/2007/09/14/hpricot-patch-to-support-arbitrarily-large-elements/

I got tired of running into stupid unpredictably sized .NET viewstates and I've thought for a while the parser should handle arbitrarily sized elements. So I finally made a patch. plz apply ;]

There's no difference in performance for html which has no elements greater than the starting buffer size.

Running Hpricot.parse(TestFiles::BASIC) 10000 times:

original code:
user     system      total        real
19.730000   0.810000  20.540000 ( 45.645450)

w patch:
user     system      total        real
19.920000   0.790000  20.710000 ( 46.001500)

Change History

  Changed 13 months ago by lwu

Index: test/test_parser.rb
===================================================================
--- test/test_parser.rb (revision 155)
+++ test/test_parser.rb (working copy)
@@ -299,10 +299,8 @@
    assert_equal "blah='blah'", doc.children[2].content
  end

-  def test_buffer_error
-    assert_raise Hpricot::ParseError, "ran out of buffer space on element <input>, starting on line 3." do
-      Hpricot(%{<p>\n\n<input type="hidden" name="__VIEWSTATE"  value="#{(("X" * 2000) + "\n") * 22}" />\n\n</p>})
-    end
+  def test_no_buffer_error
+    Hpricot(%{<p>\n\n<input type="hidden" name="__VIEWSTATE"  value="#{(("X" * 2000) + "\n") * 44}" />\n\n</p>})
  end

  def test_youtube_attr
Index:
 ext/hpricot_scan/hpricot_scan.rl
===================================================================
--- ext/hpricot_scan/hpricot_scan.rl    (revision 155)
+++ ext/hpricot_scan/hpricot_scan.rl    (working copy)
@@ -162,16 +162,23 @@

  while ( !done ) {
    VALUE str;
-    char *p = buf + have, *pe;
-    int len, space = buffer_size - have;
+    char *p, *pe;
+    int len, space = buffer_size - have, tokstart_diff, tokend_diff;

    if ( space == 0 ) {
-      /* We've used up the entire buffer storing an already-parsed token
-       * prefix that must be preserved.  Likely caused by super-long attributes.
-       * See ticket #13. */
-
 rb_raise(rb_eHpricotParseError, "ran out of buffer space on element <%s>, starting on line %d.", RSTRING(tag)->ptr, curline);
+      tokstart_diff = tokstart - buf;
+      tokend_diff = tokend - buf;
+
+      buffer_size += BUFSIZE;
+      buf = REALLOC_N(buf, char, buffer_size);
+
+      space = buffer_size - have;
+      tokstart = buf + tokstart_diff;
+      tokend = buf + tokend_diff;
    }

+    p = buf + have;
+
    if ( rb_respond_to( port, s_read ) )
    {
      str = rb_funcall( port, s_read, 1, INT2FIX(space) );

  Changed 13 months ago by oksteev

  • description modified (diff)

Thanks for posting this Leslie

in reply to: ↑ description   Changed 13 months ago by oksteev

  • description modified (diff)

Sorry, the last patch was actually broken. This one works and also includes an equality test for long elements.

Index: test/test_parser.rb
===================================================================
--- test/test_parser.rb (revision 155)
+++ test/test_parser.rb (working copy)
@@ -299,10 +299,10 @@
     assert_equal "blah='blah'", doc.children[2].content
   end

-  def test_buffer_error
-    assert_raise Hpricot::ParseError, "ran out of buffer space on element <input>, starting on line 3." do
-      Hpricot(%{<p>\n\n<input type="hidden" name="__VIEWSTATE"  value="#{(("X" * 2000) + "\n") * 22}" />\n\n</p>})
-    end
+  def test_no_buffer_error
+    viewstate = (("X" * 2000) + "\n") * 100
+    doc = Hpricot(%{<p>\n\n<input type="hidden" name="__VIEWSTATE"  value="#{viewstate}" />\n\n</p>})
+    assert_equal viewstate, (doc/:input)[0].get_attribute('value')
   end

   def test_youtube_attr
Index: ext/hpricot_scan/hpricot_scan.rl
===================================================================
--- ext/hpricot_scan/hpricot_scan.rl (revision 155)
+++ ext/hpricot_scan/hpricot_scan.rl (working copy)
@@ -162,16 +162,30 @@

   while ( !done ) {
     VALUE str;
-    char *p = buf + have, *pe;
-    int len, space = buffer_size - have;
+    char *p, *pe;
+    int len, space = buffer_size - have, tokstart_diff, tokend_diff, mark_tag_diff, mark_akey_diff, mark_aval_diff;

     if ( space == 0 ) {
-      /* We've used up the entire buffer storing an already-parsed token
-       * prefix that must be preserved.  Likely caused by super-long attributes.
-       * See ticket #13. */
-      rb_raise(rb_eHpricotParseError, "ran out of buffer space on element <%s>, starting on line %d.", RSTRING(tag)->ptr, curline);
+      tokstart_diff = tokstart - buf;
+      tokend_diff = tokend - buf;
+      mark_tag_diff = mark_tag - buf;
+      mark_akey_diff = mark_akey - buf;
+      mark_aval_diff = mark_aval - buf;
+
+      buffer_size += BUFSIZE;
+      buf = REALLOC_N(buf, char, buffer_size);
+
+      space = buffer_size - have;
+
+      tokstart = buf + tokstart_diff;
+      tokend = buf + tokend_diff;
+      mark_tag = buf + mark_tag_diff;
+      mark_akey = buf + mark_akey_diff;
+      mark_aval = buf + mark_aval_diff;
     }

+    p = buf + have;
+
     if ( rb_respond_to( port, s_read ) )
     {
       str = rb_funcall( port, s_read, 1, INT2FIX(space) );

Or, the pastie: http://pastie.caboo.se/97453

  Changed 7 months ago by oksteev

  • description modified (diff)

  Changed 7 months ago by oksteev

Here is a patch for the JRuby ragel

Index: ext/hpricot_scan/hpricot_scan.java.rl
===================================================================
--- ext/hpricot_scan/hpricot_scan.java.rl       (revision 161)
+++ ext/hpricot_scan/hpricot_scan.java.rl       (working copy)
@@ -259,7 +259,11 @@
       /* We've used up the entire buffer storing an already-parsed token
        * prefix that must be preserved.  Likely caused by super-long attributes.
        * See ticket #13. */
-      rb_raise(rb_eHpricotParseError, "ran out of buffer space on element <" + tag.toString() + ">, starting on line "+curline+".");
+      buffer_size += BUFSIZE;
+      char[] new_buf = new char[buffer_size];
+      System.arraycopy(buf, 0, new_buf, 0, buf.length);
+      buf = new_buf;
+      space = buffer_size - have;
     }
 
     if (port.respondsTo("read")) {

  Changed 6 months ago by oksteev

  • cc oksteev added
Note: See TracTickets for help on using tickets.