Ticket #4 (closed defect: fixed)

Opened 23 months ago

Last modified 22 months ago

incoming_entities is too eager

Reported by: yojimbo Owned by: somebody
Priority: trivial Milestone:
Component: component1 Version:
Keywords: Cc:

Description

The regexp used in incoming_entities() to temporarily transform the & character tries to differentiate between bare & and & used in an HTML entity name.

/&(?![#a-z0-9]+;)/i

Unfortunately this will match something like &enti#ty; because the character class is not very constrained.

I can't find a definition of XHTML/HTML entity names, but a look at http://www.w3.org/2000/07/8378/xhtml/entities/entities.xml shows that the names are created from a reduced set ... but that means a larger regexp.

We either have a alphabetic name (which can finish with 1 or two digits) or a numerical specifier - and that can be either decimal or hexadecimal

A "better" regexp might be :-

/&(?!(#\d+|#x[:xdigit:]+|([:lower:]+\d?)/

Change History

Changed 22 months ago by jgarber

  • status changed from new to closed
  • resolution set to fixed

Fixed in [133]. I copied the entities regexp used elsewhere.

Note: See TracTickets for help on using tickets.