Ticket #4 (closed defect: fixed)
incoming_entities is too eager
| Reported by: | yojimbo | Owned by: | somebody |
|---|---|---|---|
| Priority: | trivial | Milestone: | |
| Component: | component1 | Version: | |
| Keywords: | Cc: |
Description
The regexp used in incoming_entities() to temporarily transform the & character tries to differentiate between bare & and & used in an HTML entity name.
/&(?![#a-z0-9]+;)/i
Unfortunately this will match something like &enti#ty; because the character class is not very constrained.
I can't find a definition of XHTML/HTML entity names, but a look at http://www.w3.org/2000/07/8378/xhtml/entities/entities.xml shows that the names are created from a reduced set ... but that means a larger regexp.
We either have a alphabetic name (which can finish with 1 or two digits) or a numerical specifier - and that can be either decimal or hexadecimal
A "better" regexp might be :-
/&(?!(#\d+|#x[:xdigit:]+|([:lower:]+\d?)/