Roundcube Community Forum

Release Support => Pending Issues => Topic started by: garretg on June 22, 2010, 10:42:58 AM

Title: bad regexp in html2text.php
Post by: garretg on June 22, 2010, 10:42:58 AM
Greetings.. I believe I've fixed a bug in the html2text library, which you use in your product.

I don't use Roundcube Webmail... I'm a moodle developer, and moodle uses the same html2text library.

In your file /program/lib/html2text.php...
---------------------------
478  // Remove unknown/unhandled entities (this cannot be done in search-and-replace block)
479           $text = preg_replace('/&[^&;]+;/i', '', $text);
---------------------------

That regular expression is too greedy... it matches any sequence of characters that starts with an ampersand and ends with a semicolon.    

We've had numerous instances in moodle of huge chunks of content going missing when someone happens to include an ampersand in their text, and also a semicolon somewhere.

Here's an example...

Gin & Tonic
- 2oz gin;
- 5oz tonic water;
- 5 cubes of ice;
- 1 lime wedge.

if you ran that through html2text, it would output this..


Gin
- 5oz tonic water;
- 5 cubes of ice;
- 1 lime wedge.


The simple fix I am testing now is this:
479           $text = preg_replace('/&[^&;\s]+;/i', '', $text);

The additional \s makes sure the match stops on whitespace.  

Best regards,
-Garret
Title: bad regexp in html2text.php
Post by: SKaero on June 23, 2010, 03:53:18 AM
It has been fixed in r3777 http://trac.roundcube.net/changeset/3777