Avatar

removal of control characters (ASCII<32) from XML (RSS) (Technics)

by Auge ⌂, Thursday, April 25, 2019, 14:47 (1800 days ago)

Hello

Once, in the process of further development of MLF1 to version 1.8, Alfie mentioned the need to remove control characters (ASCII < 32) from postings, when they should be included into the the RSS-feed. We developed a function that replaces the chars with nothing (empty string ""). A similar function is since then (Alfie reported the issue on June, 30th 2010) part of MLF2. With the exception of line breaks (\r\n, \r, \n) and the replacement of the TAB (in the code: char(9)) with a whitespace.

While checking my bookmarks in another forum I therein found a posting, where one proposed another way to handle this issue, a regular expression. A reply contained a link to a stack-overflow-thread. This would be, even with the above mentioned exceptions, a much smaller solution but with the cost of a lesser readability.

preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/', '', $input);

The shown expression addresses the chars 0 to 8 (\x00-\x08), 11 (\x0B), 12 (\x0C), 14 to 31 (\x0E-\x1F) and additionally 127 (\x7F) (is that necessary?). With a replace of char(9) with " " afterwards we could handle the issue within only a few lines of code instead the currently used monster array.

Opinions?

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

removal of control characters (ASCII<32) from XML (RSS)

by Alfie ⌂, Vienna, Austria, Thursday, April 25, 2019, 21:06 (1800 days ago) @ Auge

Hi Auge,

While checking my bookmarks in another forum I therein found a posting, where one proposed another way to handle this issue, a regular expression. A reply contained a link to a stack-overflow-thread. This would be, even with the above mentioned exceptions, a much smaller solution but with the cost of a lesser readability.

preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/', '', $input);

The shown expression addresses the chars 0 to 8 (\x00-\x08), 11 (\x0B), 12 (\x0C), 14 to 31 (\x0E-\x1F) and additionally 127 (\x7F) (is that necessary?).

Looks good! Add a comment to the source what this regex does and why – then I like it. I don’t see a case where 127 could turn up.
For readers who don’t know the background: Such characters might be embedded in a PDF. If one copies text from a PDF to a post it’s not a problem in the forum (since these characters are invisible) but screw up the feed.

--
Cheers,
Alfie (Helmut Schütz)
BEBA-Forum (v1.8β)

Avatar

removal of control characters (ASCII<32) from XML (RSS)

by Auge ⌂, Friday, April 26, 2019, 09:04 (1799 days ago) @ Alfie

Hello

While checking my bookmarks in another forum I therein found a posting, where one proposed a regular expression. A reply contained a link to a stack-overflow-thread.

preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/', '', $input);

The shown expression addresses the chars 0 to 8 (\x00-\x08), 11 (\x0B), 12 (\x0C), 14 to 31 (\x0E-\x1F) and additionally 127 (\x7F) (is that necessary?).

Char 127 (\x7F) is the DEL-key. I don't know, if it would break the RSS-feed.

If not, the resulting function (based on the solution for MLF2, located in includes/functions.inc.php) would look like this:

/**
 * filters control characters
 *
 * @param string $string
 * @return string
 */
function filter_control_characters($string) {
  # remove the specified control chars (0-8, 11, 12, 14-31) from the string
  $string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F]/', '', $string);
  # replace the control char 9 (TAB) with a space
  $string = str_replace(chr(9), ' ', $string);
  # control chars 10 and 13 (\r, \n) remains untouched
  return $string;
}

For the MLF1-forum-special-version of Alfie (functions/funcs.output.php) it would be:

/**
 * Strips all control characters from output in case of XML output
 *
 * @param string $string
 * @return string $string
 */
function outputXMLclearedString($string) {
  # remove the specified control chars (0-8, 11, 12, 14-31) from the string
  $string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F]/', '', $string);
  # replace the control char 9 (TAB) with a space
  $string = str_replace(chr(9), ' ', $string);
  # control chars 10 and 13 (\r, \n) remains untouched
  return $string;
} # End: outputXMLclearedString

So the only substantial difference is the name of the function.

Looks good! Add a comment to the source what this regex does and why – then I like it. I don’t see a case where 127 could turn up.

Char 127 (DEL) was never part of our solutions, so I removed it from the regex.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

RSS Feed of thread