Avatar

removal of control characters (ASCII<32) from XML (RSS) (Technics)

by Auge ⌂, Friday, April 26, 2019, 09:04 (1828 days ago) @ Alfie

Hello

While checking my bookmarks in another forum I therein found a posting, where one proposed a regular expression. A reply contained a link to a stack-overflow-thread.

preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/', '', $input);

The shown expression addresses the chars 0 to 8 (\x00-\x08), 11 (\x0B), 12 (\x0C), 14 to 31 (\x0E-\x1F) and additionally 127 (\x7F) (is that necessary?).

Char 127 (\x7F) is the DEL-key. I don't know, if it would break the RSS-feed.

If not, the resulting function (based on the solution for MLF2, located in includes/functions.inc.php) would look like this:

/**
 * filters control characters
 *
 * @param string $string
 * @return string
 */
function filter_control_characters($string) {
  # remove the specified control chars (0-8, 11, 12, 14-31) from the string
  $string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F]/', '', $string);
  # replace the control char 9 (TAB) with a space
  $string = str_replace(chr(9), ' ', $string);
  # control chars 10 and 13 (\r, \n) remains untouched
  return $string;
}

For the MLF1-forum-special-version of Alfie (functions/funcs.output.php) it would be:

/**
 * Strips all control characters from output in case of XML output
 *
 * @param string $string
 * @return string $string
 */
function outputXMLclearedString($string) {
  # remove the specified control chars (0-8, 11, 12, 14-31) from the string
  $string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F]/', '', $string);
  # replace the control char 9 (TAB) with a space
  $string = str_replace(chr(9), ' ', $string);
  # control chars 10 and 13 (\r, \n) remains untouched
  return $string;
} # End: outputXMLclearedString

So the only substantial difference is the name of the function.

Looks good! Add a comment to the source what this regex does and why – then I like it. I don’t see a case where 127 could turn up.

Char 127 (DEL) was never part of our solutions, so I removed it from the regex.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!


Complete thread:

 RSS Feed of thread