removal of control characters (ASCII<32) from XML (RSS)

by Auge ⌂ @, Thursday, April 25, 2019, 14:47 (29 days ago)


Once, in the process of further development of MLF1 to version 1.8, Alfie mentioned the need to remove control characters (ASCII < 32) from postings, when they should be included into the the RSS-feed. We developed a function that replaces the chars with nothing (empty string ""). A similar function is since then (Alfie reported the issue on June, 30th 2010) part of MLF2. With the exception of line breaks (\r\n, \r, \n) and the replacement of the TAB (in the code: char(9)) with a whitespace.

While checking my bookmarks in another forum I therein found a posting, where one proposed another way to handle this issue, a regular expression. A reply contained a link to a stack-overflow-thread. This would be, even with the above mentioned exceptions, a much smaller solution but with the cost of a lesser readability.

preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/', '', $input);

The shown expression addresses the chars 0 to 8 (\x00-\x08), 11 (\x0B), 12 (\x0C), 14 to 31 (\x0E-\x1F) and additionally 127 (\x7F) (is that necessary?). With a replace of char(9) with " " afterwards we could handle the issue within only a few lines of code instead the currently used monster array.


