An unexpected character replacement
DRANK

For a full list of BASHing data blog posts see the index page.     As a data auditor I'm used to seeing non-ASCII characters appearing as replacement characters, question marks and mojibake:A few weeks ago I found a replacement in GBIF that I'd never seen before: M<fc>ller. It was a hexadecimal value for the character "ü" enclosed in angle brackets. That particular hex value for "ü" appears in Windows-1252 and other encodings, but what program did this replacement? And why?Suspecting the worst, I did a search for other angle-bracket-enclosed strings in the dataset. The search turned up a lot of data items which had originally contained a non-breaking space, and which now contained that character's Unicode representation in brackets, for example Laevicardium<U+00A0>. Excluding these, the result is shown here:The characters replaced by hexadecimal values all seemed to be in Windows -1252 encoding:The Unicode replacements are a bit less obvious, as both are control characters. "U+0092" i…

datafix.com.au
Related Topics: