Remove invalid character from file |𑇢^^^°<

Remove invalid character from file

I was having trouble opening a .C file. It would open fine with nano but geany didn't like it.

file -i MAIN.C would only give me MAIN.C: application/octet-stream; charset=binary, which is what file says when it can't recognize the charset.

Detecting the actual charset

Using python chardet :

chardetect MAIN.C 
MAIN.C: Windows-1254 with confidence 0.3362399919117238

This shouldn't be a problem for geany, but let's try converting it from "Windows-1254" to "utf8".

Converting the charset

With recode :

recode Windows-1254..UTF8 MAIN.C
recode: MAIN.C failed: Invalid entry in « CP1254..UTF-8 »

With iconv :

iconv -f Windows-1254 -t utf-8 -o OUT.C MAIN.C
iconv: illegal input sequence at position 198

So it turns out there's some rogue data at 198 that prevents the file from being interpreted correcty.

Removing non-printable characters from the file

Using sed :

sed $'s/[^[:print:]\t]//g' MAIN.C  > OUT.C

did the trick.

file OUT.C
OUT.C: C source, Non-ISO extended-ASCII text, with LF, NEL line terminators

Source

https://unix.stackexchange.com/questions/304177/convert-binary-encoding-that-head-and-notepad-can-read-to-utf-8#304178

https://stackoverflow.com/questions/43108359/how-to-remove-all-special-characters-in-linux-text