library(utf8)
10 Normalise utf8 accented text
10.1 Background
Accented text can cause problems for R.
This method shows how to convert all accents from mixed encodings to UTF-8 text .
Problems like this often emerge with data shared between PC/Mac/Linux and with data exported from Excel. The simple solution is to use the Encoding() function from the utf8 package.
10.1.1 Libraries
10.1.2 Dummy data
Here we define a vector x, which has accents. To simulate the problem, we set encoding to be mixed between UTF-8 and bytes, but the second entry is actually encoded in Latin byte format with the leading byte 0xE7 rather than in UTF-8.
<- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
x Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
If we try to convert all entries in the vector to UTF-8, it fails
try(as_utf8(x))
Error in as_utf8(x) :
entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0x69) at position 4
The simple fix is to change the encoding to match the real data. Here entry two is switched to the correct encoding and we are then able to re-encode it
Encoding(x[2]) <- "latin1"
as_utf8(x)
[1] "façile" "façile" "façile"