I recommend using UTF-8 for strings. Only disadvantage will be lock of random access to individual characters. You still will be able to iterate over individual codepoints. And a good feature of utf-8 is that it is mostly compatible with regular zero-terminated char* strings. Meaning if function doesn't try to split string or modify it in wrong way, you can easily pass utf8 to char* functions like strcpy, strcat and similar.
To work with utf-8 as a minimum you'll need two small functions. One that decodes next unicode codepoint from a byte array. And another that encodes unicode codepoint to a byte array. You can easily create them from information on Wikipedia about utf-8.
As example implementation you can take a look at these functions:
u8_toucs - gets next unicode codepoint:
https://github.com/JeffBezanson/cutef8/blob/master/utf8.c#L89
u8_wc_toutf8 - encodes unicode codepoint to bytes:
https://github.com/JeffBezanson/cutef8/blob/master/utf8.c#L171
Basic string processing to operate on each char with this cutef8 library will look something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 | char* string = ...;
size_t string_bytes = ...; // how many *bytes* (not characters) available in string
while (string_bytes != 0)
{
uint32_t unicode;
if (u8_toucs(&unicode, 1, string, string_bytes) == 0 || unicode == 0xFFFD)
{
error("broken utf-8 encoding");
}
// do something with unicode codepoint here
// ...
string += u8_charlen(unicode);
string_bytes -= u8_charlen(unicode);
}
|
Unicode codepoint in this case will be 32-bit integer. Instead of putting '£' in your code, you should use its unicode value -
163
For more feature complete C libraries look at these:
*
https://github.com/JuliaLang/utf8proc/
*
https://bitbucket.org/alekseyt/nunicode/
They will allow, for example, to change case of letters, classify unicode characters - are they uppercase latter, are they numbers, etc...