Recommendations on handling unicode in C

Charlie Turner

#11399

March 20, 2017

Hey Handmaders,

I wrote this in my C file `if (Char == '£') { ... }` and got the `character too large for enclosing character literal type` error. Uh oh. It's in the upper 128 bits of whatever codepage I'm using (and hence not standard), and I figured I need to learn and accept characters outside the standard ASCII range.

I had a look on the Unicode website, and basically I feel overwhelmed by it. I glanced through the ICU library and just felt it was so crazy. Can anyone provide some guidance on how they handle things like this is a way they consider pleasing?

Sorry for the vagueness of the questions, just very much in the unknown unknowns stage at the moment :)

Edited by Charlie Turner on March 20, 2017, 5:56pm

Mārtiņš Možeiko

#11400

March 20, 2017

I recommend using UTF-8 for strings. Only disadvantage will be lock of random access to individual characters. You still will be able to iterate over individual codepoints. And a good feature of utf-8 is that it is mostly compatible with regular zero-terminated char* strings. Meaning if function doesn't try to split string or modify it in wrong way, you can easily pass utf8 to char* functions like strcpy, strcat and similar.

To work with utf-8 as a minimum you'll need two small functions. One that decodes next unicode codepoint from a byte array. And another that encodes unicode codepoint to a byte array. You can easily create them from information on Wikipedia about utf-8.

As example implementation you can take a look at these functions:
u8_toucs - gets next unicode codepoint: https://github.com/JeffBezanson/cutef8/blob/master/utf8.c#L89
u8_wc_toutf8 - encodes unicode codepoint to bytes: https://github.com/JeffBezanson/cutef8/blob/master/utf8.c#L171

Basic string processing to operate on each char with this cutef8 library will look something like this:

char* string = ...;
size_t string_bytes = ...; // how many *bytes* (not characters) available in string
while (string_bytes != 0)
{
  uint32_t unicode;
  if (u8_toucs(&unicode, 1, string, string_bytes) == 0 || unicode == 0xFFFD)
  {
    error("broken utf-8 encoding");
  }

  // do something with unicode codepoint here
  // ...

  string += u8_charlen(unicode);
  string_bytes -= u8_charlen(unicode);
}

Unicode codepoint in this case will be 32-bit integer. Instead of putting '£' in your code, you should use its unicode value - 163

For more feature complete C libraries look at these:
* https://github.com/JuliaLang/utf8proc/
* https://bitbucket.org/alekseyt/nunicode/
They will allow, for example, to change case of letters, classify unicode characters - are they uppercase latter, are they numbers, etc...

Edited by Mārtiņš Možeiko on March 20, 2017, 8:01pm

Charlie Turner

#11403

March 20, 2017

Ta very much, that's what I was looking for :)