Unicode vs Ansi vs UTF8

Finalspace
At the moment FPL does support ansi strings for every function which uses either a path or a string buffer, but some have two versions: One for ansi and one for widechar. I dont like that way of having two different versions for each function. So i want you guys to help me settle on one solution once and for all.

If you are in a win32 platform you most likely want to use widechar for all your paths - especially when you write applications which deals with any kind of media.

On the other hand in a linux or unix based platform you dont want to deal with widechar at all and rather work with UTF-8 strings only.

Also sometimes you simply dont care and normal ansi strings are just fine, so you dont want to deal with any kind of that shit.

This boils down to two kind of string types: UTF-8 and Unicode.

So the options are:

1.) Leave it as it as and provide a "Ansi" and a "Wide" version for every functions which uses paths or string buffers. On *nix ansi would just be expected to be UTF-8 always.

or

2.) Remove all wide functions and use UTF-8 everywhere (Thats the way SDL handles it)

or

3.) Make a string buffer union which supports both and change every parameter to it

Some like that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
typedef enum fplCharBufferType {
  fplCharBufferType_AnsiOrUTF8 = 0,
  fplCharBufferType_Unicode,
} fplCharBufferType;

typedef struct fplCharBuffer {
  //! Union here has no meaning other than to have the first or the second option (Size is the always)
  union {
    char *ansiData;
    wchar_t *wideData;
  };
  //! Only used for output buffers
  size_t capacity;
  //! Which type
  fplCharBufferType type;
} fplCharBuffer;

typedef fplCharBuffer fplPathString;

// Old version
fpl_platform_api bool fplAnsiDirectoryExists(const char *ansiPath);
fpl_platform_api bool fplWideDirectoryExists(const char *widePath);
fpl_common_api char *fplEnforceAnsiPathSeparatorLen(char *ansiPath);
fpl_common_api char *fplEnforceWidePathSeparatorLen(wchar_t *widePath);

// New version
fpl_platform_api bool fplDirectoryExists(const fplPathString *path);
fpl_common_api char *fplEnforcePathSeparatorLen(fplPathString *path);


Seems like a pain...

or

4.) Dont separate between ansi and wide and just have something like that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#if defined(FPL_UNICODE)
//! 16-bit unicode character
typedef wchar_t fpl_char;
#else
//! UTF-8 or ansi character
typedef char fpl_char;
#endif

fpl_platform_api bool fplDirectoryExists(const fpl_char *path);
fpl_common_api char *fplEnforcePathSeparatorLen(fpl_char *path, size_t maxPathLen);


I would prefer the last solution. What do you think?

Of course string functions like fplGetAnsiStringLength or fplGetWideStringLength i would not remove. But i would add a additional function which uses fpl_char and either use the ansi or the wide function callback.

Comments

I don't have lot of experience with this but I would prefer as the user to use utf8 everywhere and have your API convert them internally if needed by the OS API. Supporting 2 sets of function is, in my opinion, more error prone for the user and the dev. And a linux user shouldn't have to worry about this "issue" caused by Windows.

If you end up with 2 versions of each function, please make sure the wide/ansi specifier is consistent:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
/* This */
fpl_platform_api bool fplDirectoryExistsAnsi(const char *ansiPath);
fpl_platform_api bool fplDirectoryExistsWide(const char *widePath);
fpl_common_api char *fplEnforcePathSeparatorLenAnsi(char *ansiPath);
fpl_common_api char *fplEnforcePathSeparatorLenWide(wchar_t *widePath);
/* instead of */
fpl_platform_api bool fplAnsiDirectoryExists(const char *ansiPath);
fpl_platform_api bool fplWideDirectoryExists(const char *widePath);
fpl_common_api char *fplEnforceAnsiPathSeparatorLen(char *ansiPath);
fpl_common_api char *fplEnforceWidePathSeparatorLen(wchar_t *widePath);

I would combination of 1 and 2.

As a minimum you should do only point 2 - functions by default should use utf-8 everywhere. That will cover 99.9% of use cases.

To make easier people to interoperate from legacy apps with your API, then for Windows platform (#ifdef WIN32 ... #endif) there should be functions that accept A and W strings and passes them to A and W functions directly. This is only for compatibility reasons (if you care). For new code people should just use utf-8 functions. And this is only needed on Windows. No need for wchar_t/W functions on other platforms.
Hello,

I vote UTF-8 only.

Windows has the function MultiByteToWideChar() since Windows 2000, which can convert an UTF-8 (or ANSI and a few other encodings, so this might not be a good point for UTF-8's cause...) string to "Wide".

And I believe ANSI should be forgotten by any decent developer who wants to write an international application.

Note that since Windows 10 April update, *A functions can accept UTF-8 encoding if user has enabled this in regional settings. Then GetACP() will return CP_UTF8. ANSI is not so ANSI anymore in Windows API.
Okay so i think its settled then.

I will do a transition to UTF-8 only - but leave the functions for converting Wide <-> UTF8.

So this (current style):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// FPL supports wide/UTF8(ansi) strings

fplAnsiStringToWideString()
fplUTF8StringToWideString()

fplWideStringToUTF8String()
fplWideStringToAnsiString()

fplCopyAnsiString()
fplCopyWideString()

fplOpenAnsiBinaryFile()
fplOpenWideBinaryFile()

fplEnforceAnsiPathSeparatorLen()
fplEnforceWidePathSeparatorLen()


would become (new style):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// FPL supports only UTF-8 strings, but there are functions for converting them from Wide <-> UTF-8

// fplAnsiStringToWideString() -> Removed
fplUTF8StringToWideString()

fplWideStringToUTF8String()
// fplWideStringToAnsiString() -> Removed

fplCopyString()
// fplCopyStringWide() -> Removed

fplOpenBinaryFile()
// fplOpenWideBinaryFile() -> Removed

fplEnforcePathSeparatorLen(const char *path)
// fplEnforceWidePathSeparatorLen() -> Removed
The transition from Ansi/Wide strings to UTF-8 only is now complete. All strings in the api uses and expects UTF-8 always and will be converted in the native format the OS requires. The already existing conversion functions from UTF-8 <-> WideString will be left untouched because some libraries may require wchar_t. All other wide functions are removed entirely.