feat(string): add UTF-8 string conversion and validation functions by bobtista · Pull Request #2528 · TheSuperHackers/GeneralsGameCode

bobtista · 2026-04-03T00:07:30Z

Relates to refactor(string): Add functions for handling UTF8 encoded strings #2045

Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.

Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.

New: `WWLib/utf8.h` / `utf8.cpp`

Utf8_Num_Bytes(char lead) — byte count of a UTF-8 character from its lead byte
Utf8_Trailing_Invalid_Bytes(const char* str, size_t length) — count of invalid trailing bytes due to an incomplete multi-byte sequence
Utf8_Validate(const char* str) / Utf8_Validate(const char* str, size_t length) — returns true if the string is valid UTF-8
Get_Utf8_Size(const wchar_t* src) / Get_Wchar_Size(const char* src) — required buffer sizes including null terminator
Wchar_To_Utf8(char* dest, const wchar_t* src, size_t dest_size)
Utf8_To_Wchar(wchar_t* dest, const char* src, size_t dest_size)

Naming follows the Snake_Case convention used in WWVegas. Arguments are ordered dest, src matching memcpy convention. Implementation wraps Win32 WideCharToMultiByte / MultiByteToWideChar.

`AsciiString::translate` / `UnicodeString::translate`

Replaces the broken implementations that only worked for 7-bit ASCII (marked @todo since the original code) with proper UTF-8 conversion using the new WWLib functions.

`ThreadUtils.cpp`

Replaces raw Win32 API calls in MultiByteToWideCharSingleLine and WideCharStringToMultiByte with the new WWLib functions. Also removes the manual dest[len] = 0 null terminator write, which was writing at the wrong position for multi-byte UTF-8 input.

greptile-apps · 2026-04-03T00:12:52Z

Greptile Summary

Replaces the long-standing 7-bit ASCII @todo implementations in AsciiString::translate and UnicodeString::translate with proper UTF-8 ↔ UTF-16 conversion backed by a new WWLib/utf8.h/utf8.cpp library, and refactors ThreadUtils.cpp to use the same wrappers instead of raw Win32 calls. The overlong-encoding issue raised in the prior review has been addressed; two smaller gaps remain:

utf8.cpp is added to the unconditional WWLIB_SRC list even though its entire implementation is #ifdef _WIN32 guarded with #error — other Win32-only sources live in the if(WIN32) block and this should too.
Utf8_Validate still accepts surrogate code points (U+D800–U+DFFF) encoded as 3-byte sequences; a single additional boundary check would close the gap per RFC 3629.

Confidence Score: 4/5

Safe to merge for Windows builds; a build-system inconsistency would break non-Windows compilation.

The core logic is correct and the overlong-encoding fix from the previous review cycle is properly implemented. The CMakeLists placement of utf8.cpp outside the if(WIN32) guard is a P1 defect that would fail non-Windows builds. The missing surrogate check is P2. Neither blocks the Windows target today, but the CMakeLists issue should be resolved before merging to avoid breaking any cross-platform build paths.

Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt (utf8.cpp placement) and utf8.cpp (surrogate rejection)

Important Files Changed

Filename	Overview
Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt	utf8.cpp added unconditionally to WWLIB_SRC, but its entire implementation is #ifdef _WIN32 guarded with #error on other platforms; other Win32-only sources are properly under if(WIN32).
Core/Libraries/Source/WWVegas/WWLib/utf8.cpp	New UTF-8 utility library; overlong encoding rejection now implemented, but surrogate range (U+D800–U+DFFF) not rejected by Utf8_Validate; Win32-only via #ifdef guard.
Core/Libraries/Source/WWVegas/WWLib/utf8.h	New header with correct #pragma once guard; function declarations match implementations; copyright and naming convention look correct.
Core/GameEngine/Source/Common/System/AsciiString.cpp	translate() now uses Get_Utf8_Len + Unicode_To_Utf8 correctly; null terminator fix to ensureUniqueBufferOfSize is correct; empty-string early-return path is safe.
Core/GameEngine/Source/Common/System/UnicodeString.cpp	translate() mirrors AsciiString changes correctly using Utf8_To_Unicode; buffer sizing, null termination, and failure handling are all handled properly.
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp	Replaced raw Win32 calls with WWLib wrappers; std::wstring/std::string resize() ensures null termination even in exact-fit calls; failure path in Utf8_To_Unicode sets dest[0]=L'\0' which is handled correctly.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant AsciiString
    participant UnicodeString
    participant utf8lib as WWLib/utf8
    participant Win32 as Win32 API

    Note over Caller,Win32: AsciiString::translate (UnicodeString → UTF-8)
    Caller->>AsciiString: translate(unicodeStr)
    AsciiString->>utf8lib: Get_Utf8_Len(src, srcLen)
    utf8lib->>Win32: WideCharToMultiByte(CP_UTF8, size query)
    Win32-->>utf8lib: byte count
    utf8lib-->>AsciiString: len
    AsciiString->>AsciiString: ensureUniqueBufferOfSize(len+1)
    AsciiString->>utf8lib: Unicode_To_Utf8(buf, len+1, src, srcLen)
    utf8lib->>Win32: WideCharToMultiByte(CP_UTF8, convert)
    Win32-->>utf8lib: written bytes
    utf8lib-->>AsciiString: written (or 0 on failure)

    Note over Caller,Win32: UnicodeString::translate (UTF-8 → UnicodeString)
    Caller->>UnicodeString: translate(asciiStr)
    UnicodeString->>utf8lib: Get_Unicode_Len(src, srcLen)
    utf8lib->>Win32: MultiByteToWideChar(CP_UTF8, size query)
    Win32-->>utf8lib: wchar count
    utf8lib-->>UnicodeString: len
    UnicodeString->>UnicodeString: ensureUniqueBufferOfSize(len+1)
    UnicodeString->>utf8lib: Utf8_To_Unicode(buf, len+1, src, srcLen)
    utf8lib->>Win32: MultiByteToWideChar(CP_UTF8, convert)
    Win32-->>utf8lib: written chars
    utf8lib-->>UnicodeString: written (or 0 on failure)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt
Line: 136-137

Comment:
**`utf8.cpp` unconditionally listed outside the `if(WIN32)` block**

`utf8.cpp`'s entire implementation is wrapped in `#ifdef _WIN32 … #else #error "Not implemented" #endif`, and the callers (`AsciiString.cpp`, `UnicodeString.cpp`, `ThreadUtils.cpp`) now unconditionally include `utf8.h` and call its functions. Any non-Windows build will hit the `#error` immediately. Other Win32-specific sources (`registry.cpp`, `verchk.cpp`, etc.) are correctly placed inside the `if(WIN32)` block below — `utf8.cpp` should follow the same pattern.

```cmake
# Remove from unconditional WWLIB_SRC and add to the existing if(WIN32) block:
if(WIN32)
    list(APPEND WWLIB_SRC
        ...
        utf8.cpp
        utf8.h
        ...
    )
endif()
```

A proper cross-platform implementation (or `#ifdef` guards in the callers) would be needed before removing the `#error`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Line: 67-98

Comment:
**`Utf8_Validate` does not reject surrogate code points**

The overlong-encoding checks added in this revision cover the cases the previous reviewer raised, but RFC 3629 also forbids surrogate code points (U+D800–U+DFFF) in UTF-8. They encode as `0xED 0xA0 0x80` – `0xED 0xBF 0xBF`: a 3-byte sequence whose lead byte is `0xED` and whose second byte falls in `0xA0–0xBF`. The current loop passes these through because the existing guard only catches the overlong case at `0xE0`.

On Windows, `MultiByteToWideChar(CP_UTF8, …)` will reject such input and return 0, so the downstream conversion functions degrade gracefully, but the validator returns `true` for strings that will fail conversion — which is misleading for any caller that uses `Utf8_Validate` as an acceptance gate before invoking `Utf8_To_Unicode`.

Adding one check after the existing 3-byte overlong guard is sufficient:

```cpp
if (bytes == 3 && s[i] == 0xED && s[i + 1] >= 0xA0)
    return false;
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (9): Last reviewed commit: "refactor(utf8): Update callers to use ne..." | Re-trigger Greptile}

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

bobtista · 2026-04-03T02:32:50Z

Fixed the if formatting
added RFC 3629 overlong and out-of-range checks
RE the theoretical memory leak, can that even happen here? set() allocates via the engine's custom memory allocator which crashes on failure rather than throwing, so the leak path can't really be reached right?

Core/Libraries/Source/WWVegas/WWLib/utf8.h

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

Core/GameEngine/Source/GameNetwork/GameInfo.cpp

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

Core/GameEngine/Source/Common/System/UnicodeString.cpp

Core/GameEngine/Source/Common/System/AsciiString.cpp

xezon · 2026-04-03T10:02:10Z

Core/GameEngine/Source/GameNetwork/GameInfo.cpp

 								DEBUG_LOG(("ParseAsciiStringToGameInfo - slotValue name is empty, quitting"));
 								break;
 							}
+							// TheSuperHackers @fix bobtista 02/04/2026 Validate UTF-8 encoding before processing player name


This appears to be beyond the scope of this change. It is not describes in the title. Perhaps is a separate change?

Ok reverted

Core/GameEngine/Source/Common/System/AsciiString.cpp

Core/GameEngine/Source/GameNetwork/GameInfo.cpp

xezon

Get_Utf8_Size should not include the null terminator in its size.

Core/GameEngine/Source/Common/System/AsciiString.cpp

Core/Libraries/Source/WWVegas/WWLib/utf8.h

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

xezon · 2026-04-05T12:17:27Z

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

 	if (dest_size == 0)
-		return;
+		return false;
 	int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size);


What happens if dest_size does not have enough room for a null terminator?

The doc says "Does not write a null terminator" - should we add more comments? Change the functions to always null-terminate? What do you want here?

Maybe make it behave like strncpy? Writes null if there is room, otherwise not.

The only issue with the current interface then is that we will not know if it wrote the null terminator. Maybe it should return size_t instead, returning the number of characters it writes or would like to write? MultiByteToWideChar also does that.

I suggest to think this through and design the function interface in a way that it can be conveniently be used for fixed size strings (std::string, AsciiString) and large throwaway buffers (char arr[512]).

The behavior definitely needs to be documented.

Core/GameEngine/Source/Common/System/UnicodeString.cpp

Core/GameEngine/Source/Common/System/AsciiString.cpp

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

xezon · 2026-04-06T18:52:25Z

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

+	return result != 0;
+}
+
+bool Utf8_To_Unicode(wchar_t* dest, const char* src, size_t srcLen, size_t dest_size)


Is the naming choice for src len and dest size deliberate?

no, using camelCase now

What I meant is one says len and the other says size. Is this intended? I would expect to use len or size consistently, unless one refers to number of characters and the other to number of bytes.

xezon · 2026-04-06T19:57:25Z

The diff now shows unrelated changes.

bobtista · 2026-04-06T20:04:14Z

The diff now shows unrelated changes.

Try again cleaned up the commits and force pushed

Mauller · 2026-04-07T07:35:06Z

Core/GameEngine/Source/Common/System/AsciiString.cpp

+	}
+	ensureUniqueBufferOfSize((Int)size + 1, false, nullptr, nullptr);
+	char* buf = peek();
+	if (!Unicode_To_Utf8(buf, src, srcLen, size))


So is this translating UTF16LE from windows into UTF8 that is then stored within AsciiString?

If so this may help with the paths issue with usernames and paths not using Latin characters, but file handling functions will need updating to use unicode variants instead of Ascii.

Mauller · 2026-04-07T07:47:21Z

I wonder if we should also add a flag to state that the Ascii string is holding a UTF8 string?

I guess all normal ascii characters will display properly, it's just extended character sets that will look garbled.

…ming, null-terminate when room

greptile-apps · 2026-04-07T18:49:33Z

Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt

+    utf8.cpp
+    utf8.h


utf8.cpp unconditionally listed outside the if(WIN32) block

utf8.cpp's entire implementation is wrapped in #ifdef _WIN32 … #else #error "Not implemented" #endif, and the callers (AsciiString.cpp, UnicodeString.cpp, ThreadUtils.cpp) now unconditionally include utf8.h and call its functions. Any non-Windows build will hit the #error immediately. Other Win32-specific sources (registry.cpp, verchk.cpp, etc.) are correctly placed inside the if(WIN32) block below — utf8.cpp should follow the same pattern.

# Remove from unconditional WWLIB_SRC and add to the existing if(WIN32) block: if(WIN32) list(APPEND WWLIB_SRC ... utf8.cpp utf8.h ... ) endif()

A proper cross-platform implementation (or #ifdef guards in the callers) would be needed before removing the #error.

Prompt To Fix With AI

This is a comment left during a code review. Path: Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt Line: 136-137 Comment: **`utf8.cpp` unconditionally listed outside the `if(WIN32)` block** `utf8.cpp`'s entire implementation is wrapped in `#ifdef _WIN32 … #else #error "Not implemented" #endif`, and the callers (`AsciiString.cpp`, `UnicodeString.cpp`, `ThreadUtils.cpp`) now unconditionally include `utf8.h` and call its functions. Any non-Windows build will hit the `#error` immediately. Other Win32-specific sources (`registry.cpp`, `verchk.cpp`, etc.) are correctly placed inside the `if(WIN32)` block below — `utf8.cpp` should follow the same pattern. ```cmake # Remove from unconditional WWLIB_SRC and add to the existing if(WIN32) block: if(WIN32) list(APPEND WWLIB_SRC ... utf8.cpp utf8.h ... ) endif() ``` A proper cross-platform implementation (or `#ifdef` guards in the callers) would be needed before removing the `#error`. How can I resolve this? If you propose a fix, please make it concise.

I don't think we need to change this - cross-platform implementation is a todo and the #error is intentional as a placeholder

OmniBlade · 2026-04-08T08:56:40Z

Slight nitpick on function naming, shouldn't it be Utf16 rather than Wchar as Wchar is only Utf16 on windows yet we will likely want the conversion functions on other platforms as well for at least csf parsing if nothing else.

bobtista · 2026-04-08T14:20:32Z

Slight nitpick on function naming, shouldn't it be Utf16 rather than Wchar as Wchar is only Utf16 on windows yet we will likely want the conversion functions on other platforms as well for at least csf parsing if nothing else.

Yeah, but it's consistent with the other naming as it is for now. How about we keep the naming and using a uint16_t/char16_t type internally rather than wchar_t when we make non windows paths? Or would you rather we rename to something like Utf16_To_Utf8?

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Show resolved Hide resolved

xezon reviewed Apr 3, 2026

View reviewed changes

xezon added Enhancement Is new feature or request Minor Severity: Minor < Major < Critical < Blocker labels Apr 3, 2026

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Core/GameEngine/Source/GameNetwork/GameInfo.cpp Outdated Show resolved Hide resolved

xezon reviewed Apr 4, 2026

View reviewed changes

xezon reviewed Apr 5, 2026

View reviewed changes

xezon reviewed Apr 6, 2026

View reviewed changes

feat(utf8): add UTF-8 string conversion and validation functions

40393b8

bobtista force-pushed the bobtista/feat/utf8-string-functions branch from 39d7229 to 40393b8 Compare April 6, 2026 20:02

Mauller reviewed Apr 7, 2026

View reviewed changes

bobtista added 2 commits April 7, 2026 14:37

refactor(utf8): Return size_t from conversions, use consistent len na…

abb71f0

…ming, null-terminate when room

refactor(utf8): Update callers to use new conversion API

0c9074d

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

Conversation

bobtista commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New: WWLib/utf8.h / utf8.cpp

AsciiString::translate / UnicodeString::translate

ThreadUtils.cpp

Uh oh!

greptile-apps bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

bobtista commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xezon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xezon commented Apr 6, 2026

Uh oh!

bobtista commented Apr 6, 2026

Uh oh!

Mauller Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mauller commented Apr 7, 2026

Uh oh!

greptile-apps bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmniBlade commented Apr 8, 2026

Uh oh!

bobtista commented Apr 8, 2026

bobtista commented Apr 3, 2026 •

edited

Loading

New: `WWLib/utf8.h` / `utf8.cpp`

`AsciiString::translate` / `UnicodeString::translate`

`ThreadUtils.cpp`

greptile-apps bot commented Apr 3, 2026 •

edited

Loading

Mauller Apr 7, 2026 •

edited

Loading