Unicode Text Processing in C++ — Part 2: Grapheme Clusters and Performance

Scope: This part covers grapheme clusters — the final piece of correct Unicode text processing — backed by working implementations, benchmarks, and practical guidance on where to apply ICU.

TL;DR — Iterate grapheme clusters, not bytes or code points. If a human typed it, use ICU. The cost is 10×–16× on grapheme operations and grows with input size — apply it only where correctness requires it.

Grapheme clusters: what a "character" actually is#

A character in Unicode is not a byte, and not a code point. It is a grapheme cluster: the smallest unit that a human perceives as a single symbol.

The family emoji 👨‍👩‍👧‍👦 is one grapheme cluster made of seven code points and 25 UTF-8 bytes:

The Family Emoji

The ZWJ (Zero Width Joiner, U+200D) is an invisible character with no visual width that signals the renderer to join the adjacent emojis into one combined glyph. Without ZWJ the four emojis render separately; with it they combine. On renderers that don't support the sequence, ZWJ is safely ignored and the four individual emojis appear as fallback — that degradation is intentional. ZWJ occupies no pixels but does occupy bytes: each ZWJ character (U+200D) is 3 bytes in UTF-8.

The user sees one character. Any operation that splits at a byte or code-point boundary — substring, truncation, duplicate elimination — will produce incorrect results without grapheme awareness. Cutting between a ZWJ and the emoji that follows it leaves a dangling joiner that renders incorrectly.

The boundary rules that define what constitutes a grapheme cluster are in UAX #29 §3.1 — a set of rules (GB1–GB999) that specify exactly where cluster boundaries fall for every combination of code point properties.

ICU's BreakIterator walks grapheme cluster boundaries. Here is a deduplication function that removes duplicate grapheme clusters from a UTF-8 string in place:

void RemoveDupsUnicode(std::string &str)
{
    if (str.empty()) return;

    UErrorCode status = U_ZERO_ERROR;
    icu::UnicodeString ustr = icu::UnicodeString::fromUTF8(str);

    UBreakIterator *bi = ubrk_open(UBRK_CHARACTER, nullptr, nullptr, 0, &status);
    if (U_FAILURE(status) || bi == nullptr) throw std::runtime_error("BreakIterator open failed");
    // RAII guard: ensures ubrk_close runs on every exit path, including exceptions
    std::unique_ptr<UBreakIterator, decltype(&ubrk_close)> bi_guard(bi, ubrk_close);
    ubrk_setText(bi, reinterpret_cast<const UChar *>(ustr.getBuffer()), ustr.length(), &status);
    if (U_FAILURE(status)) throw std::runtime_error("BreakIterator setText failed");

    // Map UTF-16 code unit indices → UTF-8 byte offsets for in-place writes
    std::vector<size_t> utf16ToUtf8;
    utf16ToUtf8.reserve(ustr.length() + 1);
    size_t byteOffset = 0;
    for (int32_t i = 0; i < ustr.length(); ++i) {
        utf16ToUtf8.push_back(byteOffset);
        UChar32 cp = ustr.char32At(i);
        if      (cp <= 0x7F)   byteOffset += 1;
        else if (cp <= 0x7FF)  byteOffset += 2;
        else if (cp <= 0xFFFF) byteOffset += 3;
        else                   byteOffset += 4;
        if (U_IS_SUPPLEMENTARY(cp)) { ++i; utf16ToUtf8.push_back(byteOffset); }
    }
    utf16ToUtf8.push_back(byteOffset);

    std::unordered_set<std::string> seen;
    size_t write = 0;
    int32_t start = ubrk_first(bi);
    for (int32_t end = ubrk_next(bi); end != UBRK_DONE;
         start = end, end = ubrk_next(bi))
    {
        size_t byteStart = utf16ToUtf8[static_cast<size_t>(start)];
        size_t byteEnd   = utf16ToUtf8[static_cast<size_t>(end)];
        std::string key(str.data() + byteStart, byteEnd - byteStart);

        if (seen.insert(key).second) {
            if (write != byteStart)
                std::memmove(&str[write], key.data(), key.size());
            write += key.size();
        }
    }
    str.resize(write);
}

The complexity relative to the byte version illustrates why this matters architecturally. The byte version is a bitset and a single pass:

void RemoveDups(std::string &str)
{
    std::bitset<256> seen;
    size_t write = 0;
    for (size_t read = 0; read < str.size(); ++read) {
        unsigned char uc = static_cast<unsigned char>(str[read]);
        if (!seen[uc]) {
            str[write++] = str[read];
            seen[uc] = true;
        }
    }
    str.resize(write);
}

ICU operates in UTF-16. The world speaks UTF-8. The translation layer between them is where the complexity lives. The manual utf16ToUtf8 index table in RemoveDupsUnicode exists because BreakIterator reports cluster boundaries as UTF-16 code unit offsets, but the original string is stored as UTF-8 bytes that need to be written in place. A simpler implementation could call icu::UnicodeString::toUTF8String() at the end to convert the result back — but that requires building a separate output string rather than operating in place, which adds an allocation and a copy. The index mapping trades memory for the ability to memmove directly within the original buffer.

The Unicode version must: convert to UTF-16, build that offset index, run the grapheme state machine, and track seen clusters as heap-allocated strings. There is no way to make this as cheap as the byte version — the operations are genuinely different.

Benchmark Results#

Benchmarks use Google Benchmark, which runs iterations until results stabilize and accounts for measurement overhead. Functions that mutate their input use PauseTiming() / ResumeTiming() to exclude the per-iteration copy from measurements. The benchmark code:

static void BM_RemoveDups_ASCII(benchmark::State &state)
{
    const std::string src = "aabbccddeeffgghhiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz";
    for (auto _ : state) {
        state.PauseTiming();
        std::string s = src;
        state.ResumeTiming();
        ch01::RemoveDups(s);
        benchmark::DoNotOptimize(s);
    }
}

static void BM_RemoveDups_Unicode(benchmark::State &state)
{
    // Each Greek letter appears twice — BreakIterator walks every grapheme cluster
    const std::string src = "ααββγγδδεεζζηηθθιικκλλμμννξξοοππρρσσττυυφφχχψψωω";
    for (auto _ : state) {
        state.PauseTiming();
        std::string s = src;
        state.ResumeTiming();
        ch01::RemoveDupsUnicode(s);
        benchmark::DoNotOptimize(s);
    }
}

Results (Apple M2 Pro, arm64, Apple clang 17.0.0, C++26, ICU 78.2, -O3 -DNDEBUG):

Benchmark Time Relative
RemoveDups ASCII — bitset, in-place (52 bytes) 617 ns
RemoveDups Unicode — BreakIterator, grapheme clusters (96 bytes) 6,522 ns 10.5×
RemoveDups ASCII — bitset, in-place (~1 KB) 1,255 ns
RemoveDups Unicode — BreakIterator, grapheme clusters (~1 KB) 19,710 ns 15.7×

Note

The two inputs are not the same size (52 bytes ASCII vs 96 bytes Unicode for the small case; ~1 KB each for the large case). The ratio reflects both algorithmic overhead and the larger input; the analysis below accounts for both factors. Profile with your actual input sizes before drawing conclusions.

The BreakIterator cost is structural and grows with input size. The 10.5× overhead on small strings becomes 15.7× at ~1 KB. This is expected: the ASCII bitset path reaches steady-state quickly (once all 26 letters are seen, remaining bytes are skipped with a single bitset check), while the Unicode path must convert every byte to UTF-16, build the full offset index, and look up every grapheme cluster in a heap-allocated hash set — all O(n) costs with no early-exit. The overhead is not from a poorly written loop; it is the irreducible cost of grapheme-aware processing. Conversions accumulate. If the same string is processed repeatedly, convert once to icu::UnicodeString and keep it in that representation rather than converting on every operation.

Where this matters: Part 2#

Display and truncation. Splitting a string at a byte or code-point offset can produce invalid UTF-8 or split a grapheme cluster in half. Text editors, terminal emulators, and any UI that truncates to a display width must use grapheme-aware boundaries. Cursor positioning and column width calculations break at cluster boundaries — a terminal that naively counts code points will misplace the cursor after a multi-codepoint emoji.

String length and width. The length of a string is not the width of its display. A string of 5 code points that are all fullwidth characters (CJK, fullwidth digits) displays as 10 columns — each takes 2. A string of 10 code points where 9 are combining marks stacked on a single base displays as 1 column. A string of 1 grapheme cluster might contain 25 bytes. Any code that assumes str.length() correlates with visual width is broken.

Substring and slicing. Cutting a string at an arbitrary position will corrupt it if that position falls within a multi-byte code point or within a grapheme cluster. Safe substring operations must be grapheme-aware.

When std::string is sufficient. Internal tokens, protocol keywords, configuration keys, and any string whose character set is controlled and ASCII-only do not need ICU. The overhead is real; apply it where the correctness gap is real too.

Summary: Part 2#

  • Grapheme clusters are the atomic unit of text, not bytes or code points; BreakIterator gives you what a user would call "one character."
  • ICU provides complete Unicode text processing: normalization, case-folding, and grapheme segmentation.
  • Performance cost: 3×–4× for normalization and case-folding; 10×–16× for grapheme iteration depending on input size. The ratio worsens with longer strings — cache the icu::UnicodeString if processing the same string repeatedly.
  • The decision point: If a human typed it, it needs Unicode handling. For internal tokens and ASCII-only data, std::string is fine.

Final Summary#

Unicode text processing in C++ requires three pieces:

  1. Normalization — Normalize to NFC before comparison or hashing. Cost: 3×–4× for small strings, up to 13× for longer ones. Essential for search and deduplication.

  2. Case-folding — Use foldCase() instead of tolower. Cost: well below normalization (no isolated benchmark; Part 1 measures them combined). Essential for any international text comparison.

  3. Grapheme segmentation — Iterate clusters, not code points or bytes. Cost: 10×–16× and grows with input size. Essential for correct display, truncation, and user-facing operations.

ICU provides all three. The cost is real but localized — apply it where the correctness gap exists. For anything else, std::string is fine.

Share ->