|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: C++ Code Point Iterators |
| 4 | +nav_order: 24 |
| 5 | +parent: ICU4C |
| 6 | +--- |
| 7 | +<!-- |
| 8 | +© 2025 and later: Unicode, Inc. and others. |
| 9 | +License & terms of use: http://www.unicode.org/copyright.html |
| 10 | +--> |
| 11 | + |
| 12 | +# Plug-ins |
| 13 | +{: .no_toc } |
| 14 | + |
| 15 | +## Contents |
| 16 | +{: .no_toc .text-delta } |
| 17 | + |
| 18 | +1. TOC |
| 19 | +{:toc} |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## Overview |
| 24 | + |
| 25 | +Sometimes you need to process a string one character at a time. |
| 26 | +This is trivial in a UTF-32 string, but those are not common. |
| 27 | +Most Unicode strings are UTF-8 or UTF-16 strings and may use multiple code units per Unicode code point. |
| 28 | + |
| 29 | +(Note that a Unicode code point is not necessarily what you think of as a character. |
| 30 | +See the Wikipedia article on |
| 31 | +[combining characters](https://en.wikipedia.org/wiki/Combining_character) |
| 32 | +for some examples.) |
| 33 | + |
| 34 | +Starting with ICU 78, ICU4C has [C++ header-only APIs](../icu4c/header-only.md) |
| 35 | +for conveniently iterating over the code points of a Unicode string in any standard encoding form (UTF-8/16/32). |
| 36 | +They work seamlessly with modern C++ iterators and ranges. |
| 37 | +These APIs are fully inline-implemented and can be used without linking with the ICU libraries. |
| 38 | + |
| 39 | +As with the existing C macros, there are versions which validate the code unit sequences on the fly, |
| 40 | +as well as fast but “unsafe” versions which assume & require well-formed strings. |
| 41 | + |
| 42 | +Header file documentation: |
| 43 | +[unicode/utfiterator.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utfiterator_8h.html) |
| 44 | +including some |
| 45 | +[sample code snippets](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utfiterator_8h.html#details). |
| 46 | + |
| 47 | +## Old: C macros |
| 48 | + |
| 49 | +ICU continues to provide C macros for iterating through |
| 50 | +[UTF-8](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html) |
| 51 | +and |
| 52 | +[UTF-16](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf16_8h.html) |
| 53 | +strings. For example: |
| 54 | + |
| 55 | +```c++ |
| 56 | +int32_t rangeLoop16(std::u16string_view s) { |
| 57 | + // We are just adding up the code points for minimal-code demonstration purposes. |
| 58 | + const char16_t *p = s.data(); |
| 59 | + size_t length = s.length(); |
| 60 | + int32_t sum = 0; |
| 61 | + for (size_t i = 0; i < length;) { // loop body increments i |
| 62 | + UChar32 c; |
| 63 | + U16_NEXT(p, i, length, c); |
| 64 | + sum += c; // < 0 if ill-formed |
| 65 | + } |
| 66 | + return sum; |
| 67 | +} |
| 68 | +``` |
| 69 | +
|
| 70 | +## C++ code point iterators and ranges |
| 71 | +
|
| 72 | +The `unicode/utfiterator.h` APIs let you wrap the string in a “range” object that provides iterators over the string’s code points. You could rewrite the C macro example above like this: |
| 73 | +
|
| 74 | +```c++ |
| 75 | +int32_t rangeLoop16(std::u16string_view s) { |
| 76 | + // We are just adding up the code points for minimal-code demonstration purposes. |
| 77 | + int32_t sum = 0; |
| 78 | + for (auto units : utfStringCodePoints<UChar32, UTF_BEHAVIOR_NEGATIVE>(s)) { |
| 79 | + sum += units.codePoint(); // < 0 if ill-formed |
| 80 | + } |
| 81 | + return sum; |
| 82 | +} |
| 83 | +``` |
| 84 | + |
| 85 | +This has a number of benefits compared with the C macros: |
| 86 | +- These C++ APIs provide iterator and range adaptors that are |
| 87 | + compatible with the C++ standard library, and thus look and feel natural. |
| 88 | + They are composable with standard library utilities, especially in C++20 and later. |
| 89 | +- Instead of raw pointer+length manipulation, |
| 90 | + they work with a large variety of code unit iterators. |
| 91 | + - This makes it possible to use constrained inputs without having to use an intermediate buffer of code units. |
| 92 | + - It also allows for safe code execution if the input supplies code unit iterators with bounds checking. |
| 93 | +- The same types and functions work for any of UTF-8/16/32.\ |
| 94 | + (There are different macros for UTF-8 vs. UTF-16, and none for UTF-32.) |
| 95 | +- The APIs offer a number of options for a good fit for many use cases. |
| 96 | + |
| 97 | +Here is an example for composing a `utfStringCodePoints()` range adaptor |
| 98 | +with C++20 language and standard library features: |
| 99 | +```c++ |
| 100 | +auto codePoint = [](const auto &codeUnits) { return codeUnits.codePoint(); }; |
| 101 | +const std::u16string text = u"𒂍𒁾𒁀𒀀𒂠 𒉌𒁺𒉈𒂗\n" |
| 102 | + u"𒂍𒁾𒁀𒀀 𒀀𒈾𒀀𒀭 𒉌𒀝\n" |
| 103 | + u"𒁾𒈬 𒉌𒋃 𒃻𒅗𒁺𒈬 𒉌𒅥\n"; |
| 104 | +auto lines2sqq = text | std::ranges::views::lazy_split(u'\n') | std::views::drop(1); |
| 105 | +auto codeUnits = *lines2sqq.begin(); |
| 106 | +assertTrue(std::ranges::equal( |
| 107 | + utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(codeUnits) | |
| 108 | + std::ranges::views::transform(codePoint), |
| 109 | + std::u32string_view(U"𒂍𒁾𒁀𒀀 𒀀𒈾𒀀𒀭 𒉌𒀝"))); |
| 110 | +``` |
| 111 | +
|
| 112 | +<!-- |
| 113 | +Simplified example from icu4c/source/test/intltest/utfiteratortest.cpp |
| 114 | +
|
| 115 | +Eggsplanation: Split lines on U+000A without decoding, then decode the second line. |
| 116 | +In case anyone needs to read the Sumerian aloud, the three lines on the slide read |
| 117 | +edubbaʾaše iŋennen / edubbaʾa anam iak / dubŋu išid niŋzugubŋu igu; |
| 118 | +translation: |
| 119 | +I went to school. / what did you do at school? / I recited my tablet and ate my lunch. |
| 120 | +See https://cdli.earth/artifacts/464238/reader/213101 |
| 121 | +--> |
| 122 | +
|
| 123 | +### Output: CodeUnits |
| 124 | +
|
| 125 | +The iterators do not merely return code point integers. |
| 126 | +As you iterate over a string, you are getting a `CodeUnits` object representing a |
| 127 | +Unicode code point and its code unit sequence. |
| 128 | +This supports use cases that are not centered on the code point integer. |
| 129 | +
|
| 130 | +Here is a simplified version of the class: |
| 131 | +```c++ |
| 132 | +class CodeUnits { |
| 133 | +public: |
| 134 | + CodeUnits(const CodeUnits &other); |
| 135 | + CodeUnits &operator=(const CodeUnits &other); |
| 136 | +
|
| 137 | + CP32 codePoint() const; |
| 138 | +
|
| 139 | + UnitIter begin() const; |
| 140 | + UnitIter end() const; |
| 141 | + uint8_t length() const; |
| 142 | +
|
| 143 | + std::basic_string_view<Unit> stringView() const; |
| 144 | +
|
| 145 | + bool wellFormed() const; |
| 146 | +}; |
| 147 | +``` |
| 148 | + |
| 149 | +The `CP32` code unit type is a required template parameter. It must be a 32-bit integer value, but it can be signed or unsigned. |
| 150 | +You choose the code point integer type to fit your use case: |
| 151 | +It is typically an ICU `UChar32` (=`int32_t` / signed) or a `char32_t` or a `uint32_t` (both unsigned). |
| 152 | + |
| 153 | +Pick any of these if you do not read the code point value. |
| 154 | + |
| 155 | +Here is an example for just counting how many code points are in a string: |
| 156 | +```c++ |
| 157 | +int32_t countCodePoints16(std::u16string_view s) { |
| 158 | + auto range = utfStringCodePoints<UChar32, UTF_BEHAVIOR_SURROGATE>(s); |
| 159 | + return std::distance(range.begin(), range.end()); |
| 160 | +} |
| 161 | +``` |
| 162 | +
|
| 163 | +Fetching the first code point’s code unit sequence if it is well-formed: |
| 164 | +```c++ |
| 165 | +std::string_view firstSequence8(std::string_view s) { |
| 166 | + if (s.empty()) { return {}; } |
| 167 | + auto range = utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(s); |
| 168 | + auto units = *(range.begin()); |
| 169 | + if (units.wellFormed()) { |
| 170 | + return units.stringView(); |
| 171 | + } else { |
| 172 | + return {}; |
| 173 | + } |
| 174 | +} |
| 175 | +``` |
| 176 | + |
| 177 | +### Input: UTF-8/16/32 |
| 178 | + |
| 179 | +The iterators and range adaptors work with any of the Unicode standard in-memory string encodings. |
| 180 | +The appropriate types and implementations are usually auto-detected. |
| 181 | + |
| 182 | +Details: |
| 183 | + |
| 184 | +A `UTFIterator` is instantiated with the input code unit iterator type which may yield bytes for UTF-8, 16-bit values for UTF-16, or 32-bit values for UTF-32. Using the `utfIterator()` function deduces the code unit iterator type from its arguments. |
| 185 | + |
| 186 | +You may not need to work with a `UTFIterator` directly. |
| 187 | +A `UTFStringCodePoints` range adaptor can be constructed from a `std::string`, `std::string_view`, their variants (e.g., `std::u16string` or `std::u32string_view`), an `icu::UnicodeString`, or a wide variety of other code unit “ranges”. |
| 188 | + |
| 189 | +Again, if you use the `utfStringCodePoints()` function, the `Range` template parameter is deduced from the argument. |
| 190 | + |
| 191 | +### Input: Validation |
| 192 | + |
| 193 | +You choose whether you want to validate the input string on the fly, by |
| 194 | +using `utfStringCodePoints` or `unsafeUTFStringCodePoints`, |
| 195 | +and similarly named siblings of the other types and functions. |
| 196 | + |
| 197 | +The “unsafe” version compiles into smaller and faster code, |
| 198 | +especially for UTF-8 which is fairly complex, |
| 199 | +but it requires well-formed input. |
| 200 | + |
| 201 | +The C++ standard string and string_view types, as well as ICU’s UnicodeString, |
| 202 | +do not require or enforce well-formed Unicode strings. |
| 203 | +However, you may enforce well-formed strings in large parts of your code base |
| 204 | +by checking on input and checking or debug-checking between some processing steps. |
| 205 | + |
| 206 | +“Unsafe”, that is, non-validating iterators return `UnsafeCodeUnits`, |
| 207 | +which lack the `wellFormed()` function, |
| 208 | +but otherwise have the same API as `CodeUnits`. |
| 209 | + |
| 210 | +All of the validating classes take another required template parameter for |
| 211 | +what code point value should be returned for an ill-formed code unit sequence: |
| 212 | +[enum UTFIllFormedBehavior](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utfiterator_8h.html#ae96b61b479fe4d7b8e525787353d1d46) |
| 213 | +- `UTF_BEHAVIOR_NEGATIVE`: |
| 214 | + - Returns a negative value (-1=`U_SENTINEL`) instead of a code point.\ |
| 215 | + (As usual, the intended check for a code point value from a well-formed sequence is |
| 216 | + `cp >= 0`, not `cp != U_SENTINEL`.) |
| 217 | + - If the `CP32` template parameter for the relevant classes is an unsigned type, |
| 218 | + then the negative value becomes 0xffffffff=`UINT32_MAX`. |
| 219 | +- `UTF_BEHAVIOR_FFFD`: Returns U+FFFD Replacement Character. |
| 220 | +- `UTF_BEHAVIOR_SURROGATE`: |
| 221 | + - UTF-8: Not allowed. |
| 222 | + - UTF-16: Returns the unpaired surrogate. |
| 223 | + - UTF-32: Returns the surrogate code point, or U+FFFD if out of range. |
| 224 | + |
| 225 | +Again, pick any of these if you do not read the code point value. |
| 226 | + |
| 227 | +### Input: Code unit iterators |
| 228 | + |
| 229 | +C++ standard iterators are modeled after pointers, |
| 230 | +with operators like `*` and `->` for value access, |
| 231 | +`++` and `--` for iteration, and `==` for comparing with iteration limits. |
| 232 | +In fact, pointers to code units work as inputs to `UTFIterator`. However, they are not required. |
| 233 | + |
| 234 | +When supplying a pointer or a `contiguous_iterator` for the code units, then |
| 235 | +`CodeUnits` supports the `stringView()` function. |
| 236 | + |
| 237 | +When supplying at least a `bidirectional_iterator` for the code units, then the `UTFIterator` is also a `bidirectional_iterator`, |
| 238 | +`std::make_reverse_iterator(iter)` will return an efficient backward iterator, |
| 239 | +and using `utfStringCodePoints()` on a range of such iterators |
| 240 | +supports `rbegin()` and `rend()`. |
| 241 | + |
| 242 | +When supplying only a `forward_iterator`, then |
| 243 | +the `UTFIterator` is also a `forward_iterator`, without backward iteration. |
| 244 | + |
| 245 | +The minimal input is an `input_iterator`, which does not even allow reading the same value more than once. |
| 246 | +The resulting `UTFIterator` is then also a single-pass `input_iterator`, and |
| 247 | +it returns `CodeUnits` which only support `codePoint()`, `length()`, and (if validating) `wellFormed()`. |
| 248 | + |
| 249 | +Each validating iterator needs to be instantiated with both |
| 250 | +the current-position code unit iterator as well as a “limit” (exclusive-end) or “sentinel” iterator. |
| 251 | +(Otherwise it would not know when to stop reading the variable number of code units.) |
| 252 | +The API supports “sentinel” types that differ from the code unit iterator, |
| 253 | +as long as the two can be compared. |
| 254 | + |
| 255 | +An example of an `input_iterator` is the standard-input stream. |
| 256 | +The API docs include this code example for that: |
| 257 | + |
| 258 | +```c++20 |
| 259 | +template<typename InputStream> // some istream or streambuf |
| 260 | +std::u32string cpFromInput(InputStream &in) { |
| 261 | + // This is a single-pass input_iterator. |
| 262 | + std::istreambuf_iterator bufIter(in); |
| 263 | + std::istreambuf_iterator<typename InputStream::char_type> bufLimit; |
| 264 | + auto iter = utfIterator<char32_t, UTF_BEHAVIOR_FFFD>(bufIter); |
| 265 | + auto limit = utfIterator<char32_t, UTF_BEHAVIOR_FFFD>(bufLimit); |
| 266 | + std::u32string s32; |
| 267 | + for (; iter != limit; ++iter) { |
| 268 | + s32.push_back(iter->codePoint()); |
| 269 | + } |
| 270 | + return s32; |
| 271 | +} |
| 272 | +
|
| 273 | +std::u32string cpFromStdin() { return cpFromInput(std::cin); } |
| 274 | +std::u32string cpFromWideStdin() { return cpFromInput(std::wcin); } |
| 275 | +``` |
| 276 | + |
| 277 | +### Compiled code size |
| 278 | + |
| 279 | +All of the code is inline-implemented in the header file. |
| 280 | +Where available for a compiler (e.g., g++ and clang), the code is force-inlined. |
| 281 | +As a result, the compiler will omit code whose output is not used. |
| 282 | +For example, if you do not use the code point integer, |
| 283 | +then the compiler will omit the code to assemble it from the code unit bits. |
| 284 | + |
| 285 | +The code has also been written to make it easy for the compiler to detect and eliminate redundant code, especially in typical use cases including range-based for loops. |
| 286 | + |
| 287 | +Some of the implementation code is necessarily fairly complex, |
| 288 | +especially for validating iteration over UTF-8. |
| 289 | +Compiler-friendly implementation techniques, force-inlining, and modern compiler optimizations yield code as small and fast as possible. |
0 commit comments