Skip to content

Commit 3b44def

Browse files
committed
ICU-23303 document C++ Unicode string code point iterators
See #3830
1 parent 7df0d91 commit 3b44def

12 files changed

Lines changed: 467 additions & 28 deletions

File tree

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
---
2+
layout: default
3+
title: C++ Header-Only APIs
4+
nav_order: 10
5+
parent: ICU4C
6+
---
7+
<!--
8+
© 2025 and later: Unicode, Inc. and others.
9+
License & terms of use: http://www.unicode.org/copyright.html
10+
-->
11+
12+
# Plug-ins
13+
{: .no_toc }
14+
15+
## Contents
16+
{: .no_toc .text-delta }
17+
18+
1. TOC
19+
{:toc}
20+
21+
---
22+
23+
## Overview
24+
25+
Starting with ICU 76, ICU4C has what we call C++ header-only APIs.
26+
These are especially intended for users who rely on only binary stable DLL/library exports of C APIs
27+
(C++ APIs cannot be binary stable).
28+
29+
Some header-only APIs provide functionality that is not otherwise available in C++; for example, the code point iteration and range APIs in
30+
[`unicode/utfiterator.h`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utfiterator_8h.html)
31+
and the string helpers in
32+
[`unicode/utfstring.h`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utfstring_8h.html).
33+
34+
As before, regular C++ APIs can be hidden by callers defining `U_SHOW_CPLUSPLUS_API=0`.
35+
The header-only APIs can be separately enabled via `U_SHOW_CPLUSPLUS_HEADER_API=1`.
36+
37+
([GitHub query for `U_SHOW_CPLUSPLUS_HEADER_API` in public header files](https://github.com/search?q=repo%3Aunicode-org%2Ficu+U_SHOW_CPLUSPLUS_HEADER_API+path%3Aunicode%2F*.h&type=code))
38+
39+
C++ header-only APIs are C++ definitions that are not exported by the ICU DLLs/libraries,
40+
are thus inlined into the calling code.
41+
They may call ICU C APIs,
42+
but they do not call any ICU C++ APIs except other header-only ones.
43+
(Therefore, these header-only C++ classes do not subclass UMemory or UObject.)
44+
45+
The header-only APIs are defined in a nested `header` namespace.
46+
If entry point renaming is turned off (the main namespace is `icu` rather than `icu_76` etc.),
47+
then the `U_HEADER_ONLY_NAMESPACE` is `icu::header`.
48+
49+
The following example iterates over the code point ranges in a `USet` (excluding the strings) using C++ header-only APIs on top of C-only functions.
50+
51+
```c++
52+
using icu::header::USetRanges;
53+
icu::LocalUSetPointer uset(uset_openPattern(u"[abcçカ🚴]", -1, &errorCode));
54+
for (auto [start, end] : USetRanges(uset.getAlias())) {
55+
printf("uset.range U+%04lx..U+%04lx\n", (long)start, (long)end);
56+
}
57+
for (auto range : USetRanges(uset.getAlias())) {
58+
for (UChar32 c : range) {
59+
printf("uset.range.c U+%04lx\n", (long)c);
60+
}
61+
}
62+
```
63+
64+
(Implementation note: On most platforms, when compiling ICU library code,
65+
the `U_HEADER_ONLY_NAMESPACE` is `icu_76::internal` / `icu::internal` etc.,
66+
so that any such symbols that get exported differ from the ones that calling code sees.
67+
On Windows, where DLL exports are explicit,
68+
the namespace is always the same, but these header-only APIs are not marked for export.)

docs/userguide/icu4c/plug-ins.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: default
33
title: Plug-ins
4-
nav_order: 4
4+
nav_order: 40
55
parent: ICU4C
66
---
77
<!--

docs/userguide/strings/characteriterator.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: default
33
title: CharacterIterator
4-
nav_order: 3
4+
nav_order: 30
55
parent: Chars and Strings
66
---
77
<!--
@@ -11,6 +11,17 @@ License & terms of use: http://www.unicode.org/copyright.html
1111

1212
# CharacterIterator Class
1313

14+
## Modern APIs
15+
16+
### Modern C++
17+
Starting with ICU 78, ICU4C has [C++ header-only APIs](../icu4c/header-only.md)
18+
for conveniently iterating over the code points of a Unicode string in any standard encoding form (UTF-8/16/32).
19+
See [C++ Code Point Iterators](cpp-code-point-iterator.md).
20+
21+
### Modern Java
22+
Starting with Java 8, interface `CharSequence` (and thus `String` and `StringBuilder`)
23+
has a `codePoints()` method which returns an `IntStream` of Unicode code points.
24+
1425
## Overview
1526

1627
CharacterIterator is the abstract base class that defines a protocol for
Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
---
2+
layout: default
3+
title: C++ Code Point Iterators
4+
nav_order: 24
5+
parent: ICU4C
6+
---
7+
<!--
8+
© 2025 and later: Unicode, Inc. and others.
9+
License & terms of use: http://www.unicode.org/copyright.html
10+
-->
11+
12+
# Plug-ins
13+
{: .no_toc }
14+
15+
## Contents
16+
{: .no_toc .text-delta }
17+
18+
1. TOC
19+
{:toc}
20+
21+
---
22+
23+
## Overview
24+
25+
Sometimes you need to process a string one character at a time.
26+
This is trivial in a UTF-32 string, but those are not common.
27+
Most Unicode strings are UTF-8 or UTF-16 strings and may use multiple code units per Unicode code point.
28+
29+
(Note that a Unicode code point is not necessarily what you think of as a character.
30+
See the Wikipedia article on
31+
[combining characters](https://en.wikipedia.org/wiki/Combining_character)
32+
for some examples.)
33+
34+
Starting with ICU 78, ICU4C has [C++ header-only APIs](../icu4c/header-only.md)
35+
for conveniently iterating over the code points of a Unicode string in any standard encoding form (UTF-8/16/32).
36+
They work seamlessly with modern C++ iterators and ranges.
37+
These APIs are fully inline-implemented and can be used without linking with the ICU libraries.
38+
39+
As with the existing C macros, there are versions which validate the code unit sequences on the fly,
40+
as well as fast but “unsafe” versions which assume & require well-formed strings.
41+
42+
Header file documentation:
43+
[unicode/utfiterator.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utfiterator_8h.html)
44+
including some
45+
[sample code snippets](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utfiterator_8h.html#details).
46+
47+
## Old: C macros
48+
49+
ICU continues to provide C macros for iterating through
50+
[UTF-8](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html)
51+
and
52+
[UTF-16](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf16_8h.html)
53+
strings. For example:
54+
55+
```c++
56+
int32_t rangeLoop16(std::u16string_view s) {
57+
// We are just adding up the code points for minimal-code demonstration purposes.
58+
const char16_t *p = s.data();
59+
size_t length = s.length();
60+
int32_t sum = 0;
61+
for (size_t i = 0; i < length;) { // loop body increments i
62+
UChar32 c;
63+
U16_NEXT(p, i, length, c);
64+
sum += c; // < 0 if ill-formed
65+
}
66+
return sum;
67+
}
68+
```
69+
70+
## C++ code point iterators and ranges
71+
72+
The `unicode/utfiterator.h` APIs let you wrap the string in a “range” object that provides iterators over the string’s code points. You could rewrite the C macro example above like this:
73+
74+
```c++
75+
int32_t rangeLoop16(std::u16string_view s) {
76+
// We are just adding up the code points for minimal-code demonstration purposes.
77+
int32_t sum = 0;
78+
for (auto units : utfStringCodePoints<UChar32, UTF_BEHAVIOR_NEGATIVE>(s)) {
79+
sum += units.codePoint(); // < 0 if ill-formed
80+
}
81+
return sum;
82+
}
83+
```
84+
85+
This has a number of benefits compared with the C macros:
86+
- These C++ APIs provide iterator and range adaptors that are
87+
compatible with the C++ standard library, and thus look and feel natural.
88+
They are composable with standard library utilities, especially in C++20 and later.
89+
- Instead of raw pointer+length manipulation,
90+
they work with a large variety of code unit iterators.
91+
- This makes it possible to use constrained inputs without having to use an intermediate buffer of code units.
92+
- It also allows for safe code execution if the input supplies code unit iterators with bounds checking.
93+
- The same types and functions work for any of UTF-8/16/32.\
94+
(There are different macros for UTF-8 vs. UTF-16, and none for UTF-32.)
95+
- The APIs offer a number of options for a good fit for many use cases.
96+
97+
Here is an example for composing a `utfStringCodePoints()` range adaptor
98+
with C++20 language and standard library features:
99+
```c++
100+
auto codePoint = [](const auto &codeUnits) { return codeUnits.codePoint(); };
101+
const std::u16string text = u"𒂍𒁾𒁀𒀀𒂠 𒉌𒁺𒉈𒂗\n"
102+
u"𒂍𒁾𒁀𒀀 𒀀𒈾𒀀𒀭 𒉌𒀝\n"
103+
u"𒁾𒈬 𒉌𒋃 𒃻𒅗𒁺𒈬 𒉌𒅥\n";
104+
auto lines2sqq = text | std::ranges::views::lazy_split(u'\n') | std::views::drop(1);
105+
auto codeUnits = *lines2sqq.begin();
106+
assertTrue(std::ranges::equal(
107+
utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(codeUnits) |
108+
std::ranges::views::transform(codePoint),
109+
std::u32string_view(U"𒂍𒁾𒁀𒀀 𒀀𒈾𒀀𒀭 𒉌𒀝")));
110+
```
111+
112+
<!--
113+
Simplified example from icu4c/source/test/intltest/utfiteratortest.cpp
114+
115+
Eggsplanation: Split lines on U+000A without decoding, then decode the second line.
116+
In case anyone needs to read the Sumerian aloud, the three lines on the slide read
117+
edubbaʾaše iŋennen / edubbaʾa anam iak / dubŋu išid niŋzugubŋu igu;
118+
translation:
119+
I went to school. / what did you do at school? / I recited my tablet and ate my lunch.
120+
See https://cdli.earth/artifacts/464238/reader/213101
121+
-->
122+
123+
### Output: CodeUnits
124+
125+
The iterators do not merely return code point integers.
126+
As you iterate over a string, you are getting a `CodeUnits` object representing a
127+
Unicode code point and its code unit sequence.
128+
This supports use cases that are not centered on the code point integer.
129+
130+
Here is a simplified version of the class:
131+
```c++
132+
class CodeUnits {
133+
public:
134+
CodeUnits(const CodeUnits &other);
135+
CodeUnits &operator=(const CodeUnits &other);
136+
137+
CP32 codePoint() const;
138+
139+
UnitIter begin() const;
140+
UnitIter end() const;
141+
uint8_t length() const;
142+
143+
std::basic_string_view<Unit> stringView() const;
144+
145+
bool wellFormed() const;
146+
};
147+
```
148+
149+
The `CP32` code unit type is a required template parameter. It must be a 32-bit integer value, but it can be signed or unsigned.
150+
You choose the code point integer type to fit your use case:
151+
It is typically an ICU `UChar32` (=`int32_t` / signed) or a `char32_t` or a `uint32_t` (both unsigned).
152+
153+
Pick any of these if you do not read the code point value.
154+
155+
Here is an example for just counting how many code points are in a string:
156+
```c++
157+
int32_t countCodePoints16(std::u16string_view s) {
158+
auto range = utfStringCodePoints<UChar32, UTF_BEHAVIOR_SURROGATE>(s);
159+
return std::distance(range.begin(), range.end());
160+
}
161+
```
162+
163+
Fetching the first code point’s code unit sequence if it is well-formed:
164+
```c++
165+
std::string_view firstSequence8(std::string_view s) {
166+
if (s.empty()) { return {}; }
167+
auto range = utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(s);
168+
auto units = *(range.begin());
169+
if (units.wellFormed()) {
170+
return units.stringView();
171+
} else {
172+
return {};
173+
}
174+
}
175+
```
176+
177+
### Input: UTF-8/16/32
178+
179+
The iterators and range adaptors work with any of the Unicode standard in-memory string encodings.
180+
The appropriate types and implementations are usually auto-detected.
181+
182+
Details:
183+
184+
A `UTFIterator` is instantiated with the input code unit iterator type which may yield bytes for UTF-8, 16-bit values for UTF-16, or 32-bit values for UTF-32. Using the `utfIterator()` function deduces the code unit iterator type from its arguments.
185+
186+
You may not need to work with a `UTFIterator` directly.
187+
A `UTFStringCodePoints` range adaptor can be constructed from a `std::string`, `std::string_view`, their variants (e.g., `std::u16string` or `std::u32string_view`), an `icu::UnicodeString`, or a wide variety of other code unit “ranges”.
188+
189+
Again, if you use the `utfStringCodePoints()` function, the `Range` template parameter is deduced from the argument.
190+
191+
### Input: Validation
192+
193+
You choose whether you want to validate the input string on the fly, by
194+
using `utfStringCodePoints` or `unsafeUTFStringCodePoints`,
195+
and similarly named siblings of the other types and functions.
196+
197+
The “unsafe” version compiles into smaller and faster code,
198+
especially for UTF-8 which is fairly complex,
199+
but it requires well-formed input.
200+
201+
The C++ standard string and string_view types, as well as ICU’s UnicodeString,
202+
do not require or enforce well-formed Unicode strings.
203+
However, you may enforce well-formed strings in large parts of your code base
204+
by checking on input and checking or debug-checking between some processing steps.
205+
206+
“Unsafe”, that is, non-validating iterators return `UnsafeCodeUnits`,
207+
which lack the `wellFormed()` function,
208+
but otherwise have the same API as `CodeUnits`.
209+
210+
All of the validating classes take another required template parameter for
211+
what code point value should be returned for an ill-formed code unit sequence:
212+
[enum UTFIllFormedBehavior](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utfiterator_8h.html#ae96b61b479fe4d7b8e525787353d1d46)
213+
- `UTF_BEHAVIOR_NEGATIVE`:
214+
- Returns a negative value (-1=`U_SENTINEL`) instead of a code point.\
215+
(As usual, the intended check for a code point value from a well-formed sequence is
216+
`cp >= 0`, not `cp != U_SENTINEL`.)
217+
- If the `CP32` template parameter for the relevant classes is an unsigned type,
218+
then the negative value becomes 0xffffffff=`UINT32_MAX`.
219+
- `UTF_BEHAVIOR_FFFD`: Returns U+FFFD Replacement Character.
220+
- `UTF_BEHAVIOR_SURROGATE`:
221+
- UTF-8: Not allowed.
222+
- UTF-16: Returns the unpaired surrogate.
223+
- UTF-32: Returns the surrogate code point, or U+FFFD if out of range.
224+
225+
Again, pick any of these if you do not read the code point value.
226+
227+
### Input: Code unit iterators
228+
229+
C++ standard iterators are modeled after pointers,
230+
with operators like `*` and `->` for value access,
231+
`++` and `--` for iteration, and `==` for comparing with iteration limits.
232+
In fact, pointers to code units work as inputs to `UTFIterator`. However, they are not required.
233+
234+
When supplying a pointer or a `contiguous_iterator` for the code units, then
235+
`CodeUnits` supports the `stringView()` function.
236+
237+
When supplying at least a `bidirectional_iterator` for the code units, then the `UTFIterator` is also a `bidirectional_iterator`,
238+
`std::make_reverse_iterator(iter)` will return an efficient backward iterator,
239+
and using `utfStringCodePoints()` on a range of such iterators
240+
supports `rbegin()` and `rend()`.
241+
242+
When supplying only a `forward_iterator`, then
243+
the `UTFIterator` is also a `forward_iterator`, without backward iteration.
244+
245+
The minimal input is an `input_iterator`, which does not even allow reading the same value more than once.
246+
The resulting `UTFIterator` is then also a single-pass `input_iterator`, and
247+
it returns `CodeUnits` which only support `codePoint()`, `length()`, and (if validating) `wellFormed()`.
248+
249+
Each validating iterator needs to be instantiated with both
250+
the current-position code unit iterator as well as a “limit” (exclusive-end) or “sentinel” iterator.
251+
(Otherwise it would not know when to stop reading the variable number of code units.)
252+
The API supports “sentinel” types that differ from the code unit iterator,
253+
as long as the two can be compared.
254+
255+
An example of an `input_iterator` is the standard-input stream.
256+
The API docs include this code example for that:
257+
258+
```c++20
259+
template<typename InputStream> // some istream or streambuf
260+
std::u32string cpFromInput(InputStream &in) {
261+
// This is a single-pass input_iterator.
262+
std::istreambuf_iterator bufIter(in);
263+
std::istreambuf_iterator<typename InputStream::char_type> bufLimit;
264+
auto iter = utfIterator<char32_t, UTF_BEHAVIOR_FFFD>(bufIter);
265+
auto limit = utfIterator<char32_t, UTF_BEHAVIOR_FFFD>(bufLimit);
266+
std::u32string s32;
267+
for (; iter != limit; ++iter) {
268+
s32.push_back(iter->codePoint());
269+
}
270+
return s32;
271+
}
272+
273+
std::u32string cpFromStdin() { return cpFromInput(std::cin); }
274+
std::u32string cpFromWideStdin() { return cpFromInput(std::wcin); }
275+
```
276+
277+
### Compiled code size
278+
279+
All of the code is inline-implemented in the header file.
280+
Where available for a compiler (e.g., g++ and clang), the code is force-inlined.
281+
As a result, the compiler will omit code whose output is not used.
282+
For example, if you do not use the code point integer,
283+
then the compiler will omit the code to assemble it from the code unit bits.
284+
285+
The code has also been written to make it easy for the compiler to detect and eliminate redundant code, especially in typical use cases including range-based for loops.
286+
287+
Some of the implementation code is necessarily fairly complex,
288+
especially for validating iteration over UTF-8.
289+
Compiler-friendly implementation techniques, force-inlining, and modern compiler optimizations yield code as small and fast as possible.

0 commit comments

Comments
 (0)