# Reference-Typed Strings A minimum viable proposal to add a reference-typed strings to WebAssembly (the `stringref` proposal). ## Champions Andy Wingo `` ## Goals 1. Enable programs compiled to WebAssembly to efficiently create and consume JavaScript strings 2. Provide a good string implementation that many languages implemented on top of the GC proposal would find useful These goals are sometimes in tension! The operative words to help us find good compromises are "minimal" and "viable". ## Requirements 1. Zero-copy passing of strings from JavaScript to WebAssembly & back 2. No new string implementations on the web: allow re-use of JS engine's strings 3. Allow WebAssembly implementations to efficiently represent strings internally in either WTF-8 or WTF-16 encodings 4. Allow access to WTF-16 code units for Java, Dart, Kotlin and similar languages 5. Allow string literals as constant expressions ## Definitions - *codepoint*: An integer in the range [0,0x10FFFF]. - *surrogate*: A codepoint in the range [0xD800,0xDFFF]. - *unicode scalar value*: A codepoint that is not a surrogate. - *character*: An imprecise concept that we try to avoid in this document. - *code unit*: An indivisible unit of an encoded unicode scalar value. For UTF-8 encodings, an integer in the range [0,0xFF] (a byte); for UTF-16 encodings, an integer in the range [0,0xFFFF]; for UTF-32, the unicode scalar value itself. - *high surrogate*: A surrogate in the range [0xD800,0xDBFF]. - *low surrogate*: A surrogate which is not a high surrogate. - *surrogate pair*: A sequence of a *high surrogate* followed by a *low surrogate*, used by UTF-16 to encode a codepoint in the range [0x10000,0x10FFFF]. - *isolated surrogate*: Any surrogate which is not part of a surrogate pair. ## Design ### What's a string? Good question! It sure would be nice to say that a string is a sequence of unicode scalar values. However, to satisfy the goal of being a good compilation target for a wide range of programming languages, as well as the goal of good JavaScript interoperability, things are a little more complicated. Some languages present no problem to this idea that strings are composed of unicode scalar values. Python and Rust are in this category. Other languages, notably JavaScript and Java, define their strings to be sequences of 16-bit code units, for historical reasons. These code unit sequences aren't quite UTF-16, because they can contain isolated surrogates, which are prohibited by standard Unicode encoding forms. The facility we end up building should be able to represent all Java strings, as a compilation target, and also represent all JavaScript strings, for good web iteroperability. Therefore **we define a string to be a sequence of unicode scalar values and isolated surrogates**. The code units of a Java or JavaScript string can be interpreted to encode such a sequence, in the [WTF-16](https://simonsapin.github.io/wtf-8/) encoding form. This proposal does not require a WebAssembly implementation to use WTF-16 to represent its strings internally. It only requires that the implementation be able to represent all codepoint sequences that can be encoded with WTF-16, notably codepoint sequences containing isolated surrogates. ### Encodings and views There is an impedance-matching problem between the way that WebAssembly implementations want to represent strings, and the way that source languages want to access strings. The `stringref` API has to do the best it can to smooth over these differences. On the implementation side, some WebAssembly implementations will want to represent string contents in the WTF-16 encoding, notably implementations embedded on the web because of JavaScript's usage of WTF-16. Other implementations without legacy requirements will want to use WTF-8, as it is generally more space-efficient and closer to the common UTF-8 interchange format. From the source language side, some source languages such as Java will also want to consider strings as WTF-16 code unit sequences. Other source languages will want to think of strings as WTF-8 byte sequences, and others will want to think of strings as codepoint sequences. On the most basic level, when a source language wants to access string contents, it can encode the whole string to memory (or eventually to a GC-managed array) as UTF-8, WTF-8, or WTF-16, depending on the source language's needs. For example Java usually wants to treat strings as WTF-16 code unit sequences, so Java would encode to WTF-16. This proposal provides facilities for measuring how many bytes an encoding would take, and for actually doing the encoding. This proposal also includes the ability to get a WTF-8 or WTF-16 "view" on a string, which should provide near-constant-time random access to the bytes of a WTF-8 encoding of the string, or to the 16-bit code units of a WTF-16 encoding of a string. We also provide a view that allows an iterator interface over the codepoints in a string. An implementation using WTF-8 will have some costs for source languages that want WTF-16, and vice versa. Getting a view for a WebAssembly implementation's "native" string encoding is likely a constant-time operation, and possibly even free. For example, getting a WTF-8 view for an implementation that uses WTF-8 to represent strings will be free. Getting a WTF-16 view on a WTF-8 implementation could imply a copy, or possibly the computation of a [breadcrumbs](https://www.swift.org/blog/utf8-string/#breadcrumbs) table. This proposal defines positions in strings only with respect to a specific string view type. Any interface that needs to refer to a position in a string should take a string view and an offset whose meaning depends on the string view type: a byte offset for a WTF-8 string view, a code unit offset for a WTF-16 view, or a codepoint offset for a codepoint view. ### Handling invalid UTF-8 WTF-8 and WTF-16 are not interchange formats: they should not be used when communicating data over a network, for example. If a program goes to encode a string to UTF-8, and the string contains an isolated surrogate, the program can trap, or it can replace the codepoint with `U+FFFD` (the replacement character). The stringref facility provides an efficient mechanism for detecting strings which are not valid USV sequences. When encoding data as UTF-8, it allows the programmer to specify whether to replace any isolated surrogate or whether to trap. For some source languages, defining strings to be sequences of unicode scalar values and isolated surrogates is an antifeature: those source languages do not require the ability to represent isolated surrogates and would prefer to not be given strings with isolated surrogates. We sympathize. On a boundary where such a program might receive a string containing isolated surrogates, the program can check for such strings and trap using facilities defined in this proposal. ### Prior discussions * The component model subgroup chose to agree that on component boundaries, strings consist of sequences of unicode scalar values: https://docs.google.com/presentation/d/1qVbBsDFmremBGVKiOAzRk7svjinNq6LXfJ1DzeFwKtc *The CG discussion and decision inform but don't constrain this proposal.* * The AssemblyScript developers floated a "universal strings" proposal which explicitly provided for the WTF-8 and WTF-16 encodings that can support unpaired surrogates: https://github.com/AssemblyScript/universal-strings *An excellent early draft which seriously tackles the WTF-16 problem.* ## Proposal This proposal consists of a basic `stringref` facility and a (possibly post-MVP) `stringview` facility. The `stringref` facility defines a new reference type, `stringref`. Literal `stringref` values can be embedded in a WebAssembly module, with their contents taken from a new section. WebAssembly programs can also create `stringref` values from data encoded in memory or GC arrays in the WTF-8 or WTF-16 encodings, and can likewise write `stringref` contents to memory in these encodings. There is an instruction to concatenate `stringref` values. Finally, `stringref` values can be compared for equality. The `stringview` facility allows WebAssembly to obtain a "view" on the contents of a `stringref`, treating it as a sequence of values in the WTF-8 and WTF-16 encodings, as well as treating a string as a sequence of codepoints. WebAssembly programs can use stringviews to encode parts of strings to memory, access string contents by index, and to take substrings. ## The `stringref` facility One new reference type: `stringref`. Opaque, like `externref` and `funcref`. When reading or writing encoded bytes, the address in memory at which to read or write the bytes depends on the memory model of the WebAssembly module. ``` address ::= i32 | i64 ``` Such instructions also take the memory to which to read or write as an immediate. Although `stringref` is a nullable type, trap if a null `stringref` value reaches any instruction in this proposal. The one exception is `string.eq`. ### Creating strings ``` (string.new_utf8 $memory ptr:address bytes:i32) -> str:stringref (string.new_lossy_utf8 $memory ptr:address bytes:i32) -> str:stringref (string.new_wtf8 $memory ptr:address bytes:i32) -> str:stringref ``` Create a new string from the *`bytes`* bytes in memory at *`ptr`*. Out-of-bounds access will trap. The maximum value for *`bytes`* is 231–1; passing a higher value traps. These three instructions decode the bytes in three different ways: * `string.new_utf8` decodes using a strict UTF-8 decoder. If the bytes are not valid UTF-8, trap. * `string.new_lossy_utf8` decodes using a sloppy UTF-8 decoder: all maximal subparts of an invalid subsequence are decoded as if they were `U+FFFD` (the replacement character) instead. This instruction will never trap due to a decoding error. See the section entitled "U+FFFD Substitution of Maximal Subparts" in the Unicode standard, version 14.0.0, page 126. * `string.new_wtf8` decodes using a strict WTF-8 decoder, which is like UTF-8 but also allows isolated surrogates. If the bytes are not valid WTF-8, trap. ``` (string.new_wtf16 $memory ptr:address codeunits:i32) -> str:stringref ``` Create a new string from the *`codeunits`* code units encoded in memory at *`ptr`*. Out-of-bounds access will trap. *`ptr`* must be two-byte aligned, and will trap otherwise. The maximum value for *`codeunits`* is 230–1; passing a higher value traps. Each code unit is read from memory as if with `i32.load16`, and is therefore decoded using little-endian byte order. #### `string.new` size limits Creating a string is a form of dynamic allocation and can fail. The same implementation running on different machines can have different behaviors. The specification can only say that byte/code-unit sizes above a certain limit *must* fail; but for sizes within the limits, the allocations *may* fail. If an allocation fails, the implementation must trap. Fallible `string.new` is a possible future extension. ### String literals ``` (string.const contents:i32) -> str:stringref ``` Create a new string from the literal string *`contents`*, as in `(string.const "Hello, World!")`. This instruction is constant and can be used in global variable initializers. #### String literal section The `string.const` section indicates the literal as an `i32` index into a new regular section: a string table, encoded as a `vec(vec(u8))` of valid WTF-8 strings. Because literal strings can contain codepoint 0, strings in the string table do not use NUL as a terminator. The string table section must immediately precede the global section, or where the global section would be, in the binary. Though it is useful for string literals used in constant instructions to appear early in the module binary, it may be advantageous to defer string literals that are only used at run-time to later in the module binary. This can get bulky string data off the hot path, allowing a WebAssembly implementation to start compiling functions as soon as possible. The encoding of the string literals section is preceded by a placeholder `0x00` value, allowing for the possibility of a deferred string literal section as a future extension. #### `string.const` size limits The maximum size for the WTF-8 encoding of an individual string literal is 231–1 bytes. Embeddings may impose their own limits which are more restricted. But similarly to `string.new_wtf8`, instantiating a module with string literals may fail due to lack of memory resources, even if the string size is formally within the limits. However `string.const` itself never traps when passed a valid literal offset. ### Accessing string contents All parameters and return values measuring a number of codepoints or a number of code units represent these sizes as unsigned values. ``` (string.measure_utf8 str:stringref) -> codeunits:i32 ``` Measure the number of code units (bytes) that would be required to encode the contents of the string *`str`* to UTF-8. If the string contains an isolated surrogate, return -1. The maximum number of code units returned by `string.measure_utf8` is is 231-1. If an encoding would require more code units than the limit, the result is -1. ``` (string.measure_wtf8 str:stringref) -> codeunits:i32 ``` Measure the number of code units (bytes) that would be required to encode the codepoints of the string *`str`* to WTF-8. Note that this instruction also serves to measure an encoding length for UTF-8 when isolated surrogates are replaced with `U+FFFD` ("lossy UTF-8"); the same number of bytes is required to encode `U+FFFD` as would be required to encode an isolated surrogate to WTF-8. The maximum number of code units returned by `string.measure_wtf8` is is 231-1. If an encoding would require more code units than the limit, the result is -1. ``` (string.measure_wtf16 str:stringref) -> codeunits:i32 ``` Measure the number of code units that would be required to encode the contents of the string *`str`* to WTF-16. The maximum number of code units returned by `string.measure_wtf16` is is 230-1. If an encoding would require more code units than the limit, the result is -1. ``` (string.encode_utf8 $memory str:stringref ptr:address) -> codeunits:i32 ``` Encode the contents of the string *`str`* as UTF-8 to memory at *ptr*. If an isolated surrogate is seen, trap. Return the number of code units written, which will be the same as returned by the corresponding `string.measure_utf8`. The maximum number of bytes that can be encoded at once by `string.encode` is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap). ``` (string.encode_lossy_utf8 $memory str:stringref ptr:address) -> codeunits:i32 ``` Encode the contents of the string *`str`* as UTF-8 to memory at *`ptr`*. If an isolated surrogate is seen, encode `U+FFFD` (the replacement character) instead. Return the number of code units written, which will be the same as returned by the corresponding `string.measure_wtf8`. The maximum number of bytes that can be encoded at once by `string.encode` is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap). ``` (string.encode_wtf8 $memory str:stringref ptr:address) -> codeunits:i32 ``` Encode the contents of the string *`str`* as WTF-8 to memory at *`ptr`*. Return the number of code units written, which will be the same as returned by the corresponding `string.measure_wtf8`. The maximum number of bytes that can be encoded at once by `string.encode` is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap). ``` (string.encode_wtf16 $memory str:stringref ptr:address) -> codeunits:i32 ``` Encode the contents of the string *`str`* as WTF-16 to memory at *`ptr`*. Return the number of code units written, which will be the same as returned by the corresponding `string.measure_wtf16`. Each code unit is written to memory as if stored by `i32.store16`, so WTF-16 code units are in little-endian byte order. The maximum number of bytes that can be encoded at once by `string.encode` is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap). ### Concatenation ``` (string.concat a:stringref b:stringref) -> stringref ``` Return a new stringref containing the codepoints from *`a`* followed by the codepoints from *`b`*. Note that the implementation should take care when, at any future time, treating the resulting string as a sequence of codepoints. If *`a`*'s last codepoint is a high surrogate and *`b`*'s first codepoint is a low surrogate, these two codepoints combine into one, as if they were the two code units of a UTF-16-encoded unicode scalar value. Concatenating two strings is a form of dynamic allocation and can fail. If an allocation fails, the implementation must trap. Fallible `string.concat` is a possible future extension. ### Predicates ``` (string.eq a:stringref b:stringref) -> i32 ``` If both *`a`* and *`b`* are null, return 1. If only one of them is null, return 0. Otherwise return 1 if the strings *`a`* and *`b`* contain the same codepoint sequence, or 0 otherwise. ``` (string.is_usv_sequence str:stringref) -> bool:i32 ``` Return 1 if the string *`str`* is a sequence of unicode scalar values, and 0 otherwise. A 0 result indicates that *`str`* contains isolated surrogates. ## The `stringview` facility Three new reference types: `stringview_wtf8`, `stringview_wtf16`, and `stringview_iter`. Opaque, like `externref` and `funcref`. ### `stringview_wtf8` ``` (string.as_wtf8 str:stringref) -> view:stringview_wtf8 ``` Obtain a view on a string's contents as WTF-8. The stringview can then be used to interpret the string as a byte sequence in the WTF-8 encoding. ``` (stringview_wtf8.advance view:stringview_wtf8 pos:i32 bytes:i32) -> next_pos:i32 ``` Starting at offset *`pos`* into the WTF-8 encoding of *`view`*, return the highest offset that is not greater than *`pos` + `bytes`*. If *`pos`* is greater than the WTF-8 byte length of *`view`*, it is as if it were instead given as the byte length. If *`pos`* is less than the byte length but does not indicate the byte offset of the start of a codepoint, it is advanced to the next codepoint (or the end of the string, for the last codepoint). Collectively these transformations are the "WTF-8 position treatment". If the mathematical value of *`next_pos`* would be greater than 231, trap. (Future extensions of the `stringref` proposal along the lines of the [memory64 proposal](https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/Overview.md) may allow for 64-bit variants of the position-using instructions, which could relax this restriction.) ``` (stringview_wtf8.encode_utf8 $memory view:stringview_wtf8 ptr:address pos:i32 bytes:i32) -> next_pos:i32, bytes:i32 (stringview_wtf8.encode_lossy_utf8 $memory view:stringview_wtf8 ptr:address pos:i32 bytes:i32) -> next_pos:i32, bytes:i32 (stringview_wtf8.encode_wtf8 $memory view:stringview_wtf8 ptr:address pos:i32 bytes:i32) -> next_pos:i32, bytes:i32 ``` Write a subsequence of the WTF-8 encoding of *`view`* to memory at *`ptr`*, starting at the WTF-8 offset *`pos`*, writing no more than *`bytes`* bytes. No NUL byte is written. Return the WTF-8 offset of the next characters to encode, along with the number of bytes written. *`pos`* receives the "WTF-8 position treatment", as for `stringview_wtf8.advance`. If the mathematical value of *`next_pos`* would be greater than 231, trap. (Future extensions of the `stringref` proposal along the lines of the [memory64 proposal](https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/Overview.md) may allow for 64-bit variants of the position-using instructions, which could relax this restriction.) If an isolated surrogate is seen, the behavior depends on the instruction: * `stringview_wtf8.encode_utf8` will trap. * `stringview_wtf8.encode_lossy_utf8` will encode `U+FFFD`. * `stringview_wtf8.encode_wtf8` will encode the isolated surrogate. ``` (stringview_wtf8.slice view:stringview_wtf8 start:i32 end:i32) -> str:stringref ``` Return a substring of *`view`*, for the WTF-8 bytes starting at offset *`start`* and continuing to but not including *`end`*. *`start`* and *`end`* receive the "WTF-8 position treatment", as for `stringview_wtf8.advance`. ### `stringview_wtf16` ``` (string.as_wtf16 str:stringref) -> view:stringview_wtf16 ``` Obtain a view on a string's contents as WTF-16. The stringview can then be used to interpret the string as a sequence of 16-bit code units in the WTF-16 encoding. ``` (stringview_wtf16.length view:stringview_wtf16) -> length:i32 ``` Return the total number of 16-bit code units necessary to represent *`view`* in the WTF-16 encoding. ``` (stringview_wtf16.get_codeunit view:stringview_wtf16 pos:i32) -> codeunit:i32 ``` Return the 16-bit code unit at offset *`pos`* in the WTF-16 encoding of *`view`*. If *`pos`* is greater than or equal to the WTF-16 length of *`view`*, trap. ``` (stringview_wtf16.encode $memory view:stringview_wtf16 ptr:address pos:i32 len:i32) -> codeunits:i32 ``` Write a subsequence of the WTF-16 encoding of *`view`* to memory at *`ptr`*, starting at the WTF-16 offset *`pos`*, writing no more than *`len`* 16-bit code units. If *`ptr`* is not two-byte aligned, trap. Return the number of code units written. If *`pos`* is greater than the number of WTF-16 code units in *`view`*, it is as if it were instead given as the code unit length. This transformation is the "WTF-16 position treatment". ``` (stringview_wtf16.slice view:stringview_wtf16 start:i32 end:i32) -> str:stringref ``` Return a substring of *`view`*, for the WTF-16 code units starting at offset *`start`* and continuing to but not including *`end`*. *`start`* and *`end`* receive the "WTF-16 position treatment", as for `stringview_wtf16.encode`. ### `stringview_iter` ``` (string.as_iter str:stringref) -> view:stringview_iter ``` Obtain a view on a string's contents as an iterator over the codepoints in *`str`*, initially positioned at the beginning of the string. The stringview can then be used to iterate over the codepoints of the string. ``` (stringview_iter.next view:stringview_iter) -> codepoint:i32 ``` If *`view`* is already at the end of the string, return -1. Otherwise return the codepoint currently pointed to by the iterator, and advance the iterator's position by one codepoint. ``` (stringview_iter.advance view:stringview_iter codepoints:i32) -> codepoints:i32 ``` Advance the iterator *`view`* by up to *`codepoints`* codepoints. Return the number of codepoints that were actually consumed. ``` (stringview_iter.rewind view:stringview_iter codepoints:i32) -> codepoints:i32 ``` Rewind the iterator *`view`* by up to *`codepoints`* codepoints. Return the number of codepoints that were actually consumed. ``` (stringview_iter.slice view:stringview_iter codepoints:i32) -> str:stringref ``` Return a substring of *`view`*, starting at the current position of *`view`* and continuing for at most *`codepoints`* codepoints. ### GC integration Though this proposal does not have a dependency on the [GC proposal](https://github.com/WebAssembly/gc/blob/master/proposals/gc/MVP.md), compiler authors that target GC will likely want to be able to encode the contents of a stringref to a GC array, and vice versa. The primary use cases are: 1. String-builder interfaces, which will likely use a WTF-8 or WTF-16 array as intermediate storage, depending on the language being compiled. We will need to be able to create strings from arrays. When the string contents are ready, we will almost always decode from array offset 0 and continue to some offset before the end of the array. We'll also need to be able to append a string's contents to an array at a given offset. 2. Communicating strings with another process, possibly over the network. Here, UTF-8 and WTF-8 are the important encodings, and we need to be able to read and write to arbitrary slices of arrays. The instructions below shall be available in WebAssembly implementations that support both GC and stringrefs. ``` (string.new_utf8_array codeunits:$t start:i32 end:i32) if expand($t) => array i8 -> str:stringref (string.new_lossy_utf8_array codeunits:$t start:i32 end:i32) if expand($t) => array i8 -> str:stringref (string.new_wtf8_array codeunits:$t start:i32 end:i32) if expand($t) => array i8 -> str:stringref ``` Create a new string from a subsequence of the *`codeunits`* bytes in a GC-managed array, starting from offset *`start`* and continuing to but not including *`end`*. If *`end`* is less than *`start`* or is greater than the array length, trap. The bytes are decoded in the same way as `string.new_utf8`, `string.new_lossy_utf8`, and `string.new_wtf8`, respectively. The maximum value for *`end`*–*`start`* is 231–1; passing a higher value traps. ``` (string.new_wtf16_array codeunits:$t start:i32 end:i32) if expand($t) => array i16 -> str:stringref ``` Create a new string from a subsequence of the *`codeunits`* WTF-16 code units in a GC-managed array, starting from offset *`start`* and continuing to but not including *`end`*. If *`end`* is less than *`start`* or is greater than the array length, trap. The maximum value for *`end`*–*`start`* is 230–1; passing a higher value traps. ``` (string.encode_utf8_array str:stringref array:$t start:i32) if expand($t) => array (mut i8) -> codeunits:i32 (string.encode_lossy_utf8_array str:stringref array:$t start:i32) if expand($t) => array (mut i8) -> codeunits:i32 (string.encode_wtf8_array str:stringref array:$t start:i32) if expand($t) => array (mut i8) -> codeunits:i32 (string.encode_wtf16_array str:stringref array:$t start:i32) if expand($t) => array (mut i16) -> codeunits:i32 ``` Encode the contents of the string *`str`* as WTF-8 or WTF-16, respectively, to the GC-managed array *`array`*, starting at offset *`start`*. Return the number of code units written, which will be the same as the result of a the corresponding `string.measure_wtf8` or `string.measure_wtf16`, respectively. If there is not space for the code units in the array, trap. Note that no `NUL` terminator is ever written. For `string.encode_utf8_array`, trap if an isolated surrogate is seen. For `string.encode_lossy_utf8_array`, replace isolated surrogates with `U+FFFD`. ## Binary encoding ``` reftype ::= ... | 0x64 ⇒ stringref ; SLEB128(-0x1c) | 0x63 ⇒ stringview_wtf8 ; SLEB128(-0x1d) | 0x62 ⇒ stringview_wtf16 ; SLEB128(-0x1e) | 0x61 ⇒ stringview_iter ; SLEB128(-0x1f) instr ::= ... | 0xfb 0x80:u32 $mem:u32 ⇒ string.new_utf8 $mem | 0xfb 0x81:u32 $mem:u32 ⇒ string.new_wtf16 $mem | 0xfb 0x82:u32 $idx:u32 ⇒ string.const $idx | 0xfb 0x83:u32 ⇒ string.measure_utf8 | 0xfb 0x84:u32 ⇒ string.measure_wtf8 | 0xfb 0x85:u32 ⇒ string.measure_wtf16 | 0xfb 0x86:u32 $mem:u32 ⇒ string.encode_utf8 $mem | 0xfb 0x87:u32 $mem:u32 ⇒ string.encode_wtf16 $mem | 0xfb 0x88:u32 ⇒ string.concat | 0xfb 0x89:u32 ⇒ string.eq | 0xfb 0x8a:u32 ⇒ string.is_usv_sequence | 0xfb 0x8b:u32 $mem:u32 ⇒ string.new_lossy_utf8 $mem | 0xfb 0x8c:u32 $mem:u32 ⇒ string.new_wtf8 $mem | 0xfb 0x8d:u32 $mem:u32 ⇒ string.encode_lossy_utf8 $mem | 0xfb 0x8e:u32 $mem:u32 ⇒ string.encode_wtf8 $mem | 0xfb 0x90:u32 ⇒ string.as_wtf8 | 0xfb 0x91:u32 ⇒ stringview_wtf8.advance | 0xfb 0x92:u32 $mem:u32 ⇒ stringview_wtf8.encode_utf8 $mem | 0xfb 0x93:u32 ⇒ stringview_wtf8.slice | 0xfb 0x94:u32 $mem:u32 ⇒ stringview_wtf8.encode_lossy_utf8 $mem | 0xfb 0x95:u32 $mem:u32 ⇒ stringview_wtf8.encode_wtf8 $mem | 0xfb 0x98:u32 ⇒ string.as_wtf16 | 0xfb 0x99:u32 ⇒ stringview_wtf16.length | 0xfb 0x9a:u32 ⇒ stringview_wtf16.get_codeunit | 0xfb 0x9b:u32 $mem:u32 ⇒ stringview_wtf16.encode $mem | 0xfb 0x9c:u32 ⇒ stringview_wtf16.slice | 0xfb 0xa0:u32 ⇒ string.as_iter | 0xfb 0xa1:u32 ⇒ stringview_iter.next | 0xfb 0xa2:u32 ⇒ stringview_iter.advance | 0xfb 0xa3:u32 ⇒ stringview_iter.rewind | 0xfb 0xa4:u32 ⇒ stringview_iter.slice | 0xfb 0xb0:u32 [gc] ⇒ string.new_utf8_array | 0xfb 0xb1:u32 [gc] ⇒ string.new_wtf16_array | 0xfb 0xb2:u32 [gc] ⇒ string.encode_utf8_array | 0xfb 0xb3:u32 [gc] ⇒ string.encode_wtf16_array | 0xfb 0xb4:u32 [gc] ⇒ string.new_lossy_utf8_array | 0xfb 0xb5:u32 [gc] ⇒ string.new_wtf8_array | 0xfb 0xb6:u32 [gc] ⇒ string.encode_lossy_utf8_array | 0xfb 0xb7:u32 [gc] ⇒ string.encode_wtf8_array ;; New section. If present, must be present only once, and right before ;; the globals section (or where the globals section would be). Each ;; vec(u8) must be valid WTF-8. The 0x00 is a placeholder for future ;; expansion. One possible expansion would be to replace the 0x00 with ;; a u32 indicating a count of supplementary string literals that are in ;; a section that appears later in the binary, after the code section. stringrefs ::= section_14(0x00 vec(vec(u8))) ``` Note that the u32 (uleb) encoding for the opcode after the `0xfb` prefix takes two bytes, for opcode values between 0x80 and 0x3fff. ## Examples We assume that the textual syntax for instructions that take a memory operand allows you to elide the memory, in which case it defaults to 0. ### Make string from NUL-terminated UTF-8 in memory ```wasm (func $string-from-utf8 (param $ptr i32) (result stringref) local.get $ptr local.get $ptr call $strlen string.new_utf8) ``` If the bytes being decoded aren't actually valid UTF-8, this function will trap. Use `string.new_lossy_utf8` in contexts where replacing invalid data with `U+FFFD` is a better strategy than trapping. ### Make string from an array of WTF-8 code units in memory ```wasm (func $string-from-wtf8n (param $ptr i32) (param $len i32) (result stringref) local.get $ptr local.get $len string.new_wtf8) ``` Note that `string.new_wtf8` (and `string.new_wtf8_array`) are always strict decoders: if the bytes are not valid WTF-8, the instruction traps. ### Make string from UTF-16 in memory ```wasm (func $string-from-utf16 (param $ptr i32) (param $units i32) (result stringref) local.get $ptr local.get $units string.new_wtf16) ``` This proposal doesn't distinguish between UTF-16 and WTF-16 at all; rather it just deals in WTF-16, as most source languages that expose 16-bit code units to users actually expose WTF-16 strings. ### Number of codepoints in string ```wasm (func $codepoint-length (param $str stringref) (result i32) local.get $str string.as_iter ;; Get iterator view i32.const -1 ;; advance by all codepoints stringview_iter.advance) ;; return number of codepoints advanced ``` ### String literals ```wasm (global $hey stringref (string.const "Hey")) (func $howdy (result stringref) (string.const "Howdy")) (func $is-cowboy (param $str stringref) (result i32) local.get $str call $howdy string.eq) ``` ### Return the first codepoints of a string, as a stringref ```wasm (func $prefix (param $str stringref) (param $codepoints i32) (result stringref) local.get $str string.as_iter local.get $codepoints stringview_iter.slice) ``` ### Return a slice of WTF-16 code units of a string, as a stringref ```wasm (func $slice (param $str stringref) (param $offset i32) (param $codeunits i32) (result stringref) local.get $str string.as_wtf16 local.get $offset local.get $offset local.get $codeunits i32.add stringview_wtf16.slice) ``` ### Suffix, prefix comparisons There are a few ways to compare against a substring, but the easiest is probably to slice the string, which is something you can do only with respect to a particular encoding and view. Given that we're comparing against known strings, we know how long of a slice to take. ```wasm (func starts-with-hey? (param $str stringref) (result i32) local.get $str string.as_wtf8 i32.const 0 i32.const 3 stringview_wtf8.slice global.get $hey string.eq) (func ends-with-howdy?/wtf8 (param $str stringref) (result i32) (local $wtf8 stringview_wtf8) local.get $str string.as_wtf8 local.set $wtf8 local.get $wtf8 local.get $wtf8 ;; Get wtf-8 offset of end i32.const 0 i32.const -1 stringview_wtf8.advance ;; Subtract 5. Given WTF-8 position treatment, OK to wrap or ;; not be on codepoint boundary. i32.const 5 i32.sub ;; Slice until end. If string ends with "howdy", these will be ;; 5 1-byte codepoints. i32.const -1 stringview_wtf8.slice string.const "Howdy" string.eq) ;; WTF-16 flavor is similar. (func ends-with-howdy?/wtf16 (param $str stringref) (result i32) (local $wtf16 stringview_wtf16) local.get $str string.as_wtf16 local.set $wtf16 ;; Slice last 5 code units. local.get $wtf16 local.get $wtf16 stringview_wtf16.length i32.const 5 i32.sub i32.const -1 stringview_wtf16.slice string.const "Howdy" string.eq) ;; Finally, a version with the iterator API. (func ends-with-howdy?/iter (param $str stringref) (result i32) (local $iter stringview_iter) local.get $str string.as_iter local.set $iter ;; Advance to end. local.get $iter i32.const -1 stringview_iter.advance ;; Rewind by 5. local.get $iter i32.const 5 stringview_iter.rewind ;; Slice. local.get $iter i32.const 5 stringview_iter.slice ;; Compare. string.const "Howdy" string.eq) ``` Which version of `ends-with-howdy?` will a source language produce? They are essentially equivalent in this use case of comparing against a static string, but in the general case, a source language that processes strings in terms of codepoints would probably use the iterator, languages that treat strings as UTF-8 sequences would produce the WTF-8 version whereas those that process strings in terms of 16-bit code units will compile to the WTF-16 version. One could instead do a character-by-character comparison, to avoid creating the slice. Stepping back a bit, prefix and suffix checks are examples of operations for which the stringref proposal should facilitate high-performance implementations. The primary strategy of the stringref proposal is to allow any such operation to be build in terms of its primitives. However if there are important compound operations (e.g. prefix/suffix checks) that can be sped up with a dedicated instruction, we should be open to considering adding more instructions. ### Store a `stringref` without copying ```wasm (table $strings 100 stringref) (global $next-handle i32 (i32.const 0)) (func $intern-string (param $str stringref) (result i32) (local $handle i32) global.get $next-handle local.tee $handle local.get $str table.set $strings i32.const 1 i32.add global.set $next-handle local.get $handle) ``` ### Copy string contents to application-managed memory ```wasm (func $malloc (param i32) (result i32)) (func $utf8-contents (param $str stringref) (result i32) (local $cur i32) (local $len i32) (local $ptr i32) local.get $str string.measure_utf8 local.set $len block $valid local.get $len i32.const -1 i32.ne br_if $valid unreachable ;; trap on error end local.get $len i32.const 1 i32.add call $malloc ;; reserve space for bytes and NUL local.set $ptr local.get $str local.get $ptr string.encode_utf8 ;; push bytes written, same as $len local.get $ptr i32.add i32.const 0 i32.store8 ;; write NUL local.get $ptr return) ``` Using `string.measure_utf8` ensures that the encoded string is a valid unicode scalar value sequence. How to handle invalid UTF-8 is up to the user; instead of `unreachable` we could throw an exception. Note that in this case, the subsequent `string.encode_utf8` could just as well have been `string.encode_lossy_utf8` or `string.encode_wtf8`, as these instructions are all the same for strings that do not contain isolated surrogates, and we checked that there were none. If we meant to handle isolated surrogates, we could use `string.measure_wtf8` instead. ### Stream over contents of string Assume you have a 1024-byte array of memory at `$buf`. This function will encode isolated surrogates as WTF-8. ```wasm (global $buf i32) (func $process-wtf8 (param $ptr i32) (param $len i32)) (func $process-string (param $str stringref) (local $cursor i32) ;; initial value of 0 is start (local $bytes i32) loop local.get $str local.get $cursor global.get $buf i32.const 1024 string.encode_wtf8 ;; push bytes written local.tee $bytes (if i32.eqz (then return)) ;; if no bytes encoded, done local.get $bytes local.get $cursor i32.add local.set $cursor global.get $buf local.get $bytes call $process-utf8 end) ``` ### Stream over UTF-16 code units of string, handling isolated surrogates This function is probably slower than encoding chunks of the string to WTF-16 in linear memory, for longer strings. ```wasm (func $have-code-unit (param $codeunit i32)) (func $process-string (param $str stringref) (local $wtf16 stringview_wtf16) (local $cur i32) (local $len i32) local.get $str string.as_wtf16 local.set $wtf16 local.get $wtf16 stringview_wtf16.length local.set $len block $done loop $loop local.get $cur local.get $len i32.ge br_if $done local.get $wtf16 local.get $cur stringview_wtf16.get_codeunit call $have-code-unit i32.const 1 local.get $cur i32.add local.set $cur end end) ``` ### Stream over codepoints of string, handling isolated surrogates This function is probably slower than encoding chunks of the string to WTF-8 in memory, for longer strings. ```wasm (func $have-codepoint (param $codepoint i32)) (func $process-string (param $str stringref) (local $iter stringview_iter) (local $ch i32) local.get $str string.as_iter local.set $iter block $done loop $loop local.get $iter stringview_iter.next local.tee $ch i32.const -1 i32.eq br_if $done local.get $ch call $have-codepoint end end) ``` ### Concatenate two strings ```wasm (func $append (param $a stringref) (param $b stringref) (result stringref) local.get $a local.get $b string.concat) ``` ## FAQ ### What does Emscripten currently do for strings and how could WebAssembly strings help? Generally speaking, Emscripten [eagerly converts JavaScript strings to NUL-terminated UTF-8](https://github.com/emscripten-core/emscripten/blob/main/src/preamble.js#L91-L100), allocating space for the UTF-8 encoding in linear memory using stack allocation. The [stringToUTF8 function](https://github.com/emscripten-core/emscripten/blob/main/src/runtime_strings.js#L158) is written in JavaScript and handles surrogate pairs. However for isolated surrogates, [emscripten's decoder appears to produce garbage](https://github.com/emscripten-core/emscripten/issues/15324). For C functions that return strings, emscripten [parses NUL-terminated UTF-8 from memory](https://github.com/emscripten-core/emscripten/blob/main/src/preamble.js#L109), [either using TextDecoder or via hand-rolled JavaScript](https://github.com/emscripten-core/emscripten/blob/main/src/runtime_strings.js#L35). Presumably TextDecoder is significantly faster as it doesn't have to build rope strings. Memory management is an issue, of course; the memory for a returned string value may or may not be owned by the caller. This proposal avoids memory ownership issues entirely, via automatic memory management (implemented either via GC or reference counting). It also avoids eager string encoding onto the stack and the need for NUL termination, allowing string contents to be written to memory exactly where they are needed. ### Why should WebAssembly strings be able to represent isolated surrogates? The main motivation is to support source languages with WTF-16 strings (e.g. Java, Kotlin, C#). JVM-based and CLR-based languages treat strings as sequences of 16-bit code units. Sometimes programs written in e.g. Java will decode these sequences into codepoints encoded as WTF-16, but not always. Many common algorithms can be performed directly on the code units, for example prefix matching. Therefore to efficiently support Java and friends when compiled to WebAssembly, we need to support this view of strings as sequences of any 16-bit code units, without any validity constraints that enforce that surrogates always be properly paired. This is the main reason to support WTF-16 rather than just UTF-16. An important secondary reason is interoperability with JavaScript hosts. For zero-copy interoperation with JavaScript and DOM facilities, it would be good if for `stringref` to have the same semantics as a JavaScript string, which like Java is an arbitrary sequence of 16-bit code units. Isolated surrogates are rare in JavaScript, but can occur via: - Reading invalid UTF-16 from external sources. However this is not common, as most services prefer UTF-8 over UTF-16 as an interchange format. - JavaScript code that creates strings whose code units are not valid UTF-16. - JavaScript code that processes strings in chunks and happens to split a chunk on a surrogate boundary. - This happens most often in JavaScript code that processes strings one code unit at a time. - [JavaScript / DOM keyboard input event handlers](https://github.com/rustwasm/wasm-bindgen/issues/1348#issuecomment-473186426) (though this may be just a bug; [[1]](https://bugzilla.mozilla.org/show_bug.cgi?id=1541349), [[2]](https://bugs.chromium.org/p/chromium/issues/detail?id=949056)). Therefore we define a `stringref` as an arbitrary sequence of not just unicode scalar values, but also isolated surrogates. Note that this definition excludes codepoint sequences containing proper surrogate pairs. This restriction is enforced by construction for the WTF-8 and WTF-16 encoding schemes. ### Are stringrefs mutable? No. We don't need mutable strings when compiling Java, C#, or Python, and we don't need them when interoperating with JavaScript hosts. Immutable strings have the benefit that you can hand them to an untrusted interface without copying, and you know that interface won't be able to use the string to affect any of your own state. It is not a goal for `stringref` to be the main string representation for programming languages that need mutable strings. Fortunately there are fewer and fewer of these languages as time goes on. ### What do existing C++ APIs to JavaScript strings look like? While developing this proposal, we realized that we might already have a design oracle as regards JavaScript integration: `v8.h`. Perhaps for languages that tend to work on strings in linear memory (C++, Rust), we can use the C++ interface to a JS engine as an indication of what interfaces we might need. - We can assume that `v8.h` has all the interfaces that Chromium needs, so we expect that the interfaces in `v8.h` are sufficient. - V8 wants to minimize API surface and historically has removed API, so we expect V8's interface is close to minimal. - C++ interfaces to different JS engines are similar. We can look at `v8.h` and draw conclusions for any engine. The [V8 C++ String API](https://chromium.googlesource.com/v8/v8/+/refs/heads/main/include/v8-primitive.h#110) includes the following procedural interfaces: - Create a string from encoded bytes in memory - Supported encodings: `one-byte`, `utf-8`, `utf-16` - Get length of string when encoded as `one-byte`, `utf-8`, `utf-16` - Does not include unicode scalar value count - Predicate on string to identify strings represented using one byte per character (a cheap check) and strings that can be represented using one byte per character (possibly a linear search) - Write encoded bytes to memory - Supported encodings: `one-byte`, `utf-8`, `utf-16` - Options: hint that ropes should be flattened, include `NUL` terminator or not, whether to preserve `NUL` codepoints, whether to replace isolated surrogates with the replacement character or to trap - Support for strings whose characters are in linear memory and which shouldn't be copied ("external strings"); probably not appropriate for WebAssembly - Equality predicate - Concatenate two strings. Interestingly, `v8.h` has no interface to make a substring (slice). We used this set of interfaces as a starting point for the `stringref` design. The need to support WebAssembly implementations that use WTF-8 to represent strings internally did cause us, however, to separate out some functionality into `stringview`. ### What is the expected implementation on non-browser run-times? Assuming that the non-browser implementation uses WTF-8 as the native string representation, then a `stringref` could be just a pointer, a length, and a reference count. Some implementations may also want to keep a flag indicating whether a string is valid UTF-8. Generally speaking, WebAssembly doesn't specify the time or space complexity of its operations. In that regard, an implementation is free to implement e.g. `string.concat` via an eager copy. In practice however we expect the same dynamics that lead JavaScript implementations to natively support ropes and slices would hold with non-browser run-times. These implementations would also have their own heuristics for when to flatten strings. When creating a `stringview_wtf16` from a `stringref` on a system that represents `stringref` as WTF-8, we expect that some implementations will eagerly copy the string to a WTF-16 encoding. Others will want to implement a map from WTF-16 position to WTF-8 position via [breadcrumbs](https://www.swift.org/blog/utf8-string/#breadcrumbs). ### What's the expected implementation in web browsers? We expect that web browsers use JS strings as `stringref`. We expect also that web browsers use JS strings directly as their `stringview_wtf16` implementation, given that current web browsers represent strings internally as WTF-16 (with some optimizations for latin-1 strings). For `stringview_wtf8`, we expect either an eager copy or breadcrumbs, as in the non-browser runtime case. Some small strings may avoid the re-encoding and instead re-encode on the fly. There is a possibility that some web browsers may eventually switch from the one-byte/two-byte representation to WTF-8 with breadcrumbs, which would make those web browsers use the same strategy as the non-browser case. ### How should stringviews be represented in JavaScript hosts? It's possible for a WebAssembly module to define an exported function that returns a `stringview_iter`. This proposal leaves the question of the JS API for stringviews to a post-MVP proposal. We expect that until such a proposal lands, attempting to pass a stringview across the WebAssembly/JS boundary will throw an exception, as was the case for `i64` values before the BigInt proposal landed. ### How do we expect Rust to compile to `stringref`? Generally speaking, for Rust we expect eager copies to UTF-8 data when Rust receives a stringref. Rust represents strings natively as well-formed UTF-8. Rust string processing routines can therefore assume that a UTF-8 string is valid. `stringref` strings are WTF-8, though. So we can expect that for a Rust interface that exports a function that takes a `stringref` parameter, `wasm-bindgen` would then use a [WasmString](https://rustwasm.github.io/wasm-bindgen/reference/types/string.html) type, which could be transformed to an `Option` ([which could fail or replace with U+FFFD if there are isolated surrogates](https://rustwasm.github.io/wasm-bindgen/reference/types/str.html#utf-16-vs-utf-8)). This will remove the need for `TextDecoder`/`TextEncoder`. As an optimization for Rust modules that are designed to work with WebAssembly, `WasmString` may expose some methods to avoid an eager copy. ### How do we expect JVM and CLR languages to compile to `stringref`? We expect Java to use `stringref` directly to represent string values. Java deals with strings as immutable sequences of 16-bit code units. Access to individual code units would use `stringview_wtf16`. Alternately, a Java compiler might instead choose to use `stringview_wtf16`, eagerly obtaining WTF-16 views when it receives a `stringref` from the outside world. ### How do we expect Python to compile to `stringref`? We expect CPython to provide a wrapper around `stringref` for strings that come from "outside". We expect PyPy to use `stringref` directly for all strings. Python strings [are immutable sequences of Unicode code points](https://docs.python.org/3/library/stdtypes.html#textseq). This may include surrogates. CPython's string support is abstract: all codepoint access goes through an accessor API. Therefore when CPython receives a `stringref` on a public interface, CPython could store that `stringref` in a table and then forward any indexed codepoint access to that `stringref`. PyPy would instead use `stringref` directly to implement its strings. The PyPy maintainer notes that most strings in Python aren't accessed using indexed accessors, so probably PyPy would only obtain a view as needed. ### How do we expect C++ to compile to `stringref`? We expect that LLVM will be extended with an additional reference type, `stringref`, like the existing `externref` and `funcref` support, along with a number of builtins to expose the basic `stringref` operations. LLVM will be able to directly expose C++ functions to WebAssembly that take `stringref` parameters, removing the need for much Emscripten-side code. However as reference-typed values aren't storable to main memory, we expect that unless a C++ program is carefully built to integrate reference types, that most `stringref` values will be eagerly converted to WTF-8 on the WebAssembly boundary. ### Is the `stringref` type nullable? Oh God I guess so. `ref.null string` it is I guess!! :sob: :sob: :sob: ### What kinds of performance gains can we expect? * WebAssembly can receive encoded content of JS strings exactly where it is wanted: no need to stack-allocate then copy. * WebAssembly can process long strings in chunks rather than having to reserve space for the whole string. * WebAssembly can cheaply check incoming strings against literals, treating them as symbols. * Avoid JIT warmup for JS-implemented UTF-8 encode and decode. * Avoid allocation of subarrays when decoding; e.g. [as used by emscripten](https://github.com/emscripten-core/emscripten/blob/da842597941f425e92df0b902d3af53f1bcc2713/src/runtime_strings.js#L48) * Cheap prefix/suffix tests without reading whole string * WebAssembly can cheaply pass string literals to JS without decoding or copying ### What's the security implication? Right now, working with strings fundamentally means communicating UTF-8 via memory. To grant someone access to a string, you have to grant them access to all of your memory. This violates the principle of least privilege. Having reference-typed strings would limit the capability to just the immutable codepoint sequence in question, and not all of memory. Additionaly, interfacing between memory lifetimes in C/C++ and JavaScript is bug-prone. Using `stringref` would eliminate questions of memory ownership, reducing the risk of use-after-free, data corruption, write overruns, and privileged data leakage. ### Why might you want to eagerly copy string contents to and from linear memory? Some programming languages will be happy to deal with string contents via the `stringview` APIs, avoiding copies of string contents to linear or GC-managed memory. Some others will prefer to copy out a WTF-8 encoding to main memory, because that's how they are used to dealing with strings. This copying has an overhead but for algorithms that touch many code units it can be advantageous, as you get to inline the per-code-unit processing work rather than calling out to `stringview` interfaces. As the `stringview` interfaces may exhibit polymorphism, they may have some per-operation overheads. For example, a `stringview_wtf16` will be cheap to create in JavaScript, but accessing the code units still has to dispatch over whether the string is a rope or a slice, whether the codepoints are one-byte or two-byte, and so on. Even in non-browser WTF-8 implementations there will still be ropes and slices. ### Why not just use `externref` and imported functions? The instruction set could be implemented with imported functions, replacing the `stringref` type with `externref`. So why bother adding it to WebAssembly itself? Three reasons: platform expressivity, performance, and security. On the first point: if the strings feature required some capability from the host, then it would be clearly best as a library. For example, WebGL access falls in this category. But reference-typed strings are a more fundamental feature common to all languages that use automatic memory management. In that way they are closer to the GC proposal; although you could implement structs and arrays via `externref` and imports, if you did that you might as well compile to JavaScript instead of WebAssembly. It should be possible to make a WebAssembly program that uses reference-typed strings (because almost all such programs would have strings) without relying on any JavaScript at all. Also, the evolutionary endpoint of an `externref`-and-imports strategy is a JavaScript-specific string interface. Without any broader WebAssembly platform concern, strings-using WebAssembly code would find itself relying on details of JavaScript's string representation, for example having the only interface be to process strings one code unit at a time instead of one codepoint at a time. This is not a good platform outcome. Finally, though the WebAssembly platform should be able to stand alone, it should also interoperate smoothly with hosts, especially JavaScript on the web. This rules out any implementation of strings in terms of reference-typed structs and arrays: not only would such an implementation be likely slower than the host's strings, it would also be incompatible. On the web, WebAssembly and JavaScript should use the same string implementation. On the performance side, we expect that `stringref` will be faster than `externref`+imports: 1. Whereas an `externref` might need to be a tagged union, a `stringref` can be an unpacked pointer. 2. WebAssembly instructions are likely faster and less of an optimization barrier than callouts to imports. 3. Run-time helper code for WebAssembly instructions is probably implemented in C++/Rust/etc more directly, resulting in more predictable performance than e.g. an encoder implemented in JS (for web embeddings). 4. Reading string contents, either via `string.encode_wtf8`-then-process-inline or via `stringview_wtf16`, is likely faster than calling out to JavaScript to read code units one at a time. WebAssembly-to-JavaScript calls are cheap but not free. On the other hand, it's true that JS run-time routines can use adaptive JIT techniques to possibly inline representation-specific accessors. This is of limited use though for run-time routines with many different call sites. On the reliability and security side, adding `stringref` to WebAssembly removes a significant user of extra-module access to memory. Because the WebAssembly code can pick apart the string itself, that's one fewer reason for the WebAssembly module to have to expose its memory. ### How does this relate to the component model and interface types? The component model is [a vision of how to compose systems out of shared-nothing parts implemented in WebAssembly](https://github.com/WebAssembly/component-model/pull/1/files). The boundaries between these components are mediated by interface types, which specify how to communicate data from one component to another in the most efficient way possible. At one level, reference-typed strings don't appear have anything to do with the component model. Because components are specified to not share anything, even GC-managed data, zero-copy communication of reference-typed strings between components is strictly out of scope (though this operation may be zero-copy in practice; see below). From the perspective of the component model, reference-typed strings are rather an *intra-component* concern. A component may be composed internally of a number of WebAssembly modules, as well as possible host facilities such as JavaScript. The zero-copy properties provided by `stringref` are only assured on to the inter-module, intra-component boundaries of a program. That said, strings in the abstract are an important data type, and relate to interface types (a WebAssembly proposal based on the component model). Obviously you will want to be able to use `stringref` with interface types. The shared-nothing design choice of the component model then implies that `stringref` contents should be copied when they cross a component boundary. Incidentally, for inter-component interfaces that deal in strings, the component model specifies that abstractly, [strings are sequences of unicode scalar values](https://github.com/WebAssembly/interface-types/issues/135). This implies that some JavaScript strings can't traverse a component boundary, because of the potential for isolated surrogates, and also implies an eager check that a `stringref` is a valid USV sequence, for an interface-typed call. In practice this is not a problem because the `stringref` contents are being copied anyway and so can be validated at the same time. Interface types are used to specify a WebAssembly function's signature in an abstract way. This signature should then be compiled down to a concrete adapter function specialized to the data representations used by the caller and the callee. The instruction set in this proposal can be used to implement the adapter function for passing a `stringref` as a string; assuming that the adapter function is generated in such a way that it has access to the target memory, `string.encode_wtf8` can implement the copy and validation at the same time. `string.new_wtf8` would be the implementation of getting a `stringref` from an interface-typed string value, again assuming UTF-8 encoding for these values. Of course, because a `stringref` is immutable, whether it is copied or not on a component boundary or during a call to an interface-typed function is an implementation detail. Some implementations of the component model may wish to copy in all cases, for memory usage accounting reasons. Others will apply a zero-copy strategy when possible, for example when both the caller and the callee of an interface are implemented with `stringref`. In the zero-copy case, however, hosts have to eagerly verify that the string is a valid USV sequence. For this they would use `string.is_usv_sequence`.