# String Operations The default character set for all strings is [UTF-8](https://en.wikipedia.org/wiki/UTF-8). SQL defines two primary character types: `character varying(n)` and `character(n)`, where n is a positive integer. Both of these types can store strings up to n characters (not bytes) in length. An attempt to store a longer string into a column of these types will result in an error, unless the excess characters are all spaces, in which case the string will be truncated to the maximum length. (This somewhat bizarre exception is required by the SQL standard.) If the string to be stored is shorter than the declared length, values of type character will be space-padded; values of type character varying will simply store the shorter string. In addition, we provide the `text`, or `varchar` type, which stores strings of any length. ## String constants (literals) A string constant in SQL is an arbitrary sequence of characters bounded by single quotes (`'`), for example `'This is a string'`. To include a single-quote character within a string constant, write two adjacent single quotes, e.g., `'Dianne''s horse'`. Note that this is not the same as a double-quote character (`"`). Two string constants that are only separated by whitespace with at least one newline are concatenated and effectively treated as if the string had been written as one constant. For example: ```sql SELECT 'foo' 'bar' ``` is equivalent to: ```sql SELECT 'foobar' ``` but: ```sql SELECT 'foo' 'bar' ``` is not valid syntax. Newlines are supported within string literals: ``` SELECT 'a b' ``` produces a multi-line string literal. ## Unicode characters A string literal prefixed with the `U&` can contain Unicode characters described using their numeric code: `U&'hello\0041'` includes the Unicode character with code U+0041, which is the uppercase letter 'A'. A Unicode character is described by an escape character followed by four hexadecimal digits. The default escape character is backslash, but it can be changed using the following syntax: `U&'hello!0041' UESCAPE '!'`. The `UESCAPE` notation designates the escape character to use in interpreting the current literal, which is the exclamation sign in the previous example. ## Operations on string values
pattern does not contain percent signs or underscores, then the
pattern only represents the string itself; in that case `LIKE` acts
like the equals operator. An underscore (`_`) in pattern stands for
(matches) any single character; a percent sign (`%`) matches any
sequence of zero or more characters.
Some examples:
```sql
'abc' LIKE 'abc' true
'abc' LIKE 'a%' true
'abc' LIKE '_b_' true
'abc' LIKE 'c' false
```
`LIKE` pattern matching always covers the entire string. Therefore, if
it's desired to match a sequence anywhere within a string, the pattern
must start and end with a percent sign.
To match a literal underscore or percent sign without matching other
characters, the respective character in pattern must be preceded by
the escape character. The default escape character is the backslash
but a different one can be selected by using the ESCAPE clause. To
match the escape character itself, write two escape characters. The
escape character cannot be one of the special pattern characters `_`
or `%`.
Some examples where the escape character is changed to `#`:
```sql
SELECT 'hawkeye' LIKE 'h%' ESCAPE '#' true
SELECT 'hawkeye' NOT LIKE 'h%' ESCAPE '#' false
SELECT 'h%' LIKE 'h#%' ESCAPE '#' true
SELECT 'h%' NOT LIKE 'h#%' ESCAPE '#' false
SELECT 'h%wkeye' LIKE 'h#%' ESCAPE '#' false
SELECT 'h%wkeye' NOT LIKE 'h#%' ESCAPE '#' true
SELECT 'h%wkeye' LIKE 'h#%%' ESCAPE '#' true
SELECT 'h%wkeye' NOT LIKE 'h#%%' ESCAPE '#' false
SELECT 'h%awkeye' LIKE 'h#%a%k%e' ESCAPE '#' true
```
When either argument or `LIKE`, `NOT LIKE` is `NULL`, the result is `NULL`.
## `ILIKE`
string `ILIKE` pattern
string `NOT ILIKE` pattern
The `ILIKE` expression performs a case-insensitive pattern match. If
pattern does not contain percent signs or underscores,
then the pattern represents the string itself, and `ILIKE` acts like
the equals operator, ignoring character case. An underscore (`_`) in
pattern matches any single character, while a percent
sign (`%`) matches any sequence of zero or more characters.
Some examples:
```sql
SELECT 'hawkeye' ILIKE 'h%' true
SELECT 'hawkeye' NOT ILIKE 'h%' false
SELECT 'hawkeye' ILIKE 'H%' true
SELECT 'hawkeye' NOT ILIKE 'H%' false
SELECT 'hawkeye' ILIKE 'H%Eye' true
SELECT 'hawkeye' NOT ILIKE 'H%Eye' false
SELECT 'Hawkeye' ILIKE 'h%' true
SELECT 'Hawkeye' NOT ILIKE 'h%' false
SELECT 'ABC' ILIKE '_b_' true
SELECT 'ABC' NOT ILIKE '_b_' false
```
When either argument or `ILIKE`, `NOT ILIKE` is `NULL`, the result is `NULL`.
## Regular expressions
Regular expressions are matched using the `RLIKE` function. If either
argument of `RLIKE` is `NULL`, the result is also `NULL`. The
implementation is based on the Rust
[`Regex`](https://docs.rs/regex/latest/regex/) library. The full
syntax supported is described in the [Rust
documentation](https://docs.rs/regex/latest/regex/#syntax).
The description below uses fragments from the [Postgres
documentation](https://www.postgresql.org/docs/15/functions-matching.html#FUNCTIONS-POSIX-REGEXP),
where credit is given to Henry Spencer.
Currently, our compiler does *not* support `SIMILAR TO` regular
expressions.
A regular expression is a character sequence that is an abbreviated
definition of a set of strings (a regular set). A string is said to
match a regular expression if it is a member of the regular set
described by the regular expression. As with `LIKE`, pattern
characters match string characters exactly unless they are special
characters in the regular expression language — but regular
expressions use different special characters than `LIKE` does. Unlike
`LIKE` patterns, a regular expression is allowed to match anywhere
within a string, unless the regular expression is explicitly anchored
to the beginning or end of the string.
A *regular expression* is defined as one or more *branches*, separated
by `|`. It matches anything that matches one of the branches.
A *branch* is zero or more *quantified atoms* or *constraints*,
concatenated. It matches a match for the first, followed by a match
for the second, etc.; an empty branch matches the empty string.
A *quantified atom* is an *atom* possibly followed by a single
*quantifier*. Without a quantifier, it matches a match for the
atom. With a quantifier, it can match some number of matches of the
atom. An atom can be any of the possibilities shown in the Table
below.
| Atom | Description |
|---|---|
(re) |
where re is any regular expression: matches a match for re, with the match noted for possible reporting |
(?:re) |
as above, but the match is not noted for reporting (a “non-capturing” set of parentheses) |
. |
matches any single character |
[chars] |
a bracket expression, matching any one of the chars (see below for more details) |
\k |
where k is a non-alphanumeric character): matches that character taken as an ordinary character, e.g.,
\\ matches a backslash character |
\c |
where c is alphanumeric (possibly followed by other characters): is an escape, see below |
{ |
when followed by a character other than a digit, matches the left-brace character {;
when followed by a digit, it is the beginning of a bound (see below) |
| x | where x is a single character with no other significance, matches that character |