# String Operations The default character set for all strings is [UTF-8](https://en.wikipedia.org/wiki/UTF-8). SQL defines two primary character types: `character varying(n)` and `character(n)`, where n is a positive integer. Both of these types can store strings up to n characters (not bytes) in length. An attempt to store a longer string into a column of these types will result in an error, unless the excess characters are all spaces, in which case the string will be truncated to the maximum length. (This somewhat bizarre exception is required by the SQL standard.) If the string to be stored is shorter than the declared length, values of type character will be space-padded; values of type character varying will simply store the shorter string. In addition, we provide the `text`, or `varchar` type, which stores strings of any length. ## String constants (literals) A string constant in SQL is an arbitrary sequence of characters bounded by single quotes (`'`), for example `'This is a string'`. To include a single-quote character within a string constant, write two adjacent single quotes, e.g., `'Dianne''s horse'`. Note that this is not the same as a double-quote character (`"`). Two string constants that are only separated by whitespace with at least one newline are concatenated and effectively treated as if the string had been written as one constant. For example: ```sql SELECT 'foo' 'bar' ``` is equivalent to: ```sql SELECT 'foobar' ``` but: ```sql SELECT 'foo' 'bar' ``` is not valid syntax. Newlines are supported within string literals: ``` SELECT 'a b' ``` produces a multi-line string literal. ## Unicode characters A string literal prefixed with the `U&` can contain Unicode characters described using their numeric code: `U&'hello\0041'` includes the Unicode character with code U+0041, which is the uppercase letter 'A'. A Unicode character is described by an escape character followed by four hexadecimal digits. The default escape character is backslash, but it can be changed using the following syntax: `U&'hello!0041' UESCAPE '!'`. The `UESCAPE` notation designates the escape character to use in interpreting the current literal, which is the exclamation sign in the previous example. ## Operations on string values

Operation	Description	Examples
`\|\|`	String concatenation (infix). Note that concatenation does not strip trailing spaces from CHAR(N) values, unlike other SQL dialects. If such behavior is desired, an explicit cast to `varchar` can be added.	`'Post' \|\| 'greSQL'` => `PostgreSQL`
`string LIKE pattern [ESCAPE escape-character]` and `string NOT LIKE pattern [ESCAPE escape-character]`	The LIKE expression returns true if `string` matches `pattern`. (As expected, the `NOT LIKE` expression returns false if `LIKE` returns true.)	See [`LIKE`](#like) for details.
`string ILIKE pattern` and `string NOT ILIKE pattern`	The `ILIKE` expression returns true if `string` matches `pattern`, performing a case-insensitive comparison. This means that differences in character case between the string and the pattern are ignored. (Similarly, the `NOT ILIKE` expression returns false if `ILIKE` returns true.)	See [`ILIKE`](#ilike) for details.
`string RLIKE pattern` and `string NOT RLIKE pattern`	The RLIKE expression returns true if `string` matches `pattern`. The pattern is a standard Rust regular expression.	`'string' RLIKE 's..i.*'` => `TRUE`
`ASCII ( string )`	Returns the numeric code of the first character of `string`. In UTF8 encoding, returns the Unicode code point of the character. In other multibyte encodings, the argument must be an ASCII character. Returns 0 if the string is empty.	`ascii('x')` => `120`
`CHAR_LENGTH(string)` or `CHARACTER_LENGTH(string)` or `LENGTH(string)` or `LEN(string)`	Returns number of characters in `string`.	`char_length('josé')` => `4`
`CHR ( integer )`	Returns a string containing the character numbered `integer`. If the code is incorrect (e.g., negative), the result is an empty string.	`chr(65)` => `A`
`CONCAT(`string1, ..., stringN`)`	String concatenation. Can have any number of arguments.	`CONCAT('Post', 'greSQL', 1)` => `PostgreSQL1`
`CONCAT_WS(`sep, string1, ..., stringN`)`	String concatenation with separator `sep`. Can have any number of arguments. `sep` is intercalated between all strings. If `sep` is `NULL` result is `NULL`. Other `NULL` arguments are ignored.	`CONCAT_WS(',', 'Post', 'greSQL', NULL, '1')` => `Post,greSQL,1`
`INITCAP ( string )`	Converts the first letter of each word in `string` to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.	`initcap('hi THOMAS')` => `Hi Thomas`
`INITCAP_SPACES ( string )`	Converts the first letter of each word in `string` to upper case and the rest to lower case. Words are sequences of characters separated by spaces.	`initcap('hi THOMAS-SON')` => `Hi Thomas-son`
`LEFT ( string, count )`	Returns first `count` characters in `string`. If any argument is `NULL`, returns `NULL`.	`left('abcde', 2)` => `ab`
`LOWER ( string )`	Converts `string` to all lower case.	`lower('TOM')` => `tom`
`MD5`(string)	Calculates an MD5 128-bit checksum of `string` and returns it as a hex `VARCHAR` value. If the input is NULL, NULL is returned.	`SELECT md5('Feldera')` => `841afc2f65b5763600818ef42a56d7d1`
`OVERLAY ( string PLACING newsubstring FROM start [ FOR remove ] )`	Replaces the substring of `string` that starts at the `start`'th character and extends for `remove` characters with `newsubstring`. If `count` is omitted, it defaults to the length of `newsubstring`. If `start` is not positive, the original string is unchanged. If `start` is bigger than the length of `string`, the result is the concatenation of the two strings. If `remove` is negative it is considered 0.	`overlay('Txxxxas' placing 'hom' from 2 for 4)` => `Thomas`
`POSITION(substring IN string)`	Returns the first Unicode character index of `substring` within `string`, or zero if it's not present. First character has index 1.	`position('om' in 'Thomas')` => `3`
`REGEXP_REPLACE(expr, pat[, repl])`	Replaces occurrences in the string `expr` that match the regular expression specified by the pattern `pat` with the replacement string `repl`, and returns the resulting string. If any one of `expr`, `pat`, or `repl` is `NULL`, the return value is `NULL`. If `repl` is missing, it is assumed to be the empty string. If the regular expression is invalid, the original string is returned.
`REPEAT ( string, count )`	Repeats `string` the specified number of times. The result is an empty string for a negative or 0 count.	`repeat('Pg', 4)` => `PgPgPgPg`
`REPLACE ( haystack, needle, replacement )`	Replaces all occurrences of `needle` in `haystack` with `replacement`.	`replace('abcdefabcdef', 'cd', 'XX')` => `abXXefabXXef`
`RIGHT ( string, count )`	Returns last `count` characters in the string. If any argument is `NULL`, return `NULL`.	`right('abcde', 2)` => `de`
`RLIKE(string, pattern)`	A function equivalent to the `RLIKE` operator above.	`RLIKE('string', 's..i.*')` => `TRUE`
`SPLIT(string [, delimiter])`	Produce an array of strings, by splitting `string` at each occurrence of `delimiter`. If `delimiter` is empty, return an array containing just `string`. If `string` is empty, return an empty array. If either argument is NULL, return NULL. If `delimiter` is absent assume it is the string `','`.	`SPLIT('a\|b\|c\|', '\|')` => `['a', 'b', 'c', '']`
`SPLIT_PART(string, delimiter, n)`	This function uses 1-based indexing. It extracts the `n`'th part of `string` by splitting it at each occurrence of `delimiter`. `n = 1` refers to the first part of `string` after splitting. `n = 2` refers to the second part, and so on. If `n` is negative, it returns the `abs(n)`'th part from the end of `string`. If `n` is out of bounds, it returns an empty string.	`SPLIT_PART('a\|b\|c\|', '\|', 2)` => `b` `SPLIT_PART('a\|b\|c\|', '\|', -2)` => `c` `SPLIT_PART('a\|b\|c\|', '\|', 5)` => `''`
`SUBSTR (` string, start, `[ ,` length `]`	Extracts the substring of `string` starting at the `start`'th character, and stopping after `length` characters if the value is specified. If `start` is negative, the first character is chosen counting backwards from the end of `string`. If `count` is negative the empty string is returned. The index of the first character is 1.	`SUBSTR('Thomas', 2, 3)` => `hom` `SUBSTR('Thomas', 3)` => `omas`
`SUBSTRING (` string `FROM` start `[ FOR` count `] )`	Extracts the substring of `string` starting at the `start`'th character, and stopping after `count` characters if the value is specified. If `start` is negative, only `max(count + start - 1, 0)` characters are returned. If `count` is negative the empty string is returned. The index of the first character is 1.	`SUBSTRING('Thomas' from 2 for 3)` => `hom` `SUBSTRING('Thomas' from 3)` => `omas`
`TRIM ( [ LEADING \| TRAILING \| BOTH ]` characters `FROM` string `)`	Remove `characters` from the specified ends of `string`	`TRIM(both 'xyz' from 'yxTomxx')` => `Tom` `TRIM(leading 'xyz' from 'yxTomxx')` => `Tomxx`
`UPPER ( string )`	Converts `string` to all upper case.	`upper('tom')` => `TOM`

## `LIKE` string `LIKE` pattern [`ESCAPE` escape-character] string `NOT LIKE` pattern [`ESCAPE` escape-character] If pattern does not contain percent signs or underscores, then the pattern only represents the string itself; in that case `LIKE` acts like the equals operator. An underscore (`_`) in pattern stands for (matches) any single character; a percent sign (`%`) matches any sequence of zero or more characters. Some examples: ```sql 'abc' LIKE 'abc' true 'abc' LIKE 'a%' true 'abc' LIKE '_b_' true 'abc' LIKE 'c' false ``` `LIKE` pattern matching always covers the entire string. Therefore, if it's desired to match a sequence anywhere within a string, the pattern must start and end with a percent sign. To match a literal underscore or percent sign without matching other characters, the respective character in pattern must be preceded by the escape character. The default escape character is the backslash but a different one can be selected by using the ESCAPE clause. To match the escape character itself, write two escape characters. The escape character cannot be one of the special pattern characters `_` or `%`. Some examples where the escape character is changed to `#`: ```sql SELECT 'hawkeye' LIKE 'h%' ESCAPE '#' true SELECT 'hawkeye' NOT LIKE 'h%' ESCAPE '#' false SELECT 'h%' LIKE 'h#%' ESCAPE '#' true SELECT 'h%' NOT LIKE 'h#%' ESCAPE '#' false SELECT 'h%wkeye' LIKE 'h#%' ESCAPE '#' false SELECT 'h%wkeye' NOT LIKE 'h#%' ESCAPE '#' true SELECT 'h%wkeye' LIKE 'h#%%' ESCAPE '#' true SELECT 'h%wkeye' NOT LIKE 'h#%%' ESCAPE '#' false SELECT 'h%awkeye' LIKE 'h#%a%k%e' ESCAPE '#' true ``` When either argument or `LIKE`, `NOT LIKE` is `NULL`, the result is `NULL`. ## `ILIKE` string `ILIKE` pattern string `NOT ILIKE` pattern The `ILIKE` expression performs a case-insensitive pattern match. If pattern does not contain percent signs or underscores, then the pattern represents the string itself, and `ILIKE` acts like the equals operator, ignoring character case. An underscore (`_`) in pattern matches any single character, while a percent sign (`%`) matches any sequence of zero or more characters. Some examples: ```sql SELECT 'hawkeye' ILIKE 'h%' true SELECT 'hawkeye' NOT ILIKE 'h%' false SELECT 'hawkeye' ILIKE 'H%' true SELECT 'hawkeye' NOT ILIKE 'H%' false SELECT 'hawkeye' ILIKE 'H%Eye' true SELECT 'hawkeye' NOT ILIKE 'H%Eye' false SELECT 'Hawkeye' ILIKE 'h%' true SELECT 'Hawkeye' NOT ILIKE 'h%' false SELECT 'ABC' ILIKE '_b_' true SELECT 'ABC' NOT ILIKE '_b_' false ``` When either argument or `ILIKE`, `NOT ILIKE` is `NULL`, the result is `NULL`. ## Regular expressions Regular expressions are matched using the `RLIKE` function. If either argument of `RLIKE` is `NULL`, the result is also `NULL`. The implementation is based on the Rust [`Regex`](https://docs.rs/regex/latest/regex/) library. The full syntax supported is described in the [Rust documentation](https://docs.rs/regex/latest/regex/#syntax). The description below uses fragments from the [Postgres documentation](https://www.postgresql.org/docs/15/functions-matching.html#FUNCTIONS-POSIX-REGEXP), where credit is given to Henry Spencer. Currently, our compiler does *not* support `SIMILAR TO` regular expressions. A regular expression is a character sequence that is an abbreviated definition of a set of strings (a regular set). A string is said to match a regular expression if it is a member of the regular set described by the regular expression. As with `LIKE`, pattern characters match string characters exactly unless they are special characters in the regular expression language — but regular expressions use different special characters than `LIKE` does. Unlike `LIKE` patterns, a regular expression is allowed to match anywhere within a string, unless the regular expression is explicitly anchored to the beginning or end of the string. A *regular expression* is defined as one or more *branches*, separated by `|`. It matches anything that matches one of the branches. A *branch* is zero or more *quantified atoms* or *constraints*, concatenated. It matches a match for the first, followed by a match for the second, etc.; an empty branch matches the empty string. A *quantified atom* is an *atom* possibly followed by a single *quantifier*. Without a quantifier, it matches a match for the atom. With a quantifier, it can match some number of matches of the atom. An atom can be any of the possibilities shown in the Table below.

Atom	Description
`(re)`	where `re` is any regular expression: matches a match for `re`, with the match noted for possible reporting
`(?:re)`	as above, but the match is not noted for reporting (a “non-capturing” set of parentheses)
`.`	matches any single character
`[chars]`	a bracket expression, matching any one of the chars (see below for more details)
`\k`	where `k` is a non-alphanumeric character): matches that character taken as an ordinary character, e.g., `\\` matches a backslash character
`\c`	where `c` is alphanumeric (possibly followed by other characters): is an escape, see below
`{`	when followed by a character other than a digit, matches the left-brace character `{`; when followed by a digit, it is the beginning of a bound (see below)
x	where `x` is a single character with no other significance, matches that character

The possible quantifiers and their meanings are shown the Table below. | Quantifier | Matches | |--------------|------------------------------------------------------------------------------| | `*` | a sequence of 0 or more matches of the atom | | `+` | a sequence of 1 or more matches of the atom | | `?` | a sequence of 0 or 1 matches of the atom | | `{`m`}` | a sequence of exactly m matches of the atom | | `{`m`,}` | a sequence of m or more matches of the atom | | `{`m`,`n`}` | a sequence of m through n (inclusive) matches of the atom; m cannot exceed n | | `*?` | non-greedy version of * | | `+?` | non-greedy version of + | | `??` | non-greedy version of ? | | `{`m`}?` | non-greedy version of `{m}` | | `{`m`,}?` | non-greedy version of `{m,}` | | `{`m`,`n`}?` | non-greedy version of `{m,n}` | A constraint matches an empty string, but matches only when specific conditions are met. A constraint can be used where an atom could be used, except it cannot be followed by a quantifier. The simple constraints are shown in the Table below; some more constraints are described later. | Constraint | Description | |-------------|----------------------------------------| | `^` | matches at the beginning of the string | | `$` | matches at the end of the string | ### Bracket Expressions A bracket expression is a list of characters enclosed in `[]`. It normally matches any single character from the list (but see below). If the list begins with `^`, it matches any single character not from the rest of the list. If two characters in the list are separated by `-`, this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, e.g., `[0-9]` in ASCII matches any decimal digit. It is illegal for two ranges to share an endpoint, e.g., `a-c-e`. Ranges are very collating-sequence-dependent, so portable programs should avoid relying on them. To include a literal `]` in the list, make it the first character (after `^`, if that is used). To include a literal `-`, make it the first or last character, or the second endpoint of a range. To use a literal `-` as the first endpoint of a range, enclose it in `[.` and `.]`. With the exception of these characters, some combinations using `[`, all other special characters lose their special significance within a bracket expression. In particular, `\` is not special. Within a bracket expression, the name of a character class enclosed in `[:` and `:]` stands for the list of all characters belonging to that class. A character class cannot be used as an endpoint of a range. The POSIX standard defines these character class names: | Class | Description | |----------|---------------------------------------| | `alnum` | letters and numeric digits | | `alpha` | letters | | `blank` | space and tab | | `cntrl` | control characters | | `digit` | numeric digits | | `graph` | printable characters except space | | `lower` | lower-case letters | | `print` | printable characters including space | | `punct` | punctuation | | `space` | any white space | | `upper` | upper-case letters | | `xdigit` | hexadecimal digits | Class-shorthand escapes provide shorthands for certain commonly-used character classes. They are shown in the table below. | Escape | Description | |---------|---------------------------------------------------------| | `\d` | matches any digit, like [[:digit:]] | | `\s` | matches any whitespace character, like [[:space:]] | | `\w` | matches any word character, like [[:word:]] | | `\D` | matches any non-digit, like [^[:digit:]] | | `\S` | matches any non-whitespace character, like [^[:space:]] | | `\W` | matches any non-word character, like [^[:word:]] | The behavior of these standard character classes is generally consistent across platforms for characters in the 7-bit ASCII set. Whether a given non-ASCII character is considered to belong to one of these classes depends on the collation that is used for the regular-expression function or operator. ### Regular Expression Escapes Escapes are special sequences beginning with `\` followed by an alphanumeric character. Escapes come in several varieties: character entry, class shorthands, and constraint escapes. A `\` followed by an alphanumeric character but not constituting a valid escape is illegal. Character-entry escapes exist to make it easier to specify non-printing and other inconvenient characters in REs. They are shown in the Table below. | Escape | Description | |----------|-------------------------------------------------------------------------------------------------| | `\n` | newline, as in C | | `\r` | carriage return, as in C | | `\t` | horizontal tab, as in C | | `\u`wxyz | (where wxyz is exactly four hexadecimal digits) the character whose hexadecimal value is 0xwxyz | | `\x`hh | (where hh is a pair of hexadecimal digits) the character whose hexadecimal value is 0xhh | | `\0` | the character whose value is 0 (the null byte) | A constraint escape is a constraint, matching the empty string if specific conditions are met, written as an escape. They are shown in the Table below. | Escape | Description | |---------|---------------------------------------------| | `\A` | matches only at the beginning of the string | | `\b` | matches word boundaries | | `\B` | not a word boundary | ### Capture groups A common way to use regexes is with _capture groups_. That is, instead of just looking for matches of an entire regex, parentheses are used to create groups that represent part of the match. For example, consider a string with multiple lines, and each line has three whitespace delimited fields where the second field is expected to be a number and the third field a boolean. This can be expressed with the following regular expression, where the capture groups have been labeled `$0` to `$4`. ``` (?m)^\s*(\S+)\s+([0-9]+)\s+(true|false)\s*$ ^^^^ ^^^^^ ^^^^^^^^ ^^^^^^^^^^^^ $1 $2 $3 $4 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ $0 ``` Capture group 0 always corresponds to an implicit unnamed group that includes the entire match. If a match is found, this group is always present. Subsequent groups may be named and are numbered, starting at 1, by the order in which the opening parenthesis appears in the pattern. For example, in the pattern `(?.(?.))(?.)`, `a`, `b` and `c` correspond to capture groups `$1`, `$2` and `$3`, respectively. ### Regular expression functions ```sql REGEXP_REPLACE(expr, pat[, repl]) ``` If `repl` is missing, it is assumed to be the empty string. Replaces occurrences in the string `expr` that match the regular expression specified by the pattern `pat` with the replacement string `repl`, and returns the resulting string. If any of `expr`, `pat`, or `repl` is `NULL`, the return value is `NULL`. #### Replacement string syntax All instances of `$N` in the replacement string are replaced with the substring corresponding to the capture group identified by `N`. `N` may be an integer corresponding to the index of the capture group (counted by order of opening parenthesis where 0 is the entire match) or it can be a name (consisting of letters, digits or underscores) corresponding to a named capture group. If `N` isn’t a valid capture group (whether the name doesn’t exist or isn’t a valid index), then it is replaced with the empty string. The longest possible name is used. For example, `$1a` looks up the capture group named `1a` and not the capture group at index `1`. To exert more precise control over the name, use braces, e.g., `${1}a`. To write a literal `$` use `$$`. Examples: ``` select regexp_replace('1078910', '[^01]'); 1010 select regexp_replace('deep fried', '(?\w+)\s+(?\w+)', '${first}_$second'); deep_fried select regexp_replace('Springsteen, Bruce', '([^,\s]+),\s+(\S+)', '$2 $1'); Bruce Springsteen select regexp_replace('Springsteen, Bruce', '(?[^,\s]+),\s+(?\S+)', '$first $last'); Bruce Springsteen ``` Note that using `$2` instead of `$first` or `$1` instead of `$last` would produce the same result.