Implement Data.String.CodePoints, repair the test suite#1
Conversation
The repo never had the spago-test.dhall that scripts/test refers to,
and the package set pins (psc-0.15.15-2024*) predate prelude v7.2.0
(unit = {}) and the assert override. Bump the sets, define the test
build on top of spago.dhall, and drop the Regex test wiring: the fork
has no Data.String.Regex implementation (tracked separately).
The flake inputs were two years stale as well: the pinned pslua
predates several codegen fixes; nixpkgs/easyps now match the pins of
the main pslua repo (purs-tidy is dropped from the shell - it does not
evaluate against the current nixpkgs and nothing in this repo uses it).
The prelude package is temporarily overridden with a local clone until
the Lua 5.1 compat fixes land in a released package set.
A single do-block with a hundred statements compiles to Lua nested
beyond the parser limit of stock Lua 5.1 interpreters ("chunk has too
many syntax levels", purescript-lua/purescript-lua#46). One function per log
section keeps the nesting flat. Assertions are unchanged except two
byte-semantics adaptations in NonEmpty.CodeUnits: charAt over "5 €"
yields the first UTF-8 byte of the euro sign, and the toCharArray case
uses an ASCII string because a multi-byte Char literal is not
expressible when Char is a byte.
Closes purescript-lua/purescript-lua#36. In pslua a String is a Lua byte string holding UTF-8, so all seven foreign functions decode/encode UTF-8 directly with plain Lua 5.1 arithmetic (no utf8 library, no bitops). The PureScript fallback arguments are UTF-16 surrogate-pair logic and are ignored: under the byte representation they would produce garbage. The pure uncons carried the same surrogate logic (charAt 0/1 yields bytes here, never in the surrogate range, so it silently returned the first byte as a code point); its tail now starts after the UTF-8 width of the head. The test suite replaces the upstream lone-surrogate string (not representable in UTF-8) with well-formed Unicode of widths 1-4 and keeps its structure; lone-surrogate-specific probes are dropped and toCodePointArray/fromCodePointArray roundtrips are added.
CodeUnits returned 1-based Lua positions where PureScript expects 0-based indices (charAt answered Nothing for index 0, indexOf was off by one, lastIndexOf also ignored the pattern width), and take/slice/ splitAt passed negative indices straight to string.sub, which counts from the end instead of clamping. Index handling now mirrors the upstream JS (indexOf guard, lastIndexOf clamp, slice wrapping). Common.replace/replaceAll/split treated Pattern as a Lua pattern and even applied gsub to the Replacement; PureScript semantics are literal matching, so the rewrite uses string.find in plain mode. replaceAll also no longer leaks gsub's second return value, and split keeps empty fields and multi-character separators.
| -- Decodes the code point starting at byte position i. | ||
| -- Returns the code point and the position of the next one. | ||
| -- An invalid leading byte is returned as-is (one byte consumed). | ||
| local function decode(s, i) |
There was a problem hiding this comment.
This is quite an implementation! It needs to be thoroughly covered by unit tests.
There was a problem hiding this comment.
Added in 9cd49b8: a encode/decode boundaries test round-trips a code point at every UTF-8 width edge (0x7F/0x80, 0x7FF/0x800, 0xFFFF/0x10000, 0x10FFFF) through fromCodePointArray→toCodePointArray and checks the byte widths via singleton, plus the malformed input test for the decoder's error path.
There was a problem hiding this comment.
Pull request overview
This PR ports Data.String.CodePoints to the pslua string model (Lua UTF-8 byte strings), fixes several Data.String.CodeUnits / Data.String.Common semantic mismatches (0-based indexing and literal pattern handling), and repairs/enables the test suite for this fork.
Changes:
- Implement UTF-8 encode/decode based
Data.String.CodePointsFFI (Lua 5.1 compatible, noutf8lib / bitops). - Fix
CodeUnitsindexing andCommonstring operations to treatPatternliterally (plainstring.find) and align behavior with upstream JS semantics. - Repair and restructure tests (remove Regex suite, split large do-blocks), add
spago-test.dhall, and update Nix/package-set pins.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Test/Main.purs | Removes Regex suite wiring from the test runner. |
| test/Test/Data/String/Regex.purs | Deletes Regex tests (module not provided in this fork). |
| test/Test/Data/String/NonEmpty/CodeUnits.purs | Splits large test block into per-section functions; adjusts expectations for byte/Char semantics. |
| test/Test/Data/String/NonEmpty.purs | Splits NonEmpty string tests into per-section functions. |
| test/Test/Data/String/CodeUnits.purs | Splits CodeUnits tests into per-section functions. |
| test/Test/Data/String/CodePoints.purs | Adapts upstream CodePoints tests to well-formed UTF-8 and adds array roundtrip coverage. |
| src/Data/String/Common.lua | Makes replace/replaceAll/split literal (no Lua pattern interpretation) and closer to JS semantics. |
| src/Data/String/CodeUnits.lua | Fixes 0-based indexing and (mostly) JS-like clamping/search behavior in Lua FFI. |
| src/Data/String/CodePoints.purs | Adjusts uncons to advance by UTF-8 width in pslua. |
| src/Data/String/CodePoints.lua | Replaces stub with a UTF-8 decoder/encoder implementation for CodePoints operations. |
| spago-test.dhall | Adds a Spago test configuration including test sources and pslua backend invocation. |
| packages.dhall | Updates package set pins and introduces a temporary prelude override. |
| flake.nix | Removes purs-tidy from devShell inputs. |
| flake.lock | Updates flake inputs/pins. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| -- In pslua strings are UTF-8 byte strings (a code unit is a byte), so the | ||
| -- tail starts after the UTF-8 encoding of the head code point. | ||
| uncons :: String -> Maybe { head :: CodePoint, tail :: String } | ||
| uncons s = case CU.length s of | ||
| 0 -> Nothing | ||
| 1 -> Just { head: CodePoint (fromEnum (Unsafe.charAt 0 s)), tail: "" } | ||
| _ -> | ||
| let | ||
| cu0 = fromEnum (Unsafe.charAt 0 s) | ||
| cu1 = fromEnum (Unsafe.charAt 1 s) | ||
| cp = unsafeCodePointAt0 s | ||
| in | ||
| if isLead cu0 && isTrail cu1 | ||
| then Just { head: unsurrogate cu0 cu1, tail: CU.drop 2 s } | ||
| else Just { head: CodePoint cu0, tail: CU.drop 1 s } | ||
| Just { head: cp, tail: CU.drop (utf8Width cp) s } |
There was a problem hiding this comment.
Fixed in 9cd49b8. You're right — the width was recomputed from the decoded value, but the decoder consumes a single byte for a malformed lead. uncons now takes the tail as drop 1, which is defined through the same native decoder that produces the head, so the two can no longer disagree (utf8Width is gone). Added a malformed input test: a lone continuation byte between ASCII anchors, and truncated two-/three-byte leads.
| if startAt < 0 or startAt > #s then return nothing end | ||
| local i = s:find(x, startAt + 1, true) |
There was a problem hiding this comment.
This actually matches upstream Data.String.CodeUnits, which is deliberate. The JS FFI is if (startAt < 0 || startAt > s.length) return nothing; before indexOf, and the upstream test suite pins it: indexOf' (Pattern "") (-1) "ab" == Nothing and indexOf' (Pattern "") 3 "ab" == Nothing (both in Test.Data.String.CodeUnits). Clamping would diverge from upstream and fail those. Kept as-is.
| -- Temporary override until purescript-lua-prelude v7.2.1 is released | ||
| -- and picked up by a package set release (Lua 5.1 compat fixes). | ||
| let overrides = { prelude = ../purescript-lua-prelude/spago.dhall as Location } | ||
|
|
||
| in upstream-ps // upstream-lua // overrides |
There was a problem hiding this comment.
Agreed — the relative path is a deliberate temporary stopgap, not the final state (called out in the "chore: repair the test build" commit). It is part of the release chain: prelude needs Lua 5.1 FFI fixes (purescript-lua/purescript-lua-prelude#5, merged) that are not in any released package set yet. Once those ship in a set release, this override is dropped and packages.dhall points at the released set — that commit closes the loop before merge. A fresh clone can't build until then, which is why the PR stays in draft.
| end | ||
|
|
||
| -- Encodes a code point as a UTF-8 byte string. | ||
| local function encode(cp) |
There was a problem hiding this comment.
Same here, make sure the unit-test coverage is decent.
There was a problem hiding this comment.
Covered by the new encode/decode boundaries and malformed input tests (9cd49b8).
| end | ||
| end | ||
| end), | ||
| _codePointAt = (function(_) |
There was a problem hiding this comment.
I see quite some nesting here and wonder how much of a hassle is it to use uncurried FFI functions?
There was a problem hiding this comment.
Assessment: declaring the foreign imports with FnN (and runFnN at the call sites) does flatten the .lua — e.g. _codePointAt = function(just, nothing, n, s) ... end instead of five nested closures — and lets me drop the vestigial fallback/at0 arguments the Lua ignores. The cost is that the .purs foreign signatures then diverge from upstream Data.String.CodePoints, where the curried fallback args exist precisely so one .purs is shared across the JS and Lua FFI; that makes future rebases onto upstream noisier. The nesting is confined to internal _-prefixed bindings, so it's invisible to users. I left it curried to keep the .purs diff to upstream minimal, but I'm happy to switch to Fn/runFn if you'd rather have the flatter FFI — your call on that trade-off.
| if i >= 1 and i <= #s then | ||
| return just(s:sub(i, i)) | ||
| if i >= 0 and i < #s then | ||
| return just(s:sub(i + 1, i + 1)) |
There was a problem hiding this comment.
Subtle, needs to be covered by tests
There was a problem hiding this comment.
Covered: the 0-based indexing and clamping are exercised by the split-out charAt/indexOf/indexOf'/lastIndexOf'/take/slice/splitAt sections, which are the upstream cases now actually running against this FFI.
Review follow-up on #1. uncons recomputed the tail offset from the head code point value (utf8Width), but the decoder consumes a single byte for a malformed leading byte (a lone continuation or a truncated multi-byte lead). For those inputs the width disagreed with what was actually read and uncons skipped extra bytes, desynchronising the rest of the string. The tail is now `drop 1`, which is defined via the same native decoder that produces the head, so the two can never disagree; utf8Width is gone. Add direct coverage of the encoder/decoder: round-trips at every UTF-8 width boundary (0x7F/0x80, 0x7FF/0x800, 0xFFFF/0x10000, 0x10FFFF) and malformed input (lone continuation byte between ASCII anchors, and truncated two- and three-byte leads), where the raw bytes are built with CodeUnits.singleton since a string literal would be re-encoded as valid UTF-8.
Switch packages.dhall from the local prelude override to the released package set (psc-0.15.15-20260613, which carries prelude v7.2.1), so a fresh clone and CI can build without a sibling checkout. Bump the flake pin of pslua to the merged Char-literal-escaping fix. Replace the inherited bower CI (no bower.json in this fork) with one that runs scripts/test and luacheck in the nix dev shell, and make scripts/test exit non-zero on failure so red tests fail CI. uncons now reaches take (via drop), closing a recursive group with the point-free `take = _take takeFallback`; eta-expand take so every member of the group is a function and purs accepts it.
Implement Data.String.CodePoints, repair the test suite
Closes purescript-lua/purescript-lua#36.
CodePoints. The FFI file was a stub: 2 functions out of 7, both throwing
Not implemented, with the wrong arity. In pslua aStringis a Lua byte string holding UTF-8, so all seven foreign functions now decode/encode UTF-8 directly in plain Lua 5.1 arithmetic (noutf8library, no bitops,string.charfed byte by byte). The PureScript fallback arguments are UTF-16 surrogate-pair logic and are deliberately ignored — under the byte representation they produce garbage. The pureunconscarried the same surrogate logic (charAt 0/1yields bytes here, which are never in the surrogate range, so it silently returned the first byte as the head code point); its tail now starts after the UTF-8 width of the head.CodeUnits and Common were broken too. Once the suite ran, it showed
CodeUnitsreturning 1-based Lua positions where 0-based indices are expected (charAt 0wasNothing,indexOfoff by one,lastIndexOfalso ignored the pattern width), negative indices passed straight tostring.sub(which counts from the end instead of clamping), andCommon.replace/replaceAll/splittreatingPatternas a Lua pattern —replaceAll (Pattern "[b]") ...replacedb. Index handling now mirrors the upstream JS, and matching is literal (string.findplain mode).Test suite. It had never run against this fork:
scripts/testreferenced aspago-test.dhallthat did not exist, the package set pins predated prelude v7.2.0 and theassertoverride, andTest.Mainimported Regex tests with noData.String.Regexin src (dropped here, tracked separately). Each suite is also split into per-section functions: one hundred-statement do block compiles to Lua nested beyond the parser limit of stock Lua 5.1 (purescript-lua/purescript-lua#46). The CodePoints suite keeps the upstream structure but swaps the lone-surrogate test string (not representable in UTF-8) for well-formed Unicode of widths 1–4, and gainstoCodePointArray/fromCodePointArrayroundtrips. Two byte-semantics adaptations inNonEmpty.CodeUnitsare commented inline.Merge order. The suite needs Unisay/purescript-lua PR with the Char-literal escaping fix (a
'\n'Char literal otherwise produces an unloadable chunk) and the preludefix/lua51-compatPR.packages.dhalltemporarily overridespreludewith a local clone; swapping it back to a released package set is a follow-up commit once a set with prelude v7.2.1 is out. With all three in place,scripts/testis fully green.