Implement Data.String.CodePoints, repair the test suite by Unisay · Pull Request #1 · purescript-lua/purescript-lua-strings

Unisay · 2026-06-13T07:11:58Z

Implement Data.String.CodePoints, repair the test suite

Closes purescript-lua/purescript-lua#36.

CodePoints. The FFI file was a stub: 2 functions out of 7, both throwing Not implemented, with the wrong arity. In pslua a String is a Lua byte string holding UTF-8, so all seven foreign functions now decode/encode UTF-8 directly in plain Lua 5.1 arithmetic (no utf8 library, no bitops, string.char fed byte by byte). The PureScript fallback arguments are UTF-16 surrogate-pair logic and are deliberately ignored — under the byte representation they produce garbage. The pure uncons carried the same surrogate logic (charAt 0/1 yields bytes here, which are never in the surrogate range, so it silently returned the first byte as the head code point); its tail now starts after the UTF-8 width of the head.

CodeUnits and Common were broken too. Once the suite ran, it showed CodeUnits returning 1-based Lua positions where 0-based indices are expected (charAt 0 was Nothing, indexOf off by one, lastIndexOf also ignored the pattern width), negative indices passed straight to string.sub (which counts from the end instead of clamping), and Common.replace/replaceAll/split treating Pattern as a Lua pattern — replaceAll (Pattern "[b]") ... replaced b. Index handling now mirrors the upstream JS, and matching is literal (string.find plain mode).

Test suite. It had never run against this fork: scripts/test referenced a spago-test.dhall that did not exist, the package set pins predated prelude v7.2.0 and the assert override, and Test.Main imported Regex tests with no Data.String.Regex in src (dropped here, tracked separately). Each suite is also split into per-section functions: one hundred-statement do block compiles to Lua nested beyond the parser limit of stock Lua 5.1 (purescript-lua/purescript-lua#46). The CodePoints suite keeps the upstream structure but swaps the lone-surrogate test string (not representable in UTF-8) for well-formed Unicode of widths 1–4, and gains toCodePointArray/fromCodePointArray roundtrips. Two byte-semantics adaptations in NonEmpty.CodeUnits are commented inline.

Merge order. The suite needs Unisay/purescript-lua PR with the Char-literal escaping fix (a '\n' Char literal otherwise produces an unloadable chunk) and the prelude fix/lua51-compat PR. packages.dhall temporarily overrides prelude with a local clone; swapping it back to a released package set is a follow-up commit once a set with prelude v7.2.1 is out. With all three in place, scripts/test is fully green.

The repo never had the spago-test.dhall that scripts/test refers to, and the package set pins (psc-0.15.15-2024*) predate prelude v7.2.0 (unit = {}) and the assert override. Bump the sets, define the test build on top of spago.dhall, and drop the Regex test wiring: the fork has no Data.String.Regex implementation (tracked separately). The flake inputs were two years stale as well: the pinned pslua predates several codegen fixes; nixpkgs/easyps now match the pins of the main pslua repo (purs-tidy is dropped from the shell - it does not evaluate against the current nixpkgs and nothing in this repo uses it). The prelude package is temporarily overridden with a local clone until the Lua 5.1 compat fixes land in a released package set.

A single do-block with a hundred statements compiles to Lua nested beyond the parser limit of stock Lua 5.1 interpreters ("chunk has too many syntax levels", purescript-lua/purescript-lua#46). One function per log section keeps the nesting flat. Assertions are unchanged except two byte-semantics adaptations in NonEmpty.CodeUnits: charAt over "5 €" yields the first UTF-8 byte of the euro sign, and the toCharArray case uses an ASCII string because a multi-byte Char literal is not expressible when Char is a byte.

Closes purescript-lua/purescript-lua#36. In pslua a String is a Lua byte string holding UTF-8, so all seven foreign functions decode/encode UTF-8 directly with plain Lua 5.1 arithmetic (no utf8 library, no bitops). The PureScript fallback arguments are UTF-16 surrogate-pair logic and are ignored: under the byte representation they would produce garbage. The pure uncons carried the same surrogate logic (charAt 0/1 yields bytes here, never in the surrogate range, so it silently returned the first byte as a code point); its tail now starts after the UTF-8 width of the head. The test suite replaces the upstream lone-surrogate string (not representable in UTF-8) with well-formed Unicode of widths 1-4 and keeps its structure; lone-surrogate-specific probes are dropped and toCodePointArray/fromCodePointArray roundtrips are added.

CodeUnits returned 1-based Lua positions where PureScript expects 0-based indices (charAt answered Nothing for index 0, indexOf was off by one, lastIndexOf also ignored the pattern width), and take/slice/ splitAt passed negative indices straight to string.sub, which counts from the end instead of clamping. Index handling now mirrors the upstream JS (indexOf guard, lastIndexOf clamp, slice wrapping). Common.replace/replaceAll/split treated Pattern as a Lua pattern and even applied gsub to the Replacement; PureScript semantics are literal matching, so the rewrite uses string.find in plain mode. replaceAll also no longer leaks gsub's second return value, and split keeps empty fields and multi-character separators.

Unisay · 2026-06-13T07:33:55Z

+-- Decodes the code point starting at byte position i.
+-- Returns the code point and the position of the next one.
+-- An invalid leading byte is returned as-is (one byte consumed).
+local function decode(s, i)


This is quite an implementation! It needs to be thoroughly covered by unit tests.

Added in 9cd49b8: a encode/decode boundaries test round-trips a code point at every UTF-8 width edge (0x7F/0x80, 0x7FF/0x800, 0xFFFF/0x10000, 0x10FFFF) through fromCodePointArray→toCodePointArray and checks the byte widths via singleton, plus the malformed input test for the decoder's error path.

Copilot

Pull request overview

This PR ports Data.String.CodePoints to the pslua string model (Lua UTF-8 byte strings), fixes several Data.String.CodeUnits / Data.String.Common semantic mismatches (0-based indexing and literal pattern handling), and repairs/enables the test suite for this fork.

Changes:

Implement UTF-8 encode/decode based Data.String.CodePoints FFI (Lua 5.1 compatible, no utf8 lib / bitops).
Fix CodeUnits indexing and Common string operations to treat Pattern literally (plain string.find) and align behavior with upstream JS semantics.
Repair and restructure tests (remove Regex suite, split large do-blocks), add spago-test.dhall, and update Nix/package-set pins.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
test/Test/Main.purs	Removes Regex suite wiring from the test runner.
test/Test/Data/String/Regex.purs	Deletes Regex tests (module not provided in this fork).
test/Test/Data/String/NonEmpty/CodeUnits.purs	Splits large test block into per-section functions; adjusts expectations for byte/Char semantics.
test/Test/Data/String/NonEmpty.purs	Splits NonEmpty string tests into per-section functions.
test/Test/Data/String/CodeUnits.purs	Splits CodeUnits tests into per-section functions.
test/Test/Data/String/CodePoints.purs	Adapts upstream CodePoints tests to well-formed UTF-8 and adds array roundtrip coverage.
src/Data/String/Common.lua	Makes replace/replaceAll/split literal (no Lua pattern interpretation) and closer to JS semantics.
src/Data/String/CodeUnits.lua	Fixes 0-based indexing and (mostly) JS-like clamping/search behavior in Lua FFI.
src/Data/String/CodePoints.purs	Adjusts `uncons` to advance by UTF-8 width in pslua.
src/Data/String/CodePoints.lua	Replaces stub with a UTF-8 decoder/encoder implementation for CodePoints operations.
spago-test.dhall	Adds a Spago test configuration including test sources and pslua backend invocation.
packages.dhall	Updates package set pins and introduces a temporary `prelude` override.
flake.nix	Removes `purs-tidy` from devShell inputs.
flake.lock	Updates flake inputs/pins.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Unisay · 2026-06-13T07:50:31Z

+-- In pslua strings are UTF-8 byte strings (a code unit is a byte), so the
+-- tail starts after the UTF-8 encoding of the head code point.
 uncons :: String -> Maybe { head :: CodePoint, tail :: String }
 uncons s = case CU.length s of
  0 -> Nothing
-  1 -> Just { head: CodePoint (fromEnum (Unsafe.charAt 0 s)), tail: "" }
  _ ->
    let
-      cu0 = fromEnum (Unsafe.charAt 0 s)
-      cu1 = fromEnum (Unsafe.charAt 1 s)
+      cp = unsafeCodePointAt0 s
    in
-      if isLead cu0 && isTrail cu1
-        then Just { head: unsurrogate cu0 cu1, tail: CU.drop 2 s }
-        else Just { head: CodePoint cu0, tail: CU.drop 1 s }
+      Just { head: cp, tail: CU.drop (utf8Width cp) s }


Fixed in 9cd49b8. You're right — the width was recomputed from the decoded value, but the decoder consumes a single byte for a malformed lead. uncons now takes the tail as drop 1, which is defined through the same native decoder that produces the head, so the two can no longer disagree (utf8Width is gone). Added a malformed input test: a lone continuation byte between ASCII anchors, and truncated two-/three-byte leads.

Unisay · 2026-06-13T07:50:32Z

+            if startAt < 0 or startAt > #s then return nothing end
+            local i = s:find(x, startAt + 1, true)


This actually matches upstream Data.String.CodeUnits, which is deliberate. The JS FFI is if (startAt < 0 || startAt > s.length) return nothing; before indexOf, and the upstream test suite pins it: indexOf' (Pattern "") (-1) "ab" == Nothing and indexOf' (Pattern "") 3 "ab" == Nothing (both in Test.Data.String.CodeUnits). Clamping would diverge from upstream and fail those. Kept as-is.

Unisay · 2026-06-13T07:50:33Z

+-- Temporary override until purescript-lua-prelude v7.2.1 is released
+-- and picked up by a package set release (Lua 5.1 compat fixes).
+let overrides = { prelude = ../purescript-lua-prelude/spago.dhall as Location }
+
+in  upstream-ps // upstream-lua // overrides


Agreed — the relative path is a deliberate temporary stopgap, not the final state (called out in the "chore: repair the test build" commit). It is part of the release chain: prelude needs Lua 5.1 FFI fixes (purescript-lua/purescript-lua-prelude#5, merged) that are not in any released package set yet. Once those ship in a set release, this override is dropped and packages.dhall points at the released set — that commit closes the loop before merge. A fresh clone can't build until then, which is why the PR stays in draft.

Unisay · 2026-06-13T07:34:33Z

+end
+
+-- Encodes a code point as a UTF-8 byte string.
+local function encode(cp)


Same here, make sure the unit-test coverage is decent.

Covered by the new encode/decode boundaries and malformed input tests (9cd49b8).

Unisay · 2026-06-13T07:35:19Z

+      end
+    end
+  end),
+  _codePointAt = (function(_)


I see quite some nesting here and wonder how much of a hassle is it to use uncurried FFI functions?

Assessment: declaring the foreign imports with FnN (and runFnN at the call sites) does flatten the .lua — e.g. _codePointAt = function(just, nothing, n, s) ... end instead of five nested closures — and lets me drop the vestigial fallback/at0 arguments the Lua ignores. The cost is that the .purs foreign signatures then diverge from upstream Data.String.CodePoints, where the curried fallback args exist precisely so one .purs is shared across the JS and Lua FFI; that makes future rebases onto upstream noisier. The nesting is confined to internal _-prefixed bindings, so it's invisible to users. I left it curried to keep the .purs diff to upstream minimal, but I'm happy to switch to Fn/runFn if you'd rather have the flatter FFI — your call on that trade-off.

Unisay · 2026-06-13T07:35:51Z

-          if i >= 1 and i <= #s then
-            return just(s:sub(i, i))
+          if i >= 0 and i < #s then
+            return just(s:sub(i + 1, i + 1))


Subtle, needs to be covered by tests

Covered: the 0-based indexing and clamping are exercised by the split-out charAt/indexOf/indexOf'/lastIndexOf'/take/slice/splitAt sections, which are the upstream cases now actually running against this FFI.

Review follow-up on #1. uncons recomputed the tail offset from the head code point value (utf8Width), but the decoder consumes a single byte for a malformed leading byte (a lone continuation or a truncated multi-byte lead). For those inputs the width disagreed with what was actually read and uncons skipped extra bytes, desynchronising the rest of the string. The tail is now `drop 1`, which is defined via the same native decoder that produces the head, so the two can never disagree; utf8Width is gone. Add direct coverage of the encoder/decoder: round-trips at every UTF-8 width boundary (0x7F/0x80, 0x7FF/0x800, 0xFFFF/0x10000, 0x10FFFF) and malformed input (lone continuation byte between ASCII anchors, and truncated two- and three-byte leads), where the raw bytes are built with CodeUnits.singleton since a string literal would be re-encoded as valid UTF-8.

Switch packages.dhall from the local prelude override to the released package set (psc-0.15.15-20260613, which carries prelude v7.2.1), so a fresh clone and CI can build without a sibling checkout. Bump the flake pin of pslua to the merged Char-literal-escaping fix. Replace the inherited bower CI (no bower.json in this fork) with one that runs scripts/test and luacheck in the nix dev shell, and make scripts/test exit non-zero on failure so red tests fail CI. uncons now reaches take (via drop), closing a recursive group with the point-free `take = _take takeFallback`; eta-expand take so every member of the group is a function and purs accepts it.

Unisay added 4 commits June 12, 2026 21:45

Unisay mentioned this pull request Jun 13, 2026

Data.String.CodePoints in purescript-lua-strings is not implemented purescript-lua/purescript-lua#36

Closed

Unisay self-assigned this Jun 13, 2026

Unisay requested a review from Copilot June 13, 2026 07:29

Unisay marked this pull request as ready for review June 13, 2026 07:29

Copilot started reviewing on behalf of Unisay June 13, 2026 07:29 View session

Unisay commented Jun 13, 2026

View reviewed changes

Copilot AI reviewed Jun 13, 2026

View reviewed changes

Unisay commented Jun 13, 2026

View reviewed changes

Unisay added 2 commits June 13, 2026 09:49

Unisay merged commit 395c4b6 into master Jun 13, 2026

Unisay deleted the issue/codepoints-ffi branch June 13, 2026 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Data.String.CodePoints, repair the test suite#1

Implement Data.String.CodePoints, repair the test suite#1
Unisay merged 6 commits into
masterfrom
issue/codepoints-ffi

Unisay commented Jun 13, 2026

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Unisay Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if startAt < 0 or startAt > #s then return nothing end
		local i = s:find(x, startAt + 1, true)

Conversation

Unisay commented Jun 13, 2026

Implement Data.String.CodePoints, repair the test suite

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants