Skip to content

Implement Data.String.CodePoints, repair the test suite#1

Merged
Unisay merged 6 commits into
masterfrom
issue/codepoints-ffi
Jun 13, 2026
Merged

Implement Data.String.CodePoints, repair the test suite#1
Unisay merged 6 commits into
masterfrom
issue/codepoints-ffi

Conversation

@Unisay

@Unisay Unisay commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Implement Data.String.CodePoints, repair the test suite

Closes purescript-lua/purescript-lua#36.

CodePoints. The FFI file was a stub: 2 functions out of 7, both throwing Not implemented, with the wrong arity. In pslua a String is a Lua byte string holding UTF-8, so all seven foreign functions now decode/encode UTF-8 directly in plain Lua 5.1 arithmetic (no utf8 library, no bitops, string.char fed byte by byte). The PureScript fallback arguments are UTF-16 surrogate-pair logic and are deliberately ignored — under the byte representation they produce garbage. The pure uncons carried the same surrogate logic (charAt 0/1 yields bytes here, which are never in the surrogate range, so it silently returned the first byte as the head code point); its tail now starts after the UTF-8 width of the head.

CodeUnits and Common were broken too. Once the suite ran, it showed CodeUnits returning 1-based Lua positions where 0-based indices are expected (charAt 0 was Nothing, indexOf off by one, lastIndexOf also ignored the pattern width), negative indices passed straight to string.sub (which counts from the end instead of clamping), and Common.replace/replaceAll/split treating Pattern as a Lua pattern — replaceAll (Pattern "[b]") ... replaced b. Index handling now mirrors the upstream JS, and matching is literal (string.find plain mode).

Test suite. It had never run against this fork: scripts/test referenced a spago-test.dhall that did not exist, the package set pins predated prelude v7.2.0 and the assert override, and Test.Main imported Regex tests with no Data.String.Regex in src (dropped here, tracked separately). Each suite is also split into per-section functions: one hundred-statement do block compiles to Lua nested beyond the parser limit of stock Lua 5.1 (purescript-lua/purescript-lua#46). The CodePoints suite keeps the upstream structure but swaps the lone-surrogate test string (not representable in UTF-8) for well-formed Unicode of widths 1–4, and gains toCodePointArray/fromCodePointArray roundtrips. Two byte-semantics adaptations in NonEmpty.CodeUnits are commented inline.

Merge order. The suite needs Unisay/purescript-lua PR with the Char-literal escaping fix (a '\n' Char literal otherwise produces an unloadable chunk) and the prelude fix/lua51-compat PR. packages.dhall temporarily overrides prelude with a local clone; swapping it back to a released package set is a follow-up commit once a set with prelude v7.2.1 is out. With all three in place, scripts/test is fully green.

Unisay added 4 commits June 12, 2026 21:45
The repo never had the spago-test.dhall that scripts/test refers to,
and the package set pins (psc-0.15.15-2024*) predate prelude v7.2.0
(unit = {}) and the assert override. Bump the sets, define the test
build on top of spago.dhall, and drop the Regex test wiring: the fork
has no Data.String.Regex implementation (tracked separately).

The flake inputs were two years stale as well: the pinned pslua
predates several codegen fixes; nixpkgs/easyps now match the pins of
the main pslua repo (purs-tidy is dropped from the shell - it does not
evaluate against the current nixpkgs and nothing in this repo uses it).

The prelude package is temporarily overridden with a local clone until
the Lua 5.1 compat fixes land in a released package set.
A single do-block with a hundred statements compiles to Lua nested
beyond the parser limit of stock Lua 5.1 interpreters ("chunk has too
many syntax levels", purescript-lua/purescript-lua#46). One function per log
section keeps the nesting flat. Assertions are unchanged except two
byte-semantics adaptations in NonEmpty.CodeUnits: charAt over "5 €"
yields the first UTF-8 byte of the euro sign, and the toCharArray case
uses an ASCII string because a multi-byte Char literal is not
expressible when Char is a byte.
Closes purescript-lua/purescript-lua#36.

In pslua a String is a Lua byte string holding UTF-8, so all seven
foreign functions decode/encode UTF-8 directly with plain Lua 5.1
arithmetic (no utf8 library, no bitops). The PureScript fallback
arguments are UTF-16 surrogate-pair logic and are ignored: under the
byte representation they would produce garbage.

The pure uncons carried the same surrogate logic (charAt 0/1 yields
bytes here, never in the surrogate range, so it silently returned the
first byte as a code point); its tail now starts after the UTF-8 width
of the head.

The test suite replaces the upstream lone-surrogate string (not
representable in UTF-8) with well-formed Unicode of widths 1-4 and
keeps its structure; lone-surrogate-specific probes are dropped and
toCodePointArray/fromCodePointArray roundtrips are added.
CodeUnits returned 1-based Lua positions where PureScript expects
0-based indices (charAt answered Nothing for index 0, indexOf was off
by one, lastIndexOf also ignored the pattern width), and take/slice/
splitAt passed negative indices straight to string.sub, which counts
from the end instead of clamping. Index handling now mirrors the
upstream JS (indexOf guard, lastIndexOf clamp, slice wrapping).

Common.replace/replaceAll/split treated Pattern as a Lua pattern and
even applied gsub to the Replacement; PureScript semantics are literal
matching, so the rewrite uses string.find in plain mode. replaceAll
also no longer leaks gsub's second return value, and split keeps
empty fields and multi-character separators.
-- Decodes the code point starting at byte position i.
-- Returns the code point and the position of the next one.
-- An invalid leading byte is returned as-is (one byte consumed).
local function decode(s, i)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite an implementation! It needs to be thoroughly covered by unit tests.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 9cd49b8: a encode/decode boundaries test round-trips a code point at every UTF-8 width edge (0x7F/0x80, 0x7FF/0x800, 0xFFFF/0x10000, 0x10FFFF) through fromCodePointArraytoCodePointArray and checks the byte widths via singleton, plus the malformed input test for the decoder's error path.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports Data.String.CodePoints to the pslua string model (Lua UTF-8 byte strings), fixes several Data.String.CodeUnits / Data.String.Common semantic mismatches (0-based indexing and literal pattern handling), and repairs/enables the test suite for this fork.

Changes:

  • Implement UTF-8 encode/decode based Data.String.CodePoints FFI (Lua 5.1 compatible, no utf8 lib / bitops).
  • Fix CodeUnits indexing and Common string operations to treat Pattern literally (plain string.find) and align behavior with upstream JS semantics.
  • Repair and restructure tests (remove Regex suite, split large do-blocks), add spago-test.dhall, and update Nix/package-set pins.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/Test/Main.purs Removes Regex suite wiring from the test runner.
test/Test/Data/String/Regex.purs Deletes Regex tests (module not provided in this fork).
test/Test/Data/String/NonEmpty/CodeUnits.purs Splits large test block into per-section functions; adjusts expectations for byte/Char semantics.
test/Test/Data/String/NonEmpty.purs Splits NonEmpty string tests into per-section functions.
test/Test/Data/String/CodeUnits.purs Splits CodeUnits tests into per-section functions.
test/Test/Data/String/CodePoints.purs Adapts upstream CodePoints tests to well-formed UTF-8 and adds array roundtrip coverage.
src/Data/String/Common.lua Makes replace/replaceAll/split literal (no Lua pattern interpretation) and closer to JS semantics.
src/Data/String/CodeUnits.lua Fixes 0-based indexing and (mostly) JS-like clamping/search behavior in Lua FFI.
src/Data/String/CodePoints.purs Adjusts uncons to advance by UTF-8 width in pslua.
src/Data/String/CodePoints.lua Replaces stub with a UTF-8 decoder/encoder implementation for CodePoints operations.
spago-test.dhall Adds a Spago test configuration including test sources and pslua backend invocation.
packages.dhall Updates package set pins and introduces a temporary prelude override.
flake.nix Removes purs-tidy from devShell inputs.
flake.lock Updates flake inputs/pins.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/Data/String/CodePoints.purs Outdated
Comment on lines +191 to +200
-- In pslua strings are UTF-8 byte strings (a code unit is a byte), so the
-- tail starts after the UTF-8 encoding of the head code point.
uncons :: String -> Maybe { head :: CodePoint, tail :: String }
uncons s = case CU.length s of
0 -> Nothing
1 -> Just { head: CodePoint (fromEnum (Unsafe.charAt 0 s)), tail: "" }
_ ->
let
cu0 = fromEnum (Unsafe.charAt 0 s)
cu1 = fromEnum (Unsafe.charAt 1 s)
cp = unsafeCodePointAt0 s
in
if isLead cu0 && isTrail cu1
then Just { head: unsurrogate cu0 cu1, tail: CU.drop 2 s }
else Just { head: CodePoint cu0, tail: CU.drop 1 s }
Just { head: cp, tail: CU.drop (utf8Width cp) s }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9cd49b8. You're right — the width was recomputed from the decoded value, but the decoder consumes a single byte for a malformed lead. uncons now takes the tail as drop 1, which is defined through the same native decoder that produces the head, so the two can no longer disagree (utf8Width is gone). Added a malformed input test: a lone continuation byte between ASCII anchors, and truncated two-/three-byte leads.

Comment on lines +65 to +66
if startAt < 0 or startAt > #s then return nothing end
local i = s:find(x, startAt + 1, true)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually matches upstream Data.String.CodeUnits, which is deliberate. The JS FFI is if (startAt < 0 || startAt > s.length) return nothing; before indexOf, and the upstream test suite pins it: indexOf' (Pattern "") (-1) "ab" == Nothing and indexOf' (Pattern "") 3 "ab" == Nothing (both in Test.Data.String.CodeUnits). Clamping would diverge from upstream and fail those. Kept as-is.

Comment thread packages.dhall Outdated
Comment on lines +9 to +13
-- Temporary override until purescript-lua-prelude v7.2.1 is released
-- and picked up by a package set release (Lua 5.1 compat fixes).
let overrides = { prelude = ../purescript-lua-prelude/spago.dhall as Location }

in upstream-ps // upstream-lua // overrides

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the relative path is a deliberate temporary stopgap, not the final state (called out in the "chore: repair the test build" commit). It is part of the release chain: prelude needs Lua 5.1 FFI fixes (purescript-lua/purescript-lua-prelude#5, merged) that are not in any released package set yet. Once those ship in a set release, this override is dropped and packages.dhall points at the released set — that commit closes the loop before merge. A fresh clone can't build until then, which is why the PR stays in draft.

end

-- Encodes a code point as a UTF-8 byte string.
local function encode(cp)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, make sure the unit-test coverage is decent.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered by the new encode/decode boundaries and malformed input tests (9cd49b8).

end
end
end),
_codePointAt = (function(_)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see quite some nesting here and wonder how much of a hassle is it to use uncurried FFI functions?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: declaring the foreign imports with FnN (and runFnN at the call sites) does flatten the .lua — e.g. _codePointAt = function(just, nothing, n, s) ... end instead of five nested closures — and lets me drop the vestigial fallback/at0 arguments the Lua ignores. The cost is that the .purs foreign signatures then diverge from upstream Data.String.CodePoints, where the curried fallback args exist precisely so one .purs is shared across the JS and Lua FFI; that makes future rebases onto upstream noisier. The nesting is confined to internal _-prefixed bindings, so it's invisible to users. I left it curried to keep the .purs diff to upstream minimal, but I'm happy to switch to Fn/runFn if you'd rather have the flatter FFI — your call on that trade-off.

Comment on lines -13 to +19
if i >= 1 and i <= #s then
return just(s:sub(i, i))
if i >= 0 and i < #s then
return just(s:sub(i + 1, i + 1))

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtle, needs to be covered by tests

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered: the 0-based indexing and clamping are exercised by the split-out charAt/indexOf/indexOf'/lastIndexOf'/take/slice/splitAt sections, which are the upstream cases now actually running against this FFI.

Unisay added 2 commits June 13, 2026 09:49
Review follow-up on #1.

uncons recomputed the tail offset from the head code point value
(utf8Width), but the decoder consumes a single byte for a malformed
leading byte (a lone continuation or a truncated multi-byte lead).
For those inputs the width disagreed with what was actually read and
uncons skipped extra bytes, desynchronising the rest of the string.
The tail is now `drop 1`, which is defined via the same native decoder
that produces the head, so the two can never disagree; utf8Width is
gone.

Add direct coverage of the encoder/decoder: round-trips at every UTF-8
width boundary (0x7F/0x80, 0x7FF/0x800, 0xFFFF/0x10000, 0x10FFFF) and
malformed input (lone continuation byte between ASCII anchors, and
truncated two- and three-byte leads), where the raw bytes are built
with CodeUnits.singleton since a string literal would be re-encoded as
valid UTF-8.
Switch packages.dhall from the local prelude override to the released
package set (psc-0.15.15-20260613, which carries prelude v7.2.1), so a
fresh clone and CI can build without a sibling checkout. Bump the flake
pin of pslua to the merged Char-literal-escaping fix.

Replace the inherited bower CI (no bower.json in this fork) with one
that runs scripts/test and luacheck in the nix dev shell, and make
scripts/test exit non-zero on failure so red tests fail CI.

uncons now reaches take (via drop), closing a recursive group with the
point-free `take = _take takeFallback`; eta-expand take so every member
of the group is a function and purs accepts it.
@Unisay Unisay merged commit 395c4b6 into master Jun 13, 2026
@Unisay Unisay deleted the issue/codepoints-ffi branch June 13, 2026 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data.String.CodePoints in purescript-lua-strings is not implemented

2 participants