Package: purescript-lua-enums
File: src/Data/Enum.lua
Function: toCharCode
Class: semantics Severity: high
toCharCode is implemented as c:byte(), which returns the value of the FIRST BYTE (0..255) of the argument's UTF-8 encoding, not the character's code point. The pslua compiler emits a PureScript Char literal as a Lua string holding the character's UTF-8 bytes (Lua.hs LiteralChar -> String of Text.singleton c, printed raw via dquotes(pretty t)). So any Char above U+007F arrives as a multi-byte UTF-8 string and c:byte() returns the leading byte. Confirmed on Lua 5.1: U+00E9 ('é', UTF-8 C3 A9) -> 195 instead of 233; U+FFFF (top, UTF-8 EF BF BF) -> 239 instead of 65535. JS FFI does c.charCodeAt(0), returning the UTF-16 code unit (0..65535). This also breaks Data.Enum's cardinality = toCharCode top - toCharCode bottom (evaluates to 239-0 = 239 instead of 65535) and fromEnum :: Char -> Int. Correct only for the ASCII subrange 0..127.
Current (Lua):
toCharCode = (function(c) return c:byte() end)
Expected: JS: c.charCodeAt(0) returns the code point/UTF-16 code unit of the first character (0..65535). For 'A' -> 65, 'é'(U+00E9) -> 233, top(U+FFFF) -> 65535.
Proposed fix:
Decode the first UTF-8 code point of the string instead of taking the first byte. E.g.:
toCharCode = function(c)
local b1 = c:byte(1)
if b1 < 0x80 then return b1 end
if b1 < 0xE0 then return (b1 - 0xC0) * 0x40 + (c:byte(2) - 0x80) end
if b1 < 0xF0 then return (b1 - 0xE0) * 0x1000 + (c:byte(2) - 0x80) * 0x40 + (c:byte(3) - 0x80) end
return (b1 - 0xF0) * 0x40000 + (c:byte(2) - 0x80) * 0x1000 + (c:byte(3) - 0x80) * 0x40 + (c:byte(4) - 0x80)
end
Verified on Lua 5.1: 'A'->65, 'é'->233, top->65535.
Found by the FFI audit; reproduced under Lua 5.1.
Package: purescript-lua-enums
File:
src/Data/Enum.luaFunction:
toCharCodeClass: semantics Severity: high
toCharCode is implemented as
c:byte(), which returns the value of the FIRST BYTE (0..255) of the argument's UTF-8 encoding, not the character's code point. The pslua compiler emits a PureScript Char literal as a Lua string holding the character's UTF-8 bytes (Lua.hs LiteralChar -> String of Text.singleton c, printed raw via dquotes(pretty t)). So any Char above U+007F arrives as a multi-byte UTF-8 string and c:byte() returns the leading byte. Confirmed on Lua 5.1: U+00E9 ('é', UTF-8 C3 A9) -> 195 instead of 233; U+FFFF (top, UTF-8 EF BF BF) -> 239 instead of 65535. JS FFI doesc.charCodeAt(0), returning the UTF-16 code unit (0..65535). This also breaks Data.Enum'scardinality = toCharCode top - toCharCode bottom(evaluates to 239-0 = 239 instead of 65535) andfromEnum :: Char -> Int. Correct only for the ASCII subrange 0..127.Current (Lua):
Expected: JS: c.charCodeAt(0) returns the code point/UTF-16 code unit of the first character (0..65535). For 'A' -> 65, 'é'(U+00E9) -> 233, top(U+FFFF) -> 65535.
Proposed fix:
Found by the FFI audit; reproduced under Lua 5.1.