Skip to content

[fork-ffi] enums: toCharCode — toCharCode is implemented as c:byte(), which returns the value of… #79

@Unisay

Description

@Unisay

Package: purescript-lua-enums
File: src/Data/Enum.lua
Function: toCharCode
Class: semantics Severity: high

toCharCode is implemented as c:byte(), which returns the value of the FIRST BYTE (0..255) of the argument's UTF-8 encoding, not the character's code point. The pslua compiler emits a PureScript Char literal as a Lua string holding the character's UTF-8 bytes (Lua.hs LiteralChar -> String of Text.singleton c, printed raw via dquotes(pretty t)). So any Char above U+007F arrives as a multi-byte UTF-8 string and c:byte() returns the leading byte. Confirmed on Lua 5.1: U+00E9 ('é', UTF-8 C3 A9) -> 195 instead of 233; U+FFFF (top, UTF-8 EF BF BF) -> 239 instead of 65535. JS FFI does c.charCodeAt(0), returning the UTF-16 code unit (0..65535). This also breaks Data.Enum's cardinality = toCharCode top - toCharCode bottom (evaluates to 239-0 = 239 instead of 65535) and fromEnum :: Char -> Int. Correct only for the ASCII subrange 0..127.

Current (Lua):

toCharCode = (function(c) return c:byte() end)

Expected: JS: c.charCodeAt(0) returns the code point/UTF-16 code unit of the first character (0..65535). For 'A' -> 65, 'é'(U+00E9) -> 233, top(U+FFFF) -> 65535.

Proposed fix:

Decode the first UTF-8 code point of the string instead of taking the first byte. E.g.:
  toCharCode = function(c)
    local b1 = c:byte(1)
    if b1 < 0x80 then return b1 end
    if b1 < 0xE0 then return (b1 - 0xC0) * 0x40 + (c:byte(2) - 0x80) end
    if b1 < 0xF0 then return (b1 - 0xE0) * 0x1000 + (c:byte(2) - 0x80) * 0x40 + (c:byte(3) - 0x80) end
    return (b1 - 0xF0) * 0x40000 + (c:byte(2) - 0x80) * 0x1000 + (c:byte(3) - 0x80) * 0x40 + (c:byte(4) - 0x80)
  end
Verified on Lua 5.1: 'A'->65, 'é'->233, top->65535.

Found by the FFI audit; reproduced under Lua 5.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions