Base64 by lemire · Pull Request #375 · simdutf/simdutf

lemire · 2024-03-16T17:59:12Z

This PR adds accelerated base64 encoding and decoding to simdutf. The standard adopted is WHATWG forgiving-base64: inputs may contain ASCII white spaces.

References:

forgiving-base64 specification.

WojciechMula

Impressive work, looks good to me.

WojciechMula · 2024-03-16T20:04:00Z

+ * that is not a valid base64 character (INVALID_BASE64_CHARACTER).
+ *
+ * You should call this function with a buffer that is at least maximal_binary_length_from_base64(input, length) bytes long.
+ * If you fail to provide that much space, the function may cause a buffer overflow.


I would name this function base64_to_binary_unasfe and then put this in the requirements.

And then we should also provide base64_to_binary(const char* input) -> std::vector<uint8_t>, that does calculation of safe size and allocate memory internally. Or maybe something like base_to_binary<Container>(const char* input, cont: &Container) and static_assert that the Container has method resize.

Currently, all of the transcoding functions have similar requirements (the caller is responsible to allocate enough memory as per the specification). In the scope of simdutf, I do not consider this function unsafe: if you use it according to its rather simple specification, it is safe (we have good tests). It is easy enough to add higher-level interfaces once we have the low-level ones. I have opened an issue: #377

WojciechMula · 2024-03-16T20:06:00Z

  set_property(TARGET threaded PROPERTY CXX_STANDARD_REQUIRED ON)
 endif(Threads_FOUND)
+if(CMAKE_CXX_COMPILER_ID STREQUAL Clang AND "x${CMAKE_CXX_SIMULATE_ID}" STREQUAL "xMSVC")
+  message(STATUS "Not building base64 benchmarks when using clang-cl due to build errors.")


I'm hoping somebody will help us with MSVC.

The issue is with the dependency (aklomp/base64) which does not build without help under clangcl. We encountered this same issue when trying to build Node.js under clangcl. The workaround I found was to disable some kernels. In this instance, I do not care much about benchmarking under clangcl.

I am adding a better explanation as to why we disable benchmarking under clangcl. It is not that we don't build under clangcl, we definitively do.

WojciechMula · 2024-03-16T20:12:53Z

+  printf(" See https://github.com/lemire/base64data for test data.\n");
+}
+void pretty_print(size_t, size_t bytes, std::string name, event_aggregate agg) {
+  printf("%-40s : ", name.c_str());


Since we're requiring C++17, and this is an external tool, I'd love to see here use of fmtlib. Maybe not for now, as this PR is already huge, but fmtlib is de-facto standard and works well.

I agree that fmtlib is better.

WojciechMula · 2024-03-16T20:18:30Z

+ * https://www.codeproject.com/Articles/276993/Base-Encoding-on-a-GPU. (2013).
+ */
+
+/*static simdutf_really_inline uint8x16_t lookup(const uint8x16_t input) {


remove please

WojciechMula · 2024-03-16T20:40:02Z

+    std::vector<char> source(len, 0);
+    std::vector<char> buffer;
+    buffer.resize(implementation.base64_length_from_binary(len));
+    std::mt19937 gen(std::mt19937::result_type(123456));


How about adding static seed? Then we may set it from command line, if needed.

WojciechMula · 2024-03-16T20:41:49Z

+  }
+  std::uniform_int_distribution<size_t> index_dist(0, v.size() - padding);
+  size_t i = index_dist(gen);
+  std::uniform_int_distribution<int> char_dist(0, 255);


Why not std::uniform_int_distribution<uint8_t> char_dist(0, 255)?

WojciechMula · 2024-03-16T20:50:19Z

+      std::fread(input_data + offset, 1, chunk_size - offset, current_file);
+  if (std::ferror(current_file)) {
+    std::fclose(current_file);
+    throw std::runtime_error("Error while reading.");


Read errno and use strerror function - this will be helpful.

WojciechMula

Just a two proposals that would make test code more concise.

Co-authored-by: Wojciech Muła <wojciech_mula@poczta.onet.pl>

lemire · 2024-03-18T14:23:50Z

@WojciechMula You may find this interesting:

static_assert failed: 'invalid template argument for uniform_int_distribution: N4950 [rand.req.genl]/1.5 requires one of short, int, long, long long, unsigned short, unsigned int, unsigned long, or unsigned long long'

So I am reverting.

lemire · 2024-03-18T14:28:23Z

It checks out btw:

The result type generated by the generator. The effect is undefined if this is not one of short, int, long, long long, unsigned short, unsigned int, unsigned long, or unsigned long long.

lemire · 2024-03-18T14:58:47Z

Thanks @WojciechMula

You are co-credited for the PR.

WojciechMula · 2024-03-18T15:01:05Z

@WojciechMula You may find this interesting:

static_assert failed: 'invalid template argument for uniform_int_distribution: N4950 [rand.req.genl]/1.5 requires one of short, int, long, long long, unsigned short, unsigned int, unsigned long, or unsigned long long'

So I am reverting.

Oh man, another hairy standard corner... BTW, checked on godbolt: GCC and Clang do not complain.

lemire · 2024-03-18T17:15:51Z

@WojciechMula We can't blame Microsoft because it was explicitly documented in the standard.

Jarred-Sumner · 2024-03-24T22:51:28Z

very exciting

Jarred-Sumner · 2024-03-24T23:05:18Z

One thing to note

To do this efficiently for JavaScript runtimes, a version which accepts UTF-16 input (or otherwise 2 byte chars) is important because otherwise one must go through a UTF-16 -> latin1 conversion step beforehand. It is not unusual for an ASCII-only JavaScript string to end up being stored internally as UTF-16. There are many reasons why, like if it was a string literal from source code which was passed from runtime to engine as UTF-16. Or if it was originally a substring from a non-ascii string

lemire · 2024-03-25T03:58:19Z

@Jarred-Sumner It is coming.

bakkot · 2024-03-25T23:19:08Z

Forgiving-base64 is also the algorithm we ended up choosing for the default behavior in my proposal for native base64 in JS, happily. I've just added a link to the readme pointing implementers to this library.

lemire · 2024-03-25T23:20:19Z

@bakkot Fantastic. Stay posted, feedback invited.

lemire · 2024-03-30T04:05:24Z

@Jarred-Sumner See #382 where support for UTF-16 inputs has been added.

Daniel Lemire added 6 commits March 16, 2024 12:30

adding base64 encoding and decoding following the WHATWG

8c1a6f8

forgiving-base64 specification.

fixing fuzzer

0492510

adding missing files

a9385cd

adding missing file base64_tests.cpp

5af4611

adding more missing files

92c9616

missing CPM.cmake

b175c18

This was referenced Mar 16, 2024

buffer: fix DoS vector in atob nodejs/node#51670

Closed

performance of encodings (hex, base64, base64url) nodejs/performance#128

Open

src,lib,buffer: improve atob / btoa performance nodejs/node#38433

Open

lemire requested review from WojciechMula and anonrig March 16, 2024 18:03

fix big endian issues.

db44796

WojciechMula reviewed Mar 16, 2024

View reviewed changes

Daniel Lemire added 2 commits March 16, 2024 17:38

making the random fuzzer faster by default

4564063

various fixes

93c47cc

WojciechMula reviewed Mar 17, 2024

View reviewed changes

Comment thread tests/base64_tests.cpp Outdated

Comment thread tests/base64_tests.cpp

lemire and others added 3 commits March 17, 2024 09:21

Update tests/base64_tests.cpp

fcf7e6b

Co-authored-by: Wojciech Muła <wojciech_mula@poczta.onet.pl>

Update tests/base64_tests.cpp

917250c

Co-authored-by: Wojciech Muła <wojciech_mula@poczta.onet.pl>

simplifying.

7424d03

std::uniform_int_distribution cannot accept uint8_t.

4828ef8

lemire added 3 commits March 18, 2024 10:33

Merge branch 'master' into base64

a2a5c4c

extending rvv for base64

6244341

[skip ci] better doc for rvv

517c817

lemire merged commit c5357aa into master Mar 18, 2024

Porkepix mentioned this pull request Mar 18, 2024

simdutf 5.0.0 Homebrew/homebrew-core#166524

Merged

Conversation

lemire commented Mar 16, 2024

Uh oh!

WojciechMula left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WojciechMula left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lemire commented Mar 18, 2024

Uh oh!

lemire commented Mar 18, 2024

Uh oh!

lemire commented Mar 18, 2024

Uh oh!

WojciechMula commented Mar 18, 2024

Uh oh!

lemire commented Mar 18, 2024

Uh oh!

Jarred-Sumner commented Mar 24, 2024

Uh oh!

Jarred-Sumner commented Mar 24, 2024

Uh oh!

lemire commented Mar 25, 2024

Uh oh!

bakkot commented Mar 25, 2024

Uh oh!

lemire commented Mar 25, 2024

Uh oh!

lemire commented Mar 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants