Conversation
WojciechMula
left a comment
There was a problem hiding this comment.
Impressive work, looks good to me.
| * that is not a valid base64 character (INVALID_BASE64_CHARACTER). | ||
| * | ||
| * You should call this function with a buffer that is at least maximal_binary_length_from_base64(input, length) bytes long. | ||
| * If you fail to provide that much space, the function may cause a buffer overflow. |
There was a problem hiding this comment.
I would name this function base64_to_binary_unasfe and then put this in the requirements.
And then we should also provide base64_to_binary(const char* input) -> std::vector<uint8_t>, that does calculation of safe size and allocate memory internally. Or maybe something like base_to_binary<Container>(const char* input, cont: &Container) and static_assert that the Container has method resize.
There was a problem hiding this comment.
Currently, all of the transcoding functions have similar requirements (the caller is responsible to allocate enough memory as per the specification). In the scope of simdutf, I do not consider this function unsafe: if you use it according to its rather simple specification, it is safe (we have good tests). It is easy enough to add higher-level interfaces once we have the low-level ones. I have opened an issue: #377
| set_property(TARGET threaded PROPERTY CXX_STANDARD_REQUIRED ON) | ||
| endif(Threads_FOUND) | ||
| if(CMAKE_CXX_COMPILER_ID STREQUAL Clang AND "x${CMAKE_CXX_SIMULATE_ID}" STREQUAL "xMSVC") | ||
| message(STATUS "Not building base64 benchmarks when using clang-cl due to build errors.") |
There was a problem hiding this comment.
I'm hoping somebody will help us with MSVC.
There was a problem hiding this comment.
The issue is with the dependency (aklomp/base64) which does not build without help under clangcl. We encountered this same issue when trying to build Node.js under clangcl. The workaround I found was to disable some kernels. In this instance, I do not care much about benchmarking under clangcl.
There was a problem hiding this comment.
I am adding a better explanation as to why we disable benchmarking under clangcl. It is not that we don't build under clangcl, we definitively do.
| printf(" See https://github.com/lemire/base64data for test data.\n"); | ||
| } | ||
| void pretty_print(size_t, size_t bytes, std::string name, event_aggregate agg) { | ||
| printf("%-40s : ", name.c_str()); |
There was a problem hiding this comment.
Since we're requiring C++17, and this is an external tool, I'd love to see here use of fmtlib. Maybe not for now, as this PR is already huge, but fmtlib is de-facto standard and works well.
There was a problem hiding this comment.
I agree that fmtlib is better.
| * https://www.codeproject.com/Articles/276993/Base-Encoding-on-a-GPU. (2013). | ||
| */ | ||
|
|
||
| /*static simdutf_really_inline uint8x16_t lookup(const uint8x16_t input) { |
| std::vector<char> source(len, 0); | ||
| std::vector<char> buffer; | ||
| buffer.resize(implementation.base64_length_from_binary(len)); | ||
| std::mt19937 gen(std::mt19937::result_type(123456)); |
There was a problem hiding this comment.
How about adding static seed? Then we may set it from command line, if needed.
| } | ||
| std::uniform_int_distribution<size_t> index_dist(0, v.size() - padding); | ||
| size_t i = index_dist(gen); | ||
| std::uniform_int_distribution<int> char_dist(0, 255); |
There was a problem hiding this comment.
Why not std::uniform_int_distribution<uint8_t> char_dist(0, 255)?
| std::fread(input_data + offset, 1, chunk_size - offset, current_file); | ||
| if (std::ferror(current_file)) { | ||
| std::fclose(current_file); | ||
| throw std::runtime_error("Error while reading."); |
There was a problem hiding this comment.
Read errno and use strerror function - this will be helpful.
WojciechMula
left a comment
There was a problem hiding this comment.
Just a two proposals that would make test code more concise.
Co-authored-by: Wojciech Muła <wojciech_mula@poczta.onet.pl>
Co-authored-by: Wojciech Muła <wojciech_mula@poczta.onet.pl>
|
@WojciechMula You may find this interesting:
So I am reverting. |
|
It checks out btw:
|
|
Thanks @WojciechMula You are co-credited for the PR. |
Oh man, another hairy standard corner... BTW, checked on godbolt: GCC and Clang do not complain. |
|
@WojciechMula We can't blame Microsoft because it was explicitly documented in the standard. |
|
very exciting |
|
One thing to note To do this efficiently for JavaScript runtimes, a version which accepts UTF-16 input (or otherwise 2 byte chars) is important because otherwise one must go through a UTF-16 -> latin1 conversion step beforehand. It is not unusual for an ASCII-only JavaScript string to end up being stored internally as UTF-16. There are many reasons why, like if it was a string literal from source code which was passed from runtime to engine as UTF-16. Or if it was originally a substring from a non-ascii string |
|
@Jarred-Sumner It is coming. |
|
Forgiving-base64 is also the algorithm we ended up choosing for the default behavior in my proposal for native base64 in JS, happily. I've just added a link to the readme pointing implementers to this library. |
|
@bakkot Fantastic. Stay posted, feedback invited. |
|
@Jarred-Sumner See #382 where support for UTF-16 inputs has been added. |
This PR adds accelerated base64 encoding and decoding to simdutf. The standard adopted is WHATWG forgiving-base64: inputs may contain ASCII white spaces.
References: