Base64 part2 : base64url, UTF-8 inputs and base64_to_binary_safe#382
Base64 part2 : base64url, UTF-8 inputs and base64_to_binary_safe#382
Conversation
|
@bakkot @Jarred-Sumner : comments and review invited. This will be part of a new major release. |
|
Excellent! A few minor things at a quick glance:
|
|
@bakkot Thank you for the comments. I will apply fixes soon. |
|
@bakkot Thanks. I have answered your concerns in a later commit. A BASE64_INPUT_REMAINDER is not considered an error in the sense that you can continue the processing. When an invalid character is encountered, the error is considered fatal (I made this clear now). This means that the 'state' is purposefully undocumented: you are not expected to rely on the state. In the example provide, it was indeed input bytes, but you are correct that in the general case, it is in terms of units (which can be 8-bit or 16-bit units). I have temporarily disabled the RVV tests since they fail to even start due to setup failures. |
| simdutf_warn_unused result implementation::base64_to_binary(const char * input, size_t length, char* output) const noexcept { | ||
| return compress_decode_base64(output, input, length); | ||
| simdutf_warn_unused result implementation::base64_to_binary(const char * input, size_t length, char* output, base64_options options) const noexcept { | ||
| return (options & base64_url) ? compress_decode_base64<true>(output, input, length, options) : compress_decode_base64<false>(output, input, length, options); |
There was a problem hiding this comment.
We can remove the conditional and directly pass it to the template argument of the function. Removal of a branch is a good :-)
There was a problem hiding this comment.
There is indeed a runtime branch here, and it is not free, but pushing it down might not make disappear. A different option would be to have distinct functions for base64url and regular base64, but I thought it was not very nice from an API point of view.
| size_t input_index = safe_input; | ||
| while(offset > 0 && input_index > 0) { | ||
| chartype c = input[--input_index]; | ||
| if(c == '=' || c == '\n' || c == '\r' || c == '\t' || c == ' ') { |
There was a problem hiding this comment.
If this is common we can just create a table for it
There was a problem hiding this comment.
This code could be faster and simpler, but I am adding a comment to explain why it appears unoptimized:
// offset is a value that is no larger than 3. We backtrack
// by up to offset characters + an undetermined number of
// white space characters. It is expected that the next loop
// runs at most 3 times + the number of white space characters
// in between them, so we are not worried about performance.
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
WojciechMula
left a comment
There was a problem hiding this comment.
Looks good to me. Haven't spotted anything suspicious.
|
|
||
| // base64_options are used to specify the base64 encoding options. | ||
| using base64_options = uint64_t; | ||
| enum : base64_options { |
There was a problem hiding this comment.
Thanks. I am open to using an enum class but I am a bit concerned but I really want this to be a bitset that we can extend later to support different options.
There was a problem hiding this comment.
Yeah, I verified that enum classes do not work well as bitsets:
enum class Joe : uint64_t {
default_base64 = 0,
url_base64 = 1,
allow_spaces = 4,
allow_padding = 8,
};
int main() {
Joe t = Joe::default_base64 | Joe:: allow_padding; // Will not compile
}The idea here is to be able to extend the API without too much of a mess by allowing 'options' if people have a great need for them. Like we could disable white spaces in a later version. Enum classes makes this more difficult than need be.
Granted, they are safer but we could validate the values if we are concerned.
|
Thanks for the great reviews. Merging. |
References: