Benchmark Question (large 2D-array of doubles) #2185

drcdr · 2024-05-22T19:27:07Z

drcdr
May 22, 2024

I'm using simdjson (CPU=14700K) to read JSON files with large 2D double arrays, and was wondering if I'm doing so at an expected speed or not.

Here's my simple C++ benchmark

#include <chrono>
#include "../../3rdparty/simdjson_3.6.2/simdjson.h"
#include "../../3rdparty/simdjson_3.6.2/simdjson.cpp"

template <class T>
void test_parse_2d()
{
	int nr = 1'000'000, nc = 350;
	//int nr = 10, nc = 10;
	T d;  double sum = 0;

	// ----- Create the large JSON string.  Need to stay under 4 GB -----
	std::string s_json;
	s_json.reserve(nr * nc);
	s_json = "["; // beginning of 2D-array
	for (int r = 0; r < nr; r++) {
		s_json += "["; // beginning of row
		for (int c = 0; c < nc; c++) {
			if (typeid(T) == typeid(double))
				s_json += "123.456789";
			else
				s_json += "123456789";
			s_json += (c == nc - 1) ? "]" : ","; // col separator or end of 1D-array
		}
		s_json += (r == nr - 1) ? "]" : ",";  // row separator or end of 2D-array
	}
	
	// Add SIMDJSON_PADDING bytes at the end.
	for (int i = 0; i < simdjson::SIMDJSON_PADDING; i++) s_json += " ";

	// ----- Scan the large JSON string -----
	std::chrono::steady_clock::time_point start, end;

	start = std::chrono::steady_clock::now();
	static simdjson::ondemand::parser  parser;
	simdjson::ondemand::document doc = parser.iterate(s_json); // position a pointer at the beginning of the JSON data
	float elaps1 = std::chrono::duration<float>(std::chrono::steady_clock::now() - start).count();

	start = std::chrono::steady_clock::now();
	for (auto vr : doc.get_array()) {
		for (auto vc : vr.get_array()) {
			if (typeid(T) == typeid(double)) {
				d = vc.get_double();  sum += (d - 123.456789);
			}
			else {
				d = vc.get_int64();  sum += double(d - 123456789);
			}
		}
	}
	float elaps2 = std::chrono::duration<float>(std::chrono::steady_clock::now() - start).count();
	printf("Processed %lf GB in %9.3f seconds = %lf GB/s, sum=%lf  [Init=%lf]\n", s_json.size()/1.0e9, elaps2, s_json.size()/1.0e9/elaps2, sum, elaps1);
}

int main(int argc, char** argv)
{
	printf("double ] "); test_parse_2d<double>();
	printf("int64_t] "); test_parse_2d<int64_t>();
	return 0;
}

Using Win11/MSVC (O2, AVX2, c++20), I get:

double ] Processed 3.852000 GB in     7.950 seconds = 0.484512 GB/s, sum=0.000000  [Init=0.652651]
int64_t] Processed 3.502000 GB in     6.489 seconds = 0.539652 GB/s, sum=0.000000  [Init=0.621854]

Using WSL/g++ (O3), I get:

double ] Processed 3.852000 GB in     1.956 seconds = 1.969724 GB/s, sum=0.000000  [Init=0.583709]
int64_t] Processed 3.502000 GB in     2.416 seconds = 1.449592 GB/s, sum=0.000000  [Init=0.647021]

I'm a little surprised that
a) gcc+/WSL is so much faster (my main use case is Windows)
b) get_double is faster than get_int64, in WSL

I see #2135, but it seems to me that simdjson::parse_double is being used, not anything like from_chars from fast_float e.g.?

Any suggestions for JSON-based speed improvement are appreciated.

lemire · 2024-05-22T23:17:30Z

lemire
May 22, 2024
Maintainer

Your benchmark is almost entirely a 'number parsing' benchmark (with some overhead due to the JSON structure). It is not a benchmark I would use because you are repeatedly parsing the same number. I recommend using something more realistic.

**it seems to me that simdjson::parse_double is being used, not anything like from_chars from fast_float **

It is not using fast_float but it is essentially the same speed. The same people wrote both parsers. It is the same speed.

The simdjson library uses a state-of-the-art number parser, described in this paper:

Number Parsing at a Gigabyte per Second, Software: Practice and Experience 51 (8), 2021

Notice the title: "a gigabyte per second". Obviously, you can beat a gigabyte per second, you are seemingly achieving 2 GB/s (although you are repeatedly parsing the same number so be careful). But you are just not going to achieve much higher speed than that. Using single-core processors, I don't think we know how to parse numbers at much greater speed.

get_double is faster than get_int64, in WSL

Your benchmark does suggest that we could improve integer parsing speed. That's interesting. Let us improve that.

gcc+/WSL is so much faster (my main use case is Windows)

You should get better speed with ClangCL which is what we recommend.

Your numbers are expected by the way. Please see my blog post:

Float-parsing benchmark: Regular Visual Studio, ClangCL and Linux GCC.

Here was my conclusion at the time:

We can observe that there are large performance differences. All the tests run on the same machine. The Linux build runs under the Windows Subsystem for Linux, and you do not expect the subsystem to run computations faster than Windows itself. The benchmark does not involve disk access. The benchmark is not allocating new memory.

It is not the first time that I notice that the Visual Studio compiler provides disappointing performance. I have yet to read a good explanation for why that is. Some people blame inlining, but I have yet to find a scenario where a release built from Visual Studio had poor inlining. I almost systematically find lower performance under Visual Studio when doing C++ microbenchmarks, despite using different benchmarking methodologies and libraries (e.g., Google Benchmark).

If you choose to develop under Windows and you are disappointed by the performance, I recommend reporting it to Microsoft. There is little that I, or anyone involved in simdjson, can do about it. It is in the hands of Microsoft. I have reported at least twice disappointing performance to Microsoft engineers. The one good piece of advice I got from them was to switch to ClangCL, which I recommend you do too.

0 replies

drcdr · 2024-05-23T15:57:11Z

drcdr
May 23, 2024
Author

Thanks for your quick, great reply! My Windows MSVC SIMD experience has been mixed; sometimes I get better results than g++/clang++. But is always involves lots of tuning and trial/error. My experience with reporting stuff to MS has been generally poor.

I've modified the benchmark, adding

Multiple compilers (Windows11: VS2022, LLVM/Clang, Intel C++ 2024, Intel oneAPI DPC++ 2024; WSL 2.0: g++ 11.4, clang++ 14.0)
Multiple data-approaches (f=fixed: 123.456789; l=linspace: linearly-spaced [min..max], then repeat (*); r=random from min to max, fixed seed)
double and [int64_t]

Updated code

// ----- simdjson includes -----
#include <chrono>
#include "../../3rdparty/simdjson_3.6.2/simdjson.h"
#include "../../3rdparty/simdjson_3.6.2/simdjson.cpp"

#include <random>

template <class T>
void create_JSON_fixed(std::string& s_json)
{
	s_json.clear();
	int nr = 1'000'000, nc = 350;
	//int nr = 10, nc = 10;

	// ----- Create the large JSON string.  Need to stay under 4 GB -----
	s_json.reserve(nr * nc * 5);
	s_json = "["; // beginning of 2D-array
	for (int r = 0; r < nr; r++) {
		s_json += "["; // beginning of row
		for (int c = 0; c < nc; c++) {
			if (typeid(T) == typeid(double)) s_json += "123.456789";
			else						     s_json += "123456789";
			s_json += (c == nc - 1) ? "]" : ","; // col separator or end of 1D-array
		}
		s_json += (r == nr - 1) ? "]" : ",";  // row separator or end of 2D-array
	}

	// Add SIMDJSON_PADDING bytes at the end.
	for (int i = 0; i < simdjson::SIMDJSON_PADDING; i++) s_json += " ";
}

template <class T>
void create_JSON_random(std::string& s_json)
{
	s_json.clear();
	int nr = 1'000'000, nc = 350;
	//int nr = 10, nc = 10;
	char buffer[100];

	std::random_device rd;
	std::mt19937 e2(rd());  // engine
	std::uniform_real_distribution<> dist_d(-1000.0, 1000.0);
	std::uniform_real_distribution<> dist_i(-1e10, 1e10);
	e2.seed(123);

	// ----- Create the large JSON string.  Need to stay under 4 GB -----
	s_json.reserve(nr * nc * 5);
	s_json = "["; // beginning of 2D-array
	for (int r = 0; r < nr; r++) {
		s_json += "["; // beginning of row
		for (int c = 0; c < nc; c++) {
			if (typeid(T) == typeid(double)) {
				sprintf(buffer, "%10.6f", dist_d(e2));
			} else { 
				sprintf(buffer, "%10zd", int64_t(dist_i(e2)));
			}
			s_json += std::string_view(buffer, strlen(buffer));
			s_json += (c == nc - 1) ? "]" : ","; // col separator or end of 1D-array
		}
		s_json += (r == nr - 1) ? "]" : ",";  // row separator or end of 2D-array
	}

	// Add SIMDJSON_PADDING bytes at the end.
	for (int i = 0; i < simdjson::SIMDJSON_PADDING; i++) s_json += " ";
}

template <class T>
void create_JSON_linspace(std::string& s_json)
{
	s_json.clear();
	int nr = 1'000'000, nc = 350;
	//int nr = 10, nc = 10;

	// Create a linspace of numbers
	int n_linspace = 9859;
	struct str { char s[15]; };
	std::vector<str> s_values(n_linspace);
	for (int i = 0; i < n_linspace; i++) {
		if (typeid(T) == typeid(double)) {
			sprintf(s_values[i].s, "%10.6f", -1000.0 + 2000.0*(i)/(n_linspace-1) );
		} else {
			sprintf(s_values[i].s, "%10zd", int64_t(-10'000'000'000) + int64_t(20'000'000'000) * (i) / (n_linspace - 1));
		}
	}

	// ----- Create the large JSON string.  Need to stay under 4 GB -----
	int k = 0;
	s_json.reserve(nr* nc * 5);
	s_json = "["; // beginning of 2D-array
	for (int r = 0; r < nr; r++) {
		s_json += "["; // beginning of row
		for (int c = 0; c < nc; c++) {
			s_json += s_values[k++].s;
			if (k == n_linspace) k = 0;
			s_json += (c == nc - 1) ? "]" : ","; // col separator or end of 1D-array
		}
		s_json += (r == nr - 1) ? "]" : ",";  // row separator or end of 2D-array
	}

	// Add SIMDJSON_PADDING bytes at the end.
	for (int i = 0; i < simdjson::SIMDJSON_PADDING; i++) s_json += " ";
}

template <class T>
void test_parse_2d(std::string& s_json)
{
	T d;  double sum = 0;
	// ----- Scan the large JSON string -----
	std::chrono::steady_clock::time_point start;

	start = std::chrono::steady_clock::now();
	static simdjson::ondemand::parser  parser;
	simdjson::ondemand::document doc = parser.iterate(s_json); // position a pointer at the beginning of the JSON data
	float elaps1 = std::chrono::duration<float>(std::chrono::steady_clock::now() - start).count();

	start = std::chrono::steady_clock::now();
	if (typeid(T) == typeid(double)) {
		for (auto vr : doc.get_array()) {
			for (auto vc : vr.get_array()) {
				d = vc.get_double();  
				sum += d;
			}
		}
	} else {
		for (auto vr : doc.get_array()) {
			for (auto vc : vr.get_array()) {
				d = vc.get_int64();  
				sum += double(d);
			}
		}
	}
	float elaps2 = std::chrono::duration<float>(std::chrono::steady_clock::now() - start).count();
	printf("Processed %lf GB in %9.3f seconds = %lf GB/s [Init=%.3lf s] sum=%.1lf\n", s_json.size()/1.0e9, elaps2, s_json.size()/1.0e9/elaps2, elaps1, sum);
}

int main(int argc, char** argv)
{
	std::string s_json;
	int n_trials = 6;
	std::vector<char> opts = { 'f', 'l', 'r' };
	for (auto opt : opts)
	{
		if (opt == 'f')      create_JSON_fixed<double>(s_json);
		else if (opt == 'l') create_JSON_linspace<double>(s_json);
		else if (opt == 'r') create_JSON_random<double>(s_json);
		for (int n = 0; n < n_trials; n++) {
			printf("%c double ] ", opt);  test_parse_2d<double>(s_json);
		}

		if (opt == 'f')      create_JSON_fixed<int64_t>(s_json);
		else if (opt == 'l') create_JSON_linspace<int64_t>(s_json);
		else if (opt == 'r') create_JSON_random<int64_t>(s_json);
		for (int n = 0; n < n_trials; n++) {
			printf("%c int64_t] ", opt);  test_parse_2d<int64_t>(s_json);
		}
	}
	return 0;
}

Here are the results for throughput vs. (f,l,r) and compiler, 6 iterations each (error bars indicate span from min to max throughput). Windows is O2, WSL is O3. Each bar-group is for a different compiler; blue is Windows11, green is WSL2 (Ubuntu). Hatched bars are for get_double(). My computer was otherwise idle when I ran these.

I've noticed that the normal VS2022 compiler is using the routine
simdjson_unused simdjson_inline simdjson_result<double> parse_double(const uint8_t * src) noexcept
but I think the others are, could this explain why it is so much slower?

int64_t seems to be slower than double only for WSL-g++.

I can't explain why 'l' (linspace) is so different from 'r' (random), but maybe 'r' hits certain float-parser exception cases and 'l' doesn't?

(*) More on linspace: this was motivated to just save time when creating the 350M-entry json-string. I create 9859 strings representing values from [min..max], then cycle through them again and again when creating the full json. Did it this way to try and get a good mix of digits.

0 replies

lemire · 2024-05-23T16:22:34Z

lemire
May 23, 2024
Maintainer

I've noticed that the normal VS2022 compiler is using the routine
simdjson_unused simdjson_inline simdjson_result parse_double(const uint8_t * src) noexcept
but I think the others are, could this explain why it is so much slower?

Can you elaborate? To my knowledge they should all use exactly the same parsing routines, with slight specialization for MSVC.

2 replies

drcdr May 23, 2024
Author

You're right. I was trying multiple compilers, and thought that I noticed that different code paths were being followed. But I've added some better instrumentation, and see they are taking the same code path, through parse_double(). Sorry for that.

Noticed that MSVC was calling logging and clang was not, but that only improved things slightly. Commented that out for now.

I don't see substantial differences in the assembly (MSVC vs clang), but do see substantial differences in Intel Advisor (IA) profiling. Trying to work through that now. Preliminary result is that parse_digit() is quite different (MSVC vs clang), but trying to understand IA better.

lemire May 23, 2024
Maintainer

If you ever figure out how we can improve MSVC performance, please do raise an issue.

drcdr · 2024-05-30T16:00:42Z

drcdr
May 30, 2024
Author

I thought I'd provide an update. I'm really perplexed. Here's a summary of the current setup:

I'm comparing VS2022 (v143) vs LLVM/Clang (in MSVC). Behavior is similar whether I use /O1 or /O2.
I'm using a simple 10K x 35K JSON 2D float array of '123.456789', because it recreates the problem and is simple to analyze.
I've made some simple modifications to parse_double in simdjson.h which improve speed only slightly. One was to make while (parse_digit(*p, i)) { p++; } into a separate inline function, because it eliminated an extra jmp instruction under VS2022.
I've made a separate file 'simple.cpp' which has a routine double my_get_double(const uint8_t* src, uint8_t* &p_out) noexcept that basically has all of the same code as parse_double. The calling function uses p_out to quickly skip to the next input float.

It's (4) that I'm really focusing on here, comparing similar code in two separate files, compiled into one executable with standard VS2022 compiler. Here are sample timing results for VS2022. (For comparison, LLVM is 1.54 GB/s for simple and 1.28 GB/s for simdjson (/O1).)

f double simple  ] Processed 3.850020 GB in   2.427 seconds = 1.586093 GB/s
f double simdjson] Processed 3.850020 GB in   7.787 seconds = 0.494407 GB/s

Yes, it's running ~3x faster outside of simdjson.h. I have no idea why.

Below are some Performance Profiler clips from simple.cpp vs simdjson.h, showing that 3 main hotspots, two of which are parse_digit_loop. It also shows the assembly listing for just parse_digit_loop, to show that the assembly code is essentially the same for the two - there should be no difference in timing. The assembly/timing clips are taken from Intel Advisor. If the self-time indications are correct...it would seem to indicate that there is some sort of arithmetic stall (in one case and not the other), for the instructions that do the i = 10 * i + digit - the lea ... ptr[rax + rax*4] (*5) and lea ... ptr [rax+rcx*2] (*2 + digit`). But this makes no sense.

Very bizarre!

3 replies

lemire May 30, 2024
Maintainer

I submit to you that you should take note that LLVM (I understand that this is LLVM from VS2022) generates code that runs 1.28/0.494407 = 2.6 times faster.

The private advice I got from Microsoft engineers was to switch to LLVM. This is the recommandation we make to our users.

In another project, we disabled all SIMD optimizations under Visual Studio because it generates incorrect code:

RoaringBitmap/CRoaring#603

Let me be clear. Microsoft's compiler generates obviously incorrect code given quite simple inputs. It has been reported to them. They do not fix it. They keep releasing new versions with the same bug.

My current view is that the 'regular' Visual Studio compiler (not LLVM) should be considered legacy. I have received multiple private confirmations from inside Microsoft that

Google has dropped support for the regular Visual Studio compiler under Windows, adopting solely LLVM.

If you find the cause of this particular issue and you can help simdjson, it would be great but my recommendation stands: avoid the regular Visual Studio compiler if you can.

drcdr May 30, 2024
Author

Yeah, I hear you, I am in the process of migrating to LLVM in MSVS. I wanted to try and chase this down, because it seemed quite strange.

To be clear, from what I see above, there's no clear indication that this is a MSVS/VS2022 compiler error, since the same assembly code is running at different speeds, in the same executable. Your 1.28/0.494407 is correct, but so is 1.54/1.586.

If I get any inspiration, I'll update. Thanks!

lemire May 30, 2024
Maintainer

MSVS/VS2022 compiler error, since the same assembly code is running at different speeds

Never reason from the results of a sampling profiler. A profiler can help you in crude cases identify bottlenecks, but it usually does not give you enough information to understand what is going on.

If you want to understand what is going on in this case, look at the number of instructions being retired, the effective CPU frequency, the number of expensive page misses and so forth.

taoliq · 2025-11-18T12:53:10Z

taoliq
Nov 18, 2025

I've done some exploring these past few days. Let me share my findings.

int64_t seems to be slower than double only for WSL-g++.

The reason is that the benchmark used typeid() to make a specialization, and gcc didn't eliminate the typeid() call site in the int64_t specialization.

I also tried some newer gcc, e.g., gcc 12.0, typeid() has been eliminated.

So I think the result only appears on older gcc, like gcc 8.5

			if (typeid(T) == typeid(double)) {  // <----------- HERE causes runtime overhead in the int64 version
				d = vc.get_double();  sum += (d - 123.456789);
			} else {
				d = vc.get_int64();  sum += double(d - 123456789);
			}

If you use std::is_same<T, double> to make a specialization, you will get the expected output.

			if constexpr (std::is_same_v<T, double>) {  // <----------- will not rely on compiler optimization
				d = vc.get_double();  sum += (d - 123.456789);
			} else {
				d = vc.get_int64();  sum += double(d - 123456789);
			}

I also wrote some microbenchmarks to test dom::parse, not surprised, parse double is slower.

0 replies

Benchmark Question (large 2D-array of doubles) #2185

Uh oh!

Uh oh!

drcdr May 22, 2024

Replies: 5 comments · 5 replies

Uh oh!

lemire May 22, 2024 Maintainer

Uh oh!

drcdr May 23, 2024 Author

Uh oh!

lemire May 23, 2024 Maintainer

Uh oh!

Uh oh!

drcdr May 23, 2024 Author

Uh oh!

lemire May 23, 2024 Maintainer

Uh oh!

drcdr May 30, 2024 Author

Uh oh!

lemire May 30, 2024 Maintainer

Uh oh!

drcdr May 30, 2024 Author

Uh oh!

lemire May 30, 2024 Maintainer

Uh oh!

taoliq Nov 18, 2025

drcdr
May 22, 2024

Replies: 5 comments 5 replies

lemire
May 22, 2024
Maintainer

drcdr
May 23, 2024
Author

lemire
May 23, 2024
Maintainer

drcdr May 23, 2024
Author

lemire May 23, 2024
Maintainer

drcdr
May 30, 2024
Author

lemire May 30, 2024
Maintainer

drcdr May 30, 2024
Author

lemire May 30, 2024
Maintainer

taoliq
Nov 18, 2025