While doing fastplotlib/fastplotlib#913 I looked into the fastest ways to map an array of strings to int. This is useful when a user changes millions of scatter point marker shapes. Thought it'd be useful so posting what I did here:
Simple for loop:
# `a` is an array of string marker names
markers_int = np.zeros(a.size, dtype=np.int32)
for i in range(a.size):
markers_int[i] = pygfx.MarkerInt[a[i]]
To instead use this for mapping using python's map or numpy.vectorize we need a function which AFAIK you can't get with an enum, so convert the enum to a dict. Let me know if there's a better way!
markers_mapping = dict.fromkeys(list(pygfx.MarkerShape))
for m in markers_mapping.keys():
markers_mapping[m] = pygfx.MarkerInt[m]
builtin python map
np.asarray(list(map(markers_mapping.get, a)))
numpy.vectorize
vectorized_marker_to_int = np.vectorize(markers_mapping.get)
Execution times, np.vectorize is unsurprisingly the fastest:
simple: 1.57s
map: 0.84s
vectorized: 0.51s
Used this to benchmark:
Details
import numpy as np
import pygfx
from math import ceil
from timeit import timeit
markers_mapping = dict.fromkeys(list(pygfx.MarkerShape))
for m in markers_mapping.keys():
markers_mapping[m] = pygfx.MarkerInt[m]
n = ceil(5_000_000 / len(list(pygfx.MarkerShape)))
a = np.tile(list(pygfx.MarkerShape), n)
def simple(a):
markers_int = np.zeros(a.size, dtype=np.int32)
for i in range(a.size):
markers_int[i] = pygfx.MarkerInt[a[i]]
return markers_int
def python_map(a):
return np.asarray(list(map(markers_mapping.get, a)))
vectorized_marker_to_int = np.vectorize(markers_mapping.get)
for func, name in zip([simple, python_map, vectorized_marker_to_int], ["simple", "map", "vectorized"]):
t = timeit(lambda: func(a), number=10)
print(f"{name}: {t / 10:.2f}s")
An important caveat to be aware of is that any control flow defeats the purpose of vectorizing and performance degrades. I'm getting around this by first running np.unique on the large array (which is fairly fast and doesn't add much overhead) and then checking the much smaller unique array for any invalid markers. If all markers are valid, run the vectorized function on the full array.
Edit: I was curious and tried match case, it's slower than builtin map, 0.85s. It's hard to beat python dicts for arbitrary mapping.
def match_case(m):
match m:
case "circle":
return 101
case "ring":
return 102
case "square":
return 201
case "diamond":
return 202
case "plus":
return 203
case "cross":
return 204
case "asterix":
return 205
case "tick":
return 206
case "tick_left":
return 207
case "tick_right":
return 208
case "triangle_up":
return 301
case "triangle_down":
return 302
case "triangle_left":
return 303
case "triangle_right":
return 304
case "heart":
return 401
case "spade":
return 402
case "club":
return 403
case "pin":
return 404
case "custom":
return 901
vectorized_match_case = np.vectorize(match_case, otypes=[np.int32])
While doing fastplotlib/fastplotlib#913 I looked into the fastest ways to map an array of strings to int. This is useful when a user changes millions of scatter point marker shapes. Thought it'd be useful so posting what I did here:
Simple for loop:
To instead use this for mapping using python's map or
numpy.vectorizewe need a function which AFAIK you can't get with an enum, so convert the enum to a dict. Let me know if there's a better way!builtin python
mapnumpy.vectorizeExecution times,
np.vectorizeis unsurprisingly the fastest:Used this to benchmark:
Details
An important caveat to be aware of is that any control flow defeats the purpose of vectorizing and performance degrades. I'm getting around this by first running
np.uniqueon the large array (which is fairly fast and doesn't add much overhead) and then checking the much smaller unique array for any invalid markers. If all markers are valid, run the vectorized function on the full array.Edit: I was curious and tried match case, it's slower than builtin
map,0.85s. It's hard to beat python dicts for arbitrary mapping.