Skip to content

Fastest mapping arrays of strings to int for markers #1215

@kushalkolar

Description

@kushalkolar

While doing fastplotlib/fastplotlib#913 I looked into the fastest ways to map an array of strings to int. This is useful when a user changes millions of scatter point marker shapes. Thought it'd be useful so posting what I did here:

Simple for loop:

# `a` is an array of string marker names
markers_int = np.zeros(a.size, dtype=np.int32)
for i in range(a.size):
    markers_int[i] = pygfx.MarkerInt[a[i]]

To instead use this for mapping using python's map or numpy.vectorize we need a function which AFAIK you can't get with an enum, so convert the enum to a dict. Let me know if there's a better way!

markers_mapping = dict.fromkeys(list(pygfx.MarkerShape))

for m in markers_mapping.keys():
    markers_mapping[m] = pygfx.MarkerInt[m]

builtin python map

np.asarray(list(map(markers_mapping.get, a)))

numpy.vectorize

vectorized_marker_to_int = np.vectorize(markers_mapping.get)

Execution times, np.vectorize is unsurprisingly the fastest:

simple: 1.57s
map: 0.84s
vectorized: 0.51s

Used this to benchmark:

Details
import numpy as np
import pygfx
from math import ceil
from timeit import timeit

markers_mapping = dict.fromkeys(list(pygfx.MarkerShape))

for m in markers_mapping.keys():
    markers_mapping[m] = pygfx.MarkerInt[m]

n = ceil(5_000_000 / len(list(pygfx.MarkerShape)))

a = np.tile(list(pygfx.MarkerShape), n)

def simple(a):
    markers_int = np.zeros(a.size, dtype=np.int32)
    for i in range(a.size):
        markers_int[i] = pygfx.MarkerInt[a[i]]
        
    return markers_int

def python_map(a):
    return np.asarray(list(map(markers_mapping.get, a)))

vectorized_marker_to_int = np.vectorize(markers_mapping.get)

for func, name in zip([simple, python_map, vectorized_marker_to_int], ["simple", "map", "vectorized"]):
    t = timeit(lambda: func(a), number=10)
    print(f"{name}: {t / 10:.2f}s")

An important caveat to be aware of is that any control flow defeats the purpose of vectorizing and performance degrades. I'm getting around this by first running np.unique on the large array (which is fairly fast and doesn't add much overhead) and then checking the much smaller unique array for any invalid markers. If all markers are valid, run the vectorized function on the full array.

Edit: I was curious and tried match case, it's slower than builtin map, 0.85s. It's hard to beat python dicts for arbitrary mapping.

def match_case(m):
    match m:
        case "circle":
            return 101
        case "ring":
            return 102
        case "square":
            return 201
        case "diamond":
            return 202
        case "plus":
            return 203
        case "cross":
            return 204
        case "asterix":
            return 205
        case "tick":
            return 206
        case "tick_left":
            return 207
        case "tick_right":
            return 208
        case "triangle_up":
            return 301
        case "triangle_down":
            return 302
        case "triangle_left":
            return 303
        case "triangle_right":
            return 304
        case "heart":
            return 401
        case "spade":
            return 402
        case "club":
            return 403
        case "pin":
            return 404
        case "custom":
            return 901

vectorized_match_case = np.vectorize(match_case, otypes=[np.int32])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions