Skip to content

[PERF]: Measure the performance impact of the "layered design" in cuda-bindings #1605

@mdboom

Description

@mdboom

API calls in cuda-bindings currently are made through 3 layers.

As an experiment to measure the performance impact of calling through these layers, I "flattened" the call so the top layer just directly calls the C function pointer in the library (currently handled by the bottom layer). The overhead of each of these layers is pretty small, by design, but there is still some Python exception handling, as well as our library initialization check (cuPythonInit()) along the way.

While we lose some safety and version independence doing this, it is useful as an experiment to see what the cost of that flexibility is.

My changes

Measuring this with the benchmark in #659, I do not see any measurable change. Branch predictors must be pretty good these days.

Before: Mean +- std dev: 2.77 us +- 0.37 us
After: Mean +- std dev: 2.76 us +- 0.21 us

Metadata

Metadata

Assignees

Labels

experimentDescribes an investigation or measurement

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions