Functor GPU JIT (CUDA)¶
This notebook is about runtime GPU JIT for functor_obj: set f.device = \"gpu\" and ESCAPE will generate, nvcc-compile, and cache a CUDA kernel for vectorized array evaluation. There is no Python compile(..., gpu=True) and no separate “GPU compile” step in your code—the JIT runs inside the library when you select the GPU device.
- GPU path:
f.device = \"gpu\"→ JIT + kernel cache; arrays can run on the device. - CPU path:
f.device = \"cpu\"(default) → no CUDA launch for array eval.
Optional — CPU native extension: compiler.ipynb shows compile(f), which runs a one-time C++ build of a CPU-only handler. That is unrelated to GPU JIT; many examples below still call compile(f) only to get a fast CPU baseline next to the JIT GPU lane.
Requirements¶
- NVIDIA GPU with a supported driver.
- CUDA Toolkit (
nvcconPATH, orCUDA_HOME/CUDA_PATH/CUDAHOME). - On Windows, Visual Studio build tools (same as building ESCAPE) so
nvcccan invokecl.exe.
The first time you use a functor on GPU after f.device = \"gpu\", JIT may take tens of seconds (CUDA codegen + nvcc + cache). Later calls reuse the cache. If you also use compile(f) for a CPU baseline, the first compile(f) costs a few seconds (CPU extension build only).
Note: If CUDA is missing,
fc.device = \"gpu\"fails when JIT tries to build the kernel (e.g. toolkit not onPATH).Restart / Run all: The first code cell imports ESCAPE, runs
esc.require, and prints whether CUDA was found — it does not stop the kernel. (Older versions usedSystemExit, which could kill the kernel so no later cells ran and you saw no results.) If CUDA is not detected, add NVIDIA’sbindirectory to the environment Jupyter uses (Notebook’s PATH may differ from your terminal).
import shutil
import os
import glob
def cuda_toolkit_available() -> bool:
if shutil.which("nvcc") is not None:
return True
for k in ("CUDAHOME", "CUDA_HOME", "CUDA_PATH"):
v = os.environ.get(k)
if v and os.path.isdir(v):
return True
# Jupyter kernels often lack CUDA on PATH; probe default install locations (Windows)
for base in (
os.environ.get("CUDA_PATH", ""),
os.path.join(os.environ.get("ProgramFiles", r"C:\\Program Files"), "NVIDIA GPU Computing Toolkit", "CUDA"),
):
if not base or not os.path.isdir(base):
continue
for nvcc in glob.glob(os.path.join(base, "v*", "bin", "nvcc.exe")):
if os.path.isfile(nvcc):
return True
return False
HAVE_CUDA = cuda_toolkit_available()
if HAVE_CUDA:
print("CUDA toolkit: OK (nvcc or CUDA install found).")
else:
print(
"WARNING: CUDA toolkit not detected. Add the CUDA `bin` folder to PATH or set CUDA_PATH. "
"Cells that set fc.device = 'gpu' will fail until CUDA is visible."
)
import numpy as np
import escape as esc
from escape.core.compilers import compile
esc.require("0.9.8")
CUDA toolkit: OK (nvcc or CUDA install found). Loading material database from C:\dev\escape-core\python\src\escape\scattering\..\data\mdb\materials.db
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[1], line 35 33 import numpy as np 34 import escape as esc ---> 35 from escape.core.compilers import compile 37 esc.require("0.9.8") ModuleNotFoundError: No module named 'escape.core.compilers'
Quick start (JIT only)¶
Build an ordinary functor and switch the device to GPU — no compile() required for CUDA:
- Array evaluation can use the JIT kernel when
deviceis"gpu". - Scalar calls still use the efficient CPU-side path for the expression tree (or a compiled handler if you used
compile()elsewhere).
x = esc.var("x")
f_jit = esc.sin(x) + 1.0
f_jit.device = "gpu"
print("Scalar f_jit(0.5) =", f_jit(0.5))
print("device:", f_jit.device)
f_jit.show()
How it works (short)¶
f.device = \"gpu\"— the functor handler enters GPU mode; on demand, ESCAPE emits CUDA, runsnvcc,dlopens the module, and caches the result (JIT service).compile(f)(optional) — same CPU-only pipeline as compiler.ipynb; builds a native host extension, not a GPU library.- Arrays — with device GPU, vectorized eval can launch the JIT kernel; with
f.device = \"cpu\", no CUDA launch.
Use contiguous float64 arrays for benchmarks to avoid extra copies (ESCAPE may warn on non-owning views).
Why GPU timings can be slower than CPU in %timeit¶
This is expected with the current GPU JIT runtime, not a sign that the GPU is “broken”.
Heavy fixed cost on every call — Each array evaluation runs the JIT launch stub: it
cudaMallocs device buffers in the nvcc-built DLL, copies the input from NumPy to the device, launches the kernel,cudaDeviceSynchronize(), copies the output back to the host array, thencudaFrees those buffers. That full sequence is what you time; with-n1 -r1it is a single run per%timeitline (no multi-repeat average).CPU compiled is extremely fast —
compile(f)already generates a tight native loop over contiguous host memory. For moderate sizes (e.g. millions of points), that loop is often faster than a GPU path dominated by per-call setup.What would make GPU “win” — Much larger batches, reused device buffers (no alloc/free per call), async pipelines, and fewer global synchronizations. Those are future optimizations; the first implementation favors correctness and a simple memory model.
How to interpret benchmarks — For a fair throughput sense, time one large evaluation (or wrap so setup happens outside the timed block). Compare interpreted
f(x)vs CPU compiledfc_cpu(x)vs GPUfc_gpu(x)with that in mind.
Note: device should read "gpu" after you assign fc_gpu.device = \"gpu\" (cluster mode still uses a tcp://… endpoint).
Timing cells: %timeit is run as %timeit -n1 -r1 (one loop, one repeat) so each line times a single evaluation — no multi-repeat statistical average.
Example 1 — Gaussian peak (large arrays)¶
$$f(x) = A \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$
We compile(f_gauss) twice to get two CPU-native copies, then set fc_gpu.device = \"gpu\" on one — that turns on JIT for arrays while the other stays on CPU. This is only to compare JIT GPU vs compiled CPU; the JIT path does not require compile() (see Quick start).
We use 1M, 5M, and 10M samples to stress throughput. Timings include host/device work as implemented (unified memory + kernel).
x = esc.var("x")
A = esc.par("A", 1.0)
mu = esc.par("mu", 0.0, userlim=[-1, 1])
sigma = esc.par("sigma", 1.0)
f_gauss = A * esc.exp(-(x - mu) ** 2 / (2 * sigma ** 2))
def bench_array(n: int):
t = np.linspace(-8.0, 8.0, n, dtype=np.float64)
xb = np.ascontiguousarray(t)
# Always compare against CPU baseline (loop may leave device on "gpu").
f_gauss.device = "cpu"
y_orig = np.asarray(f_gauss(xb), dtype=np.float64)
y_cpu = np.asarray(f_gauss(xb), dtype=np.float64)
f_gauss.device="gpu"
y_gpu = np.asarray(f_gauss(xb), dtype=np.float64)
err_cpu = float(np.max(np.abs(y_cpu - y_orig)))
err_gpu = float(np.max(np.abs(y_gpu - y_orig)))
return err_cpu, err_gpu
for n in (1_000_000, 5_000_000, 10_000_000):
ec, eg = bench_array(n)
print(f"n = {n:>10,} max|cpu-orig| = {ec:.3e} max|gpu-orig| = {eg:.3e}")
f_gauss.device="gpu"
f_gauss.show()
n = 100_000_000
xb = np.ascontiguousarray(np.linspace(-8.0, 8.0, n, dtype=np.float64))
f_gauss.device = "cpu"
print(f"Array length {n:,} — interpreted:")
%timeit f_gauss(xb)
f_gauss.device = "gpu"
print(f"\nSame — GPU JIT (device gpu):")
%timeit f_gauss(xb)
yb = f_gauss(xb)
yb.size
Example 2 — Mixed trig + exp (matches unit test, larger grid)¶
$$f(x) = a \sin(b x) + e^{-x^2}$$
x = esc.var("x")
a = esc.par("a", 2.5)
b = esc.par("b", -1.3)
f_mix = a * esc.sin(b * x) + esc.exp(-x ** 2)
n = 8_000_000
xb = np.ascontiguousarray(np.linspace(-4.0, 4.0, n, dtype=np.float64))
f_mix.device = "cpu"
y_cpu = f_mix(xb)
f_mix.device = "gpu"
y_gpu = f_mix(xb)
print("max|gpu - cpu|:", float(np.max(np.abs(np.asarray(y_gpu) - np.asarray(y_cpu)))))
print(f"n = {n:,} — CPU compiled")
f_mix.device = "cpu"
%timeit f_mix(xb)
f_mix.device = "gpu"
print(f"n = {n:,} — GPU JIT")
%timeit f_mix(xb)
Example 3 — Decaying exponential grid (10M points)¶
Same family as test_compile_gpu_large_array, with an even larger vector.
x = esc.var("x")
a = esc.par("a", 1.0)
f_exp = a * esc.exp(-x ** 2)
f_exp.device = "gpu"
n = 10_000_000
xb = np.ascontiguousarray(np.linspace(-12.0, 12.0, n, dtype=np.float64))
print(f"\nGPU functor — {n:,} points:")
%timeit f_exp(xb)
f_exp.device = "cpu"
%timeit f_exp(xb)
Example 4 — CPU fallback on the same GPU build¶
Setting device to "cpu" disables the CUDA array path for that functor; useful for debugging or A/B checks.
x = esc.var("x")
f = x * 2.0
fc = compile(f)
n = 2_000_000
xb = np.ascontiguousarray(np.linspace(-3.0, 3.0, n, dtype=np.float64))
y_ref = np.asarray(f(xb), dtype=np.float64)
fc.device = "gpu"
y_gpu = np.asarray(fc(xb), dtype=np.float64)
fc.device = "cpu"
y_cpu = np.asarray(fc(xb), dtype=np.float64)
print("max|y_gpu - ref|:", float(np.max(np.abs(y_gpu - y_ref))))
print("max|y_cpu - ref|:", float(np.max(np.abs(y_cpu - y_ref))))
fc.device = "gpu"
print(f"\nGPU device — {n:,} points:")
%timeit -n1 -r1 fc(xb)
fc.device = "cpu"
print(f"\nCPU device — {n:,} points:")
%timeit -n1 -r1 fc(xb)
Example 5 — Two variables (smaller grid)¶
2D inputs use len = n_points * n_vars layout; for a quick check we use a moderate 2D grid (large full tensors get heavy on memory).
x = esc.var("x")
y = esc.var("y")
f2 = esc.exp(-x) * y + 1.0
f2.device = "gpu"
nx = ny = 10500
xs = np.linspace(-1.0, 1.0, nx, dtype=np.float64)
ys = np.linspace(-1.0, 1.0, ny, dtype=np.float64)
X, Y = np.meshgrid(xs, ys, indexing="xy")
# Packed layout: [x0, y0, x1, y1, ...] — length n_points * num_variables
pts = np.ascontiguousarray(np.column_stack([X.ravel(), Y.ravel()]).ravel())
z = np.asarray(f2(pts), dtype=np.float64)
z_ref = (X * Y + 1.0).ravel()
print("max error:", float(np.max(np.abs(z - z_ref))))
print(f"grid {nx}x{ny} = {pts.size // 2:,} points")
%timeit -n1 -r1 fc2(pts)
%timeit f2(pts)
API reference¶
| Item | Description |
|---|---|
f.device = \"gpu\" |
Enable GPU JIT for array evaluation (nvcc + cache at runtime); requires CUDA toolkit on PATH. |
f.device |
"gpu", "cpu", or a cluster tcp://… endpoint string. |
compile(f) |
Optional, CPU-only: native extension (see compiler.ipynb). Not required for GPU JIT. |
compile(f, keep_build=True) |
Keep CPU compile artifacts (build_dir). |
compile(f, build_dir=path) |
Fixed CPU build directory. |
JIT vs CPU compile()¶
GPU JIT and compile(f) are independent: JIT codegen runs when the functor uses the GPU device; compile() only builds a host extension. Supported subtree for CUDA/JIT matches what the emitter allows (same family as in compiler.ipynb for comparable ops).
See also¶
python/tests/core/test_compiler.py— GPU JIT scenarios (e.g.TestCompilerGPU).- CPU functor compiler: compiler.ipynb.
- This notebook file:
functor_gpu_jit.ipynb(formerlycompiler_gpu.ipynb).
Distribution Functors on GPU¶
Test all 6 distribution functions (normal, gamma, schulz, lognorm, uniform, triangular) with GPU JIT.
x = esc.var("x")
mean_p = esc.par("mean", 10.0)
sigma_p = esc.par("sigma", 2.0)
fwhm_p = esc.par("fwhm", 4.0)
distributions = {
"normal": esc.normal(x, mean_p, fwhm_p),
"gamma": esc.gamma(x, mean_p, sigma_p),
"schulz": esc.schulz(x, mean_p, sigma_p),
"lognorm": esc.lognorm(x, mean_p, sigma_p),
"uniform": esc.uniform(x, mean_p, fwhm_p),
"triangular": esc.triangular(x, mean_p, fwhm_p),
}
xarr = np.linspace(0.1, 25.0, 500)
for name, f in distributions.items():
cpu_vals = f(xarr)
f.device = "gpu"
gpu_vals = f(xarr)
f.device = "cpu"
max_diff = np.max(np.abs(cpu_vals - gpu_vals))
ok = max_diff < 1e-12
print(f"{name:12s} max|cpu-gpu| = {max_diff:.2e} {'OK' if ok else 'MISMATCH'}")
Convolution on GPU¶
Test convolution h(x) = ∫ f(τ) * g(x - τ) dτ with GPU JIT.
x = esc.var("x")
tau = esc.var("tau")
mean_f = esc.par("mean_f", 10.0)
sigma_f = esc.par("sigma_f", 1.0)
mean_g = esc.par("mean_g", 0.0)
sigma_g = esc.par("sigma_g", 1.5)
f_conv = esc.normal(tau, mean_f, sigma_f)
g_conv = esc.normal(x - tau, mean_g, sigma_g)
h_conv = esc.convolution(f_conv, g_conv, x, tau, 0.0, 30.0, 1e-8, 1e-8, 200)
xarr = np.linspace(2.0, 18.0, 200)
cpu_vals = h_conv(xarr)
h_conv.device = "gpu"
gpu_vals = h_conv(xarr)
h_conv.device = "cpu"
max_diff = np.max(np.abs(cpu_vals - gpu_vals))
print(f"convolution max|cpu-gpu| = {max_diff:.2e} {'OK' if max_diff < 1e-6 else 'MISMATCH'}")
Weighted Average (Distribution-based) on GPU¶
Test H(x) = ∫ f(τ) * g(x, τ) dτ with GPU JIT.
x = esc.var("x")
mean_v = esc.var("mean")
sigma_p = esc.par("sigma", 0.1, userlim=[0, 10])
f_avg = esc.sin(x*1000)**2
g_avg = esc.normal(x, mean_v, sigma_p * x)
h_avg = esc.average_normal(f_avg, sigma_p, x, epsrel=1e-8, epsabs=1e-8, maxiter=200, numpoints=15)
xarr = np.linspace(1.0, 10.0, 100)
cpu_vals = h_avg(xarr)
h_avg.device = "gpu"
gpu_vals = h_avg(xarr)
h_avg.device = "cpu"
max_diff = np.max(np.abs(cpu_vals - gpu_vals))
print(f"weighted avg max|cpu-gpu| = {max_diff:.2e} {'OK' if max_diff < 1e-6 else 'MISMATCH'}")
h_avg.device = "gpu"
h_avg.show(coordinates=np.linspace(-10, 10, 1000))