Reason or Problem
generate_terrain is slow on the numpy backend. The octave loop in _gen_terrain (xrspatial/terrain.py) calls _perlin once per octave, and _perlin (xrspatial/perlin.py) is plain numpy: floor, astype, & 255, four fancy-index gathers (p[p[xi] + yi]), fade polynomials, lerps. Every one of those allocates a full-size temporary, and it all repeats for each octave (16 by default). The dask+numpy backend calls the same _gen_terrain per chunk, so it's slow too.
Measured here (numpy, default fbm, 16 octaves):
- 512x512: 244 ms
- 1024x1024: 1506 ms
- 2048x2048: 7368 ms
About 6.3s of the 2048x2048 run is the perlin octave loop. The GPU backend is already fast (28 ms at 2048x2048), so it's not the target.
Proposal
Replace the per-octave numpy work with one fused @njit(parallel=True) kernel that computes the whole field per pixel -- domain warp, the fbm/ridged octave loop, and the final cube -- with prange over rows for threading. No per-octave temporaries; permutation tables are built once and passed in as a stacked int32 array.
A prototype of just the octave loop does the 2048x2048 16-octave fbm in 27.7 ms versus 6258 ms for the current 16x _perlin, and agrees with the current output to 8e-7 (float32 epsilon).
Design:
- Add a fused scalar kernel in terrain.py with parallel jit. Per pixel: optionally compute the warp displacement (warp_octaves perlin sum, normalized, scaled by warp_strength), then run the fbm or ridged octave loop, divide by the amplitude norm, cube.
_gen_terrain takes the fast path when worley_blend == 0 (the default). When worley blending is on, keep the existing numpy code, since worley needs a global min/max pass and is off by default.
- numpy and dask+numpy both go through
_gen_terrain, so both benefit. cupy and dask+cupy are unchanged.
Value: Default CPU generate_terrain gets much faster with output unchanged within float32 epsilon; dask+numpy speeds up for free.
Drawbacks
The worley path (worley_blend > 0) isn't sped up here.
Alternatives
Fuse the GPU kernels too, but the GPU is already fast and its parity tests are tight, so leave it for a follow-up.
Reason or Problem
generate_terrainis slow on the numpy backend. The octave loop in_gen_terrain(xrspatial/terrain.py) calls_perlinonce per octave, and_perlin(xrspatial/perlin.py) is plain numpy: floor, astype,& 255, four fancy-index gathers (p[p[xi] + yi]), fade polynomials, lerps. Every one of those allocates a full-size temporary, and it all repeats for each octave (16 by default). The dask+numpy backend calls the same_gen_terrainper chunk, so it's slow too.Measured here (numpy, default fbm, 16 octaves):
About 6.3s of the 2048x2048 run is the perlin octave loop. The GPU backend is already fast (28 ms at 2048x2048), so it's not the target.
Proposal
Replace the per-octave numpy work with one fused
@njit(parallel=True)kernel that computes the whole field per pixel -- domain warp, the fbm/ridged octave loop, and the final cube -- withprangeover rows for threading. No per-octave temporaries; permutation tables are built once and passed in as a stacked int32 array.A prototype of just the octave loop does the 2048x2048 16-octave fbm in 27.7 ms versus 6258 ms for the current 16x
_perlin, and agrees with the current output to 8e-7 (float32 epsilon).Design:
_gen_terraintakes the fast path whenworley_blend == 0(the default). When worley blending is on, keep the existing numpy code, since worley needs a global min/max pass and is off by default._gen_terrain, so both benefit. cupy and dask+cupy are unchanged.Value: Default CPU
generate_terraingets much faster with output unchanged within float32 epsilon; dask+numpy speeds up for free.Drawbacks
The worley path (worley_blend > 0) isn't sped up here.
Alternatives
Fuse the GPU kernels too, but the GPU is already fast and its parity tests are tight, so leave it for a follow-up.