Skip to content

Speed up generate_terrain on the numpy backend with a fused parallel kernel #3565

Description

@brendancol

Reason or Problem

generate_terrain is slow on the numpy backend. The octave loop in _gen_terrain (xrspatial/terrain.py) calls _perlin once per octave, and _perlin (xrspatial/perlin.py) is plain numpy: floor, astype, & 255, four fancy-index gathers (p[p[xi] + yi]), fade polynomials, lerps. Every one of those allocates a full-size temporary, and it all repeats for each octave (16 by default). The dask+numpy backend calls the same _gen_terrain per chunk, so it's slow too.

Measured here (numpy, default fbm, 16 octaves):

  • 512x512: 244 ms
  • 1024x1024: 1506 ms
  • 2048x2048: 7368 ms

About 6.3s of the 2048x2048 run is the perlin octave loop. The GPU backend is already fast (28 ms at 2048x2048), so it's not the target.

Proposal

Replace the per-octave numpy work with one fused @njit(parallel=True) kernel that computes the whole field per pixel -- domain warp, the fbm/ridged octave loop, and the final cube -- with prange over rows for threading. No per-octave temporaries; permutation tables are built once and passed in as a stacked int32 array.

A prototype of just the octave loop does the 2048x2048 16-octave fbm in 27.7 ms versus 6258 ms for the current 16x _perlin, and agrees with the current output to 8e-7 (float32 epsilon).

Design:

  • Add a fused scalar kernel in terrain.py with parallel jit. Per pixel: optionally compute the warp displacement (warp_octaves perlin sum, normalized, scaled by warp_strength), then run the fbm or ridged octave loop, divide by the amplitude norm, cube.
  • _gen_terrain takes the fast path when worley_blend == 0 (the default). When worley blending is on, keep the existing numpy code, since worley needs a global min/max pass and is off by default.
  • numpy and dask+numpy both go through _gen_terrain, so both benefit. cupy and dask+cupy are unchanged.

Value: Default CPU generate_terrain gets much faster with output unchanged within float32 epsilon; dask+numpy speeds up for free.

Drawbacks

The worley path (worley_blend > 0) isn't sped up here.

Alternatives

Fuse the GPU kernels too, but the GPU is already fast and its parity tests are tight, so leave it for a follow-up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePR touches performance-sensitive code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions