Speed up generate_terrain on the numpy backend with a fused parallel kernel

## Reason or Problem
`generate_terrain` is slow on the numpy backend. The octave loop in `_gen_terrain` (xrspatial/terrain.py) calls `_perlin` once per octave, and `_perlin` (xrspatial/perlin.py) is plain numpy: floor, astype, `& 255`, four fancy-index gathers (`p[p[xi] + yi]`), fade polynomials, lerps. Every one of those allocates a full-size temporary, and it all repeats for each octave (16 by default). The dask+numpy backend calls the same `_gen_terrain` per chunk, so it's slow too.

Measured here (numpy, default fbm, 16 octaves):
- 512x512: 244 ms
- 1024x1024: 1506 ms
- 2048x2048: 7368 ms

About 6.3s of the 2048x2048 run is the perlin octave loop. The GPU backend is already fast (28 ms at 2048x2048), so it's not the target.

## Proposal
Replace the per-octave numpy work with one fused `@njit(parallel=True)` kernel that computes the whole field per pixel -- domain warp, the fbm/ridged octave loop, and the final cube -- with `prange` over rows for threading. No per-octave temporaries; permutation tables are built once and passed in as a stacked int32 array.

A prototype of just the octave loop does the 2048x2048 16-octave fbm in 27.7 ms versus 6258 ms for the current 16x `_perlin`, and agrees with the current output to 8e-7 (float32 epsilon).

**Design:**
- Add a fused scalar kernel in terrain.py with parallel jit. Per pixel: optionally compute the warp displacement (warp_octaves perlin sum, normalized, scaled by warp_strength), then run the fbm or ridged octave loop, divide by the amplitude norm, cube.
- `_gen_terrain` takes the fast path when `worley_blend == 0` (the default). When worley blending is on, keep the existing numpy code, since worley needs a global min/max pass and is off by default.
- numpy and dask+numpy both go through `_gen_terrain`, so both benefit. cupy and dask+cupy are unchanged.

**Value:** Default CPU `generate_terrain` gets much faster with output unchanged within float32 epsilon; dask+numpy speeds up for free.

## Drawbacks
The worley path (worley_blend > 0) isn't sped up here.

## Alternatives
Fuse the GPU kernels too, but the GPU is already fast and its parity tests are tight, so leave it for a follow-up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up generate_terrain on the numpy backend with a fused parallel kernel #3565

Reason or Problem

Proposal

Drawbacks

Alternatives

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Speed up generate_terrain on the numpy backend with a fused parallel kernel #3565

Description

Reason or Problem

Proposal

Drawbacks

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions