## Micro-Patch Displacement Mapping

One major endeavour that I undertook this past summer was the creation of a novel displacement mapping system for Maverick Render.

The main motivations behind this have been:

- Our customers were running out of GPU memory too often when using displacement.
- Our main goal is that
**Maverick is an exquisite-quality tool for model visualization**. Hence displacement is very important to us. - Open the door to some other geometry-generation features beyond displacement itself.

So I undusted the old *virtualized geometry* solution I implemented for fryrender back in 2008 and re-engineered it for the Maverick core in the GPU.

### Say hi to: Micro-Patch Displacement Mapping

*Non-affine MPDM in arbitrarily-curved surfaces*

*Affine MPDM in flat surfaces*

Some good properties of Maverick’s displacement:

- No requirements
*whatsoever*on the base mesh or on the heightmap texture. - It costs near 0 in warm-up time.
- It costs near 0 in memory (only a small payload is appended to the height map).
- 0 information is lost; every single texel in the heightmap is turned into a displaced
*micro-patch*. - Implemented as a custom primitive in Maverick’s OptiX ray-tracing backend.
- Used every trick in the book to optimize it as much as I could.

### MPDM vs. brute-force

MPDM is meant to address some classic problems with the *subdivide-and-displace-vertices* (brute-force) approach. Using brute-force:

- There is some (often significant) wait time for warm-up.
- Memory usage goes through the roof, more than often blowing up all available memory.
- Height sampling happens at the vertices and not at the heightmap texels.

Points 1-2 are very obvious. It becomes impossible to abuse displacement, or to recklessly use it for large surfaces. Some techniques such as back-face culling or camera culling can be used to work around these difficulties, but such techniques pose problems on their own. Expecting the user to be careful not to exceed his memory budget is always an ill-fated plan.

Point 3 often means that the resulting displacement is smudgy, and not as crisp as it should be. *i.e.,* information is lost, and/or geometry is produced in excess. Some techniques (that Maverick itself has featured in the past) such as *autobump* can be used to work around this problem by kind-of-restoring the information that is lost in the process. But then again, optimal configuration would become the responsibility of the user.

None of these issues happen with MPDM. :-)

### Are there any benefits to the old brute-force approach?

As usual, nothing really good comes without a price. This is often very true in the realm of algorithm optimization.

- MPDM unfolds virtualized geometry during ray-traversal. The displaced geometry
*never*exists for real in memory (that is why it is said to be*virtualized*) and this is the reason why MPDM has a near 0-cost in memory usage. However, this virtualization comes at an algorithmic cost. Despite massively optimized, MPDM renders a bit more slowly than the same displaced geometry would if the vertices were pre-displaced during warm-up and entered the ray-tracing BVH as regular geometry. - For MPDM, part of the optimization process relies on the fact that the input map is discretized in
*texels*, (as in a regular bitmap). For brute-force displacement the only requirement is that the height map can be evaluated at each geometry vertex, which is true to bitmaps, procedurals, and all map types. So MPDM is a priori limited to`filetex`

maps, but we have added to Maverick a new map type called`bakemap`

so the user can bake any procedural map tree into a`filetex`

to feed MPDM if necessary.

### How does MPDM work?

Here’s a very broad explanation on how MPDM in Maverick works:

- A (rather classic) hierarchical partitioning scheme based in min-max mip-map levels is performed on the input heightmap. As depicted in the images below.
- Those voxels are efficiently ray-traceable… unless there is curvature in the base mesh. Hell breaks loose as soon as curvature is at play. Maverick does non-affine math here (visualized in a previous post on non-affine triangle sweeps). I’ve tried different approaches for this (non-affine, discretized, and hybrid), and the code allows to flag-switch each mode on and off.
- Because of the non-affine treatment, at the lowest hierarchy level you no longer have quads or triangles, but bilinear patches (hence the name:
*Micro-PATCH Displacement Mapping*).

*Non-affine MPDM in arbitrarily-curved surfaces*

*The input heightmap used*

*Affine MPDM in flat surfaces (special optimization branch)*

*The input heightmap used*

But the devil lies in the details. Efficient implementation of this all was a multi-week hell working with data structures, low-level optimizations, coordinating MPDM with OptiX and the rest of the Maverick core, and dealing with plenty of non-obvious fine-grain issues such as:

- Avoiding cracks across edges.
- Dealing with UV mapping projections.
- Transfer of shading normals.
- Real-time notifications from the UI.
- …

Youtube 4K videos:

https://youtube.com/shorts/2PdRx0sALgg?feature=share

https://youtube.com/shorts/O9qUCZTxkTQ?feature=share

## Diffraction vs. Multi-resolution

I’ve been working lately on glare/bloom/fringe and other post-processing effects in Maverick Render. Some of these inherit from our lovely ArionFX Adobe Photoshop and AfterEffects plug-in.

One complaint in ArionFX and also in Maverick is (was, because this post is about a successful fix) that Glare/Bloom diverge in *shape and power* when the input image is rendered at a different resolution, even if the Glare/Bloom parameters stay the same.

There are some relatively unobvious reasons for this. Basically, the challenges are:

*Hard challenge*: Diffraction is a frequency analysis effect. For a render, this happens in the discrete realm (pixels). The size (amount of pixels) of the images involved changes what frequencies and how they show up in the Fourier Transform.*Hard challenge*: Anti-Aliasing of neighboring pixels (more prevalent at low resolution) averages their power and dims the overall Glare/Bloom overlay. This can pose a real problem for thin geometries such as lightbulb filaments.*Easy challenge*: As illustrated in some of my previous posts, the FT itself has some properties that relate its scale and power to the scale and power of the aperture/obstacle of the lens iris. These of course must be compensated for.*Medium challenge*: Changes in aspect ratio, or in padding in the image buffers (such as the padding between the IPR size in the UI vs. the canvas size) must be taken into account as well.

The upcoming release of Maverick will address these issues.

Here’s a small video with a sequence of Maverick post-processing effects, all rendered alternating landscape and portrait aspect ratios between 512 and 2048. The video is cropped as landscape to be easier on the eyes. As can be seen, at lower resolutions there’s always some power divergence, and a little bit of blur. But those are unavoidable to some extent.

Youtube 4K version: youtu.be/4K8F8vP76…

## Flyby heatmaps (i)

These are some technical visualizations from the all-new 0-memory geometry generation subsystem in Maverick Render which we intend to release soon.

I will be posting more related images and videos in the upcoming days.

### Youtube versions of the above videos

**[EDIT]** I believe that micro.blog has some difficulty with (*i.e.,* won’t play back) videos encoded at a non-standard aspect ratio. So here are the same videos in Youtube form:

### Results in production

The first beneficiary of this subsystem is Displacement Mapping (now Micro-Patch Displacement Mapping *a.k.a.,* MPDM). This new MPDM system uses near zero memory, near zero warm-up time, and captures *literally and exactly* every single texel in the input texture.

Youtube 4K version: youtu.be/BfNTDcMCB…

## Bessel functions

*From Wikipedia*

### Iterative implementation

I came up with the above implementation, with which I plotted these.

`J0-4 in [0..2pi]`

`2*abs(J1-4(x)/x) in [0..2pi]`

Iterative methods are a terrible choice for many use cases because of error accumulation, jitter, the very presence of a loop, etc… But in this case this method also happens to be pretty weak and way too easy to break:

`J0-4 in [0..8pi]`

### Deus ex machina

After some unsuccessful googling I was about to work on my own Taylor Series approximation or at least study the proposed implementations in Numerical Recipes. But I learnt to my surprise :o) that `<math.h>`

provides `j0/j1/jn/y0/y1/yn`

and even CUDA provides `j0f/j1f/jnf/y0f/y1f/ynf`

.

I checked whether the C/CUDA implementations match my iterative implementation, and they do.

For reasons I will discuss in a future post, I am after `j1`

only, so… *my job here is done*.

## Fourier Transform of a unit pulse

One might need (as is the case for me now) to compute a FT without *actually* computing the FT. *i.e.,* to find the *explicit* formulation for the resulting FT of a certain 2D signal.

In particular, as part of some optimizations in Maverick Render I wish to create a map that looks like the diffraction pattern of a polygonal aperture shape, but without computing the FT of said aperture.

*From a previous post*

Looking at this video from my previous post, intuition says that each of the streaks of a diffraction pattern is “kind of independent from” or “overlaps onto” the rest and could be modeled on its own from a 1D FT. This intuition is incorrect in the general case. But let’s focus on the one case where this reasoning is correct: *the diffraction pattern of a square shape*.

### Separability of the FT

The Fourier Transform is *separable* in both directions:

- The 2D analysis formula can be written as a 1D analysis in the
`x`

direction followed by a 1D analysis in the`y`

direction. - The 2D synthesis formula can be written as a 1D analysis in the
`u`

direction followed by a 1D analysis in the`v`

direction.

So it seems natural to focus on the 1D case first and then combine 2x 1D transforms for the square. The cross section of a square is (one cycle of) a 1D rectangular pulse.

### Fourier Transform of a 1D unit pulse

*From Wikipedia*

A unit pulse of amplitude `A`

and width `T`

is a function `g(t)`

that yields `A`

for `t IN [-T/2..+T/2]`

and `0`

everywhere else. Plugging this `g(t)`

in the definition of the Fourier Transform, we obtain this:

So the FT of a unit pulse is the real-valued function `A*T*sinc(f*T)`

. The imaginary component in this case cancels out.

### The sinc function

The `sin(pi*f*T)/(pi*f*T)`

part of the solution is *such a fundamental building block* in the field of signal processing that it got a name of its own: The venerable sinc function, or alternatively the sampling function.

From Wikipedia: *The normalized sinc function is the Fourier Transform of the rectangular function with no scaling. It is used in the concept of reconstructing a continuous bandlimited signal from uniformly spaced samples of that signal.*

The sinc function can be expressed in unnormalized form (`sin(x)/x`

which integral over the real line is `pi`

) or in normalized form (`sin(pi*x)/(pi*x)`

which integral over the real line is `1`

).

In both cases there is an indetermination at `x=0`

but you can use L’Hôpital’s rule to show that `sinc(0)=1`

.

*From Wikipedia*

### Fourier Transform of a rectangular shape

Since the FT is separable as mentioned above, the 2D FT of a square shape must be the 1D FT of a unit pulse in the `x`

direction, followed by another 1D FT in the `y`

direction.

Note that this would work for rectangles too, by using a different `T`

value for each axis.

The top-right image is the actual FT of the square shape on the top-left. The bottom-right image is just the above piece of code. The bottom-left image is the profile along the `x`

direction of the above code at `y=0`

which is proportional to `abs(sinc)`

.

The top-right and bottom-right images match exactly, except for:

- Some numerical drift.
- Unlike the above code, the true FT is periodic and the streaks wrap-around. But this happens to be a desirable property in my case of use.

So: *success!*

### Can this method be generalized to n-sided polygons?

One might hope that the FT of `n`

-sided polygons, or even `inf`

-sided polygons (circles) could be conscructed similarly by *composing rotating FTs* of square functions somehow. But the truth is that things get a little bit more complicated than that.

In particular, the other case of interest for me in my application besides squares is *circles*. We’ll get into those in a future post.

### The double-slit experiment

In the basic version of the double-slit experiment, a laser beam illuminates a plate pierced by two parallel slits, and the light passing through the slits is observed on a screen behind the plate. The wave nature of light causes the light waves passing through the two slits to interfere, producing bright and dark bands on the screen.

The pattern formed on the screen is the (constructive/destructive) sum of two rectangular interference patterns; one per slit. The profile of each pattern happens to be a `sinc`

function.

In this image below, I have dimmed the slits vertically using a gaussian in order to generate an image that more closely resembles classic pictures of the experiment.

## The Avalanche effect

In the realm of integer hashing, the Avalanche effect is the notion that if the input of a hash function is changed slightly, its output must change significantly. *i.e.,* flipping one bit in the input must randomly flip about half the output bits.

This criterion can be hardened. Strict (full) avalanche says that whenever a single input bit is flipped, each of the output bits must change with a 50% probability.

In simpler terms this means that a good hash function is expected to break the correlation between the input and the output as much as possible. Avalanche is one possible measure of how well this property is achieved.

### Implementing avalanche

Let us focus on `u32->u32`

hash functions.

Implementing avalanche is fairly simple. All you need to do is pass each possible `u:u32`

value through your hash function, and then flip each bit in `u`

one by one, seeing what bits in `hash(u)`

are affected. For each flip in the output you `++`

the `(c,r)`

entry in a `32x32`

counters array, where `(c,r)`

corresponds to a flip in the c-th input bit and a flip in the r-th output bit.

An exact estimation of avalanche requires that you try each of the `2^32`

possible values for `u`

once, which may take longer than reasonable. But it is generally enough to just try a large bunch of (uniformly distributed) values to get a good estimate of what avalanche will converge to.

A good hash function should accumulate to a close-to-perfectly-even counters array. *i.e.,* if the array is normalized by the total amount of values we tested, the array will be flat 50% gray. On the other hand, poor hash functions quickly produce an evident pattern in the resulting array (see down below).

https://en.wikipedia.org/wiki/MurmurHash

*Avalanche matrix for Murmur3’s finalizer*

### I wrote a Shadertoy for this

Over the years I have implemented plenty of stress-tests for PRNGs, QRNs, hash functions, etc… as part of Maverick Render’s MK_api Unit Testing. This time I went ahead and wrote a Shadertoy instead as an exercise:

https://www.shadertoy.com/view/NtVczW

The code allows you to try out different popular `u32->u32`

hash functions. You can also type in your own. In a few seconds you will see what the `32x32`

avalanche array and the corresponding numerical estimation converge to.

The code can be edited to display the avalanche matrix, or the *bias* matrix, which is nothing but `bias=abs(avalanche-.5)`

. *i.e.,* How much avalanche deviates from 50%. Naturally, the bias is meant to converge to 0 for a high-quality hash function.

Some results with other hash functions:

*Avalanche matrix for iquint1*

*Avalanche matrix for Schechter and Bridson hash*

*Avalanche matrix for Wang’s hash*

### Valuable links

There is a methodology to create your own hash functions by combining certain reversible operations. I won’t get into the details here, but I will give a couple of fantastic links instead:

- The pretty amazing
**Hash Function prospector**project, by Chris Wellons (*a.k.a.,*skeeto):

https://github.com/skeeto/hash-prospector

skeeto found some extraordinary 32-bit and 64-bit direct/reverse hash functions himself. In particular the `lowbias32`

one, which comes implemented in my Shadertoy above.

- A dissertion on the search for perfect hashing, by Bob Jenkins:

http://burtleburtle.net/bob/hash/integer.html

### Takeaway

Many CG use cases where what matters the most is performance can get away with cheap hash/RNG functions. On the other hand, industrial-strength rendering applications such as Maverick Render just can’t.

But there is a sweet spot between high-performance and low-bias (as proven by, for example, the Murmur3 finalizer or skeeto’s low-bias hash) where you can have a negligible bias with just a handful of arithmetic instructions.

## Recent developments in Maverick Render

I am super-stoked about the new tech we’ve been creating for Maverick Render recently.

*Test render courtesy of Studio Podrini*

Our Product Manager wouldn’t like me to spoil any surprises by revealing what new features are involved in this image, so I will leave that to the viewer’s imagination. But here’s a hint: *new/updated BSDF stuff* + *new heavy raytracing-related stuff* + *even more new heavy raytracing-related stuff*.

All these will see the light in Maverick Render 2022.5 (as soon as it is ready).

## Barycentric coords in triangle sweeps

A “triangle sweep” can be defined as 2 triangles `(A,B,C)`

and `(D,E,F)`

and the 3 bilinear patches at the sides of both triangles. It’s kind of a “twisted” triangular prism.

We can define some sort of `(u,v,w)`

barycentric coordinates here, such that each point `P`

interior to the sweep could be written as `P=(A+AD*w)*u+(B+BE*w)*v+(C+CF*w)*(1-u-v)`

.

As soon as the sweep is “twisted” the behavior of those *barycentric* coords is no longer linear. *i.e.,* linear motion in the `(u,v,w)`

space does not translate to linear motion in 3D space and viceversa. However these coordinates can be useful for certain applications.

One way to find `(u,v,w)`

for a given `P`

is to solve for `w`

first (the red plane below) and then find `u`

and `v`

as regular barycentrics in the triangular section. After massaging the numbers a bit, solving for `w`

turns out to be a cubic equation, which is the wiggly orange curve below.

We are only interested in solutions where `w IN [0..1]`

.

## Octahedron unit-vector encoding

There are many situations in Computer Graphics where you may need to store massive amounts of unit-length vectors (*e.g.,* directions, surface normals, …). The normals AOV in a G-Buffer, incoming directions in a Photon map, or uploading a 3D model to the GPU for rendering are situations that greatly benefit from a bit-efficient encoding for normals.

### Spherical coordinates encoding

The straightforward way to encode a normal is to store its `(x,y,z)`

coordinates, which takes `3x32=96`

bits in single-precision floating-point. This naive approach doesn’t take advantage of the fact that in normals or directions the vector length is irrelevant. So assuming unit length, you may use spherical coordinates `(theta, phi)`

and encode each normal with 2 numbers instead of 3.

Since `(theta, phi)`

have a `[0..2pi]`

and `[0..pi]`

range, it sounds reasonable to quantize its values using fixed-point math. This way both angles become a couple of integers of, say, 16-bit each, with their ranges turned into `[0..65535]`

. Note that this gives twice as much resolution to `phi`

but that’s fine. We have achieved a reasonably fine-grained `16+16=32-bit`

encoding. Depending on how tolerable precision losses are in your application, you may go even lower (*e.g.,* `12+12=24-bit`

, or even `8+8=16-bit`

).

The problem with this simple approach is that spherical coordinates are not an equal-area transform, as they concentrate points near the poles.

*Spherical coordinates encoding:
phi gets more precision than theta, and density between the equator/poles is highly uneven.*

### Octahedron encoding

A smarter solution which produces a much more even distribution, while being strikingly simple and more efficient computationally (no trigonometry whatsoever) is *octahedron encoding*.

*Octahedron encoding: The distribution is pretty even.*

The idea is to project positions in the sphere (unit vectors) onto a sphere-centered octahedron. And then piece together the 8 faces of the octahedron in the unit cube. This diagram should be self-explanatory.

Reference: Octahedron environment maps

### Fibonacci encoding

Recently I have been using the Fibonacci lattice here and there and had the idea to use its equal-area spherical flavor for normal encoding. After googling a bit I came across this paper, so somebody had the idea first:

Unorganized Unit Vectors Sets Quantization

Actually, pulling from this thread I also came across these Shadertoy samples by Íñigo Quilez. They visualize what happens with either method as more bits are budgeted for normals encoding:

*Fibonacci:* https://www.shadertoy.com/view/4t2XWK

*Octahedron:* https://www.shadertoy.com/view/Mtfyzl

*Fibonacci encoding: Looks great, but comes at a price.*

This looks elegant (pretty even!). Unfortunately the implementation to encode/decode unit-vectors with `n`

bits is rather convoluted and involves `N=2^n`

and `1/N`

. As soon as `n>24`

or so numerical degradation makes this solution unusable in single-precision. Also, this method is significantly slower than octahedron encoding, and while the distribution looks prettier, the difference in practice is negligible.

## QRNs in the unit square/disk/sphere

Basic concepts in this post, but worth refreshing.

Very often in Computer Graphics (in Monte Carlo integration in particular) you need to sample `N`

points uniformly over the surface of a square, a disk, or a sphere.

Depending on many (many!) factors such as how critical the *quality* of the random distribution needs to be, performance or code simplicity considerations, availability of a high-quality randomization seed, etc… one may choose one method or another. There are plenty of options, and each comes with their own strengths and weaknesses.

In this post I will talk about:

- The
*Equal-Area transform*to turn a pair of uniformly distributed random numbers in the unit square into uniformly distributed points over the surface of the the unit disk/sphere. - A super-simple
*Fibonacci-based*method to generate`N`

points that uniformly cover the unit square (and hence the unit disk/sphere).

### Equal-Area transform

Given `u`

a uniform distribution of 2D points in the surface of the unit square, the transforms to produce uniform distributions of 2D points in the unit disk, or 3D points in the unit sphere are:

There are very clean-cut explanations for these on the always fantastic wolfram.com website.

These are based on the area (disk) and solid angle (sphere) differential elements.

### Fibonacci lattice

The algorithm to generate the first `N`

points in the Fibonacci lattice goes like this:

This produces a stratified distribution of points in the unit square.

The beauty of this method is that it’s a strikingly simple couple of divs and a mod. Unlike most series generators, however, it requires that you know `N`

beforehand. Other usual methods like the ones depicted later in this post are incremental, so each new sample is spawned next to the previous ones. The Fibonacci lattice, on the other hand, “squeezes” previous samples as new ones come in, as illustrated by the videos below.

### Fibonacci lattice (square/disk/sphere)

This videos simply plug the Fibonacci lattice formula to the Equal-Area transforms above.

As a bonus, the next videos are generated like the ones above, but using a uniformly distributed QRN generator (the popular and very strong Mersenne Twister) and the van Der Corput stratified series (in particular, the Halton sequence using primes `2`

and `3`

).

#### Mersenne Twister (Random)

#### Halton (Stratified)

## Recklessly encoding int as float

*Quiz:* Does this loop end, and if it does, what is the value of `k`

after the loop? (assuming IEEE-754 and usual 32-bit semantics for `int/float`

)

Running this code in VS you get `k=16777217`

which equals `2^24+1`

.

In other words, in line with the true intent of the quiz, `float`

can encode -exactly- (without any precission loss) all natural numbers up to `2^24`

(included). Because `float`

encodes sign in a dedicated bit, this property holds true for negative values as well. So you could *hypothetically* encode any `[-2^24..+2^24]`

`int`

in a `float`

with a static/implicit cast.

I said *hypothetically* because generally speaking I would never do such thing. Carelessly casting between `int`

and `float`

(without an explicit `floor`

, `ceil`

, `trunc`

, `round`

…) is often a sign of lousy coding, unless performance is at stake and you know well what you are doing.

However, I came across the situation in Unity recently, where the `Vertex ID`

node in their Shader Graph hands over `SV_VertexID`

as a `float`

(disappointed face :-P). I would’ve expected an `uint/int`

output or a `float`

that you could reinterpret-cast to retrieve the raw `SV_VertexID`

bits with asuint(). But nope. You are handed over a `float`

which seemingly carries `(float)SV_VertexID`

.

One may recklessly use this `float`

to index a `StructuredBuffer`

ignoring the way static-casting works. This is one case where ignorance is bliss, because the `float`

received exactly matches the original `int`

as long as the original value is `<=2^24`

. That is, as long as you are dealing with fewer than (roughly) 16.7 million vertices, which is usually the case.

I believe that Unity’s ShaderGraph doesn’t support `int/uint`

as `In/Out`

values between nodes, so I guess that the `Vertex ID`

and `Instance ID`

nodes just static-cast the corresponding `SV_...`

value to a float. But it would be better (in the *pedantic* sense) to reinterpret the bit pattern of the raw values with `asfloat`

and then let the user retrieve them with `asint()/asuint()`

.

### reinterpret-cast between int/float in HLSL

### A (loose) explanation of the 2^24 limit

This is the IEEE-754 standard for floating-point numbers, as described in Wikipedia:

The value of a float-encoded number is reconstructed as `(-1)^S * M * 2^E`

.

`S`

is the 1-bit sign.`M`

is the 23-bit mantissa, interpreted as`1.xxxxx`

(in binary).`E`

is the 8-bit exponent, used as`E-127`

where 127 is often called*bias*.

*e.g.,* a power-of-2 integer number like 4 would be encoded as:

`M=1.00`

which is the binary representation of 4, with the point at the left-most 1.`E=2`

(+bias).

The restored number is `1.00 shl 2=4`

.

We should be able to do the same for all power-of-2 integers until we max out `E`

.

For non-power-of-2 integers the process is similar. *e.g.,* number 5:

`M=1.01...00`

.`E=2`

(+bias).

The restored number is now `1.01 shl 2=5`

.

This works the same for all integer numbers as long as `M`

can hold the raw binary representation of the number. The tallest binary number that `M`

can hold is 23 consecutive 1s. That is: `1.11...11`

(`24x1s`

in total). With `E=23`

(+bias) this equals `2^24-1`

.

The next integer `2^24`

would be encoded as `1.00...00`

(`24x0s`

clamped to 23, but the trailing 0s are meaningless here). With `E=24`

(+bias) this equals `2^24`

(**the answer provided above!!**).

But the next integer `1.{23x0s.1}`

can’t be encoded in a 23-bit `M`

. So from `2^24`

onwards, there is necessarily a loss. Some integers beyond `2^24`

(such as powers-of-2) may be luckily encoded exactly by `float`

. But not all consecutive numbers will. Actually, `2^24+1`

is the first integer that won’t.

Whoa! As always, *apologies for any mistakes in any of my posts*.

**[EDIT] The same reasoning can be applied to double, where the mantisa M is 52-bit long.**

Hence `double`

can encode exactly all integers in the range `[-2^53..+2^53]`

.

## Viewport baked PBR materials

The technique described here is very basic, but useful nonetheless.

Let’s make the following broad assumptions:

- No interactions between surfaces (
*i.e.,*no GI, no occlusion, etc…). - The only light source is an IBL (environment) map placed at infinity which orientation is anchored to the camera.
- Each surface point
`P`

is made of a perfectly rough white material which simply “collects” the light that arrives to`P`

from above the hemiplane defined by`(P,N)`

.

Under such assumptions a ball floating in the void looks like this image rendered in Maverick Studio with a procedural sphere primitive and a standard material:

The illumination collected by each point `P`

and radiated back to the camera (+ tonemapping) happens to be, exactly, one particular pixel in the rendered ball as depicted in the hand-drawn diagram above. Note that due to the constraints set, **this color depends exclusively on the normal N at P but not on P itself**. In other words, the rendered ball effectively precomputes what any surface point

`P`

sees through its normal `N`

.The goal in this post is to benefit from this knowledge in a rasterization viewport. So let’s jump now to the world of pixel shaders. In our vertex shader, let’s transform the object’s normals `N`

to camera space `Nc`

and let’s drop the resulting Z coordinate. `(Nc.x,Nc.y)`

are the geometry normals projected to screen space. Let’s normalize these: `Nc'=normalize(Nc.x,Nc.y)`

Let’s upload the ball render as a regular 2D texture and let’s define the following UV space:

`Nc'`

is normalized, so the UVs generated will cover the interior of the ball only. Normals at grazing angles will be near the perimeter of the texture’s ball.

In our fragment shader, let’s simply output the color found in the ball image at the UVs computed as described. Here’s the result:

Of course, physically-based materials are more complicated than this. In general, the amount of light received by the surface from the `(P,N)`

hemiplane through each possible incoming direction and then bounced back through each possible outgoing direction depends on both directions, and also on the properties of the material. This is basically the definition of a BRDF, BTW.

But because we set the constraint that the lights are not moving (*i.e.,* the IBL stays fixed with respect to the camera) the incoming and outgoing directions and hence the behavior of the BRDF ends up being a function of `N`

and *nothing* else. So this method still works **no matter what material we render the ball with** (well… excluding refractions).

Here goes another example with the same IBL but adding some specularity to the material:

Of course this could be pushed further with some SSAO or any other technique that approximates self-shadowing. But this is a very compact and cheap (memory/code/effort) way to display models in an easy-to-author appealing way.

Some final notes:

- If you want to be accurate, the ball should be rendered with an
*orthographic*camera (*i.e.,*not with a*perspective*camera). - For blurry/glossy or diffuse reflections, you can get away with a low-res ball render. However, make sure that the UVs that you generate don’t step on the antialias pixels at the boundary of the rendered ball. Otherwise shading will look b0rked at grazing angles.
- This technique can be extended by using 2 ball images: one with the diffuse component of the material, and another with the specular component. This way the fragment shader could modulate the diffuse part and the specular part by separate colors that could be passed as inputs to the shader. Keep in mind that this would require that the ball images and the colors are passed in linear (instead of sRGB) space and that the resulting color is sent back to sRGB.

## Trimmed NURBS

I’m currently swimming in an ocean of NURBS with their trimming loops. This is experimental work for NURBS rendering in Maverick Render.

**[EDIT]** Throwing some GLSL vert/frag shaders in to the mix for normal/tangency inspection. 2 days in and this is still WIP.

## Lattice-Boltzmann Method (LBM)

A couple of weekends ago I was tinkering with fluid dynamics and implemented a few compute shaders in Unity. Among the ones I implemented, my favorite is the Lattice-Boltzmann Method featured here.

This early implementation allows to draw SDF entities in the simulation space behaving as blockers, inflows or outflows. I intend to expand on the features, but most likely at a slow pace, because this is a no-rush side-project. But I definitely wish to post now and then about my progress.

I also hope to find some spare time soon to write a quick article on how to implement an LBM cellular automata from a light-on-maths perspective.

## Colorful pet project

Some pet project I am making some progress on. More info soon. Definitely. Maybe.

**[EDIT] I had to install a micro.blog plug-in so Twitter cards display the main post image when cross-posting to Twitter.** Apologies for the multiple test posts.

**[EDIT] Here’s a hint.** It’s a cutsie CSG-based Solid modeling experiment in Unity Editor.

## Topology fun

Recently I have been revisiting in my spare time one of my favorite subjects: topology. I will probably be posting some stuff about topological operators, subdivision surfaces, DCEL structures, irregular tilings and such things in the near future.

Below: A little demo I’ve written that supports an extended flavor of Conway’s notation for polyhedra: `adotaocD`

.

**[EDIT] I paused this pet project for the time being…** because the 2022.2 release of Maverick has virtually blown all my productivity hours away. But I plan to write some entries on topological operators at some point further into this year… Hopefully…

## Uniform vs. stratified - 2015-03-10

**[EDIT] This post was migrated from my blog from 2011…**

This is a classic subject in numerical (Monte Carlo) integration.

*Uniform 2D distribution vs. Halton series for the first 2 dimensions*

To the left: 32768 points in a 512×512 image using a uniform random number generator (Mersenne Twister). To the right, the first 32768 pairs in the Halton series, using dimensions #0 and #1. Click to enlarge!

## Playing with the Fourier Transform (I) - 2016-07-28

**[EDIT] This post was migrated from my blog from 2011…**

**[EDIT] This post was extended on 2022.09.**

The beauty and power of the Fourier Transform never cease to amaze me. And since several effects in ArionFX are based on it, I have had to play with it a lot in recent times.

As explained in a previous post, diffraction patterns (*e.g.,* the glare simulated in ArionFX for Photoshop) come from the Fourier Transform of the lens aperture. I will use the FT of an aperture mask for visualization in this post.

I will use pow-2 square sizes (my FT implementation is an FFT). Let’s start by this *Aperture x Obstacle* map output directly from ArionFX for Photoshop v3.5.0.

*Aperture x Obstacle mask, rasterized @ 512×512*

The Fourier Transform of this image, rasterized at that size in said 512px buffer, is the following.

*FT of the mask above*

The faux-color is done on the magnitude of each complex number in the FT. All the FT images in this post are normalized equally, and offset to look “centered” around the mid-pixel.

Such diffraction patterns and some heavy convolution magic are what ArionFX uses to compute glare on HDR images:

*Resulting glare in ArionFX*

Now, let’s focus on what happens to the FT (frequency space) when one does certain operations on the source data (image space). Or, in this exemplification: what happens to the diffraction pattern when one plays with the rasterized aperture mask.

Note that we’re speaking of the *Discrete Fourier Transform*, so sampling (rasterization, pixelization, anti-aliasing, …) issues are mostly ignored.

### Rotation about the center

A rotation of the source buffer about its center doesn’t change the frequencies present in the data; only their orientation. So a rotation in the source data rotates the FT in the exact same way.

As we will see next, this property holds true regardless of the center of rotation, because the magnitude of the FT is invariant with respect to translations.

*Rotation about the center*

### Translation (with wrap-around)

Frequencies arise from the relative position of values in the data field, and not from their absolute position in said field. For this reason, shifting (wrap-around included) the source data does not affect the magnitude of the corresponding Fourier Transform in any way.

*Invariance to translation*

Let’s recall that the idea behind the FT is that *“any periodic function can be rewritten as a weighted sum of sines and cosines of different frequencies”*. Periodic being the keyword there.

### Repetition (tiling)

Tiling the data buffer `NxM`

times (e.g., `2×2`

in the example below) produces the same FT, but with frequencies “exploded” every `NxM`

cells, canceling out everywhere else.

This is because no new frequencies are introduced, since we are transforming the same source data. However, the source data is `NxM`

times smaller proportional to the data buffer size (*i.e.,* the frequencies become `NxM`

times higher).

*Exploded frequencies on tiling*

### Data scaling

Normalization and sampling issues aside, scaling the data within the source buffer scales the FT inversely.

This is because encoding smaller data produces higher fundamental frequencies, while encoding a larger version of the same data produces lower fundamental frequencies.

*Inverse effect on scaling*

In the particular case of glare (*e.g.,* ArionFX) this means that the diffraction pattern becomes blurrier if the iris is sampled small. Or, in other words, for a given iris, the sharpest diffraction pattern possible is achieved when the iris is sampled as large as the data buffer itself.

## Filling of missing image pixels - 2015-05-28

**[EDIT] This post was migrated from my blog from 2011…**

Here’s what we could call a mean pyramid of the 512×512 Lena image. *i.e.*, a sequence of progressive 1:2 downscalings, where each pixel in the i-th downscaled level is the average of the corresponding 4 pixels in the previous level. For people familiar with OpenGL and such, this is what happens when the mipmaps for a texture map are computed:

*The 9+1 mipmap levels in the 512×512 Lena image*

There are many smart, efficient, and relatively simple algorithms based on multi-resolution image pyramids. Modern GPUs can deal with upscaling and downscaling of RGB/A images at the speed of light, so usually such algorithms can be implemented at interactive or even real-time framerates.

Here is the result of upscaling all the mip levels of the Lena image by doing progressive 2:1 bi-linear upscalings until the original resolution is reached:

*Upscaled mipmap levels*

Note the characteristic “blocky” (bi-linear) appearance, specially evident in the images of the second row.

Lately, I have been doing some experimentation with **pixel extrapolation** algorithms that “restore” the missing pixels in an incomplete image for a 2D DOF filter based on the Depth AOV.

My pixel extrapolation algorithm works in 2 stages:

- The first stage (analysis) prepares the mean pyramid of the source image by doing progressive 1:2 downscalings. Only the meaningful (not-a-hole) pixels in each 2x2 packet are averaged down. If a 2x2 packet does not have any meaningful pixels, a hole is passed to the next (lower) level.
- The second stage (synthesis) starts at the smallest level and goes up, leaving meaningful pixels intact, while replacing holes by upscaled data from the previous (lower) level.

*Mean pyramid (with holes)*

Note that the analysis stage can stop as soon as a mip level doesn’t have any holes.

Here is the full algorithm, successfully filling the missing pixels in the image.

*Mean pyramid (filled)*

### Conclusions

This algorithm can be implemented in an extremely efficient fashion on the GPU, and allows for fantastic parallelization on the CPU as well. The locality of color/intensity is preserved reasonably well, although holes become smears of color “too easily”.

A small (classic) inconvenience is that the source image must be pre-padded to a power-of-2 size for the progressive 1:2 downscalings to be well defined. I picked an image that is a power-of-2 in size already in the above examples.

TL;DR: Given the ease of implementation and the extreme potential in terms of efficiency, the results are quite decent for many applications.

## Scrambled Halton - 2015-03-10

**[EDIT] This post was migrated from my blog from 2011…**

The Halton sequence, which is one of my favourite algorithms ever, can be used for efficient stratified multi-dimensional sampling. Some references:

It is possible to do stratified sampling of hyper-points in the *s-dimensional* unit hyper-cube by picking one consecutive dimension of the Halton series for each component. A convenient way to do so is to use the first `s`

prime numbers as the basis for each Halton sequence.

It is well-known, however, that while this approach works great for low dimensions, high dimensions often exhibit a great degree of undesired correlation. The following image displays a grid where each cell combines two components of the *32-dimensional* Halton hyper-cube.

*Raw 32-dimensional Halton hyper-cube*

One can easily spot how some pairs exhibit an obvious pattern, while others fill their corresponding 2D area very densely. This happens more aggressively for higher dimensions (*i.e.,* to the right/bottom in the image) and for pairs formed with close components (*i.e.,* near the diagonal in the image). Click to enlarge!

A successful approach to dissolve this problem without losing the good properties of stratification is to do *“random digit scrambling”* (*a.k.a.,* rds). During the construction of a Halton number, digits in the range `[0..base[`

are combined. Given a *Pseudo-Random Permutation* of `length=base`

, all that one must do is use `PRP(digit)`

instead of `digit`

directly. This somewhat shuffles Halton pairs in rows and columns in a strong way so the correlation disappears. However, since the PRP is a bijection, the good properties of stratification are generally preserved.

**[EDIT] While this is true, it has been proven that Owen-Scrambling is a superior method.**

How to build a strong and efficient randomised PRP of an arbitrary length is a very interesting subject which details I won’t get into here. You can build your own carefully combining a number of reversible hashing operations.

Here’s the scrambling strategy in action:

*Scrambled 32-dimensional Halton hyper-cube*

Now all the blocks in the grid look equally dense. Magic!

As long as one picks a good seedable PRP, it is possible to generate any number of different samplings, all with the same good properties.

**[EDIT] These are the guidelines of the QRN system that Arion used for many years.**

## The Error function (erf) - 2014-09-06

**[EDIT] This post was migrated from my blog from 2011…**

Here is the 1D Gaussian function:

Put in short, the Error function is the integral of the Gaussian function from 0 to a certain point `x`

:

At least, that is the way the formula is presented by Wikipedia and Wolfram|Alpha. But as soon as you try to work with it you find out that in order to really match the above Gaussian function, normalization and axis scaling must be taken care of:

The plots below display `G(x,sigma)`

(bell-shaped) in blue, and `erf(x,sigma)`

(S-shaped) in yellow.

A very typical use for `G(x,sigma)`

that I’ve been talking about on this blog lately, is to build convolution kernels for Gaussian blur. An image convolution kernel, for example, is a pixel-centered discretization of a certain underlying function (a Gaussian, in this case). Said discretization splits the x axis in uniform unit-length bins (*a.k.a.,* taps, or intervals) centered at `x=0`

.

For each bin, this is pretty much the definition of the integral of `G(x,sigma)`

along the bin. That is, the increment of `erf(x,sigma)`

between both end-points of the bin.

*Discretized 1D Gaussian ( sigma=0.5)*

*Discretized 1D Gaussian ( sigma=1.0)*

*Discretized 1D Gaussian ( sigma=1.5)*

### Doing it wrong

Most implementations of Gaussian convolution kernels simply evaluate `G(x,sigma)`

at the mid-point of each bin. This is a decent approximation that is also trivial to implement:

This value is represented in blue in the above plots.

### Implementing erf

While `G(x,sigma)`

has a trivial explicit formulation, `erf`

is an archetypical non-elementary function.

Some development environments provide an `erf`

implementation. But most (the ones I use) don’t. There are some smart example implementations out there if you Google a bit. Usually they are either numerical approximations, or piece-wise interpolations.

Assuming that one has an `erf`

implementation at his disposal, the correct value for the bin `[a..b]`

is:

This value is represented in red in the above plots.

### A poor-man’s approximation

If you are feeling lazy, an alternative approximation is to super-sample `G(x,sigma)`

along the bin. The amount of samples can be chosen to be proportional to how small `sigma`

is.

This method is easy to implement. However, it is far less elegant and also less efficient as more evaluation calls are required.

### Conclusion

When discretizing a Gaussian function, as sigma becomes small (close to or smaller than `(b-a)`

), the approximation `bin(a,b)`

(in blue above) deviates significantly from the correct `bin'(a,b)`

(in red above).

So if you are going to Gaussian-blur at radius 1px or smaller, and precision is important, then it is necessary to use `erf(x,sigma)`

or at least super-sample `G(x,sigma)`

.

## Downsampling and Gaussian blur - 2014-09-01

**[EDIT] This post was migrated from my blog from 2011…**

I talked about several strategies to optimize convolutions in some of my previous posts. I still got to talk about how to approximate a Gaussian blur using a multi-step Box blur in a future post. However, there is yet another good technique to optimize a Gaussian blur that may come handy in some cases.

This post is inspired by a need that I had some days ago: Say that you need to do a 3D Gaussian blur on a potentially humongous 3D data buffer. Working with downsampled data sounds ideal in terms of storage and performance. So that’s what I am going to talk about here:

*What happens when downsampled data is used as input for a Gaussian blur?*

### The idea

Here’s the 0-centered and un-normalized 1D Gaussian function:

The *sigma* parameter in the Gaussian function stretches the bell shape along the `x`

axis. So it is quite straightforward to understand that if one downsamples the input dataset by a scale factor `k`

, then applies a (smaller) Gaussian where `sigma’=s/k`

, and finally upscales the result by the same scale factor `k`

, the result will approximate a true Gaussian on the original dataset where `sigma=s`

.

In cleaner terms: if one has an input dataset (*e.g.,* an image) `I`

and wants to have it blurred by a Gaussian where `sigma=s`

:

`I’<=I`

downsampled by a certain scale factor`k`

.`I”<=I'`

blurred by a small Gaussian where`s’=s/k`

.`I”’<=I''`

upscaled by a scale factor`k`

.

### How good is this approximation?

The Sampling Theorem states that sampling a signal at (at least) twice its smallest wavelength is enough. Which means that downsampling cuts frequencies above the *Nyquist* limit (half the sampling rate). In other words: Downsampling means less data to process, but at the expense of introducing an error.

Fortunately, a Gaussian blur is a form of low-pass frequency filter. This means that blurring is quite tolerant to alterations in the high part of the frequency spectrum.

### Visual evaluation

In the examples below I am downsampling with a simple pixel average, and I am upscaling with a simple bilinear filter. The `2x2`

grids below compare:

- Top-left – The original image
`I`

. - Top-right –
`I`

downsampled and upscaled by k (note the blocky bilinear filter look). - Bottom-left – The resulting image
`I”’`

. - Bottom-right –
`I`

blurred by a true Gaussian where`sigma=s`

.

In these examples, `k=sigma`

was chosen for simplicity. This means that the small Gaussian uses `sigma’=1`

.

*Gaussian blur where sigma=4*

*Gaussian blur where sigma=16*

*Gaussian blur where sigma=64*

### Conclusion

As shown, the approximation (bottom-left vs. bottom-right) is pretty good.

The gain in speed depends on multiple implementation factors. However, as I explained above, this post was inspired by a need to cope with a cubic memory storage problem when doing Gaussian blurs on a 3D buffer. Working with a heavily downsampled buffer clearly helps in that sense. And it is needless to say that decreasing the amount of data to process by `k^3`

also brings a dramatic speed boost, making it possible to use tiny separable convolutions along the 3 (downsampled) axes.

Note that one might pick any downsampling scale factor `k>=1`

. The higher the value of `k`

, the higher the approximation error and the smaller and faster the convolution.

The choice `k=sigma`

offers a good trade-off between approximation error and speed gain as shown above.

## Diaphragm and f-stop - 2014-08-17

**[EDIT] This post was migrated from my blog from 2011…**

This is just another image taken from the Maverick Unit Testing system. The chart displays different polygonal diaphragms at different f-stop values. Doubling the f-stop number, halves the surface light can pass through.

*Polygonal diaphragms and f-stop*

## Fast convolutions (III) - 2014-07-19

**[EDIT] This post was migrated from my blog from 2011…**

Some more remarks about the performance of the convolution methods described so far. I will be leaving the brute-force algorithm out for obvious reasons.

The plot below represents input size (in pixels) in the `x`

axis, and convolution time (in seconds) in the `y`

axis. So, if you go to `x=4096`

that means *“a convolution of a 4096×4096 image by a 4096×4096 kernel”*.

*Even competition between FFT-based vs. separable convolution methods*

Two conclusions can be made from the above plot, which confirm what was explained in my previous post:

- For large kernels (as large as the image itself) the separable convolution method is
`O(n^3)`

and times get to absurd levels very quickly. If you are dealing with large generic images/kernels, the FFT-based method is the way to go. - The FFT-based method uses the Fast Fourier Transform, which is
`O(n^2·log(n))`

thanks to some decomposition technique that requires the size of the input data to be (padded to) a power-of-2. For this reason, it takes the same amount of time to do a convolution on a`257×257`

image/kernel than on a`512×512`

image/kernel, because both cases operate on`512×512`

buffers after all. This is why the graph for the FFT method is stepped. When`x`

crosses a power-of-2, the running time goes up all of a sudden and stays stable until the next power-of-2.

The plot was generated with my current implementation of both convolution methods in MK_api. My FFT uses the Cooley-Tukey algorithm, and everything (i.e., FFT, IFFT, point-wise products, and 1D separable convolutions) makes use of multi-threading. There’s always room for improvement, but the running times seem pretty decent, as we’re speaking of `<2s`

for images up to `4096×4096`

in a 16-thread CPU. An implementation in CUDA would be (orders of magnitude) faster, though. 😛

**[EDIT] It’s been 8 years since this post (2014..2022). Now we use the cuFFT implementation in Maverick, which is blazingly fast.**

*Separable convolution with a fixed 11×11 kernel (in orange)*

A couple more remarks:

- The graph of the FFT-based method wouldn’t change if smaller kernels were used, as the algorithm requires the kernel to be padded to the size of the image. However, the graph of the separable method becomes much less steep when very small kernels are used:
- The running time of the FFT/IFFT is not exactly a trivial subject. Speed in FFT/IFFT algorithms depends not only on the size of the data, but also on the data itself. While generating the above plots, I came across some anomalies in the time measurements. Some kernels or some images produced faster or slower results. Some combinations would even produce non-flat steps in the staircase-looking graph. That’s normal, but worth mentioning.

## Fast convolutions (II) - 2014-07-18

**[EDIT] This post was migrated from my blog from 2011…**

I will analyze the algorithmic complexity of the convolution algorithms described in my previous posts.

To make things simpler, let’s assume that the dimensions of the image are `>=`

the dimensions of the convolution kernel, and that both are square, with dimensions `S·S`

and `s·s`

, respectively.

### Naive algorithm - O(n^4)

*“wxh operations for each of the W·H image pixels”.*

*i.e.,* `S·S·s·s`

operations. This is quadratic with a heavy constant for tiny kernels, but quickly becomes quartic for medium-to-large kernels.

The auxiliary buffer can be identical in size and bit-depth to the original image. So the memory usage factor is 2x.

### Separable convolution - O(n^3)

*“One 1D convolution for each row + One 1D convolution for each column”.*

*i.e.,* `2·S·S·s`

operations. This is quadratic with a bad constant for small kernels, but becomes cubic for large kernels.

Remarkably, the total running time depends on the dimensions of the image -and- the dimensions of the kernel.

Again, the auxiliary buffer can be identical in size and bit-depth to the original image. So the memory usage factor is 2x.

### Convolution theorem - O(n^2·log(n))

*“Two FFTs + One point-wise product + One IFFT”.*

Let’s call `S’`

to the closest power-of-2 such that `S’>=S`

. Then a proper implementation of the FFT/IFFT does (approx.) `2·S’·S’·log(S)`

operations, while the point-wise product does `S’·S’`

operations. This makes the algorithm `O(S’·S’·log(S’))`

with some heavy (quadratic) overhead due to the memory copying, padding, and the point-wise product.

Remarkably, the total running time is independent of the size of the kernel.

This algorithm is quite memory hungry, though, because two `S’·S’`

complex-number buffers are required. This means two floating-point numbers per entry (*i.e.,* the real/imaginary coefficients). The algorithm starts by copying the image/kernel to the real part of the corresponding complex-number buffer, leaving the imaginary coefficient and the surface excess filled with zeroes. Then the FFTs/product/IFFT happen in-place.

So the auxiliary memory required is `4·S’·S’`

floating-point numbers.

In the worst case where `S`

is a power-of-2-plus-1, `S’`

gets nearly twice as large as `S`

. If the original image is 8-bit and we are using single-precision floating-point math for the FFT/IFFT, this means a memory usage factor of 64x. In the case of an HDR (single-precision floating-point) grayscale image, we are speaking of a worst case scenario of 16x. In average, however, the memory usage factor is around 8x. If `S`

is a power-of-2, then the memory usage factor goes down to 4x.

*Heavy glare using the FFT-based method in an HDR image by Paul Debevec*

This image with heavy glare has been output with some ArionFX-related experimental tonemapping code I am working on these days.

### Conclusions:

Assuming that we are only interested in sheer performance:

- The FFT-based method is (by far) the fastest for large images/kernels. Interestingly, the algorithm is not affected by the size of the kernel, which can be as large as the (padded) image itself without a penalty.
- The FFT-based method becomes even faster if the same kernel is applied multiple times. The kernel FFT can be calculated just once, and then be re-used.
- Due to the heavy setup overhead in the FFT-based method, the separable method can be faster for small (separable) kernels where
`s`

is in the range of`log(S’)`

.

Last, but certainly not least, **there is a much faster and more light-weight algorithm for the special case of Box/Gaussian Blur**. I will talk about this in a separate blog entry.