random number generation, part 1

From:

Rik

Subject:

Date:

Wed, 18 Dec 2019 16:02:06 -0800

On 12/17/2019 01:54 PM, address@hidden wrote:

Just for the sake of comparison here is the code used in numpy, the
convention (and seems to be widespread is to sample [0,1) )

This is also the convention used by C++. The documentation for the uniform random number generator at http://www.cplusplus.com/reference/random/uniform_real_distribution/ says "This distribution (also known as rectangular distribution) produces random numbers in a range [a,b) where all intervals of the same length within it are equally probable."

However, we have no freedom to follow this convention. The Mathworks has decreed that the range shall not include the endpoints. See https://www.mathworks.com/help/matlab/ref/rand.html where it is stated, "X = rand returns a single uniformly distributed random number in the interval (0,1)."

https://github.com/numpy/numpy/blob/master/numpy/core/include/numpy/random/distributions.h
https://github.com/numpy/numpy/blob/master/numpy/random/src/distributions/distributions.c

https://github.com/numpy/numpy/blob/master/numpy/core/include/numpy/random/bitgen.h

There we see

static NPY_INLINE float next_float(bitgen_t *bitgen_state) {
return (next_uint32(bitgen_state) >> 9) * (1.0f / 8388608.0f);
}

A good idea to compare to existing implementations. In this case, 32 bits of random data are generated, 9 low-order bits are discarded by shifting left, leaving 23 bits of valid data, and then it is normalized by dividing by 2^23. I believe that this unnecessarily discards one bit of information.

The IEEE single-precision (https://en.wikipedia.org/wiki/Single-precision_floating-point_format) stores 23 bits of mantissa, but because it uses a biased exponent it actually has 24-bits of effective mantissa. Hence, my implementation used 24 bits.

The other difference between my implementation and theirs is simply the selection of random bits. Do you choose randi32[31:8] or randi32[23:0]? All of the bits are random so the choice becomes which strategy is more efficient. I used AND masking

i = randi32 () & static_cast<uint32_t> (0xFFFFFF);

to get the lowest 24 bits. For this sort of thing you really need to know how your hardware is going to behave. There are dedicated souls who have measured the number of clock cycles per instruction for various architectures (see www.agner.org/optimize/instruction_tables.pdf). I chose, as reasonably representative of a modern computer, an Intel Nehalem-architecture CPU. Both AND with immediate data and SHR (SHift Right) are single clock-cycle instructions. However, because of parallel ALUs, when considered in a pipeline, the AND instruction takes 0.33 clock cycles while the SHR takes 0.5 clock cycles. Thus a slight preference for the coding strategy I used but nothing particularly definitive.

Dan mentioned that the algorithm doesn't normalize using division by a constant, but rather multiplication by a constant.

return i * (1.0f / 16777216.0f)He also mentioned vagaries in floating point hardware units. Both points are valid. Using the same data source as above, the FMUL (Floating point MULtiply) has a latency of 5 clock cycles, but can be fully pipelined so that sequential multiplies can be made every clock cycle. The FDIV (Floating point DIVision) instruction has a variable latency of 7-27 cycles and can't be pipelined effectively. Thus, multiplication is the preferred strategy and any decent compiler will optimize the division in to a single compile-time constant and then use multiplication.

Dan also mentioned directly copying bits from the integer to the floating point value. This is possible to code, but the reason to do this would be for performance rather than correctness. Right now, I'm more worried about correctness, and I would lean away from introducing further changes that could bring their own set of issues. For instance, which bits to set will depend on whether the processor is big-endian or little-endian, or even hermaphroditic like ARM CPUs that can switch on the fly. That could be an optimization for Octave 7, or we could just switch to using the C++ <random> libraries instead.

Gene suggested a way for getting rid of the while loop which tests for 0 with

return ((randi32 () & uint32_t (0x00FFFFFE)) + 1.f) / std::exp2 (std::numeric_limits<float>::digits);

The problem here is that this spreads values over only 2^23 floating point values rather the larger, but still possible, 2^24. To see this, consider that the mask 0xFF_FF_FE has only 23 bits and that the addition of 1.f is equivalent to forcibly setting the least-significant bit (i | 0x01). The values generated will always be odd (1, 3, 5) rather than also including the evens (1,2,3,4, ...).

So, after the very useful input, I think this code should work
static float randu32 (void) { uint32_t i; do { i = randi32 () & static_cast<uint32_t> (0xFFFFFF); } while (i == 0); return i * (1.0f / 16777216.0f);

which is nearly the same as before, except it removes the unnecessary exclusion of i == 0xFFFFFF which several people pointed out.

--Rik

[Prev in Thread]

Current Thread

[Next in Thread]

random number generation, part 1, Rik <=

Re: random number generation, part 1, Daniel J Sebald, 2019/12/18