Implementing the reals


The most common way to implement "real" numbers on computers nowadays is by using floating-point numbers. They are defined by three values, the mantissa $c$, the exponent $E$, and a single sign bit $s$, so that any number is represented as $$x = (-1)^{s} \times c \times b^E$$

with $b$ the basis used, generally either $2$ or $10$. These days, most computers use the IEEE Standard for Binary Floating-Point Arithmetic, which uses the basis $b = 2$, the exponent as a signed integer, and a mantissa expressed as a binary decimal number

$$c = c_0.c_1c_2c_3...$$ or in other words $$c = \sum_{i = 0}^n c_i 2^{-i}$$

There are many more characteristics to the IEEE 754 standard, but these are the basic mathematical notions involved for it. It has many benefits, hence its common use, but also suffer from a few drawbacks :

An alternative to floating-point arithmetics is fixed-point arithmetics, which is simply the same representation as integers, extended to decimals, so that two arrays of bits $c$ and $d$ will represent the number

$$x = \sum_{i = 0}^n 2^{c_i} + \sum_{i = 0}^m 2^{-d_i}$$

Unlike floating-point numbers, fixed-point numbers are evenly distributed and have no redundant values, although they trade this for a much shorter precision. While a 32 bits floating-point numbers, with 8 bits dedicated to the exponent, will range from order of $2^{-126}$ to $2^{127}$, a fixed-point number with 16 bits dedicated to the integer part and 16 to decimals, will range between the orders of $2^{-15}$ and $2^{15}$. They also still suffer from the same rounding problem.

1. Rationals as pairs of integers

The simplest way to implement rationals exactly is to represent them simply as the ratio of two integers, one signed and one unsigned.
struct fraction { int numerator; unsigned int denominator; }; { numerator; denominator; };
The basic operations then work simply in the usual manner on fractions, except for the case of the division, which must be treated with care : if we chose an unsigned integer for the denominator, the denominator of the division will possibly include a multiplication by a negative number.
fraction +(fraction a, fraction b) { return fraction( a.denominator * b.numerator + a.numerator * b.denominator , a.denominator * b.denominator); }; fraction -(fraction a, fraction b) { return fraction( a.denominator * b.numerator - a.numerator * b.denominator , a.denominator * b.denominator); }; fraction *(fraction a, fraction b) { return fraction( a.numerator* b.numerator, a.denominator * b.denominator); }; fraction /(fraction a, fraction b) { return fraction( a.numerator* b.denominator, a.denominator * b.numerator); }; stuff
The obvious drawback of this representation is that a number can be represented by many different pairs, such as $(1,2)$, $(2,4)$, $(4,8)$, etc. As it is, two fractions can't be directly compared, we first need to reduce the fraction.
int fraction::gcd(int a, int b) { return b?gcd(b, a%b):a; } void fraction::reduce() { int c = gcd(this->num, this->den); this->num /= c; this->den /= c; if(this->den < 0){this->den *= -1; this->num *= -1;} } stuff

Rationals as a pair of integer and fraction

$$q = n + \frac{1}{d}$$

Rationals as a Gödel encoding

The Stern-Brocot tree

The Calkin-Wilf tree


Last updated : 2018-02-18 17:23:16
Tags : arithmetics