In solving data analytics problems, occasionally we must bin data into intervals or categories based on some arbitrary stratification. A common example outside of the world of computer science is letter grades, a concept that can be abstracted1 to:
def grade(score):
grades = ['F', 'D', 'C', 'B', 'A']
if score >= 100:
return 'A'
if score <= 50:
return 'F'
return grades[(score - 50) // 10]
Thus, for:
scores = [88, 72, 61, 39, 97]
results = [grade(score) for score in scores]
We get:
['B', 'C', 'D', 'F', 'A']
However, if any of our score
values2 are not integers we will encounter a TypeError
:
>>> grade(61.2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
TypeError: list indices must be integers or slices, not float
This is caused by the following type coercion, as lists cannot be indexed by floats:
>>> (61.2 - 50) // 2
1.0
Of course, we can cast to an int:
def grade(score):
grades = ['F', 'D', 'C', 'B', 'A']
if score > 99:
return 'A'
if score <= 50:
return 'F'
return grades[(int(score) - 50) // 10]
However, for large datasets this is not an efficient operation:
chris% python -m timeit 'int(102.3 // 2)'
5000000 loops, best of 5: 99.6 nsec per loop
chris% python -m timeit '102.3 // 2'
50000000 loops, best of 5: 7.05 nsec per loop
In my real-world example, the score
equivalent is always a float, so we can map the float integer values to their respective categories3, similar to:
>>> grades
{0.0: 'F', 1.0: 'D', ...}
>>> grades[0]
'F'
>>> grades[0.]
'F'
>>> grades[1]
'D'
>>> grades[1.]
'D'
However, I was still curious why floor division was returning a float and not an integer.
According to Concrete Mathematics4:
βxβ = the greatest integer less than or equal to x
βxβ = the least integer greater than or equal to x
Note the key word integer; while this may be the official mathematical definition, Python does not seem to follow this to the letter and returns a type-coerced value.
It is documented when types are coerced. From the Python docs:
The same is true for binary operators:
The / (division) and // (floor division) operators yield the quotient of their arguments. The numeric arguments are first converted to a common type. Division of integers yields a float, while floor division of integers results in an integer; the result is that of mathematical division with the βfloorβ function applied to the result.
The float divmod
work occurs in float_floor_div
, which comes from the cpython source. All this function does is call float_divmod
and dereferences the quotient:
static PyObject *
float_floor_div(PyObject *v, PyObject *w)
{
PyObject *t, *r;
t = float_divmod(v, w);
if (t == NULL || t == Py_NotImplemented)
return t;
assert(PyTuple_CheckExact(t));
r = PyTuple_GET_ITEM(t, 0);
Py_INCREF(r);
Py_DECREF(t);
return r;
}
The bulk of the work occurs inside of the called function, float_divmod
.
The source for float_divmod
lives here. The first step of this function handles the type cast:
float_divmod(PyObject *v, PyObject *w)
{
double vx, wx;
double div, mod, floordiv;
CONVERT_TO_DOUBLE(v, vx);
CONVERT_TO_DOUBLE(w, wx);
...
Where vx
is the dividend, and wx
is the divisor. The rest of this function handles the various cases that can occur:
If the divisor is zero, we raise an error:
...
if (wx == 0.0) {
PyErr_SetString(PyExc_ZeroDivisionError, "float divmod()");
return NULL;
}
...
Before we can do any work, we must calculate the quotient and dividend:
mod = fmod(vx, wx);
div = (vx - mod) / wx;
We get the modulus using the fmod
function from the C Standard Library, and the quotient using float division, also leveraging the C standard library.
Once we have these values, we check to ensure the sign is correct.
...
if (mod) {
if ((wx < 0) != (mod < 0)) {
mod += wx;
div -= 1.0;
}
}
...
If the remainder (i.e., the modulus) is zero, we copy the sign to the divisor with copysign(0.0, wx)
5.
If the quotient is not zero, we call the floor
function:
...
if (div) {
floordiv = floor(div);
if (div - floordiv > 0.5)
floordiv += 1.0;
}
...
If the dividend is zero, we copy the sign of the quotient onto zero with copysign(0.0, vx / wx)
.
Once all of this has completed, we return a tuple like (floor_qotient, modulus)
.
As Python is a mature language, the changes that affected these behaviors underwent much debate.
Since the proposal of the Python 3 change to the division operator in PEP 238, it was decided that the result of floor division with floats would be a float, following the above arithmetic conversions:
In particular, if a and b are both ints or longs, the result has the same type and value as for classic division on these types (including the case of mixed input types;
int // long
andlong // int
will both return a long).For floating point inputs, the result is a float. For example:
3.5 // 2.0 == 1.0
This PEP defined the changes to the numerical stack in Python and explicitly noted the following:
__floor__(self)
, called frommath.floor(x)
, which returns the greatestIntegral <= x
.
Thus, it would follow that the floor
call in float_divmod
should return an integer and not a float value. However, this is not the case.
Thirteen years later, Alexander Belopolsky reported Issue 22444: Floor divide should return int:
PEP 3141 defines floor division as
floor(x/y)
and specifies thatfloor()
should return int type. Builtin float type has been made part of the PEP 3141 numerical tower, but floor division of two floats still results in a float.
In this thread, my exact issue accessing a list index above was raised:
This is one of the common uses of floor division - to find an index of a cell in a regular grid:
(x - start) // step
. In this situation, it is convenient to have the result ready to be used as an index without a cast.
However, the decision to not make this change ended with Raymond Hettinger's reply:
The current behavior has been around for a long time and is implemented in several modules, including decimal
and fraction
. As core devs, we need to keep focused on a priority of making the language stable (not making changes that truly necessary and invalidating all previously published material) and more importantly not adding yet more obstacles to converting from Python 2 to Python 3 (which Guido has called "death by a thousand cuts").
The current behavior can be useful it that it allows floor division operations without unexpected type conversions occurring in the middle of an expression. We really don't want to break those use cases.
Python's philosophy means the language provides high-level access to data structures. Here, the coercion avoids dealing with the nuances of numeric types that one would have to wrangle with when writing C. In C, given the different numerical types, you have to return a double, because the range of a double is so much larger than that of an integer.
These decisions were made during the early days of the Python 3 transition, where the core developers wanted to minimize the friction of the upgrade from 2 to 3. As a result, to not break code that depended on floor division being type preserving6, it was decided that floor division should return a value equal to an integer7 but agnostic of type.
In a post-Python 2 era, these concerns do not hold the same weight. This is an example of some technical debt that Python has accrued, as other parts of Python do not behave this way:
>>> type(timedelta(2.2) // timedelta(3))
<class 'int'>
>>> type(Fraction(2.3) // Fraction(1))
<class 'int'>
Python makes sacrifices to retain its beauty and simplicity, this being one of them. As a result, the programmer, in this instance, is unable to choose the functionality that best addresses a specific problem. A possible solution would be to provide both floor()
and float_floor()
for accessing these values in both types. Also possible would be to make lists accessible by the float counterparts of their integer indices, since 1 == 1.0
.
Further, the language is far more performant as a result of this decision: float // any == float
is a computationally simple and fast operation; float // any == int
is significantly more complicated and expensive, especially for large inputs due to the inefficient cast from float back to int.
chris% python -m timeit 'int(102.3 // 2)'
5000000 loops, best of 5: 99.6 nsec per loop
chris% python -m timeit '102.3 // 2'
50000000 loops, best of 5: 7.05 nsec per loop
This performance hit does not stem from the float conversion itself 8 but rather from the fact that integer and float values are stored in different registers: the CPU must first store the float from the FPU register in memory, then read that memory address into the int register. This is a classic example of a load-hit-store stall.
Since we can combine types in this way in Python, we need to be cognizant of the pitfalls that we can run into when writing polymorphic functions. Most of the time, the type of number will not matter to the Python interpreter. When type does matter, understanding how these data are handled internally leads to a smoother development process.
Discussion: Hacker News, Reddit
This is an oversimplification of the problem I was solving but serves to demonstrate the issue at hand. ↩︎
ex 61 -> 61.2
↩︎
Additionally, this means we can lookup based on integer or float keys, because in Python, 1.0 == 1
, thus hash(1.0) == hash(1)
. ↩︎
Graham, Ronald Lewis, et al. Concrete Mathematics: a Foundation for Computer Science. 2nd ed., Addison-Wesley, 2017. P 67. PDF. ↩︎
Like it was in Python 2. ↩︎
As mentioned above, 1.0 == 1
. ↩︎
The conversion operation just truncates at the decimal. ↩︎