Recently, I was writing an algorithm to solve a coding challenge that involved finding a point in a Cartesian plane that had the minimum distance from all of the other points. In Python, the distance function would be expressed as math.sqrt(dx ** 2 + dy ** 2)
. However, there are several different ways to express each term: dx ** 2
, math.pow(dx, 2)
, and dx * dx
. Interestingly, these all perform differently, and I wanted to understand how and why.
Python provides a module called timeit
to test performance, which makes testing these timings rather simple. With x
set to 2
, we can run timing tests on all three of our options above:
Expression | Timing (100k iterations) |
---|---|
x * x |
3.87 ms |
x ** 2 |
80.97 ms |
math.pow(x, 2) |
83.60 ms |
Python also provides a model called dis
that disassembles code so we can see what each of these expressions are doing under the hood, which helps in understanding the performance differences.
Using dis.dis(lambda x: x * x)
, we can see that the following code gets executed:
0 LOAD_FAST 0 (x)
2 LOAD_FAST 0 (x)
4 BINARY_MULTIPLY
6 RETURN_VALUE
The program loads x
twice, runs BINARY_MULTIPLY
, and return
s the value.
Using dis.dis(lambda x: math.pow(x, 2))
, we can see the following code gets executed:
0 LOAD_GLOBAL 0 (math)
2 LOAD_ATTR 1 (pow)
4 LOAD_FAST 0 (x)
6 LOAD_CONST 1 (2)
8 CALL_FUNCTION 2
10 RETURN_VALUE
The math
module loads from the global space, and then the pow
attribute loads. Next, both arguments are loaded and the pow
function is called, which return
s the value.
Using dis.dis(lambda x: x ** 2)
, we can see that the following code gets executed:
0 LOAD_FAST 0 (x)
2 LOAD_CONST 1 (2)
4 BINARY_POWER
6 RETURN_VALUE
The program loads x
, loads 2
, runs BINARY_POWER
, and return
s the value.
Using the math.pow()
functions as a point of comparison, both multiplication and exponentiation differ in only one part of their bytecode: calling BINARY_MULTIPLY
versus calling BINARY_POWER
.
This function is located here in the Python source code. It does a few interesting things:
long_mul(PyLongObject *a, PyLongObject *b)
{
PyLongObject *z;
CHECK_BINOP(a, b);
/* fast path for single-digit multiplication */
if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
return PyLong_FromLongLong((long long)v);
}
z = k_mul(a, b);
/* Negate if exactly one of the inputs is negative. */
if (((Py_SIZE(a) ^ Py_SIZE(b)) < 0) && z) {
_PyLong_Negate(&z);
if (z == NULL)
return NULL;
}
return (PyObject *)z;
}
For small numbers, this uses binary multiplication. For larger values, the function uses Karatsuba multiplication, which is a fast multiplication algorithm for larger numbers.
We can see how this function gets called in ceval.c
:
case TARGET(BINARY_MULTIPLY): {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *res = PyNumber_Multiply(left, right);
Py_DECREF(left);
Py_DECREF(right);
SET_TOP(res);
if (res == NULL)
goto error;
DISPATCH();
}
This function is located here in the Python source code. It also does several interesting things:
The source code is too long to fully include, which partially explains the detrimental performance. Here are some interesting snippets:
if (Py_SIZE(b) < 0) { /* if exponent is negative */
if (c) {
PyErr_SetString(PyExc_ValueError, "pow() 2nd argument "
"cannot be negative when 3rd argument specified");
goto Error;
}
else {
/* else return a float. This works because we know
that this calls float_pow() which converts its
arguments to double. */
Py_DECREF(a);
Py_DECREF(b);
return PyFloat_Type.tp_as_number->nb_power(v, w, x);
}
}
After creating some pointers, the function checks if the power
given is a float or is negative, where it either errors or calls a different function to handle exponentiation.
If neither cases hit, it checks for a third argument, which is always None
according to ceval.c
1:
case TARGET(BINARY_POWER): {
PyObject *exp = POP();
PyObject *base = TOP();
PyObject *res = PyNumber_Power(base, exp, Py_None);
Py_DECREF(base);
Py_DECREF(exp);
SET_TOP(res);
if (res == NULL)
goto error;
DISPATCH();
}
Finally, the function defines two routines: REDUCE
for modular reduction and MULT
for multiplication and reduction. The multiplication function uses long_mul
for both values, which is the same function used in BINARY_MULTIPLY
.
#define REDUCE(X)
do {
if (c != NULL) {
if (l_divmod(X, c, NULL, &temp) < 0
goto Error;
Py_XDECREF(X);
X = temp;
temp = NULL;
}
} while(0)
#define MULT(X, Y, result)
do {
temp = (PyLongObject *)long_mul(X, Y);
if (temp == NULL)
goto Error;
Py_XDECREF(result);
result = temp;
temp = NULL;
REDUCE(result);
} while(0)
After that, the function uses left-to-right k-ary exponentiation defined in chapter 14.62 of the Handbook of Applied Cryptography:
We can use the timeit
library above to profile code at different values and see how the performance changes over time.
To test the performance at different power
values, we need to generate some functions.
Since both of these are already in the Python source, all we need to do is define a function for exponentiation we can call from inside a timeit
call:
exponent = lambda base, power: base ** power
Since this changes each time the power
changes3, we need to generate a new multiplication function each time the base changes. To do this, we can generate a string like x*x*x
and call eval()
on it to return a function:
def generate_mult_func(n):
mult_steps = '*'.join(['q'] * n)
func_string = f'lambda q: {mult_steps}' # Keep this so we can print later
return eval(func_string), func_string
Thus, we can make a multiply
function like so:
multiply, func_string = generate_mult_func(power)
If we call generate_mult_func(4)
, multiply
will be a lambda function that looks like this:
lambda q: q*q*q*q
Using the code posted here, we can determine at what point multiply
becomes less efficient than exponent
.
Staring with these values:
base = 2
power = 2
We loop until the time it takes to execute 100,000
iterations of multiply
is slower than executing 100,000
iterations of exponent
. Initially, here are the timings, with math.pow()
serving as a point of comparison:
Starting speeds:
Multiply time 11.56 ms
Exponent time 35.82 ms
math.pow time 16.73 ms
When running on repl.it, Python finds the crossover in 1.2s:
Crossover found in 1.2 s:
Base, power 2, 15
Multiply time 43.05 ms
Exponent time 39.70 ms
math.pow time 16.42 ms
Multiply func lambda q: q*q*q*q*q*q*q*q*q*q*q*q*q*q*q
Thus, chaining multiplication together is faster until our expression gets to 2^14
; at 2^15
exponentiation becomes faster.
Using Pandas, we can keep track of the timing at each power:
Power multiply exponent math.pow
2 0.011562 0.035822 0.016731
3 0.013043 0.033764 0.014614
4 0.015307 0.032323 0.015349
5 0.015974 0.033678 0.016470
6 0.017917 0.032465 0.015282
7 0.019147 0.034249 0.014993
8 0.020042 0.034530 0.015794
9 0.023667 0.038041 0.016430
10 0.029137 0.038911 0.016717
11 0.032310 0.040869 0.016580
12 0.033338 0.036693 0.014642
13 0.035552 0.037233 0.015178
14 0.037351 0.037806 0.015666
15 0.043046 0.039704 0.016415
From here, it is very simple to generate a line graph:
plot = df.plot().get_figure()
plot.savefig('viz.png')
Interestingly, math.pow()
and exponent
mostly perform at the same rate, while our multiply
functions vary wildly. Unsurprisingly, the longer the multiplication chain, the longer it takes to execute.
While the crossover is interesting, this doesn’t show what happens at powers larger than 15. Going up through 1000
, we get the following trend:
When we zoom in so that math.pow()
and exponent
are more pronounced, we see the same performance trend continue:
While using **
the time gradually increases, math.pow()
generally has executes at around the same speed.
When writing algorithms that use small exponents, here proved less than 15
, it is faster to chain multiplication together than to use the **
exponentiation operator. Additionally, math.pow()
is more efficient than chained multiplication at powers larger than 5
and always more efficient than the **
operator, so there is never a reason to use **
.
Additionally, this is also true in JavaScript4. Thanks @juliancoleman for this comparison!
Discussion: r/Python, Hacker News | View as: PDF, Markdown