Recently, I was writing an algorithm to solve a coding challenge that involved finding a point in a Cartesian plane that had the minimum distance from all of the other points. In Python, the distance function would be expressed as `math.sqrt(dx ** 2 + dy ** 2)`

. However, there are several different ways to express each term: `dx ** 2`

, `math.pow(dx, 2)`

, and `dx * dx`

. Interestingly, these all perform differently, and I wanted to understand how and why.

Python provides a module called `timeit`

to test performance, which makes testing these timings rather simple. With `x`

set to `2`

, we can run timing tests on all three of our options above:

Expression | Timing (100k iterations) |
---|---|

`x * x` |
3.87 ms |

`x ** 2` |
80.97 ms |

`math.pow(x, 2)` |
83.60 ms |

Python also provides a model called `dis`

that disassembles code so we can see what each of these expressions are doing under the hood, which helps in understanding the performance differences.

Using `dis.dis(lambda x: x * x)`

, we can see that the following code gets executed:

```
0 LOAD_FAST 0 (x)
2 LOAD_FAST 0 (x)
4 BINARY_MULTIPLY
6 RETURN_VALUE
```

The program loads `x`

twice, runs `BINARY_MULTIPLY`

, and `return`

s the value.

Using `dis.dis(lambda x: math.pow(x, 2))`

, we can see the following code gets executed:

```
0 LOAD_GLOBAL 0 (math)
2 LOAD_ATTR 1 (pow)
4 LOAD_FAST 0 (x)
6 LOAD_CONST 1 (2)
8 CALL_FUNCTION 2
10 RETURN_VALUE
```

The `math`

module loads from the global space, and then the `pow`

attribute loads. Next, both arguments are loaded and the `pow`

function is called, which `return`

s the value.

Using `dis.dis(lambda x: x ** 2)`

, we can see that the following code gets executed:

```
0 LOAD_FAST 0 (x)
2 LOAD_CONST 1 (2)
4 BINARY_POWER
6 RETURN_VALUE
```

The program loads `x`

, loads `2`

, runs `BINARY_POWER`

, and `return`

s the value.

Using the `math.pow()`

functions as a point of comparison, both multiplication and exponentiation differ in only one part of their bytecode: calling `BINARY_MULTIPLY`

versus calling `BINARY_POWER`

.

This function is located here in the Python source code. It does a few interesting things:

```
long_mul(PyLongObject *a, PyLongObject *b)
{
PyLongObject *z;
CHECK_BINOP(a, b);
/* fast path for single-digit multiplication */
if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
return PyLong_FromLongLong((long long)v);
}
z = k_mul(a, b);
/* Negate if exactly one of the inputs is negative. */
if (((Py_SIZE(a) ^ Py_SIZE(b)) < 0) && z) {
_PyLong_Negate(&z);
if (z == NULL)
return NULL;
}
return (PyObject *)z;
}
```

For small numbers, this uses binary multiplication. For larger values, the function uses Karatsuba multiplication, which is a fast multiplication algorithm for larger numbers.

We can see how this function gets called in `ceval.c`

:

```
case TARGET(BINARY_MULTIPLY): {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *res = PyNumber_Multiply(left, right);
Py_DECREF(left);
Py_DECREF(right);
SET_TOP(res);
if (res == NULL)
goto error;
DISPATCH();
}
```

This function is located here in the Python source code. It also does several interesting things:

The source code is too long to fully include, which partially explains the detrimental performance. Here are some interesting snippets:

```
if (Py_SIZE(b) < 0) { /* if exponent is negative */
if (c) {
PyErr_SetString(PyExc_ValueError, "pow() 2nd argument "
"cannot be negative when 3rd argument specified");
goto Error;
}
else {
/* else return a float. This works because we know
that this calls float_pow() which converts its
arguments to double. */
Py_DECREF(a);
Py_DECREF(b);
return PyFloat_Type.tp_as_number->nb_power(v, w, x);
}
}
```

After creating some pointers, the function checks if the `power`

given is a float or is negative, where it either errors or calls a different function to handle exponentiation.

If neither cases hit, it checks for a third argument, which is always `None`

according to `ceval.c`

^{1}:

```
case TARGET(BINARY_POWER): {
PyObject *exp = POP();
PyObject *base = TOP();
PyObject *res = PyNumber_Power(base, exp, Py_None);
Py_DECREF(base);
Py_DECREF(exp);
SET_TOP(res);
if (res == NULL)
goto error;
DISPATCH();
}
```

Finally, the function defines two routines: `REDUCE`

for modular reduction and `MULT`

for multiplication and reduction. The multiplication function uses `long_mul`

for both values, which is the same function used in `BINARY_MULTIPLY`

.

```
#define REDUCE(X)
do {
if (c != NULL) {
if (l_divmod(X, c, NULL, &temp) < 0
goto Error;
Py_XDECREF(X);
X = temp;
temp = NULL;
}
} while(0)
#define MULT(X, Y, result)
do {
temp = (PyLongObject *)long_mul(X, Y);
if (temp == NULL)
goto Error;
Py_XDECREF(result);
result = temp;
temp = NULL;
REDUCE(result);
} while(0)
```

After that, the function uses left-to-right k-ary exponentiation defined in chapter 14.6^{2} of the Handbook of Applied Cryptography:

We can use the `timeit`

library above to profile code at different values and see how the performance changes over time.

To test the performance at different `power`

values, we need to generate some functions.

Since both of these are already in the Python source, all we need to do is define a function for exponentiation we can call from inside a `timeit`

call:

```
exponent = lambda base, power: base ** power
```

Since this changes each time the `power`

changes^{3}, we need to generate a new multiplication function each time the base changes. To do this, we can generate a string like `x*x*x`

and call `eval()`

on it to return a function:

```
def generate_mult_func(n):
mult_steps = '*'.join(['q'] * n)
func_string = f'lambda q: {mult_steps}' # Keep this so we can print later
return eval(func_string), func_string
```

Thus, we can make a `multiply`

function like so:

```
multiply, func_string = generate_mult_func(power)
```

If we call `generate_mult_func(4)`

, `multiply`

will be a lambda function that looks like this:

```
lambda q: q*q*q*q
```

Using the code posted here, we can determine at what point `multiply`

becomes less efficient than `exponent`

.

Staring with these values:

```
base = 2
power = 2
```

We loop until the time it takes to execute `100,000`

iterations of `multiply`

is slower than executing `100,000`

iterations of `exponent`

. Initially, here are the timings, with `math.pow()`

serving as a point of comparison:

```
Starting speeds:
Multiply time 11.56 ms
Exponent time 35.82 ms
math.pow time 16.73 ms
```

When running on repl.it, Python finds the crossover in 1.2s:

```
Crossover found in 1.2 s:
Base, power 2, 15
Multiply time 43.05 ms
Exponent time 39.70 ms
math.pow time 16.42 ms
Multiply func lambda q: q*q*q*q*q*q*q*q*q*q*q*q*q*q*q
```

Thus, chaining multiplication together is faster until our expression gets to `2^14`

; at `2^15`

exponentiation becomes faster.

Using Pandas, we can keep track of the timing at each power:

```
Power multiply exponent math.pow
2 0.011562 0.035822 0.016731
3 0.013043 0.033764 0.014614
4 0.015307 0.032323 0.015349
5 0.015974 0.033678 0.016470
6 0.017917 0.032465 0.015282
7 0.019147 0.034249 0.014993
8 0.020042 0.034530 0.015794
9 0.023667 0.038041 0.016430
10 0.029137 0.038911 0.016717
11 0.032310 0.040869 0.016580
12 0.033338 0.036693 0.014642
13 0.035552 0.037233 0.015178
14 0.037351 0.037806 0.015666
15 0.043046 0.039704 0.016415
```

From here, it is very simple to generate a line graph:

```
plot = df.plot().get_figure()
plot.savefig('viz.png')
```

Interestingly, `math.pow()`

and `exponent`

mostly perform at the same rate, while our `multiply`

functions vary wildly. Unsurprisingly, the longer the multiplication chain, the longer it takes to execute.

While the crossover is interesting, this doesnâ€™t show what happens at powers larger than 15. Going up through `1000`

, we get the following trend:

When we zoom in so that `math.pow()`

and `exponent`

are more pronounced, we see the same performance trend continue:

While using `**`

the time gradually increases, `math.pow()`

generally has executes at around the same speed.

When writing algorithms that use small exponents, here proved less than `15`

, it is faster to chain multiplication together than to use the `**`

exponentiation operator. Additionally, `math.pow()`

is more efficient than chained multiplication at powers larger than `5`

and always more efficient than the `**`

operator, so there is never a reason to use `**`

.

Additionally, this is also true in JavaScript^{4}. Thanks @juliancoleman for this comparison!

Discussion: r/Python, Hacker News | View as: PDF, Markdown