May 30, 2019 Performance of Various Python Exponentiation Methods

Performance of Various Python Exponentiation Methods

Timing Tests
Expression Disassembly
BINARY_MULTIPLY versus BINARY_POWER
- BINARY_MULTIPLY
- BINARY_POWER
Charting Performance Differences
More Performance Testing
Conclusions

Recently, I was writing an algorithm to solve a coding challenge that involved finding a point in a Cartesian plane that had the minimum distance from all of the other points. In Python, the distance function would be expressed as math.sqrt(dx ** 2 + dy ** 2). However, there are several different ways to express each term: dx ** 2, math.pow(dx, 2), and dx * dx. Interestingly, these all perform differently, and I wanted to understand how and why.

Timing Tests

Python provides a module called timeit to test performance, which makes testing these timings rather simple. With x set to 2, we can run timing tests on all three of our options above:

Expression	Timing (100k iterations)
`x * x`	3.87 ms
`x ** 2`	80.97 ms
`math.pow(x, 2)`	83.60 ms

Expression Disassembly

Python also provides a model called dis that disassembles code so we can see what each of these expressions are doing under the hood, which helps in understanding the performance differences.

Multiplication

Using dis.dis(lambda x: x * x), we can see that the following code gets executed:

0 LOAD_FAST                0 (x)
2 LOAD_FAST                0 (x)
4 BINARY_MULTIPLY
6 RETURN_VALUE

The program loads x twice, runs BINARY_MULTIPLY, and returns the value.

math.pow()

Using dis.dis(lambda x: math.pow(x, 2)), we can see the following code gets executed:

0 LOAD_GLOBAL              0 (math)
2 LOAD_ATTR                1 (pow)
4 LOAD_FAST                0 (x)
6 LOAD_CONST               1 (2)
8 CALL_FUNCTION            2
10 RETURN_VALUE

The math module loads from the global space, and then the pow attribute loads. Next, both arguments are loaded and the pow function is called, which returns the value.

Exponentiation

Using dis.dis(lambda x: x ** 2), we can see that the following code gets executed:

0 LOAD_FAST                0 (x)
2 LOAD_CONST               1 (2)
4 BINARY_POWER
6 RETURN_VALUE

The program loads x, loads 2, runs BINARY_POWER, and returns the value.

BINARY_MULTIPLY versus BINARY_POWER

Using the math.pow() functions as a point of comparison, both multiplication and exponentiation differ in only one part of their bytecode: calling BINARY_MULTIPLY versus calling BINARY_POWER.

BINARY_MULTIPLY

This function is located here in the Python source code. It does a few interesting things:

long_mul(PyLongObject *a, PyLongObject *b)
{
    PyLongObject *z;

    CHECK_BINOP(a, b);

    /* fast path for single-digit multiplication */
    if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
        stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
        return PyLong_FromLongLong((long long)v);
    }

    z = k_mul(a, b);
    /* Negate if exactly one of the inputs is negative. */
    if (((Py_SIZE(a) ^ Py_SIZE(b)) < 0) && z) {
        _PyLong_Negate(&z);
        if (z == NULL)
            return NULL;
    }
    return (PyObject *)z;
}

For small numbers, this uses binary multiplication. For larger values, the function uses Karatsuba multiplication, which is a fast multiplication algorithm for larger numbers.

We can see how this function gets called in ceval.c:

case TARGET(BINARY_MULTIPLY): {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *res = PyNumber_Multiply(left, right);
    Py_DECREF(left);
    Py_DECREF(right);
    SET_TOP(res);
    if (res == NULL)
        goto error;
    DISPATCH();
}

BINARY_POWER

This function is located here in the Python source code. It also does several interesting things:

The source code is too long to fully include, which partially explains the detrimental performance. Here are some interesting snippets:

if (Py_SIZE(b) < 0) {  /* if exponent is negative */
    if (c) {
        PyErr_SetString(PyExc_ValueError, "pow() 2nd argument "
                        "cannot be negative when 3rd argument specified");
        goto Error;
    }
    else {
        /* else return a float.  This works because we know
           that this calls float_pow() which converts its
           arguments to double. */
        Py_DECREF(a);
        Py_DECREF(b);
        return PyFloat_Type.tp_as_number->nb_power(v, w, x);
    }
}

After creating some pointers, the function checks if the power given is a float or is negative, where it either errors or calls a different function to handle exponentiation.

If neither cases hit, it checks for a third argument, which is always None according to ceval.c¹:

case TARGET(BINARY_POWER): {
    PyObject *exp = POP();
    PyObject *base = TOP();
    PyObject *res = PyNumber_Power(base, exp, Py_None);
    Py_DECREF(base);
    Py_DECREF(exp);
    SET_TOP(res);
    if (res == NULL)
        goto error;
    DISPATCH();
}

Finally, the function defines two routines: REDUCE for modular reduction and MULT for multiplication and reduction. The multiplication function uses long_mul for both values, which is the same function used in BINARY_MULTIPLY.

#define REDUCE(X)
do {
    if (c != NULL) {
        if (l_divmod(X, c, NULL, &temp) < 0
            goto Error;
        Py_XDECREF(X);
        X = temp;
        temp = NULL;
    }
} while(0)

#define MULT(X, Y, result)
do {
    temp = (PyLongObject *)long_mul(X, Y);
    if (temp == NULL)
        goto Error;
    Py_XDECREF(result);
    result = temp;
    temp = NULL;
    REDUCE(result);
} while(0)

After that, the function uses left-to-right k-ary exponentiation defined in chapter 14.6² of the Handbook of Applied Cryptography:

Charting Performance Differences

We can use the timeit library above to profile code at different values and see how the performance changes over time.

Generating Functions

To test the performance at different power values, we need to generate some functions.

math.pow() and Exponentiation

Since both of these are already in the Python source, all we need to do is define a function for exponentiation we can call from inside a timeit call:

exponent = lambda base, power: base ** power

Chained Multiplication

Since this changes each time the power changes³, we need to generate a new multiplication function each time the base changes. To do this, we can generate a string like x*x*x and call eval() on it to return a function:

def generate_mult_func(n):
    mult_steps = '*'.join(['q'] * n)
    func_string = f'lambda q: {mult_steps}'  # Keep this so we can print later
    return eval(func_string), func_string

Thus, we can make a multiply function like so:

multiply, func_string = generate_mult_func(power)

If we call generate_mult_func(4), multiply will be a lambda function that looks like this:

lambda q: q*q*q*q

Finding the Crossover

Using the code posted here, we can determine at what point multiply becomes less efficient than exponent.

Staring with these values:

base = 2
power = 2

We loop until the time it takes to execute 100,000 iterations of multiply is slower than executing 100,000 iterations of exponent. Initially, here are the timings, with math.pow() serving as a point of comparison:

Starting speeds:
Multiply time   11.56 ms
Exponent time   35.82 ms
math.pow time   16.73 ms

When running on repl.it, Python finds the crossover in 1.2s:

Crossover found in 1.2 s:
Base, power     2, 15
Multiply time   43.05 ms
Exponent time   39.70 ms
math.pow time   16.42 ms
Multiply func   lambda q: q*q*q*q*q*q*q*q*q*q*q*q*q*q*q

Thus, chaining multiplication together is faster until our expression gets to 2^14; at 2^15 exponentiation becomes faster.

Charting the Performance

Using Pandas, we can keep track of the timing at each power:

Power  multiply  exponent  math.pow
    2  0.011562  0.035822  0.016731
    3  0.013043  0.033764  0.014614
    4  0.015307  0.032323  0.015349
    5  0.015974  0.033678  0.016470
    6  0.017917  0.032465  0.015282
    7  0.019147  0.034249  0.014993
    8  0.020042  0.034530  0.015794
    9  0.023667  0.038041  0.016430
   10  0.029137  0.038911  0.016717
   11  0.032310  0.040869  0.016580
   12  0.033338  0.036693  0.014642
   13  0.035552  0.037233  0.015178
   14  0.037351  0.037806  0.015666
   15  0.043046  0.039704  0.016415

From here, it is very simple to generate a line graph:

plot = df.plot().get_figure()
plot.savefig('viz.png')

Interestingly, math.pow() and exponent mostly perform at the same rate, while our multiply functions vary wildly. Unsurprisingly, the longer the multiplication chain, the longer it takes to execute.

More Performance Testing

While the crossover is interesting, this doesn’t show what happens at powers larger than 15. Going up through 1000, we get the following trend:

When we zoom in so that math.pow() and exponent are more pronounced, we see the same performance trend continue:

While using ** the time gradually increases, math.pow() generally has executes at around the same speed.

Conclusions

When writing algorithms that use small exponents, here proved less than 15, it is faster to chain multiplication together than to use the ** exponentiation operator. Additionally, math.pow() is more efficient than chained multiplication at powers larger than 5 and always more efficient than the ** operator, so there is never a reason to use **.

Additionally, this is also true in JavaScript⁴. Thanks @juliancoleman for this comparison!

Discussion: r/Python, Hacker News | View as: PDF, Markdown

This is used as the modulus parameter in the stdlib pow() and math.pow() functions: pow(2, 8, 10) results in 2^8 % 10, or 6 ↩︎
According to the Python source, specifically section 14.82. ↩︎
x ** 2 == x * x, x ** 3 == x * x * x and so on. ↩︎
Except in Safari, where Math.pow() is the slowest. ↩︎

Christopher Sardegna's Blog

Thoughts on technology, design, data analysis, and data visualization.