Efficient pandas
I’ve been using pandas for quick data analysis, or more complex data processing, for years now. I probably discovered it around ~2010..?
Still, writing correct and efficient code with pandas can be tricky - I see that a lot in colleagues that are new to the library. Here’s a short note I wrote with the main ideas to keep in mind in order to write fast code with pandas:
- Try to avoid explicit loops over (subset of) rows
- Try to do as much as you can with column-wise operations, so that pandas can be smart, vectorize whatever it can, and avoid expensive python operations.
- If there is a pandas built-in method or function that does precisely what you need, use it! It could be orders of magnitudes faster than custom code.
A number of recipes are contained in the Cookbook in the official website. Other good pages to read in the documentation are those around indexing, grouping, text data (especially stuff on the magic .str notation), and time series/dates. And for the interested reader I’d recommend the great series of blog posts by Tom Augspurger on “modern” pandas here – it’s 4 years old but it still contains a lot of great stuff.
Also, I won’t comment here on common pitfalls around the correctness of pandas code. That would deserve its own post.
Efficient apply
Calling apply row by row, with df.apply(f, axis=1)
, is no faster than looping over the rows.. so it’s rather slow.
True, it’s highly flexibile, because the argument passed to f is the entire row where you can access multiple fields at the same time.
Viceversa, df.apply(f)
would try to apply f
column-wise on each column (can be used with any functions that can be applied to a column/vector), which is faster but clearly less flexible.
Example:
def slow_apply(dat):
return dat.apply(lambda row: np.sqrt(row["a"]), axis=1)
def fast_apply(dat):
return dat["a"].apply(np.sqrt)
def faster_apply(dat):
return np.sqrt(dat["a"])
> len(df)
14378
> %time slow_apply(df)
CPU times: user 484 ms, sys: 18.7 ms, total: 503 ms
Wall time: 503 ms
> %time fast_apply(df)
CPU times: user 994 µs, sys: 87 µs, total: 1.08 ms
Wall time: 532 µs
> %time faster_apply(df)
CPU times: user 653 µs, sys: 32 µs, total: 0.685 ms
Wall time: 386 µs
A 1000x difference, and even slightly better if we completely avoid the “apply”*.
(*A nice but tricky thing about pandas: applying a numpy function to a pandas object will return a pandas object. In this cases, the return type of np.sqrt(col)
is again a pd.Series, not an np.array! Personal preference: I’d rather take the (slightly) slower version and be more explicit using the apply(np.sqrt)
. All readers will understand that the output is going to be a pd.Series.)
Efficient grouping
Iterating on groups of a dataframe can be ok for quick data analysis or development; however, if you need to run a big number of them (say, feature extraction on a big dataset, or something that needs to run in production), note that it can be extremely slow if not done right.
One should rather use groupby followed by .agg()
, .transform()
, or .apply()
. In particular, transform is useful when you want to group and then apply the (same) result of a computation on all elements of each group. However, transform functions are applied on single columns.
Example:
def slow(data):
""" For each object in a group, that might have multiple classes,
assign the class with max total confidence. Example:
index id class confidence
0 0 apple 0.90
1 0 banana 0.35
2 0 apple 0.99
3 0 banana 0.43
4 0 banana 0.30
Here the object with id = 0 would be assigned the class 'apple'.
"""
for i in data["id"].unique():
idx = data["id"] == i
weights = data.loc[idx, ["class", "confidence"]].groupby("class").sum()
data.loc[idx, "class"] = weights.idxmax()[0]
def fast(data):
data["weights"] = data.groupby(["id", "class"])["confidence"].transform('sum')
data["class"] = data.groupby("id")["weights"].transform(lambda g: data.loc[g.idxmax(), "class"])
del data["weights"]
> len(df)
14378
> %time slow(df)
CPU times: user 3.42 s, sys: 45.8 ms, total: 3.46 s
Wall time: 3.46 s
> %time fast(df)
CPU times: user 760 ms, sys: 18.6 ms, total: 779 ms
Wall time: 204 ms
We obtained quite a big improvement: 20x speedup. There might be way to speed things up even further.
The “fast” version relies on the fact that the index of the groups in a groupby are the same as the one in the original df (actually groups will be a view on that dataframe), so the index of the max element can be used to index the original dataframe.
Another example, perhaps even simpler:
def slow(data, step=12):
"""Compute relative change in quantities a and b along time of each element in a group"""
for group_id in data["id"].unique():
pos = (data["id"] == group_id)
data.loc[pos, "a_change"] = data.loc[pos, "a"].diff(step) / data.loc[pos, "a"].shift(step)
data.loc[pos, "b_change"] = data.loc[pos, "b"].diff(step) / data.loc[pos, "b"].shift(step)
def fast(data, step=12):
data[["a_change", "b_change"]] = data.groupby("id")[["a", "b"]].transform(
lambda g: g.diff(step) / g.shift(step))
> len(df)
14378
> %time slow(df)
CPU times: user 2.11 s, sys: 6.13 ms, total: 2.12 s
Wall time: 2.13 s
> %time fast(df)
CPU times: user 1.03 s, sys: 15.3 ms, total: 1.05 s
Wall time: 1.04 s
Here the speed up is 2x, and the code is in my opinion also easier to read…
…but still pretty slow. It can be made even faster, when you find out that there is a df method, pct_change()
, that does exactly what we’re trying to do here! \o/
def faster(data, step=12):
data[["a_change", "b_change"]] = data.groupby("id")[["a","b"]].pct_change(periods=step)
> %time faster(df)
CPU times: user 9.43 ms, sys: 52 µs, total: 9.48 ms
Wall time: 10.9 ms
And this is much faster, 350x speedup.
Finally, when more flexibility is needed (e.g., for operations between columns), a general apply()
can also be used, obtaining max flexibility:
def slow(data):
motions = []
for _, group in data.groupby('id'):
group = group.copy()
group['going_left'] = group['right'].shift() > group['right']
group['going_right'] = group['left'].shift() < group['left']
group['is_shrinking'] = (group['right'] - group['left']) < (group['right'].shift() - group['left'].shift())
motions.append(group[['id', 'timestamp', 'going_left', 'going_right', 'is_shrinking']])
return pd.concat(motions, ignore_index=True)
def fast(data):
return data.groupby('id')[["right", "left", "timestamp"]].apply(
lambda g: pd.DataFrame(
{
"going_left": g["right"].shift() > g["right"],
"going_right": g["left"].shift() < g["left"],
"is_shrinking": (g["right"] - g["left"]) < (g["right"] - g["left"]).shift(),
"timestamp": g["timestamp"],
})
).reset_index(0) # remove the id index from groupby
> %time slow(df)
CPU times: user 1.53 s, sys: 58 µs, total: 1.53 s
Wall time: 1.53 s
> %time fast(df)
CPU times: user 945 ms, sys: 7.99 ms, total: 953 ms
Wall time: 949 ms
The benefit of using groupby().apply()
is small in terms of efficiency, but still noticeable; and the code is in my opinion more terse and readable.
Still, not something you can call fast. For that, a bit more effort is required :-)