Bug hunt: Rescaling DataFrames in Python

2025-03-29

This is one of the tricky bugs I fixed relatively early in my career. This one took a solid few days to truly understand, but it boiled down to a relatively simple fix.

The problem

This bug affected a data export pipeline in a legacy codebase. We were converting raw time series values into a proprietary data format (with permission) so that the data could be inspected in another program by the technicians.

Some of the technicians raised a bug report to us:

Some of the files would crash the program when opened

Their workaround was to re-create the file on a subset of the data – indicating there was potentially some bad values that weren’t being scaled properly, probably leading to an “out-of-bounds” error.

The steps that are taken to convert the raw data:

Clip values to +/- 5 mV, which are the limits of what can be represented in the proprietary filetype
Rescale the channels so that all values are within +/- 512, which is the range of integers that can be encoded in 10 bits
Split up the study data into 1 hour frames
Convert each scaled frame of data into binary and append to proprietary file.

The part of the pipeline concerned with rescaling values was

import pandas as pd

def rescale_channels(
    data: pd.DataFrame,
    channel_headers: list[str],
) -> pd.DataFrame:
    maximum_point = data[channel_headers].max().max()
    minimum_point = data[channel_headers].min().min()
    abs_max = max(maximum_point, abs(minimum_point))

    absolute_max = 5 * (10**-3)
    for column in channel_headers:
        _data = data[column]

        # CLIPPING
        if abs_max > absolute_max:
            abs_max = 5 * (10**-3)
            _data = _data.clip(lower=-abs_max, upper=abs_max)

        # RESCALING
        _data = (_data / abs_max * 510).round()
        _data = _data.fillna(511)  # 511 is "undefined".

        # TYPECASTING
        _data = _data.astype(int)

        # RECOMBINING COLUMN IN DF
        data[column] = _data

    return data

Can you spot the bug?

The solution

Turns out the issue was an overwritten loop variable:

    # CLIPPING
    if abs_max > absolute_max:
        abs_max = 5 * (10**-3)  # <--- ISSUE OCCURS HERE
        _data = _data.clip(lower=-abs_max, upper=abs_max)

In cases when 5mV < absmax(CH1) < absmax(CH2), the highlighted if statement is only entered once, resulting in values in CH2 that exceed +/- 512.

The fix is thankfully quite simple:

# channel_headers = ["ECG1", "ECG2", "DIFF"]
def rescale_channels(
    data: pd.DataFrame,
    channel_headers:list[str],
) -> pd.DataFrame:
    maximum_point = data[channel_headers].max().max()
    minimum_point = data[channel_headers].min().min()
    abs_max = max(maximum_point, abs(minimum_point))

    absolute_max = 5 * (10**-3)
    for column in channel_headers:
        abs_max_col = abs_max  # <--- CREATE A TEMP ITERATION VARIABLE TO AVOID OVERWRITE
        _data = data[column]

        # CLIPPING
        if abs_max_col > absolute_max:
            abs_max_col = 5 * (10**-3)
            _data = _data.clip(lower=-abs_max_col, upper=abs_max_col)

        # RESCALING
        _data = (_data / abs_max_col * 510).round()
        _data = _data.fillna(511)  # 511 is "undefined".

        # TYPECASTING
        _data = _data.astype(int)

        # RECOMBINING COLUMN IN DF
        data[column] = _data

    return raw_data

Once this was implemented, I added a unit test to capture this case and ensured that the issue was fixed when the technicians opened their files with this fix implemented.

Lesson I learned from this:

Don’t overwrite loop variables – this leads to tricky to spot bugs!

Reply to this post by email ↪