ProbableOdyssey

Bug hunt: Rescaling DataFrames in Python

This is one of the tricky bugs I fixed relatively early in my career. This one took a solid few days to truly understand, but it boiled down to a relatively simple fix.

The problem

This bug affected a data export pipeline in a legacy codebase. We were converting raw time series values into a proprietary data format (with permission) so that the data could be inspected in another program by the technicians.

Some of the technicians raised a bug report to us:

Some of the files would crash the program when opened

Their workaround was to re-create the file on a subset of the data – indicating there was potentially some bad values that weren’t being scaled properly, probably leading to an “out-of-bounds” error.

The steps that are taken to convert the raw data:

The part of the pipeline concerned with rescaling values was

import pandas as pd

def rescale_channels(
    data: pd.DataFrame,
    channel_headers: list[str],
) -> pd.DataFrame:
    maximum_point = data[channel_headers].max().max()
    minimum_point = data[channel_headers].min().min()
    abs_max = max(maximum_point, abs(minimum_point))

    absolute_max = 5 * (10**-3)
    for column in channel_headers:
        _data = data[column]

        # CLIPPING
        if abs_max > absolute_max:
            abs_max = 5 * (10**-3)
            _data = _data.clip(lower=-abs_max, upper=abs_max)

        # RESCALING
        _data = (_data / abs_max * 510).round()
        _data = _data.fillna(511)  # 511 is "undefined".

        # TYPECASTING
        _data = _data.astype(int)

        # RECOMBINING COLUMN IN DF
        data[column] = _data

    return data

Can you spot the bug?

The solution

Turns out the issue was an overwritten loop variable:

    # CLIPPING
    if abs_max > absolute_max:
        abs_max = 5 * (10**-3)  # <--- ISSUE OCCURS HERE
        _data = _data.clip(lower=-abs_max, upper=abs_max)

In cases when 5mV < absmax(CH1) < absmax(CH2), the highlighted if statement is only entered once, resulting in values in CH2 that exceed +/- 512.

The fix is thankfully quite simple:

# channel_headers = ["ECG1", "ECG2", "DIFF"]
def rescale_channels(
    data: pd.DataFrame,
    channel_headers:list[str],
) -> pd.DataFrame:
    maximum_point = data[channel_headers].max().max()
    minimum_point = data[channel_headers].min().min()
    abs_max = max(maximum_point, abs(minimum_point))

    absolute_max = 5 * (10**-3)
    for column in channel_headers:
        abs_max_col = abs_max  # <--- CREATE A TEMP ITERATION VARIABLE TO AVOID OVERWRITE
        _data = data[column]

        # CLIPPING
        if abs_max_col > absolute_max:
            abs_max_col = 5 * (10**-3)
            _data = _data.clip(lower=-abs_max_col, upper=abs_max_col)

        # RESCALING
        _data = (_data / abs_max_col * 510).round()
        _data = _data.fillna(511)  # 511 is "undefined".

        # TYPECASTING
        _data = _data.astype(int)

        # RECOMBINING COLUMN IN DF
        data[column] = _data

    return raw_data

Once this was implemented, I added a unit test to capture this case and ensured that the issue was fixed when the technicians opened their files with this fix implemented.

Lesson I learned from this:

Reply to this post by email ↪