Bug hunt: Rescaling DataFrames in Python
This is one of the tricky bugs I fixed relatively early in my career. This one took a solid few days to truly understand, but it boiled down to a relatively simple fix.
The problem
This bug affected a data export pipeline in a legacy codebase. We were converting raw time series values into a proprietary data format (with permission) so that the data could be inspected in another program by the technicians.
Some of the technicians raised a bug report to us:
Some of the files would crash the program when opened
Their workaround was to re-create the file on a subset of the data – indicating there was potentially some bad values that weren’t being scaled properly, probably leading to an “out-of-bounds” error.
The steps that are taken to convert the raw data:
- Clip values to
+/- 5 mV
, which are the limits of what can be represented in the proprietary filetype - Rescale the channels so that all values are within
+/- 512
, which is the range of integers that can be encoded in 10 bits - Split up the study data into 1 hour frames
- Convert each scaled frame of data into binary and append to proprietary file.
The part of the pipeline concerned with rescaling values was
import pandas as pd
def rescale_channels(
data: pd.DataFrame,
channel_headers: list[str],
) -> pd.DataFrame:
maximum_point = data[channel_headers].max().max()
minimum_point = data[channel_headers].min().min()
abs_max = max(maximum_point, abs(minimum_point))
absolute_max = 5 * (10**-3)
for column in channel_headers:
_data = data[column]
# CLIPPING
if abs_max > absolute_max:
abs_max = 5 * (10**-3)
_data = _data.clip(lower=-abs_max, upper=abs_max)
# RESCALING
_data = (_data / abs_max * 510).round()
_data = _data.fillna(511) # 511 is "undefined".
# TYPECASTING
_data = _data.astype(int)
# RECOMBINING COLUMN IN DF
data[column] = _data
return data
Can you spot the bug?
The solution
Turns out the issue was an overwritten loop variable:
# CLIPPING
if abs_max > absolute_max:
abs_max = 5 * (10**-3) # <--- ISSUE OCCURS HERE
_data = _data.clip(lower=-abs_max, upper=abs_max)
In cases when 5mV < absmax(CH1) < absmax(CH2)
, the highlighted if
statement is only entered
once, resulting in values in CH2
that exceed +/- 512
.
The fix is thankfully quite simple:
# channel_headers = ["ECG1", "ECG2", "DIFF"]
def rescale_channels(
data: pd.DataFrame,
channel_headers:list[str],
) -> pd.DataFrame:
maximum_point = data[channel_headers].max().max()
minimum_point = data[channel_headers].min().min()
abs_max = max(maximum_point, abs(minimum_point))
absolute_max = 5 * (10**-3)
for column in channel_headers:
abs_max_col = abs_max # <--- CREATE A TEMP ITERATION VARIABLE TO AVOID OVERWRITE
_data = data[column]
# CLIPPING
if abs_max_col > absolute_max:
abs_max_col = 5 * (10**-3)
_data = _data.clip(lower=-abs_max_col, upper=abs_max_col)
# RESCALING
_data = (_data / abs_max_col * 510).round()
_data = _data.fillna(511) # 511 is "undefined".
# TYPECASTING
_data = _data.astype(int)
# RECOMBINING COLUMN IN DF
data[column] = _data
return raw_data
Once this was implemented, I added a unit test to capture this case and ensured that the issue was fixed when the technicians opened their files with this fix implemented.
Lesson I learned from this:
- Don’t overwrite loop variables – this leads to tricky to spot bugs!