Skip to content

Suggestion to always cast the data to float and scale on load #4371

@JoeZiminski

Description

@JoeZiminski

Currently for preprocessing, the data type and scaling is managed by the individual preprocessing step, in get_traces(). In general, the majority of processing steps are performed in float, and so the the data is repeatedly passed as int, converted to float, then back to int (e.g. phase_shift, filter, common_reference).

Managing the dtype like this is often a source of bugs, as its easy to forget to cast back to int or update parameters that need to be cast down to int (e.g. #4175, #2311, #3505, #4297, #4370) and due to their nature of typing issues these bugs are usually serious and hard to debug.

Although it is a nice idea to maintain the originally dtype throughout, as some data must be float (e.g. motion corrected), it is not certain that the dtype of your recording is the same dtype you started with anyway. I think motion is the only case (though maybe bad channel interpolation is another contender), but it still breaks the convention and adds uncertainty to the recording dtype for all downstream processes.

For these reasons I would suggest that the data is cast to float (and preferable scaled to uV) upfront, on the first get_traces() call. In this way there can be a guarantee at all times you are working on the scaled, float data. Concerning some of the possible benefits of keeping data as int16 for processing:

  1. performance

Modern CPU are optimized for floating point computation and typically float32 operations are faster than integer (1, 2 ) as SIMD is better optimized for floating point. At least, any difference should be negligible. In our case, the data is typically cast to float anyway for the majority of operations so there should be little performance decrease. Another performance benefit of the upfront cast and scaling is the casting is only done once, not at every preprocessing step

  1. memory

On disk, data should definitely be int16, casting from float to int when written. Otherwise, because many steps already cast to float32 the memory requirements for preprocessing don't really change by casting up front. If you need to set memory for a job, it has to be the maximum amount used at any point anyway, which is typically float32. The only place I can think of (though there are probably others I'm missing) where it will directly impact is in visualizing data, though people typically only visuals a small chunk at a time e.g. 1 second).

Interested to hear what people think, I'm sure there's loads of stuff I haven't considered. But at least considering the above, guaranteeing the dtype and scaling of raw data at all points in the processing chain seems like it will save a load of headaches without any performance decrease (and possibly a performance boost).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions