Suggestion to always cast the data to `float` and scale on load

Currently for preprocessing, the data type and scaling is managed by the individual preprocessing step, in `get_traces()`. In general, the majority of processing steps are performed in float, and so the the data is repeatedly passed as int, converted to float, then back to int (e.g. `phase_shift`, `filter`, `common_reference`). 

Managing the dtype like this is often a source of bugs, as its easy to forget to cast back to int or update parameters that need to be cast down to int (e.g. #4175, #2311, #3505, #4297, #4370) and due to their nature of typing issues these bugs are usually serious and hard to debug. 

Although it is a nice idea to maintain the originally dtype throughout, as some data must be float (e.g. motion corrected), it is not certain that the dtype of your recording is the same dtype you started with anyway. I think motion is the only case (though maybe bad channel interpolation is another contender), but it still breaks the convention and adds uncertainty to the recording dtype for all downstream processes. 

For these reasons I would suggest that the data is cast to float (and preferable scaled to uV) upfront,  on the first `get_traces()` call. In this way there can be a guarantee at all times you are working on the scaled, float data. Concerning some of the possible benefits of keeping data as `int16` for processing:

1) performance

Modern CPU are optimized for floating point computation and typically float32 operations are faster than integer ([1](https://www.reddit.com/r/C_Programming/comments/12s8ede/is_there_any_performance_benefit_to_using_int_vs/), [2](https://community.intel.com/t5/Software-Tuning-Performance/Are-integers-faster-than-floats/m-p/1151334) ) as SIMD is better optimized for floating point. At least, any difference should be negligible. In our case, the data is typically cast to float anyway for the majority of operations so there should be little performance decrease. Another performance benefit of the upfront cast and scaling is the casting is only done once, not at every preprocessing step

2) memory

On disk, data should definitely be `int16`, casting from float to int when written. Otherwise, because many steps already cast to `float32` the memory requirements for preprocessing don't really change by casting up front. If you need to set memory for a job, it has to be the maximum amount used at any point anyway, which is typically `float32`. The only place I can think of (though there are probably others I'm missing) where it will directly impact is in visualizing data, though people typically only visuals a small chunk at a time e.g. 1 second). 

Interested to hear what people think, I'm sure there's loads of stuff I haven't considered. But at least considering the above, guaranteeing the dtype and scaling of raw data at all points in the processing chain seems like it will save a load of headaches without any performance decrease (and possibly a performance boost).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion to always cast the data to `float` and scale on load #4371

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suggestion to always cast the data to float and scale on load #4371

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Suggestion to always cast the data to `float` and scale on load #4371