-
Notifications
You must be signed in to change notification settings - Fork 241
Description
Currently for preprocessing, the data type and scaling is managed by the individual preprocessing step, in get_traces(). In general, the majority of processing steps are performed in float, and so the the data is repeatedly passed as int, converted to float, then back to int (e.g. phase_shift, filter, common_reference).
Managing the dtype like this is often a source of bugs, as its easy to forget to cast back to int or update parameters that need to be cast down to int (e.g. #4175, #2311, #3505, #4297, #4370) and due to their nature of typing issues these bugs are usually serious and hard to debug.
Although it is a nice idea to maintain the originally dtype throughout, as some data must be float (e.g. motion corrected), it is not certain that the dtype of your recording is the same dtype you started with anyway. I think motion is the only case (though maybe bad channel interpolation is another contender), but it still breaks the convention and adds uncertainty to the recording dtype for all downstream processes.
For these reasons I would suggest that the data is cast to float (and preferable scaled to uV) upfront, on the first get_traces() call. In this way there can be a guarantee at all times you are working on the scaled, float data. Concerning some of the possible benefits of keeping data as int16 for processing:
- performance
Modern CPU are optimized for floating point computation and typically float32 operations are faster than integer (1, 2 ) as SIMD is better optimized for floating point. At least, any difference should be negligible. In our case, the data is typically cast to float anyway for the majority of operations so there should be little performance decrease. Another performance benefit of the upfront cast and scaling is the casting is only done once, not at every preprocessing step
- memory
On disk, data should definitely be int16, casting from float to int when written. Otherwise, because many steps already cast to float32 the memory requirements for preprocessing don't really change by casting up front. If you need to set memory for a job, it has to be the maximum amount used at any point anyway, which is typically float32. The only place I can think of (though there are probably others I'm missing) where it will directly impact is in visualizing data, though people typically only visuals a small chunk at a time e.g. 1 second).
Interested to hear what people think, I'm sure there's loads of stuff I haven't considered. But at least considering the above, guaranteeing the dtype and scaling of raw data at all points in the processing chain seems like it will save a load of headaches without any performance decrease (and possibly a performance boost).