Thank you very much for the great work. The potential of this framework is awesome.
I have question about the context mechanism in the Audio Tokenizer, though.
It seems that there is a chunk size in both encoder and decoder of the causal audio transformer (CAT).
What is the effective number of chunk size used during the training?
I tried different durations and looks like the effect is quite sensitive, both in the encoding of the prompt, and the decoding side.
P.S. Just to confirm, that basically the KV state is reset for each chunk, right?