Chunk Size and KV caching in the CAT (Audio Tokenizer)

Thank you very much for the great work. The potential of this framework is awesome.

I have question about the context mechanism in the Audio Tokenizer, though.
It seems that there is a chunk size in both encoder and decoder of the causal audio transformer (CAT).

What is the effective number of chunk size used during the training?
I tried different durations and looks like the effect is quite sensitive, both in the encoding of the prompt, and the decoding side.

P.S. Just to confirm, that basically the KV state is reset for each chunk, right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk Size and KV caching in the CAT (Audio Tokenizer) #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Chunk Size and KV caching in the CAT (Audio Tokenizer) #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions