Skip to content

Chunk Size and KV caching in the CAT (Audio Tokenizer) #34

@patrickltobing

Description

@patrickltobing

Thank you very much for the great work. The potential of this framework is awesome.

I have question about the context mechanism in the Audio Tokenizer, though.
It seems that there is a chunk size in both encoder and decoder of the causal audio transformer (CAT).

What is the effective number of chunk size used during the training?
I tried different durations and looks like the effect is quite sensitive, both in the encoding of the prompt, and the decoding side.

P.S. Just to confirm, that basically the KV state is reset for each chunk, right?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions