No word tokenizer under the hood?

Hi,

In the original BPE paper, as well as in the BPE dropout paper, the authors apply word-based tokenization (namely,  the Moses tokenizer, as well as some others) before the main algorithm. However, this project's readme is somewhat vague regarding this detail. Do I understand it correctly that the only word-based tokenization implemented is basically splitting on spaces and that's it?

What confuses me is this quote: `ours does not consider tokens that cross word boundaries`. For some languages it's impossible not to consider tokens that cross word boundaries based on spaces alone. So my question as follows: is there a more sophisticated word-based tokenizer under the hood after all?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No word tokenizer under the hood? #87

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No word tokenizer under the hood? #87

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions