Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HuggingFace Tokenizers compatibility #188

Open
gabeorlanski opened this issue Feb 1, 2022 · 3 comments
Open

HuggingFace Tokenizers compatibility #188

gabeorlanski opened this issue Feb 1, 2022 · 3 comments

Comments

@gabeorlanski
Copy link

Hi, I have been trying to get SeqIO to work with HuggingFace's tokenizers for a bit but have been running into trouble with non-t5 based tokenizers. Specifically, it seems that, because they are not sentencepiece tokenizers, tokenizers for models such as GPT-2 are incompatible with SeqIO's SentencePieceVocabulary as they only have the vocab files:

{
  'vocab_file': 'vocab.json',
  'merges_file': 'merges.txt',
  'tokenizer_file': 'tokenizer.json'
}

Is there a currently supported way to use these tokenizers with SeqIO? Or would I need to make my own vocab class?

@adarob
Copy link
Member

adarob commented Feb 1, 2022

You can make your own subclass of seqio.Vocabulary that provides this compatibility. It would be an excellent contribution to the codebase!

@gauravmishra
Copy link
Collaborator

+1 this would be great to have!

@OhadRubin
Copy link

Hey, i'm about to implement this in the near future (and hopefully make a pull request).
Specifically for the GPT-2 tokenizer, but it doesn't really matter.
Are there any things/pitfall I should look out for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants