Skip to content

[Tokenizers] How to load HF based tokenizers e.g. SmolLM #7197

Open
@nietras

Description

@nietras

Often LLM models are distributed on HuggingFace or similar where tokenizers are presumed created via transformers library. This often contains a bunch of json/txt files. I have found it hard to then now how to create a ML.Tokenizer from that. For example how would one create a tokenizer for:

https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct/tree/main

Could there be a getting started document detailing how to load tokenizers from such files and how to identify what to use to load these?

Metadata

Metadata

Assignees

Labels

TokenizersdocumentationRelated to documentation of ML.NETenhancementNew feature or requestquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions