Skip to content

[safetensors parser] RE_SAFETENSORS_SHARD_FILE #593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 29, 2024

Conversation

mishig25
Copy link
Collaborator

@mishig25 mishig25 commented Mar 29, 2024

const RE_SAFETENSORS_SHARD_FILE = /\d{5}-of-\d{5}\.safetensors$/;

is this regex good to detect shard file based on safetensors filename ?

@mishig25 mishig25 marked this pull request as ready for review March 29, 2024 15:25
@mishig25 mishig25 requested a review from coyotte508 as a code owner March 29, 2024 15:25
@mishig25 mishig25 requested review from Narsil and julien-c March 29, 2024 15:25
Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good (cc @Wauplin @LysandreJik on the filename convention)

Actually we could spec HF-wide that sharded files always have this pattern, given it's the one also used by datasets library for sharded files as was just confirmed to me by @lhoestq

@mishig25 mishig25 merged commit 09d8032 into main Mar 29, 2024
@mishig25 mishig25 deleted the RE_SAFETENSORS_SHARD_FILE branch March 29, 2024 15:33
@lhoestq
Copy link
Member

lhoestq commented Mar 29, 2024

FYI sharded files from datasets have a similar pattern, e.g.

train-00000-of-00004.parquet
train-00001-of-00004.parquet
train-00002-of-00004.parquet
train-00003-of-00004.parquet

BUT this is not the case for many datasets uploaded manually that can use arbitrary patterns, e.g.

0001.tar
0002.tar
0003.tar
part0.json
part1.json
part2.json

@Wauplin
Copy link
Contributor

Wauplin commented Mar 29, 2024

is this regex good to detect shard file based on safetensors filename ?

Correct. Though it's not used yet, harmonization for model sharding is happening here.

@Wauplin
Copy link
Contributor

Wauplin commented Mar 29, 2024

FYI sharded files from datasets have a similar pattern, e.g.

train-0000-of-0004.parquet
train-0001-of-0004.parquet
train-0002-of-0004.parquet
train-0003-of-0004.parquet

In your example, @lhoestq the numbers are on 4 digits, @mishig25's regex has 5.

@julien-c
Copy link
Member

anyways not always the case for models either: https://huggingface.co/hpcai-tech/grok-1/tree/main

@lhoestq
Copy link
Member

lhoestq commented Apr 8, 2024

In your example, @lhoestq the numbers are on 4 digits, @mishig25's regex has 5.

oops it's actually 5 digits as well, sorry (edited my message)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants