Skip to content

[safetensors] better RE_SAFETENSORS_SHARD_FILE #605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion packages/hub/src/lib/parse-safetensors-metadata.spec.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import { assert, it, describe } from "vitest";
import { parseSafetensorsMetadata } from "./parse-safetensors-metadata";
import { RE_SAFETENSORS_SHARD_FILE, parseSafetensorsMetadata } from "./parse-safetensors-metadata";
import { sum } from "../utils/sum";

describe("parseSafetensorsMetadata", () => {
Expand Down Expand Up @@ -109,4 +109,19 @@ describe("parseSafetensorsMetadata", () => {
assert.deepStrictEqual(parse.parameterCount, { BF16: 8_537_680_896 });
assert.deepStrictEqual(sum(Object.values(parse.parameterCount)), 8_537_680_896);
});

it("should detect sharded safetensors filename", async () => {
const safetensorsPath = "model00002-of-00072.safetensors"; // https://huggingface.co/bigscience/bloom/blob/4d8e28c67403974b0f17a4ac5992e4ba0b0dbb6f/model_00002-of-00072.safetensors
const match = safetensorsPath.match(RE_SAFETENSORS_SHARD_FILE);

assert.strictEqual(RE_SAFETENSORS_SHARD_FILE.test(safetensorsPath), true);
assert.strictEqual(match?.[1], "00002");
assert.strictEqual(match?.[2], "00072");

const safetensorsPathWithDash = "model-00002-of-00072.safetensors"; // https://huggingface.co/google/gemma-7b/blob/7aeedade2bfdf69adddb754cff0461e74541e436/model-00001-of-00004.safetensors
assert.strictEqual(RE_SAFETENSORS_SHARD_FILE.test(safetensorsPathWithDash), true);

const safetensorsPathWithUnderscore = "model_00002-of-00072.safetensors"; // https://huggingface.co/bigscience/bloom/blob/4d8e28c67403974b0f17a4ac5992e4ba0b0dbb6f/model_00002-of-00072.safetensors
assert.strictEqual(RE_SAFETENSORS_SHARD_FILE.test(safetensorsPathWithUnderscore), true);
});
});
2 changes: 1 addition & 1 deletion packages/hub/src/lib/parse-safetensors-metadata.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ export const SAFETENSORS_INDEX_FILE = "model.safetensors.index.json";
/// but in some situations safetensors weights have different filenames.
export const RE_SAFETENSORS_FILE = /\.safetensors$/;
export const RE_SAFETENSORS_INDEX_FILE = /\.safetensors\.index\.json$/;
export const RE_SAFETENSORS_SHARD_FILE = /\d{5}-of-\d{5}\.safetensors$/;
export const RE_SAFETENSORS_SHARD_FILE = /[-_]?(\d{5})-of-(\d{5})\.safetensors$/;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get [-_]?, since it's optional it doesn't do anything?

Maybe remove the ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get [-_]?, since it's optional it doesn't do anything?

Screenshot 2024-04-05 at 10 34 38 src here

on example above, I'd like to think that - is part of -xxxxx-of-xxxxx.safetensors rather than model-. The reason of this decision is that, if in the future, we wanna show common name of those tensors files it should be model rather than model- & with this regex we can achieve that by str.slice(0, -str.match(RE_SAFETENSORS_SHARD_FILE).length)

Maybe remove the ?

- or _ are optional. So we need to have ?. For example there can be model00002-of-00072.safetensors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm ok but that's a tenuous dependency between hub.js and the hub. It should be made an explicit function extractShardTensorsData(filename) that returns everything (the root, the numbers, ...) in a typed JSON.

Otherwise assumptions made in one codebase about another codebase can easily be broken, especially if there's no comment / warning that some things are there for a particular reason

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think [_-] can really be optional, i've always seen either of them

const PARALLEL_DOWNLOADS = 5;
const MAX_HEADER_LENGTH = 25_000_000;

Expand Down