Skip to content

Commit b0c71c7

Browse files
authored
scripts : platform independent script to verify sha256 checksums (#1203)
* python script to verify the checksum of the llama models Added Python script for verifying SHA256 checksums of files in a directory, which can run on multiple platforms. Improved the formatting of the output results for better readability. * Update README.md update to the readme for improved readability and to explain the usage of the python checksum verification script * update the verification script I've extended the script based on suggestions by @prusnak The script now checks the available RAM, is there is enough to check the file at once it will do so. If not the file is read in chunks. * minor improvment small change so that the available ram is checked and not the total ram * remove the part of the code that reads the file at once if enough ram is available based on suggestions from @prusnak i removed the part of the code that checks whether the user had enough ram to read the entire model at once. the file is now always read in chunks. * Update verify-checksum-models.py quick fix to pass the git check
1 parent a8a2efd commit b0c71c7

File tree

2 files changed

+98
-12
lines changed

2 files changed

+98
-12
lines changed

README.md

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -371,29 +371,37 @@ python3 convert.py models/gpt4all-7B/gpt4all-lora-quantized.bin
371371

372372
- The newer GPT4All-J model is not yet supported!
373373

374-
### Obtaining and verifying the Facebook LLaMA original model and Stanford Alpaca model data
374+
### Obtaining the Facebook LLaMA original model and Stanford Alpaca model data
375375

376376
- **Under no circumstances should IPFS, magnet links, or any other links to model downloads be shared anywhere in this repository, including in issues, discussions, or pull requests. They will be immediately deleted.**
377377
- The LLaMA models are officially distributed by Facebook and will **never** be provided through this repository.
378378
- Refer to [Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to request access to the model data.
379-
- Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files to confirm that you have the correct model data files before creating an issue relating to your model files.
380-
- The following command will verify if you have all possible latest files in your self-installed `./models` subdirectory:
381379

382-
`sha256sum --ignore-missing -c SHA256SUMS` on Linux
380+
### Verifying the model files
383381

384-
or
382+
Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files to confirm that you have the correct model data files before creating an issue relating to your model files.
383+
- The following python script will verify if you have all possible latest files in your self-installed `./models` subdirectory:
385384

386-
`shasum -a 256 --ignore-missing -c SHA256SUMS` on macOS
385+
```bash
386+
# run the verification script
387+
python3 .\scripts\verify-checksum-models.py
388+
```
389+
390+
- On linux or macOS it is also possible to run the following commands to verify if you have all possible latest files in your self-installed `./models` subdirectory:
391+
- On Linux: `sha256sum --ignore-missing -c SHA256SUMS`
392+
- on macOS: `shasum -a 256 --ignore-missing -c SHA256SUMS`
393+
394+
### Seminal papers and background on the models
387395

388-
- If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
396+
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
389397
- LLaMA:
390-
- [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
391-
- [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
398+
- [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
399+
- [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
392400
- GPT-3
393-
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
401+
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
394402
- GPT-3.5 / InstructGPT / ChatGPT:
395-
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
396-
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
403+
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
404+
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
397405

398406
### Perplexity (measuring model quality)
399407

scripts/verify-checksum-models.py

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
import os
2+
import hashlib
3+
4+
def sha256sum(file):
5+
block_size = 16 * 1024 * 1024 # 16 MB block size
6+
b = bytearray(block_size)
7+
file_hash = hashlib.sha256()
8+
mv = memoryview(b)
9+
with open(file, 'rb', buffering=0) as f:
10+
while True:
11+
n = f.readinto(mv)
12+
if not n:
13+
break
14+
file_hash.update(mv[:n])
15+
16+
return file_hash.hexdigest()
17+
18+
# Define the path to the llama directory (parent folder of script directory)
19+
llama_path = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir))
20+
21+
# Define the file with the list of hashes and filenames
22+
hash_list_file = os.path.join(llama_path, "SHA256SUMS")
23+
24+
# Check if the hash list file exists
25+
if not os.path.exists(hash_list_file):
26+
print(f"Hash list file not found: {hash_list_file}")
27+
exit(1)
28+
29+
# Read the hash file content and split it into an array of lines
30+
with open(hash_list_file, "r") as f:
31+
hash_list = f.read().splitlines()
32+
33+
# Create an array to store the results
34+
results = []
35+
36+
# Loop over each line in the hash list
37+
for line in hash_list:
38+
# Split the line into hash and filename
39+
hash_value, filename = line.split(" ")
40+
41+
# Get the full path of the file by joining the llama path and the filename
42+
file_path = os.path.join(llama_path, filename)
43+
44+
# Informing user of the progress of the integrity check
45+
print(f"Verifying the checksum of {file_path}")
46+
47+
# Check if the file exists
48+
if os.path.exists(file_path):
49+
# Calculate the SHA256 checksum of the file using hashlib
50+
file_hash = sha256sum(file_path)
51+
52+
# Compare the file hash with the expected hash
53+
if file_hash == hash_value:
54+
valid_checksum = "V"
55+
file_missing = ""
56+
else:
57+
valid_checksum = ""
58+
file_missing = ""
59+
else:
60+
valid_checksum = ""
61+
file_missing = "X"
62+
63+
# Add the results to the array
64+
results.append({
65+
"filename": filename,
66+
"valid checksum": valid_checksum,
67+
"file missing": file_missing
68+
})
69+
70+
71+
# Print column headers for results table
72+
print("\n" + "filename".ljust(40) + "valid checksum".center(20) + "file missing".center(20))
73+
print("-" * 80)
74+
75+
# Output the results as a table
76+
for r in results:
77+
print(f"{r['filename']:40} {r['valid checksum']:^20} {r['file missing']:^20}")
78+

0 commit comments

Comments
 (0)