Skip to content

Commit 26fbacb

Browse files
authored
Remove packing and default batch size from FT cli (#60)
1 parent 0e21703 commit 26fbacb

File tree

6 files changed

+11
-58
lines changed

6 files changed

+11
-58
lines changed

examples/embeddings/Code_search.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,7 @@
260260
"def format_inferrer_validator(df):\n",
261261
" \"\"\"\n",
262262
" This validator will infer the likely fine-tuning format of the data, and display it to the user if it is classification.\n",
263-
" It will also suggest to use ada, --no_packing and explain train/validation split benefits.\n",
263+
" It will also suggest to use ada and explain train/validation split benefits.\n",
264264
" \"\"\"\n",
265265
" ft_type = infer_task_type(df)\n",
266266
" immediate_msg = None\n",

examples/finetuning/finetuning-classification.ipynb

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -257,7 +257,7 @@
257257
"\n",
258258
"- Your file contains 1197 prompt-completion pairs\n",
259259
"- Based on your data it seems like you're trying to fine-tune a model for classification\n",
260-
"- For classification, we recommend you try one of the faster and cheaper models, such as `ada`. You should also set the `--no_packing` parameter when fine-tuning\n",
260+
"- For classification, we recommend you try one of the faster and cheaper models, such as `ada`\n",
261261
"- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training\n",
262262
"- There are 11 examples that are very long. These are rows: [134, 200, 281, 320, 404, 595, 704, 838, 1113, 1139, 1174]\n",
263263
"For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.\n",
@@ -277,7 +277,7 @@
277277
"Feel free to take a look!\n",
278278
"\n",
279279
"Now use that file when fine-tuning:\n",
280-
"> openai api fine_tunes.create -t \"sport2_prepared_train.jsonl\" -v \"sport2_prepared_valid.jsonl\" --no_packing --compute_classification_metrics --classification_positive_class \" baseball\"\n",
280+
"> openai api fine_tunes.create -t \"sport2_prepared_train.jsonl\" -v \"sport2_prepared_valid.jsonl\" --compute_classification_metrics --classification_positive_class \" baseball\"\n",
281281
"\n",
282282
"After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\\n\\n###\\n\\n` for the model to start generating completions, rather than continuing with the prompt.\n",
283283
"Once your model starts training, it'll approximately take 30.8 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.\n"
@@ -301,7 +301,7 @@
301301
"cell_type": "markdown",
302302
"source": [
303303
"## Fine-tuning\n",
304-
"The tool suggests we run the following command to train the dataset. Since this is a classification task, we would like to know what the generalization performance on the provided validation set is for our classification use case. The tool suggests to add `--compute_classification_metrics --classification_positive_class \" baseball\"` in order to compute the classification metrics. Classification performs better with a hyperparameter `--no_packing`.\n",
304+
"The tool suggests we run the following command to train the dataset. Since this is a classification task, we would like to know what the generalization performance on the provided validation set is for our classification use case. The tool suggests to add `--compute_classification_metrics --classification_positive_class \" baseball\"` in order to compute the classification metrics.\n",
305305
"\n",
306306
"We can simply copy the suggested command from the CLI tool. We specifically add `-m ada` to fine-tune a cheaper and faster ada model, which is usually comperable in performance to slower and more expensive models on classification use cases. "
307307
],
@@ -311,7 +311,7 @@
311311
"cell_type": "code",
312312
"execution_count": 9,
313313
"source": [
314-
"!openai api fine_tunes.create -t \"sport2_prepared_train.jsonl\" -v \"sport2_prepared_valid.jsonl\" --no_packing --compute_classification_metrics --classification_positive_class \" baseball\" -m ada"
314+
"!openai api fine_tunes.create -t \"sport2_prepared_train.jsonl\" -v \"sport2_prepared_valid.jsonl\" --compute_classification_metrics --classification_positive_class \" baseball\" -m ada"
315315
],
316316
"outputs": [
317317
{
@@ -737,4 +737,4 @@
737737
},
738738
"nbformat": 4,
739739
"nbformat_minor": 2
740-
}
740+
}

examples/finetuning/olympics-3-train-qa.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -373,7 +373,7 @@
373373
}
374374
],
375375
"source": [
376-
"!openai api fine_tunes.create -t \"olympics-data/discriminator_train.jsonl\" -v \"olympics-data/discriminator_test.jsonl\" --no_packing --batch_size 16 --compute_classification_metrics --classification_positive_class \" yes\" --model ada"
376+
"!openai api fine_tunes.create -t \"olympics-data/discriminator_train.jsonl\" -v \"olympics-data/discriminator_test.jsonl\" --batch_size 16 --compute_classification_metrics --classification_positive_class \" yes\" --model ada"
377377
]
378378
},
379379
{
@@ -391,7 +391,7 @@
391391
}
392392
],
393393
"source": [
394-
"!openai api fine_tunes.create -t \"olympics-data/qa_train.jsonl\" -v \"olympics-data/qa_test.jsonl\" --no_packing --batch_size 16"
394+
"!openai api fine_tunes.create -t \"olympics-data/qa_train.jsonl\" -v \"olympics-data/qa_test.jsonl\" --batch_size 16"
395395
]
396396
},
397397
{

openai/cli.py

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -397,7 +397,6 @@ def create(cls, args):
397397
"batch_size",
398398
"learning_rate_multiplier",
399399
"prompt_loss_weight",
400-
"use_packing",
401400
"compute_classification_metrics",
402401
"classification_n_classes",
403402
"classification_positive_class",
@@ -891,23 +890,6 @@ def help(args):
891890
"learning rate is determined by the original learning rate used for "
892891
"pretraining multiplied by this value.",
893892
)
894-
sub.add_argument(
895-
"--use_packing",
896-
action="store_true",
897-
dest="use_packing",
898-
help="On classification tasks, we recommend not setting this flag. "
899-
"On all other tasks, we recommend setting it. "
900-
"When set, we pack as many prompt-completion pairs as possible into each "
901-
"training example. This greatly increases the speed of a fine-tuning job, "
902-
"often without negatively affecting model performance.",
903-
)
904-
sub.add_argument(
905-
"--no_packing",
906-
action="store_false",
907-
dest="use_packing",
908-
help="Disables the packing flag (see --use_packing for description).",
909-
)
910-
sub.set_defaults(use_packing=None)
911893
sub.add_argument(
912894
"--prompt_loss_weight",
913895
type=float,

openai/validators.py

Lines changed: 2 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
import sys
33
from typing import Any, Callable, NamedTuple, Optional
44

5-
import numpy as np
65
import pandas as pd
76

87

@@ -535,12 +534,12 @@ def read_any_format(fname, fields=["prompt", "completion"]):
535534
def format_inferrer_validator(df):
536535
"""
537536
This validator will infer the likely fine-tuning format of the data, and display it to the user if it is classification.
538-
It will also suggest to use ada, --no_packing and explain train/validation split benefits.
537+
It will also suggest to use ada and explain train/validation split benefits.
539538
"""
540539
ft_type = infer_task_type(df)
541540
immediate_msg = None
542541
if ft_type == "classification":
543-
immediate_msg = f"\n- Based on your data it seems like you're trying to fine-tune a model for {ft_type}\n- For classification, we recommend you try one of the faster and cheaper models, such as `ada`. You should also set the `--no_packing` parameter when fine-tuning\n- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training"
542+
immediate_msg = f"\n- Based on your data it seems like you're trying to fine-tune a model for {ft_type}\n- For classification, we recommend you try one of the faster and cheaper models, such as `ada`\n- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training"
544543
return Remediation(name="num_examples", immediate_msg=immediate_msg)
545544

546545

@@ -634,27 +633,6 @@ def get_classification_hyperparams(df):
634633
return n_classes, pos_class
635634

636635

637-
def get_batch_size_suggestion(df, no_packing):
638-
"""
639-
Suggest the batch size based on the number of examples after packing optionally is applied.
640-
"""
641-
n_examples, n_characters = (
642-
len(df),
643-
df.completion.str.len().sum() + df.prompt.str.len().sum(),
644-
)
645-
BATCH_SIZE_TO_N_EXAMPLES_RATIO = 0.002
646-
BATCH_SIZE_TO_N_CHARACTERS_RATIO = BATCH_SIZE_TO_N_EXAMPLES_RATIO / 10_000
647-
648-
if no_packing:
649-
batch_size = BATCH_SIZE_TO_N_EXAMPLES_RATIO * n_examples
650-
else:
651-
batch_size = BATCH_SIZE_TO_N_CHARACTERS_RATIO * n_characters
652-
653-
batch_size = max(1, int(2 ** np.ceil(np.log2(batch_size))))
654-
batch_size_suggestion = f" --batch_size {batch_size}"
655-
return batch_size_suggestion
656-
657-
658636
def write_out_file(df, fname, any_remediations, auto_accept):
659637
"""
660638
This function will write out a dataframe to a file, if the user would like to proceed, and also offer a fine-tuning command with the newly created file.
@@ -670,14 +648,7 @@ def write_out_file(df, fname, any_remediations, auto_accept):
670648
if accept_suggestion(input_text, auto_accept):
671649
split = True
672650

673-
no_packing = ft_format == "classification" or (
674-
ft_format == "conditional generation" and len(df) < 1000
675-
)
676651
additional_params = ""
677-
if no_packing:
678-
additional_params = " --no_packing"
679-
additional_params += get_batch_size_suggestion(df, no_packing)
680-
681652
common_prompt_suffix_new_line_handled = common_prompt_suffix.replace("\n", "\\n")
682653
common_completion_suffix_new_line_handled = common_completion_suffix.replace(
683654
"\n", "\\n"

openai/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
VERSION = "0.11.4"
1+
VERSION = "0.11.5"

0 commit comments

Comments
 (0)