Skip to content

Fix KeyError occurring using fine_tunes.prepare_data #125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 16, 2022

Conversation

serinamarie
Copy link
Contributor

@serinamarie serinamarie commented Sep 27, 2022

Description

KeyError occurs when running openai tools fine_tunes.prepare_data -f training_file.jsonl when training_file.jsonl contains a prompt/completion that is BOTH a duplicate and a long example. Without trying to change too much, this fix would:

  • wrap the retrieval of long_examples and long_indexes into a function, calling it preemptively within long_examples_validator to provide analysis information about how many rows are long examples, then calling it when actually dropping rows within optional_fn and providing info to the user if the keys that are being dropped have changed.

Related Issue

Fixes #121

Other Notes

I am also happy to provide a file that can be used to reproduce this error.

Example Output:

When the error would normally occur, instead you would see:

❯ openai tools fine_tunes.prepare_data -f training_file_0927.jsonl
Analyzing...
- There are 2 duplicated prompt-completion sets. These are rows: [5, 6]
- There are 2 examples that are very long. These are rows: [3, 6]
Based on the analysis we will perform the following actions:
- [Recommended] Remove 2 duplicate rows [Y/n]: y
- [Recommended] Remove 2 long examples [Y/n]: y
The indices of the long examples has changed as a result of a previously applied recommendation.
The 1 long examples to be dropped are now at the following indices: [3]
- [Recommended] Add a suffix separator ` ->` to all prompts [Y/n]: y 

@serinamarie serinamarie marked this pull request as ready for review September 27, 2022 20:56
@serinamarie serinamarie changed the title Resolve key error through cli Resolve key error through CLI Sep 27, 2022
@serinamarie serinamarie changed the title Resolve key error through CLI Fix KeyError occurring using fine_tunes.prepare_data Sep 27, 2022
Copy link
Collaborator

@hallacy hallacy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Sorry for the delay. A few nits and then I think this is good to go. Can you explain a bit more about why this works?

@serinamarie
Copy link
Contributor Author

Awesome! Sorry for the delay. A few nits and then I think this is good to go. Can you explain a bit more about why this works?

Hi @hallacy, it works because in the original code when you are presented with the indices to drop for long examples (originally at line 164), at that point in time no recommendations have been applied. For example, a row that is both a long example AND a duplicate may exist at index 4; and the current script, when removing the duplicates, will remove index 4, but index 4 is still 'scheduled' to be deleted when dropping long examples at original line 170. When the optional_fn() is applied, the key no longer exists, causing a KeyError.
tl;dr sometimes optional_fn() will try to delete an index that doesn't exist anymore because the long_indexes have been calculated before any recommendations have been applied.

By recalculating the long_indexes when actually applying the optional_fn() on the df, we ensure that we have an up-to-date array containing the indices where each long example exists, rather than a possibly outdated one. I hope that helps to clear it up; I could have included that in the description :)

@serinamarie serinamarie requested a review from hallacy October 15, 2022 00:10
@hallacy
Copy link
Collaborator

hallacy commented Oct 16, 2022

Love it! Thank you!

@hallacy hallacy merged commit d1769c1 into openai:main Oct 16, 2022
cgayapr pushed a commit to cgayapr/openai-python that referenced this pull request Dec 14, 2024
* Initial commit

* Add fix

* Reinstate reset_index()

* Add suggestions

* Remove print stmt

* punctuation

* Add test for fine_tunes.prepare_data

* Renamed file, added docstrings

* Move comment placement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError when removing long examples after removing duplicate rows
2 participants