-
Notifications
You must be signed in to change notification settings - Fork 35
Create cleaning util module mirroring util.py #211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…cause divide-by-zeros, In these special cases, instead, we now use logarithm + addition to avoid floating point underflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. Can you please remove the original util.py
and update the other notebooks in tutorials/corpus
so that they get the shared functionality from its new location at text_extensions_for_pandas.cleaning
?
I would recommend breaking the contents of util.py
into categories of functionality and putting each category of function into a separate file. Some of the functions and classes may fit better at other parts of the namespace besides cleaning
-- the BertActor class, for example.
Also, we don't want to have a hard dependency on Ray, so can you please make all the usage of Ray be done on-demand?
…mit with unified cleaning.util
All changes Requested have been made.
which together encomass the same functionality. All of the CoNLL_* notebooks have been updated to use the new module. Ray dependencies have been made to be on-demand, when ray-specific functions are called, ray is imported and then used. A new tutorial document, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Some minor comments inline.
# | ||
|
||
################################################################################ | ||
# io module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# io module | |
# cleaning module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in latest commit
################################################################################ | ||
# io module | ||
# | ||
# Functions in text_extensions_for_pandas that create DataFrames and convert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This descriptive comment needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in latest commit
|
||
import numpy as np | ||
import pandas as pd | ||
import sklearn.random_projection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imports of sklearn
and transformers
need to be moved out of the top-level scope, because those packages are not part of our current requirements.txt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
corpus_label_col: str, | ||
predicted_label_col: str, | ||
print_output: bool = False, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the print_output
argument here. The caller can call print()
on the return value of this function if they want that functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
precision, recall and accuacy figures, as well as global averaged metrics, and r | ||
eturns them as a pandas DataFrame In the 'Simple' mode, calculates micro averaged | ||
precision recall and F1 scorereturns them as a dictionary. | ||
:param predicted_ents: entities returned from the predictions of the model, in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a forward reference here to the other arguments that specify column names of these DataFrames?
multiple subtokens all describe the same original token. | ||
:param keep_cols: any column that you wish to be carried over to the output dataframe, by default | ||
the column 'raw_span' is the only column to be carried over, if it exists. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing description of return value in docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in latest commit
the tokens of those documents as spans. | ||
:param iob_col: the column containing the predicted iob values from the model | ||
:param entity_type_col: the column containing the predicted element types from the model | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing description of returned DataFrame in docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in latest commit
pred_aligned_doc = tp.io.bert.align_bert_tokens_to_corpus_tokens( | ||
pred_spans, raw_docs[fold][doc_num].rename({raw_docs_span_col_name: "span"}) | ||
) | ||
pred_aligned_doc[[fold_col, doc_col]] = [fold, doc_num] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this function rename the ent_type
column back to the user's requested column name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in latest commit
PreTrainingTokenizerFast which supports `encode_plus` with | ||
return_offsets_mapping=True. | ||
A default tokenizer will be used if this is `None` or not specified | ||
:param token_col: the column in the each of the dataframes in `doc` containing the spans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc
==> document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in latest commit
) | ||
raw_tok_spans = ( | ||
tp.TokenSpanArray.align_to_tokens(bert_toks["span"], document[token_col]) | ||
.as_frame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to skip the round-trip through a new DataFrame and just pull out the begin_token
and end_token
attributes of the return value from align_to_tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in latest commit
@ZachEichen can you please pull the latest fixes from the |
Adresses issue # 196
creates a module in text extensions that folds the functionality of util.py into the main library.
Provides functionality for the following:
ray
to increase speed