Skip to content

Create cleaning util module mirroring util.py #211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jul 8, 2021

Conversation

ZachEichen
Copy link
Collaborator

Adresses issue # 196

creates a module in text extensions that folds the functionality of util.py into the main library.

Provides functionality for the following:

  • preprocesssing of models to bert compatible formats
  • creation and training of models, using ray to increase speed
  • creation and training of reduced-accuracy model ensembles using user defined or automatic kernel sizes and seeds
  • running inference on a dataset with either iob or per-token style labels
  • running inference as above and conversion back to the original tokenization of the dataset
  • flagging suspicious labels given a set of inferred features, as predicted by an ensemble of models and the model labels

@ZachEichen ZachEichen requested a review from frreiss June 21, 2021 14:05
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

…cause divide-by-zeros, In these special cases, instead, we now use logarithm + addition to avoid floating point underflows
Copy link
Member

@frreiss frreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Can you please remove the original util.py and update the other notebooks in tutorials/corpus so that they get the shared functionality from its new location at text_extensions_for_pandas.cleaning?

I would recommend breaking the contents of util.py into categories of functionality and putting each category of function into a separate file. Some of the functions and classes may fit better at other parts of the namespace besides cleaning -- the BertActor class, for example.

Also, we don't want to have a hard dependency on Ray, so can you please make all the usage of Ray be done on-demand?

@ZachEichen ZachEichen requested a review from frreiss June 29, 2021 17:43
@ZachEichen
Copy link
Collaborator Author

ZachEichen commented Jun 30, 2021

All changes Requested have been made.
new util.py has been split into 3 sub-modules:

  • preprocess.py
  • ensemble.py
  • analyze.py

which together encomass the same functionality.

All of the CoNLL_* notebooks have been updated to use the new module.

Ray dependencies have been made to be on-demand, when ray-specific functions are called, ray is imported and then used.

A new tutorial document, Cross_check_datapoints.ipynb has been created to give an overview of the functionality, and demonstrate it on a classification task.

Copy link
Member

@frreiss frreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some minor comments inline.

#

################################################################################
# io module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# io module
# cleaning module

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in latest commit

################################################################################
# io module
#
# Functions in text_extensions_for_pandas that create DataFrames and convert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This descriptive comment needs to be updated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in latest commit


import numpy as np
import pandas as pd
import sklearn.random_projection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imports of sklearn and transformers need to be moved out of the top-level scope, because those packages are not part of our current requirements.txt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in latest commit

corpus_label_col: str,
predicted_label_col: str,
print_output: bool = False,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the print_output argument here. The caller can call print() on the return value of this function if they want that functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in latest commit

precision, recall and accuacy figures, as well as global averaged metrics, and r
eturns them as a pandas DataFrame In the 'Simple' mode, calculates micro averaged
precision recall and F1 scorereturns them as a dictionary.
:param predicted_ents: entities returned from the predictions of the model, in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a forward reference here to the other arguments that specify column names of these DataFrames?

multiple subtokens all describe the same original token.
:param keep_cols: any column that you wish to be carried over to the output dataframe, by default
the column 'raw_span' is the only column to be carried over, if it exists.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing description of return value in docstring

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in latest commit

the tokens of those documents as spans.
:param iob_col: the column containing the predicted iob values from the model
:param entity_type_col: the column containing the predicted element types from the model
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing description of returned DataFrame in docstring

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in latest commit

pred_aligned_doc = tp.io.bert.align_bert_tokens_to_corpus_tokens(
pred_spans, raw_docs[fold][doc_num].rename({raw_docs_span_col_name: "span"})
)
pred_aligned_doc[[fold_col, doc_col]] = [fold, doc_num]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function rename the ent_type column back to the user's requested column name here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in latest commit

PreTrainingTokenizerFast which supports `encode_plus` with
return_offsets_mapping=True.
A default tokenizer will be used if this is `None` or not specified
:param token_col: the column in the each of the dataframes in `doc` containing the spans
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc ==> document

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in latest commit

)
raw_tok_spans = (
tp.TokenSpanArray.align_to_tokens(bert_toks["span"], document[token_col])
.as_frame()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to skip the round-trip through a new DataFrame and just pull out the begin_token and end_token attributes of the return value from align_to_tokens

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in latest commit

@frreiss
Copy link
Member

frreiss commented Jul 7, 2021

@ZachEichen can you please pull the latest fixes from the master branch into your PR branch to unblock the CI tests?

@frreiss frreiss merged commit b75c406 into CODAIT:master Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants