Create cleaning util module mirroring util.py #211

ZachEichen · 2021-06-21T14:05:18Z

Adresses issue # 196

creates a module in text extensions that folds the functionality of util.py into the main library.

Provides functionality for the following:

preprocesssing of models to bert compatible formats
creation and training of models, using ray to increase speed
creation and training of reduced-accuracy model ensembles using user defined or automatic kernel sizes and seeds
running inference on a dataset with either iob or per-token style labels
running inference as above and conversion back to the original tokenization of the dataset
flagging suspicious labels given a set of inferred features, as predicted by an ensemble of models and the model labels

…o vastly streamline conll-3 like process

review-notebook-app · 2021-06-21T14:05:22Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…cause divide-by-zeros, In these special cases, instead, we now use logarithm + addition to avoid floating point underflows

frreiss

This is looking good. Can you please remove the original util.py and update the other notebooks in tutorials/corpus so that they get the shared functionality from its new location at text_extensions_for_pandas.cleaning?

I would recommend breaking the contents of util.py into categories of functionality and putting each category of function into a separate file. Some of the functions and classes may fit better at other parts of the namespace besides cleaning -- the BertActor class, for example.

Also, we don't want to have a hard dependency on Ray, so can you please make all the usage of Ray be done on-demand?

…mit with unified cleaning.util

ZachEichen · 2021-06-30T16:01:42Z

All changes Requested have been made.
new util.py has been split into 3 sub-modules:

preprocess.py
ensemble.py
analyze.py

which together encomass the same functionality.

All of the CoNLL_* notebooks have been updated to use the new module.

Ray dependencies have been made to be on-demand, when ray-specific functions are called, ray is imported and then used.

A new tutorial document, Cross_check_datapoints.ipynb has been created to give an overview of the functionality, and demonstrate it on a classification task.

frreiss

LGTM. Some minor comments inline.

frreiss · 2021-06-30T19:37:58Z

text_extensions_for_pandas/cleaning/__init__.py

+#
+
+################################################################################
+# io module


Suggested change

# io module

# cleaning module

updated in latest commit

frreiss · 2021-06-30T19:38:22Z

text_extensions_for_pandas/cleaning/__init__.py

+################################################################################
+# io module
+#
+# Functions in text_extensions_for_pandas that create DataFrames and convert


This descriptive comment needs to be updated.

Updated in latest commit

frreiss · 2021-06-30T19:40:24Z

text_extensions_for_pandas/cleaning/analysis.py

+
+import numpy as np
+import pandas as pd
+import sklearn.random_projection


Imports of sklearn and transformers need to be moved out of the top-level scope, because those packages are not part of our current requirements.txt.

done in latest commit

frreiss · 2021-06-30T19:42:08Z

text_extensions_for_pandas/cleaning/analysis.py

+    corpus_label_col: str,
+    predicted_label_col: str,
+    print_output: bool = False,
+):


I would remove the print_output argument here. The caller can call print() on the return value of this function if they want that functionality.

done in latest commit

frreiss · 2021-06-30T19:43:20Z

text_extensions_for_pandas/cleaning/analysis.py

+    precision, recall and accuacy figures, as well as global averaged metrics, and r
+    eturns them as a pandas DataFrame In the 'Simple' mode, calculates micro averaged
+    precision recall and F1 scorereturns them as a dictionary.
+    :param predicted_ents: entities returned from the predictions of the model, in the


Can you add a forward reference here to the other arguments that specify column names of these DataFrames?

frreiss · 2021-06-30T19:54:10Z

text_extensions_for_pandas/cleaning/ensemble.py

+      multiple subtokens all describe the same original token.
+    :param keep_cols: any column that you wish to be carried over to the output dataframe, by default
+      the column 'raw_span' is the only column to be carried over, if it exists.
+    """


Missing description of return value in docstring

updated in latest commit

frreiss · 2021-06-30T19:55:15Z

text_extensions_for_pandas/cleaning/ensemble.py

+     the tokens of those documents as spans.
+    :param iob_col: the column containing the predicted iob values from the model
+    :param entity_type_col: the column containing the predicted element types from the model
+    """


Missing description of returned DataFrame in docstring

Updated in latest commit

frreiss · 2021-06-30T19:57:20Z

text_extensions_for_pandas/cleaning/ensemble.py

+        pred_aligned_doc = tp.io.bert.align_bert_tokens_to_corpus_tokens(
+            pred_spans, raw_docs[fold][doc_num].rename({raw_docs_span_col_name: "span"})
+        )
+        pred_aligned_doc[[fold_col, doc_col]] = [fold, doc_num]


Should this function rename the ent_type column back to the user's requested column name here?

Updated in latest commit

frreiss · 2021-06-30T19:58:06Z

text_extensions_for_pandas/cleaning/preprocess.py

+                       PreTrainingTokenizerFast which supports `encode_plus` with
+                       return_offsets_mapping=True.
+                       A default tokenizer will be used if this is `None` or not specified
+    :param token_col: the column in the each of the dataframes in `doc` containing the spans


doc ==> document

Updated in latest commit

frreiss · 2021-06-30T20:00:49Z

text_extensions_for_pandas/cleaning/preprocess.py

+    )
+    raw_tok_spans = (
+        tp.TokenSpanArray.align_to_tokens(bert_toks["span"], document[token_col])
+        .as_frame()


Might want to skip the round-trip through a new DataFrame and just pull out the begin_token and end_token attributes of the return value from align_to_tokens

Updated in latest commit

frreiss · 2021-07-07T22:40:30Z

@ZachEichen can you please pull the latest fixes from the master branch into your PR branch to unblock the CI tests?

ZachEichen added 4 commits June 15, 2021 15:27

pretty functional replication of conll_3.ipynb for a different task

eb49249

created a cleaning module, and populated it with cleaning utilities t…

22727d6

…o vastly streamline conll-3 like process

Merge branch 'master' of github.com:CODAIT/text-extensions-for-pandas

b49fd05

reformatted

02386c6

ZachEichen requested a review from frreiss June 21, 2021 14:05

ZachEichen added 3 commits June 21, 2021 10:08

edited license header comments to be accurate

8a91922

fixed a bug in one of the methods

9c46539

fixed a bug in cleaning util where tokens with many sub tokens could …

2b267e4

…cause divide-by-zeros, In these special cases, instead, we now use logarithm + addition to avoid floating point underflows

frreiss requested changes Jun 22, 2021

View reviewed changes

ZachEichen added 9 commits June 23, 2021 10:29

redid some doc comments; added precision / accuracy metrics. Last com…

ae0483a

…mit with unified cleaning.util

broke up util.py into smaller, more descriptive modules

310cee8

made various modifications

b2d1a56

cleaned up formatting and documentation on files

a15911e

Merge branch 'master' of github.com:CODAIT/text-extensions-for-pandas

d441a08

slightly modified csv prep method

147019f

made changes to conll_3

0c25b9e

updated conll_2.ipynb

62ae6df

re-removed util.py

439a7ca

ZachEichen requested a review from frreiss June 29, 2021 17:43

ZachEichen added 2 commits June 30, 2021 00:21

updated Cross_check_datapoints notebook and relocated to new location

bc7af99

ran and updated CoNLL4 notebook

eeec4ae

frreiss approved these changes Jun 30, 2021

View reviewed changes

ZachEichen added 2 commits July 1, 2021 17:10

made various changes to address fred's comments.

ff73d7b

Merge branch 'master' of github.com:CODAIT/text-extensions-for-pandas

e154cc1

ZachEichen added 2 commits July 8, 2021 14:56

made small tweaks to cleaning module

81413fd

Merge branch 'master' of github.com:CODAIT/text-extensions-for-pandas

660715c

frreiss merged commit b75c406 into CODAIT:master Jul 8, 2021

Create cleaning util module mirroring util.py #211

Create cleaning util module mirroring util.py #211

Uh oh!

Conversation

ZachEichen commented Jun 21, 2021

Uh oh!

review-notebook-app bot commented Jun 21, 2021

Uh oh!

frreiss left a comment

Choose a reason for hiding this comment

Uh oh!

ZachEichen commented Jun 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frreiss left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frreiss commented Jul 7, 2021

Uh oh!

Uh oh!

ZachEichen commented Jun 30, 2021 •

edited

Loading