Skip to content

Create cleaning util module mirroring util.py #211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jul 8, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
eb49249
pretty functional replication of conll_3.ipynb for a different task
ZachEichen Jun 15, 2021
22727d6
created a cleaning module, and populated it with cleaning utilities t…
ZachEichen Jun 18, 2021
b49fd05
Merge branch 'master' of github.com:CODAIT/text-extensions-for-pandas
ZachEichen Jun 21, 2021
02386c6
reformatted
ZachEichen Jun 21, 2021
8a91922
edited license header comments to be accurate
ZachEichen Jun 21, 2021
9c46539
fixed a bug in one of the methods
ZachEichen Jun 21, 2021
2b267e4
fixed a bug in cleaning util where tokens with many sub tokens could …
ZachEichen Jun 21, 2021
ae0483a
redid some doc comments; added precision / accuracy metrics. Last com…
ZachEichen Jun 23, 2021
310cee8
broke up util.py into smaller, more descriptive modules
ZachEichen Jun 23, 2021
b2d1a56
made various modifications
ZachEichen Jun 24, 2021
a15911e
cleaned up formatting and documentation on files
ZachEichen Jun 28, 2021
d441a08
Merge branch 'master' of github.com:CODAIT/text-extensions-for-pandas
ZachEichen Jun 28, 2021
147019f
slightly modified csv prep method
ZachEichen Jun 29, 2021
0c25b9e
made changes to conll_3
ZachEichen Jun 29, 2021
62ae6df
updated conll_2.ipynb
ZachEichen Jun 29, 2021
439a7ca
re-removed util.py
ZachEichen Jun 29, 2021
bc7af99
updated Cross_check_datapoints notebook and relocated to new location
ZachEichen Jun 30, 2021
eeec4ae
ran and updated CoNLL4 notebook
ZachEichen Jun 30, 2021
ff73d7b
made various changes to address fred's comments.
ZachEichen Jul 1, 2021
e154cc1
Merge branch 'master' of github.com:CODAIT/text-extensions-for-pandas
ZachEichen Jul 2, 2021
81413fd
made small tweaks to cleaning module
ZachEichen Jul 8, 2021
660715c
Merge branch 'master' of github.com:CODAIT/text-extensions-for-pandas
ZachEichen Jul 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion text_extensions_for_pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,12 +40,13 @@
# Sub-modules
from text_extensions_for_pandas import io
from text_extensions_for_pandas import spanner
from text_extensions_for_pandas import cleaning

# Sphinx autodoc needs this redundant listing of public symbols to list the contents
# of this subpackage.
__all__ = [
"Span", "SpanDtype", "SpanArray",
"TokenSpan", "TokenSpanDtype", "TokenSpanArray",
"TensorElement", "TensorDtype", "TensorArray",
"io"
"io", 'cleaning'
]
48 changes: 48 additions & 0 deletions text_extensions_for_pandas/cleaning/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#
# Copyright (c) 2021 IBM Corp.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

################################################################################
# cleaning module
#
# Functions in text_extensions_for_pandas that allow for identification of
# possibly incorrect labels, and quick training of models on bert embeddings
# of a corpus

# Expose the public APIs that users should get from importing the top-level
# library.

from text_extensions_for_pandas.cleaning import ensemble
from text_extensions_for_pandas.cleaning import analysis
from text_extensions_for_pandas.cleaning import preprocess

# import important functions from each module
from text_extensions_for_pandas.cleaning.preprocess import (
preprocess_documents,
combine_raw_spans_docs,
)
from text_extensions_for_pandas.cleaning.analysis import (
flag_suspicious_labels,
create_f1_score_report,
create_f1_score_report_iob,
)
from text_extensions_for_pandas.cleaning.ensemble import (
train_reduced_model,
train_model_ensemble,
infer_and_extract_entities_iob,
infer_and_extract_raw_entites,
infer_on_df,
)

__all__ = ["ensemble", "analysis", "preprocess"]
Loading