Skip to content

Commit 21bb99c

Browse files
ds-filipknefelFilip Knefelpawel-kmiecik
authored
chore: change table extraction defaults (Unstructured-IO#370)
Changed default value of `pdf_infer_table_structure` to `True`. Changed default value of `skip_infer_table_types` to `[]`. Marked `pdf_infer_table_structure` as deprecated and removed from documented usage examples. Updated tests in line with above changes. --------- Co-authored-by: Filip Knefel <[email protected]> Co-authored-by: Paweł Kmiecik <[email protected]>
1 parent 57f5a68 commit 21bb99c

File tree

7 files changed

+31
-48
lines changed

7 files changed

+31
-48
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
## 0.0.66-dev2
1+
## 0.0.66-dev3
22

33
* Add support for `unique_element_ids` parameter.
44
* Add max lifetime, via MAX_LIFETIME_SECONDS env-var, to API containers
55
* Bump unstructured to 0.13.2
6+
* Change default values for `pdf_infer_table_structure` and `skip_infer_table_types`. Mark `pdf_infer_table_structure` deprecated.
67

78
## 0.0.65
89

README.md

Lines changed: 3 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -141,25 +141,10 @@ When elements are extracted from PDFs or images, it may be useful to get their b
141141
| jq -C . | less -R
142142
```
143143

144-
#### PDF Table Extraction
145-
146-
To extract the table structure from PDF files using the `hi_res` strategy, ensure that the `pdf_infer_table_structure` parameter is set to `true`. This setting includes the table's text content in the response. By default, this parameter is set to `false` to avoid the expensive reading process.
147-
148-
```
149-
curl -X 'POST' \
150-
'https://api.unstructured.io/general/v0/general' \
151-
-H 'accept: application/json' \
152-
-H 'Content-Type: multipart/form-data' \
153-
-F 'files=@sample-docs/layout-parser-paper.pdf' \
154-
-F 'strategy=hi_res' \
155-
-F 'pdf_infer_table_structure=true' \
156-
| jq -C . | less -R
157-
```
158-
159144
#### Skip Table Extraction
160145

161-
Currently, we provide support for enabling and disabling table extraction for file types other than PDF files. Set parameter `skip_infer_table_types` to specify the document types that you want to skip table extraction with. By default, we skip table extraction
162-
for PDFs and Images, which are `pdf`, `jpg` and `png`. Again, please note that table extraction only works with `hi_res` strategy. For example, if you don't want to skip table extraction for images, you can pass an empty value to `skip_infer_table_types` with:
146+
Currently, we provide support for enabling and disabling table extraction for all file types. Set parameter `skip_infer_table_types` to specify the document types that you want to skip table extraction with. By default, we enable table extraction
147+
for all file types (`skip_infer_table_types=[]`). Again, please note that table extraction only works with `hi_res` strategy. For example, if you want to skip table extraction for images, you can pass a list with matching image file types:
163148

164149
```
165150
curl -X 'POST' \
@@ -168,7 +153,7 @@ for PDFs and Images, which are `pdf`, `jpg` and `png`. Again, please note that t
168153
-H 'Content-Type: multipart/form-data' \
169154
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
170155
-F 'strategy=hi_res' \
171-
-F 'skip_infer_table_types=[]' \
156+
-F 'skip_infer_table_types=["jpg"]' \
172157
| jq -C . | less -R
173158
```
174159

openapi.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@
169169
"pdf_infer_table_structure": {
170170
"type": "boolean",
171171
"title": "Pdf Infer Table Structure",
172-
"description": "If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, 'text_as_html', where the value (string) is a just a transformation of the data into an HTML <table>."
172+
"description": "Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents."
173173
},
174174
"skip_infer_table_types": {
175175
"items": {
@@ -178,7 +178,7 @@
178178
},
179179
"type": "array",
180180
"title": "Skip Infer Table Types",
181-
"description": "The document types that you want to skip table extraction with. Default: ['pdf', 'jpg', 'png']"
181+
"description": "The document types that you want to skip table extraction with. Default: []"
182182
},
183183
"xml_keep_tags": {
184184
"type": "boolean",

prepline_general/api/general.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -293,7 +293,7 @@ def pipeline_api(
293293
hi_res_model_name: Optional[str] = None,
294294
include_page_breaks: bool = False,
295295
ocr_languages: Optional[List[str]] = None,
296-
pdf_infer_table_structure: bool = False,
296+
pdf_infer_table_structure: bool = True,
297297
skip_infer_table_types: Optional[List[str]] = None,
298298
strategy: str = "auto",
299299
xml_keep_tags: bool = False,

prepline_general/api/models/form_params.py

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -69,15 +69,13 @@ def as_form(
6969
List[str],
7070
Form(
7171
title="Skip Infer Table Types",
72-
description="The document types that you want to skip table extraction with. Default: ['pdf', 'jpg', 'png']",
72+
description=(
73+
"The document types that you want to skip table extraction with. Default: []"
74+
),
7375
example="['pdf', 'jpg', 'png']",
7476
),
7577
BeforeValidator(SmartValueParser[List[str]]().value_or_first_element),
76-
] = [
77-
"pdf",
78-
"jpg",
79-
"png",
80-
], # noqa
78+
] = [], # noqa
8179
gz_uncompressed_content_type: Annotated[
8280
Optional[str],
8381
Form(
@@ -132,10 +130,14 @@ def as_form(
132130
bool,
133131
Form(
134132
title="Pdf Infer Table Structure",
135-
description="If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, 'text_as_html', where the value (string) is a just a transformation of the data into an HTML <table>.",
133+
description=(
134+
"Deprecated! Use skip_infer_table_types to opt out of table extraction for any "
135+
"file type. If False and strategy=hi_res, no Table Elements will be extracted "
136+
"from pdf files regardless of skip_infer_table_types contents."
137+
),
136138
),
137139
BeforeValidator(SmartValueParser[bool]().value_or_first_element),
138-
] = False,
140+
] = True,
139141
strategy: Annotated[
140142
Literal["fast", "hi_res", "auto", "ocr_only"],
141143
Form(

scripts/smoketest.py

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,8 @@ def send_document(
2222
content_type: str = "",
2323
strategy: str = "auto",
2424
output_format: str = "application/json",
25-
pdf_infer_table_structure: str = "false",
25+
skip_infer_table_types: list[str] = [],
2626
uncompressed_content_type: str = "",
27-
skip_infer_table_types: str = "['pdf', 'jpg', 'png']",
2827
):
2928
if filenames_gzipped is None:
3029
filenames_gzipped = []
@@ -37,7 +36,6 @@ def send_document(
3736
options = {
3837
"strategy": strategy,
3938
"output_format": output_format,
40-
"pdf_infer_table_structure": pdf_infer_table_structure,
4139
"skip_infer_table_types": skip_infer_table_types,
4240
}
4341
if uncompressed_content_type:
@@ -226,15 +224,15 @@ def test_strategy_performance():
226224

227225
@pytest.mark.skipif(skip_inference_tests, reason="emulated architecture")
228226
@pytest.mark.parametrize(
229-
"strategy, pdf_infer_table_structure, expected_table_num",
227+
"strategy, skip_infer_table_types, expected_table_num",
230228
[
231-
("fast", "True", 0),
232-
("fast", "False", 0),
233-
("hi_res", "True", 2),
234-
("hi_res", "False", 0),
229+
("fast", [], 0),
230+
("fast", ["pdf"], 0),
231+
("hi_res", [], 2),
232+
("hi_res", ["pdf"], 0),
235233
],
236234
)
237-
def test_table_support(strategy: str, pdf_infer_table_structure: str, expected_table_num: int):
235+
def test_table_support(strategy: str, skip_infer_table_types: list[str], expected_table_num: int):
238236
"""
239237
Test that table extraction works on hi_res strategy
240238
"""
@@ -243,8 +241,7 @@ def test_table_support(strategy: str, pdf_infer_table_structure: str, expected_t
243241
filenames=[test_file],
244242
content_type="application/pdf",
245243
strategy=strategy,
246-
pdf_infer_table_structure=pdf_infer_table_structure,
247-
skip_infer_table_types="[]",
244+
skip_infer_table_types=skip_infer_table_types,
248245
)
249246

250247
assert response.status_code == 200

test_general/api/test_app.py

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -177,9 +177,7 @@ def test_languages_param():
177177

178178

179179
def test_skip_infer_table_types_param():
180-
"""
181-
Verify that we skip table instruction unless specified
182-
"""
180+
"""Verify that we extract table unless excluded by skip_infer_table_types"""
183181
client = TestClient(app)
184182
test_file = Path("sample-docs") / "layout-parser-paper-with-table.jpg"
185183
response = client.post(
@@ -191,19 +189,19 @@ def test_skip_infer_table_types_param():
191189
# test we skip table extraction by default
192190
elements = response.json()
193191
table = [el["metadata"]["text_as_html"] for el in elements if "text_as_html" in el["metadata"]]
194-
assert len(table) == 0
192+
assert len(table) == 1
195193

196194
response = client.post(
197195
MAIN_API_ROUTE,
198196
files=[("files", (str(test_file), open(test_file, "rb")))],
199-
data={"skip_infer_table_types": "['pdf']"},
197+
data={"skip_infer_table_types": ["jpg"]},
200198
)
201199

202200
assert response.status_code == 200
203-
# test we didn't specify to skip table extration with image
201+
# test we specified to skip extraction for jpg
204202
elements = response.json()
205203
table = [el["metadata"]["text_as_html"] for el in elements if "text_as_html" in el["metadata"]]
206-
assert len(table) == 1
204+
assert len(table) == 0
207205
# This text is not currently picked up
208206
# assert "Layouts of history Japanese documents" in table[0]
209207

0 commit comments

Comments
 (0)