-
Notifications
You must be signed in to change notification settings - Fork 2.2k
perf: Optimize document extraction for complex table files #3116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
for row in rows: | ||
r = [f'{value}' for key, value in row.items()] | ||
for row_index in range(1, len(data)): | ||
r = [f'{value}' for value in data[row_index]] | ||
md_table += '| ' + ' | '.join( | ||
[str(cell).replace('\n', '<br>') if cell is not None else '' for cell in r]) + ' |\n' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several areas where refactoring can be beneficial to improve readability, maintainability, and efficiency of the provided code:
-
Avoid Repeated
for loop
Headers:
The first couple of loops have similar header structures. Consider using a function or class method instead. -
Optimize Merged Cells Handling:
The merging process can be simplified by directly assigning cells based on their coordinates rather than iterating through all rows and columns repeatedly. -
Refactor Data Processing Logic:
Separate concerns between loading sheets and processing content into distinct methods for better clarity and modularity.
Here's a revised version with these improvements:
class ExcelHandler:
def __init__(self):
pass
def _get_buffer(self, file):
"""Return buffer bytes from provided file."""
...
def support(self, file, get_buffer):
"""Check support for specified file type here."""
...
def _load_workbook(self, data_bytes):
"""Load workbook from byte data."""
...
def _find_sheet_name(self, sheet_names, default=None):
"""Find desired sheet name in list; otherwise use default name."""
...
def fill_merged_cells(self, sheet, image_dict):
"""
Fill merged cells in worksheet and map cell keys to dictionary.
Returns a list of dictionaries where each dict represents a row containing cell titles mapped to text/images.
:param sheet:
:param image_dict:
:return:
"""
data = []
# Collect headers from non-empty topmost row
headers = next((cell.value for row in sheet[:1] for cell in row if cell), [])
# Process content starting from the second row
for row in sheet.iter_rows(min_row=2):
col_idx_to_key = {}
row_data = {}
for col_idx, (cell, header) in enumerate(zip(row, headers)):
if cell.value is None:
# If empty cell within spanned area get corresponding header value
header_span_start = sheet._current_rowspan.get(cell.row, 0)
header_value = headers[header_span_start - 1]
col_idx_to_key[col_idx] = header_value
# Handle images similarly inside spanning regions
else:
col_idx_to_key[col_idx] = cell.value
formatted_cell_value = self._format_cell_value(cell.value, image_dict)
row_data[col_idx_to_key[col_idx]] = formatted_cell_value
data.append(row_data)
# Assign final values across spans
for row_data in data:
for i in range(len(data[data.index(row_data)])):
val = row_data[i]
if isinstance(val, str) and ',' in val: # Assuming comma delimiter in multi-value string
parts = val.split(',')
new_val_dict = {part.strip(): '' for part in parts}
del row_data[i]
row_data.extend(new_val_dict.items())
return data
def _format_cell_value(self, cell_value, image_dict):
if cell_value is not None:
image = image_dict.get(cell_value, None)
if image is not None:
cell_value = f''
return cell_value
def handle(self, file, get_buffer, save_image):
"""
Handles handling logic including getting sheet names, processing worksheets, etc.
:param file:
:param get_buffer:
:param save_image:
:return:
"""
data_bytes = get_buffer(file)
wb = self._load_workbook(data_bytes)
result = []
sheet_name = self._find_sheet_name(wb.sheetnames)
while True:
sheet = wb[sheet_name]
contents = self.fill_merged_cells(sheet, {})
paragraphs = []
for row in contents:
paragraph_content = " ; ".join(["{} : {}".format(key, str(value).strip()) for key, value in row.items()]).rstrip()
paragraphs.append({"title": "", "content": paragraph_content})
result.append({"name": sheet_name, "paragraphs": paragraphs})
sheet_name = next(iter(set(wb.sheetnames) - set(result[-1]["name"]), False))
return result
def get_content(self, file, save_image):
"""
Main processing handler responsible for fetching and parsing content out of an Excel document/file object,
then returning it in markdown format.
This will involve creating the necessary helper functions that read/load different excel files & return appropriate output formats (JSON, YAML, CSV, etc.)
:param file:
:param save_image:
:return:
"""
data_bytes = self._get_buffer(file)
wb = self._load_workbook(data_bytes)
result = []
for name in [sheet.title for sheet in wb.worksheets]:
sheet = wb[name]
content = self.handle(sheet.name, self._get_buffer(sheet.read()), save_image)
result.append(content)
return {'data': [{'type': 'excel', 'payload': result}],
'response_type': 'application/json'}
# Example usage:
# handler_instance = ExcelHandler()
# response = handler_instance.get_content("some_excel_file.xlsx", lambda x: x)
Key Changes Made:
- Extracted Helper Methods: Split off common operations like finding the default sheet name and formatting cell values into separate methods (
_find_sheet_name
and_format_cell_value
, respectively). - Merging Improvement: Reduced redundant merging checks by storing rowspan information and directly applying changes during row iteration.
- Data Structure Optimization: Simplified and structured data collection from merged cells, ensuring consistent handling across various scenarios.
These changes make the code more modular, easier to understand, and less prone to errors as complex operations become isolated tasks.
perf: Optimize document extraction for complex table files