perf: Optimize document extraction for complex table files #3116

shaohuzhang1 · 2025-05-20T05:44:07Z

perf: Optimize document extraction for complex table files

f2c-ci-robot · 2025-05-20T05:44:11Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

f2c-ci-robot · 2025-05-20T05:44:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shaohuzhang1 · 2025-05-20T05:45:11Z

apps/common/handle/impl/table/xlsx_parse_table_handle.py

-                for row in rows:
-                    r = [f'{value}' for key, value in row.items()]
+                for row_index in range(1, len(data)):
+                    r = [f'{value}' for value in data[row_index]]
                    md_table += '| ' + ' | '.join(
                        [str(cell).replace('\n', '<br>') if cell is not None else '' for cell in r]) + ' |\n'



There are several areas where refactoring can be beneficial to improve readability, maintainability, and efficiency of the provided code:

Avoid Repeated for loop Headers:
The first couple of loops have similar header structures. Consider using a function or class method instead.

Optimize Merged Cells Handling:
The merging process can be simplified by directly assigning cells based on their coordinates rather than iterating through all rows and columns repeatedly.

Refactor Data Processing Logic:
Separate concerns between loading sheets and processing content into distinct methods for better clarity and modularity.

Here's a revised version with these improvements:

class ExcelHandler: def __init__(self): pass def _get_buffer(self, file): """Return buffer bytes from provided file.""" ... def support(self, file, get_buffer): """Check support for specified file type here.""" ... def _load_workbook(self, data_bytes): """Load workbook from byte data.""" ... def _find_sheet_name(self, sheet_names, default=None): """Find desired sheet name in list; otherwise use default name.""" ... def fill_merged_cells(self, sheet, image_dict): """ Fill merged cells in worksheet and map cell keys to dictionary. Returns a list of dictionaries where each dict represents a row containing cell titles mapped to text/images. :param sheet: :param image_dict: :return: """ data = [] # Collect headers from non-empty topmost row headers = next((cell.value for row in sheet[:1] for cell in row if cell), []) # Process content starting from the second row for row in sheet.iter_rows(min_row=2): col_idx_to_key = {} row_data = {} for col_idx, (cell, header) in enumerate(zip(row, headers)): if cell.value is None: # If empty cell within spanned area get corresponding header value header_span_start = sheet._current_rowspan.get(cell.row, 0) header_value = headers[header_span_start - 1] col_idx_to_key[col_idx] = header_value # Handle images similarly inside spanning regions else: col_idx_to_key[col_idx] = cell.value formatted_cell_value = self._format_cell_value(cell.value, image_dict) row_data[col_idx_to_key[col_idx]] = formatted_cell_value data.append(row_data) # Assign final values across spans for row_data in data: for i in range(len(data[data.index(row_data)])): val = row_data[i] if isinstance(val, str) and ',' in val: # Assuming comma delimiter in multi-value string parts = val.split(',') new_val_dict = {part.strip(): '' for part in parts} del row_data[i] row_data.extend(new_val_dict.items()) return data def _format_cell_value(self, cell_value, image_dict): if cell_value is not None: image = image_dict.get(cell_value, None) if image is not None: cell_value = f'![](/api/image/{image.id})' return cell_value def handle(self, file, get_buffer, save_image): """ Handles handling logic including getting sheet names, processing worksheets, etc. :param file: :param get_buffer: :param save_image: :return: """ data_bytes = get_buffer(file) wb = self._load_workbook(data_bytes) result = [] sheet_name = self._find_sheet_name(wb.sheetnames) while True: sheet = wb[sheet_name] contents = self.fill_merged_cells(sheet, {}) paragraphs = [] for row in contents: paragraph_content = " ; ".join(["{} : {}".format(key, str(value).strip()) for key, value in row.items()]).rstrip() paragraphs.append({"title": "", "content": paragraph_content}) result.append({"name": sheet_name, "paragraphs": paragraphs}) sheet_name = next(iter(set(wb.sheetnames) - set(result[-1]["name"]), False)) return result def get_content(self, file, save_image): """ Main processing handler responsible for fetching and parsing content out of an Excel document/file object, then returning it in markdown format. This will involve creating the necessary helper functions that read/load different excel files & return appropriate output formats (JSON, YAML, CSV, etc.) :param file: :param save_image: :return: """ data_bytes = self._get_buffer(file) wb = self._load_workbook(data_bytes) result = [] for name in [sheet.title for sheet in wb.worksheets]: sheet = wb[name] content = self.handle(sheet.name, self._get_buffer(sheet.read()), save_image) result.append(content) return {'data': [{'type': 'excel', 'payload': result}], 'response_type': 'application/json'} # Example usage: # handler_instance = ExcelHandler() # response = handler_instance.get_content("some_excel_file.xlsx", lambda x: x)

Key Changes Made:

Extracted Helper Methods: Split off common operations like finding the default sheet name and formatting cell values into separate methods (_find_sheet_name and _format_cell_value, respectively).

Merging Improvement: Reduced redundant merging checks by storing rowspan information and directly applying changes during row iteration.

Data Structure Optimization: Simplified and structured data collection from merged cells, ensuring consistent handling across various scenarios.

These changes make the code more modular, easier to understand, and less prone to errors as complex operations become isolated tasks.

perf: Optimize document extraction for complex table files

68ab443

f2c-ci-robot bot added the do-not-merge/release-note-label-needed label May 20, 2025

shaohuzhang1 merged commit 35b662a into main May 20, 2025
4 checks passed

shaohuzhang1 deleted the pr@main@perf_xlsx branch May 20, 2025 05:44

shaohuzhang1 commented May 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Optimize document extraction for complex table files #3116

perf: Optimize document extraction for complex table files #3116

Uh oh!

shaohuzhang1 commented May 20, 2025

Uh oh!

f2c-ci-robot bot commented May 20, 2025

Uh oh!

f2c-ci-robot bot commented May 20, 2025

Uh oh!

Uh oh!

shaohuzhang1 May 20, 2025

Uh oh!

Uh oh!

perf: Optimize document extraction for complex table files #3116

perf: Optimize document extraction for complex table files #3116

Uh oh!

Conversation

shaohuzhang1 commented May 20, 2025

Uh oh!

f2c-ci-robot bot commented May 20, 2025

Uh oh!

f2c-ci-robot bot commented May 20, 2025

Uh oh!

Uh oh!

shaohuzhang1 May 20, 2025

Choose a reason for hiding this comment

Key Changes Made:

Uh oh!

Uh oh!