Skip to content

perf: Optimize document extraction for complex table files #3116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 20, 2025

Conversation

shaohuzhang1
Copy link
Contributor

perf: Optimize document extraction for complex table files

Copy link

f2c-ci-robot bot commented May 20, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

f2c-ci-robot bot commented May 20, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shaohuzhang1 shaohuzhang1 merged commit 35b662a into main May 20, 2025
4 checks passed
@shaohuzhang1 shaohuzhang1 deleted the pr@main@perf_xlsx branch May 20, 2025 05:44
for row in rows:
r = [f'{value}' for key, value in row.items()]
for row_index in range(1, len(data)):
r = [f'{value}' for value in data[row_index]]
md_table += '| ' + ' | '.join(
[str(cell).replace('\n', '<br>') if cell is not None else '' for cell in r]) + ' |\n'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several areas where refactoring can be beneficial to improve readability, maintainability, and efficiency of the provided code:

  1. Avoid Repeated for loop Headers:
    The first couple of loops have similar header structures. Consider using a function or class method instead.

  2. Optimize Merged Cells Handling:
    The merging process can be simplified by directly assigning cells based on their coordinates rather than iterating through all rows and columns repeatedly.

  3. Refactor Data Processing Logic:
    Separate concerns between loading sheets and processing content into distinct methods for better clarity and modularity.

Here's a revised version with these improvements:

class ExcelHandler:
    def __init__(self):
        pass

    def _get_buffer(self, file):
        """Return buffer bytes from provided file."""
        ...

    def support(self, file, get_buffer):
        """Check support for specified file type here."""
        ...

    def _load_workbook(self, data_bytes):
        """Load workbook from byte data."""
        ...

    def _find_sheet_name(self, sheet_names, default=None):
        """Find desired sheet name in list; otherwise use default name."""
        ...

    def fill_merged_cells(self, sheet, image_dict):
        """
        Fill merged cells in worksheet and map cell keys to dictionary.
        Returns a list of dictionaries where each dict represents a row containing cell titles mapped to text/images.
        :param sheet:
        :param image_dict:
        :return:
        """
        data = []

        # Collect headers from non-empty topmost row
        headers = next((cell.value for row in sheet[:1] for cell in row if cell), [])

        # Process content starting from the second row
        for row in sheet.iter_rows(min_row=2):
            col_idx_to_key = {}
            row_data = {}

            for col_idx, (cell, header) in enumerate(zip(row, headers)):
                if cell.value is None:
                    # If empty cell within spanned area get corresponding header value
                    header_span_start = sheet._current_rowspan.get(cell.row, 0)
                    header_value = headers[header_span_start - 1]
                    col_idx_to_key[col_idx] = header_value
                    # Handle images similarly inside spanning regions
                else:
                    col_idx_to_key[col_idx] = cell.value

                formatted_cell_value = self._format_cell_value(cell.value, image_dict)
                row_data[col_idx_to_key[col_idx]] = formatted_cell_value

            data.append(row_data)

        # Assign final values across spans
        for row_data in data:
            for i in range(len(data[data.index(row_data)])):
                val = row_data[i]
                if isinstance(val, str) and ',' in val:  # Assuming comma delimiter in multi-value string
                    parts = val.split(',')
                    new_val_dict = {part.strip(): '' for part in parts}
                    del row_data[i]
                    row_data.extend(new_val_dict.items())

        return data

    def _format_cell_value(self, cell_value, image_dict):
        if cell_value is not None:
            image = image_dict.get(cell_value, None)
            if image is not None:
                cell_value = f'![](/api/image/{image.id})'
        return cell_value

    def handle(self, file, get_buffer, save_image):
        """
        Handles handling logic including getting sheet names, processing worksheets, etc.
        :param file:
        :param get_buffer:
        :param save_image:
        :return:
        """
        data_bytes = get_buffer(file)
        wb = self._load_workbook(data_bytes)
        result = []

        sheet_name = self._find_sheet_name(wb.sheetnames)

        while True:
            sheet = wb[sheet_name]

            contents = self.fill_merged_cells(sheet, {})

            paragraphs = []
            for row in contents:
                paragraph_content = " ; ".join(["{} : {}".format(key, str(value).strip()) for key, value in row.items()]).rstrip()
                paragraphs.append({"title": "", "content": paragraph_content})

            result.append({"name": sheet_name, "paragraphs": paragraphs})
            sheet_name = next(iter(set(wb.sheetnames) - set(result[-1]["name"]), False))

        return result

    def get_content(self, file, save_image):
        """
        Main processing handler responsible for fetching and parsing content out of an Excel document/file object,
        then returning it in markdown format.
        This will involve creating the necessary helper functions that read/load different excel files & return appropriate output formats (JSON, YAML, CSV, etc.)
        :param file:
        :param save_image:
        :return:
        """
        data_bytes = self._get_buffer(file)
        wb = self._load_workbook(data_bytes)

        result = []
        for name in [sheet.title for sheet in wb.worksheets]:
            sheet = wb[name]

            content = self.handle(sheet.name, self._get_buffer(sheet.read()), save_image)
            result.append(content)

        return {'data': [{'type': 'excel', 'payload': result}],
               'response_type': 'application/json'}

# Example usage:
# handler_instance = ExcelHandler()
# response = handler_instance.get_content("some_excel_file.xlsx", lambda x: x)

Key Changes Made:

  • Extracted Helper Methods: Split off common operations like finding the default sheet name and formatting cell values into separate methods (_find_sheet_name and _format_cell_value, respectively).
  • Merging Improvement: Reduced redundant merging checks by storing rowspan information and directly applying changes during row iteration.
  • Data Structure Optimization: Simplified and structured data collection from merged cells, ensuring consistent handling across various scenarios.

These changes make the code more modular, easier to understand, and less prone to errors as complex operations become isolated tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant