-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Does not correctly parse table rows with w:gridBefore or w:gridAfter elements #881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What are the semantics of those elements? |
Microsoft Word for Microsoft 365 MSO (16.0.13231.20372) 64-bit <w:trPr> When the object is serialized out as xml, its qualified name is w:gridAfter. Remarks gridAfter (Grid Columns After Last Cell) This element specifies the number of grid columns in the parent table's table grid (§17.4.49; §17.4.48) which shall be left after the last cell in the table row. If this element conflicts with the remaining size of the document grid after all table cells in this row have been added to the grid, then it shall be ignored. If this element is not specified, then its value shall be assumed to be zero grid units. When the object is serialized out as xml, its qualified name is w:gridBefore. Remarks gridBefore (Grid Columns Before First Cell) This element specifies the number of grid columns in the parent table's table grid (§17.4.49; §17.4.48) which must be skipped before the contents of this table row (its table cells) are added to the parent table. [Note: This property is used to specify tables whose leading edge (left for left-to-right tables, right for right-to-left tables) does not start at the first grid column (the same shared edge). end note] If this element is omitted, then its value shall be assumed to be zero grid units. If this element's value is larger than the size of the table grid, then the value shall be ignored and the first cell in the row can span the full table grid (i.e. the second cell, if one exists, should start at the last shared edge in the table). |
I've added a small number of details here #564 (comment) |
I thought about fixing this by adding support on the Row class for gridAfter and gridBefore. Then the issue is how to expose the row and cell indexing to the user. It seems like the table cells are in a linear list, repeated for each gridSpan. It might be plausible to repeat the first cell in a row a number of times equal to gridBefore + gridSpan (defaulting to 0 and 1 respectively) and repeat the last cell in a row a number of times equal to gridAfter + gridSpan (again, defaulting to 0 and 1 respectively). However, I would prefer to iterate through the cells in a row and get a single cell element when that cell contains a gridSpan. Possibly exposing a gridSpan property on the Cell. If cell access was modified in this manner, then the Row could expose gridAfter and gridBefore properties and cell iteration would still only return a single Cell instance for each iteration, effectively "skipping" the cells that would otherwise exist for gridAfter, gridBefore, and any gridSpan values. |
Yeah, the interface design for cell access is tricky and the current implementation is what we settled on after a lot of consideration and experimentation. Our primary concern is making it accessible for most folks for the most common cases, and keeping the cognitive load low. I'm inclined to think that any alternative cell access scheme at least start out life as a parallel implementation. Then maybe if it's clearly simpler for folks to understand maybe we replace what's there, but I doubt we'd ever get there. Tables are just complicated things and most folks only care about the simple cases. Anyway, you might want to create an alternate proxy object for tables, perhaps one that subclasses the existing One possibility might be a |
I was also thinking about a parallel access method as well so current users don't have to update code. I like your proposed implementation. I only need reading for the moment, so an iterator as you mention would be ideal. I have been exploring the object properties and getting access to the XML directly through the _tr and _tc objects. They seem to have access to the gridAfter XML element in my file and access to gridSpan element in the TcPr elements as well, so I think all the info is within the structure. I'm having difficulty extracting text from the XML element. From a long time ago, I remember accessing text by looking at "pre" text, then descendants in order then "post" text. itertext() method on _tc seems to duplicate text in my case, but I'm still exploring. Thanks for your input! |
I added the following definition to class Table in table.py:
Then to use this call, I needed to get the max row index. Ideally, this would be abstracted, but this seems the most direct way to access by row number rather than an iterator for row and for cell within a row. Note also that this is more properly called a "generator" and only creates Cell objects when needed rather than creating them all at once for the row (or table, for that matter). documentlist is a list of filenames each of which is suitable for using when calling Document(). This example prints out the text in each cell with a maximum horizontal spacing and a vertical line with all the cells in each row appearing on one line:
Edit: Here's another way to define iter_row_cells using map iterator instead of creating a generator with yield:
Edit2: Here's how to get a row iterator of cell iterators:
Or All In One (if
|
@HiGregSmith Nice :) Thanks for posting what you came up with :) |
Function for correct retrieve real cells from row: def real_row_cells(row:'_Row') -> 'Sequence[_Cell]':
from docx.oxml.simpletypes import ST_Merge
from docx.table import _Cell
cells = []
table:'Table' = row.table
drows = getattr(table, '_real_rows', {})
if not drows:
table._real_rows = drows = {}
row_ind = row._index
if ocells := drows.get(row_ind):
return ocells
for i, tc in enumerate(row._tr.tc_lst):
for grid_span_idx in range(tc.grid_span):
if tc.vMerge == ST_Merge.CONTINUE:
prev_cels = real_row_cells(table.rows[row_ind - 1])
cells.append(prev_cels[i])
elif grid_span_idx > 0:
cells.append(cells[-1])
else:
cells.append(_Cell(tc, table))
ocells = tuple(cells)
drows[row_ind] = ocells
return ocells |
This does not work for me. |
And here is a version that does not need monkey patching:
|
The following function
It returns a
with
|
When iterating over cells within rows, the index calculations for cells and rows within the table are off by the accumulated w:gridBefore and w:gridAfter from prior rows.
The text was updated successfully, but these errors were encountered: