Skip to content

lxml #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
etfre opened this issue Jan 7, 2014 · 11 comments
Closed

lxml #3

etfre opened this issue Jan 7, 2014 · 11 comments

Comments

@etfre
Copy link

etfre commented Jan 7, 2014

It looks like you want the user to work entirely through python-docx, as Etree elements are abstracted away through wrapper classes. If that's the case, what are you planning with regards to methods such as iter(), find(), xpath expressions etc.? I know that for simpler documents, statements like document.add_paragraph() are sufficient, but I've found lxml methods like the ones I mentioned above to be invaluable for more involved Docx scripting.

@scanny
Copy link
Contributor

scanny commented Jan 7, 2014

Can you suggest a particular use case? I think it might be easier to think and talk about in terms of a concrete objective the user it trying to achieve.

@etfre
Copy link
Author

etfre commented Jan 7, 2014

Sure. Say that the user wants to find all of the runs with size 14 text in a document and italicize those runs as well. With access to the Docx's <w:document> element, this wouldn't be too challenging using iter(run tagname), find(rPr tagname), set(), makeelement(), and append() (for rPr element if necessary).

@scanny
Copy link
Contributor

scanny commented Jan 8, 2014

Ah, good, that helps focus things, thanks :)

A couple notions, not completely coherent, and in no particular order:

  • I'd like to imagine that the feature set of the current version is the tip of a rather larger iceberg. In fact that was the primary motivator for me undertaking the project. It seemed to me that the function-based architecture of the prior versions had reached its limit and that an object-oriented and growable architecture was needed to allow it to be extended to fulfill the demands the community clearly had for it. The objective of this initial release is feature parity, just enough to allow a developer to make the switch and get on a robust, growing platform. I'm hoping there is quite a bit more to come from here.
  • I like your idea of making the Etree elements available somehow for advanced users. One of the things I've noticed in working on python-pptx, the PowerPoint companion project, is that little recipes that use the lxml element right under the covers of the high-level object are a great way to provide "temporary" features to folks that need them. Like for a while I didn't have the API yet to set font color, but a few lines using the spPr element as a starting point provided a very workable stop-gap. I'm sure there will always be cases where going directly in to manipulate the XML directly would be handy. Maybe I could add those properties to the published API for what I loosely refer to as the "proxy" objects (like Paragraph, Table, etc.) so an advanced user could find them without having to read the code.
  • There's a wrapper around the lxml layer I've named oxml (docx.oxml subpackage). Right now there's a fair amount of code there, but I'm about half-way through building a declarative meta-classy module to reduce most of that code to defining element and attribute names, sequences, etc. That module raises the abstraction level of dealing with the XML mechanics such that the thorny details of adding a child element in the right order and so on are mostly taken care of without bleeding into higher-level code. It's a little like a specialized and smarter version of lxml.objectify. That API could also possibly be published for use and extension by advanced users.
  • I'm thinking there are at least three broad categories of use-case for python-docx. Document generation, document modification or repair, and document content extraction/transformation. Each of these has their own flavor as far as the type of features one naturally requires. The latter two require a way to traverse the existing content. The former not so much. I know for sure we'll need navigation facilities for finding spots in documents, iterating over ranges, and making "surgical" insertions, deletions, and changes. It's not yet clear to me though what form that should take. The Microsoft API is not a convincing guide to me on this count. It has the concept of Range that offers this type of operations, but it strikes me that's strongly tied to the fact that it operates on a "running" application where adhering closely to the GUI metaphors is relevant. The other aspect I haven't finished wrestling with is how to deal with revision "markers" in the XML. Just plain XPath can get things done if you know for sure there are no deleted paragraphs in there, but if there are you get unexpected results. And many files have that sort of thing in there without their users knowing about it, so it's too much to assume it just won't happen.

Anyway, that's probably enough reflection for one sitting. How does all that strike you?

@etfre
Copy link
Author

etfre commented Jan 8, 2014

That strikes me as a lot more words than I was expecting!

From the standpoint of a user, my ideal interface would revolve around multipurpose objects that had both higher-level methods and properties like you’ve implemented with the Paragraph/Run/Text classes, as well as lxml-like functionality. For instance, it would be great if I had an instance of Paragraph p with which I could both perform p.add_text(“Hello World”) as well as p.iterchildren().

Implementing this could be tricky though. I don't think it would be a good idea to allow the user to interact with both python-docx wrapper objects as well as the etree objects themselves. There wouldn't really be any clean way to separate the two types of objects, and you'd end up with users trying to call para.add_run() on their etree._Element objects. The best idea I have right now is to implement a base element wrapper class. Each instance of the class maps to a particular etree element and would have methods overriding all of the normal etree methods. Something like:

class BaseOxmlClass:
  def __init__(self, element, root_document_instance):
    self.element = element
    self.root_document_instance = root_document_instance
  ...
  def getparent(self): #sample implementation of lxml method
    return self.root_document_instance.map_of_wrappers_and_elements[self.element.getparent()]

The getparent() method would return the object wrapping the parent of the etree object. Table, Text etc would all be subclasses of BaseOxmlClass and have their own special methods and attributes.

Are the revision markers that you're referring to the RSID attributes that seem to break up perfectly good runs for no good reason? Those definitely gave me trouble when I first started using python-docx and I was wondering why some of my search() and replace() functions weren't working. I ended up writing replacement functions that ignored runs and just searched through the entire paragraph text as a single string and haven't had many issues since. What sort of challenges are you running into there?

@scanny
Copy link
Contributor

scanny commented Jan 8, 2014

Yeah, apologies for that, there are a lot of related topics here and just needed us to find a focus spot, seems like we're on the trail of one now :)

So let's focus on the bit about iterating over the lxml children bit for a start. It would be straightforward to publish the likes of 'paragraph._p' as part of the API. Maybe the naming would be different, let's leave that to later. The idea would be you could get the lxml element of any proxy object you had in hand if you wanted to dive into the XML directly.

I would be strongly inclined to not try to combine these two in a single object. Rather you can access the back door if you need it, but then you're in an lxml world. My argument is to value preservation of distinct levels of abstraction over whatever convenience might be achieved by combining them. I confess I don't see that particular convenience, but maybe you can jot a couple of quick 5-line code samples that illustrate it and we could discuss that.

On the question of revision markers, I don't have any direct experience of them being a problem. I actually don't work with .docx an awful lot, I do a lot more with .pptx :). But from a design standpoint, they initially presented as a challenge when trying to provide a document.paragraphs collection. Turns out just returning the objects found with body.xpath('./w:p') only works when there are no revision marks. Then it gets worse because if you take './/w:p' or whatever, you can get deleted ones in there (and other problems, but just for illustration). So where I am with that is you actually would have to specify which collection you wanted, like "Final showing markup" or something. I didn't know how to judge which behaviors would be most useful without actually having some use-cases to think it through with. Any insight you could provide there would be very helpful.

@etfre
Copy link
Author

etfre commented Jan 8, 2014

I definitely see the value of keeping lxml functionality at least somewhat separate from the objects that the user will primarily be working through. My biggest concern is that moving between the two different levels depending on particular needs can get messy quickly. I think the simplest solution, and the one that you seem to be leaning towards, would be to provide access to the etree._Element object through an attribute of an object for exceptional cases, and trust that the user knows enough about what he/she is doing to not mess things up. If that's the case, the biggest issue will be adding functionality in the current API to ensure that the user will only need lxml for special cases. Looking over the italics example that I gave above, it looks like there is almost already a simple implementation available:

d = docx.Document('infile.docx')
for p in d.paragraphs:
  for r in p.runs:
    if r.font_size == 14:  # or 28?
      r.italic = True

Just writing that out, it does seem that adding functionality to the current API without requiring the user to dive down to lxml would be easier than I initially thought. A lot stuff that lxml does is already available by default, and it shouldn't be particularly difficult to add the equivalent for etree._Element methods like insert() (perhaps as an "index" keyword parameter in add_paragraph()/add_run() etc.) and getparent() (I can't think of an immediate use for going up the tree, but I suspect that there are plenty).

I'll admit that I don't see any special reason to treat modified/deleted text differently any differently. Correct me if I'm wrong, but from an xml standpoint, there are three different implementations (original, markup, final), as "original showing markup" and "final showing markup" are only GUI settings within Word itself. I think that the best default option would be to account for the markup in paragraphs by including deleted paragraphs in document.paragraphs, creating a boolean deleted_text attribute for runs and so on, as well as possibly giving the user methods to accept or reject the revisions.

@scanny
Copy link
Contributor

scanny commented Jan 8, 2014

Oooh, now that's an idea. I like that a lot on first take. So the library could attend to just discovering all the bits that were in there (paragraphs and whatever), characterizing them as to revision status along the way, and then the developer could use that information in whatever way suited their use case.

Then if it turned out to be handy, it would be easy to add an additional collection like 'Final after revisions' or something that provided a high-level access point to commonly needed subsets.

Maybe the default document.paragraphs could be 'final', including inserts but not deletions, and then have a flag to include deletions. That way you wouldn't have to consider revision marks unless they were likely to be a factor in your particular use-case.

I suppose that would mean paragraphs would need to be a method instead of a property, something like document.paragraphs(include_deleted=False).

I'm liking this line of thinking a lot.

On the other bit, like swapping back and forth between lxml, yes, I would hope we could cover 95% of use cases without having to get lxml involved at all. That's one of the big benefits of the OO design, you can add API more or less indefinitely without overly complicating any one call signature. Maybe the lxml access bit could be a way to discover new common use cases and then those could be one-by-one added to the API.

This iteration question is big though. Once we have that down there's all kinds of things that can come after and build on that.

@scanny
Copy link
Contributor

scanny commented Jan 8, 2014

The revision cases are pretty complex it looks like: Eric White article on revisions. But definitely not insurmountable. Especially if taken on an object-by-object basis.

For paragraph, for a start, it looks like these are the cases:

  • insert whole paragraph
  • delete whole paragraph
  • delete paragraph marker of this paragraph, i.e. it is now combined with the following paragraph
  • move source paragraph (this one seems a bit hairy)

However, to get the "effective after revisions" is simpler, just digging one level into:

  • insert whole paragraph, and
  • move whole paragraph (moveTo, i.e. this paragraph was moved to here from somewhere else.
  • then combining two or more having delete paragraph marker

I could probably update the current .paragraphs to include these fellows in a single sitting. Might make sense to make it a method at the same time, just to hold open the options.

@scanny
Copy link
Contributor

scanny commented Jan 10, 2014

Okay Evan, so after a day's reflection, here's what I think makes sense for document.paragraphs, in the tone of how it might appear in the documentation:

Document.paragraphs is the sequence of paragraphs corresponding to the "Final" view of the document. Inserted paragraphs appear in the sequence. Deleted paragraphs do not. Moved paragraphs appear in their new location.

Additional features for accessing deleted paragraphs (and perhaps moved paragraphs in their original location) could be added later. I don't have a clear idea of a use case for accessing those, so I'm inclined to leave it alone until one surfaces.

The advantage of this approach is that the casual user doesn't have to even know about revision marks in order to operate on the document in the form it would most naturally appear in Word.

What do you think? I'll add it as a feature request in a separate issue thread if you concur.

@etfre
Copy link
Author

etfre commented Jan 10, 2014

That sounds good. I suspect that manipulating documents with markup will be a very small subset of use cases anyways, so as long as there is a basic implementation in place for most users, it shouldn't be an issue. People who want to do crazier things can, again, dive into the lxml.

@scanny
Copy link
Contributor

scanny commented Jan 15, 2014

@evfredericksen I've added issue #6 and issue #7 to capture the features we identified here, publishing the internal lxml element references and including inserted and moved paragraphs in Document.paragraphs. Let me know if there are other items you want to discuss.

@scanny scanny closed this as completed Jan 15, 2014
Sibuken pushed a commit to speechki-book/python-docx that referenced this issue May 12, 2020
Benjamin-T added a commit to Benjamin-T/python-docx that referenced this issue Oct 25, 2021
pix pushed a commit to pix/python-docx that referenced this issue Aug 19, 2022
…port_for_footnote

Feature/low level support for footnote
alberto743 pushed a commit to alberto743/python-docx that referenced this issue Jun 15, 2023
updated dependencies, added py38 to support in setup.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants