-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feature: InlineShape.image ? #249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The API doesn't support this, you'll need to dig into the XML with lxml and some maybe some python-docx internals support if you want it bad enough :) The general gist of the XML is here: You can get a handle to the wp:inline element using InlineShape._inline: From there you can navigate to the pic:pic element using something like this:
From there you'll need to parse the embed link to get the relationship to the picture, which will be stored as a separate part. These are just general guidelines, I don't have time unfortunately to get it down to working details, but should give you an idea of what's involved. I'm sure there must be ways the existing internals can help but I can't remember just now how it all works. You'll need to trace through the code a bit if it's worth the effort to you. I would start that by tracing through how the .add_picture() bit works, you're basically looking to roughly reverse that. |
Hi Steve, I'm starting to explore the issue of loading images from docx files. I've figured out inline shapes and getting the (internal) image name and I've explored Relationships and have figured out how to link the I've been exploring the underbelly of runs, and have now determined a Any suggestions would be welcome. David On 01/26/2016 12:28 AM, Steve Canny wrote:
David K. Woods, Ph.D. |
Most of the grunt work with images is taken care of in the python-docx internals. You should be able to mostly leverage that for what you need, definitely all the bits about looking up image blobs from relationship ids and so on. The first key thing you need is the so-called With the rId in hand (something like 'rId5'), you can get the "related part", which will be an image part in this case: document_part = document.part
# OR (if inline_shapes is already handy)
document_part = document.inline_shapes.part
# Then lookup the image part by rId
rId = however_you_get_the_rId_from_the_inline_shape()
image_part = document_part.related_parts[rId]
# docx.parts.image.ImagePart has some useful bits
filename = image_part.filename
# and also provides access to a docx.image.image.Image object which has even more goodies
image = image_part.image
bytes_of_image = image.blob
... and a bunch of other bits like dimensions, filename, content type, extension, etc. Let me know if you need more once you've had a look at these :) |
Hi David, what did you end up doing with this? Just wondering if a reasonably clear new API feature occurred to you after working this challenge that might be handy to have. Maybe Just sorting through the issue list here and wanted to encapsulate this one in the title if there was something you thought made sense. |
Hi Steve, I haven't done anything about it yet. I want to get feature/tabstops David On 04/21/2016 11:07 PM, Steve Canny wrote:
|
Hi Steve, I am struggling at the moment with the same issue as David. In particular, I need to copy tables, paragraphs, and images from one docx to another docx. To do so, the item_block_items method outlined in #40 was of great help for me. However, this method only considers paragraphs and tables. Is there a way to extend this method to also consider inline shapes? Best regards |
Anything is possible for the diligent developer :) But not something that anyone is working on at the moment as far as I know, if that's what you're asking :) |
Hi Steve, I am turning my attention to this problem again after a number of distractions such as making sure my kids can eat most days. You know how that goes. I've never been able to get a handle on a specific Inline object from a Run when reading a file or when using run.add_picture() rather than document.add_picture(). Document.add_picture() produces a CT_Inline object, and this allows access to everything that's needed. However, run.add_picture() produces a CT_Run object with something in drawing_lst[0], but that something is not a CT_Inline object and I haven't been able to crack what that something is. (It's a lxml.etree._Element object, but I can't figure out where to go from there.) When reading a file, images are held in Runs, not CT_Inline objects no matter how they were created. My impression is that the pieces are mostly there for reading files and seeing graphics within Runs, but the connections from one level of the XML to the next are getting lost on the oxml level at the w:drawing level. I believe this because the CT_Drawing object referenced in docs/dev/analysis/features/text/run-content.rst and docs/dev/analysis/features/shapes/shapes-inline.rst does not appear to actually exist in the python_docx code. CT_Inline exists, and all the objects needed down the line from there to get all the data needed from the XML exist, but without the CT_Drawing object, nothing appears to be accessible. So I'm thinking that I need to add CT_Drawing to oxml/shape.py, and that if I do this right, linking it to CT_Inline correctly, then I'm just about where I need to be. I'll be able to read the CT_Drawing object in the Run's drawing_lst to get the information I need to proceed. Does that seem possible to you? Does this approach make sense? Let me know if you need more information to make sense of what I'm saying, beyond what's already in the existing features documentation. David EDIT 30 minutes later: To answer my own question, yes, that approach makes sense. IT WORKS. I can now gain access to images in runs in my modified python_docx source code. Tomorrow morning, I start working on how to submit it properly so it can be integrated into the release version of the code. |
Yes, that makes sense to me David. I notice the MS API has In practice, I think one finds at most a single picture in a run. But the schema places no limitation on how many can appear in the same run. That's good for us I think; we won't need |
@aschilling Have you fixed the problem? ' copy tables, paragraphs, and images from one docx to another docx' |
@scanny You mentioned in a comment that you were considering adding a I have managed to save a copy of the inline shape using this:
However, I can't figure out whether any particular paragraph or run has a picture in it, and navigate to the image object from there. Any suggestions? |
I got a little further with this: I found that a run
I assume the first part is the text. The last element has data on the picture. Digging deeper, I find that I can get the rId as
This is the chain of elements I went through to get there:
This doesn't look like easy to do programmatically, but I could probably do it by inspecting the Is there any more direct way to figure out where an inline picture comes in a run and to find its size, type/filename and bytes? Edit
Then I can use the code above to get a copy of the image. But I'm having trouble understanding why some elements below |
@mfripp these secrets are revealed in this part of the code: https://github.com/python-openxml/python-docx/blob/master/docx/oxml/text/run.py#L22
To demonstrate: >>> r = run._r
>>> type(r)
CT_R
>>> type(r.rPr)
CT_RPr # run properties, basically font with bold, italic, size, font name, etc.
>>> len(r.t_lst) # all sequence properties get `_lst` appended, so OneOrMore or ZeroOrMore "fields".
1
>>> dir(r)
... long list of all the available methods and properties, many of which are defined by the metaclass ...
>>> r.xml
... dump of the XML for this run element only ... So whatever you want from a run element should be available by name. Accessing sub-elements by index is very unreliable because so many are optional and others can appear more than once, like If you know where you're headed then XPath is usually faster than digging through a hierarchy where there are optional elements, like: >>> r.xpath(".//w:pic")
... a list of zero or more `<w:pic>` elements in this run (might be `a:pic`, check the XML dump) ... Not all elements have custom element classes, only those we've provided some access to via That's a start toward understanding anyway. |
@scanny Thanks for your advice on this. I'd gotten part of the way down the path you recommended, but I was having trouble going further. When I use Where things get tough is when I use I previously tried looking at the source code you mentioned, but I ran into a problem there too: I had code that sort of worked for getting an image from a run. I also found that Word (at least my version) never puts more than an image in the same run as text or another image. So that makes them pretty easy to work with, i.e., I don't have to worry about whether there is text before or after the image. This is what that code looked like:
The part that was annoying was having to check all the sub-elements of
I can access I've now tried
|
Hmm, not sure why the Each name defined as a child element or attribute will get certain helper methods and properties. Like an attribute Glad it seems like you got things working. I'd definitely use the |
Love your module.
I'm trying to add *.docx import to my python qualitative analysis tool, and python-docx has allowed me to bring content to a wxPython RichTextCtrl really easily. I'm getting all the character and paragraph level formatting and all of the text, which has come together really quickly.
But I seem to be missing something in reading in images. I can get the size and type of the images, but how to I get the IMAGE DATA to convert into an image object?
The text was updated successfully, but these errors were encountered: