-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Handling poorly formed XML #625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What approach are you taking? The three I can think of are:
All three of those would work. #1 is probably the least invasive and quickest time-to-value. Option #2 makes it hard to upgrade as new features come online. #3 is a little fancier than #1 and is probably how I would go in this sort of situation (but I have a lot of "internals" knowledge). In any case, you'd want to intervene in and around the I'd say you want to start by identifying which parts of the python-docx interface you wanted to behave differently (I would keep this as narrow as possible). If you could get by with A HOWEVER: now that I've looked at the example more closely, it looks like you can accommodate the single problem of |
Hi Steve,
I'm working on a data exchange project, where a group of developers of similar software are trying to work out a way to move data between programs. We're passing DOCx files as part of this. Unfortunately, some of the files I need to import don't follow the DOCx format specification very well. So far, I've handled EMF images, improper paragraph formatting parameter values ("exactly" instead of "exact") and improper color specifications (<w:color w:val="black"/> instead of using RGB values) and a few other minor issues. But there's one issue I haven't been able to crack.
The documents I need to read specify line breaks incorrectly:
instead of:
My imports bring in only the first text block from each run, the text that comes in before the first "<w:br/>" tag and skips the rest of the text. Effectively, the run ends at the improper tag. (The file loads in Word 2010 correctly.)
The author of the program outputting the offending files talks about using OpenOffice standards and blames his DOCx module and promises he'll have his programmer look at it, but I've concluded that my best course of action would be to try to figure out how to correctly import these incorrect files.
Despite feeling reasonably comfortable working with python-docx code, I can't seem to crack where and how to intervene regarding these extra line break tags. I'd appreciate any suggestions you can throw my way.
David Woods
The text was updated successfully, but these errors were encountered: