|
| 1 | +--- |
| 2 | +date: 2024-09-12 |
| 3 | +--- |
| 4 | + |
| 5 | +# Conversion between JSON-DOC and other formats |
| 6 | + |
| 7 | +JSON-DOC is designed to be a versatile and interoperable format for representing structured block-based documents. To facilitate its adoption and integration with existing systems, we provide conversion capabilities between JSON-DOC and other common document formats. |
| 8 | + |
| 9 | +## HTML as intermediate format |
| 10 | + |
| 11 | +Writing a converter between a single pair of formats is an arduous task by itself. At the initial stage, it is infeasible to write converters between all the existing formats and JSON-DOC. |
| 12 | + |
| 13 | +To remedy this, we follow a pivot approach over HTML. HTML is a well-established and unambiguous format (unlike Markdown and its flavors) with rich markup elements, capable of representing most of the document constructs we encounter in practice. The most popular Python HTML parser BeautifulSoup is efficient and well tested. Therefore, we choose HTML as the intermediate format. We implement the pair JSON-DOC <-> HTML and couple it with Pandoc to convert between all the existing formats and JSON-DOC. |
| 14 | + |
| 15 | +For examle, to convert a Markdown file to JSON-DOC, we first use one of the existing converters to convert the Markdown file to HTML, and then use the converter we wrote to convert from HTML to JSON-DOC, i.e. Markdown --(pandoc)--> HTML --(this library)--> JSON-DOC. |
| 16 | + |
| 17 | +## Details of HTML conversion |
| 18 | + |
| 19 | +- Terminal text nodes are to be converted to rich text blocks. |
| 20 | +- Order of children blocks must be preserved |
| 21 | +- Any node in the syntax tree can generate 3 types of objects: string, rich text, or block. |
| 22 | +- Consequently, any node can receive a list of these objects as as children. Each of these must be handled properly: merge rich texts, append rich text to current block, if they can't be appended in the current node, pass them to the parent node while preserving the order of the children, and so on. |
| 23 | + |
| 24 | +Below is an example paragraph element with child elements: |
| 25 | + |
| 26 | +```html |
| 27 | +<p>This is a <b>bold</b> word and this is an <em>emphasized</em> word.</p> |
| 28 | +``` |
| 29 | + |
| 30 | +This yields the following syntax tree: |
| 31 | + |
| 32 | +```mermaid |
| 33 | +graph TD; |
| 34 | + root["<p>"] |
| 35 | + root --> node1["This is a"] |
| 36 | + root --> node2["<b>"] |
| 37 | + node2 --> node3["bold"] |
| 38 | + root --> node4["word and this is an"] |
| 39 | + root --> node5["<em>"] |
| 40 | + node5 --> node6["emphasized"] |
| 41 | + root --> node7["word."] |
| 42 | +
|
| 43 | + classDef string fill:#28a745,color:white,font-weight:bold,stroke-width:2px; |
| 44 | + classDef rich_text fill:#ffc107,color:#343a40,font-weight:bold,stroke-width:2px; |
| 45 | + classDef block fill:#dc3545,color:white,font-weight:bold,stroke-width:2px; |
| 46 | +
|
| 47 | + class root block; |
| 48 | + class node2,node5 rich_text; |
| 49 | + class node1,node3,node4,node6,node7 string; |
| 50 | +``` |
| 51 | + |
| 52 | +In this example, only the `<p>` element creates a JSON-DOC (paragraph) block. |
| 53 | + |
| 54 | +- Terminal string nodes (colored green) are returned as strings while recursing the tree. |
| 55 | +- HTML tags that don't create blocks (colored yellow), but apply some style, such as `<b>` and `<em>`, are returned as empty rich text objects with corresponding `Annotations`. |
| 56 | +- HTML tags that create blocks (colored red), such as `<p>`, `<blockquote>`, `<code>`, etc. are returned as empty JSON-DOC blocks. |
| 57 | + |
| 58 | +The function `process_tag(node)` receives the top level node and recurses its children which are themselves either HTML elements or text nodes. |
| 59 | + |
| 60 | +```python |
| 61 | +def process_tag(node): |
| 62 | + children_objects = [] |
| 63 | + for child in node.children: |
| 64 | + if isinstance(child, NavigableString): |
| 65 | + children_objects.append(child.text) |
| 66 | + else: |
| 67 | + # Note that process_tag returns a list of objects and it is |
| 68 | + # concatenated to the children_objects list. |
| 69 | + children_objects.extend(process_tag(child)) |
| 70 | + |
| 71 | + # Get the empty object corresponding to the current node (rich text, block or None) |
| 72 | + current_node_object: BlockBase | RichTextBase | None = convert_current_node(node) |
| 73 | + |
| 74 | + # Reconcile the children objects with the current node object |
| 75 | + return_objects: list = reconcile_children(current_node_object, children_objects) |
| 76 | + return return_objects |
| 77 | +``` |
| 78 | + |
| 79 | +## Placeholder blocks |
| 80 | + |
| 81 | +Some HTML elements are not guaranteed to be converted to a JSON-DOC block: |
| 82 | + |
| 83 | +- For example, in JSON-DOC, images can have captions, but tables cannot. So HTML `<caption>` elements needs to be handled separately. |
| 84 | +- HTML `<br>` elements do not resolve to a JSON-DOC block, but instead trigger a split in a parent block which can contain rich text. |
| 85 | + |
| 86 | +To conditionally handle these elements, we create a corresponding placeholder block and handle them in various ways while the tree is being processed. |
| 87 | + |
| 88 | +## Using the converter script |
| 89 | + |
| 90 | +The Python package `jsondoc` includes a command line script `convert_jsondoc` to convert between JSON-DOC and other formats. To see how to use the converter script, run: |
| 91 | + |
| 92 | +```bash |
| 93 | +convert_jsondoc --help |
| 94 | +``` |
| 95 | + |
| 96 | +Source and target formats can be specified with the `-s` and `-t` flags. If they are not specified, the converter will try to infer them from file extensions in input and output file names. |
| 97 | + |
| 98 | +Convert the example HTML file to JSON-DOC: |
| 99 | + |
| 100 | +```bash |
| 101 | +convert_jsondoc -i examples/html/html_all_elements.html --indent 2 |
| 102 | +``` |
| 103 | + |
| 104 | +- The script will exclusively convert from JSON-DOC to other formats and vice versa. So either the source or the target must be JSON-DOC. |
| 105 | +- If the source is not JSON-DOC, then target will be assumed to be JSON-DOC. |
| 106 | +- If the source is JSON-DOC, then the target format will have to be specified, either directly with `-t` or indirectly by providing a file name with an extension that can be used to infer the format. |
| 107 | + |
| 108 | +```bash |
| 109 | +# Will convert an_awesome_file.docx to JSON-DOC and save it in awesome_jsondoc.json |
| 110 | +convert_jsondoc -i an_awesome_file.docx --indent 2 -o awesome_jsondoc.json |
| 111 | +``` |
| 112 | + |
| 113 | +You can also pipe the output of one converter to the input of another, to convert from one format to another. |
| 114 | + |
| 115 | +```bash |
| 116 | +# Convert from HTML to JSON-DOC and then to Markdown |
| 117 | +convert_jsondoc -i an_awesome_file.html --indent 2 | convert_jsondoc -s jsondoc -t markdown |
| 118 | +``` |
| 119 | + |
| 120 | +## Remaining tasks |
| 121 | + |
| 122 | +HTML->JSON-DOC tasks |
| 123 | + |
| 124 | +- [x] Convert lists `<ul>` and `<ol>` |
| 125 | +- [x] Convert line breaks `<br>` |
| 126 | +- [x] Convert `<caption>` and `<figcaption>` |
| 127 | +- [x] Force_page=true |
| 128 | +- [ ] Residual strings, newlines or empty paragraphs in the final output list (in progress) |
| 129 | +- [ ] Make sure `<a>` conversion is consistent |
| 130 | +- [ ] Cleanup empty blocks at the end |
| 131 | +- [ ] Table cells with colspan/rowspan |
| 132 | +- [ ] Add test for `<code>` and `<pre>` |
| 133 | +- [ ] Table thead/tbody/tfoot ordering |
0 commit comments