Skip to content

Add HTML -> JSON-DOC converter #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
4a23c57
Add converter script from Markdownify
osolmaz Aug 21, 2024
da4c811
Add HTML example
osolmaz Sep 3, 2024
665bba5
Minor
osolmaz Sep 3, 2024
f5aff3b
Add html converstion test, wip
osolmaz Sep 3, 2024
833b623
Add nested paragraphs
osolmaz Sep 3, 2024
852918d
Correct children relationships
osolmaz Sep 4, 2024
0e6ee44
Minor
osolmaz Sep 4, 2024
61c2dc3
Time tests
osolmaz Sep 4, 2024
4c127df
Implement ALLOWED_CHILDREN_BLOCK_TYPES
osolmaz Sep 4, 2024
06a105b
wip
osolmaz Sep 4, 2024
d1144dc
Default const values, create mermaid diagram for HTML
osolmaz Sep 5, 2024
5ba5a4e
wip
osolmaz Sep 5, 2024
062c913
Basic example works
osolmaz Sep 5, 2024
b2b10a5
Add note
osolmaz Sep 5, 2024
8a4c8ea
Add more doc
osolmaz Sep 6, 2024
126ba9c
Cleanup
osolmaz Sep 6, 2024
9557c76
Add JSON-DOC to Markdown converter, wip
osolmaz Sep 6, 2024
e8ccdea
Convert more blocks into markdown, wip
osolmaz Sep 6, 2024
73b3128
wip
osolmaz Sep 6, 2024
afe4c34
Implement reconcile_to_rich_text()
osolmaz Sep 7, 2024
ac75505
Add converter script
osolmaz Sep 8, 2024
b1d1267
Added 1 html to jsondoc test example
osolmaz Sep 8, 2024
7c817a9
Add another test
osolmaz Sep 8, 2024
3364f79
Implement reconcile_to_block(), wip
osolmaz Sep 9, 2024
b31f217
Minor
osolmaz Sep 9, 2024
f8ecf32
Implement table support
osolmaz Sep 10, 2024
d2b8411
Add <br> support
osolmaz Sep 10, 2024
914f23c
Add <ul> and <ol> support
osolmaz Sep 10, 2024
3b5d92d
Implement create_page()
osolmaz Sep 11, 2024
75363de
Handle table captions
osolmaz Sep 11, 2024
b0f79b1
Can convert html_all_elements.html
osolmaz Sep 11, 2024
f1b9d2f
Handle image captions
osolmaz Sep 11, 2024
8e6d922
Minor
osolmaz Sep 11, 2024
27087eb
Minor
osolmaz Sep 11, 2024
af521c0
Remove anthropic from deps
osolmaz Sep 11, 2024
488b8f6
Add test for <a> element, cleanup
osolmaz Sep 11, 2024
73d972c
Cleanup
osolmaz Sep 11, 2024
78accbe
Cleanup
osolmaz Sep 11, 2024
493e0e9
Minor
osolmaz Sep 11, 2024
bdc82b4
Rename docs
osolmaz Sep 12, 2024
06c72f9
Add Pypandoc for conversion from other formats
osolmaz Sep 12, 2024
c8e196e
Improve converter script logic
osolmaz Sep 12, 2024
9c69bba
Update doc
osolmaz Sep 12, 2024
91df280
Minor
osolmaz Sep 12, 2024
92d9727
Add test to workflow
osolmaz Sep 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,9 @@ jobs:
- name: Run tests
run: |
source .venv/bin/activate
python tests/run_validation_tests.py schema
python tests/run_serialization_tests.py
python tests/test_validation.py schema
python tests/test_serialization.py
python tests/test_html_to_jsondoc.py

# - name: Upload test results
# uses: actions/upload-artifact@v2
Expand Down
133 changes: 133 additions & 0 deletions docs/conversion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
---
date: 2024-09-12
---

# Conversion between JSON-DOC and other formats

JSON-DOC is designed to be a versatile and interoperable format for representing structured block-based documents. To facilitate its adoption and integration with existing systems, we provide conversion capabilities between JSON-DOC and other common document formats.

## HTML as intermediate format

Writing a converter between a single pair of formats is an arduous task by itself. At the initial stage, it is infeasible to write converters between all the existing formats and JSON-DOC.

To remedy this, we follow a pivot approach over HTML. HTML is a well-established and unambiguous format (unlike Markdown and its flavors) with rich markup elements, capable of representing most of the document constructs we encounter in practice. The most popular Python HTML parser BeautifulSoup is efficient and well tested. Therefore, we choose HTML as the intermediate format. We implement the pair JSON-DOC <-> HTML and couple it with Pandoc to convert between all the existing formats and JSON-DOC.

For examle, to convert a Markdown file to JSON-DOC, we first use one of the existing converters to convert the Markdown file to HTML, and then use the converter we wrote to convert from HTML to JSON-DOC, i.e. Markdown --(pandoc)--> HTML --(this library)--> JSON-DOC.

## Details of HTML conversion

- Terminal text nodes are to be converted to rich text blocks.
- Order of children blocks must be preserved
- Any node in the syntax tree can generate 3 types of objects: string, rich text, or block.
- Consequently, any node can receive a list of these objects as as children. Each of these must be handled properly: merge rich texts, append rich text to current block, if they can't be appended in the current node, pass them to the parent node while preserving the order of the children, and so on.

Below is an example paragraph element with child elements:

```html
<p>This is a <b>bold</b> word and this is an <em>emphasized</em> word.</p>
```

This yields the following syntax tree:

```mermaid
graph TD;
root["&lt;p&gt;"]
root --> node1["This is a"]
root --> node2["&lt;b&gt;"]
node2 --> node3["bold"]
root --> node4["word and this is an"]
root --> node5["&lt;em&gt;"]
node5 --> node6["emphasized"]
root --> node7["word."]

classDef string fill:#28a745,color:white,font-weight:bold,stroke-width:2px;
classDef rich_text fill:#ffc107,color:#343a40,font-weight:bold,stroke-width:2px;
classDef block fill:#dc3545,color:white,font-weight:bold,stroke-width:2px;

class root block;
class node2,node5 rich_text;
class node1,node3,node4,node6,node7 string;
```

In this example, only the `<p>` element creates a JSON-DOC (paragraph) block.

- Terminal string nodes (colored green) are returned as strings while recursing the tree.
- HTML tags that don't create blocks (colored yellow), but apply some style, such as `<b>` and `<em>`, are returned as empty rich text objects with corresponding `Annotations`.
- HTML tags that create blocks (colored red), such as `<p>`, `<blockquote>`, `<code>`, etc. are returned as empty JSON-DOC blocks.

The function `process_tag(node)` receives the top level node and recurses its children which are themselves either HTML elements or text nodes.

```python
def process_tag(node):
children_objects = []
for child in node.children:
if isinstance(child, NavigableString):
children_objects.append(child.text)
else:
# Note that process_tag returns a list of objects and it is
# concatenated to the children_objects list.
children_objects.extend(process_tag(child))

# Get the empty object corresponding to the current node (rich text, block or None)
current_node_object: BlockBase | RichTextBase | None = convert_current_node(node)

# Reconcile the children objects with the current node object
return_objects: list = reconcile_children(current_node_object, children_objects)
return return_objects
```

## Placeholder blocks

Some HTML elements are not guaranteed to be converted to a JSON-DOC block:

- For example, in JSON-DOC, images can have captions, but tables cannot. So HTML `<caption>` elements needs to be handled separately.
- HTML `<br>` elements do not resolve to a JSON-DOC block, but instead trigger a split in a parent block which can contain rich text.

To conditionally handle these elements, we create a corresponding placeholder block and handle them in various ways while the tree is being processed.

## Using the converter script

The Python package `jsondoc` includes a command line script `convert_jsondoc` to convert between JSON-DOC and other formats. To see how to use the converter script, run:

```bash
convert_jsondoc --help
```

Source and target formats can be specified with the `-s` and `-t` flags. If they are not specified, the converter will try to infer them from file extensions in input and output file names.

Convert the example HTML file to JSON-DOC:

```bash
convert_jsondoc -i examples/html/html_all_elements.html --indent 2
```

- The script will exclusively convert from JSON-DOC to other formats and vice versa. So either the source or the target must be JSON-DOC.
- If the source is not JSON-DOC, then target will be assumed to be JSON-DOC.
- If the source is JSON-DOC, then the target format will have to be specified, either directly with `-t` or indirectly by providing a file name with an extension that can be used to infer the format.

```bash
# Will convert an_awesome_file.docx to JSON-DOC and save it in awesome_jsondoc.json
convert_jsondoc -i an_awesome_file.docx --indent 2 -o awesome_jsondoc.json
```

You can also pipe the output of one converter to the input of another, to convert from one format to another.

```bash
# Convert from HTML to JSON-DOC and then to Markdown
convert_jsondoc -i an_awesome_file.html --indent 2 | convert_jsondoc -s jsondoc -t markdown
```

## Remaining tasks

HTML->JSON-DOC tasks

- [x] Convert lists `<ul>` and `<ol>`
- [x] Convert line breaks `<br>`
- [x] Convert `<caption>` and `<figcaption>`
- [x] Force_page=true
- [ ] Residual strings, newlines or empty paragraphs in the final output list (in progress)
- [ ] Make sure `<a>` conversion is consistent
- [ ] Cleanup empty blocks at the end
- [ ] Table cells with colspan/rowspan
- [ ] Add test for `<code>` and `<pre>`
- [ ] Table thead/tbody/tfoot ordering
9 changes: 8 additions & 1 deletion docs/differences-from-notion.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
- Certain block types are not supported.
- TBD: List which ones
- Certain fields can be omitted, which would fall back to default values.
- Metadata field is present in blocks.
- Metadata field is present in blocks.
- grep `// notion-diverge`


## Ideas

- Let TableBlocks have caption? In Notion, they cannot have captions.
- Add subscript/superscript annotation to rich text?
26 changes: 26 additions & 0 deletions docs/json-doc-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@

# JSON-DOC Specification

TBD

## Children blocks

See `jsondoc.validate.rules` for more details.

### Block types with no restrictions on children types

- `type: paragraph`
- `type: toggle`
- `type: bulleted_list_item`
- `type: numbered_list_item`
- `type: quote`
- `type: synced_block`
- `type: to_do`
- `type: column`

### Block types where children are restricted to a specific type

- `type: column_list`
- Children type: `type: column`
- `type: table`
- Children type: `type: table_row`
File renamed without changes.
File renamed without changes.
26 changes: 13 additions & 13 deletions docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,19 @@ title: "JSON-DOC"

# JSON-DOC Implementation Roadmap

- [ ] Create JSONSchema for each block type.
- [ ] Implement converters into JSON-DOC
- [x] Create JSONSchema for each block type.
- [x] Implement converters into JSON-DOC
- [ ] Multimodal-LLM based PDF/raster image -> JSON-DOC (Most important)
- [ ] HTML -> JSON-DOC
- [ ] DOCX -> JSON-DOC
- [ ] XLSX -> JSON-DOC
- [ ] PPTX -> JSON-DOC
- [ ] CSV -> JSON-DOC
- [ ] Google Docs -> JSON-DOC (lower priority compared to DOCX)
- [ ] Google Sheets -> JSON-DOC
- [ ] Google Slides -> JSON-DOC
- [x] HTML -> JSON-DOC
- [x] DOCX -> JSON-DOC
- [x] XLSX -> JSON-DOC
- [x] PPTX -> JSON-DOC
- [x] CSV -> JSON-DOC
- [x] Google Docs -> JSON-DOC (lower priority compared to DOCX)
- [x] Google Sheets -> JSON-DOC
- [x] Google Slides -> JSON-DOC
- [ ] Implement converters from JSON-DOC
- [ ] JSON-DOC -> Markdown/plain text with tabular metadata for injecting into LLM context.
- [x] JSON-DOC -> Markdown/plain text with tabular metadata for injecting into LLM context.
- [ ] Ability to reference, extract and render a certain table range. (Important for scrolling in spreadsheets)
- [ ] Frontend for JSON-DOC
- [ ] JavaScript renderer for JSON-DOC to render it in the browser.
Expand Down Expand Up @@ -121,5 +121,5 @@ These are not "official" blocks, but exist under the `rich_text` key in some blo
## Miscellaneous tasks

- [ ] Make non-essential fields optional with default values to make JSON files smaller. Start with rich text fields.
- [ ] Reserve jsondoc PyPI package name.
- [ ] Buy a JSON-DOC domain. json-doc.org and json-doc.com are available.
- [x] Reserve jsondoc PyPI package name. (Reserved python-jsondoc since PyPI rejected jsondoc)
- [x] Buy a JSON-DOC domain. json-doc.org and json-doc.com are available.
Loading
Loading