textcortex
diff --git a/‎.github/workflows/test.yaml
Lines changed: 3 additions & 2 deletions b/‎.github/workflows/test.yaml
Lines changed: 3 additions & 2 deletions
diff --git a/‎docs/conversion.md
Lines changed: 133 additions & 0 deletions b/‎docs/conversion.md
Lines changed: 133 additions & 0 deletions
diff --git a/‎docs/differences-from-notion.md
Lines changed: 8 additions & 1 deletion b/‎docs/differences-from-notion.md
Lines changed: 8 additions & 1 deletion
diff --git a/‎docs/json-doc-spec.md
Lines changed: 26 additions & 0 deletions b/‎docs/json-doc-spec.md
Lines changed: 26 additions & 0 deletions
diff --git a/‎docs/notes-on-python-implementation.md renamed to ‎docs/python-implementation.md b/‎docs/notes-on-python-implementation.md renamed to ‎docs/python-implementation.md
diff --git a/‎docs/notes-on-notion.md renamed to ‎docs/reverse-engineering-notion.md b/‎docs/notes-on-notion.md renamed to ‎docs/reverse-engineering-notion.md
diff --git a/‎docs/roadmap.md
Lines changed: 13 additions & 13 deletions b/‎docs/roadmap.md
Lines changed: 13 additions & 13 deletions
@@ -42,8 +42,9 @@ jobs:
     - name: Run tests
       run: |
         source .venv/bin/activate
-        python tests/run_validation_tests.py schema
-        python tests/run_serialization_tests.py
+        python tests/test_validation.py schema
+        python tests/test_serialization.py
+        python tests/test_html_to_jsondoc.py
 
     # - name: Upload test results
     #   uses: actions/upload-artifact@v2
 
@@ -0,0 +1,133 @@
+---
+date: 2024-09-12
+---
+
+# Conversion between JSON-DOC and other formats
+
+JSON-DOC is designed to be a versatile and interoperable format for representing structured block-based documents. To facilitate its adoption and integration with existing systems, we provide conversion capabilities between JSON-DOC and other common document formats.
+
+## HTML as intermediate format
+
+Writing a converter between a single pair of formats is an arduous task by itself. At the initial stage, it is infeasible to write converters between all the existing formats and JSON-DOC.
+
+To remedy this, we follow a pivot approach over HTML. HTML is a well-established and unambiguous format (unlike Markdown and its flavors) with rich markup elements, capable of representing most of the document constructs we encounter in practice. The most popular Python HTML parser BeautifulSoup is efficient and well tested. Therefore, we choose HTML as the intermediate format. We implement the pair JSON-DOC <-> HTML and couple it with Pandoc to convert between all the existing formats and JSON-DOC.
+
+For examle, to convert a Markdown file to JSON-DOC, we first use one of the existing converters to convert the Markdown file to HTML, and then use the converter we wrote to convert from HTML to JSON-DOC, i.e. Markdown --(pandoc)--> HTML --(this library)--> JSON-DOC.
+
+## Details of HTML conversion
+
+- Terminal text nodes are to be converted to rich text blocks.
+- Order of children blocks must be preserved
+- Any node in the syntax tree can generate 3 types of objects: string, rich text, or block.
+- Consequently, any node can receive a list of these objects as as children. Each of these must be handled properly: merge rich texts, append rich text to current block, if they can't be appended in the current node, pass them to the parent node while preserving the order of the children, and so on.
+
+Below is an example paragraph element with child elements:
+
+```html
+<p>This is a <b>bold</b> word and this is an <em>emphasized</em> word.</p>
+```
+
+This yields the following syntax tree:
+
+```mermaid
+graph TD;
+    root["&lt;p&gt;"]
+    root --> node1["This is a"]
+    root --> node2["&lt;b&gt;"]
+    node2 --> node3["bold"]
+    root --> node4["word and this is an"]
+    root --> node5["&lt;em&gt;"]
+    node5 --> node6["emphasized"]
+    root --> node7["word."]
+
+    classDef string fill:#28a745,color:white,font-weight:bold,stroke-width:2px;
+    classDef rich_text fill:#ffc107,color:#343a40,font-weight:bold,stroke-width:2px;
+    classDef block fill:#dc3545,color:white,font-weight:bold,stroke-width:2px;
+
+    class root block;
+    class node2,node5 rich_text;
+    class node1,node3,node4,node6,node7 string;
+```
+
+In this example, only the `<p>` element creates a JSON-DOC (paragraph) block.
+
+- Terminal string nodes (colored green) are returned as strings while recursing the tree.
+- HTML tags that don't create blocks (colored yellow), but apply some style, such as `<b>` and `<em>`, are returned as empty rich text objects with corresponding `Annotations`.
+- HTML tags that create blocks (colored red), such as `<p>`, `<blockquote>`, `<code>`, etc. are returned as empty JSON-DOC blocks.
+
+The function `process_tag(node)` receives the top level node and recurses its children which are themselves either HTML elements or text nodes.
+
+```python
+def process_tag(node):
+    children_objects = []
+    for child in node.children:
+        if isinstance(child, NavigableString):
+            children_objects.append(child.text)
+        else:
+            # Note that process_tag returns a list of objects and it is
+            # concatenated to the children_objects list.
+            children_objects.extend(process_tag(child))
+
+    # Get the empty object corresponding to the current node (rich text, block or None)
+    current_node_object: BlockBase | RichTextBase | None = convert_current_node(node)
+
+    # Reconcile the children objects with the current node object
+    return_objects: list = reconcile_children(current_node_object, children_objects)
+    return return_objects
+```
+
+## Placeholder blocks
+
+Some HTML elements are not guaranteed to be converted to a JSON-DOC block:
+
+- For example, in JSON-DOC, images can have captions, but tables cannot. So HTML `<caption>` elements needs to be handled separately.
+- HTML `<br>` elements do not resolve to a JSON-DOC block, but instead trigger a split in a parent block which can contain rich text.
+
+To conditionally handle these elements, we create a corresponding placeholder block and handle them in various ways while the tree is being processed.
+
+## Using the converter script
+
+The Python package `jsondoc` includes a command line script `convert_jsondoc` to convert between JSON-DOC and other formats. To see how to use the converter script, run:
+
+```bash
+convert_jsondoc --help
+```
+
+Source and target formats can be specified with the `-s` and `-t` flags. If they are not specified, the converter will try to infer them from file extensions in input and output file names.
+
+Convert the example HTML file to JSON-DOC:
+
+```bash
+convert_jsondoc -i examples/html/html_all_elements.html --indent 2
+```
+
+- The script will exclusively convert from JSON-DOC to other formats and vice versa. So either the source or the target must be JSON-DOC.
+- If the source is not JSON-DOC, then target will be assumed to be JSON-DOC.
+- If the source is JSON-DOC, then the target format will have to be specified, either directly with `-t` or indirectly by providing a file name with an extension that can be used to infer the format.
+
+```bash
+# Will convert an_awesome_file.docx to JSON-DOC and save it in awesome_jsondoc.json
+convert_jsondoc -i an_awesome_file.docx --indent 2 -o awesome_jsondoc.json
+```
+
+You can also pipe the output of one converter to the input of another, to convert from one format to another.
+
+```bash
+# Convert from HTML to JSON-DOC and then to Markdown
+convert_jsondoc -i an_awesome_file.html --indent 2 | convert_jsondoc -s jsondoc -t markdown
+```
+
+## Remaining tasks
+
+HTML->JSON-DOC tasks
+
+- [x] Convert lists `<ul>` and `<ol>`
+- [x] Convert line breaks `<br>`
+- [x] Convert `<caption>` and `<figcaption>`
+- [x] Force_page=true
+- [ ] Residual strings, newlines or empty paragraphs in the final output list (in progress)
+- [ ] Make sure `<a>` conversion is consistent
+- [ ] Cleanup empty blocks at the end
+- [ ] Table cells with colspan/rowspan
+- [ ] Add test for `<code>` and `<pre>`
+- [ ] Table thead/tbody/tfoot ordering
@@ -1,4 +1,11 @@
 - Certain block types are not supported.
   - TBD: List which ones
 - Certain fields can be omitted, which would fall back to default values.
-- Metadata field is present in blocks.
+- Metadata field is present in blocks.
+- grep `// notion-diverge`
+
+
+## Ideas
+
+- Let TableBlocks have caption? In Notion, they cannot have captions.
+- Add subscript/superscript annotation to rich text?
@@ -0,0 +1,26 @@
+
+# JSON-DOC Specification
+
+TBD
+
+## Children blocks
+
+See `jsondoc.validate.rules` for more details.
+
+### Block types with no restrictions on children types
+
+- `type: paragraph`
+- `type: toggle`
+- `type: bulleted_list_item`
+- `type: numbered_list_item`
+- `type: quote`
+- `type: synced_block`
+- `type: to_do`
+- `type: column`
+
+### Block types where children are restricted to a specific type
+
+- `type: column_list`
+  - Children type: `type: column`
+- `type: table`
+  - Children type: `type: table_row`
@@ -6,19 +6,19 @@ title: "JSON-DOC"
 
 # JSON-DOC Implementation Roadmap
 
-- [ ] Create JSONSchema for each block type.
-- [ ] Implement converters into JSON-DOC
+- [x] Create JSONSchema for each block type.
+- [x] Implement converters into JSON-DOC
   - [ ] Multimodal-LLM based PDF/raster image -> JSON-DOC (Most important)
-  - [ ] HTML -> JSON-DOC
-  - [ ] DOCX -> JSON-DOC
-  - [ ] XLSX -> JSON-DOC
-  - [ ] PPTX -> JSON-DOC
-  - [ ] CSV -> JSON-DOC
-  - [ ] Google Docs -> JSON-DOC (lower priority compared to DOCX)
-  - [ ] Google Sheets -> JSON-DOC
-  - [ ] Google Slides -> JSON-DOC
+  - [x] HTML -> JSON-DOC
+  - [x] DOCX -> JSON-DOC
+  - [x] XLSX -> JSON-DOC
+  - [x] PPTX -> JSON-DOC
+  - [x] CSV -> JSON-DOC
+  - [x] Google Docs -> JSON-DOC (lower priority compared to DOCX)
+  - [x] Google Sheets -> JSON-DOC
+  - [x] Google Slides -> JSON-DOC
 - [ ] Implement converters from JSON-DOC
-  - [ ] JSON-DOC -> Markdown/plain text with tabular metadata for injecting into LLM context.
+  - [x] JSON-DOC -> Markdown/plain text with tabular metadata for injecting into LLM context.
   - [ ] Ability to reference, extract and render a certain table range. (Important for scrolling in spreadsheets)
 - [ ] Frontend for JSON-DOC
   - [ ] JavaScript renderer for JSON-DOC to render it in the browser.
@@ -121,5 +121,5 @@ These are not "official" blocks, but exist under the `rich_text` key in some blo
 ## Miscellaneous tasks
 
 - [ ] Make non-essential fields optional with default values to make JSON files smaller. Start with rich text fields.
-- [ ] Reserve jsondoc PyPI package name.
-- [ ] Buy a JSON-DOC domain. json-doc.org and json-doc.com are available.
+- [x] Reserve jsondoc PyPI package name. (Reserved python-jsondoc since PyPI rejected jsondoc)
+- [x] Buy a JSON-DOC domain. json-doc.org and json-doc.com are available.