Skip to content

v0.1.0 schemas #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Aug 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 128 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
[flake8]
max-line-length = 88
max-complexity = 15
extend-ignore =
# E101: Indentation contains mixed spaces and tabs
E101
# E111: Indentation is not a multiple of four
E111
# E112: Expected an indented block
E112
# E113: Unexpected indentation
E113
# E114: Indentation is not a multiple of four (comment)
E114
# E115: Expected an indented block (comment)
E115
# E116: Unexpected indentation (comment)
E116
# E117: Over-indented
E117
# E121: Continuation line under-indented for hanging indent
E121
# E122: Continuation line missing indentation or outdented
E122
# E123: Closing bracket does not match indentation of opening bracket's line
E123
# E124: Closing bracket does not match visual indentation
E124
# E125: Continuation line with same indent as next logical line
E125
# E126: Continuation line over-indented for hanging indent
E126
# E127: Continuation line over-indented for visual indent
E127
# E128: Continuation line under-indented for visual indent
E128
# E129: Visually indented line with same indent as next logical line
E129
# E131: Continuation line unaligned for hanging indent
E131
# E133: Closing bracket is missing indentation
E133
# E201: Whitespace after '('
E201,
# E202: Whitespace before ')'
E202,
# E203: Whitespace before ':'
E203,
# E211: Whitespace before '('
E211,
# E221: Multiple spaces before operator
E221,
# E222: Multiple spaces after operator
E222,
# E223: Tab before operator
E223,
# E224: Tab after operator
E224,
# E225: Missing whitespace around operator
E225,
# E226: Missing whitespace around arithmetic operator
E226,
# E227: Missing whitespace around bitwise or shift operator
E227,
# E228: Missing whitespace around modulo operator
E228,
# E231: Missing whitespace after ',', ';', or ':'
E231,
# E241: Multiple spaces after ','
E241,
# E242: Tab after ','
E242,
# E251: Unexpected spaces around keyword / parameter equals
E251,
# E261: At least two spaces before inline comment
E261,
# E262: Inline comment should start with '# '
E262,
# E265: Block comment should start with '# '
E265,
# E266: Too many leading '#' for block comment
E266,
# E271: Multiple spaces after keyword
E271,
# E272: Multiple spaces before keyword
E272,
# E273: Tab after keyword
E273,
# E274: Tab before keyword
E274,
# E275: Missing whitespace after keyword
E275,
# E301: Expected 1 blank line, found 0
E301,
# E302: Expected 2 blank lines, found 0
E302,
# E303: Too many blank lines (3)
E303,
# E304: Blank lines found after function decorator
E304,
# E305: Expected 2 blank lines after end of function or class
E305,
# E306: Expected 1 blank line before a nested definition
E306,
# E401: Multiple imports on one line
E401,
# E704: multiple statements on one line (def)
E704,
# E203: whitespace before ':'
E203,
# W191: Indentation contains tabs
W191,
# W291: Trailing whitespace
W291,
# W292: No newline at end of file
W292,
# W293: Blank line contains whitespace
W293,
# W391: Blank line at end of file
W391,
# W503: line break before binary operator
W503,
# W504: line break after binary operator
W504,
# F401: imported but unused
F401,
# F841: local variable is assigned to but never used
F841
53 changes: 53 additions & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: Test and Validate

on:
push:
branches:
- main
pull_request:
branches:
- main

jobs:
test:
runs-on: ubuntu-latest

steps:
- name: Check out repository
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.11'

- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: 1.5.0
virtualenvs-create: true
virtualenvs-in-project: true

- name: Load cached venv
id: cached-poetry-dependencies
uses: actions/cache@v2
with:
path: .venv
key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock') }}

- name: Install dependencies
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
run: poetry install --no-interaction

- name: Run tests
run: |
source .venv/bin/activate
python tests/run_validation_tests.py schema

# - name: Upload test results
# uses: actions/upload-artifact@v2
# with:
# name: test-results
# path: test-results # Adjust this path if your tests output results to a different directory


5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.env
.DS_Store
*.pdf
*.png
__pycache__
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.11
16 changes: 15 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,16 @@
# JSON-DOC
JSON-DOC is a block based document file format and data model

JSON-DOC is a simple and flexible format for storing structured content in JSON files. It is designed to support a wide variety of content types and use cases, such as paragraphs, headings, lists, tables, images, code blocks, HTML and more.

JSON-DOC is an attempt to standardize the data model used by [Notion](https://notion.so).

## Features

- Documents are represented as a list of blocks
- Each block is a JSON object
- A unique identifier for each block by hashing RFC 8785 Canonical JSON
- Support for nested blocks

## Motivation

TBD
64 changes: 64 additions & 0 deletions docs/notes-on-notion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Reverse Engineering Notion Data Model and API

## UUIDs

Notion uses UUIDs (v4) for the ID of each object. We could possibly improve on this by

- Using TypeID's: Improves readability and attribution of IDs
- Using and ID format that is more efficient for database indices.

## Blocks

A `Block` is (literally) the primary building block of documents in Notion.

See: https://developers.notion.com/reference/block

A `Block` is a container that allows stacking and nesting of various content types that Notion supports. In that way, it is a meta-object. It does not contain content itself, but it represents the relationship between content objects.

An example Block of type `child_database`:

```json
{
"object": "block",
"id": "91589676-9cab-40dd-8ace-52f31a225d0a",
"parent": {
"type": "page_id",
"page_id": "8d7dbc6b-5c55-4589-826c-1352450db04e"
},
"created_by": {
"object": "user",
"id": "b9eb2a95-ab37-462d-b6ff-ff84080051f0"
},
"created_time": "2024-05-28T20:28:00.000Z",
"last_edited_time": "2024-05-28T20:29:00.000Z",
"last_edited_by": {
"object": "user",
"id": "b9eb2a95-ab37-462d-b6ff-ff84080051f0"
},
"has_children": false,
"archived": false,
"in_trash": false,
"type": "child_database",
"child_database": {
"title": "Example database"
}
}
```

### `type` field

The `type` field specifies what kind of content a block represents. The content is then contained in the corresponding field of the block object. For example, if the `type` field is `code`, the content is in the `code` field.


## Pages

A `Page` is not a block, but a container for blocks. Pages can exist independently and contain other pages and blocks, creating a hierarchical structure. Blocks exist within pages (or other blocks) and do not have the capability to contain pages.

```json
{
"id": "8d7dbc6b5c554589826c1352450db04e",
"type": "page",
"properties": {...},
"children": [...]
}
```
118 changes: 118 additions & 0 deletions docs/roadmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
author: "Onur Solmaz<[email protected]>"
date: 2024-08-01
title: "JSON-DOC"
---

# JSON-DOC Implementation Roadmap

- [ ] Create JSONSchema for each block type.
- [ ] Implement converters into JSON-DOC
- [ ] Multimodal-LLM based PDF/raster image -> JSON-DOC (Most important)
- [ ] HTML -> JSON-DOC
- [ ] DOCX -> JSON-DOC
- [ ] XLSX -> JSON-DOC
- [ ] PPTX -> JSON-DOC
- [ ] CSV -> JSON-DOC
- [ ] Google Docs -> JSON-DOC (lower priority compared to DOCX)
- [ ] Google Sheets -> JSON-DOC
- [ ] Google Slides -> JSON-DOC
- [ ] Implement converters from JSON-DOC
- [ ] JSON-DOC -> Markdown/plain text with tabular metadata for injecting into LLM context.
- [ ] Ability to reference, extract and render a certain table range. (Important for scrolling in spreadsheets)
- [ ] Frontend for JSON-DOC
- [ ] JavaScript renderer for JSON-DOC to render it in the browser.

# JSON-DOC Schema

We will implement a JSONSchema for a Notion page and each block type.

## Page

- [x] Page block

See https://developers.notion.com/reference/block for the authoritative Notion specification.

## Blocks

### Rich text (See https://developers.notion.com/reference/rich-text)

These are not "official" blocks, but exist under the `rich_text` key in some blocks.

- [x] `type: text`
- [x] `type: equation`
- Inline equations.
- Will be rendered using KaTeX on the client side.
- [ ] ~~`type: mention`~~
- Won't implement for now

### Other text-type blocks

- [x] `type: paragraph`
- [x] `type: heading_1`
- [x] `type: heading_2`
- [x] `type: heading_3`
- [x] `type: code`
- [x] `type: equation`
- Block-level equations.
- [x] `type: quote`
- [ ] ~~`type: callout`~~
- Won't implement for now

### List item blocks

- [x] `type: bulleted_list_item`
- [x] `type: numbered_list_item`
- [x] `type: to_do`

### Table blocks

- [x] `type: table`
- [x] `type: table_row`

### Non-text blocks

- [x] `type: image`
- [ ] ~~`type: file`~~
- Won't implement for now
- [ ] ~~`type: pdf`~~
- Won't implement for now
- [ ] ~~`type: embed`~~
- Won't implement for now
- [ ] ~~`type: video`~~
- Won't implement for now


### Page/Container type blocks

- [x] `type: column`
- [x] `type: column_list`
- [ ] `type: table_of_contents`
- [ ] `type: child_page`
- [x] `type: divider`
- Might implement, might not be necessary for the current document conversion use case
- [ ] `type: synced_block`
- Might implement, ditto
- [ ] ~~`type: toggle`~~
- Won't implement

### Link-related blocks

- [ ] ~~`type: link_preview`~~
- Won't implement
- [ ] ~~`type: link_to_page`~~
- Won't implement
- [ ] ~~`type: bookmark`~~
- Won't implement

### Notion-specific blocks

- [ ] `type: child_database`
- [ ] ~~`type: breadcrumb`~~
- Won't implement
- [ ] ~~`type: unsupported`~~
- Meta block, not needed

### Deprecated blocks

- ~~`type: template`~~
Loading
Loading