Skip to content

[Win][Config] Enable full support of UnstructuredIO API features on Windows #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
fbdc6af
build(deps): version bumps for maintenance (#424)
christinestraub Jun 14, 2024
2bdd52a
build: replace rockylinux with chainguard/wolfi as a base image (#423)
christinestraub Jun 17, 2024
e8c6fa9
fix: build and push workflow failing due to missing `-f` option `buil…
micmarty-deepsense Jun 20, 2024
80a6627
fix: update SHA for the base images (both architectures) after `base-…
micmarty-deepsense Jun 21, 2024
d3564b6
fix: revert to rockylinux SHA that works (arm64) (#428)
micmarty-deepsense Jun 21, 2024
5b604b2
fix: re-add `DOCKER_IMAGE` env var in `Test image` step (#429)
micmarty-deepsense Jun 21, 2024
2f482e8
fix: invalid env var setting in `docker-publish` workflow (#430)
micmarty-deepsense Jun 21, 2024
d7acffe
fix: `docker-publish` workflow failing on main due to inexisting `ARC…
micmarty-deepsense Jun 21, 2024
d5a878f
build(deps): bump dependency versions (#434)
MthwRobinson Jun 24, 2024
6710df0
fix/Fix MS Office filetype errors and harden docker smoketest (#436)
awalker4 Jun 28, 2024
8ecc097
Merge branch 'main' of https://github.com/EmbeddedLLM/unstructured-ap…
tjtanaa Jul 9, 2024
5cb0857
Merge branch 'main' into merge-main-tj
tjtanaa Jul 9, 2024
53a22e9
merge main; validated format: xml, txv, csv, xml, json, html, docs, d…
tjtanaa Jul 9, 2024
7632778
compilable setting
tjtanaa Jul 11, 2024
6872590
update Windows markdown
tjtanaa Jul 11, 2024
1f4895c
Disable debug mode in `unstructuredio_api.spec`
tjtanaa Jul 11, 2024
2c01236
Enable pdf test case in `test_app.py`
tjtanaa Jul 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -112,3 +112,9 @@ jobs:
source .venv/bin/activate
make docker-build
make docker-test
- name: Scan image
uses: anchore/scan-action@v3
with:
image: "pipeline-family-${{ env.PIPELINE_FAMILY }}-dev"
# NOTE(robinson) - revert this to medium when we bump libreoffice
severity-cutoff: high
22 changes: 10 additions & 12 deletions .github/workflows/docker-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,16 +45,17 @@ jobs:
build-images:
strategy:
matrix:
docker-platform: ["linux/arm64", "linux/amd64"]
arch: ["arm64", "amd64"]
runs-on: ubuntu-latest-m
needs: [setup, set-short-sha]
env:
SHORT_SHA: ${{ needs.set-short-sha.outputs.short_sha }}
DOCKER_PLATFORM: linux/${{ matrix.arch }}
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
driver: ${{ matrix.docker-platform == 'linux/amd64' && 'docker' || 'docker-container' }}
driver: ${{ matrix.arch == 'amd64' && 'docker' || 'docker-container' }}
- name: Checkout code
uses: actions/checkout@v4
- name: Login to Quay.io
Expand All @@ -68,15 +69,15 @@ jobs:
# Clear some space (https://github.com/actions/runner-images/issues/2840)
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/share/boost

ARCH=$(cut -d "/" -f2 <<< ${{ matrix.docker-platform }})
DOCKER_BUILDKIT=1 docker buildx build --platform=$ARCH --load \
DOCKER_BUILDKIT=1 docker buildx build --load -f Dockerfile-${{ matrix.arch }} \
--platform=$DOCKER_PLATFORM \
--build-arg PIP_VERSION=$PIP_VERSION \
--build-arg BUILDKIT_INLINE_CACHE=1 \
--build-arg PIPELINE_PACKAGE=${{ env.PIPELINE_FAMILY }} \
--provenance=false \
--progress plain \
--cache-from $DOCKER_BUILD_REPOSITORY:$ARCH \
-t $DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA .
--cache-from $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }} \
-t $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA .
- name: Set virtualenv cache
uses: actions/cache@v4
id: virtualenv-cache
Expand All @@ -88,20 +89,17 @@ jobs:
uses: docker/setup-qemu-action@v3
- name: Test image
run: |
ARCH=$(cut -d "/" -f2 <<< ${{ matrix.docker-platform }})
source .venv/bin/activate
if [ "${{ matrix.docker-platform }}" == "linux/arm64" ]; then
DOCKER_PLATFORM="${{ matrix.docker-platform }}" DOCKER_IMAGE="$DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA" \
export DOCKER_IMAGE="$DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA"
if [ "$DOCKER_PLATFORM" == "linux/arm64" ]; then
SKIP_INFERENCE_TESTS=true make docker-test
else
DOCKER_PLATFORM="${{ matrix.docker-platform }}" DOCKER_IMAGE="$DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA" \
make docker-test
fi
- name: Push image
run: |
# write to the build repository to cache for the publish-images job
ARCH=$(cut -d "/" -f2 <<< ${{ matrix.docker-platform }})
docker push $DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA
docker push $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA
publish-images:
runs-on: ubuntu-latest-m
needs: [setup, set-short-sha, build-images]
Expand Down
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
## 0.0.72

* Fix certain filetypes failing mimetype lookup in the new base image

## 0.0.71

* replace rockylinux with chainguard/wolfi as a base image for `amd64`

## 0.0.70

* Bump to `unstructured` 0.14.6
* Bump to `unstructured-inference` 0.7.35

## 0.0.69

* Bump to `unstructured` 0.14.4
Expand Down
43 changes: 43 additions & 0 deletions Dockerfile-amd64
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# syntax=docker/dockerfile:experimental
FROM quay.io/unstructured-io/base-images:wolfi-base@sha256:7c3af225a39f730f4feee705df6cd8d1570739dc130456cf589ac53347da0f1d as base

# NOTE(crag): NB_USER ARG for mybinder.org compat:
# https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html
ARG NB_USER=notebook-user
ARG NB_UID=1000
ARG PIP_VERSION
ARG PIPELINE_PACKAGE
ARG PYTHON_VERSION="3.11"

# Set up environment
ENV PYTHON python${PYTHON_VERSION}
ENV PIP ${PYTHON} -m pip

WORKDIR ${HOME}
USER ${NB_USER}

ENV PYTHONPATH="${PYTHONPATH}:${HOME}"
ENV PATH="/home/${NB_USER}/.local/bin:${PATH}"

FROM base as python-deps
COPY --chown=${NB_USER}:${NB_USER} requirements/base.txt requirements-base.txt
RUN ${PIP} install pip==${PIP_VERSION}
RUN ${PIP} install --no-cache -r requirements-base.txt

FROM python-deps as model-deps
RUN ${PYTHON} -c "import nltk; nltk.download('punkt')" && \
${PYTHON} -c "import nltk; nltk.download('averaged_perceptron_tagger')" && \
${PYTHON} -c "from unstructured.partition.model_init import initialize; initialize()"

FROM model-deps as code
COPY --chown=${NB_USER}:${NB_USER} CHANGELOG.md CHANGELOG.md
COPY --chown=${NB_USER}:${NB_USER} logger_config.yaml logger_config.yaml
COPY --chown=${NB_USER}:${NB_USER} prepline_${PIPELINE_PACKAGE}/ prepline_${PIPELINE_PACKAGE}/
COPY --chown=${NB_USER}:${NB_USER} exploration-notebooks exploration-notebooks
COPY --chown=${NB_USER}:${NB_USER} scripts/app-start.sh scripts/app-start.sh

ENTRYPOINT ["scripts/app-start.sh"]
# Expose a default port of 8000. Note: The EXPOSE instruction does not actually publish the port,
# but some tooling will inspect containers and perform work contingent on networking support declared.

EXPOSE 8000
2 changes: 1 addition & 1 deletion Dockerfile → Dockerfile-arm64
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,4 @@ COPY --chown=${NB_USER}:${NB_USER} scripts/app-start.sh scripts/app-start.sh
ENTRYPOINT ["scripts/app-start.sh"]
# Expose a default port of 8000. Note: The EXPOSE instruction does not actually publish the port,
# but some tooling will inspect containers and perform work contingent on networking support declared.
EXPOSE 8000
EXPOSE 8000
52 changes: 52 additions & 0 deletions _internal/config/logger_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
version: 1
disable_existing_loggers: False
formatters:
default_format:
"()": uvicorn.logging.DefaultFormatter
format: '%(asctime)s %(name)s %(levelname)s %(message)s'
access:
"()": uvicorn.logging.AccessFormatter
format: '%(asctime)s %(client_addr)s %(request_line)s - %(status_code)s'
handlers:
access_handler:
formatter: access
class: logging.StreamHandler
stream: ext://sys.stderr
standard_handler:
formatter: default_format
class: logging.StreamHandler
stream: ext://sys.stderr
loggers:
uvicorn.error:
level: INFO
handlers:
- standard_handler
propagate: no
# disable logging for uvicorn.error by not having a handler
uvicorn.access:
level: INFO
handlers:
- access_handler
propagate: no
# disable logging for uvicorn.access by not having a handler
unstructured:
level: INFO
handlers:
- standard_handler
propagate: no
unstructured.trace:
level: CRITICAL
handlers:
- standard_handler
propagate: no
unstructured_inference:
level: DEBUG
handlers:
- standard_handler
propagate: no
unstructured_api:
level: DEBUG
handlers:
- standard_handler
propagate: no

Loading