Skip to content

Commit 392a12e

Browse files
tjtanaachristinestraubmicmarty-deepsenseMthwRobinsonawalker4
authored
[Win][Config] Enable full support of UnstructuredIO API features on Windows (#2)
## PR Summary 1. Merged changes from upstream. 2. Update `unstructuredio_api.spec`. 3. Update `unstructuredio_api.py`. 4. Add additional setup dependencies to the `docs/Windows.md`. * build(deps): version bumps for maintenance (Unstructured-IO#424) ### Summary Version bumps for regular maintenance and to address moderate CVEs from security scans. - bump `unstructured` to `0.14.6` - bump `unstructured-inference` to `0.7.35` * build: replace rockylinux with chainguard/wolfi as a base image (Unstructured-IO#423) ### Summary Updates the Dockerfile to use the Chainguard wolfi-base image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. ### Testing Run `make docker-build` and `make docker-start-api`, then try: ``` from unstructured.partition.api import partition_via_api elements = partition_via_api( filename=filename, api_url="http://localhost:8000/general/v0/general", api_key="<API-KEY>", strategy="hi_res", ) print("\n\n".join([str(el) for el in elements])) ``` * fix: build and push workflow failing due to missing `-f` option `buildx build` command (Unstructured-IO#425) I noticed that images on main branch are failing to build (and push) due to missing `-f` parameter in `docker buildx build`. By default it expects `Dockerfile` to exist, but we only have `Dockerfile-amd64` and `Dockerfile-arm64` ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/4527165a-909e-498d-b0ee-8bba4b1a13e4) --------- Co-authored-by: christinestraub <[email protected]> * fix: update SHA for the base images (both architectures) after `base-images` repo update (Unstructured-IO#427) build and publish CI steps are failing, because the base images have changed in quay (their SHAs) ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fc4e9aac-0820-4c90-9ad9-68cc6d9aad03) ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fafe2ca4-dab2-4610-a26b-a7a4d56723a5) * fix: revert to rockylinux SHA that works (arm64) (Unstructured-IO#428) unnecessary SHA update introduced in Unstructured-IO#427 that needs to be reverted * fix: re-add `DOCKER_IMAGE` env var in `Test image` step (Unstructured-IO#429) shell syntax error occurs in docker-publish.yml workflow * fix: invalid env var setting in `docker-publish` workflow (Unstructured-IO#430) bug introduced in previous PR causing build failure on main * fix: `docker-publish` workflow failing on main due to inexisting `ARCH` env var (Unstructured-IO#431) * build(deps): bump dependency versions (Unstructured-IO#434) ### Summary Bumps dependency versions for the API. Closes Unstructured-IO#432. * fix/Fix MS Office filetype errors and harden docker smoketest (Unstructured-IO#436) # Changes **Fix for docx and other office files returning `{"detail":"File type None is not supported."}`** After moving to the wolfi base image, the `mimetypes` lib no longer knows about these file extensions. To avoid issues like this, let's add an explicit mapping for all the file extensions we care about. I added a `filetypes.py` and moved `get_validated_mimetype` over. When this file is imported, we'll call `mimetypes.add_type` for all file extensions we support. **Update smoke test coverage** This bug snuck past because we were already providing the mimetype in the docker smoke test. I updated `test_happy_path` to test against the container with and without passing `content_type`. I added some missing filetypes, and sorted the test params by extension so we can see when new types are missing. # Testing The new smoke test will verify that all filetypes are working. You can also `make docker-build && make docker-start-api`, and test out the docx in the sample docs dir. On `main`, this file will give you the error above. ``` curl 'http://localhost:8000/general/v0/general' \ --form 'files=@"fake.docx"' ``` * merge main; validated format: xml, txv, csv, xml, json, html, docs, docx, ppt, pptx, xlsx, xls, pdf * compilable setting * update Windows markdown * Disable debug mode in `unstructuredio_api.spec` * Enable pdf test case in `test_app.py` --------- Co-authored-by: Christine Straub <[email protected]> Co-authored-by: Michał Martyniak <[email protected]> Co-authored-by: Matt Robinson <[email protected]> Co-authored-by: Austin Walker <[email protected]> Co-authored-by: tjtanaa <[email protected]>
1 parent 4c22810 commit 392a12e

23 files changed

+965
-351
lines changed

.github/workflows/ci.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,3 +112,9 @@ jobs:
112112
source .venv/bin/activate
113113
make docker-build
114114
make docker-test
115+
- name: Scan image
116+
uses: anchore/scan-action@v3
117+
with:
118+
image: "pipeline-family-${{ env.PIPELINE_FAMILY }}-dev"
119+
# NOTE(robinson) - revert this to medium when we bump libreoffice
120+
severity-cutoff: high

.github/workflows/docker-publish.yml

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -45,16 +45,17 @@ jobs:
4545
build-images:
4646
strategy:
4747
matrix:
48-
docker-platform: ["linux/arm64", "linux/amd64"]
48+
arch: ["arm64", "amd64"]
4949
runs-on: ubuntu-latest-m
5050
needs: [setup, set-short-sha]
5151
env:
5252
SHORT_SHA: ${{ needs.set-short-sha.outputs.short_sha }}
53+
DOCKER_PLATFORM: linux/${{ matrix.arch }}
5354
steps:
5455
- name: Set up Docker Buildx
5556
uses: docker/setup-buildx-action@v3
5657
with:
57-
driver: ${{ matrix.docker-platform == 'linux/amd64' && 'docker' || 'docker-container' }}
58+
driver: ${{ matrix.arch == 'amd64' && 'docker' || 'docker-container' }}
5859
- name: Checkout code
5960
uses: actions/checkout@v4
6061
- name: Login to Quay.io
@@ -68,15 +69,15 @@ jobs:
6869
# Clear some space (https://github.com/actions/runner-images/issues/2840)
6970
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/share/boost
7071
71-
ARCH=$(cut -d "/" -f2 <<< ${{ matrix.docker-platform }})
72-
DOCKER_BUILDKIT=1 docker buildx build --platform=$ARCH --load \
72+
DOCKER_BUILDKIT=1 docker buildx build --load -f Dockerfile-${{ matrix.arch }} \
73+
--platform=$DOCKER_PLATFORM \
7374
--build-arg PIP_VERSION=$PIP_VERSION \
7475
--build-arg BUILDKIT_INLINE_CACHE=1 \
7576
--build-arg PIPELINE_PACKAGE=${{ env.PIPELINE_FAMILY }} \
7677
--provenance=false \
7778
--progress plain \
78-
--cache-from $DOCKER_BUILD_REPOSITORY:$ARCH \
79-
-t $DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA .
79+
--cache-from $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }} \
80+
-t $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA .
8081
- name: Set virtualenv cache
8182
uses: actions/cache@v4
8283
id: virtualenv-cache
@@ -88,20 +89,17 @@ jobs:
8889
uses: docker/setup-qemu-action@v3
8990
- name: Test image
9091
run: |
91-
ARCH=$(cut -d "/" -f2 <<< ${{ matrix.docker-platform }})
9292
source .venv/bin/activate
93-
if [ "${{ matrix.docker-platform }}" == "linux/arm64" ]; then
94-
DOCKER_PLATFORM="${{ matrix.docker-platform }}" DOCKER_IMAGE="$DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA" \
93+
export DOCKER_IMAGE="$DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA"
94+
if [ "$DOCKER_PLATFORM" == "linux/arm64" ]; then
9595
SKIP_INFERENCE_TESTS=true make docker-test
9696
else
97-
DOCKER_PLATFORM="${{ matrix.docker-platform }}" DOCKER_IMAGE="$DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA" \
9897
make docker-test
9998
fi
10099
- name: Push image
101100
run: |
102101
# write to the build repository to cache for the publish-images job
103-
ARCH=$(cut -d "/" -f2 <<< ${{ matrix.docker-platform }})
104-
docker push $DOCKER_BUILD_REPOSITORY:$ARCH-$SHORT_SHA
102+
docker push $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA
105103
publish-images:
106104
runs-on: ubuntu-latest-m
107105
needs: [setup, set-short-sha, build-images]

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,16 @@
1+
## 0.0.72
2+
3+
* Fix certain filetypes failing mimetype lookup in the new base image
4+
5+
## 0.0.71
6+
7+
* replace rockylinux with chainguard/wolfi as a base image for `amd64`
8+
9+
## 0.0.70
10+
11+
* Bump to `unstructured` 0.14.6
12+
* Bump to `unstructured-inference` 0.7.35
13+
114
## 0.0.69
215

316
* Bump to `unstructured` 0.14.4

Dockerfile-amd64

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# syntax=docker/dockerfile:experimental
2+
FROM quay.io/unstructured-io/base-images:wolfi-base@sha256:7c3af225a39f730f4feee705df6cd8d1570739dc130456cf589ac53347da0f1d as base
3+
4+
# NOTE(crag): NB_USER ARG for mybinder.org compat:
5+
# https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html
6+
ARG NB_USER=notebook-user
7+
ARG NB_UID=1000
8+
ARG PIP_VERSION
9+
ARG PIPELINE_PACKAGE
10+
ARG PYTHON_VERSION="3.11"
11+
12+
# Set up environment
13+
ENV PYTHON python${PYTHON_VERSION}
14+
ENV PIP ${PYTHON} -m pip
15+
16+
WORKDIR ${HOME}
17+
USER ${NB_USER}
18+
19+
ENV PYTHONPATH="${PYTHONPATH}:${HOME}"
20+
ENV PATH="/home/${NB_USER}/.local/bin:${PATH}"
21+
22+
FROM base as python-deps
23+
COPY --chown=${NB_USER}:${NB_USER} requirements/base.txt requirements-base.txt
24+
RUN ${PIP} install pip==${PIP_VERSION}
25+
RUN ${PIP} install --no-cache -r requirements-base.txt
26+
27+
FROM python-deps as model-deps
28+
RUN ${PYTHON} -c "import nltk; nltk.download('punkt')" && \
29+
${PYTHON} -c "import nltk; nltk.download('averaged_perceptron_tagger')" && \
30+
${PYTHON} -c "from unstructured.partition.model_init import initialize; initialize()"
31+
32+
FROM model-deps as code
33+
COPY --chown=${NB_USER}:${NB_USER} CHANGELOG.md CHANGELOG.md
34+
COPY --chown=${NB_USER}:${NB_USER} logger_config.yaml logger_config.yaml
35+
COPY --chown=${NB_USER}:${NB_USER} prepline_${PIPELINE_PACKAGE}/ prepline_${PIPELINE_PACKAGE}/
36+
COPY --chown=${NB_USER}:${NB_USER} exploration-notebooks exploration-notebooks
37+
COPY --chown=${NB_USER}:${NB_USER} scripts/app-start.sh scripts/app-start.sh
38+
39+
ENTRYPOINT ["scripts/app-start.sh"]
40+
# Expose a default port of 8000. Note: The EXPOSE instruction does not actually publish the port,
41+
# but some tooling will inspect containers and perform work contingent on networking support declared.
42+
43+
EXPOSE 8000

Dockerfile renamed to Dockerfile-arm64

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,4 +46,4 @@ COPY --chown=${NB_USER}:${NB_USER} scripts/app-start.sh scripts/app-start.sh
4646
ENTRYPOINT ["scripts/app-start.sh"]
4747
# Expose a default port of 8000. Note: The EXPOSE instruction does not actually publish the port,
4848
# but some tooling will inspect containers and perform work contingent on networking support declared.
49-
EXPOSE 8000
49+
EXPOSE 8000

_internal/config/logger_config.yaml

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
version: 1
2+
disable_existing_loggers: False
3+
formatters:
4+
default_format:
5+
"()": uvicorn.logging.DefaultFormatter
6+
format: '%(asctime)s %(name)s %(levelname)s %(message)s'
7+
access:
8+
"()": uvicorn.logging.AccessFormatter
9+
format: '%(asctime)s %(client_addr)s %(request_line)s - %(status_code)s'
10+
handlers:
11+
access_handler:
12+
formatter: access
13+
class: logging.StreamHandler
14+
stream: ext://sys.stderr
15+
standard_handler:
16+
formatter: default_format
17+
class: logging.StreamHandler
18+
stream: ext://sys.stderr
19+
loggers:
20+
uvicorn.error:
21+
level: INFO
22+
handlers:
23+
- standard_handler
24+
propagate: no
25+
# disable logging for uvicorn.error by not having a handler
26+
uvicorn.access:
27+
level: INFO
28+
handlers:
29+
- access_handler
30+
propagate: no
31+
# disable logging for uvicorn.access by not having a handler
32+
unstructured:
33+
level: INFO
34+
handlers:
35+
- standard_handler
36+
propagate: no
37+
unstructured.trace:
38+
level: CRITICAL
39+
handlers:
40+
- standard_handler
41+
propagate: no
42+
unstructured_inference:
43+
level: DEBUG
44+
handlers:
45+
- standard_handler
46+
propagate: no
47+
unstructured_api:
48+
level: DEBUG
49+
handlers:
50+
- standard_handler
51+
propagate: no
52+

0 commit comments

Comments
 (0)