This repository provides a Dockerfile to build a containerized environment for DeepDoctection. The image is based on NVIDIA CUDA 12.8 and comes pre-configured with PyTorch, Detectron2, and other necessary dependencies.
- Dockerfile Breakdown
- Building the Docker Image
- Running the Docker Container
- Getting a Docker image from the Docker hub for the last published release
- Pulling images from the Docker hub:
- Starting a container with docker compose
The Dockerfile consists of several key steps:
FROM nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04The image is built on NVIDIA CUDA 12.8, ensuring compatibility with GPU acceleration and deep learning frameworks.
ARG USERNAME=developer
ARG USER_UID=1001
ARG USER_GID=1001We define a non-root user (developer) with a configurable UID and GID.
ARG DEEPDOCTECTION_VERSION=1.0.7
ENV DEEPDOCTECTION_VERSION=${DEEPDOCTECTION_VERSION}The DeepDoctection version is set as an environment variable, making it easier to update the version.
RUN apt-get update && apt-get install -y --no-install-recommends \
python3-pip python3-dev python3-venv \
git curl sudo \
libsm6 libxext6 libxrender-dev \
ninja-build && \
apt-get clean && rm -rf /var/lib/apt/lists/*These system dependencies are required for Python, PyTorch, and Detectron2.
RUN if ! getent group $USER_GID; then groupadd --gid $USER_GID $USERNAME; fi && \
if ! id -u $USER_UID > /dev/null 2>&1; then useradd --uid $USER_UID --gid $USER_GID -m -s /bin/bash $USERNAME; fi && \
echo "$USERNAME ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers.d/$USERNAME && \
chmod 0440 /etc/sudoers.d/$USERNAMEThis ensures the developer user is created and has the necessary permissions.
USER $USERNAME
WORKDIR /home/$USERNAME
ENV HOME="/home/$USERNAME"The working directory is set, and the home directory is properly configured.
RUN python3 -m venv $HOME/venv && \
echo "source $HOME/venv/bin/activate" >> $HOME/.bashrcThis ensures that all Python dependencies are installed within a virtual environment.
RUN /bin/bash -c "source $HOME/venv/bin/activate && \
pip install --no-cache-dir uv"uv is a modern package manager used for fast and efficient dependency installation.
RUN /bin/bash -c "source $HOME/venv/bin/activate && \
uv pip install --no-cache-dir uv wheel ninja && \
uv pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121"PyTorch and its dependencies are installed with CUDA 12.1 support.
RUN /bin/bash -c "source $HOME/venv/bin/activate && \
uv pip install --no-cache-dir 'detectron2 @ git+https://github.com/deepdoctection/detectron2.git' --no-build-isolation"Detectron2 is installed from a specific GitHub repository.
RUN /bin/bash -c "source $HOME/venv/bin/activate && \
uv pip install --no-cache-dir deepdoctection[full]==$DEEPDOCTECTION_VERSION && \
uv pip install --no-cache-dir opencv-python"DeepDoctection and OpenCV are installed for document processing tasks.
SHELL ["/bin/bash", "-c"]
CMD ["bash"]The default shell is set to bash, and the container will start with an interactive shell.
To build the Docker image, use the following command:
docker build -t deepdoctection/dd:1.0.0 -f Dockerfile .This will create an image tagged as deepdoctection/base:1.0.0.
To start a container from the built image, run:
docker run --gpus all -it --rm \
-v deepdoctection_cache:/home/developer/.cache/deepdoctection \
deepdoctection/base:1.0.0This ensures GPU support and mounts a cache directory.
With the release of version v.0.27.0, we are starting to provide Docker images for the full installation. This is due to the fact that the requirements and dependencies are complex and even the construction of Docker images can lead to difficulties.
docker pull deepdoctection/deepdoctection:<release_tag>
The container can be started with the above docker run command.
We provide a docker-compose.yaml file to start the generated image pulled from the hub. In order to use it, replace
first the image argument with the tag, you want to use. Second, in the .env file, set the two environment variables:
CACHE_HOST: Model weights/configuration files, as well as potentially datasets, are not baked into the image during
the build time, but mounted into the container with volumes. For a local installation, this is
usually ~/.cache/deepdoctection.
WORK_DIR: A temporary directory where documents can be loaded and also mounted into the container.
The container can be started as usual, for example, with:
docker compose up -d
Using the interpreter of the container, you can then run something like this:
import deepdoctection as dd
if __name__=="__main__":
analyzer = dd.get_dd_analyzer()
df = analyzer.analyze(path = "/home/files/your_doc.pdf")
df.reset_state()
for dp in df:
print(dp.text)