archivebox/archivebox

Sponsored OSS

By ArchiveBox

Updated over 1 year ago

Official Docker image for the ArchiveBox self-hosted internet archiving tool.

Image
Developer tools
97

5M+

archivebox/archivebox repository overview

ArchiveBox
Open-source self-hosted web archiving.


▶️ Quickstart | Demo | GitHub | Documentation | Info & Motivation | Community


   



ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view websites offline.

Without active preservation effort, everything on the internet eventually dissapears or degrades. Archive.org does a great job as a centralized service, but saved URLs have to be public, and they can't save every type of content.

ArchiveBox is an open source tool that lets organizations & individuals archive both public & private web content while retaining control over their data. It can be used to save copies of bookmarks, preserve evidence for legal cases, backup photos from FB/Insta/Flickr or media from YT/Soundcloud/etc., save research papers, and more...

➡️ Get ArchiveBox with pip install archivebox on Linux, macOS, and Windows (WSL2), or via Docker ⭐️.

Once installed, it can be used as a CLI tool, self-hosted Web App, Python library, or one-off command.




📥 You can feed ArchiveBox URLs one at a time, or schedule regular imports from your bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, our Browser Extension, and more.
See Input Formats for a full list of supported input formats...


snapshot detail page

It saves snapshots of the URLs you feed it in several redundant formats.
It also detects any content featured inside pages & extracts it out into a folder:

  • 🌐 HTML/Any websites ➡️ original HTML+CSS+JS, singlefile HTML, screenshot PNG, PDF, WARC, title, article text, favicon, headers, ...
  • 🎥 Social Media/News ➡️ post content TXT, comments, title, author, images, ...
  • 🎬 YouTube/SoundCloud/etc. ➡️ MP3/MP4s, subtitles, metadata, thumbnail, ...
  • 💾 Github/Gitlab/etc. links ➡️ clone of GIT source code, README, images, ...
  • and more, see Output Formats below...

You can run ArchiveBox as a Docker web app to manage these snapshots, or continue accessing the same collection using the pip-installed CLI, Python API, and SQLite3 APIs. All the ways of using it are equivalent, and provide matching features like adding tags, scheduling regular crawls, viewing logs, and more...



🛠️ ArchiveBox uses standard tools like Chrome, wget, & yt-dlp, and stores data in ordinary files & folders.
(no complex proprietary formats, all data is readable without needing to run ArchiveBox)

The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats for decades after it goes down.



📦  Install ArchiveBox using your preferred method: docker / pip / apt / etc. (see full Quickstart below).

  Expand for quick copy-pastable install commands...   ⤵️
# Option A: Get ArchiveBox with Docker Compose (recommended):
mkdir -p ~/archivebox/data && cd ~/archivebox
curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml   # edit options in this file as-needed
docker compose run archivebox init --setup
# docker compose run archivebox add 'https://example.com'
# docker compose run archivebox help
# docker compose up


# Option B: Or use it as a plain Docker container: mkdir -p ~/archivebox/data && cd ~/archivebox/data docker run -it -v $PWD:/data archivebox/archivebox init --setup # docker run -it -v $PWD:/data archivebox/archivebox add 'https://example.com' # docker run -it -v $PWD:/data archivebox/archivebox help # docker run -it -v $PWD:/data -p 8000:8000 archivebox/archivebox

# Option C: Or install it with your preferred pkg manager (see Quickstart below for apt, brew, and more) pip install archivebox mkdir -p ~/archivebox/data && cd ~/archivebox/data archivebox init --setup # archviebox add 'https://example.com' # archivebox help # archivebox server 0.0.0.0:8000

# Option D: Or use the optional auto setup script to install it curl -fsSL 'https://get.archivebox.io' | sh

Open http://localhost:8000 to see your server's Web UI ➡️



bookshelf graphic   logo   bookshelf graphic

Demo | Screenshots | Usage
. . . . . . . . . . . . . . . . . . . . . . . . . . . .

cli init screenshot cli init screenshot server snapshot admin screenshot server snapshot details page screenshot

Key Features


🤝 Professional Integration

ArchiveBox is free for everyone to self-host, but we also provide support, security review, and custom integrations to help NGOs, governments, and other organizations run ArchiveBox professionally:

  • Journalists: crawling during research, preserving cited pages, fact-checking & review
  • Lawyers: collecting & preserving evidence, detecting changes, tagging & review
  • Researchers: analyzing social media trends, getting LLM training data, crawling pipelines
  • Individuals: saving bookmarks, preserving portfolio content, legacy / memoirs archival
  • Governments: snapshoting public service sites, recordkeeping compliance

Contact us if your org wants help using ArchiveBox professionally.
We offer: setup & support, hosting, custom features, security, hashing & audit logging/chain-of-custody, etc.
ArchiveBox has 🏛️ 501(c)(3) nonprofit status and all our work supports open-source development.



grassgrass

Quickstart

🖥  Supported OSs: Linux/BSD, macOS, Windows (Docker)   👾  CPUs: amd64 (x86_64), arm64, arm7 (raspi>=3)


✳️  Easy Setup
Docker docker-compose (macOS/Linux/Windows)   👈  recommended   (click to expand)
👍 Docker Compose is recommended for the easiest install/update UX + best security + all extras out-of-the-box.

  1. Install Docker on your system (if not already installed).
  2. Download the docker-compose.yml file into a new empty directory (can be anywhere).
    mkdir -p ~/archivebox/data && cd ~/archivebox
    # Read and edit docker-compose.yml options as-needed after downloading
    curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml
    
  3. Run the initial setup to create an admin user (or set ADMIN_USER/PASS in docker-compose.yml)
    docker compose run archivebox init --setup
    
  4. Next steps: Start the server then login to the Web UI http://127.0.0.1:8000 ⇢ Admin.
    docker compose up
    # completely optional, CLI can always be used without running a server
    # docker compose run [-T] archivebox [subcommand] [--help]
    docker compose run archivebox add 'https://example.com'
    docker compose run archivebox help
    
    For more info, see Install: Docker Compose in the Wiki. ➡️

See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.

Docker docker run (macOS/Linux/Windows)
  1. Install Docker on your system (if not already installed).
  2. Create a new empty directory and initialize your collection (can be anywhere).
    mkdir -p ~/archivebox/data && cd ~/archivebox/data
    docker run -v $PWD:/data -it archivebox/archivebox init --setup
    
  3. Optional: Start the server then login to the Web UI http://127.0.0.1:8000 ⇢ Admin.
    docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox
    # completely optional, CLI can always be used without running a server
    # docker run -v $PWD:/data -it [subcommand] [--help]
    docker run -v $PWD:/data -it archivebox/archivebox help
    
    For more info, see Install: Docker Compose in the Wiki. ➡️

See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.

curl sh automatic setup script bash auto-setup script (macOS/Linux)
  1. Install Docker on your system (optional, highly recommended but not required).
  2. Run the automatic setup script.
    curl -fsSL 'https://get.archivebox.io' | sh
    For more info, see Install: Bare Metal in the Wiki. ➡️

See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.
See setup.sh for the source code of the auto-install script.
See "Against curl | sh as an install method" blog post for my thoughts on the shortcomings of this install method.


🛠  Package Manager Setup

Pip pip (macOS/Linux/BSD)
  1. Install Python >= v3.10 and Node >= v18 on your system (if not already installed).
  2. Install the ArchiveBox package using pip3 (or pipx).
    pip3 install --upgrade archivebox yt-dlp playwright
    playwright install --with-deps chromium
    archivebox version
    # install any missing extras shown using apt/brew/pkg/etc. see Wiki for instructions
    #    [email protected] node curl wget git ripgrep ...
    
    See the Install: Bare Metal Wiki for full install instructions for each OS...
  3. Create a new empty directory and initialize your collection (can be anywhere).
    mkdir -p ~/archivebox/data && cd ~/archivebox/data   # for example
    archivebox init --setup   # instantialize a new collection
    # (--setup auto-installs and link JS dependencies: singlefile, readability, mercury, etc.)
    
  4. Optional: Start the server then login to the Web UI http://127.0.0.1:8000 ⇢ Admin.
    archivebox server 0.0.0.0:8000
    # completely optional, CLI can always be used without running a server
    # archivebox [subcommand] [--help]
    archivebox help
    

See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.

See the pip-archivebox repo for more details about this distribution.

aptitude apt (Ubuntu/Debian/etc.)
  1. Add the ArchiveBox repository to your sources.
    echo "deb http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" | sudo tee /etc/apt/sources.list.d/archivebox.list
    sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
    sudo apt update
    
  2. Install the ArchiveBox package using apt.
    sudo apt install archivebox
    # update to newest version with pip (sometimes apt package is outdated)
    pip install --upgrade --ignore-installed archivebox yt-dlp playwright
    playwright install --with-deps chromium    # install chromium and its system dependencies
    archivebox version                         # make sure all dependencies are installed
    
  3. Create a new empty directory and initialize your collection (can be anywhere).
    mkdir -p ~/archivebox/data && cd ~/archivebox/data
    archivebox init --setup
    
    Note: If you encounter issues or want more granular instructions, see the Install: Bare Metal Wiki.

  4. Optional: Start the server then login to the Web UI http://127.0.0.1:8000 ⇢ Admin.
    archivebox server 0.0.0.0:8000
    # completely optional, CLI can always be used without running a server
    # archivebox [subcommand] [--help]
    archivebox help
    

See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.
See the debian-archivebox repo for more details about this distribution.

homebrew brew (macOS only)
  1. Install Homebrew on your system (if not already installed).
  2. Install the ArchiveBox package using brew.
    brew tap archivebox/archivebox
    brew install archivebox
    # update to newest version with pip (sometimes brew package is outdated)
    pip install --upgrade --ignore-installed archivebox yt-dlp playwright
    playwright install --with-deps chromium    # install chromium and its system dependencies
    archivebox version                         # make sure all dependencies are installed
    
    See the Install: Bare Metal Wiki for more granular instructions for macOS... ➡️
  3. Create a new empty directory and initialize your collection (can be anywhere).
    mkdir -p ~/archivebox/data && cd ~/archivebox/data
    archivebox init --setup
    
  4. Optional: Start the server then login to the Web UI http://127.0.0.1:8000 ⇢ Admin.
    archivebox server 0.0.0.0:8000
    # completely optional, CLI can always be used without running a server
    # archivebox [subcommand] [--help]
    archivebox help
    

See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.
See the homebrew-archivebox repo for more details about this distribution.

Arch pacman / FreeBSD pkg / Nix nix (Arch/FreeBSD/NixOS/more)

Warning: These are contributed by external volunteers and may lag behind the official pip channel.

Tag summary

Content type

Image

Digest

sha256:eb5c8c034

Size

630.8 MB

Last updated

over 1 year ago

docker pull archivebox/archivebox:sha-baa3be75

This week's pulls

Pulls:

41,872

Apr 13 to Apr 19