Skip to content

vib795/resume-screener

Repository files navigation

📄 Resume Screener – Plagiarism Detection (API + UI)

A FastAPI app to detect plagiarism in resumes using:

  • Exact duplicate hashing (100% hits for true duplicates)
  • Lexical overlap (MinHash + LSH, Jaccard)
  • Semantic similarity (sentence embeddings)
  • HTML evidence reports (exact/soft overlaps + similar bullets)
  • A clean drag-and-drop UI for non-technical users

Runs locally (Python 3.12) or in Docker.


✨ Features

  • PDF / DOCX support
  • PII masking (optional & consistent across CLI/API)
  • Exact dupes → instant 100% match
  • Near dupes → single-signal hard gates (e.g., sem ≥ 0.985 or lex ≥ 0.95)
  • Top matches with filenames in responses
  • Evidence report (HTML) per best match (with inline highlighting)
  • Tunable weights & thresholds → realistic unique / needs_review / plagiarized_likely
  • Drag & drop UI – screen or add to corpus in your browser
  • Hot reload in dev; non-root container in prod

🗂 Project Layout

.
├─ data/
│  ├─ corpus/        # put resumes here to index (PDF/DOCX)
│  ├─ index/         # generated index files (gitignored)
│  └─ reports/       # generated HTML evidence reports (gitignored)
├─ src/resumescreener/
│  ├─ app.py         # FastAPI app + endpoints + UI/static mounts
│  ├─ cli.py         # CLI: python -m resumescreener.cli index ./data/corpus
│  ├─ index/         # MinHash/LSH, FAISS-like stores, persistence
│  ├─ parsing/       # extract, normalize_and_split, normalized_hash
│  ├─ models/        # embedding & cross-encoder loaders
│  ├─ utils/         # pii masking, chunking, io helpers
│  ├─ evidence.py    # overlap & bullet pairing, HTML report builder
│  └─ scorer.py      # fused scoring, reranker, thresholds
├─ ui/
│  ├─ index.html     # drag & drop UI (Screen / Add to Corpus)
│  ├─ app.css
│  └─ app.js
├─ requirements.txt
├─ Dockerfile
├─ docker-compose.yml
├─ docker-compose.dev.yml
├─ .gitignore
└─ README.md

🚀 How to Run (all platforms)

Below are copy-paste blocks for Local (Python 3.12) and Docker (dev/prod). Choose one path.

Tip: First drop a few PDFs/DOCX files into ./data/corpus/.


Option A — Local (Python 3.12)

Python 3.13 is not recommended (some deps don’t ship wheels yet). Use 3.12.

macOS / Linux (bash/zsh)

# 0) From the repo root, create and activate venv
python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

# 1) (Optional) start fresh if switching models or indexes
rm -rf ./data/index

# 2) Configure model & tuning (no code changes needed)
export RESUME_SCREENER_EMBEDDER_MODEL="sentence-transformers/all-mpnet-base-v2"
export RESUME_SCREENER_PII_MASKING=1
export RESUME_SCREENER_W_SEM=0.60
export RESUME_SCREENER_W_LEX=0.40
export RESUME_SCREENER_W_LAY=0.0
export RESUME_SCREENER_PLAGIARIZED=0.85
export RESUME_SCREENER_REVIEW=0.48
export RESUME_SCREENER_NEAR_SEM=0.985
export RESUME_SCREENER_NEAR_LEX=0.95

# 3) Build the index
export PYTHONPATH=src

Index your corpus:

python -m resumescreener.cli index ./data/corpus

# 4) Run the API (serves UI at /)
uvicorn resumescreener.app:app --host 0.0.0.0 --port 8000 --reload

Open:

Windows (PowerShell)

# 0) Venv
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt

# 1) Optional clean
rmdir /S /Q .\data\index

# 2) Environment (adjust as desired)
$env:RESUME_SCREENER_EMBEDDER_MODEL="sentence-transformers/all-mpnet-base-v2"
$env:RESUME_SCREENER_PII_MASKING="1"
$env:RESUME_SCREENER_W_SEM="0.60"
$env:RESUME_SCREENER_W_LEX="0.40"
$env:RESUME_SCREENER_W_LAY="0.0"
$env:RESUME_SCREENER_PLAGIARIZED="0.85"
$env:RESUME_SCREENER_REVIEW="0.48"
$env:RESUME_SCREENER_NEAR_SEM="0.985"
$env:RESUME_SCREENER_NEAR_LEX="0.95"

# 3) Index
$env:PYTHONPATH="src"
python -m resumescreener.cli index .\data\corpus

# 4) Run API
uvicorn resumescreener.app:app --host 0.0.0.0 --port 8000 --reload

Open:


Option B — Docker

Both compose files mount:

  • ./data/app/data (indexes, reports)
  • ./ui/app/ui (static UI)
  • ./model-cache/app/.cache/huggingface (model cache)

Dev (hot reload, live code)

# Set model/env via .env or env section in docker-compose.dev.yml
docker compose -f docker-compose.dev.yml up --build

Index inside the running container:

docker exec -it resume-screener-dev \
  python -m resumescreener.cli index ./data/corpus

Prod-ish

docker compose up --build -d

Index:

docker exec -it resume-screener \
  python -m resumescreener.cli index ./data/corpus

Open:


🔁 Switching Embedding Models (no code changes)

Just set the env var, then reindex if the embedding dimension changes.

You can switch models with one env var.

Local (recommended flow):

# switch model
export RESUME_SCREENER_EMBEDDER_MODEL="sentence-transformers/all-mpnet-base-v2"

# wipe old vectors if dim changes
rm -rf ./data/index

# rebuild
export PYTHONPATH=src
python -m resumescreener.cli index ./data/corpus

# run
uvicorn resumescreener.app:app --reload

Docker

  • Put this in docker-compose.yml or docker-compose.dev.yml:
# docker-compose.yml
environment:
  RESUME_SCREENER_EMBEDDER_MODEL: "sentence-transformers/all-mpnet-base-v2"
  RESUME_SCREENER_PII_MASKING: "1"

Then: Then:

docker compose up --build -d
docker exec -it resume-screener python -m resumescreener.cli index ./data/corpus

Optional .env:

RESUME_SCREENER_EMBEDDER_MODEL=sentence-transformers/all-mpnet-base-v2
RESUME_SCREENER_PII_MASKING=1
RESUME_SCREENER_W_SEM=0.60
RESUME_SCREENER_W_LEX=0.40
RESUME_SCREENER_W_LAY=0.0
RESUME_SCREENER_PLAGIARIZED=0.85
RESUME_SCREENER_REVIEW=0.48
RESUME_SCREENER_NEAR_SEM=0.985
RESUME_SCREENER_NEAR_LEX=0.95

Then in compose:

env_file:
  - ./.env

📏 Model Compatibility Table

Model Dim Speed Accuracy Multilingual Reindex Needed when Switching?
sentence-transformers/all-MiniLM-L6-v2 384 Fast Good No Only if current dim ≠ 384
sentence-transformers/all-MiniLM-L12-v2 384 Medium Better No Only if current dim ≠ 384
sentence-transformers/paraphrase-MiniLM-L6-v2 384 Fast Good No Only if current dim ≠ 384
sentence-transformers/paraphrase-MiniLM-L12-v2 384 Medium Good No Only if current dim ≠ 384
sentence-transformers/all-mpnet-base-v2 768 Medium High No Yes if current ≠ 768
sentence-transformers/paraphrase-mpnet-base-v2 768 Medium High No Yes if current ≠ 768
distiluse-base-multilingual-cased-v2 512 Medium Good Yes Yes if current ≠ 512
xlm-r-100langs-bert-base-nli-stsb-mean-tokens 768 Slow Good Yes Yes if current ≠ 768

Rule of thumb: same Dim → keep old vectors; different Dimrm -rf ./data/index and re-index.


🔌 Endpoints (quick)

  • POST /screen → returns scores + decision
  • POST /screen_with_report → saves HTML evidence under ./data/reports
  • POST /index → adds single resume to index
  • POST /persist → persist in-memory stores
  • POST /reload_index → reload from disk (useful after running CLI)
  • GET /stats → index stats (docs, vectors, hashes)
  • GET /health → health check

curl

curl -F 'file=@"/path/Resume.pdf"'  http://localhost:8000/screen
curl -F 'file=@"/path/Resume.pdf"'  http://localhost:8000/screen_with_report
curl -F 'file=@"/path/Resume.docx"' http://localhost:8000/index

🧠 How detection works (super short)

  • Exact hash → 100% plagiarized if match.

  • Candidates = semantic neighbors ∪ LSH neighbors.

  • Score each candidate: total = w_sem*semantic + w_lex*lexical + w_lay*layout

  • Near-dup hard gate: sem ≥ 0.985 or lex ≥ 0.95 → treat as plagiarized.

  • Decision:

    • ≥ 0.85 → plagiarized_likely
    • 0.30–0.85 → needs_review (component overrides also apply)
    • < 0.30 → unique

For the deep dive, see LOGIC.md.


🧩 Troubleshooting

  • Exact duplicate not 100%? Reindex after enabling PII masking; ensure CLI & API masking match. Call /reload_index or restart the API.

  • Different score when “Generate evidence report” is toggled? We now unify the candidate path for both endpoints so scores should match. If you still see drift, reindex and ensure env is identical.

  • FAISS / dim mismatch assertion? You switched to a model with a different dimension. rm -rf ./data/index and re-index.


🔒 License

This project is not open-licensed. All rights reserved.

Use, hosting, modification, distribution, or monetization require written permission from the copyright holder. Revenue sharing applies to commercial use. See LICENSE for full terms.

Contact: Utkarsh Singh – utkarshsingh795@icloud.com

Releases

No releases published

Packages

 
 
 

Contributors