📄 Resume Screener – Plagiarism Detection (API + UI)

A FastAPI app to detect plagiarism in resumes using:

Exact duplicate hashing (100% hits for true duplicates)
Lexical overlap (MinHash + LSH, Jaccard)
Semantic similarity (sentence embeddings)
HTML evidence reports (exact/soft overlaps + similar bullets)
A clean drag-and-drop UI for non-technical users

Runs locally (Python 3.12) or in Docker.

✨ Features

PDF / DOCX support
PII masking (optional & consistent across CLI/API)
Exact dupes → instant 100% match
Near dupes → single-signal hard gates (e.g., sem ≥ 0.985 or lex ≥ 0.95)
Top matches with filenames in responses
Evidence report (HTML) per best match (with inline highlighting)
Tunable weights & thresholds → realistic unique / needs_review / plagiarized_likely
Drag & drop UI – screen or add to corpus in your browser
Hot reload in dev; non-root container in prod

🗂 Project Layout

.
├─ data/
│  ├─ corpus/        # put resumes here to index (PDF/DOCX)
│  ├─ index/         # generated index files (gitignored)
│  └─ reports/       # generated HTML evidence reports (gitignored)
├─ src/resumescreener/
│  ├─ app.py         # FastAPI app + endpoints + UI/static mounts
│  ├─ cli.py         # CLI: python -m resumescreener.cli index ./data/corpus
│  ├─ index/         # MinHash/LSH, FAISS-like stores, persistence
│  ├─ parsing/       # extract, normalize_and_split, normalized_hash
│  ├─ models/        # embedding & cross-encoder loaders
│  ├─ utils/         # pii masking, chunking, io helpers
│  ├─ evidence.py    # overlap & bullet pairing, HTML report builder
│  └─ scorer.py      # fused scoring, reranker, thresholds
├─ ui/
│  ├─ index.html     # drag & drop UI (Screen / Add to Corpus)
│  ├─ app.css
│  └─ app.js
├─ requirements.txt
├─ Dockerfile
├─ docker-compose.yml
├─ docker-compose.dev.yml
├─ .gitignore
└─ README.md

🚀 How to Run (all platforms)

Below are copy-paste blocks for Local (Python 3.12) and Docker (dev/prod). Choose one path.

Tip: First drop a few PDFs/DOCX files into ./data/corpus/.

Option A — Local (Python 3.12)

Python 3.13 is not recommended (some deps don’t ship wheels yet). Use 3.12.

macOS / Linux (bash/zsh)

# 0) From the repo root, create and activate venv
python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

# 1) (Optional) start fresh if switching models or indexes
rm -rf ./data/index

# 2) Configure model & tuning (no code changes needed)
export RESUME_SCREENER_EMBEDDER_MODEL="sentence-transformers/all-mpnet-base-v2"
export RESUME_SCREENER_PII_MASKING=1
export RESUME_SCREENER_W_SEM=0.60
export RESUME_SCREENER_W_LEX=0.40
export RESUME_SCREENER_W_LAY=0.0
export RESUME_SCREENER_PLAGIARIZED=0.85
export RESUME_SCREENER_REVIEW=0.48
export RESUME_SCREENER_NEAR_SEM=0.985
export RESUME_SCREENER_NEAR_LEX=0.95

# 3) Build the index
export PYTHONPATH=src

Index your corpus:

python -m resumescreener.cli index ./data/corpus

# 4) Run the API (serves UI at /)
uvicorn resumescreener.app:app --host 0.0.0.0 --port 8000 --reload

Open:

UI: http://localhost:8000/
Docs: http://localhost:8000/docs

Windows (PowerShell)

# 0) Venv
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt

# 1) Optional clean
rmdir /S /Q .\data\index

# 2) Environment (adjust as desired)
$env:RESUME_SCREENER_EMBEDDER_MODEL="sentence-transformers/all-mpnet-base-v2"
$env:RESUME_SCREENER_PII_MASKING="1"
$env:RESUME_SCREENER_W_SEM="0.60"
$env:RESUME_SCREENER_W_LEX="0.40"
$env:RESUME_SCREENER_W_LAY="0.0"
$env:RESUME_SCREENER_PLAGIARIZED="0.85"
$env:RESUME_SCREENER_REVIEW="0.48"
$env:RESUME_SCREENER_NEAR_SEM="0.985"
$env:RESUME_SCREENER_NEAR_LEX="0.95"

# 3) Index
$env:PYTHONPATH="src"
python -m resumescreener.cli index .\data\corpus

# 4) Run API
uvicorn resumescreener.app:app --host 0.0.0.0 --port 8000 --reload

Open:

UI: http://localhost:8000/
API docs: http://localhost:8000/docs

Option B — Docker

Both compose files mount:

./data → /app/data (indexes, reports)

./ui → /app/ui (static UI)

./model-cache → /app/.cache/huggingface (model cache)

Dev (hot reload, live code)

# Set model/env via .env or env section in docker-compose.dev.yml
docker compose -f docker-compose.dev.yml up --build

Index inside the running container:

docker exec -it resume-screener-dev \
  python -m resumescreener.cli index ./data/corpus

Prod-ish

docker compose up --build -d

Index:

docker exec -it resume-screener \
  python -m resumescreener.cli index ./data/corpus

Open:

http://localhost:8000/

🔁 Switching Embedding Models (no code changes)

Just set the env var, then reindex if the embedding dimension changes.

You can switch models with one env var.

Local (recommended flow):

# switch model
export RESUME_SCREENER_EMBEDDER_MODEL="sentence-transformers/all-mpnet-base-v2"

# wipe old vectors if dim changes
rm -rf ./data/index

# rebuild
export PYTHONPATH=src
python -m resumescreener.cli index ./data/corpus

# run
uvicorn resumescreener.app:app --reload

Docker

Put this in docker-compose.yml or docker-compose.dev.yml:

# docker-compose.yml
environment:
  RESUME_SCREENER_EMBEDDER_MODEL: "sentence-transformers/all-mpnet-base-v2"
  RESUME_SCREENER_PII_MASKING: "1"

Then: Then:

docker compose up --build -d
docker exec -it resume-screener python -m resumescreener.cli index ./data/corpus

Optional .env:

RESUME_SCREENER_EMBEDDER_MODEL=sentence-transformers/all-mpnet-base-v2
RESUME_SCREENER_PII_MASKING=1
RESUME_SCREENER_W_SEM=0.60
RESUME_SCREENER_W_LEX=0.40
RESUME_SCREENER_W_LAY=0.0
RESUME_SCREENER_PLAGIARIZED=0.85
RESUME_SCREENER_REVIEW=0.48
RESUME_SCREENER_NEAR_SEM=0.985
RESUME_SCREENER_NEAR_LEX=0.95

Then in compose:

env_file:
  - ./.env

📏 Model Compatibility Table

Model	Dim	Speed	Accuracy	Multilingual	Reindex Needed when Switching?
sentence-transformers/all-MiniLM-L6-v2	384	Fast	Good	No	Only if current dim ≠ 384
sentence-transformers/all-MiniLM-L12-v2	384	Medium	Better	No	Only if current dim ≠ 384
sentence-transformers/paraphrase-MiniLM-L6-v2	384	Fast	Good	No	Only if current dim ≠ 384
sentence-transformers/paraphrase-MiniLM-L12-v2	384	Medium	Good	No	Only if current dim ≠ 384
sentence-transformers/all-mpnet-base-v2	768	Medium	High	No	Yes if current ≠ 768
sentence-transformers/paraphrase-mpnet-base-v2	768	Medium	High	No	Yes if current ≠ 768
distiluse-base-multilingual-cased-v2	512	Medium	Good	Yes	Yes if current ≠ 512
xlm-r-100langs-bert-base-nli-stsb-mean-tokens	768	Slow	Good	Yes	Yes if current ≠ 768

Rule of thumb: same Dim → keep old vectors; different Dim → rm -rf ./data/index and re-index.

🔌 Endpoints (quick)

POST /screen → returns scores + decision
POST /screen_with_report → saves HTML evidence under ./data/reports
POST /index → adds single resume to index
POST /persist → persist in-memory stores
POST /reload_index → reload from disk (useful after running CLI)
GET /stats → index stats (docs, vectors, hashes)
GET /health → health check

curl

curl -F 'file=@"/path/Resume.pdf"'  http://localhost:8000/screen
curl -F 'file=@"/path/Resume.pdf"'  http://localhost:8000/screen_with_report
curl -F 'file=@"/path/Resume.docx"' http://localhost:8000/index

🧠 How detection works (super short)

Exact hash → 100% plagiarized if match.
Candidates = semantic neighbors ∪ LSH neighbors.
Score each candidate: total = w_sem*semantic + w_lex*lexical + w_lay*layout
Near-dup hard gate: sem ≥ 0.985 or lex ≥ 0.95 → treat as plagiarized.
Decision:
- ≥ 0.85 → plagiarized_likely
- 0.30–0.85 → needs_review (component overrides also apply)
- < 0.30 → unique

For the deep dive, see LOGIC.md.

🧩 Troubleshooting

Exact duplicate not 100%? Reindex after enabling PII masking; ensure CLI & API masking match. Call /reload_index or restart the API.
Different score when “Generate evidence report” is toggled? We now unify the candidate path for both endpoints so scores should match. If you still see drift, reindex and ensure env is identical.
FAISS / dim mismatch assertion? You switched to a model with a different dimension. rm -rf ./data/index and re-index.

🔒 License

Use, hosting, modification, distribution, or monetization require written permission from the copyright holder. Revenue sharing applies to commercial use. See LICENSE for full terms.

Contact: Utkarsh Singh – utkarshsingh795@icloud.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 Resume Screener – Plagiarism Detection (API + UI)

✨ Features

🗂 Project Layout

🚀 How to Run (all platforms)

Option A — Local (Python 3.12)

macOS / Linux (bash/zsh)

Windows (PowerShell)

Option B — Docker

Dev (hot reload, live code)

Prod-ish

🔁 Switching Embedding Models (no code changes)

Docker

📏 Model Compatibility Table

🔌 Endpoints (quick)

🧠 How detection works (super short)

🧩 Troubleshooting

🔒 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
src/resumescreener		src/resumescreener
tests		tests
ui		ui
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
LOGIC.md		LOGIC.md
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Folders and files

Latest commit

History

Repository files navigation

📄 Resume Screener – Plagiarism Detection (API + UI)

✨ Features

🗂 Project Layout

🚀 How to Run (all platforms)

Option A — Local (Python 3.12)

macOS / Linux (bash/zsh)

Windows (PowerShell)

Option B — Docker

Dev (hot reload, live code)

Prod-ish

🔁 Switching Embedding Models (no code changes)

Docker

📏 Model Compatibility Table

🔌 Endpoints (quick)

🧠 How detection works (super short)

🧩 Troubleshooting

🔒 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages