A FastAPI app to detect plagiarism in resumes using:
- Exact duplicate hashing (100% hits for true duplicates)
- Lexical overlap (MinHash + LSH, Jaccard)
- Semantic similarity (sentence embeddings)
- HTML evidence reports (exact/soft overlaps + similar bullets)
- A clean drag-and-drop UI for non-technical users
Runs locally (Python 3.12) or in Docker.
- PDF / DOCX support
- PII masking (optional & consistent across CLI/API)
- Exact dupes → instant 100% match
- Near dupes → single-signal hard gates (e.g.,
sem ≥ 0.985orlex ≥ 0.95) - Top matches with filenames in responses
- Evidence report (HTML) per best match (with inline highlighting)
- Tunable weights & thresholds → realistic
unique / needs_review / plagiarized_likely - Drag & drop UI – screen or add to corpus in your browser
- Hot reload in dev; non-root container in prod
.
├─ data/
│ ├─ corpus/ # put resumes here to index (PDF/DOCX)
│ ├─ index/ # generated index files (gitignored)
│ └─ reports/ # generated HTML evidence reports (gitignored)
├─ src/resumescreener/
│ ├─ app.py # FastAPI app + endpoints + UI/static mounts
│ ├─ cli.py # CLI: python -m resumescreener.cli index ./data/corpus
│ ├─ index/ # MinHash/LSH, FAISS-like stores, persistence
│ ├─ parsing/ # extract, normalize_and_split, normalized_hash
│ ├─ models/ # embedding & cross-encoder loaders
│ ├─ utils/ # pii masking, chunking, io helpers
│ ├─ evidence.py # overlap & bullet pairing, HTML report builder
│ └─ scorer.py # fused scoring, reranker, thresholds
├─ ui/
│ ├─ index.html # drag & drop UI (Screen / Add to Corpus)
│ ├─ app.css
│ └─ app.js
├─ requirements.txt
├─ Dockerfile
├─ docker-compose.yml
├─ docker-compose.dev.yml
├─ .gitignore
└─ README.md
Below are copy-paste blocks for Local (Python 3.12) and Docker (dev/prod). Choose one path.
Tip: First drop a few PDFs/DOCX files into
./data/corpus/.
Python 3.13 is not recommended (some deps don’t ship wheels yet). Use 3.12.
# 0) From the repo root, create and activate venv
python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
# 1) (Optional) start fresh if switching models or indexes
rm -rf ./data/index
# 2) Configure model & tuning (no code changes needed)
export RESUME_SCREENER_EMBEDDER_MODEL="sentence-transformers/all-mpnet-base-v2"
export RESUME_SCREENER_PII_MASKING=1
export RESUME_SCREENER_W_SEM=0.60
export RESUME_SCREENER_W_LEX=0.40
export RESUME_SCREENER_W_LAY=0.0
export RESUME_SCREENER_PLAGIARIZED=0.85
export RESUME_SCREENER_REVIEW=0.48
export RESUME_SCREENER_NEAR_SEM=0.985
export RESUME_SCREENER_NEAR_LEX=0.95
# 3) Build the index
export PYTHONPATH=srcIndex your corpus:
python -m resumescreener.cli index ./data/corpus
# 4) Run the API (serves UI at /)
uvicorn resumescreener.app:app --host 0.0.0.0 --port 8000 --reloadOpen:
# 0) Venv
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt
# 1) Optional clean
rmdir /S /Q .\data\index
# 2) Environment (adjust as desired)
$env:RESUME_SCREENER_EMBEDDER_MODEL="sentence-transformers/all-mpnet-base-v2"
$env:RESUME_SCREENER_PII_MASKING="1"
$env:RESUME_SCREENER_W_SEM="0.60"
$env:RESUME_SCREENER_W_LEX="0.40"
$env:RESUME_SCREENER_W_LAY="0.0"
$env:RESUME_SCREENER_PLAGIARIZED="0.85"
$env:RESUME_SCREENER_REVIEW="0.48"
$env:RESUME_SCREENER_NEAR_SEM="0.985"
$env:RESUME_SCREENER_NEAR_LEX="0.95"
# 3) Index
$env:PYTHONPATH="src"
python -m resumescreener.cli index .\data\corpus
# 4) Run API
uvicorn resumescreener.app:app --host 0.0.0.0 --port 8000 --reloadOpen:
- UI: http://localhost:8000/
- API docs: http://localhost:8000/docs
Both compose files mount:
./data→/app/data(indexes, reports)./ui→/app/ui(static UI)./model-cache→/app/.cache/huggingface(model cache)
# Set model/env via .env or env section in docker-compose.dev.yml
docker compose -f docker-compose.dev.yml up --buildIndex inside the running container:
docker exec -it resume-screener-dev \
python -m resumescreener.cli index ./data/corpusdocker compose up --build -dIndex:
docker exec -it resume-screener \
python -m resumescreener.cli index ./data/corpusOpen:
Just set the env var, then reindex if the embedding dimension changes.
You can switch models with one env var.
Local (recommended flow):
# switch model
export RESUME_SCREENER_EMBEDDER_MODEL="sentence-transformers/all-mpnet-base-v2"
# wipe old vectors if dim changes
rm -rf ./data/index
# rebuild
export PYTHONPATH=src
python -m resumescreener.cli index ./data/corpus
# run
uvicorn resumescreener.app:app --reload- Put this in
docker-compose.ymlordocker-compose.dev.yml:
# docker-compose.yml
environment:
RESUME_SCREENER_EMBEDDER_MODEL: "sentence-transformers/all-mpnet-base-v2"
RESUME_SCREENER_PII_MASKING: "1"Then: Then:
docker compose up --build -d
docker exec -it resume-screener python -m resumescreener.cli index ./data/corpusOptional .env:
RESUME_SCREENER_EMBEDDER_MODEL=sentence-transformers/all-mpnet-base-v2
RESUME_SCREENER_PII_MASKING=1
RESUME_SCREENER_W_SEM=0.60
RESUME_SCREENER_W_LEX=0.40
RESUME_SCREENER_W_LAY=0.0
RESUME_SCREENER_PLAGIARIZED=0.85
RESUME_SCREENER_REVIEW=0.48
RESUME_SCREENER_NEAR_SEM=0.985
RESUME_SCREENER_NEAR_LEX=0.95
Then in compose:
env_file:
- ./.env| Model | Dim | Speed | Accuracy | Multilingual | Reindex Needed when Switching? |
|---|---|---|---|---|---|
| sentence-transformers/all-MiniLM-L6-v2 | 384 | Fast | Good | No | Only if current dim ≠ 384 |
| sentence-transformers/all-MiniLM-L12-v2 | 384 | Medium | Better | No | Only if current dim ≠ 384 |
| sentence-transformers/paraphrase-MiniLM-L6-v2 | 384 | Fast | Good | No | Only if current dim ≠ 384 |
| sentence-transformers/paraphrase-MiniLM-L12-v2 | 384 | Medium | Good | No | Only if current dim ≠ 384 |
| sentence-transformers/all-mpnet-base-v2 | 768 | Medium | High | No | Yes if current ≠ 768 |
| sentence-transformers/paraphrase-mpnet-base-v2 | 768 | Medium | High | No | Yes if current ≠ 768 |
| distiluse-base-multilingual-cased-v2 | 512 | Medium | Good | Yes | Yes if current ≠ 512 |
| xlm-r-100langs-bert-base-nli-stsb-mean-tokens | 768 | Slow | Good | Yes | Yes if current ≠ 768 |
Rule of thumb: same Dim → keep old vectors; different Dim → rm -rf ./data/index and re-index.
POST /screen→ returns scores + decisionPOST /screen_with_report→ saves HTML evidence under./data/reportsPOST /index→ adds single resume to indexPOST /persist→ persist in-memory storesPOST /reload_index→ reload from disk (useful after running CLI)GET /stats→ index stats (docs, vectors, hashes)GET /health→ health check
curl
curl -F 'file=@"/path/Resume.pdf"' http://localhost:8000/screen
curl -F 'file=@"/path/Resume.pdf"' http://localhost:8000/screen_with_report
curl -F 'file=@"/path/Resume.docx"' http://localhost:8000/index-
Exact hash → 100% plagiarized if match.
-
Candidates = semantic neighbors ∪ LSH neighbors.
-
Score each candidate:
total = w_sem*semantic + w_lex*lexical + w_lay*layout -
Near-dup hard gate:
sem ≥ 0.985orlex ≥ 0.95→ treat as plagiarized. -
Decision:
- ≥ 0.85 → plagiarized_likely
- 0.30–0.85 → needs_review (component overrides also apply)
- < 0.30 → unique
For the deep dive, see LOGIC.md.
-
Exact duplicate not 100%? Reindex after enabling PII masking; ensure CLI & API masking match. Call
/reload_indexor restart the API. -
Different score when “Generate evidence report” is toggled? We now unify the candidate path for both endpoints so scores should match. If you still see drift, reindex and ensure env is identical.
-
FAISS / dim mismatch assertion? You switched to a model with a different dimension.
rm -rf ./data/indexand re-index.
This project is not open-licensed. All rights reserved.
Use, hosting, modification, distribution, or monetization require written permission from the copyright holder. Revenue sharing applies to commercial use. See LICENSE for full terms.
Contact: Utkarsh Singh – utkarshsingh795@icloud.com