A stateless FastAPI service that extracts structured data from UWI unofficial transcript PDFs. It accepts a PDF file upload, filters out watermark overlays, parses course records, GPAs, programme info, and current term/year, then returns structured JSON.
# Local development (with hot reload)
cd apps/transcripts
python -m venv .venv && .venv/bin/pip install -r requirements-dev.txt
.venv/bin/uvicorn app.main:app --reload --port 8001
# Or via Docker (from project root)
docker compose -f docker-compose.local.yml up transcript-extraction
# Run tests
.venv/bin/python -m pytest tests/ -vBase URL: http://localhost:8001/api/v1
Returns {"status": "healthy"}.
Extracts structured data from a transcript PDF. Returns 200 on success.
Content-Type: multipart/form-data
| Field | Type | Description |
|---|---|---|
file |
file | The PDF file to extract (must be application/pdf) |
curl -X POST http://localhost:8001/api/v1/transcripts/extract \
-F "file=@transcript.pdf"| Field | Type | Description |
|---|---|---|
current_programme |
string | Degree type (e.g. "BSc", "MSc"). Empty string if not found. |
major |
string | Major field of study. Empty string if not found. |
current_term |
string | Last semester in the document (e.g. "2023/2024 Semester II", "2024/2025 Summer"). Empty string if not found. |
current_year |
int | Study year derived from highest course code level (COMP 2xxx = year 2). 0 if no courses found. |
degree_gpa |
float | null | Degree GPA from TRANSCRIPT TOTALS section. null if not found. |
overall_gpa |
float | null | Overall GPA from TRANSCRIPT TOTALS section. null if not found. |
courses |
array | All course records found. Empty array if none found. |
courses[].code |
string | Subject code and number (e.g. "COMP 1601"). |
courses[].title |
string | Course title. Multi-line titles are joined with a space. |
courses[].grade |
string | null | Letter grade (A+, A, A-, B+, B, B-, C+, C, C-, D+, D, D-, F, F1, F2, F3, HD, P, W, MC, AB, DEF, NC, EX, NP, INC). null for in-progress courses with no grade yet. |
| Rule | HTTP Status | Error |
|---|---|---|
File is not application/pdf |
422 | file must be a PDF (application/pdf) |
| File is empty (0 bytes) | 422 | uploaded file is empty |
| No text could be extracted from PDF | 422 | could not extract text from PDF — file may be corrupt or not a transcript |
All errors return JSON with a detail field.
{ "detail": "file must be a PDF (application/pdf)" }{ "detail": "could not extract text from PDF — file may be corrupt or not a transcript" }{ "detail": "internal server error" }Stack traces are logged server-side but never exposed in the response.
| Variable | Default | Description |
|---|---|---|
LOG_LEVEL |
INFO |
Logging level (DEBUG, INFO, WARNING, ERROR) |
-
PDF opened with pdfplumber. Each page is filtered to remove watermark characters (font size > 20), which removes the "Unofficial Transcript" overlay.
-
Courses parsed via regex matching lines like
COMP 1601 S UG Computer Programming I A+ 3.00 12.00 4.00. Multi-line titles (next line is a continuation) are joined automatically. -
GPAs extracted from the
TRANSCRIPT TOTALSsection — specifically theOverall:andDegree:rows. -
Programme info parsed from the
CURRENT PROGRAMMEblock (Degree, Programme, Faculty, Major, Department, Campus). -
Current year derived from the highest course code level number (e.g. COMP 2601 = year 2).
-
Current term is the last semester reference found (e.g. "2024/2025 Semester II").
Each extraction step is isolated — a failure in one section does not prevent the others from being extracted. Failed sections fall back to their default values (empty string, null, 0, or []).
transcripts/
├── app/
│ ├── main.py # FastAPI app, routes, error handling
│ ├── extractor.py # PDF extraction + transcript parsing logic
│ └── models/
│ ├── __init__.py
│ └── extract_transcript_dtos.py # Pydantic response models
├── tests/
│ ├── conftest.py # sys.path setup
│ ├── test_main.py # Unit tests for extraction logic
│ └── test_api.py # API endpoint tests
├── Dockerfile # Production image
├── Dockerfile.dev # Dev image with hot reload
├── requirements.txt # Production dependencies
└── requirements-dev.txt # + test dependencies
{ "current_programme": "BSc", // degree from CURRENT PROGRAMME block "major": "Computer Science", // major field of study "current_term": "2023/2024 Semester II", // last semester found in document "current_year": 2, // derived from highest course level (e.g. COMP 2xxx = year 2) "degree_gpa": 3.44, // from TRANSCRIPT TOTALS (null if not found) "overall_gpa": 3.44, // from TRANSCRIPT TOTALS (null if not found) "courses": [ { "code": "COMP 1601", // subject code + number "title": "Computer Programming I", // course title (multi-line titles are joined) "grade": "A+" // letter grade, or null if in-progress } ] }