# Transcript Extraction Service A stateless FastAPI service that extracts structured data from UWI unofficial transcript PDFs. It accepts a PDF file upload, filters out watermark overlays, parses course records, GPAs, programme info, and current term/year, then returns structured JSON. ## Quick Start ```bash # Local development (with hot reload) cd apps/transcripts python -m venv .venv && .venv/bin/pip install -r requirements-dev.txt .venv/bin/uvicorn app.main:app --reload --port 8001 # Or via Docker (from project root) docker compose -f docker-compose.local.yml up transcript-extraction # Run tests .venv/bin/python -m pytest tests/ -v ``` ## API Base URL: `http://localhost:8001/api/v1` ### `GET /healthy` Returns `{"status": "healthy"}`. ### `POST /transcripts/extract` Extracts structured data from a transcript PDF. Returns `200` on success. **Content-Type:** `multipart/form-data` | Field | Type | Description | |-------|------|-------------| | `file` | file | The PDF file to extract (must be `application/pdf`) | #### Example ```bash curl -X POST http://localhost:8001/api/v1/transcripts/extract \ -F "file=@transcript.pdf" ``` --- ## Response Body ```jsonc { "current_programme": "BSc", // degree from CURRENT PROGRAMME block "major": "Computer Science", // major field of study "current_term": "2023/2024 Semester II", // last semester found in document "current_year": 2, // derived from highest course level (e.g. COMP 2xxx = year 2) "degree_gpa": 3.44, // from TRANSCRIPT TOTALS (null if not found) "overall_gpa": 3.44, // from TRANSCRIPT TOTALS (null if not found) "courses": [ { "code": "COMP 1601", // subject code + number "title": "Computer Programming I", // course title (multi-line titles are joined) "grade": "A+" // letter grade, or null if in-progress } ] } ``` ### Response Fields | Field | Type | Description | |-------|------|-------------| | `current_programme` | string | Degree type (e.g. "BSc", "MSc"). Empty string if not found. | | `major` | string | Major field of study. Empty string if not found. | | `current_term` | string | Last semester in the document (e.g. "2023/2024 Semester II", "2024/2025 Summer"). Empty string if not found. | | `current_year` | int | Study year derived from highest course code level (COMP **2**xxx = year 2). `0` if no courses found. | | `degree_gpa` | float \| null | Degree GPA from TRANSCRIPT TOTALS section. `null` if not found. | | `overall_gpa` | float \| null | Overall GPA from TRANSCRIPT TOTALS section. `null` if not found. | | `courses` | array | All course records found. Empty array if none found. | | `courses[].code` | string | Subject code and number (e.g. "COMP 1601"). | | `courses[].title` | string | Course title. Multi-line titles are joined with a space. | | `courses[].grade` | string \| null | Letter grade (A+, A, A-, B+, B, B-, C+, C, C-, D+, D, D-, F, F1, F2, F3, HD, P, W, MC, AB, DEF, NC, EX, NP, INC). `null` for in-progress courses with no grade yet. | --- ## Validation Rules | Rule | HTTP Status | Error | |------|-------------|-------| | File is not `application/pdf` | 422 | `file must be a PDF (application/pdf)` | | File is empty (0 bytes) | 422 | `uploaded file is empty` | | No text could be extracted from PDF | 422 | `could not extract text from PDF — file may be corrupt or not a transcript` | --- ## Error Responses All errors return JSON with a `detail` field. ### 422 - Validation Error ```json { "detail": "file must be a PDF (application/pdf)" } ``` ```json { "detail": "could not extract text from PDF — file may be corrupt or not a transcript" } ``` ### 500 - Internal Server Error ```json { "detail": "internal server error" } ``` Stack traces are logged server-side but never exposed in the response. --- ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `LOG_LEVEL` | `INFO` | Logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) | --- ## How Extraction Works 1. **PDF opened** with pdfplumber. Each page is filtered to remove watermark characters (font size > 20), which removes the "Unofficial Transcript" overlay. 2. **Courses parsed** via regex matching lines like `COMP 1601 S UG Computer Programming I A+ 3.00 12.00 4.00`. Multi-line titles (next line is a continuation) are joined automatically. 3. **GPAs extracted** from the `TRANSCRIPT TOTALS` section — specifically the `Overall:` and `Degree:` rows. 4. **Programme info** parsed from the `CURRENT PROGRAMME` block (Degree, Programme, Faculty, Major, Department, Campus). 5. **Current year** derived from the highest course code level number (e.g. COMP **2**601 = year 2). 6. **Current term** is the last semester reference found (e.g. "2024/2025 Semester II"). Each extraction step is isolated — a failure in one section does not prevent the others from being extracted. Failed sections fall back to their default values (empty string, `null`, `0`, or `[]`). --- ## Project Structure ``` transcripts/ ├── app/ │ ├── main.py # FastAPI app, routes, error handling │ ├── extractor.py # PDF extraction + transcript parsing logic │ └── models/ │ ├── __init__.py │ └── extract_transcript_dtos.py # Pydantic response models ├── tests/ │ ├── conftest.py # sys.path setup │ ├── test_main.py # Unit tests for extraction logic │ └── test_api.py # API endpoint tests ├── Dockerfile # Production image ├── Dockerfile.dev # Dev image with hot reload ├── requirements.txt # Production dependencies └── requirements-dev.txt # + test dependencies ```