Skip to content

Latest commit

 

History

History
162 lines (115 loc) · 6.02 KB

File metadata and controls

162 lines (115 loc) · 6.02 KB

Transcript Extraction Service

A stateless FastAPI service that extracts structured data from UWI unofficial transcript PDFs. It accepts a PDF file upload, filters out watermark overlays, parses course records, GPAs, programme info, and current term/year, then returns structured JSON.

Quick Start

# Local development (with hot reload)
cd apps/transcripts
python -m venv .venv && .venv/bin/pip install -r requirements-dev.txt
.venv/bin/uvicorn app.main:app --reload --port 8001

# Or via Docker (from project root)
docker compose -f docker-compose.local.yml up transcript-extraction

# Run tests
.venv/bin/python -m pytest tests/ -v

API

Base URL: http://localhost:8001/api/v1

GET /healthy

Returns {"status": "healthy"}.

POST /transcripts/extract

Extracts structured data from a transcript PDF. Returns 200 on success.

Content-Type: multipart/form-data

Field Type Description
file file The PDF file to extract (must be application/pdf)

Example

curl -X POST http://localhost:8001/api/v1/transcripts/extract \
  -F "file=@transcript.pdf"

Response Body

{
  "current_programme": "BSc",              // degree from CURRENT PROGRAMME block
  "major": "Computer Science",             // major field of study
  "current_term": "2023/2024 Semester II", // last semester found in document
  "current_year": 2,                       // derived from highest course level (e.g. COMP 2xxx = year 2)
  "degree_gpa": 3.44,                      // from TRANSCRIPT TOTALS (null if not found)
  "overall_gpa": 3.44,                     // from TRANSCRIPT TOTALS (null if not found)
  "courses": [
    {
      "code": "COMP 1601",                 // subject code + number
      "title": "Computer Programming I",   // course title (multi-line titles are joined)
      "grade": "A+"                         // letter grade, or null if in-progress
    }
  ]
}

Response Fields

Field Type Description
current_programme string Degree type (e.g. "BSc", "MSc"). Empty string if not found.
major string Major field of study. Empty string if not found.
current_term string Last semester in the document (e.g. "2023/2024 Semester II", "2024/2025 Summer"). Empty string if not found.
current_year int Study year derived from highest course code level (COMP 2xxx = year 2). 0 if no courses found.
degree_gpa float | null Degree GPA from TRANSCRIPT TOTALS section. null if not found.
overall_gpa float | null Overall GPA from TRANSCRIPT TOTALS section. null if not found.
courses array All course records found. Empty array if none found.
courses[].code string Subject code and number (e.g. "COMP 1601").
courses[].title string Course title. Multi-line titles are joined with a space.
courses[].grade string | null Letter grade (A+, A, A-, B+, B, B-, C+, C, C-, D+, D, D-, F, F1, F2, F3, HD, P, W, MC, AB, DEF, NC, EX, NP, INC). null for in-progress courses with no grade yet.

Validation Rules

Rule HTTP Status Error
File is not application/pdf 422 file must be a PDF (application/pdf)
File is empty (0 bytes) 422 uploaded file is empty
No text could be extracted from PDF 422 could not extract text from PDF — file may be corrupt or not a transcript

Error Responses

All errors return JSON with a detail field.

422 - Validation Error

{ "detail": "file must be a PDF (application/pdf)" }
{ "detail": "could not extract text from PDF — file may be corrupt or not a transcript" }

500 - Internal Server Error

{ "detail": "internal server error" }

Stack traces are logged server-side but never exposed in the response.


Environment Variables

Variable Default Description
LOG_LEVEL INFO Logging level (DEBUG, INFO, WARNING, ERROR)

How Extraction Works

  1. PDF opened with pdfplumber. Each page is filtered to remove watermark characters (font size > 20), which removes the "Unofficial Transcript" overlay.

  2. Courses parsed via regex matching lines like COMP 1601 S UG Computer Programming I A+ 3.00 12.00 4.00. Multi-line titles (next line is a continuation) are joined automatically.

  3. GPAs extracted from the TRANSCRIPT TOTALS section — specifically the Overall: and Degree: rows.

  4. Programme info parsed from the CURRENT PROGRAMME block (Degree, Programme, Faculty, Major, Department, Campus).

  5. Current year derived from the highest course code level number (e.g. COMP 2601 = year 2).

  6. Current term is the last semester reference found (e.g. "2024/2025 Semester II").

Each extraction step is isolated — a failure in one section does not prevent the others from being extracted. Failed sections fall back to their default values (empty string, null, 0, or []).


Project Structure

transcripts/
├── app/
│   ├── main.py                         # FastAPI app, routes, error handling
│   ├── extractor.py                    # PDF extraction + transcript parsing logic
│   └── models/
│       ├── __init__.py
│       └── extract_transcript_dtos.py  # Pydantic response models
├── tests/
│   ├── conftest.py                     # sys.path setup
│   ├── test_main.py                    # Unit tests for extraction logic
│   └── test_api.py                     # API endpoint tests
├── Dockerfile                          # Production image
├── Dockerfile.dev                      # Dev image with hot reload
├── requirements.txt                    # Production dependencies
└── requirements-dev.txt                # + test dependencies