README.md

Transcript Extraction Service

A stateless FastAPI service that extracts structured data from UWI unofficial transcript PDFs. It accepts a PDF file upload, filters out watermark overlays, parses course records, GPAs, programme info, and current term/year, then returns structured JSON.

Quick Start

# Local development (with hot reload)
cd apps/transcripts
python -m venv .venv && .venv/bin/pip install -r requirements-dev.txt
.venv/bin/uvicorn app.main:app --reload --port 8001

# Or via Docker (from project root)
docker compose -f docker-compose.local.yml up transcript-extraction

# Run tests
.venv/bin/python -m pytest tests/ -v

API

Base URL: http://localhost:8001/api/v1

`GET /healthy`

Returns {"status": "healthy"}.

`POST /transcripts/extract`

Extracts structured data from a transcript PDF. Returns 200 on success.

Content-Type: multipart/form-data

Field	Type	Description
`file`	file	The PDF file to extract (must be `application/pdf`)

Example

curl -X POST http://localhost:8001/api/v1/transcripts/extract \
  -F "file=@transcript.pdf"

Response Body

{
  "current_programme": "BSc",              // degree from CURRENT PROGRAMME block
  "major": "Computer Science",             // major field of study
  "current_term": "2023/2024 Semester II", // last semester found in document
  "current_year": 2,                       // derived from highest course level (e.g. COMP 2xxx = year 2)
  "degree_gpa": 3.44,                      // from TRANSCRIPT TOTALS (null if not found)
  "overall_gpa": 3.44,                     // from TRANSCRIPT TOTALS (null if not found)
  "courses": [
    {
      "code": "COMP 1601",                 // subject code + number
      "title": "Computer Programming I",   // course title (multi-line titles are joined)
      "grade": "A+"                         // letter grade, or null if in-progress
    }
  ]
}

Response Fields

Field	Type	Description
`current_programme`	string	Degree type (e.g. "BSc", "MSc"). Empty string if not found.
`major`	string	Major field of study. Empty string if not found.
`current_term`	string	Last semester in the document (e.g. "2023/2024 Semester II", "2024/2025 Summer"). Empty string if not found.
`current_year`	int	Study year derived from highest course code level (COMP 2xxx = year 2). `0` if no courses found.
`degree_gpa`	float \| null	Degree GPA from TRANSCRIPT TOTALS section. `null` if not found.
`overall_gpa`	float \| null	Overall GPA from TRANSCRIPT TOTALS section. `null` if not found.
`courses`	array	All course records found. Empty array if none found.
`courses[].code`	string	Subject code and number (e.g. "COMP 1601").
`courses[].title`	string	Course title. Multi-line titles are joined with a space.
`courses[].grade`	string \| null	Letter grade (A+, A, A-, B+, B, B-, C+, C, C-, D+, D, D-, F, F1, F2, F3, HD, P, W, MC, AB, DEF, NC, EX, NP, INC). `null` for in-progress courses with no grade yet.

Validation Rules

Rule	HTTP Status	Error
File is not `application/pdf`	422	`file must be a PDF (application/pdf)`
File is empty (0 bytes)	422	`uploaded file is empty`
No text could be extracted from PDF	422	`could not extract text from PDF — file may be corrupt or not a transcript`

Error Responses

All errors return JSON with a detail field.

422 - Validation Error

{ "detail": "file must be a PDF (application/pdf)" }

{ "detail": "could not extract text from PDF — file may be corrupt or not a transcript" }

500 - Internal Server Error

{ "detail": "internal server error" }

Stack traces are logged server-side but never exposed in the response.

Environment Variables

Variable	Default	Description
`LOG_LEVEL`	`INFO`	Logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`)

How Extraction Works

PDF opened with pdfplumber. Each page is filtered to remove watermark characters (font size > 20), which removes the "Unofficial Transcript" overlay.
Courses parsed via regex matching lines like COMP 1601 S UG Computer Programming I A+ 3.00 12.00 4.00. Multi-line titles (next line is a continuation) are joined automatically.
GPAs extracted from the TRANSCRIPT TOTALS section — specifically the Overall: and Degree: rows.
Programme info parsed from the CURRENT PROGRAMME block (Degree, Programme, Faculty, Major, Department, Campus).
Current year derived from the highest course code level number (e.g. COMP 2601 = year 2).
Current term is the last semester reference found (e.g. "2024/2025 Semester II").

Each extraction step is isolated — a failure in one section does not prevent the others from being extracted. Failed sections fall back to their default values (empty string, null, 0, or []).

Project Structure

transcripts/
├── app/
│   ├── main.py                         # FastAPI app, routes, error handling
│   ├── extractor.py                    # PDF extraction + transcript parsing logic
│   └── models/
│       ├── __init__.py
│       └── extract_transcript_dtos.py  # Pydantic response models
├── tests/
│   ├── conftest.py                     # sys.path setup
│   ├── test_main.py                    # Unit tests for extraction logic
│   └── test_api.py                     # API endpoint tests
├── Dockerfile                          # Production image
├── Dockerfile.dev                      # Dev image with hot reload
├── requirements.txt                    # Production dependencies
└── requirements-dev.txt                # + test dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcript Extraction Service

Quick Start

API

`GET /healthy`

`POST /transcripts/extract`

Example

Response Body

Response Fields

Validation Rules

Error Responses

422 - Validation Error

500 - Internal Server Error

Environment Variables

How Extraction Works

Project Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Transcript Extraction Service

Quick Start

API

GET /healthy

POST /transcripts/extract

Example

Response Body

Response Fields

Validation Rules

Error Responses

422 - Validation Error

500 - Internal Server Error

Environment Variables

How Extraction Works

Project Structure

`GET /healthy`

`POST /transcripts/extract`