transcripts

Transcript Extraction Service

A stateless FastAPI service that extracts structured data from UWI unofficial transcript PDFs. It accepts a PDF file upload, filters out watermark overlays, parses course records, GPAs, programme info, and current term/year, then returns structured JSON.

Quick Start

# Local development (with hot reload)
cd apps/transcripts
python -m venv .venv && .venv/bin/pip install -r requirements-dev.txt
.venv/bin/uvicorn app.main:app --reload --port 8001

# Or via Docker (from project root)
docker compose -f docker-compose.local.yml up transcript-extraction

# Run tests
.venv/bin/python -m pytest tests/ -v

API

Base URL: http://localhost:8001/api/v1

`GET /healthy`

Returns {"status": "healthy"}.

`POST /transcripts/extract`

Extracts structured data from a transcript PDF. Returns 200 on success.

Content-Type: multipart/form-data

Field	Type	Description
`file`	file	The PDF file to extract (must be `application/pdf`)

Example

curl -X POST http://localhost:8001/api/v1/transcripts/extract \
  -F "file=@transcript.pdf"

Response Body

{
  "current_programme": "BSc",              // degree from CURRENT PROGRAMME block
  "major": "Computer Science",             // major field of study
  "current_term": "2023/2024 Semester II", // last semester found in document
  "current_year": 2,                       // derived from highest course level (e.g. COMP 2xxx = year 2)
  "degree_gpa": 3.44,                      // from TRANSCRIPT TOTALS (null if not found)
  "overall_gpa": 3.44,                     // from TRANSCRIPT TOTALS (null if not found)
  "courses": [
    {
      "code": "COMP 1601",                 // subject code + number
      "title": "Computer Programming I",   // course title (multi-line titles are joined)
      "grade": "A+"                         // letter grade, or null if in-progress
    }
  ]
}

Response Fields

Field	Type	Description
`current_programme`	string	Degree type (e.g. "BSc", "MSc"). Empty string if not found.
`major`	string	Major field of study. Empty string if not found.
`current_term`	string	Last semester in the document (e.g. "2023/2024 Semester II", "2024/2025 Summer"). Empty string if not found.
`current_year`	int	Study year derived from highest course code level (COMP 2xxx = year 2). `0` if no courses found.
`degree_gpa`	float \| null	Degree GPA from TRANSCRIPT TOTALS section. `null` if not found.
`overall_gpa`	float \| null	Overall GPA from TRANSCRIPT TOTALS section. `null` if not found.
`courses`	array	All course records found. Empty array if none found.
`courses[].code`	string	Subject code and number (e.g. "COMP 1601").
`courses[].title`	string	Course title. Multi-line titles are joined with a space.
`courses[].grade`	string \| null	Letter grade (A+, A, A-, B+, B, B-, C+, C, C-, D+, D, D-, F, F1, F2, F3, HD, P, W, MC, AB, DEF, NC, EX, NP, INC). `null` for in-progress courses with no grade yet.

Validation Rules

Rule	HTTP Status	Error
File is not `application/pdf`	422	`file must be a PDF (application/pdf)`
File is empty (0 bytes)	422	`uploaded file is empty`
No text could be extracted from PDF	422	`could not extract text from PDF — file may be corrupt or not a transcript`

Error Responses

All errors return JSON with a detail field.

422 - Validation Error

{ "detail": "file must be a PDF (application/pdf)" }

{ "detail": "could not extract text from PDF — file may be corrupt or not a transcript" }

500 - Internal Server Error

{ "detail": "internal server error" }

Stack traces are logged server-side but never exposed in the response.

Environment Variables

Variable	Default	Description
`LOG_LEVEL`	`INFO`	Logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`)

How Extraction Works

PDF opened with pdfplumber. Each page is filtered to remove watermark characters (font size > 20), which removes the "Unofficial Transcript" overlay.
Courses parsed via regex matching lines like COMP 1601 S UG Computer Programming I A+ 3.00 12.00 4.00. Multi-line titles (next line is a continuation) are joined automatically.
GPAs extracted from the TRANSCRIPT TOTALS section — specifically the Overall: and Degree: rows.
Programme info parsed from the CURRENT PROGRAMME block (Degree, Programme, Faculty, Major, Department, Campus).
Current year derived from the highest course code level number (e.g. COMP 2601 = year 2).
Current term is the last semester reference found (e.g. "2024/2025 Semester II").

Each extraction step is isolated — a failure in one section does not prevent the others from being extracted. Failed sections fall back to their default values (empty string, null, 0, or []).

Project Structure

transcripts/
├── app/
│   ├── main.py                         # FastAPI app, routes, error handling
│   ├── extractor.py                    # PDF extraction + transcript parsing logic
│   └── models/
│       ├── __init__.py
│       └── extract_transcript_dtos.py  # Pydantic response models
├── tests/
│   ├── conftest.py                     # sys.path setup
│   ├── test_main.py                    # Unit tests for extraction logic
│   └── test_api.py                     # API endpoint tests
├── Dockerfile                          # Production image
├── Dockerfile.dev                      # Dev image with hot reload
├── requirements.txt                    # Production dependencies
└── requirements-dev.txt                # + test dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Transcript Extraction Service

Quick Start

API

`GET /healthy`

`POST /transcripts/extract`

Example

Response Body

Response Fields

Validation Rules

Error Responses

422 - Validation Error

500 - Internal Server Error

Environment Variables

How Extraction Works

Project Structure

Name		Name	Last commit message	Last commit date
parent directory ..
app		app
examples		examples
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
EXAMPLE.md		EXAMPLE.md
README.md		README.md
main.py		main.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

FilesExpand file tree

transcripts

Directory actions

More options

Directory actions

More options

Latest commit

History

transcripts

Folders and files

parent directory

README.md

Transcript Extraction Service

Quick Start

API

GET /healthy

POST /transcripts/extract

Example

Response Body

Response Fields

Validation Rules

Error Responses

422 - Validation Error

500 - Internal Server Error

Environment Variables

How Extraction Works

Project Structure

`GET /healthy`

`POST /transcripts/extract`