Gill A.Remote

Extract blood test data from PDF documents that have been OCR'd

Posted 3 days ago

Project-Based

Description

The objective is to build a structured blood test database that allows pathology results to be viewed, edited, filtered, and exported to Excel via a web-based HTML interface. The system stores results in a clean, standardised format so trends can be analysed accurately over time.

Using AI-assisted OCR, I have built a local Python extraction pipeline that converts PDF pathology reports into machine-readable text and inserts structured data into a SQLite database. The majority of blood tests extract correctly, including canonical test name, result value, unit, and reference range.

However, I have reached a specific technical issue with three markers:

• CRP (C-reactive protein) • GLU (Glucose)

The OCR output clearly contains the correct lines, and debug logs confirm they are processed. Yet no rows are inserted for these markers.

The failure appears to occur between canonical matching, numeric extraction, or validation logic.

Current System Architecture

The system runs locally and consists of:

• extraction_core_2.py (main engine) • Supporting modules for OCR preprocessing, lab dictionary building, regex matching, and validation • Schema-driven canonical lab dictionary • HTML viewer for results display and Excel export

Pipeline flow:

Convert PDF to image (pdf2image)

Preprocess

Run Tesseract OCR

Clean and normalise text

Match against canonical lab dictionary

Extract:

canonical test name

numeric result

unit

reference range

Validate

Insert into SQLite

The engine is deterministic and rule-based.

The Specific Problem

Example OCR line:

CRP H 5.2 mg/L 0-5

OCR text is correct. NUMBER_PATTERN matches. The canonical dictionary contains the test.

Yet:

Inserted 0 rows from XXX-XXX-XXXXOrderReport_23B00006604_CRP.pdf

Likely failure points include:

• Canonical containment match failing due to normalisation • Flag tokens (“H”, “L”) interfering with numeric capture • Validation rejecting due to strict range formatting • Unit pattern mismatch (e.g. mmol/L)

If validation fails, the row is rejected silently.

All other panels extract correctly. The issue appears isolated.

What Is Required

This is not a rebuild.

We do not want:

• Re-architecture • Large-scale changes

We need:

Precise Diagnosis

Identify exactly where CRP, ESR, and GLU are failing insertion and which rule is causing rejection.

Minimal Safe Fix

Implement a targeted correction that:

Zero Regression
Modular Implementation

If appropriate:

The existing architecture should remain intact.

Constraints

The system is designed to be:

• Schema-driven • Forensic-grade

We cannot introduce probabilistic or unpredictable behaviour.

Longer-Term Goal

After stabilising extraction:

• Later incorporate AI-assisted interpretation

Immediate priority:

Stabilise deterministic extraction for CRP, ESR, and GLU without breaking the existing engine.

Materials Provided

Uploaded:

• Full extraction_core_2.py (text format)

Additional materials available on request:

This is a focused debugging and refinement request. I have spent many hours attempting to isolate the issue and now require an experienced developer to identify the blocking condition and implement a practical fix.

I have been advised this should take 1–2 hours for a senior developer.

Looking for a swift turnaround.

Budget: GBP 210 (Fixed Price)

Proposals: 17 freelancers have applied

Skills

PythonSqliteHtmlSwift