Leo Mari Cuizon — AI Systems Operator & Workflow Designer

About

I validate AI systems in the conditions they'll actually run in.

I started in customer success, moved into QA and system validation, and have spent the last year building AI-assisted products to understand how they break.

I work across QA, AI output evaluation, and structured execution — using AI tools throughout my process for testing, debugging, analysis, and workflow design. I've shipped multiple product experiments not to launch startups, but to stay close to the systems I test.

I don't specialize in one stack. I specialize in understanding how a system is supposed to behave, finding where it doesn't, and documenting it clearly enough that someone else can act on it.

Location

Cebu City, Philippines

Focus

QA · LLM Evaluation · AI Workflow Design · AI

Availability

Open to work · Remote

Approach

Systems thinking · Execution-first

What I Do

What I actually do.

Mobile & Web App QA Testing

User flows, auth/session bugs, mobile behavior, UI state failures
Form validation, broken links, offline behavior, service worker edge cases
Bug reports with steps to reproduce, severity, and fix path

AI & LLM Workflow Testing

Multi-model evaluation — scoring grounding, reasoning, completeness, and hallucination behavior
Prompt evaluation across varied inputs, tones, edge cases, and failure conditions
API response validation (OpenAI, Supabase, Groq) and fallback testing

Research & Structured Data Execution

Primary-source research collected, verified, and organized systematically
Raw data cleaned and formatted into spreadsheets, docs, or structured JSON
Built for AI ingestion — consistent schema, no junk rows, source-attributed

↳

I also build what I test

Building PWAs from scratch gives me a closer view of where systems actually break
Shipping and testing in the same loop means I understand failure modes from both sides
Not a separate service — context that makes the testing sharper

How I Work With AI

AI as infrastructure,
not a shortcut.

I design systems around AI tools — routing logic, source-of-truth rules, feedback loops — rather than using them ad hoc.

→

Structured routing

Each tool has a defined role — what it decides and what it doesn't
Source-of-truth hub keeps decisions stable across the workflow
Specialist tools handle focused outputs: architecture, strategy, execution

→

Documented outputs

Async by default — work from a brief, not a call
Every engagement returns structured artifacts, not activity summaries
Bug reports, evaluation sheets, and research delivered in your preferred format

→

Feedback loops

Real-world results feed back into the system to update stable decisions
Explicit rules for what gets classified as stable, experimental, or rejected
QA and iteration run in the same loop — not as separate phases

→

Narrow, precise execution

Implementation tasks are scoped tightly — no broad refactors without cause
Constraints are documented before work begins, not discovered after
Changes are verified against expected behavior before closing the loop

Projects

Things I built to understand
how systems break.

Personal experiments — not polished products. Each one was a reason to get closer to a real failure mode.

Personal experiment · v47+

Job Intel MVP

Deterministic job evaluation system that scores remote listings against a candidate profile — without AI ranking. Transparent, rule-based logic produces explainable outputs. Built a multi-GPT workflow to manage the build: a source-of-truth Hub, a specialist GPT for architecture decisions, and Codex for narrow implementation tasks with explicit constraints on what each tool could decide.

Visit project

Personal experiment · Three.js

Stackr

Offline-first AI notes PWA iterated across 47+ versions. Used as a regression testing ground every time a feature was added or removed. Caught a critical auth failure caused by iOS Safari's Intelligent Tracking Prevention blocking Supabase session persistence on PWA reinstall — isolated the caching conflict and documented the fix path.

Visit project

Personal experiment · PWA

Jungle Dash

2.5D endless runner PWA built as a testing ground for continuous-state systems: collision detection, mobile control behavior, obstacle generation edge cases, state resets on game death, and performance under sustained loops. Object pooling, garbage collection pressure, and service worker behavior under offline conditions — all testable in a way most apps don't expose.

Visit project

What You Get

Every engagement ends with
something you can act on.

Documented outputs, not activity summaries. Here's what that looks like in practice.

📋

Bug Reports

Steps to reproduce, exact conditions, expected vs actual behavior
Severity classification and suggested fix path
Delivered in your preferred format — doc, sheet, or Notion

🧪

Test Case Checklists

Edge cases mapped from your product logic and user flows
AI/LLM response evaluation notes with pass/fail criteria
Workflow failure summaries with reproduction paths

📁

Structured Research Sheets

Source-attributed, consistently formatted, ready for AI ingestion
Clean schema — no junk rows, no mixed formats
Delivered as CSV, XLSX, or JSON depending on your pipeline

🗂️

Dataset Curation

Structured text datasets extracted and cleaned for LLM training pipelines
Consistent labeling, formatting, and deduplication across large collections
Source-verified, schema-consistent, delivered in your required format

Sample Bug Report

Real report, sanitized client details

P1 — Blocker

System fails to generate downstream outputs after successful data processing

Context

Web application · Staging environment · Workflow: Data ingestion → Processing → Output generation

Steps to Reproduce

Create a new workspace/entity
Connect a data source and initiate processing
Allow processing stage to complete successfully
Trigger output generation step A, then step B
Observe output status

Expected

Both output generation steps complete, producing valid output artifacts.

Actual

Processing completes. Both output steps fail. No artifacts created.

System Logs (Sanitized)

            processing completed successfully (items_processed: 16, blocked: false)

            output_a.asset_id = null

            output_b.asset_id = null

            last_failed_runs.output_a.status = failed

            last_failed_runs.output_b.status = failed

Analysis

Processing layer executes correctly. Failure occurs in the downstream output generation pipeline. Likely issue: missing or invalid mapping between processed data and output generation inputs — breakdown in the data handoff between modules.

LLM Evaluation

Real evaluation, sanitized client details

Multi-model document summarization — 6 models, 4 document lengths

Evaluated Claude Sonnet 4.6, Haiku 4.5, GPT-4.1 mini, GPT-4.1 nano, Gemini 2.5 Flash, and Gemini 3.1 Flash Lite across short (~1K words), medium (~5K), long (~15K), and very long (~40K word) documents. Scored each output across six dimensions with a 15-point hallucination penalty for minor flags and a critical penalty for fabricated structures.

Rubric (6 dimensions)

Grounding · Reasoning · Completeness · Actionability · Clarity · Task Overlay — plus hallucination flags: none / minor / critical

Key findings

Gemini 3.1 Flash Lite fabricated phantom citation brackets [1][2][3] throughout a 40K-word task — classified hallucination_critical, scored 33/100
GPT-4.1 nano fabricated a numerical statistic by an order of magnitude in a medium document — hard-to-catch and high-risk in production
Haiku 4.5 fabricated an institutional affiliation in a very long document — unacceptable in clinical or high-stakes contexts
GPT-4.1 mini was the top short and very long document summarizer; Sonnet 4.6 was most reliable for medium and long

Output — routing recommendation

Task	Primary	Fallback
Short (~1K words)	GPT-4.1 mini	Haiku 4.5
Medium (~5K words)	Sonnet 4.6	Haiku 4.5
Long (~15K words)	Sonnet 4.6	Gemini 2.5 Flash
Very long (~40K words)	GPT-4.1 mini	Sonnet 4.6

Mini Case Study

AI-assisted debugging & documentation

PWA login sessions disappearing on refresh

Problem

Users were logged out every time the PWA was refreshed or reinstalled on iOS Safari.

Tested

Service worker caching strategy, Supabase auth token storage, ITP cookie behavior across iOS versions.

Output

Surfaced a caching conflict blocking session persistence. Documented the issue and fix path with AI assistance.

I build and test AI systems
in the same loop.

I validate AI systems in the conditions they'll actually run in.

What I actually do.

Mobile & Web App QA Testing

AI & LLM Workflow Testing

Research & Structured Data Execution

I also build what I test

AI as infrastructure,
not a shortcut.

Structured routing

Documented outputs

Feedback loops

Narrow, precise execution

Things I built to understand
how systems break.

Job Intel MVP

Stackr

Jungle Dash

Every engagement ends with
something you can act on.

Bug Reports

Test Case Checklists

Structured Research Sheets

Dataset Curation

System fails to generate downstream outputs after successful data processing

Multi-model document summarization — 6 models, 4 document lengths

PWA login sessions disappearing on refresh

Need structured QA, LLM evaluation,
or AI workflow support?

I build and test AI systemsin the same loop.

I validate AI systems in the conditions they'll actually run in.

What I actually do.

Mobile & Web App QA Testing

AI & LLM Workflow Testing

Research & Structured Data Execution

I also build what I test

AI as infrastructure,not a shortcut.

Structured routing

Documented outputs

Feedback loops

Narrow, precise execution

Things I built to understandhow systems break.

Job Intel MVP

Stackr

Jungle Dash

Every engagement ends withsomething you can act on.

Bug Reports

Test Case Checklists

Structured Research Sheets

Dataset Curation

System fails to generate downstream outputs after successful data processing

Multi-model document summarization — 6 models, 4 document lengths

PWA login sessions disappearing on refresh

Need structured QA, LLM evaluation,or AI workflow support?

I build and test AI systems
in the same loop.

AI as infrastructure,
not a shortcut.

Things I built to understand
how systems break.

Every engagement ends with
something you can act on.

Need structured QA, LLM evaluation,
or AI workflow support?