Available for work

I build and test AI systems
in the same loop.

Leo Mari Cuizon · AI Systems Operator · QA · LLM Evaluation · Workflow Design

Mobile apps · Web · LLM output evaluation · AI workflow systems · Edge cases · iOS & Android

I validate AI systems in the conditions they'll actually run in.

I started in customer success, moved into QA and system validation, and have spent the last year building AI-assisted products to understand how they break.

I work across QA, AI output evaluation, and structured execution — using AI tools throughout my process for testing, debugging, analysis, and workflow design. I've shipped multiple product experiments not to launch startups, but to stay close to the systems I test.

I don't specialize in one stack. I specialize in understanding how a system is supposed to behave, finding where it doesn't, and documenting it clearly enough that someone else can act on it.

Location
Cebu City, Philippines
Focus
QA · LLM Evaluation · AI Workflow Design · AI
Availability
Open to work · Remote
Approach
Systems thinking · Execution-first

What I actually do.

01

Mobile & Web App QA Testing

  • User flows, auth/session bugs, mobile behavior, UI state failures
  • Form validation, broken links, offline behavior, service worker edge cases
  • Bug reports with steps to reproduce, severity, and fix path
02

AI & LLM Workflow Testing

  • Multi-model evaluation — scoring grounding, reasoning, completeness, and hallucination behavior
  • Prompt evaluation across varied inputs, tones, edge cases, and failure conditions
  • API response validation (OpenAI, Supabase, Groq) and fallback testing
03

Research & Structured Data Execution

  • Primary-source research collected, verified, and organized systematically
  • Raw data cleaned and formatted into spreadsheets, docs, or structured JSON
  • Built for AI ingestion — consistent schema, no junk rows, source-attributed

I also build what I test

  • Building PWAs from scratch gives me a closer view of where systems actually break
  • Shipping and testing in the same loop means I understand failure modes from both sides
  • Not a separate service — context that makes the testing sharper

AI as infrastructure,
not a shortcut.

I design systems around AI tools — routing logic, source-of-truth rules, feedback loops — rather than using them ad hoc.

Structured routing

  • Each tool has a defined role — what it decides and what it doesn't
  • Source-of-truth hub keeps decisions stable across the workflow
  • Specialist tools handle focused outputs: architecture, strategy, execution

Documented outputs

  • Async by default — work from a brief, not a call
  • Every engagement returns structured artifacts, not activity summaries
  • Bug reports, evaluation sheets, and research delivered in your preferred format

Feedback loops

  • Real-world results feed back into the system to update stable decisions
  • Explicit rules for what gets classified as stable, experimental, or rejected
  • QA and iteration run in the same loop — not as separate phases

Narrow, precise execution

  • Implementation tasks are scoped tightly — no broad refactors without cause
  • Constraints are documented before work begins, not discovered after
  • Changes are verified against expected behavior before closing the loop

Things I built to understand
how systems break.

Personal experiments — not polished products. Each one was a reason to get closer to a real failure mode.

Personal experiment · v47+

Job Intel MVP

Deterministic job evaluation system that scores remote listings against a candidate profile — without AI ranking. Transparent, rule-based logic produces explainable outputs. Built a multi-GPT workflow to manage the build: a source-of-truth Hub, a specialist GPT for architecture decisions, and Codex for narrow implementation tasks with explicit constraints on what each tool could decide.

Visit project
Personal experiment · Three.js

Stackr

Offline-first AI notes PWA iterated across 47+ versions. Used as a regression testing ground every time a feature was added or removed. Caught a critical auth failure caused by iOS Safari's Intelligent Tracking Prevention blocking Supabase session persistence on PWA reinstall — isolated the caching conflict and documented the fix path.

Visit project
Personal experiment · PWA

Jungle Dash

2.5D endless runner PWA built as a testing ground for continuous-state systems: collision detection, mobile control behavior, obstacle generation edge cases, state resets on game death, and performance under sustained loops. Object pooling, garbage collection pressure, and service worker behavior under offline conditions — all testable in a way most apps don't expose.

Visit project

Every engagement ends with
something you can act on.

Documented outputs, not activity summaries. Here's what that looks like in practice.

📋

Bug Reports

  • Steps to reproduce, exact conditions, expected vs actual behavior
  • Severity classification and suggested fix path
  • Delivered in your preferred format — doc, sheet, or Notion
🧪

Test Case Checklists

  • Edge cases mapped from your product logic and user flows
  • AI/LLM response evaluation notes with pass/fail criteria
  • Workflow failure summaries with reproduction paths
📁

Structured Research Sheets

  • Source-attributed, consistently formatted, ready for AI ingestion
  • Clean schema — no junk rows, no mixed formats
  • Delivered as CSV, XLSX, or JSON depending on your pipeline
🗂️

Dataset Curation

  • Structured text datasets extracted and cleaned for LLM training pipelines
  • Consistent labeling, formatting, and deduplication across large collections
  • Source-verified, schema-consistent, delivered in your required format

Sample Bug Report

Real report, sanitized client details
P1 — Blocker

System fails to generate downstream outputs after successful data processing

Context

Web application · Staging environment · Workflow: Data ingestion → Processing → Output generation

Steps to Reproduce

  1. Create a new workspace/entity
  2. Connect a data source and initiate processing
  3. Allow processing stage to complete successfully
  4. Trigger output generation step A, then step B
  5. Observe output status

Expected

Both output generation steps complete, producing valid output artifacts.

Actual

Processing completes. Both output steps fail. No artifacts created.

System Logs (Sanitized)

processing completed successfully (items_processed: 16, blocked: false)
output_a.asset_id = null
output_b.asset_id = null
last_failed_runs.output_a.status = failed
last_failed_runs.output_b.status = failed

Analysis

Processing layer executes correctly. Failure occurs in the downstream output generation pipeline. Likely issue: missing or invalid mapping between processed data and output generation inputs — breakdown in the data handoff between modules.

LLM Evaluation

Real evaluation, sanitized client details

Multi-model document summarization — 6 models, 4 document lengths

Evaluated Claude Sonnet 4.6, Haiku 4.5, GPT-4.1 mini, GPT-4.1 nano, Gemini 2.5 Flash, and Gemini 3.1 Flash Lite across short (~1K words), medium (~5K), long (~15K), and very long (~40K word) documents. Scored each output across six dimensions with a 15-point hallucination penalty for minor flags and a critical penalty for fabricated structures.

Rubric (6 dimensions)

Grounding · Reasoning · Completeness · Actionability · Clarity · Task Overlay — plus hallucination flags: none / minor / critical

Key findings

  • Gemini 3.1 Flash Lite fabricated phantom citation brackets [1][2][3] throughout a 40K-word task — classified hallucination_critical, scored 33/100
  • GPT-4.1 nano fabricated a numerical statistic by an order of magnitude in a medium document — hard-to-catch and high-risk in production
  • Haiku 4.5 fabricated an institutional affiliation in a very long document — unacceptable in clinical or high-stakes contexts
  • GPT-4.1 mini was the top short and very long document summarizer; Sonnet 4.6 was most reliable for medium and long

Output — routing recommendation

Task Primary Fallback
Short (~1K words) GPT-4.1 mini Haiku 4.5
Medium (~5K words) Sonnet 4.6 Haiku 4.5
Long (~15K words) Sonnet 4.6 Gemini 2.5 Flash
Very long (~40K words) GPT-4.1 mini Sonnet 4.6
AI-assisted debugging & documentation

PWA login sessions disappearing on refresh

Problem

Users were logged out every time the PWA was refreshed or reinstalled on iOS Safari.

Tested

Service worker caching strategy, Supabase auth token storage, ITP cookie behavior across iOS versions.

Output

Surfaced a caching conflict blocking session persistence. Documented the issue and fix path with AI assistance.

Need structured QA, LLM evaluation,
or AI workflow support?

Send me what you're building, what's breaking, or what you need validated. I'll return a clear issue list, evaluation notes, or workflow documentation — depending on what's needed. Available for 1–2 projects at a time, async-first, remote only.