Available for work

I build and test AI systems
in the same loop.

Leo Mari Cuizon · AI-assisted QA · Workflow Testing · Lightweight PWA & Web Projects

Mobile apps · Web · LLM output evaluation · AI workflow systems · Edge cases · iOS & Android

Portfolio overview screenshot

I validate AI systems in the conditions they'll actually run in.

I started in data-entry related tasks, light research, simple graphic related work using Canva, moved into QA and system validation, and more recently into building AI-assisted products to understand how they break.

I work across QA, LLM testing, and structured execution — using AI tools throughout my process for testing, debugging, analysis, and workflow design. I've shipped multiple product experiments not to launch startups, but to stay close to the systems I test.

Location
Cebu City, Philippines
Focus
QA · LLM Evaluation · AI Workflow Design · AI
Availability
Open to work · Remote
Approach
Systems thinking · Execution-first

What I actually do.

01

Mobile & Web App QA Testing

  • User flows, auth/session bugs, mobile behavior, UI state failures
  • Form validation, broken links, offline behavior, service worker edge cases
  • Bug reports with steps to reproduce, severity, and fix path
02

AI + PWA Workflow & System Validation

  • Edge cases mapped across PWA flows, AI features, and API-driven interactions
  • End-to-end workflow analysis covering offline/online sync behavior, UI logic, and system integration points
  • API response validation (OpenAI, Supabase, Groq) and fallback testing
03

Research & Structured Data Execution

  • Primary-source research collected, verified, and organized systematically
  • Raw data cleaned and formatted into spreadsheets, docs, or structured JSON
  • Built for AI ingestion — consistent schema, no junk rows, source-attributed

I also build what I test

  • Building PWAs from scratch gives me a closer view of where systems actually break
  • Shipping and testing in the same loop means I understand failure modes from both sides
  • Not a separate service — context that makes the testing sharper

AI as part of structured workflows to improve testing, analysis, and execution.

I use AI tools within structured workflows for testing, analysis, and execution, rather than using them in isolation.

Structured prompts

  • Each tool has a defined role — what it decides and what it doesn't
  • Source-of-truth hub keeps decisions stable across the workflow
  • Specialist tools handle focused outputs: architecture, strategy, execution

Documented outputs

  • Async by default — work from a brief, not a call
  • Every engagement returns structured artifacts, not activity summaries
  • Bug reports, evaluation sheets, and research delivered in your preferred format

Feedback loops

  • Real-world results feed back into the system to update stable decisions
  • Explicit rules for what gets classified as stable, experimental, or rejected
  • QA and iteration run in the same loop — not as separate phases

Narrow, precise execution

  • Implementation tasks are scoped tightly — no broad refactors without cause
  • Constraints are documented before work begins, not discovered after
  • Changes are verified against expected behavior before closing the loop

Things I built to understand
how systems break.

Personal experiments — not polished products. Each one was a reason to get closer to a real failure mode.

Job Intel MVP screenshot
Personal experiment · v47+

Job Intel MVP

Rule-based job evaluation workflow that scores remote listings against a candidate profile — without AI ranking. Transparent, rule-based logic produces explainable outputs. Built a multi-GPT workflow to manage the build: a source-of-truth Hub, a specialist GPT for architecture decisions, and Codex for narrow implementation tasks with explicit constraints on what each tool could decide.

Visit project
Stackr screenshot
Personal experiment · Three.js

Stackr

Offline-first AI notes PWA iterated across 47+ versions. I used my own apps to observe and debug real failure cases during development. Caught a critical auth failure caused by iOS Safari's Intelligent Tracking Prevention blocking Supabase session persistence on PWA reinstall — isolated the caching conflict and documented the fix path.

Visit project
Jungle Dash screenshot
Personal experiment · PWA

Jungle Dash

2.5D endless runner PWA built as a testing ground for continuous-state systems: collision detection, mobile control behavior, obstacle generation edge cases, state resets on game death, and performance under sustained loops. Object pooling, garbage collection pressure, and service worker behavior under offline conditions — all testable in a way most apps don't expose.

Visit project

AI Bug Triage Automation

Paste any raw bug report. GPT-4o-mini classifies severity, scores priority, and extracts structured fields via a Make.com automation pipeline.

Live
Bug report input 0 chars
Try a sample:

Every engagement ends with
something you can act on.

Documented outputs, not activity summaries. Here's what that looks like in practice.

📋

Bug Reports

  • Steps to reproduce, exact conditions, expected vs actual behavior
  • Severity classification and suggested fix path
  • Delivered in your preferred format — doc, sheet, or Notion
🧪

Lightweight Web & PWA Builds

  • Simple HTML websites, lightweight PWAs, landing pages, and AI-assisted product experiments
  • Mobile responsiveness, offline behavior, auth/session handling, and workflow validation
  • Structured iteration using AI-assisted tooling for implementation, debugging, refinement, and testing
📁

Structured Research Sheets

  • Source-attributed, consistently formatted, ready for AI ingestion
  • Clean schema — no junk rows, no mixed formats
  • Delivered as CSV, XLSX, or JSON depending on your pipeline
🗂️

Dataset Curation

  • Structured text datasets extracted and cleaned for LLM training pipelines
  • Consistent labeling, formatting, and deduplication across large collections
  • Source-verified, schema-consistent, delivered in your required format

Sample Bug Report

Real report, sanitized client details
P1 — Blocker

System fails to generate downstream outputs after successful data processing

Context

Web application · Staging environment · Workflow: Data ingestion → Processing → Output generation

Steps to Reproduce

  1. Create a new workspace/entity
  2. Connect a data source and initiate processing
  3. Allow processing stage to complete successfully
  4. Trigger output generation step A, then step B
  5. Observe output status

Expected

Both output generation steps complete, producing valid output artifacts.

Actual

Processing completes. Both output steps fail. No artifacts created.

System Logs (Sanitized)

processing completed successfully (items_processed: 16, blocked: false)
output_a.asset_id = null
output_b.asset_id = null
last_failed_runs.output_a.status = failed
last_failed_runs.output_b.status = failed

Analysis (LLM-assisted QA review)

Processing completes successfully, but output generation fails. This suggests a possible issue in how processed data is passed into the output generation step, or missing required input mapping between stages.

LLM Evaluation

Real evaluation, sanitized client details

Multi-model document summarization — 6 models, 4 document lengths

I conducted structured evaluation of multiple LLM outputs across different document lengths, assessing accuracy, reasoning, completeness, and hallucination behavior. Claude Opus was used as the primary reference model for consistency and comparison.

Rubric (6 dimensions)

Grounding · Reasoning · Completeness · Actionability · Clarity · Task Overlay — plus hallucination flags: none / minor / critical

Key findings

  • Gemini 3.1 Flash Lite fabricated phantom citation brackets [1][2][3] throughout a 40K-word task — classified hallucination_critical, scored 33/100
  • GPT-4.1 nano fabricated a numerical statistic by an order of magnitude in a medium document — hard-to-catch and high-risk in production
  • Haiku 4.5 fabricated an institutional affiliation in a very long document — unacceptable in clinical or high-stakes contexts
  • GPT-4.1 mini was the top short and very long document summarizer; Sonnet 4.6 was most reliable for medium and long

Output — routing recommendation

Task Primary Fallback
Short (~1K words) GPT-4.1 mini Haiku 4.5
Medium (~5K words) Sonnet 4.6 Haiku 4.5
Long (~15K words) Sonnet 4.6 Gemini 2.5 Flash
Very long (~40K words) GPT-4.1 mini Sonnet 4.6
AI-assisted debugging & documentation

PWA login sessions disappearing on refresh

Problem

Users were logged out every time the PWA was refreshed or reinstalled on iOS Safari.

Tested

Service worker caching strategy, Supabase auth token storage, ITP cookie behavior across iOS versions.

Output

Surfaced a caching conflict blocking session persistence. Documented the issue and fix path with AI assistance.

Need structured QA, LLM evaluation,
or AI workflow support?

Send me what you're building, what's breaking, or what you need validated. I'll return a clear issue list, evaluation notes, or workflow documentation — depending on what's needed. Available for 1–2 projects at a time, async-first, remote only.

Hire me on: