Driftline
Independent research lab building evaluation tools for structural reliability in generative images.
Driftline studies structural correctness in generative image systems and builds tools to make visual failures easier to measure, compare, and understand.
Current work focuses on the gap between visual plausibility and structural correctness in diffusion-based image models, using controlled probes, scored image subsets, and evaluation workflows designed to support repeatable review.
Driftline Evaluator
Driftline Evaluator measures structural correctness in generative images by classifying recurring failure patterns and comparing how prompts or workflows actually perform.
Driftline Evaluator is a visual evaluation system for generative images. It is designed to measure structural correctness, classify recurring failure patterns, and compare how prompts or workflows perform under repeatable conditions. Instead of focusing only on style or realism, it looks at whether an image holds together as a believable structure.
- Measures structural correctness in generated images
- Classifies recurring visual failure patterns
- Compares how different prompts or workflows perform
- Organizes image review with a repeatable scoring rubric
- Helps separate believable outputs from subtle or obvious failures
- Supports research, benchmarking, and workflow evaluation
Who it’s for
- Researchers studying failure patterns in generative images
- Builders comparing prompts, models, or workflows
- Creative teams who need a clearer way to review outputs
- Anyone trying to measure whether generated images are structurally believable
Research
A scored observational note examining chair-generation failures under minimal prompt conditions.
In a broader baseline run of approximately 7,000 generated images, 280 chair-class outputs were identified.
A manually reviewed illustrative subset of 58 chairs was then scored using Driftline’s Structural Validity
Score (SVS), yielding an even split between structurally acceptable and structurally failed outputs.
Key finding: even in a curated subset, half of generated chairs fail a basic structural plausibility test.
A structured hand-family comparison testing whether constrained prompt variations improve structural correctness
in diffusion-generated hands. Prompt branches including pose cues, numeric wording, and semantic-styling language
were reviewed under a locked rubric designed to separate visual plausibility from anatomical correctness.
Key finding: constrained prompting improved outcomes unevenly, with pose cues reducing outright failures most effectively, but no tested condition solved hand anatomy reliably.
Contact
research@driftline-us.ai