Thinkers360

Agentic Workflows and Clinical Accuracy: Qwen3-VL-8B-Thinking in Multimodal Medical Diagnosis

Oct



Introduction

The aspiration to integrate intelligent systems into medicine is as old as the digital age itself, dating back to early expert systems such as MYCIN and Internist. While such systems were rule-based and brittle, the emergence of Large Multimodal Models (LMMs) marks a paradigm shift, offering the potential to process the complexity inherent in real-world clinical practice. Today, AI must move beyond simple image classification to synthesize diverse data streams—clinical history, laboratory results, and complex imaging—to offer verifiable diagnostic and management strategies. This endeavour is not merely academic; it is motivational, driven by the need to support clinicians in high-stakes scenarios where fragmented data can lead to missed diagnoses or treatment delays. This paper evaluates the capabilities of the Qwen3-VL-8B-Thinking model in performing a complex, multimodal medical diagnosis, specifically examining the trade-offs between instantaneous accuracy and the robust, verifiable precision achieved through an iterative agentic workflow.

The development of LMMs capable of synthesizing visual evidence (e.g., imaging) with extensive text data (e.g., clinical history) is foundational to future clinical informatics. The Qwen3-VL-8B-Thinking model was tested in a high-stakes diagnostic scenario—a complex case of stercoral colitis—to evaluate its consistency and accuracy under both single-pass and iterative agentic workflows. The results demonstrate the model’s robust reasoning capabilities, highlighting its proficiency in handling nuanced medical data and its capacity to be systematically guided toward precise, verifiable clinical outputs.

The Ground Truth: Inspiration from a Clinical Case Study

This experiment was meticulously structured around a specific, published clinical case study: "Stercoral Colitis," authored by Aleksandra Bajer, B.S., and Erica Levine, M.D., and published in the New England Journal of Medicine (N Engl J Med 2025; 393: e23) on October 15, 2025 (DOI: 10.1056/NEJMicm2502616). This authoritative paper provided the ground truth necessary to design a high-fidelity benchmark for the Qwen3-VL model.

The case involves a 23-year-old man with autism spectrum disorder and chronic constipation. This unique combination of risk factors elevates the case's complexity beyond routine impaction. The paper detailed:

  1. Specific Imaging Findings: Computed Tomography (CT) scans revealing colonic distention, mural thickening, and perirectal fat stranding—the visual evidence provided to the model.

  2. Required Acute Management: Fecal disimpaction via flexible sigmoidoscopy.

  3. Comprehensive Long-Term Management: The finding of puborectalis muscular dysfunction required follow-up with anorectal manometry and pelvic-floor physical therapy.

These five critical elements (Diagnosis, Imaging Findings, Acute Procedure, Long-Term Assessment, and Long-Term Therapy) formed the non-negotiable checklist for the Validation Agent in the iterative workflow. The difficulty of the task lies not just in diagnosis, but in producing this comprehensive, multi-stage management plan that integrates acute care with chronic neurological causes.

Code Structure and Experimental Methodology

The experiment employed two distinct methodologies, each implemented in Python code to interact with the Qwen-VL-8B-Thinking model via the OpenRouter API.

1. The Non-Agentic (Single-Pass) Version

This workflow serves as the efficiency benchmark. It is direct, simulating a human clinician providing a single, comprehensive request to the model:

  • Structure: A single function call containing all inputs: the CT images (encoded as Base64 data), the clinical vignette, and an exhaustive prompt detailing the required diagnostic elements (e.g., rationale, differential diagnoses, acute intervention, and long-term management).

  • Result: The model delivers one, unassisted output. The success of this approach hinges entirely on the clarity of the initial prompt and the model’s immediate reasoning capacity.

2. The Agentic (Iterative) Version

This workflow serves as the robustness benchmark, simulating a multi-stage review process designed to enforce specific clinical precision. It is built around three specialized, interacting Python classes (agents):

  • Image Analysis Agent: This initial agent's sole task is to describe the raw, observable findings from the CT images (e.g., "Colon distention," "Increased colon wall thickness," "Pericolonic fat stranding") without drawing clinical conclusions. This ensures the primary model grounds its subsequent output in concrete visual evidence.

  • Prompt Engineer Agent: This agent manages the iterative flow. For each loop, it updates the prompt by incorporating the image findings and, critically, integrates the specific negative feedback received from the Validation Agent. This targets the model's refinement (e.g., forcing the use of the termrequiringoral Colitis" instead of a generalized term).

  • Validation Agent: This is the gatekeeper. It contains a fixed set of five non-negotiable clinical criteria (Diagnosis, Acute Procedure, Long-Term Assessment, Long-Term Therapy, and Complications). To overcome the rigidity issues of the initial runs, this agent uses Regular Expressions for flexible but specific semantic checking (e.g., accepting flexible sigmoidoscopy or endoscopic removal). If any criterion is not met, the loop continues; only perfect compliance achieves convergence.

This modular, iterative design was essential for proving that the Qwen3-VL model could be systematically steered to align with the precise, detailed requirements of the authoritative medical literature.

Qwen3-VL-8B-Thinking's Core Performance

The model's ability to interpret the three-part CT scan (coronal, sagittal, and axial views) alongside the critical clinical vignette (23-year-old male, autism spectrum disorder, chronic constipation) was highly reliable across all experimental runs:

  • Multimodal Synthesis: Qwen3-VL-8B-Thinking consistently linked the visual findings (colonic distention, soft tissue density of impacted stool, wall thickening, and perirectal fat stranding) to the clinical context. It correctly deduced that the patient's history of chronic constipation, exacerbated by ASD-related behavioural factors, was the root cause of the acute condition.

  • Diagnostic Accuracy: The model maintained a high level of diagnostic correctness throughout the experiment, rapidly identifying the condition as Stercoral Colitis or its direct mechanism, "Fecal Impaction with Secondary Ischemic Colitis."

  • Management Comprehensiveness: Crucially, the model consistently included the complete three-part management plan derived from the medical ground truth: endoscopic disimpaction (e.g., flexible sigmoidoscopy), necessary diagnostic follow-up via anorectal manometry, and the long-term therapeutic strategy of pelvic-floor physical therapy.

The Model Under Different Workflows

1. Non-Agentic (Efficiency Test)

In the single-prompt test, Qwen3-VL-8B-Thinking demonstrated exceptional efficiency, producing a structured, correct, and comprehensive result instantly. This showed that, given a high-quality, fully contextualized prompt, the model can synthesize a complex clinical delivery in a single step. This workflow prioritizes speed, relying entirely on the model's innate ability to interpret and follow complex, layered instructions.

2. Agentic (Verifiability and Precision Test)

The agentic workflow, comprising the Image Analysis Agent, Prompt Engineer Agent, and Validation Agent, was designed to test the model's capacity for verifiable precision.

  • Initial Response: Qwen3-VL often provided the clinically equivalent description ("Fecal Impaction with Secondary Ischemic Colitis"), which, while accurate, lacked the specific, formal term.

  • Refinement and Convergence: The model responded effectively to the targeted prompts issued by the Prompt Engineer Agent. When the Validation Agent enforced the strict requirement for "Stercoral Colitis" and the specific procedure "flexible sigmoidoscopy," Qwen3-VL successfully modified its subsequent output to meet these exact semantic criteria. This successful convergence (at Iteration 4 in the final execution) proves that the Qwen3-VL-8B demonstrates a model that is not only intelligent but also highly steerable and capable of meeting predefined external requirements for regulated clinical documentation.

Comparative Results and Validation

Both the Non-Agentic and the Final Agentic versions provided high-accuracy medical diagnoses and treatment plans compared to the paper's ground truth.

Final Comparative Analysis Matrix

Feature

Ground Truth (Paper)

Non-Agentic Version (Original)

Final Agentic (Converged, Iteration 4)

Final Diagnosis

Stercoral Colitis

Stercoral Colitis

Stercoral Colitis

Pathology Rationale

Feces distend the colon, causing inflammation (ischemia).

Massive fecal impaction leading to ischemic inflammation.

Fecal Impaction --> Ischemia -->  Colitis (Inflammation).

Acute Procedure

Fecal disimpaction by flexible sigmoidoscopy.

Colonoscopy (preferred) / Enemas for disimpaction.

Flexible sigmoidoscopy is the gold standard for immediate disimpaction.

Long-Term Assessment

Anorectal manometry (showed non-relaxation of the anorectal angle).

Anorectal Manometry (to diagnose dysfunctional defecation).

Anorectal Manometry (to evaluate dyssynergia).

Long-Term Therapy

Pelvic-floor physical therapy was initiated.

Pelvic-Floor Physical Therapy (targets hypertonic puborectalis with biofeedback).

Pelvic-Floor Physical Therapy (using biofeedback).

Workflow Efficiency

N/A

Most Efficient (Single Pass)

Robust, Self-Correcting (Converged at Iteration 4)

Evaluation Summary

Medical Accuracy: Both the Non-Agentic and Final Agentic methods successfully yielded the specific diagnosis of Stercoral Colitis and correctly identified all three critical management steps: endoscopic disimpaction, anorectal manometry, and pelvic-floor physical therapy.

Efficiency vs. Robustness:

  • The Non-Agentic method was faster, achieving the result in a single, well-primed step.

  • The Final Agentic method demonstrated that an autonomous system could be engineered to achieve the same high-specificity result by using iterative feedback and self-correction, making it a more robust framework for complex, sensitive tasks.

The Future of Open-Source Agentic AI in Clinical Medicine

The successful application of the Qwen3-VL-8B-Thinking model—an open-source Large Multimodal Model—within an agentic framework holds significant implications for the future of clinical AI. Unlike proprietary black-box systems, open-source models offer crucial advantages in medical settings:

  • Transparency and Auditability: Open access allows researchers and hospital IT teams to inspect the underlying model architecture and fine-tune it with local, specialized medical data. This level of transparency is essential for building trust among clinicians and for regulatory compliance, as medical decisions must be fully auditable.

  • Customization and Specialization: Open-source models can be specialized for specific clinical domains (e.g., pediatric radiology, neuro-oncology) by continuous training on unique institutional data, a flexibility that is severely limited in closed commercial models. This is particularly valuable for rare or complex conditions like stercoral colitis, which require integrating GI, behavioural, and logical knowledge.

  • Safety via Agentic Architecture: The use of the agentic framework for mitigating the inherent risks (e.g., hallucinations, nonspecific outputs) associated with general-purpose LLMs in medicine. By breaking the task down into verifiable steps and using a Validation Agent to enforce clinical protocols and terminology, the workflow acts as a safety guardrail. This demonstrated convergence of an open-source model confirms that safety and high accuracy can be achieved simultaneously through structural, code-based interventions, paving the way for the decentralized adoption of powerful LMMs globally.

Convergence of multimodal intelligence and open-source agentic design marks a pivotal moment for clinical AI. The Qwen3-VL-8B-Thinking model demonstrated the necessary core intelligence to diagnose and manage a complex, multifactorial condition. One of the most profound lessons is that efficiency must yield to verifiability in healthcare. The iterative agentic workflow, though slower, delivered a result that was not only accurate but provably compliant with strict clinical criteria, ensuring the use of the precise diagnostic and procedural language required by specialists. This robust, steerable architecture—leveraging the transparency of open-source LMMs—establishes a scalable blueprint for safely embedding advanced AI assistants into critical care settings worldwide. The future of medical diagnosis is not merely about powerful LLMs; it is about building reliable, auditable agentic scaffolding that guarantees clinical confidence and patient safety.

By FRANK MORALES

Keywords: Agentic AI, Generative AI, Open Source

Share this article
Search
How do I climb the Thinkers360 thought leadership leaderboards?
What enterprise services are offered by Thinkers360?
How can I run a B2B Influencer Marketing campaign on Thinkers360?