Progress Update

Agentic baselines

Date: March 25, 2026
Preview: Baselines for ADNI

Baseline

Vanilla zero-shot LLM

One LLM call with no tools or agent graph: instructions plus all tabular PET SUVRs, MRI regional volumes, and demographics are concatenated into one prompt string (implementation may still map slices to API roles internally). Model: GPT-4o mini. Test accuracy 0.3007, F1 0.2549.

Example prompt (abridged)
You are an expert in multi-modal assessment for Alzheimer's disease and related cognitive decline. You integrate amyloid PET (SUVR), structural MRI volumetrics, and demographic context to reason about cognitive stage. Cognitive stages are: CN (cognitively normal), MCI (mild cognitive impairment), and AD (Alzheimer's dementia).

Task: Diagnose cognitive status for patient 2c2b87ca69d76b46

PET Amyloid (SUVR):
  BRAINSTEM SUVR: 1.271
  …  [full SUVR atlas follows]

MRI Structural Volumes:
  BRAINSTEM VOLUME: 19617.0
  …  [full volume atlas follows]

Demographics:
  Age at visit (years): 69.50
  …  [gender, education, ethnicity, race]

Based on the patient data above, provide your final diagnosis as one of: CN, MCI, or AD.
Format: 'The diagnosis is: [CN/MCI/AD]'
Production prompts include every regional line from the input table.
Example model output (same patient, GPT-4o mini)
To diagnose the cognitive status of the patient based on the provided data, we will analyze the amyloid PET SUVR values, structural MRI volumetrics, and demographic context.

1. **Amyloid PET SUVR Analysis**:
   - The patient's SUVR values are generally elevated, particularly in regions associated with Alzheimer's pathology (e.g., frontal lobe, cingulate, and temporal lobe).
   - Notably, the entorhinal cortex SUVR is 0.948, which is below the typical threshold for amyloid positivity, but many other regions show significant elevations (e.g., frontal lobe SUVRs around 1.3-1.5).
   - The overall pattern suggests a significant amyloid burden, which is often associated with Alzheimer's disease.

2. **MRI Structural Volumetrics**:
   - The volumes of critical regions such as the hippocampus (7255.0) and entorhinal cortex (5463.0) are important. While the hippocampal volume is somewhat preserved, the entorhinal cortex volume is on the lower side, which can indicate early neurodegeneration.
   - The overall brain structure appears to show some atrophy, particularly in areas typically affected by Alzheimer's disease.

3. **Demographics**:
   - The patient is a 69.5-year-old female with 20 years of education. Higher education levels can sometimes correlate with better cognitive reserve, potentially masking early cognitive decline.

4. **Cognitive Stage**:
   - Given the elevated amyloid burden and the structural changes observed, the patient likely falls into the category of Mild Cognitive Impairment (MCI) rather than being cognitively normal (CN) or having full-blown Alzheimer's dementia (AD). The presence of amyloid pathology combined with some structural changes suggests that the patient is experiencing cognitive decline that is not yet severe enough to meet the criteria for Alzheimer's dementia.

Based on this analysis, the final diagnosis is:

The diagnosis is: MCI
Free-form reasoning; final line matches the requested format for parsing.

Baseline

Chain-of-thought (CoT)

Same tabular inputs as zero-shot, plus explicit section headings for reasoning (PET → MRI → demographics → integration) and a single parseable final line. One prompt string to the model. single_LLM/cot.py · GPT-4o mini. Test accuracy 0.2941, F1 0.2599.

Example prompt
You are an expert in multi-modal assessment for Alzheimer's disease and related cognitive decline. The user message requires structured chain-of-thought under fixed section headings, then exactly one final line with the diagnosis. Follow their headings and the final-line format exactly. Stages: CN (cognitively normal), MCI (mild cognitive impairment), AD (Alzheimer's dementia).

Task: Diagnose cognitive status for patient 2c2b87ca69d76b46

Expert opinions:

PET Amyloid (SUVR):
  BRAINSTEM SUVR: 1.271
  …  [full regional SUVR tables]

MRI Structural Volumes:
  BRAINSTEM VOLUME: 19617.0
  …  [full regional volume tables]

Demographics:
  Age at visit (years): 69.50
  Gender: Female
  …

Based on the patient data above, provide your final diagnosis as one of: CN, MCI, or AD.
Format: 'The diagnosis is: [CN/MCI/AD]'
---
How to answer (required structure):

1) Write chain-of-thought under these Markdown headings:
### 1. PET amyloid (SUVR)
Interpret regional SUVR patterns and what they suggest about amyloid-related burden relative to typical CN vs neurodegenerative patterns.

### 2. MRI structural volumes
Interpret atrophy or preservation (e.g., medial temporal, ventricular, cortical) and how that aligns with CN vs MCI vs AD.

### 3. Demographics
Briefly contextualize age, education, sex, and other given fields (do not invent facts not in the data).

### 4. Integration
Weigh PET, MRI, and demographics together and argue which single stage (CN, MCI, or AD) is best supported.

2) After those sections, output exactly one final line with no other text on that line:
The diagnosis is: [CN/MCI/AD]
Example model output
### 1. PET amyloid (SUVR)
The PET amyloid results show elevated SUVR values in several regions, particularly in the brainstem (1.397) and various areas of the cerebral cortex, including the cingulate and frontal lobe. These values suggest a significant amyloid burden, which is typically associated with neurodegenerative processes such as Alzheimer's disease (AD). In contrast, cognitively normal (CN) individuals usually exhibit lower SUVR values across these regions.

### 2. MRI structural volumes
The MRI structural volumes indicate some degree of atrophy, particularly in the frontal lobe and temporal lobe regions, which are critical for memory and cognitive function. The ventricular volumes are also notably enlarged, which is often indicative of neurodegeneration. In CN individuals, we would expect to see preserved volumes in these areas, while MCI and AD patients typically show some level of atrophy.

### 3. Demographics
The patient is a 78-year-old female with 18 years of education. This demographic information is relevant as older age is a significant risk factor for cognitive decline, and higher education levels can sometimes provide a protective effect against dementia. However, the combination of age and the observed neuroimaging findings suggests a higher likelihood of cognitive impairment.

### 4. Integration
Considering the elevated amyloid burden indicated by the PET scan, the structural atrophy observed in the MRI, and the patient's age and educational background, the evidence strongly supports a diagnosis of Alzheimer's disease (AD). The combination of these factors points towards a neurodegenerative process rather than a state of mild cognitive impairment (MCI) or cognitive normality (CN).

The diagnosis is: AD
Headings match the template; last line is parsed as the prediction.

Baseline

Few-shot

Labeled demos share the same tabular layout as the query patient; each ends with The diagnosis is: …, then one new patient without a label. We report 3-shot (1 per class), 9-shot (3 per class), and 15-shot (5 per class). GPT-4o mini

Example prompt
You are an expert in multi-modal assessment for Alzheimer's disease and related cognitive decline. The user message first shows one or more example patients with the same data layout and their gold-standard diagnosis (CN, MCI, or AD), then a new patient without a label. Diagnose only the new patient using the same reasoning style as for the examples. Output format for your answer: a single line exactly: The diagnosis is: CN or MCI or AD.

The following are example patients with multimodal data and their correct cognitive stage (CN, MCI, or AD).
Use them as reference for how data patterns relate to the label.

— Example 1 (CN) —
PETAmyloid (SUVR):
Brainstem:
  BRAINSTEM SUVR: 1.271

Cerebellum:
  CEREBELLUM CORTEX SUVR: 0.901
  CEREBELLUM WHITE MATTER SUVR: 1.346

Cerebral Cortex:
  CEREBRAL WHITE MATTER SUVR: 1.687

Cingulate:
  CTX CAUDALANTERIORCINGULATE SUVR: 1.458
  CTX ISTHMUSCINGULATE SUVR: 1.389
MRI …
Demographics …
The diagnosis is: CN

— Examples 2–8 —
…
The diagnosis is: MCI   (or AD, etc.)

Based on the patient data above, provide your final diagnosis as one of: CN, MCI, or AD.
Format: 'The diagnosis is: [CN/MCI/AD]'
Test results
SettingAccuracyF1
3-shot (1/class)0.38240.3231
9-shot (3/class)0.33990.2859
15-shot (5/class)0.33990.2863

ADAgent

Collaborative Multi-Modal Diagnostic Agent

ADAgent composes multiple imaging tools (e.g., ResNet, MedicalNet) with LangChain and LangGraph frameworks on GPT-4o that aggregates evidence and issues a human-readable reasoning trace and final decision.

  • Routing: selects trained SOTA set conditional on input features (MRI only, PET only, or MRI + PET)
  • Tool execution: runs up to 5 SOTA model ensemble per case, collects per-tool predicted probability for each class
  • Coordinator: synthesizes outputs with confidence & justification
ADAgent schematic
#file:prompt.png
Hou, W. et al. (2026). ADAgent: LLM Agent for Alzheimer’s Disease Analysis with Collaborative Coordinator. In: Qiu, J., et al. AI for Clinical Applications. Agentic AI CMLLMs CREATE 2025 2025 2025. Lecture Notes in Computer Science, vol 16147. Springer, Cham. https://doi.org/10.1007/978-3-032-06004-4_3

Published Results

Diagnosis & Prognosis (ADNI)

Reported results from the paper (these are literature numbers, not our runs). Tasks shown:
  Diagnosis: three-way classification predicting Cognitively Normal (CN), Mild Cognitive Impairment (MCI), or Alzheimer's Disease (AD).
  Prognosis: binary prediction whether an MCI patient will convert to AD within 36 months (converter vs stable).

Multi-modal diagnosis task
MethodACCSPESENF1
MedicalNet0.571±0.0090.768±0.0090.591±0.0300.586±0.015
nnMamba0.527±0.0160.753±0.0080.572±0.0210.541±0.028
ResNet500.567±0.0390.771±0.0180.599±0.0360.581±0.043
MCAD0.548±0.0290.766±0.0140.591±0.0270.556±0.026
CMViM0.617±0.0140.659±0.0030.630±0.0130.633±0.010
ADAgent0.644±0.0140.794±0.0100.644±0.0210.647±0.018
Multi-modal prognosis task
MethodACCSPESENF1
MedicalNet0.763±0.0260.902±0.0170.333±0.0520.407±0.064
nnMamba0.733±0.0380.843±0.0680.394±0.1890.405±0.147
ResNet500.726±0.0340.853±0.0290.333±0.0520.374±0.065
MCAD0.719±0.0260.892±0.0610.182±0.0910.231±0.080
CMViM0.815±0.0460.912±0.0290.515±0.1050.575±0.109
ADAgent0.822±0.0100.941±0.0190.545±0.0430.561±0.030

Tabular Baseline Adaptation

Turning ADAgent into a Baseline for Tabular Datasets

Key modifications to make ADAgent comparable to tabular SOTA methods.

  1. 1. Replace CNN SOTAs for images with tabular SOTAs for MRI, PET, and demographics.
  2. 2. Provide strong tabular baselines for comparison.
  3. 3. We hyperparameter-tune each SOTA method using Optuna to obtain the best and most stable settings.
Tabular SOTA Methods
  • FT-Transformer — transformer architecture adapted for tabular features.
  • TabNet — attentive decision-step model providing feature selection and interpretability.
  • TabPFN — prior-data fitted network for fast probabilistic / few-shot tabular predictions.
  • NODE — Neural Oblivious Decision Ensembles (differentiable tree-like model for tabular data).
  • TabM — modern MLP-based tabular model with enhanced feature-interaction handling.
ML Baselines (backup)
  • XGBoost — gradient-boosted decision trees; fast, strong baseline for tabular data.
  • MLP — multi-layer perceptron baseline for sanity checks and simple comparisons.

ADAgent in Action

LangGraph & LangChain Coordinator

How the LLM Coordinator orchestrates the tabular SOTAs based on the execution logs:

LangGraph (State Machine)

Manages the cyclic control flow. It defines nodes (process, execute) and conditional edges (has_tool_calls) to loop between reasoning and tool execution until a final answer is reached.

LangChain (Primitives)

Provides the core components: the ChatOpenAI wrapper (GPT-4.1-mini), prompt templates, message history, and the standard interface for binding our tabular tools (e.g., pet_diagnosis).

Execution Trace

[LangGraph] Enter `process` node
[LangChain] `ChatOpenAI` generates tool call
[LangGraph] `has_tool_calls` = True
[LangGraph] Enter `execute` node
[LangChain] Run `pet_diagnosis` tool
[LangGraph] Loop back to `process` node
[LangChain] `ChatOpenAI` synthesizes final diagnosis
[LangGraph] `has_tool_calls` = False -> Exit
User Prompt + Patient CSV
LangGraph Cyclic Workflow
Node: process
LangChain ChatOpenAI analyzes state & generates response or tool calls
Edge: has_tool_calls?
Yes
Node: execute
Run Tabular SOTAs (XGBoost, TabNet, NODE)
↺ Return results to process
No
Exit Graph
Return final synthesized diagnosis
Final Output: "MCI"

Under the Hood

Prompts & Tool Execution Workflow

The LLM is guided by a strict system prompt that defines its persona, rules of engagement, and the expected output format.

System Prompt:
"You are an expert medical AI assistant for diagnosing Alzheimer's Disease.
Make multiple tool calls in parallel or sequence as needed for comprehensive answers. Critically think about and criticize the tool outputs.
AD processes can be divided into three stages: CN (Cognitively Normal), MCI (Mild Cognitive Impairment), AD (Alzheimer's Disease).
Your final decision must select one category from the above three stages."
Human Input:
"pet_path: temp/pet_002_S_0295.csv
Please diagnose this patient based on the provided PET CSV data. Conclude with one of [CN,MCI,AD]."

How the LLM interprets the tabular tool outputs to form a final diagnosis:

1. Tool Output (pet_diagnosis)
{
  'XGBoost': [0.29, 0.47, 0.24], // [CN, MCI, AD]
  'TabNet':  [0.37, 0.28, 0.35],
  'NODE':    [0.48, 0.40, 0.12]
}
2. LLM Synthesis & Reasoning
"The PET data diagnosis results are from three models:
  • XGBoost model probabilities: CN 29.17%, MCI 47.27%, AD 23.56%
  • TabNet model probabilities: CN 36.77%, MCI 27.92%, AD 35.31%
  • NODE model probabilities: CN 48.08%, MCI 40.38%, AD 11.54%
The XGBoost model suggests MCI is most likely, TabNet suggests AD slightly more than CN or MCI, and NODE suggests CN as most likely.

Considering the mixed results but with a tendency towards MCI and CN stages, the overall conclusion leans towards MCI as the most probable stage.

Final diagnosis: MCI (Mild Cognitive Impairment)."

Validation & Test

Tabular Baselines — ACC & F1 (All modalities - PET, MRI, Demographics)

Results are aggregated over 5 random seeds (mean ± std reported). All models were trained for up to 100 epochs with early stopping (patience = 10) to avoid under- or over-fitting.
Validation Results
MethodACCF1
XGBoost0.6314±0.05740.6301±0.0587
FT-Transformer0.6461±0.04030.6491±0.0420
TabNet0.6497±0.03110.6382±0.0359
TabPFN0.6461±0.04270.6486±0.0427
NODE0.6392±0.04920.6415±0.0588
TabM0.6379±0.04540.6258±0.0573
Test Results
MethodACCF1
XGBoost0.6261±0.01340.6242±0.0096
FT-Transformer0.6313±0.01780.6271±0.0144
TabNet0.5991±0.03670.5872±0.0404
TabPFN0.6565±0.02660.6564±0.0231
NODE0.6461±0.03500.6467±0.0318
TabM0.6078±0.02390.6083±0.0107
Mean (5 tabular SOTAs)0.6282±0.02450.6251±0.0281
ADAgent (gpt-4-1-mini)0.6435±0.02150.6404±0.0199
Mean (5 tabular SOTAs): average of test ACC/F1 means for FT-Transformer, TabNet, TabPFN, NODE, and TabM (XGBoost excluded as baseline); ± is the standard deviation across those five method-level means (spread between models, not seed uncertainty).

Validation & Test

Tabular Baselines — ACC & F1 (MRI only)

Validation Results (MRI only)
MethodACCF1
XGBoost0.6052±0.05430.5960±0.0611
FT-Transformer0.5870±0.04370.5711±0.0574
TabNet0.5922±0.05430.5541±0.0771
TabPFN0.5922±0.04740.5881±0.0501
NODE0.6000±0.04200.5860±0.0520
TabM0.5895±0.05630.5871±0.0605
Test Results (MRI only)
MethodACCF1
XGBoost0.5817±0.03200.5773±0.0333
FT-Transformer0.6174±0.02300.6080±0.0204
TabNet0.5243±0.03020.5044±0.0280
TabPFN0.6174±0.01770.6085±0.0183
NODE0.5991±0.02630.5913±0.0280
TabM0.5974±0.03380.5985±0.0296
Mean (5 tabular SOTAs)0.5911±0.03860.5821±0.0440
ADAgent (gpt-4-1-mini)0.6061±0.01560.6070±0.0157
Mean (5 tabular SOTAs): average of test ACC/F1 means for FT-Transformer, TabNet, TabPFN, NODE, and TabM (XGBoost excluded as baseline); ± is the standard deviation across those five method-level means (spread between models, not seed uncertainty).

Validation & Test

Tabular Baselines — ACC & F1 (PET only)

Validation Results (PET only)
MethodACCF1
XGBoost0.6065±0.05430.5940±0.0611
FT-Transformer0.6043±0.02350.5918±0.0227
TabNet0.6039±0.03910.5802±0.0565
TabPFN0.5983±0.02700.5857±0.0246
NODE0.6013±0.03810.5878±0.0424
TabM0.5882±0.03070.5727±0.0391
Test Results (PET only)
MethodACCF1
XGBoost0.5983±0.03320.5955±0.0347
FT-Transformer0.6243±0.01760.6140±0.0158
TabNet0.5417±0.01460.5197±0.0202
TabPFN0.6304±0.02770.6219±0.0258
NODE0.5948±0.03320.5889±0.0315
TabM0.6183±0.02290.6236±0.0175
Mean (5 tabular SOTAs)0.6019±0.03630.5936±0.0436
ADAgent (gpt-4-1-mini)0.6226±0.03130.6178±0.0369
Mean (5 tabular SOTAs): average of test ACC/F1 means for FT-Transformer, TabNet, TabPFN, NODE, and TabM (XGBoost excluded as baseline); ± is the standard deviation across those five method-level means (spread between models, not seed uncertainty).

Error Analysis

Why did GPT-4-1-mini underperform?

Core Issue: LLMs (especially smaller models like gpt-4-1-mini) struggle with raw numerical reasoning. They frequently misinterpret arrays of probabilities, leading to flawed aggregation.

Tool Output (Probabilities)

Patient: pet_002_S_0295.csv | Classes: [CN, MCI, AD]

  • XGBoost: [0.29, 0.47, 0.23] → Predicts MCI
  • TabNet: [0.36, 0.27, 0.35] → Predicts CN (barely)
  • NODE: [0.48, 0.40, 0.11] → Predicts CN
True Majority: 2 votes for CN, 1 vote for MCI

LLM's Flawed Reasoning

The model fails to correctly tally the votes and hallucinates a majority.

"The XGBoost model indicates the highest probability for MCI stage. The TabNet model is somewhat ambiguous but leans slightly towards AD. The NODE model indicates the highest probability for CN stage.

Considering these results, the overall highest probability category from the majority of models is MCI.

Final diagnosis: MCI"
Result: Incorrect Final Diagnosis (MCI instead of CN)

Other Baselines

Run other baselines

Run other baselines:

  1. Autogen