Progress Update

Agentic baselines

Date: March 25, 2026

Preview: Baselines for ADNI

Baseline

Vanilla zero-shot LLM

One LLM call with no tools or agent graph: instructions plus all tabular PET SUVRs, MRI regional volumes, and demographics are concatenated into one prompt string (implementation may still map slices to API roles internally). Model: GPT-4o mini. Test accuracy 0.3007, F1 0.2549.

Example prompt (abridged)

You are an expert in multi-modal assessment for Alzheimer's disease and related cognitive decline. You integrate amyloid PET (SUVR), structural MRI volumetrics, and demographic context to reason about cognitive stage. Cognitive stages are: CN (cognitively normal), MCI (mild cognitive impairment), and AD (Alzheimer's dementia).

Task: Diagnose cognitive status for patient 2c2b87ca69d76b46

PET Amyloid (SUVR):
  BRAINSTEM SUVR: 1.271
  …  [full SUVR atlas follows]

MRI Structural Volumes:
  BRAINSTEM VOLUME: 19617.0
  …  [full volume atlas follows]

Demographics:
  Age at visit (years): 69.50
  …  [gender, education, ethnicity, race]

Based on the patient data above, provide your final diagnosis as one of: CN, MCI, or AD.
Format: 'The diagnosis is: [CN/MCI/AD]'

Production prompts include every regional line from the input table.

Example model output (same patient, GPT-4o mini)

To diagnose the cognitive status of the patient based on the provided data, we will analyze the amyloid PET SUVR values, structural MRI volumetrics, and demographic context.

1. **Amyloid PET SUVR Analysis**:
- The patient's SUVR values are generally elevated, particularly in regions associated with Alzheimer's pathology (e.g., frontal lobe, cingulate, and temporal lobe).
- Notably, the entorhinal cortex SUVR is 0.948, which is below the typical threshold for amyloid positivity, but many other regions show significant elevations (e.g., frontal lobe SUVRs around 1.3-1.5).
- The overall pattern suggests a significant amyloid burden, which is often associated with Alzheimer's disease.

2. **MRI Structural Volumetrics**:
- The volumes of critical regions such as the hippocampus (7255.0) and entorhinal cortex (5463.0) are important. While the hippocampal volume is somewhat preserved, the entorhinal cortex volume is on the lower side, which can indicate early neurodegeneration.
- The overall brain structure appears to show some atrophy, particularly in areas typically affected by Alzheimer's disease.

3. **Demographics**:
- The patient is a 69.5-year-old female with 20 years of education. Higher education levels can sometimes correlate with better cognitive reserve, potentially masking early cognitive decline.

4. **Cognitive Stage**:
- Given the elevated amyloid burden and the structural changes observed, the patient likely falls into the category of Mild Cognitive Impairment (MCI) rather than being cognitively normal (CN) or having full-blown Alzheimer's dementia (AD). The presence of amyloid pathology combined with some structural changes suggests that the patient is experiencing cognitive decline that is not yet severe enough to meet the criteria for Alzheimer's dementia.

Based on this analysis, the final diagnosis is:

The diagnosis is: MCI

Free-form reasoning; final line matches the requested format for parsing.

Baseline

Chain-of-thought (CoT)

Same tabular inputs as zero-shot, plus explicit section headings for reasoning (PET → MRI → demographics → integration) and a single parseable final line. One prompt string to the model. single_LLM/cot.py · GPT-4o mini. Test accuracy 0.2941, F1 0.2599.

Example prompt

You are an expert in multi-modal assessment for Alzheimer's disease and related cognitive decline. The user message requires structured chain-of-thought under fixed section headings, then exactly one final line with the diagnosis. Follow their headings and the final-line format exactly. Stages: CN (cognitively normal), MCI (mild cognitive impairment), AD (Alzheimer's dementia).

Task: Diagnose cognitive status for patient 2c2b87ca69d76b46

Expert opinions:

PET Amyloid (SUVR):
  BRAINSTEM SUVR: 1.271
  …  [full regional SUVR tables]

MRI Structural Volumes:
  BRAINSTEM VOLUME: 19617.0
  …  [full regional volume tables]

Demographics:
  Age at visit (years): 69.50
  Gender: Female
  …

Based on the patient data above, provide your final diagnosis as one of: CN, MCI, or AD.
Format: 'The diagnosis is: [CN/MCI/AD]'
---
How to answer (required structure):

1) Write chain-of-thought under these Markdown headings:
### 1. PET amyloid (SUVR)
Interpret regional SUVR patterns and what they suggest about amyloid-related burden relative to typical CN vs neurodegenerative patterns.

### 2. MRI structural volumes
Interpret atrophy or preservation (e.g., medial temporal, ventricular, cortical) and how that aligns with CN vs MCI vs AD.

### 3. Demographics
Briefly contextualize age, education, sex, and other given fields (do not invent facts not in the data).

### 4. Integration
Weigh PET, MRI, and demographics together and argue which single stage (CN, MCI, or AD) is best supported.

2) After those sections, output exactly one final line with no other text on that line:
The diagnosis is: [CN/MCI/AD]

Example model output

### 1. PET amyloid (SUVR)
The PET amyloid results show elevated SUVR values in several regions, particularly in the brainstem (1.397) and various areas of the cerebral cortex, including the cingulate and frontal lobe. These values suggest a significant amyloid burden, which is typically associated with neurodegenerative processes such as Alzheimer's disease (AD). In contrast, cognitively normal (CN) individuals usually exhibit lower SUVR values across these regions.

### 2. MRI structural volumes
The MRI structural volumes indicate some degree of atrophy, particularly in the frontal lobe and temporal lobe regions, which are critical for memory and cognitive function. The ventricular volumes are also notably enlarged, which is often indicative of neurodegeneration. In CN individuals, we would expect to see preserved volumes in these areas, while MCI and AD patients typically show some level of atrophy.

### 3. Demographics
The patient is a 78-year-old female with 18 years of education. This demographic information is relevant as older age is a significant risk factor for cognitive decline, and higher education levels can sometimes provide a protective effect against dementia. However, the combination of age and the observed neuroimaging findings suggests a higher likelihood of cognitive impairment.

### 4. Integration
Considering the elevated amyloid burden indicated by the PET scan, the structural atrophy observed in the MRI, and the patient's age and educational background, the evidence strongly supports a diagnosis of Alzheimer's disease (AD). The combination of these factors points towards a neurodegenerative process rather than a state of mild cognitive impairment (MCI) or cognitive normality (CN).

The diagnosis is: AD

Headings match the template; last line is parsed as the prediction.

Baseline

Few-shot

Labeled demos share the same tabular layout as the query patient; each ends with The diagnosis is: …, then one new patient without a label. We report 3-shot (1 per class), 9-shot (3 per class), and 15-shot (5 per class). GPT-4o mini

Example prompt

You are an expert in multi-modal assessment for Alzheimer's disease and related cognitive decline. The user message first shows one or more example patients with the same data layout and their gold-standard diagnosis (CN, MCI, or AD), then a new patient without a label. Diagnose only the new patient using the same reasoning style as for the examples. Output format for your answer: a single line exactly: The diagnosis is: CN or MCI or AD.

The following are example patients with multimodal data and their correct cognitive stage (CN, MCI, or AD).
Use them as reference for how data patterns relate to the label.

— Example 1 (CN) —
PETAmyloid (SUVR):
Brainstem:
  BRAINSTEM SUVR: 1.271

Cerebellum:
  CEREBELLUM CORTEX SUVR: 0.901
  CEREBELLUM WHITE MATTER SUVR: 1.346

Cerebral Cortex:
  CEREBRAL WHITE MATTER SUVR: 1.687

Cingulate:
  CTX CAUDALANTERIORCINGULATE SUVR: 1.458
  CTX ISTHMUSCINGULATE SUVR: 1.389
MRI …
Demographics …
The diagnosis is: CN

— Examples 2–8 —
…
The diagnosis is: MCI   (or AD, etc.)

Based on the patient data above, provide your final diagnosis as one of: CN, MCI, or AD.
Format: 'The diagnosis is: [CN/MCI/AD]'

Test results

Setting	Accuracy	F1
3-shot (1/class)	0.3824	0.3231
9-shot (3/class)	0.3399	0.2859
15-shot (5/class)	0.3399	0.2863

ADAgent

Collaborative Multi-Modal Diagnostic Agent

ADAgent composes multiple imaging tools (e.g., ResNet, MedicalNet) with LangChain and LangGraph frameworks on GPT-4o that aggregates evidence and issues a human-readable reasoning trace and final decision.

Routing: selects trained SOTA set conditional on input features (MRI only, PET only, or MRI + PET)
Tool execution: runs up to 5 SOTA model ensemble per case, collects per-tool predicted probability for each class
Coordinator: synthesizes outputs with confidence & justification

Hou, W. et al. (2026). ADAgent: LLM Agent for Alzheimer’s Disease Analysis with Collaborative Coordinator. In: Qiu, J., et al. AI for Clinical Applications. Agentic AI CMLLMs CREATE 2025 2025 2025. Lecture Notes in Computer Science, vol 16147. Springer, Cham. https://doi.org/10.1007/978-3-032-06004-4_3

Published Results

Diagnosis & Prognosis (ADNI)

Reported results from the paper (these are literature numbers, not our runs). Tasks shown:
Diagnosis: three-way classification predicting Cognitively Normal (CN), Mild Cognitive Impairment (MCI), or Alzheimer's Disease (AD).
Prognosis: binary prediction whether an MCI patient will convert to AD within 36 months (converter vs stable).

Multi-modal diagnosis task
Method	ACC	SPE	SEN	F1
MedicalNet	0.571±0.009	0.768±0.009	0.591±0.030	0.586±0.015
nnMamba	0.527±0.016	0.753±0.008	0.572±0.021	0.541±0.028
ResNet50	0.567±0.039	0.771±0.018	0.599±0.036	0.581±0.043
MCAD	0.548±0.029	0.766±0.014	0.591±0.027	0.556±0.026
CMViM	0.617±0.014	0.659±0.003	0.630±0.013	0.633±0.010
ADAgent	0.644±0.014	0.794±0.010	0.644±0.021	0.647±0.018

Multi-modal prognosis task
Method	ACC	SPE	SEN	F1
MedicalNet	0.763±0.026	0.902±0.017	0.333±0.052	0.407±0.064
nnMamba	0.733±0.038	0.843±0.068	0.394±0.189	0.405±0.147
ResNet50	0.726±0.034	0.853±0.029	0.333±0.052	0.374±0.065
MCAD	0.719±0.026	0.892±0.061	0.182±0.091	0.231±0.080
CMViM	0.815±0.046	0.912±0.029	0.515±0.105	0.575±0.109
ADAgent	0.822±0.010	0.941±0.019	0.545±0.043	0.561±0.030

Tabular Baseline Adaptation

Turning ADAgent into a Baseline for Tabular Datasets

Key modifications to make ADAgent comparable to tabular SOTA methods.

1. Replace CNN SOTAs for images with tabular SOTAs for MRI, PET, and demographics.
2. Provide strong tabular baselines for comparison.
3. We hyperparameter-tune each SOTA method using Optuna to obtain the best and most stable settings.

Tabular SOTA Methods

FT-Transformer — transformer architecture adapted for tabular features.
TabNet — attentive decision-step model providing feature selection and interpretability.
TabPFN — prior-data fitted network for fast probabilistic / few-shot tabular predictions.
NODE — Neural Oblivious Decision Ensembles (differentiable tree-like model for tabular data).
TabM — modern MLP-based tabular model with enhanced feature-interaction handling.

ML Baselines (backup)

XGBoost — gradient-boosted decision trees; fast, strong baseline for tabular data.
MLP — multi-layer perceptron baseline for sanity checks and simple comparisons.

ADAgent in Action

LangGraph & LangChain Coordinator

How the LLM Coordinator orchestrates the tabular SOTAs based on the execution logs:

LangGraph (State Machine)

Manages the cyclic control flow. It defines nodes (process, execute) and conditional edges (has_tool_calls) to loop between reasoning and tool execution until a final answer is reached.

LangChain (Primitives)

Provides the core components: the ChatOpenAI wrapper (GPT-4.1-mini), prompt templates, message history, and the standard interface for binding our tabular tools (e.g., pet_diagnosis).

Execution Trace

[LangGraph] Enter `process` node
[LangChain] `ChatOpenAI` generates tool call
[LangGraph] `has_tool_calls` = True
[LangGraph] Enter `execute` node
[LangChain] Run `pet_diagnosis` tool
[LangGraph] Loop back to `process` node
[LangChain] `ChatOpenAI` synthesizes final diagnosis
[LangGraph] `has_tool_calls` = False -> Exit

User Prompt + Patient CSV

↓

LangGraph Cyclic Workflow

Node: process

LangChain ChatOpenAI analyzes state & generates response or tool calls

↓

Edge: has_tool_calls?

Yes

Node: execute

Run Tabular SOTAs (XGBoost, TabNet, NODE)

↺ Return results to process

No

Exit Graph

Return final synthesized diagnosis

↓

Final Output: "MCI"

Under the Hood

Prompts & Tool Execution Workflow

The LLM is guided by a strict system prompt that defines its persona, rules of engagement, and the expected output format.

System Prompt:
"You are an expert medical AI assistant for diagnosing Alzheimer's Disease.
Make multiple tool calls in parallel or sequence as needed for comprehensive answers. Critically think about and criticize the tool outputs.
AD processes can be divided into three stages: CN (Cognitively Normal), MCI (Mild Cognitive Impairment), AD (Alzheimer's Disease).
Your final decision must select one category from the above three stages."
Human Input:
"pet_path: temp/pet_002_S_0295.csv
Please diagnose this patient based on the provided PET CSV data. Conclude with one of [CN,MCI,AD]."

How the LLM interprets the tabular tool outputs to form a final diagnosis:

1. Tool Output (pet_diagnosis)

{
  'XGBoost': [0.29, 0.47, 0.24], // [CN, MCI, AD]
  'TabNet':  [0.37, 0.28, 0.35],
  'NODE':    [0.48, 0.40, 0.12]
}

↓

2. LLM Synthesis & Reasoning

"The PET data diagnosis results are from three models:

XGBoost model probabilities: CN 29.17%, MCI 47.27%, AD 23.56%
TabNet model probabilities: CN 36.77%, MCI 27.92%, AD 35.31%
NODE model probabilities: CN 48.08%, MCI 40.38%, AD 11.54%

The XGBoost model suggests MCI is most likely, TabNet suggests AD slightly more than CN or MCI, and NODE suggests CN as most likely.

Considering the mixed results but with a tendency towards MCI and CN stages, the overall conclusion leans towards MCI as the most probable stage.

Final diagnosis: MCI (Mild Cognitive Impairment)."

Validation & Test

Tabular Baselines — ACC & F1 (All modalities - PET, MRI, Demographics)

Results are aggregated over 5 random seeds (mean ± std reported). All models were trained for up to 100 epochs with early stopping (patience = 10) to avoid under- or over-fitting.

Validation Results
Method	ACC	F1
XGBoost	0.6314±0.0574	0.6301±0.0587
FT-Transformer	0.6461±0.0403	0.6491±0.0420
TabNet	0.6497±0.0311	0.6382±0.0359
TabPFN	0.6461±0.0427	0.6486±0.0427
NODE	0.6392±0.0492	0.6415±0.0588
TabM	0.6379±0.0454	0.6258±0.0573

Test Results
Method	ACC	F1
XGBoost	0.6261±0.0134	0.6242±0.0096
FT-Transformer	0.6313±0.0178	0.6271±0.0144
TabNet	0.5991±0.0367	0.5872±0.0404
TabPFN	0.6565±0.0266	0.6564±0.0231
NODE	0.6461±0.0350	0.6467±0.0318
TabM	0.6078±0.0239	0.6083±0.0107
Mean (5 tabular SOTAs)	0.6282±0.0245	0.6251±0.0281
ADAgent (gpt-4-1-mini)	0.6435±0.0215	0.6404±0.0199

Mean (5 tabular SOTAs): average of test ACC/F1 means for FT-Transformer, TabNet, TabPFN, NODE, and TabM (XGBoost excluded as baseline); ± is the standard deviation across those five method-level means (spread between models, not seed uncertainty).

Validation & Test

Tabular Baselines — ACC & F1 (MRI only)

Validation Results (MRI only)
Method	ACC	F1
XGBoost	0.6052±0.0543	0.5960±0.0611
FT-Transformer	0.5870±0.0437	0.5711±0.0574
TabNet	0.5922±0.0543	0.5541±0.0771
TabPFN	0.5922±0.0474	0.5881±0.0501
NODE	0.6000±0.0420	0.5860±0.0520
TabM	0.5895±0.0563	0.5871±0.0605

Test Results (MRI only)
Method	ACC	F1
XGBoost	0.5817±0.0320	0.5773±0.0333
FT-Transformer	0.6174±0.0230	0.6080±0.0204
TabNet	0.5243±0.0302	0.5044±0.0280
TabPFN	0.6174±0.0177	0.6085±0.0183
NODE	0.5991±0.0263	0.5913±0.0280
TabM	0.5974±0.0338	0.5985±0.0296
Mean (5 tabular SOTAs)	0.5911±0.0386	0.5821±0.0440
ADAgent (gpt-4-1-mini)	0.6061±0.0156	0.6070±0.0157

Mean (5 tabular SOTAs): average of test ACC/F1 means for FT-Transformer, TabNet, TabPFN, NODE, and TabM (XGBoost excluded as baseline); ± is the standard deviation across those five method-level means (spread between models, not seed uncertainty).

Validation & Test

Tabular Baselines — ACC & F1 (PET only)

Validation Results (PET only)
Method	ACC	F1
XGBoost	0.6065±0.0543	0.5940±0.0611
FT-Transformer	0.6043±0.0235	0.5918±0.0227
TabNet	0.6039±0.0391	0.5802±0.0565
TabPFN	0.5983±0.0270	0.5857±0.0246
NODE	0.6013±0.0381	0.5878±0.0424
TabM	0.5882±0.0307	0.5727±0.0391

Test Results (PET only)
Method	ACC	F1
XGBoost	0.5983±0.0332	0.5955±0.0347
FT-Transformer	0.6243±0.0176	0.6140±0.0158
TabNet	0.5417±0.0146	0.5197±0.0202
TabPFN	0.6304±0.0277	0.6219±0.0258
NODE	0.5948±0.0332	0.5889±0.0315
TabM	0.6183±0.0229	0.6236±0.0175
Mean (5 tabular SOTAs)	0.6019±0.0363	0.5936±0.0436
ADAgent (gpt-4-1-mini)	0.6226±0.0313	0.6178±0.0369

Mean (5 tabular SOTAs): average of test ACC/F1 means for FT-Transformer, TabNet, TabPFN, NODE, and TabM (XGBoost excluded as baseline); ± is the standard deviation across those five method-level means (spread between models, not seed uncertainty).

Error Analysis

Why did GPT-4-1-mini underperform?

Core Issue: LLMs (especially smaller models like gpt-4-1-mini) struggle with raw numerical reasoning. They frequently misinterpret arrays of probabilities, leading to flawed aggregation.

Tool Output (Probabilities)

Patient: pet_002_S_0295.csv | Classes: [CN, MCI, AD]

XGBoost: [0.29, 0.47, 0.23] → Predicts MCI
TabNet: [0.36, 0.27, 0.35] → Predicts CN (barely)
NODE: [0.48, 0.40, 0.11] → Predicts CN

True Majority: 2 votes for CN, 1 vote for MCI

LLM's Flawed Reasoning

The model fails to correctly tally the votes and hallucinates a majority.

"The XGBoost model indicates the highest probability for MCI stage. The TabNet model is somewhat ambiguous but leans slightly towards AD. The NODE model indicates the highest probability for CN stage.

Considering these results, the overall highest probability category from the majority of models is MCI.

Final diagnosis: MCI"

Result: Incorrect Final Diagnosis (MCI instead of CN)

Other Baselines

Run other baselines

Run other baselines:

Autogen