One LLM call with no tools or agent graph: instructions plus all tabular PET SUVRs, MRI regional volumes, and demographics are concatenated into one prompt string (implementation may still map slices to API roles internally). Model: GPT-4o mini. Test accuracy 0.3007, F1 0.2549.
You are an expert in multi-modal assessment for Alzheimer's disease and related cognitive decline. You integrate amyloid PET (SUVR), structural MRI volumetrics, and demographic context to reason about cognitive stage. Cognitive stages are: CN (cognitively normal), MCI (mild cognitive impairment), and AD (Alzheimer's dementia). Task: Diagnose cognitive status for patient 2c2b87ca69d76b46 PET Amyloid (SUVR): BRAINSTEM SUVR: 1.271 … [full SUVR atlas follows] MRI Structural Volumes: BRAINSTEM VOLUME: 19617.0 … [full volume atlas follows] Demographics: Age at visit (years): 69.50 … [gender, education, ethnicity, race] Based on the patient data above, provide your final diagnosis as one of: CN, MCI, or AD. Format: 'The diagnosis is: [CN/MCI/AD]'
To diagnose the cognitive status of the patient based on the provided data, we will analyze the amyloid PET SUVR values, structural MRI volumetrics, and demographic context. 1. **Amyloid PET SUVR Analysis**: - The patient's SUVR values are generally elevated, particularly in regions associated with Alzheimer's pathology (e.g., frontal lobe, cingulate, and temporal lobe). - Notably, the entorhinal cortex SUVR is 0.948, which is below the typical threshold for amyloid positivity, but many other regions show significant elevations (e.g., frontal lobe SUVRs around 1.3-1.5). - The overall pattern suggests a significant amyloid burden, which is often associated with Alzheimer's disease. 2. **MRI Structural Volumetrics**: - The volumes of critical regions such as the hippocampus (7255.0) and entorhinal cortex (5463.0) are important. While the hippocampal volume is somewhat preserved, the entorhinal cortex volume is on the lower side, which can indicate early neurodegeneration. - The overall brain structure appears to show some atrophy, particularly in areas typically affected by Alzheimer's disease. 3. **Demographics**: - The patient is a 69.5-year-old female with 20 years of education. Higher education levels can sometimes correlate with better cognitive reserve, potentially masking early cognitive decline. 4. **Cognitive Stage**: - Given the elevated amyloid burden and the structural changes observed, the patient likely falls into the category of Mild Cognitive Impairment (MCI) rather than being cognitively normal (CN) or having full-blown Alzheimer's dementia (AD). The presence of amyloid pathology combined with some structural changes suggests that the patient is experiencing cognitive decline that is not yet severe enough to meet the criteria for Alzheimer's dementia. Based on this analysis, the final diagnosis is: The diagnosis is: MCI
Same tabular inputs as zero-shot, plus explicit section headings for reasoning (PET → MRI → demographics → integration) and a single parseable final line. One prompt string to the model. single_LLM/cot.py · GPT-4o mini. Test accuracy 0.2941, F1 0.2599.
You are an expert in multi-modal assessment for Alzheimer's disease and related cognitive decline. The user message requires structured chain-of-thought under fixed section headings, then exactly one final line with the diagnosis. Follow their headings and the final-line format exactly. Stages: CN (cognitively normal), MCI (mild cognitive impairment), AD (Alzheimer's dementia). Task: Diagnose cognitive status for patient 2c2b87ca69d76b46 Expert opinions: PET Amyloid (SUVR): BRAINSTEM SUVR: 1.271 … [full regional SUVR tables] MRI Structural Volumes: BRAINSTEM VOLUME: 19617.0 … [full regional volume tables] Demographics: Age at visit (years): 69.50 Gender: Female … Based on the patient data above, provide your final diagnosis as one of: CN, MCI, or AD. Format: 'The diagnosis is: [CN/MCI/AD]' --- How to answer (required structure): 1) Write chain-of-thought under these Markdown headings: ### 1. PET amyloid (SUVR) Interpret regional SUVR patterns and what they suggest about amyloid-related burden relative to typical CN vs neurodegenerative patterns. ### 2. MRI structural volumes Interpret atrophy or preservation (e.g., medial temporal, ventricular, cortical) and how that aligns with CN vs MCI vs AD. ### 3. Demographics Briefly contextualize age, education, sex, and other given fields (do not invent facts not in the data). ### 4. Integration Weigh PET, MRI, and demographics together and argue which single stage (CN, MCI, or AD) is best supported. 2) After those sections, output exactly one final line with no other text on that line: The diagnosis is: [CN/MCI/AD]
### 1. PET amyloid (SUVR) The PET amyloid results show elevated SUVR values in several regions, particularly in the brainstem (1.397) and various areas of the cerebral cortex, including the cingulate and frontal lobe. These values suggest a significant amyloid burden, which is typically associated with neurodegenerative processes such as Alzheimer's disease (AD). In contrast, cognitively normal (CN) individuals usually exhibit lower SUVR values across these regions. ### 2. MRI structural volumes The MRI structural volumes indicate some degree of atrophy, particularly in the frontal lobe and temporal lobe regions, which are critical for memory and cognitive function. The ventricular volumes are also notably enlarged, which is often indicative of neurodegeneration. In CN individuals, we would expect to see preserved volumes in these areas, while MCI and AD patients typically show some level of atrophy. ### 3. Demographics The patient is a 78-year-old female with 18 years of education. This demographic information is relevant as older age is a significant risk factor for cognitive decline, and higher education levels can sometimes provide a protective effect against dementia. However, the combination of age and the observed neuroimaging findings suggests a higher likelihood of cognitive impairment. ### 4. Integration Considering the elevated amyloid burden indicated by the PET scan, the structural atrophy observed in the MRI, and the patient's age and educational background, the evidence strongly supports a diagnosis of Alzheimer's disease (AD). The combination of these factors points towards a neurodegenerative process rather than a state of mild cognitive impairment (MCI) or cognitive normality (CN). The diagnosis is: AD
Labeled demos share the same tabular layout as the query patient; each ends with The diagnosis is: …, then one new patient without a label. We report 3-shot (1 per class), 9-shot (3 per class), and 15-shot (5 per class). GPT-4o mini
You are an expert in multi-modal assessment for Alzheimer's disease and related cognitive decline. The user message first shows one or more example patients with the same data layout and their gold-standard diagnosis (CN, MCI, or AD), then a new patient without a label. Diagnose only the new patient using the same reasoning style as for the examples. Output format for your answer: a single line exactly: The diagnosis is: CN or MCI or AD. The following are example patients with multimodal data and their correct cognitive stage (CN, MCI, or AD). Use them as reference for how data patterns relate to the label. — Example 1 (CN) — PETAmyloid (SUVR): Brainstem: BRAINSTEM SUVR: 1.271 Cerebellum: CEREBELLUM CORTEX SUVR: 0.901 CEREBELLUM WHITE MATTER SUVR: 1.346 Cerebral Cortex: CEREBRAL WHITE MATTER SUVR: 1.687 Cingulate: CTX CAUDALANTERIORCINGULATE SUVR: 1.458 CTX ISTHMUSCINGULATE SUVR: 1.389 MRI … Demographics … The diagnosis is: CN — Examples 2–8 — … The diagnosis is: MCI (or AD, etc.) Based on the patient data above, provide your final diagnosis as one of: CN, MCI, or AD. Format: 'The diagnosis is: [CN/MCI/AD]'
| Setting | Accuracy | F1 |
|---|---|---|
| 3-shot (1/class) | 0.3824 | 0.3231 |
| 9-shot (3/class) | 0.3399 | 0.2859 |
| 15-shot (5/class) | 0.3399 | 0.2863 |
ADAgent composes multiple imaging tools (e.g., ResNet, MedicalNet) with LangChain and LangGraph frameworks on GPT-4o that aggregates evidence and issues a human-readable reasoning trace and final decision.
Reported results from the paper (these are literature numbers, not our runs). Tasks shown:
Diagnosis: three-way classification predicting Cognitively Normal (CN), Mild Cognitive Impairment (MCI), or Alzheimer's Disease (AD).
Prognosis: binary prediction whether an MCI patient will convert to AD within 36 months (converter vs stable).
| Multi-modal diagnosis task | ||||
|---|---|---|---|---|
| Method | ACC | SPE | SEN | F1 |
| MedicalNet | 0.571±0.009 | 0.768±0.009 | 0.591±0.030 | 0.586±0.015 |
| nnMamba | 0.527±0.016 | 0.753±0.008 | 0.572±0.021 | 0.541±0.028 |
| ResNet50 | 0.567±0.039 | 0.771±0.018 | 0.599±0.036 | 0.581±0.043 |
| MCAD | 0.548±0.029 | 0.766±0.014 | 0.591±0.027 | 0.556±0.026 |
| CMViM | 0.617±0.014 | 0.659±0.003 | 0.630±0.013 | 0.633±0.010 |
| ADAgent | 0.644±0.014 | 0.794±0.010 | 0.644±0.021 | 0.647±0.018 |
| Multi-modal prognosis task | ||||
|---|---|---|---|---|
| Method | ACC | SPE | SEN | F1 |
| MedicalNet | 0.763±0.026 | 0.902±0.017 | 0.333±0.052 | 0.407±0.064 |
| nnMamba | 0.733±0.038 | 0.843±0.068 | 0.394±0.189 | 0.405±0.147 |
| ResNet50 | 0.726±0.034 | 0.853±0.029 | 0.333±0.052 | 0.374±0.065 |
| MCAD | 0.719±0.026 | 0.892±0.061 | 0.182±0.091 | 0.231±0.080 |
| CMViM | 0.815±0.046 | 0.912±0.029 | 0.515±0.105 | 0.575±0.109 |
| ADAgent | 0.822±0.010 | 0.941±0.019 | 0.545±0.043 | 0.561±0.030 |
Key modifications to make ADAgent comparable to tabular SOTA methods.
How the LLM Coordinator orchestrates the tabular SOTAs based on the execution logs:
Manages the cyclic control flow. It defines nodes (process, execute) and conditional edges (has_tool_calls) to loop between reasoning and tool execution until a final answer is reached.
Provides the core components: the ChatOpenAI wrapper (GPT-4.1-mini), prompt templates, message history, and the standard interface for binding our tabular tools (e.g., pet_diagnosis).
[LangGraph] Enter `process` node [LangChain] `ChatOpenAI` generates tool call [LangGraph] `has_tool_calls` = True [LangGraph] Enter `execute` node [LangChain] Run `pet_diagnosis` tool [LangGraph] Loop back to `process` node [LangChain] `ChatOpenAI` synthesizes final diagnosis [LangGraph] `has_tool_calls` = False -> Exit
processChatOpenAI analyzes state & generates response or tool callshas_tool_calls?executeprocessThe LLM is guided by a strict system prompt that defines its persona, rules of engagement, and the expected output format.
How the LLM interprets the tabular tool outputs to form a final diagnosis:
{
'XGBoost': [0.29, 0.47, 0.24], // [CN, MCI, AD]
'TabNet': [0.37, 0.28, 0.35],
'NODE': [0.48, 0.40, 0.12]
}
| Validation Results | ||
|---|---|---|
| Method | ACC | F1 |
| XGBoost | 0.6314±0.0574 | 0.6301±0.0587 |
| FT-Transformer | 0.6461±0.0403 | 0.6491±0.0420 |
| TabNet | 0.6497±0.0311 | 0.6382±0.0359 |
| TabPFN | 0.6461±0.0427 | 0.6486±0.0427 |
| NODE | 0.6392±0.0492 | 0.6415±0.0588 |
| TabM | 0.6379±0.0454 | 0.6258±0.0573 |
| Test Results | ||
|---|---|---|
| Method | ACC | F1 |
| XGBoost | 0.6261±0.0134 | 0.6242±0.0096 |
| FT-Transformer | 0.6313±0.0178 | 0.6271±0.0144 |
| TabNet | 0.5991±0.0367 | 0.5872±0.0404 |
| TabPFN | 0.6565±0.0266 | 0.6564±0.0231 |
| NODE | 0.6461±0.0350 | 0.6467±0.0318 |
| TabM | 0.6078±0.0239 | 0.6083±0.0107 |
| Mean (5 tabular SOTAs) | 0.6282±0.0245 | 0.6251±0.0281 |
| ADAgent (gpt-4-1-mini) | 0.6435±0.0215 | 0.6404±0.0199 |
| Validation Results (MRI only) | ||
|---|---|---|
| Method | ACC | F1 |
| XGBoost | 0.6052±0.0543 | 0.5960±0.0611 |
| FT-Transformer | 0.5870±0.0437 | 0.5711±0.0574 |
| TabNet | 0.5922±0.0543 | 0.5541±0.0771 |
| TabPFN | 0.5922±0.0474 | 0.5881±0.0501 |
| NODE | 0.6000±0.0420 | 0.5860±0.0520 |
| TabM | 0.5895±0.0563 | 0.5871±0.0605 |
| Test Results (MRI only) | ||
|---|---|---|
| Method | ACC | F1 |
| XGBoost | 0.5817±0.0320 | 0.5773±0.0333 |
| FT-Transformer | 0.6174±0.0230 | 0.6080±0.0204 |
| TabNet | 0.5243±0.0302 | 0.5044±0.0280 |
| TabPFN | 0.6174±0.0177 | 0.6085±0.0183 |
| NODE | 0.5991±0.0263 | 0.5913±0.0280 |
| TabM | 0.5974±0.0338 | 0.5985±0.0296 |
| Mean (5 tabular SOTAs) | 0.5911±0.0386 | 0.5821±0.0440 |
| ADAgent (gpt-4-1-mini) | 0.6061±0.0156 | 0.6070±0.0157 |
| Validation Results (PET only) | ||
|---|---|---|
| Method | ACC | F1 |
| XGBoost | 0.6065±0.0543 | 0.5940±0.0611 |
| FT-Transformer | 0.6043±0.0235 | 0.5918±0.0227 |
| TabNet | 0.6039±0.0391 | 0.5802±0.0565 |
| TabPFN | 0.5983±0.0270 | 0.5857±0.0246 |
| NODE | 0.6013±0.0381 | 0.5878±0.0424 |
| TabM | 0.5882±0.0307 | 0.5727±0.0391 |
| Test Results (PET only) | ||
|---|---|---|
| Method | ACC | F1 |
| XGBoost | 0.5983±0.0332 | 0.5955±0.0347 |
| FT-Transformer | 0.6243±0.0176 | 0.6140±0.0158 |
| TabNet | 0.5417±0.0146 | 0.5197±0.0202 |
| TabPFN | 0.6304±0.0277 | 0.6219±0.0258 |
| NODE | 0.5948±0.0332 | 0.5889±0.0315 |
| TabM | 0.6183±0.0229 | 0.6236±0.0175 |
| Mean (5 tabular SOTAs) | 0.6019±0.0363 | 0.5936±0.0436 |
| ADAgent (gpt-4-1-mini) | 0.6226±0.0313 | 0.6178±0.0369 |
Core Issue: LLMs (especially smaller models like gpt-4-1-mini) struggle with raw numerical reasoning. They frequently misinterpret arrays of probabilities, leading to flawed aggregation.
Patient: pet_002_S_0295.csv | Classes: [CN, MCI, AD]
The model fails to correctly tally the votes and hallucinates a majority.
Run other baselines: