﻿<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="http://medrxiv.org">
<admin:errorReportsTo rdf:resource="mailto:medrxiv@cshlpress.edu"/>
<title>medrxiv Subject Collection: Health Informatics</title>
<link>http://medrxiv.org</link>
<description>
This feed contains articles for medRxiv Subject Collection "Health Informatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.17.26355900v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.18.26355962v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.17.26355907v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.11.26354463v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.17.26355819v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.10.26355351v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.09.26353388v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.16.26355804v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.16.26355782v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.16.26355792v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.15.26355735v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.15.26355670v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.16.26355686v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.10.26355413v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.13.26355565v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.08.26355187v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.08.26355139v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.12.26355166v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.11.26355453v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.13.26355589v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.11.26355471v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.06.26354746v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.11.26355494v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.10.26355390v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.10.26355372v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.10.26355348v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.09.26355176v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.05.26354271v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.09.26355266v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.06.08.26355217v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>medrxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>medrxiv</title>
<url>https://www.medrxiv.org/sites/default/files/medrxiv_internal_logo.png</url>
<link>http://medrxiv.org</link>
</image>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.17.26355900v1?rss=1">
<title>
<![CDATA[
Demographic Calibration Gaps in Breast Cancer Risk Prediction: Introducing the Demographic Calibration Gap Score 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.17.26355900v1?rss=1
</link>
<description><![CDATA[
BackgroundMost breast cancer prediction studies skip calibration reporting entirely. Fewer still examine calibration by demographic subgroup. Predicted probabilities that are systematically off for specific racial or gender groups produce biased clinical decisions, and aggregate statistics will not catch that.

ObjectiveTo introduce the Demographic Calibration Gap Score (DCGS), a metric that measures how much calibration error varies across demographic subgroups, and to show how it performs across five classifiers, four calibration conditions, and two datasets.

MethodsFive classifiers were trained on the Wisconsin Diagnostic Breast Cancer dataset (n=569) and evaluated on a breast cancer cohort from MIMIC-IV (n=1,316). Three global calibration methods were applied: no calibration, Platt scaling, and isotonic regression. A fourth condition, subgroup-targeted Platt scaling, was applied to the MIMIC cohort. DCGS was computed as DCGSrange = maxk(ECEk) - mink(ECEk) across racial and gender subgroups, with 95% bootstrap confidence intervals. Conformal prediction coverage and Demographic Coverage Gap (DCG) were reported.

ResultsOn Wisconsin, all five models achieved AUROC above 0.98 and ECE below 0.12. Performance fell sharply on the MIMIC external cohort: AUROC dropped to 0.45-0.57 for base and globally calibrated variants, confirming distributional shift. DCGS exceeded the 0.05 clinical significance threshold in 28 of 40 model-calibration combinations on the race axis. Neither global Platt nor isotonic calibration reliably reduced DCGS below that threshold. Conformal coverage collapsed to roughly 25% on MIMIC, and racial DCG exceeded 0.15 for all 20 model-variant combinations.

ConclusionsReducing population-level ECE through global recalibration does not reliably close demographic calibration gaps. DCGS gives researchers a direct, standardized way to detect and report those disparities. Code and the DCGS computation library are released as open-source Python under the MIT License.
]]></description>
<dc:creator><![CDATA[ Eniolade, M. ]]></dc:creator>
<dc:date>2026-06-22</dc:date>
<dc:identifier>doi:10.64898/2026.06.17.26355900</dc:identifier>
<dc:title><![CDATA[Demographic Calibration Gaps in Breast Cancer Risk Prediction: Introducing the Demographic Calibration Gap Score]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-22</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.18.26355962v1?rss=1">
<title>
<![CDATA[
Generative Artificial Intelligence in Psychotherapy Practice: A Global Online Survey of Mental Health Professionals' Adoption 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.18.26355962v1?rss=1
</link>
<description><![CDATA[
Background: Generative artificial intelligence (GenAI) tools, including large language model (LLM)-based platforms such as ChatGPT, Google Gemini, and Microsoft Copilot, are being adopted across healthcare settings with increasing speed. Despite the increasing popularity of GenAI, empirical data on the extent and nature of adoption by mental health clinicians in routine psychotherapy practice globally remain scarce. Objective: This study aimed to characterize current use patterns of GenAI tools among a global sample of practicing mental health professionals, including prevalence of use, specific tools employed, clinical and administrative purposes served, perceived effect on workload, and the institutional context shaping adoption (e.g., encouragement, prohibition, and training). Methods: We administered a cross-sectional online survey to a global convenience sample of licensed mental health professionals who provide psychotherapy as part of the scope of their practice (i.e., psychotherapists, psychologists, counsellors, nurses, and psychiatrists). Participants were recruited via professional networks, purposely avoiding the use of social media platforms. Within the survey, we captured GenAI use behaviors in psychotherapy contexts, and demographic and professional background data. Descriptive statistics were analyzed for all variables. Multivariate logistic regression was used to examine demographic and professional predictors of GenAI use. Results: A total of 766 mental health professionals who provide psychotherapy from 30 countries completed the survey. Of these, 54.6% (n=418) reported having purposely used at least one GenAI tool in psychotherapy clinical practice. ChatGPT was the most frequently used tool (354/418, 84.7%). The most commonly reported clinical purpose was assisting with treatment planning (175/418, 41.9%), followed by managing administrative tasks (173/418, 41.4%) and generating psychoeducational materials for clients (166/418, 39.7%). 82.8% of AI users reported that these tools reduced their overall work burden. Only 18.1% (139/766) of respondents reported institutional encouragement to use AI tools, while 81.1% (621/766) reported not having received any professional training on AI use. Predictors of AI adoption included younger age and rural practice setting. Conclusions: In this global convenience sample survey, GenAI use among mental health professionals in psychotherapy settings is widespread, concentrated in a wide variety of clinical and administrative tasks. Formal training and institutional guidance substantially lag behind current adoption patterns. These findings highlight an urgent need for evidence-based competency frameworks, regulatory clarity, and professional education to support safe and ethically informed integration of AI into clinical mental health practice.
]]></description>
<dc:creator><![CDATA[ Blease, C., Hagström, J., Gaab, J., Carey, A., Cipriani, F., Gorman, C., Nascimento, A. F., Fitzgerald, A., Holtz, L., Schwarz, J., Strudwick, G., Tibbs, M., Torous, J., Cornelius-White, J. H. D. ]]></dc:creator>
<dc:date>2026-06-22</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.26355962</dc:identifier>
<dc:title><![CDATA[Generative Artificial Intelligence in Psychotherapy Practice: A Global Online Survey of Mental Health Professionals' Adoption]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-22</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.17.26355907v1?rss=1">
<title>
<![CDATA[
AI-driven Multimodal Representation Learning for Latent Mediation Structure Discovery of Socioeconomic Disadvantage, Psychosocial Factors, and Cardiometabolic Multimorbidity 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.17.26355907v1?rss=1
</link>
<description><![CDATA[
Social disadvantage is associated with multimorbidity, but the pathways linking social conditions to disease burden remain poorly understood. We developed an AI-driven multimodal mediation framework that integrates socioeconomic, psychosocial, clinical, laboratory, behavioral, and genomic data from the All of Us Research Program. Modality-specific variational autoencoders were used to derive latent representations of each data domain, and mediation analyses were subsequently performed in latent space to evaluate indirect associations between socioeconomic disadvantage, psychosocial factors, and multimorbidity. The final analytic cohort included 20,804 participants with complete multimodal data. Across 800 exposure--mediator--outcome combinations, mediation signals were concentrated within a small number of latent dimensions. The strongest indirect association linked a socioeconomic disadvantage dimension, a psychosocial vulnerability dimension, and a cardiometabolic multimorbidity dimension (NIE = 0.002517). The psychosocial dimension was characterized by poorer mental health, greater loneliness, lower social well-being, and lower health literacy, whereas the outcome dimension was associated with hypertension, diabetes, hyperlipidemia, obesity, chronic kidney disease, and heart disease. Bootstrap analyses supported the stability of the leading pathway. These findings suggest that psychosocial vulnerability may contribute to the association between socioeconomic disadvantage and cardiometabolic multimorbidity. More broadly, the proposed framework illustrates how AI-based representation learning can be used to investigate complex relationships across high-dimensional multimodal health data.
]]></description>
<dc:creator><![CDATA[ Ma, S., Cao, C. ]]></dc:creator>
<dc:date>2026-06-22</dc:date>
<dc:identifier>doi:10.64898/2026.06.17.26355907</dc:identifier>
<dc:title><![CDATA[AI-driven Multimodal Representation Learning for Latent Mediation Structure Discovery of Socioeconomic Disadvantage, Psychosocial Factors, and Cardiometabolic Multimorbidity]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-22</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.11.26354463v1?rss=1">
<title>
<![CDATA[
A Drug-Specific, Half-Life-Adjusted Framework for Classifying CNS-Active Systemic Therapy Exposure During and After Radiotherapy 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.11.26354463v1?rss=1
</link>
<description><![CDATA[
Clinical oncology datasets often store systemic therapy as a regimen label with a start date and an end date. Those records are clinically recognizable but can be analytically incomplete when the research question concerns whether a patient was exposed to a concurrent CNS-active drug (cCNS-aD) or an adjuvant CNS-active drug (aCNS-aD) around radiotherapy. Contemporary CNS-oncology studies usually define CNS activity by empiric drug lists and define concurrency by fixed calendar windows, although the literature shows substantial heterogeneity across both concepts. This paper proposes a generalizable framework for converting raw systemic therapy records into reproducible cCNS-aD and aCNS-aD variables, useful in subgrouping for clinical studies. The framework uses a transparent CNS scoring model based on three clinical evidence components: intracranial objective response rate, consensus CNS endorsement, and intrathecal route of administration. It then defines a pharmacokinetic exposure proxy as the recorded end date plus five half-lives. Concurrent exposure is classified by overlap with the radiotherapy interval, while post-radiotherapy exposure is classified by overlap with a prespecified post-RT attribution window. The framework separately identifies post-RT pharmacokinetic persistence and post-RT treatment initiation, allowing investigators to distinguish continued exposure from true adjuvant initiation. This is a methodological framework and reference implementation. Implementation audits and endpoint-specific sensitivity analyses remain necessary before use as a definitive exposure classifier.
]]></description>
<dc:creator><![CDATA[ Pari Mitre, L., Drapkin, B., Dohopolski, M. ]]></dc:creator>
<dc:date>2026-06-22</dc:date>
<dc:identifier>doi:10.64898/2026.06.11.26354463</dc:identifier>
<dc:title><![CDATA[A Drug-Specific, Half-Life-Adjusted Framework for Classifying CNS-Active Systemic Therapy Exposure During and After Radiotherapy]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-22</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.17.26355819v1?rss=1">
<title>
<![CDATA[
Multisite Real-World Validation of an Electronic Health Record-Integrated Generative Artificial Intelligence Tool for Venous Thromboembolism Risk Stratification 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.17.26355819v1?rss=1
</link>
<description><![CDATA[
Background: Guiding risk-appropriate inpatient thromboprophylaxis requires venous thromboembolism (VTE) risk stratification; however, reliable risk determination remains inconsistent in routine care. Health systems increasingly pilot artificial intelligence (AI) tools, yet few studies demonstrate rigorous evaluation in the context of a learning health system (LHS). We evaluated the performance of a pilot electronic health record (EHR)-integrated generative AI (GenAI) system, inHealth General Reasoner (iHGR), for VTE risk stratification versus clinician order set classifications and physician-adjudicated chart review. Methods: This multisite retrospective validation study included adult inpatient admissions at Johns Hopkins Medicine between June 21, 2025, and Dec 18, 2025 (checklist-based order set from June 21, 2025 - November 19, 2025, and clinician judgement-based order set from November 29 - December 18, 2025). From 758 eligible admissions, we randomly sampled 500 balanced by site and order set periods. iHGR and clinician-selected order set classifications were compared with the reference standard (RS). Primary outcomes were iHGR sensitivity and specificity. Secondary analyses compared the order sets with the same RS to evaluate workflow comparators and error patterns. Results: iHGR achieved 81.8% sensitivity (95% CI 77.3-85.6) and 70.9% specificity (63.6-77.3). The checklist-based order set had 61.3% sensitivity (53.7-68.5) and 86.2% specificity (77.4-91.9). The clinician judgement-based order set had 78.1% sensitivity (71.3-83.7) and 65.4% specificity (54.3-75.0). False-negative iHGR classifications were associated with missed narrative risk factors. Conclusion: iHGR showed higher sensitivity for VTE risk than checklist-based order sets and clinician judgement without introducing systematic bias. In silico evaluation of pilot AI systems within LHSs can identify clinically important performance trade-offs and implementation targets before operational scale-up. Narrative clinical data abstraction remained a key limitation, supporting the use of GenAI to support rather than supplant clinician judgement.
]]></description>
<dc:creator><![CDATA[ Baughman, D. J., Liu, S., Jee, S., Young, C., Knight, A. M., Davis, A., Yegnasubramanian, S., Najjar, P., Whitbread, J. J., Ahumada, L., Chused, A., Haut, E. R., Lau, B. D., Sridharan, A., Streiff, M., Aziz, K. B. ]]></dc:creator>
<dc:date>2026-06-22</dc:date>
<dc:identifier>doi:10.64898/2026.06.17.26355819</dc:identifier>
<dc:title><![CDATA[Multisite Real-World Validation of an Electronic Health Record-Integrated Generative Artificial Intelligence Tool for Venous Thromboembolism Risk Stratification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-22</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.10.26355351v1?rss=1">
<title>
<![CDATA[
Can Vision-Language Models See the Vital Signs? Benchmarking and Fine-Tuning for Intraoperative Monitor Reading 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.10.26355351v1?rss=1
</link>
<description><![CDATA[
Background Vital-sign deterioration is a leading contributor to preventable perioperative death, yet manual monitor reading is intermittent, error-prone, and subject to alarm fatigue. Automating this perceptual step could enable continuous surveillance, but existing solutions depend on device-specific hardware integration or cloud-hosted vision-language models (VLMs), which raise privacy, cost, and connectivity barriers in resource-limited healthcare facilities. Methods We constructed a benchmark of 200 in-the-wild intraoperative monitor photographs (spanning multiple vendors, angles, and illumination conditions) annotated for eight vital-sign parameters: heart rate, SpO2, ETCO2, respiratory rate, systolic/diastolic/mean blood pressure, and temperature. We evaluated an optical character recognition (OCR)-based pipeline, nine instruction-tuned VLMs (four commercial, five open-weight ranging from [&le;]4B to 31B parameters) under two prompting regimes, and a compact open model (Qwen3.5-9B) adapted via low-rank fine-tuning (LoRA, 0.46% of parameters updated). Results Under a domain-aware prompt, frontier VLMs reached 0.98-0.997 exact-match accuracy zero-shot, whereas the OCR pipeline and [&le;]4B model scored approximately 0.20 lower, defining a 9B-class usable floor. LoRA fine-tuning Qwen3.5-9B on 80-120 images raised accuracy from 0.953 to 0.994 (statistically indistinguishable from the best commercial model) and reduced the critical-error rate fivefold (0.0313 [-&gt;] 0.0063). Ablations showed that performance saturated at 80 training images and rank-8 adapters. Conclusion Monitor reading is a solved perception problem for VLMs above the 9B scale. A lightweight fine-tuned open model achieves frontier accuracy while running entirely on local hardware, preserving data privacy, offline capability, and near-zero marginal cost. Residual errors stem from blood-pressure source ambiguity and are addressable with explicit disambiguation logic.
]]></description>
<dc:creator><![CDATA[ Gao, S., Wang, P., Zhao, X., Yang, B., Zhang, Z., Feng, X., He, Q. ]]></dc:creator>
<dc:date>2026-06-18</dc:date>
<dc:identifier>doi:10.64898/2026.06.10.26355351</dc:identifier>
<dc:title><![CDATA[Can Vision-Language Models See the Vital Signs? Benchmarking and Fine-Tuning for Intraoperative Monitor Reading]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.09.26353388v1?rss=1">
<title>
<![CDATA[
Digital self-efficacy as a potential intermediary between vision impairment and daily internet use among older adults: A cross-sectional analysis of HINTS 2024 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.09.26353388v1?rss=1
</link>
<description><![CDATA[
BackgroundOlder adults with vision impairment often experience barriers to using digital technology. The indirect associations between vision impairment and digital access and skills via digital self-efficacy and frustration among older adults remain largely unknown.

ObjectiveThis study aimed to 1) explore factors associated with digital access, skills, self-efficacy, and frustration among older adults with vision impairment; 2) examine associations between vision impairment and digital access, skills, self-efficacy, and frustration among older adults; and 3) examine whether digital self-efficacy and frustration may help explain associations between vision impairment and digital access and skills among older adults.

MethodsThis was a cross-sectional study using nationally representative data from the Health Information National Trends Survey (HINTS) 2024. Respondents aged 60 and older were included. Vision impairment was assessed using a self-reported item. Outcomes included self-reported digital access, skills, self-efficacy, and frustration. Survey-weighted multivariable logistic regression and generalized structural equation modeling were conducted, adjusting for age, sex, race/ethnicity, education, and the number of comorbidities.

ResultsAmong 3,149 older adults (mean [SD] age, 70.7 [10.0] years; 45.6% female), 7.1% (n=223) reported vision impairment. Among older adults with vision impairment, 65.6% (95% CI, 53.5% to 75.9%) used the internet daily, and 79.5% (95% CI, 66.8% to 88.2%) used a smartphone in the past 12 months. In multivariable logistic regression analyses among older adults with vision impairment, older age was associated with lower odds of daily internet use (OR, 0.84; 95% CI, 0.79 to 0.90), smartphone use (OR, 0.85; 95% CI, 0.75 to 0.97), wearable device use (OR, 0.88; 95% CI, 0.79 to 0.97), and using the internet to send a message to a healthcare provider (OR, 0.87; 95% CI, 0.80 to 0.93). Older adults who self-identified as racial and ethnic minority groups (e.g., Black/African American, Hispanic) had lower odds of daily internet use (OR, 0.15; 95% CI, 0.05 to 0.50) and using the internet to send a message to a healthcare provider (OR, 0.17; 95% CI, 0.04 to 0.73) compared with Non-Hispanic White older adults. Vision impairment was associated with lower odds of daily internet use (OR, 0.60; 95% CI, 0.37 to 0.99) and digital self-efficacy (OR, 0.53; 95% CI, 0.32 to 0.86). Digital self-efficacy was associated with higher odds of daily internet use (OR, 2.95; 95% CI, 2.04 to 4.26). Generalized structural equation modeling identified an indirect association between vision impairment and daily internet use via digital self-efficacy (coefficient, -0.68; 95% CI, -1.24 to -0.12).

ConclusionsFindings suggest that reduced digital self-efficacy may help explain the observed association between vision impairment and daily internet use among older adults. Interventions targeting digital self-efficacy, including accessible interface designs, personalized coaching, and peer support, may help bridge the digital divide among older adults with vision impairment.
]]></description>
<dc:creator><![CDATA[ Suzuki, H., Hoffmann, T., Leutwyler, H., Wallhagen, M. ]]></dc:creator>
<dc:date>2026-06-18</dc:date>
<dc:identifier>doi:10.64898/2026.06.09.26353388</dc:identifier>
<dc:title><![CDATA[Digital self-efficacy as a potential intermediary between vision impairment and daily internet use among older adults: A cross-sectional analysis of HINTS 2024]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.16.26355804v1?rss=1">
<title>
<![CDATA[
Comparative Evaluation of Pretrained Large Language Models for Suicide Risk Prediction from Clinical Notes in U.S. Veterans 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.16.26355804v1?rss=1
</link>
<description><![CDATA[
BackgroundSuicide remains a significant and potentially preventable cause of death among United States veterans. Predictive models based on structured electronic health record (EHR) data, including the U.S. Department of Veterans Affairs Recovery Engagement and Coordination for Health-Veterans Enhanced Treatment (REACH-VET) program, aim to identify individuals at elevated risk for enhanced monitoring and follow-up. Increasing evidence suggests that unstructured clinical narratives contain additional psychosocial information that may enhance risk prediction when analyzed using natural language processing (NLP). However, optimal approaches for representing clinical text remain uncertain. Recent advances in large language models (LLMs) enable contextual text representations that capture complex semantic relationships beyond traditional lexical methods.

MethodsWe compared the predictive performance of pretrained LLMs with classical bag-of-words (BoW) representations for suicide risk prediction using clinical notes from 27,241 veterans receiving care in the Veterans Health Administration. Patients were stratified by REACH-VET risk tier (low, moderate, high), and models were evaluated across prediction windows defined by note look-back periods (<30, <90, and <270 days).

ResultsLLM-based representations outperformed BoW approaches in seven of nine risk tier-time window combinations, achieving a maximum AUROC of 0.644 when solely considering text. Incorporating structured clinical variables further improved performance (AUROC=0.748). Model interpretation identified suicide-related language, especially in notes documented within 30 days of the outcome among patients classified as high risk.

ConclusionsPretrained LLMs can extract clinically meaningful information from narrative documentation, providing a foundation for future work adapting to additional clinical contexts and nuanced temporal associations to improve suicide risk prediction.
]]></description>
<dc:creator><![CDATA[ Levy, J., Levis, M., Dimambro, M., Rozema, L., Ayandeh, S., Diallo, A., Zhou, Y., Li, S., Wu, W., Shiner, B., Gui, J. ]]></dc:creator>
<dc:date>2026-06-18</dc:date>
<dc:identifier>doi:10.64898/2026.06.16.26355804</dc:identifier>
<dc:title><![CDATA[Comparative Evaluation of Pretrained Large Language Models for Suicide Risk Prediction from Clinical Notes in U.S. Veterans]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.16.26355782v1?rss=1">
<title>
<![CDATA[
Hard to Halt: Automation Bias in Agent-Driven Sequencing Prior Authorization Workflows 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.16.26355782v1?rss=1
</link>
<description><![CDATA[
PurposePrior authorization (PA) for exome or genome sequencing is a time-consuming process that impedes timely rare disease diagnosis. Large language model-based browser agents offer potential for automating these workflows, but their clinical reliability remain uncharacterized.

MethodsWe developed a sandbox compromising a simulated ES/GS PA submission payer portal and a synthetic EHR containing 836 patient records spanning compliant profiles and deficient profiles with different types of issues. Gemini 3 Pro, Gemini 3 Flash, and Claude Opus 4.5 were evaluated on task completion rate, form completion accuracy, and appropriate withholding for deficient profiles.

ResultsLarger models achieved much higher task completion rates (Gemini 3 Pro 95.45%, Claude Opus 4.5 93.67%) compared to Gemini 3 Flash (56.05%), but nearly universally failed to withhold submission for deficient profiles whereas Gemini 3 Flash ironically demonstrated superior withholding performance (17.33%). In a non-agentic setting, Gemini 3 Pro correctly identified 91% of the issues in deficient profiles, indicating that withholding failure is attributable to the browser interaction rather than the models reasoning limitations.

ConclusionCurrent LLM-based browser agents exhibit a systematic bias towards form submission that poses risks in PA workflows. A modular, multi-agent architecture with human supervision is necessary for a safe clinical deployment.
]]></description>
<dc:creator><![CDATA[ Nie, M., Chung, W., Waxler, J., Lee, M., Weng, C., Lewis, R., Ahimaz, P., Wang, K., Liu, C. ]]></dc:creator>
<dc:date>2026-06-18</dc:date>
<dc:identifier>doi:10.64898/2026.06.16.26355782</dc:identifier>
<dc:title><![CDATA[Hard to Halt: Automation Bias in Agent-Driven Sequencing Prior Authorization Workflows]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.16.26355792v1?rss=1">
<title>
<![CDATA[
Looked but didn't see: inattentional blindness and yes-bias confabulation in vision-language models 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.16.26355792v1?rss=1
</link>
<description><![CDATA[
Previous work showed that many participants fail to notice a gorilla in a video of people playing basketball. Another study found that 83% of trained radiologists failed to report a gorilla figure inserted into a chest CT nodule-search task, even though eye-tracking revealed that most observers had foveated the figure. We ask whether a similar phenomenon exists in contemporary vision-language models (VLMs). We find that (i) VLMs are capable of spotting the gorilla in both still-frame images and videos of lung CT scans; (ii) models display inattentional blindness, which varies according to model generation and type of stimulus presented; (iii) Gemini-3.1-Pro outperforms most other flagship and open-weight VLMs at identifying the presence or absence of the gorilla. We additionally ran a segmentation experiment utilizing two different model classes: a generalist (SAM 3), which found the gorilla but produced little to no results for anatomy-based prompts; a medical specialist (BiomedParse), which produced more promising anatomy-based results but flagged "gorilla" on gorilla-free control videos on 82% of frames. The behavioral signature of inattentional blindness reproduces in VLMs, but a unique confabulation failure mode means that any "did the model see X" claim requires signal-detection analysis with a matched-control false-alarm baseline.
]]></description>
<dc:creator><![CDATA[ Raymond, J. D., Hu, P., Solomon, B. D., Duong, D. ]]></dc:creator>
<dc:date>2026-06-18</dc:date>
<dc:identifier>doi:10.64898/2026.06.16.26355792</dc:identifier>
<dc:title><![CDATA[Looked but didn't see: inattentional blindness and yes-bias confabulation in vision-language models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.15.26355735v1?rss=1">
<title>
<![CDATA[
MedAgent: A Retrieval-Augmented Clinical Decision Support Agent with Verifiable Evidence Grounding for Evidence-Based Medicine 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.15.26355735v1?rss=1
</link>
<description><![CDATA[
Evidence-based medicine demands clinical answers that are not only fluent and medically plausible, but also anchored in traceable evidence, tailored to patient-specific clinical questions, sensitive to the hierarchy of evidence, and respectful of clinical safety boundaries. While general-purpose large language models (LLMs) exhibit strong medical language generation ability, they tend to lean on parametric memory, underuse retrieved evidence, hallucinate citations, conflate evidence levels, and draw conclusions that are not fully supported by the underlying literature. Such limitations pose particular risks in clinical decision support, where answer reliability, evidence traceability, and reasoning consistency are paramount.

To address these issues, we present MedAgent, an evidence-based medical agent trained through an end-to-end pipeline that integrates supervised fine-tuning (SFT) cold start, reward modeling, and Group Relative Policy Optimization (GRPO). The agent is designed to execute a structured workflow encompassing clinical question understanding, PICO extraction, evidence retrieval, evidence stratification, citation-grounded answer generation, and quality evaluation. Specifically, a Qwen2.5-14B-Instruct backbone is first cold-started on 200 human-verified agent trajectories, equipping it with tool invocation, PICO parsing, structured response generation, and citation faithfulness. Next, a Qwen2.57B reward model is trained on 2,099 pairwise preference samples to provide semantic-level quality signals for evidence-based responses. Finally, GRPO reinforcement learning is conducted in a retrieval-augmented agent environment, where every rollout involves real evidence retrieval and is scored jointly by rule-based rewards and reward-model signals.

To avoid over-reliance on training rewards, we further construct an independent evidence-based medical evaluation benchmark, MedTrustBench, which contains 200 clinical questions spanning 10 specialties and four difficulty levels. Each question is annotated with standardized PICO elements and rubric-based scoring criteria. The benchmark includes 1,187 rubrics across seven dimensions: question relevance, evidence hierarchy, evidence quality and timeliness, evidence-answer consistency, completeness and depth, logical rigor, and medical terminology. Under an identical RAG pipeline, retrieval tool, retrieval configuration, and evaluation protocol, MedAgentv17 attains 78.6 points, outperforming GPT-4.1 (75.3) and approaching GPT-5.4 (80.3). These results show that a 14B domain-aligned model can surpass strong general-purpose baselines on specialized evidence-based medical reasoning, while delivering practical advantages in cost, privacy, controllability, and hospital-oriented private deployment. The model and associated datasets are publicly released at https://www.modelscope.cn/profile/InfoxmedModel.
]]></description>
<dc:creator><![CDATA[ Wang, F., Guo, Z., Ye, Z. ]]></dc:creator>
<dc:date>2026-06-17</dc:date>
<dc:identifier>doi:10.64898/2026.06.15.26355735</dc:identifier>
<dc:title><![CDATA[MedAgent: A Retrieval-Augmented Clinical Decision Support Agent with Verifiable Evidence Grounding for Evidence-Based Medicine]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.15.26355670v1?rss=1">
<title>
<![CDATA[
The Unreliable Judges: Assessing Reproducibility and Self-Preference Bias of LLMs as Free-Text Evaluators 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.15.26355670v1?rss=1
</link>
<description><![CDATA[
Large Language Models (LLMs) are transforming clinical practice and research, but their adoption requires rigorous evaluation. While human assessment is ideal, its cost has driven the widespread use of LLMs as evaluators. We introduce an open-source reciprocal framework comparing 71 human experts against six LLMs. AI evaluators show a strong self-preference bias, yet neither group reliably identified whether a response was human- or AI-generated. AI scores correlated with surface features such as length and lexical diversity, whereas human scores did not. By probing the evaluators hidden states and applying targeted steering, we show that verbosity is a major causal driver of the bias. Moreover, shuffling question-response pairings shows that long responses keep high scores even when they no longer answer the question, whereas short ones do not, demonstrating that AI judges reward verbosity largely independently of content alignment. Finally, API-based and batch inference inflate stochasticity, underscoring the need for controlled deployment.
]]></description>
<dc:creator><![CDATA[ Alvarez-Arenas, J. I., mananes, D., jimenez-carretero, d., Sanchez-Cabo, F. ]]></dc:creator>
<dc:date>2026-06-17</dc:date>
<dc:identifier>doi:10.64898/2026.06.15.26355670</dc:identifier>
<dc:title><![CDATA[The Unreliable Judges: Assessing Reproducibility and Self-Preference Bias of LLMs as Free-Text Evaluators]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.16.26355686v1?rss=1">
<title>
<![CDATA[
Silent Manipulation of Mental Health Treatment Recommendations from a Large Language Model 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.16.26355686v1?rss=1
</link>
<description><![CDATA[
ImportanceLarge language models (LLMs) increasingly inform mental health decisions by patients and clinicians. Inference-time activation steering can shift model behavior on a target dimension without altering weights or prompts and without disclosure to users, allowing treatment recommendations to be silently changed for commercial or ideological reasons.

ObjectiveTo determine whether directional activation steering can shift an open-weights LLMs depression treatment recommendations.

Design, Setting, and ParticipantsThis non-human subjects study applied directional activation steering to an open-weights LLM (DeepSeek V4 Flash) responding to 12 depression-advice scenarios (4 favoring medication, 4 favoring avoidance, 4 neutral), generated at 30 amplitudes from -1.5 to +1.5 in 0.1 increments plus an unsteered baseline.

ExposuresA single steering direction contrasting antidepressant medication with self-directed approaches (diet, exercise, meditation, dietary supplements), constructed from 16 paired training prompts and applied at the attention output of every transformer block; weights and system prompt were held constant.

Main Outcomes and MeasuresThe extent to which medication and four self-care categories were addressed, scored 0 to 3 by a human-validated LLM rater (Claude Opus 4.7), the medication-versus-self-care balance, and clinician referral, estimated per unit of amplitude using mixed-effects models with a scenario random intercept.

ResultsAcross 372 generations, steering produced a graded, dose-dependent shift in the medication-versus-self-care balance, which declined by 0.32 per unit of amplitude ({beta} = -0.32; 95% CI, -0.39 to -0.25; P < .001); medication extent fell and self-care extent rose. The shift was largest for scenarios with no stated treatment preference ({beta} = -0.44; 95% CI, -0.54 to -0.34; P < .001). A clinician referral appeared in 322 of 372 responses (87%) and did not vary with steering amplitude (P = .63).

Conclusions and RelevanceIn this open-weights LLM providing depression treatment information, inference-time activation steering shifted treatment recommendations without altering weights, prompt structure, or safety outputs, with the largest effect among users expressing no treatment preference. These findings suggest a need for LLM disclosure standards and independent auditing as such models inform clinical decisions.
]]></description>
<dc:creator><![CDATA[ Perlis, R. H. ]]></dc:creator>
<dc:date>2026-06-17</dc:date>
<dc:identifier>doi:10.64898/2026.06.16.26355686</dc:identifier>
<dc:title><![CDATA[Silent Manipulation of Mental Health Treatment Recommendations from a Large Language Model]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.10.26355413v1?rss=1">
<title>
<![CDATA[
Low-Density Lipoprotein Cholesterol and Dementia Risk: Integrating Mendelian Randomization and Target Trial Emulation Within the Heart-Brain Axis 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.10.26355413v1?rss=1
</link>
<description><![CDATA[
BackgroundThe heart-brain axis links cardiovascular and neurodegenerative disease through shared vascular and inflammatory mechanisms. Although low-density lipoprotein cholesterol (LDL-C) is an established causal factor in atherosclerotic cardiovascular disease (ASCVD), its relationship with dementia remains uncertain, with midlife elevations associated with increased risk but late-life associations often appearing null or inverse. To address this cholesterol paradox, we integrated mendelian randomization (MR) with an active-comparator new-user target trial emulation.

MethodsWe applied a triangulated causal inference framework integrating two-sample MR with observational target trial emulation. Genetic variants associated with LDL-C were used as instrumental variables to evaluate Alzheimers disease (AD), Dementia with Lewy bodies (DLB), Frontotemporal dementia (FTD), and any dementia (AnyDem), with causal estimates derived using inverse-variance weighted models and sensitivity analyses for heterogeneity and pleiotropy. In parallel, an active-comparator new-user design compared statin versus ezetimibe initiation among adults aged [&ge;]60 years using propensity score (PS) overlap weighting and Cox proportional hazards models to evaluate cardiovascular and dementia outcomes.

ResultsGenetically predicted LDL-C was associated with increased risk of DLB (OR 1.65, 95% CI 1.30-2.10; p<0.001), but not AD or AnyDem; FTD estimates were inconsistent. Sensitivity analyses suggested heterogeneity and possible pleiotropy for DLB. In the observational analysis (n=6,977), statin initiation was associated with higher risks of ASCVD (HR 1.26, 95% CI 1.11-1.45) and AnyDem (HR 1.66, 95% CI 1.16-2.38), although estimates attenuated after lipid adjustment and lagged analyses, suggesting residual confounding, treatment selection, and reverse causation in late-life observational associations.

ConclusionsThese findings suggest that LDL-C reflects accumulated vascular and metabolic risk rather than a direct causal driver of AD or overall dementia, although a subtype-specific association was observed for DLB. Late-life associations appeared influenced by timing, reverse causation, and treatment selection, warranting cautious interpretation.
]]></description>
<dc:creator><![CDATA[ Mukumbi, K., Liu, Y., Shi, Z., Liu, E., Toyli, A., Hung, G.-U., Chen, Q.-H., Sha, Q., Chiu, P.-Y., Zhou, W. ]]></dc:creator>
<dc:date>2026-06-17</dc:date>
<dc:identifier>doi:10.64898/2026.06.10.26355413</dc:identifier>
<dc:title><![CDATA[Low-Density Lipoprotein Cholesterol and Dementia Risk: Integrating Mendelian Randomization and Target Trial Emulation Within the Heart-Brain Axis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.13.26355565v1?rss=1">
<title>
<![CDATA[
Ranking-optimized survival models can underperform fixed-horizon clinical prediction: a SUPPORT2 reanalysis of machine learning, attending-physician judgment, and the original SUPPORT model at 60- and 180-day mortality 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.13.26355565v1?rss=1
</link>
<description><![CDATA[
Machine-learning survival models are increasingly proposed for intensive-care mortality prediction and are usually judged by the concordance index, a ranking metric averaged over follow-up. Yet many bedside decisions require a probability at a specific time, such as 60- or 180-day mortality. We asked whether ranking-optimized models perform competitively at fixed clinical horizons when compared with attending-physician judgment and the original 1995 SUPPORT logistic model. Reanalyzing the SUPPORT2 cohort (9,105 critically ill adults; five United States centers; 1989-1994) with a stratified 70/15/15 split, we compared a gradient-boosted survival model, the physicians recorded prognostic estimate, and the 1995 model at 60 and 180 days, and tested several alternative learners. The survival model achieved a competitive ranking concordance (0.705) but underperformed both comparators at fixed horizons: at 60 days its area under the ROC curve was 0.750, versus 0.808 for physicians (on the matched sample) and 0.827 for the 1995 model, a gap reproduced across eight independent splits and statistically reliable after multiplicity correction. Discrimination was equitable across sex, race, and age. Post-hoc recalibration did not change discrimination, so the deficit is not miscalibration. Replacing the ranking objective with timepoint-matched binary training recovered roughly half the gap; neural networks, a deep ranking model, and two timepoint-aware discrete-time models did not close it, indicating an objective-horizon mismatch rather than limited model capacity. Leave-one-disease-out validation revealed severe generalization failure in disease groups absent from training. The physician advantage was conditional on a physician electing to give an estimate; many gave uninformative or no estimate. We recommend reporting timepoint-specific discrimination alongside the concordance index, timepoint-matched training when fixed-horizon predictions drive care, leave-one-subgroup validation, and distribution-free prediction intervals to support selective deployment.
]]></description>
<dc:creator><![CDATA[ Truong, Q. H., Hoang, D. C., Luu, D. T. ]]></dc:creator>
<dc:date>2026-06-16</dc:date>
<dc:identifier>doi:10.64898/2026.06.13.26355565</dc:identifier>
<dc:title><![CDATA[Ranking-optimized survival models can underperform fixed-horizon clinical prediction: a SUPPORT2 reanalysis of machine learning, attending-physician judgment, and the original SUPPORT model at 60- and 180-day mortality]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.08.26355187v1?rss=1">
<title>
<![CDATA[
Development of an automated, imaging-based preoperative screening model for early identification of malnutrition in an abdominal surgery cohort 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.08.26355187v1?rss=1
</link>
<description><![CDATA[
BackgroundClinical malnutrition affects one in five abdominal surgery patients and increases postoperative complications and mortality. Current screening occurs after admission, closing the window for preoperative nutritional intervention. No objective, scalable preoperative screening tool exists.

ObjectiveTo determine whether automated volumetric CT-based body composition analysis improves preoperative identification of surgical patients at risk for clinical malnutrition compared to clinical variables or single-slice imaging alone.

MethodsRetrospective cohort study of adults undergoing elective abdominal surgery at a quaternary academic medical center (2018-2021) with a preoperative CT scan within 90 days and complete nutrition assessment. Clinical malnutrition was diagnosed by a registered dietitian using ASPEN/AND criteria. Three sex-stratified Elastic Net models were compared: (1) base clinical variables; (2) base plus L3 single-slice skeletal muscle index and attenuation; and (3) base plus comprehensive 3D volumetric quantification of five muscle groups and two fat depots. Discrimination (AUROC), calibration (Brier score), and clinical utility (decision curve analysis) were assessed via 10-fold cross-validation.

ResultsAmong 1,143 patients (52.4% female; mean age 60.5 years), 231 (20.2%) were diagnosed with malnutrition. Malnourished patients had significantly higher complication rates (36.4% vs. 15.4%, p<0.001) and prolonged length of stay (45.9% vs. 16.4%, p<0.001). Critically, 27.2% of malnourished patients were not flagged as at-risk by the standard Malnutrition Screening Tool. The volumetric model (Model 3) achieved the highest discrimination (males: AUROC 0.808; females: 0.794) and best calibration (males: Brier 0.129; females: 0.124), significantly outperforming both the base model (males: p=0.004; females: p<0.001) and L3 model (males: p=0.019; females: p<0.001). L3 features modestly improved discrimination but paradoxically worsened calibration -- an effect corrected by volumetric features. Sex-specific risk profiles differed markedly, with ASA classification dominating female models and demographic factors dominating male models.

ConclusionsAutomated volumetric CT body composition analysis significantly improves preoperative malnutrition risk identification, with sex-stratified models revealing distinct risk profiles. Leveraging imaging already obtained for surgical planning, this approach opens a preoperative window for nutritional intervention that current practice fails to utilize.
]]></description>
<dc:creator><![CDATA[ Gershuni, V. M., Damani, R. A., Vasisht, S., Sharma, R., Rowe, J., Compher, C., Duda, J., Sagreiya, H., Kelz, R., Lee, H., Tasian, G., Damrauer, S. M., Wu, G. D., Witschey, W. R. ]]></dc:creator>
<dc:date>2026-06-16</dc:date>
<dc:identifier>doi:10.64898/2026.06.08.26355187</dc:identifier>
<dc:title><![CDATA[Development of an automated, imaging-based preoperative screening model for early identification of malnutrition in an abdominal surgery cohort]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.08.26355139v1?rss=1">
<title>
<![CDATA[
Fidelity-Derived Quantum Dissimilarity-Enhanced k-Nearest Neighbor Algorithm for Arterial Hypertension Prediction 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.08.26355139v1?rss=1
</link>
<description><![CDATA[
We present a quantum-enhanced version of the classic k-Nearest Neighbors (kNN) classification algorithm, applied to the prediction of arterial hypertension. The traditional Euclidean distance metric of the kNN algorithm is replaced with a Fidelity-derived quantum dissimilarity measure to evaluate the similarity between data samples. We map classical real-world clinical and ECG-derived data features into quantum states via the Dense-Angle Encoding, which efficiently utilizes parameterized rotation gates to pack multiple features into minimal qubits while maintaining pure states. We evaluate the performance of the dissimilarity measure using both the noiseless state vector Simulator and the IBM Qiskit Estimator primitives. The quantum circuit demonstrates robust predictive capabilities comparable to the classical model. While it does not claim computational supremacy over the classical baseline, the framework proves that fidelity-based similarity is a physically meaningful and efficient approach for hybrid quantum-classical classification.
]]></description>
<dc:creator><![CDATA[ Tampakaki, A. E., Barmparis, G. D., Angelaki, E., Marketou, M. E., Tsironis, G. P. ]]></dc:creator>
<dc:date>2026-06-16</dc:date>
<dc:identifier>doi:10.64898/2026.06.08.26355139</dc:identifier>
<dc:title><![CDATA[Fidelity-Derived Quantum Dissimilarity-Enhanced k-Nearest Neighbor Algorithm for Arterial Hypertension Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.12.26355166v1?rss=1">
<title>
<![CDATA[
Entity-Aware Generation of Synthetic Clinical Progress Notes for Prostate Cancer using Large Language Model 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.12.26355166v1?rss=1
</link>
<description><![CDATA[
ObjectivesThis study investigates large language models (LLMs) for clinical entity projection across substantial textual transformation. Specifically, we evaluate whether entities annotated in Spanish prostate cancer case reports can be preserved and explicitly projected when the source narratives are transformed into hospital-style clinical progress notes. Entity projection is treated as a generation-driven task, allowing paraphrase, condensation and narrative reorganisation, providing that clinically relevant entities remain recoverable as structured annotations.

MethodsA corpus of 109 Spanish prostate cancer case reports was annotated using a silver-standard pipeline combining Spanish biomedical named-entity recognition with rule-based prostate-specific antigen (PSA) and Gleason extractors. The resulting silver-standard annotations were validated on a subset of generated notes against a gold-standard consensus produced by medical experts in prostate cancer. Four LLMs were evaluated for note generation and entity projection: GPT-5.4 Nano, Qwen 3.5:35B-A3B, GLM5 and Claude Sonnet 4.6. Entity-to-Entity (E2E) generation used XML-annotated cases as RAG-supported input, whereas Text-to-Entity (T2E) generation required models to generate and annotate notes directly from plain text cases. Zero-shot and few-shot prompting were tested. Projection quality was measured using precision, recall and F1-score, and complemented by LLM-as-a-judge evaluation using Kimi K2.6.

ResultsE2E consistently outperformed T2E, indicating that explicit entity-enriched input substantially facilitates entity preservation and localisation. GLM5 achieved the best E2E zero-shot result (F1 = 0.915), followed by Claude Sonnet 4.6 (F1 = 0.896). In T2E, few-shot prompting improved performance, with Claude Sonnet 4.6 reaching the highest score (F1 = 0.718). Age, Gleason, Disease, Procedure, Duration and negation-related entities were robustly projected, whereas PSA and Dose showed less stable behaviour.

ConclusionLLMs can generate clinically plausible synthetic prostate cancer evolution notes while preserving a substantial proportion of source entities, particularly when explicit semantic annotations are provided as input. However, the lower and more variable performance observed in T2E highlights the difficulty of jointly generating clinical narratives and projecting entities without source-side information, especially for numerical and measure-related entities.
]]></description>
<dc:creator><![CDATA[ Rey-Blanes, A., Veredas-Morente, J., Moreno-Barea, F. J., Veredas, F. J. ]]></dc:creator>
<dc:date>2026-06-15</dc:date>
<dc:identifier>doi:10.64898/2026.06.12.26355166</dc:identifier>
<dc:title><![CDATA[Entity-Aware Generation of Synthetic Clinical Progress Notes for Prostate Cancer using Large Language Model]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.11.26355453v1?rss=1">
<title>
<![CDATA[
Semantic Embeddings and the Peripheral Transcriptome in Ischemic Stroke: Connecting Molecular Signatures to NANDA-I Diagnoses 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.11.26355453v1?rss=1
</link>
<description><![CDATA[
ObjectiveTo construct and evaluate, in an exploratory manner, a pathophysiologic rationale linking biological pathways derived from the peripheral transcriptome in ischemic stroke (IS) to nursing diagnoses in the NANDA-I 2024-2026 taxonomy, while emphasizing that this association is not direct, deterministic, or automatically inferable from textual similarity with large language models (LLMs).

MethodsA computational study was conducted using public secondary data from the Gene Expression Omnibus series GSE16561, which includes 63 peripheral blood samples: 39 from individuals with IS and 24 from healthy controls. The pipeline integrated transcriptomic analysis and functional enrichment, semantic mapping through ClinicalBERT embeddings, and mechanistic and clinical-conceptual judgment using Claude Sonnet 4.6 as a judge. The judgment stage was treated as the central interpretive layer, designed to mediate the transcriptome, pathophysiology, functional manifestation, and NANDA-I diagnosis.

ResultsThe analysis identified a bimodal transcriptomic pattern, with activation of pathways related to innate immunity and suppression of pathways related to adaptive immunity. Semantic mapping generated 158 pathway-diagnosis pairs. The Spearman correlation between cosine similarity and the mechanistic score was negative and statistically significant (rho = -0.243; p = 2.09e-03), but weak in magnitude. This effect size indicates that semantic similarity explained less than 6% of the variance in mechanistic plausibility, reinforcing the insufficiency of embeddings as a stand-alone criterion. Of the 158 pairs, 14 were classified as high concordance, 8 as moderate, and 136 as divergent.

ConclusionThe main value of this study lies in demonstrating that translating biological pathways into nursing diagnoses requires pathophysiologic, functional, and clinical-conceptual mediation. The prioritized pairs represent mechanistically plausible hypotheses for future research, without implying causality, direct clinical confirmation, or immediate care recommendations.
]]></description>
<dc:creator><![CDATA[ Santos, R. d. P., Tinoco Patricio, A. d. O., Gama, P. H., Freitas, L. M. D., Ribeiro, K. R. ]]></dc:creator>
<dc:date>2026-06-15</dc:date>
<dc:identifier>doi:10.64898/2026.06.11.26355453</dc:identifier>
<dc:title><![CDATA[Semantic Embeddings and the Peripheral Transcriptome in Ischemic Stroke: Connecting Molecular Signatures to NANDA-I Diagnoses]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.13.26355589v1?rss=1">
<title>
<![CDATA[
VarEx: A Large Language Model Pipeline for Automated Extraction of Exposures, Outcomes, and Covariates from Epidemiologic Studies 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.13.26355589v1?rss=1
</link>
<description><![CDATA[
ObjectiveObservational studies are essential for investigating risk factors for Alzheimers disease and related dementias (ADRD), but inconsistent reporting and selection of covariates can contribute to residual confounding, omitted-variable bias, and reduced reproducibility. We developed and evaluated VAREX (Variable Extraction), a large language model (LLM)-based information extraction framework designed to automatically identify exposures, outcomes, and covariates from epidemiologic studies and populate structured evidence repositories.

Materials and MethodsVAREX combines retrieval-augmented generation, biomedical language-model embeddings, semantic chunking, cross-encoder reranking, and prompt-engineered LLM workflows to extract epidemiologic variables from full-text biomedical articles. The framework was evaluated using a reference-standard corpus of observational studies examining blood pressure variability (BPV) and Alzheimers disease-related dementias (ADRD), together with external validation datasets involving other exposure-outcome relationships. Extracted variables were compared with independently curated human reference standards using semantic matching and one-to-one assignment procedures. Covariates were additionally classified into ten epidemiologically relevant semantic categories.

ResultsIn the primary BPV[-&gt;]ADRD corpus (n = 10 studies), VAREX achieved a precision of 0.91, recall of 0.84, and F1-score of 0.87 for variable extraction. Covariate classification accuracy was 0.90, yielding a strict extraction-and-classification F1-score of 0.78. External validation datasets demonstrated comparable performance across diverse epidemiologic domains, with extraction F1-scores ranging from 0.73 to 0.85. Category-level performance was strongest for health behaviors (F1 = 0.96), sociodemographic variables (F1 = 0.90), and medication exposures (F1 = 0.89). Compared with published estimates of manual systematic-review effort, VAREX reduced processing time from approximately 61 minutes to 9 minutes per article, representing an 85.7

DiscussionThese findings demonstrate that LLM-based information extraction can accurately identify and classify epidemiologic variables across heterogeneous observational-study designs. Automated extraction enables scalable construction of structured repositories of exposures, outcomes, and covariates while substantially reducing the labor required for evidence synthesis and systematic reviews.

ConclusionVAREX provides an effective framework for automated extraction and classification of epidemiologic variables from the biomedical literature. By supporting large-scale evidence synthesis and structured knowledge resource development, VAREX may facilitate more rigorous observational research, improved confounder identification, and enhanced reproducibility in epidemiology.
]]></description>
<dc:creator><![CDATA[ Malec, S. A., Pradhan, M., Upadhayaya, R., Metzger, V. ]]></dc:creator>
<dc:date>2026-06-15</dc:date>
<dc:identifier>doi:10.64898/2026.06.13.26355589</dc:identifier>
<dc:title><![CDATA[VarEx: A Large Language Model Pipeline for Automated Extraction of Exposures, Outcomes, and Covariates from Epidemiologic Studies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.11.26355471v1?rss=1">
<title>
<![CDATA[
Unveiling the Awareness of Private Health Insurance Coverage among Healthcare Professionals in Freetown, Sierra Leone: Insights Extracted from Their Perspectives. 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.11.26355471v1?rss=1
</link>
<description><![CDATA[
Our study is an assessment of the knowledge, personal coverage, and related determinants of private health insurance as revealed by healthcare professionals in Freetown, the urban capital of Sierra Leone. This study stands as a precursor for Low- and Middle-Income Countries (LMICs), like Sierra Leone, seeking to establish Universal Health Coverage (UHC) to provide healthcare access and coverage through publicly arranged risk pooling, designed to help protect against unmanageable medical costs. In parallel, such countries face significant challenges with achieving sustainable universal coverage due to limited public resources, inefficient allocation systems, uneasy reliance on out-of-pocket payments, and large struggling populations. Our research sheds particular light on how healthcare professionals view their own participation with private healthcare options. A cross-sectional, analytical study was conducted, openly recruiting individuals from various facilities in Freetown. Using the Yamane Formula, a sample size of 109 participants was calculated. STATA 14.0 was used for data analysis. Our findings revealed that 96 (88.9%) participants did not have private health insurance, while 12 (11.1%) did have private coverage. However, 105 (97.2%) reported other modes of health insurance, with only 3 (2.8%) uninsured. Notably, 97.2% expressed willingness to join a private health insurance scheme. Our study found no statistically significant associations between selected indicators (demographic or socioeconomic fac tors) and current insurance coverage among study participants. These results highlight a low prevalence and understanding of private health insurance among healthcare professionals in a representative urban center in Sub-Saharan Africa (SSA), while acknowledging high willingness to enroll. The lack of anysignificant determinants suggests other unexamined factors, such as cost, accessibility, or awareness, capable of influencing the adoption and implementation of a universal health program.
]]></description>
<dc:creator><![CDATA[ Gary, L. P., Kamara, A. N., Jimmy, A. I., Lebbie, A. P. ]]></dc:creator>
<dc:date>2026-06-15</dc:date>
<dc:identifier>doi:10.64898/2026.06.11.26355471</dc:identifier>
<dc:title><![CDATA[Unveiling the Awareness of Private Health Insurance Coverage among Healthcare Professionals in Freetown, Sierra Leone: Insights Extracted from Their Perspectives.]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.06.26354746v1?rss=1">
<title>
<![CDATA[
SPIRIT-CONSORT-ELM: Element-Level Assessment of Randomized Controlled Trial Reporting Using Large Language Models 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.06.26354746v1?rss=1
</link>
<description><![CDATA[
Randomized controlled trials (RCTs) play a central role in assessing the benefits and harms of interventions. Incomplete reporting in RCT publications can compromise the verifiability and usefulness of RCTs. SPIRIT and CONSORT reporting guidelines aim to improve the completeness of RCT protocols and results publications, respectively. However, many RCTs are not reported completely. Checking manuscripts automatically could help authors improve the completeness of reports prior to publication. We previously annotated SPIRIT-CONSORT-TM, a corpus of 200 articles (comprising 100 protocol-results publication pairs) using 83 checklist items drawn from SPIRIT 2013 and CONSORT 2010. We also trained machine learning models to automatically assess reporting at the item level. Each checklist item can include multiple constituent elements (i.e., specific details required for that item), and an item might be considered fully reported when all of its elements are present. However, prior work does not explicitly capture or evaluate reporting at the element level. To address this gap, we extended SPIRIT-CONSORT-TM by incorporating element-level annotations and using them to assess reporting completeness (SPIRIT-CONSORT-ELM). We formulated element-level assessment as a machine reading comprehension task, operationalized through 119 questions, where each question targets a specific reporting element within a checklist item. Using the 200 articles included in SPIRIT-CONSORT-TM, two annotators independently answered 119 questions for 50 articles (25 protocol-results pairs) and resolved any discrepancies through discussion; the remaining 150 articles (75 protocol-results pairs) were assessed by a single annotator. We then developed an automated pipeline for element-level assessment using SPIRIT-CONSORT-ELM. The pipeline first applies a PubMedBERT-based model to identify sentences containing item-level reporting information, then it uses a generative large language model (LLM; GPT-5) with chain-of-thought reasoning to answer element-level questions based on the retrieved evidence. Agreement between the two annotators was high (Gwets AC1: 0.782) and our pipeline achieved high accuracy in identifying element-level reporting evidence (F1: 0.822, Gwets AC1: 0.796). Ablation studies indicate that chain-of-thought reasoning and the inclusion of illustrative in-context examples modestly improve LLM performance on the machine reading comprehension task. SPIRIT-CONSORT-ELM provides a benchmark for evaluating reporting guideline completeness at the element level, enabling assessment of RCT transparency beyond the simple presence or absence of checklist items and is publicly available at https://osf.io/kznx4/. The automated pipeline establishes a robust baseline for assessing RCT reporting and demonstrates potential as a practical aid for authors, reviewers, and editors to identify and address gaps in completeness and transparency of RCT reports.
]]></description>
<dc:creator><![CDATA[ Jiang, L., Ying, X., Brown, A. W., Lan, M., Song, W., Menke, J., Vorland, C., Mayo-Wilson, E., Kilicoglu, H. ]]></dc:creator>
<dc:date>2026-06-15</dc:date>
<dc:identifier>doi:10.64898/2026.06.06.26354746</dc:identifier>
<dc:title><![CDATA[SPIRIT-CONSORT-ELM: Element-Level Assessment of Randomized Controlled Trial Reporting Using Large Language Models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.11.26355494v1?rss=1">
<title>
<![CDATA[
Cost-Performance Evaluation of Large Language Models for Aspect-Based Sentiment Analysis of HCAHPS Patient Comments: A Validation Study 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.11.26355494v1?rss=1
</link>
<description><![CDATA[
BackgroundHospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS) free-text comments contain actionable feedback, but timely, scalable, and affordable sentiment analysis remains challenging for health systems that rely on third-party vendors.

ObjectivesTo evaluate cost-performance tradeoffs between a cost-optimized and a flagship large language model (LLM) for aspect-based sentiment analysis of HCAHPS comments, using human inter-rater agreement as a reproducibility benchmark.

MethodsWe analyzed 512 free-text HCAHPS comments collected from two community hospitals in calendar year 2023. Six trained reviewers (medical students, recent medical graduates, and practicing internists) independently assigned positive, negative, or neutral labels to each comment-aspect pair; the majority label among three reviewers formed the consensus reference standard. Two OpenAI models -- GPT-5-nano (cost-optimized) and GPT-5 (flagship) -- were prompted in a zero-shot setting via the OpenAI API. We calculated pairwise Cohens {kappa} to establish a human inter-rater baseline, then compared each models labels to the consensus using Cohens {kappa}, accuracy, weighted F1, and per-call cost and latency.

ResultsMean human inter-rater agreement was {kappa} = 0.79 (substantial). Both LLMs exceeded this baseline (cost-optimized {kappa} = 0.85; flagship {kappa} = 0.85) with nearly identical accuracy (0.92) and weighted F1 (0.93 vs. 0.93). Performance was strong on positive (F1 {approx} 0.97) and negative (F1 {approx} 0.90) classes but poor on the underrepresented neutral class (F1 [&le;] 0.19). The cost-optimized model processed all 512 comments for $0.04 versus $0.18 for the flagship -- a 4.2-fold cost difference without measurable performance gain.

ConclusionsCommercially available LLMs can perform aspect-based sentiment analysis on HCAHPS comments at human-level reproducibility, with the cost-optimized tier sufficient for routine classification. This offers health systems a rapid, scalable, low-cost alternative to vendor-based patient-experience analytics.
]]></description>
<dc:creator><![CDATA[ Nawab, K., Ramsey, G., Asfandiyar, S., Atreya, S., Hijjawi, S., Rokkam, S., Ghayur, U., Rajesh, A., Yousuf, I., Shah, Z. A., Misra, A. K., Ponnala, M., Hamid, T., Schreiber, R. ]]></dc:creator>
<dc:date>2026-06-15</dc:date>
<dc:identifier>doi:10.64898/2026.06.11.26355494</dc:identifier>
<dc:title><![CDATA[Cost-Performance Evaluation of Large Language Models for Aspect-Based Sentiment Analysis of HCAHPS Patient Comments: A Validation Study]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.10.26355390v1?rss=1">
<title>
<![CDATA[
Room-Specialized Mixture-of-Experts for In-Home ADL Recognition with Ambient Sensors 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.10.26355390v1?rss=1
</link>
<description><![CDATA[
Monitoring activities of daily living (ADLs) in the home is a promising approach for tracking dementia progression in older adults. While ambient sensor-based ADL systems are well-studied, most existing ADL recognition systems rely on globally trained models that ignore the spatial organization of in-home activities. In real deployments, where training data are sparse and highly home-specific, global transformer models may fail to capture room-dependent behavioral structure. We propose a deterministic Mixture of Experts (MoE) architecture for in-home ADL recognition, in which each expert is a compact transformer specialized to one room of the home (bedroom, kitchen, bathroom, living area). Input segments are routed using a deterministic gating strategy based on room-level motion activity and time-of-day priors for sleep-related behaviors. Unlike learned routing networks, the proposed gate encodes domain knowledge about where ADLs are likely to occur, reducing model complexity under limited per-home training data. By decomposing ADL recognition into room-specific activity spaces, the proposed architecture reduces competition between dominant and low-frequency activities under highly imbalanced residential data. We evaluated the system on data collected via low-cost ambient sensors (motion, light, temperature, humidity) and Raspberry Pi edge devices across five homes, with ground-truth ADL labels provided by participants and caregivers. Across the five homes, the proposed MoE consistently outperformed global transformer, 1D CNN, and Random Forest baselines, achieving macro-F1 scores ranging from 0.60 to 0.88, highlighting the importance of home-specific modeling in real-world deployments. These findings suggest that room-aware expert specialization may provide a practical and interpretable strategy for low-data ADL recognition in real-world residential environments.
]]></description>
<dc:creator><![CDATA[ Addepalli, V. r., Rao, P., Kiselica, A., Kummerfeld, E., Abdalnabi, N., Lee, K. ]]></dc:creator>
<dc:date>2026-06-12</dc:date>
<dc:identifier>doi:10.64898/2026.06.10.26355390</dc:identifier>
<dc:title><![CDATA[Room-Specialized Mixture-of-Experts for In-Home ADL Recognition with Ambient Sensors]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-12</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.10.26355372v1?rss=1">
<title>
<![CDATA[
PCRAgent: A Multi-Agent Framework for Transforming Noisy clinical conversations into Structured Pre-Consultation Medical Records and Reusable Clinical Data Resources 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.10.26355372v1?rss=1
</link>
<description><![CDATA[
In primary care and outpatient settings, clinically important patient information is often embedded in fragmented, ambiguous, repetitive, and noisy communication between physicians and patients. This limits physicians ability to obtain a clear pre-consultation overview of symptoms, history of present illness, and visit intent, while also preventing real world clinical dialogues from being reused in hospital information systems and medical artificial intelligence applications. To address this challenge, we developed PCRAgent, a centrally coordinated multi agent framework for pre-consultation clinical information organization, shifting information processing upstream from the consultation. Guided by physician inquiry logic, PCRAgent identifies, extracts, corrects, and standardizes patient-reported information from noisy consultations. Its coordinated modules including error detection, semantic editing, output control, contextual memory, and intent recognition enable robust parallel handling of spelling errors, repetitions, grammatical inconsistencies, medical ambiguities, and non-medical interference. A traceable edit list records intermediate corrections and context, allowing iterative refinement without redundant modifications. PCRAgent generates two complementary outputs. One is a Pre-Consultation Clinical Report for rapid physician review. The other is a Structured Clinical Conversation Dataset for hospital data construction and downstream AI applications. In evaluations using 220,000 strongly perturbed consultations, PCRAgent maintained high robustness, achieving a clinical information accuracy of 4.99 out of 5 and key element completeness of 5 out of 5, outperforming GPT4o. Expert review of Chinese and English dialogues confirmed high clinical accuracy of 4.85 out of 5 and high security of 4.79 out of 5. Multicenter validation in real world outpatient workflows further demonstrated practical utility. These results indicate that PCRAgent improves outpatient workflow efficiency, reduces physicians cognitive burden, ensures completeness of pre-consultation clinical information, supports more focused and accurate clinical decision-making, and enables high-quality reuse of clinical data for downstream medical artificial intelligence applications.
]]></description>
<dc:creator><![CDATA[ Zhang, M., Zhao, J., Tang, W., Xing, J., Li, J., Zhang, H., Qiu, J., Zhang, Y. ]]></dc:creator>
<dc:date>2026-06-11</dc:date>
<dc:identifier>doi:10.64898/2026.06.10.26355372</dc:identifier>
<dc:title><![CDATA[PCRAgent: A Multi-Agent Framework for Transforming Noisy clinical conversations into Structured Pre-Consultation Medical Records and Reusable Clinical Data Resources]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-11</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.10.26355348v1?rss=1">
<title>
<![CDATA[
Validity and Limitations of the Empatica E4 Wristband for Autonomic and Thermoregulatory Sleep Monitoring Against Concurrent Polysomnography: A Wearanize+ Dataset Study 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.10.26355348v1?rss=1
</link>
<description><![CDATA[
The Empatica E4 wristband provides continuous multi-modal physiological monitoring including blood volume pulse (BVP), electrodermal activity (EDA) and skin temperature (TEMP) but its validity for sleep-stage-specific autonomic and thermoregulatory monitoring has not been systematically evaluated against concurrent polysomnography (PSG). Using the Wearanize+ dataset which provides synchronised PSG, Empatica E4, and Zmax EEG recordings from 100 home-recorded participants; a systematic validation of Empatica E4 physiological signals against PSG ground truth across five sleep stages was conducted. Of 100 participants, 92 had Empatica data; 69 met Zmax EEG signal quality criteria and formed the analysis sample. Heart rate (HR) from the pre-computed Empatica HR channel showed valid stage-specific patterns (Wake: 70.9 bpm, N3: 61.2 bpm) and moderate inter-device MeanNN correspondence with PSG ECG (Spearman r=0.35-0.42 across stages). Skin temperature showed the expected thermoregulatory pattern (Wake: 33.92{degrees}C, N3: 35.48{degrees}C) and is recommended for downstream analyses. Tonic EDA showed an inverted stage pattern attributable to wrist sweat accumulation during deep sleep, representing a known confound for wrist-worn EDA during sleep. Phasic EDA showed plausible patterns and may be used with caution. These findings establish a validated feature set for Empatica E4 sleep research and directly inform multimodal psychiatric biomarker studies using the Wearanize+ dataset.
]]></description>
<dc:creator><![CDATA[ Parry, Y. D., Briganti, G. ]]></dc:creator>
<dc:date>2026-06-11</dc:date>
<dc:identifier>doi:10.64898/2026.06.10.26355348</dc:identifier>
<dc:title><![CDATA[Validity and Limitations of the Empatica E4 Wristband for Autonomic and Thermoregulatory Sleep Monitoring Against Concurrent Polysomnography: A Wearanize+ Dataset Study]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-11</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.09.26355176v1?rss=1">
<title>
<![CDATA[
A Heterogeneous Graph Neural Network Framework for Multi-Horizon Stroke Mortality Prediction 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.09.26355176v1?rss=1
</link>
<description><![CDATA[
BackgroundMachine learning models for stroke mortality prediction typically treat each time horizon independently and use flat tabular features that ignore the relational structure of electronic health records (EHRs). In this pilot study, we leveraged graph-based machine learning models to predict post stroke all-cause-mortality across three different time horizons.

MethodsWe developed Stroke Temporal Heterogeneous Graph (StrokeTHG), a heterogeneous graph neural network model for simultaneous multi-horizon stroke mortality prediction (30-day, 90-day, 1-year) using EHR data from Penn State Health System. The model encodes various relations among EHR entities (e.g., patient, diagnosis, comorbidity) and temporal encoding of admission time to better predict stroke mortality. We compared our proposed approach against various baseline methods, including Logistic Regression, Random Forest, and XGBoost. We also performed ablation and subgroup analyses, evaluated the quality of learned graph embeddings, and assessed the importance of different edge types in the graph.

ResultsWe included 4,144 stroke patients (mean age 69.2 years; 54.3% men), of whom 3,332 (80.4%) survived their stroke after one year. 30-day, 90-day, and 1-year mortality rates were 9.7%, 13.7%, and 19.6%, respectively. Our proposed approach, StrokeTHG, achieved AUROC of 0.872, 0.878, and 0.837 across horizons, outperforming all tabular baselines. At [&ge;]75% specificity, the model identified 5-10 percentage points more mortality cases than the best baseline at each horizon. Subgroup analysis demonstrated consistent performance across sex subgroups and the largest discriminative gains in the Age 65-80 stratum. Edge-type ablation identified phenotype-patient and admission-patient edges in the constructed EHR graph as the most influential relational edges for mortality prediction. StrokeTHG embeddings outperformed all graph and matrix factorization baselines under an identical downstream classifier, confirming that performance gains stem from representation quality rather than classifier capacity.

ConclusionsStrokeTHG demonstrates that heterogeneous graph representations of EHR data provide a consistent improvement over flat tabular models for multi-horizon stroke mortality prediction, with particular advantage at clinically actionable sensitivity thresholds and novel multi-horizon monotonic prediction capability. This methodological framework may be adaptable to other EHR-based clinical research studies seeking to leverage heterogeneous relational structures for predictive modeling.
]]></description>
<dc:creator><![CDATA[ Tharzeen, A., Vafaei Sadr, A., Radfar, N., Hwang, W., Abedi, V., Zand, R. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.09.26355176</dc:identifier>
<dc:title><![CDATA[A Heterogeneous Graph Neural Network Framework for Multi-Horizon Stroke Mortality Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.05.26354271v1?rss=1">
<title>
<![CDATA[
A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.05.26354271v1?rss=1
</link>
<description><![CDATA[
ObjectiveTo introduce PsiBench, a clinically validated medication-safety benchmark for evaluating large language models (LLMs) against the standards used to certify hospital computerized provider order entry (CPOE) and electronic health record (EHR) systems, and a non-overlapping three-tier evaluation framework separating highest-stakes discrimination, the operational CDS regime, and category-correct alerting.

Materials and MethodsPsiBench comprises 492 medication-safety scenarios across 11 safety categories, created by clinical pharmacology experts whose work underpins an annualized testing procedure used by more than 2,000 U.S. hospitals. The three-tier framework partitions the scenarios non-overlappingly: Discrimination (98 scenarios, 50 fatal vs 48 deception, near-balanced 51%/49%); Operational (394 scenarios, 261 serious unsafe plus 133 safe including 41 Excessive Alerts reclassified as operational negatives); and Attribution (311 alert-required scenarios). We evaluated 40 frontier LLMs from 10 providers over 3 runs per scenario at temperature 0.2 (or the provider default where temperature is not configurable), yielding 59,040 evaluations conducted April 21-23, 2026.

ResultsHeadline binary performance on the full benchmark spans a wide range across the 40 models: F1 78.5%-92.3%, accuracy 65.4%-89.8%, sensitivity 81.4%-100.0%, specificity 6.1%-81.8%. Leading models by F1 (o4-mini 92.3%; o3 92.2%) pair high sensitivity with meaningful specificity; three models saturate sensitivity at 100% but fall below 25% specificity, indistinguishable from a naive always-alert classifier. The wide spread on a single headline metric motivates tier-specific analyses, developed in a separate clinical paper.

Discussion and ConclusionPsiBench and the three-tier framework operationalize a rigorous evaluation rubric for LLM medication safety, grounded in two decades of national hospital audit experience. The framework generalizes to any binary medication-safety classifier (rule-based, conventional ML, or LLM-driven), supporting tier-aware model selection and post-deployment surveillance.
]]></description>
<dc:creator><![CDATA[ Proulx, J., Daines, B., Barton, M., Leonard, M. E., Garcia, J. A., Young, B., Snell, Q., West, T. W., Watson, S. R., AlQaseer, M., Louiset, M., Maqsood, M. B., Voutt-Goos, M. J., Douma, C., Kasbekar, N., Jeffries, J., Abu-Rahmeh, W., Frush, K., Grewal, D. K., Bahsoun, M., Leonard, M., Frankel, A., Classen, D. C., Pestotnik, S. L. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.05.26354271</dc:identifier>
<dc:title><![CDATA[A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.09.26355266v1?rss=1">
<title>
<![CDATA[
Beyond event-rate enrichment: proteomic risk scores for mechanism-aware prevention trial design 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.09.26355266v1?rss=1
</link>
<description><![CDATA[
BackgroundBlood-based biomarkers are increasingly proposed for identifying high-risk individuals before clinical disease and for making prevention-oriented trials more efficient. Prognostic enrichment can increase event rates, but trial efficiency also depends on whether the intervention effect is preserved in the enriched population.

MethodsUsing the UK Biobank Pharma Proteomics Project, we trained disease-specific proteomic risk scores (ProRS) from 2,916 plasma proteins with elastic-net Cox models. We compared ProRS, polygenic risk scores (PRS), and combined PRS-ProRS scores across ten incident diseases. We estimated cumulative incidence and theoretical two-arm time-to-event trial sample sizes across risk strata. To evaluate effect preservation, we examined six intervention-analogue exposure-outcome pairs spanning genetic (PCSK9 /coronary artery disease, APOE /Alzheimers disease, PPARG/type 2 diabetes, IL23R/Crohns disease), behavioural (physical activity/all-cause mortality), and pharmacological (RAAS inhibitors versus calcium channel blockers/coronary artery disease) examples.

ResultsProRS outperformed PRS for 9 of 10 diseases (median C-index 0.75 versus 0.61). ProRS and PRS were weakly correlated (median Pearson |r| = 0.04), and joint PRS-ProRS stratification identified groups with higher observed incidence than either score alone for several endpoints. In the top risk quartile, combined-score enrichment reduced theoretical required sample sizes by 32-74% under a fixed 20% relative hazard reduction. These gains were not always preserved when stratum-specific intervention-analogue effects were used. Effects were broadly preserved for APOE /Alzheimers disease and physical activity/mortality. The PPARG/type 2 diabetes effect attenuated toward the null under all three score types, showing that event-rate enrichment does not guarantee effect preservation. For IL23R/Crohns disease and the antihypertensive comparison, point estimates differed across score types - preserved under polygenic but attenuated under proteomic enrichment - but confidence intervals were wide and overlapping.

ConclusionsProteomic risk scores can identify high-event-rate populations for prevention-oriented trials, but event-rate enrichment alone is insufficient for trial design. Biomarker-guided enrichment should evaluate mechanism-specific effect preservation and may be preferable as a stratification or adaptive-design variable rather than as a restrictive eligibility criterion.
]]></description>
<dc:creator><![CDATA[ Fieggen, J., Simond, G., Segal, B. M., Noori, A., Thakurta, A., Butler, C. C., Clifton, D. A., Clifton, L. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.09.26355266</dc:identifier>
<dc:title><![CDATA[Beyond event-rate enrichment: proteomic risk scores for mechanism-aware prevention trial design]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.06.08.26355217v1?rss=1">
<title>
<![CDATA[
An Explainable Multimodal AI Framework with Reinforcement Learning for Post-Surgical Clinical Decision Support 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.06.08.26355217v1?rss=1
</link>
<description><![CDATA[
Post-surgical adverse outcomes, including mortality, intensive care readmission, and complications, remain major challenges for clinical decision-making. Existing machine learning approaches focus on outcome prediction while operating as opaque systems, limiting clinical trust and the translation of predictions into treatment decisions, and many clinical studies rely on synthetic data in which shared intermediate variables create circular dependencies between inputs and targets that compromise reported performance. We aimed to develop an explainable multimodal architecture and a rigorous evaluation methodology that address these gaps. We designed a two-stage architecture integrating supervised deep learning for risk prediction with conservative Q-learning for action recommendation. The first stage uses five modality-specific encoders for structured records, physiological time-series, chest radiographs, clinical notes, and surgical metadata, unified through cross-modal attention into a shared patient-state representation. The second stage applies offline reinforcement learning to recommend clinical actions while preventing value overestimation. We formally characterized a target-leakage flaw in synthetic pipelines and propose a real-data methodology using a verified clinical database, with event-censored temporal separation and uncertainty-weighted per-task training. Component-level behavior was validated on a controlled synthetic benchmark, demonstrating that the architecture functions as designed without claiming clinical validity. The cross-modal attention and risk-prediction components behaved as expected, whereas the offline reinforcement learning stage did not converge on the benchmark, indicating that value estimation requires further investigation on real clinical data. The architecture provides dual-level explainability through attention visualization and value decomposition, contributing a deployable design, a formal methodological critique of synthetic-data practices, and a complete framework for clinically valid evaluation.
]]></description>
<dc:creator><![CDATA[ Ahmed, M., Ahmed, F., Mow, S. M., Taha, P. A., Barua, S., Rahman, M. M., Rafy, A., Mondol, S. M., Faisal, M. I. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.08.26355217</dc:identifier>
<dc:title><![CDATA[An Explainable Multimodal AI Framework with Reinforcement Learning for Post-Surgical Clinical Decision Support]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
