﻿<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="http://medrxiv.org">
<admin:errorReportsTo rdf:resource="mailto:medrxiv@cshlpress.edu"/>
<title>medrxiv Subject Collection: Health Informatics</title>
<link>http://medrxiv.org</link>
<description>
This feed contains articles for medRxiv Subject Collection "Health Informatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.07.26352456v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.06.26352506v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.06.26352616v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.01.26349658v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.06.26352613v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.06.26352516v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.06.26352564v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.06.26352548v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.05.26352366v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.29.26351965v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.05.26352445v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.01.26352171v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.04.26352393v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.28.26351782v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.03.26352339v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.04.26352350v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.03.26352335v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.03.26352340v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.03.26352241v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.03.26352326v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.29.26352082v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.02.26352261v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.30.26352196v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.01.26352213v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.05.01.26352193v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.23.26351510v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.24.26351503v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.30.26352142v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.29.26352110v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.26.26351798v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>medrxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>medrxiv</title>
<url>https://www.medrxiv.org/sites/default/files/medrxiv_internal_logo.png</url>
<link>http://medrxiv.org</link>
</image>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.07.26352456v1?rss=1">
<title>
<![CDATA[
Transforming Semi-structured Variant Assessments into Computable Clinical Assertions: A Pilot Study for AI-Assisted Curation 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.07.26352456v1?rss=1
</link>
<description><![CDATA[
Genomic medicine relies on expert evaluation of genomic variants, but this process is dramatically slowed by a lack of readily-accessible genomic knowledge. Although genomic knowledge resources such as ClinVar and CIViC support structured data sharing and provide interfaces for adding structure, much of the variant interpretation data generated upstream of these resources is not readily interoperable with these resources, limiting the ability of clinical labs to share data and creating knowledge silos. Here we evaluate a strategy for breaking down these knowledge silos in a pilot study to transform semi-structured variant classification knowledge into computable clinical assertions leveraging the Global Alliance for Genomics and Health (GA4GH) Genomic Knowledge Standards specifications. We programmatically mapped previously captured somatic cancer clinical significance classifications from spreadsheets to the GA4GH Variant Annotation specification. For diagnostic classification data, this approach enabled reuse of standards-aware submission tooling to share 1,499 records to ClinVar. We then studied how AI-assisted curation approaches to overcome gaps in unstructured text enabled scalable curation of prior classifications in unstructured text. Using this approach, we were able to accurately classify clinical significance for 71.8% (117/163) of randomly sampled prognostic evidence statements. We conclude with an overview of how this work may be generalized to make computationally inaccessible variant evidence from other clinical laboratories broadly reusable in downstream knowledgebases such as CIViC and ClinVar.
]]></description>
<dc:creator><![CDATA[ Cannon, M. J., Bratulin, A., Kuzma, K., Puthawala, D., Corsmeier, D., Schieffer, K., Kelly, B., Cottrell, C., Wagner, A. H. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.07.26352456</dc:identifier>
<dc:title><![CDATA[Transforming Semi-structured Variant Assessments into Computable Clinical Assertions: A Pilot Study for AI-Assisted Curation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.06.26352506v1?rss=1">
<title>
<![CDATA[
Positive Registration Rate as a Key Determinant of COCOA Effectiveness: Empirical Evidence from Individual-Level Key-Match Data during the Sixth and Seventh COVID-19 Waves in Japan 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.06.26352506v1?rss=1
</link>
<description><![CDATA[
Background: COCOA, Japan's Bluetooth-based COVID-19 contact tracing app, was widely regarded as ineffective due to persistently low key-match counts. However, this assessment may have conflated two distinct phenomena: (1) a structurally suppressed positive registration rate caused by administrative friction in the HER-SYS linkage, and (2) genuine epidemiological inefficacy. Objective: To empirically examine whether the correlation between individual COCOA key-match counts and regional COVID-19 case numbers depended on positive registration rate, using a unique longitudinal dataset from a single observer with a rigorously controlled behavioral pattern. Methods: The corresponding author (S.N.) recorded daily key-match counts from his personal iPhone from January 10 to October 8, 2022, encompassing the Sixth Wave (January 10-April 20, 2022) and Seventh Wave (July 9-September 2, 2022). Daily reported COVID-19 cases in Tokyo were obtained from publicly available NHK data. Pearson correlation coefficients were calculated for each wave period separately. Results: During the Sixth Wave, no meaningful correlation was observed (r2 = 0.018, p = 0.059, n = 194). During the Seventh Wave, a strong positive correlation emerged (r2 = 0.530, p < 0.001, n = 56). This correlation disappeared abruptly after September 12, 2022, coinciding with Japan's revision of the mandatory full case reporting policy. Conclusions: COCOA's utility as a real-time signal of neighborhood-level COVID-19 prevalence was critically dependent on positive registration rate rather than app installation rate. These findings provide the first real-world empirical evidence supporting the threshold effect predicted by prior simulation studies, and offer important lessons for future pandemic preparedness.
]]></description>
<dc:creator><![CDATA[ Nakagawa, S., Kumagai, S., Yamamoto, A. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.26352506</dc:identifier>
<dc:title><![CDATA[Positive Registration Rate as a Key Determinant of COCOA Effectiveness: Empirical Evidence from Individual-Level Key-Match Data during the Sixth and Seventh COVID-19 Waves in Japan]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.06.26352616v1?rss=1">
<title>
<![CDATA[
Enhanced Adverse-Event Detection and Drug-Event Relation Extraction from Clinical Notes 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.06.26352616v1?rss=1
</link>
<description><![CDATA[
Adverse drug events (ADEs) are a significant source of preventable patient harm, yet many ADE signals remain buried in free-text clinical notes. Clinical notes often describe adverse events (AEs) in relation to drugs in two ways: whether a drug causes the AE (the AE is an ADE) or a drug is given to treat an AE (it is considered the Reason for drug treatment). In the N2C2 2018 benchmark, ADEs and Reasons are annotated as separate entity types, despite often being similar in both wording and clinical meaning. This shared similarity makes them difficult to distinguish during entity extraction, leading to errors in relation classification. Therefore, we propose a two-stage framework that first detects AEs as a unified event category and then classifies drug-event pairs into Drug-ADE, Drug-Reason, or No-Relation. In the end-to-end evaluation on the N2C2 2018 benchmark, our system achieves F1 scores of 0.93 for Drug-ADE and 0.94 for Drug-Reason, improving over previously reported end-to-end benchmarks of 0.48 for Drug-ADE and 0.59 for Drug-Reason. Overall, these results support a more precise task formulation in which AEs are detected broadly first, and the ADE vs Reason distinction is resolved at the relation layer. Furthermore, they motivate the development of AE-focused datasets annotated independently of drug linkage to enable more reliable end-to-end pharmacovigilance systems.
]]></description>
<dc:creator><![CDATA[ Alharbi, O., Wu, C. H., Chen, C., Shanker, V. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.26352616</dc:identifier>
<dc:title><![CDATA[Enhanced Adverse-Event Detection and Drug-Event Relation Extraction from Clinical Notes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.01.26349658v1?rss=1">
<title>
<![CDATA[
UPhAIR: A Hybrid Pipeline for Generating Understandable Post-hoc AI Reports in Glioma IDH Mutation Status Prediction 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.01.26349658v1?rss=1
</link>
<description><![CDATA[
Clinical adoption of machine learning (ML) in medical imaging is limited by the lack of interpretability. To address this, we present understandable post-hoc artificial intelligence reports (UPhAIR), a pipeline designed to generate transparent, evidence-based explanations by combining Shapley additive explanation (SHAP) analysis with retrieval-augmented generation (RAG) and large language models (LLMs). We trained 12 Classifiers to predict Isocitrate dehydrogenase (IDH) mutation status in glioma using radiomics and clinical features. SHAP values were used to identify key contributors to each prediction. Domain literature was collected from three sources and indexed within a RAG framework. Relevant papers were retrieved using Facebook AI similarity search (FAISS) vector similarity search and provided to Google Gemini 2.5 Pro to generate concise, reference-supported explanations for each feature. The model achieved a best AUC of 0.90 on a 5-fold cross-validation using an extreme gradient boosting (XGBoost) Classifier and a hold-out test AUC of 0.86. In a case study of a single patient excluded from training, the model correctly predicted the patient to be IDH-wildtype glioma, and SHAP identified MGMT status, age, and three radiomic features as the most influential features. UPhAIR produced a structured report combining SHAP visualizations with LLM-generated summaries grounded in scientific evidence. UPhAIR provides a practical, model-agnostic framework that enhances ML interpretability in clinical settings, helping bridge the gap between black-box AI and real-world medical decision-making.
]]></description>
<dc:creator><![CDATA[ Gorji, A., Shahverdi, H., Saberi, A., Gheiji, B., Farahani, S., Azemi, G., Di Ieva, A. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.01.26349658</dc:identifier>
<dc:title><![CDATA[UPhAIR: A Hybrid Pipeline for Generating Understandable Post-hoc AI Reports in Glioma IDH Mutation Status Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.06.26352613v1?rss=1">
<title>
<![CDATA[
Accelerating Mental Health Precision Trial : An Effective Visualization-Driven Tool for Power and Sample Size Estimation in Biomarker-based Study Designs 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.06.26352613v1?rss=1
</link>
<description><![CDATA[
Precision medicine has given rise to a spectrum of biomarker-guided trial designs, from simple enrichment and strategy designs to more complex adaptive frameworks. To address the need for user-friendly tools that span this spectrum, we developed a unified R Shiny platform that first implements three standard designs: the randomize-all design, the enrichment design, and the biomarker-strategy design, allowing researchers to perform power and sample size calculations under each framework with intuitive inputs and visual outputs. Building on this foundation, the platform further extends to support two-stage general randomized basket trial designs with interim analysis, which can be viewed as a generalization of the standard designs to multiple biomarker-defined subgroups. The tool was rigorously validated by comparison with established R pipelines and published formulas, and user testing confirmed its intuitive interface. By providing seamless integration from standard to advanced designs under a common input-output framework, our platform enables researchers to directly compare power and sample size requirements across different design choices using the same underlying assumptions. The result is a freely accessible tool offering effective visualizations for the full spectrum of biomarker-guided trial designs, available at https://ampt.obicloud.ca/. Future improvements may further expand the tool's capabilities to accommodate the increasing complexity of trial designs needed by the research community.
]]></description>
<dc:creator><![CDATA[ Chen, D. Z., Xie, A., Ma, C. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.26352613</dc:identifier>
<dc:title><![CDATA[Accelerating Mental Health Precision Trial : An Effective Visualization-Driven Tool for Power and Sample Size Estimation in Biomarker-based Study Designs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.06.26352516v1?rss=1">
<title>
<![CDATA[
Artificial Intelligence Driven Support and Self Care Competence as Determinants of Medication Adherence in Diabetes Care, A Cross-sectional Nigerian Study 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.06.26352516v1?rss=1
</link>
<description><![CDATA[
Medication adherence among patients with diabetes remains suboptimal in low and middle income countries, including Nigeria. Emerging digital health interventions such as AI powered virtual support may be associated with improved adherence behaviours. This study examined self care competence and perceived AI powered virtual support as predictors of medication adherence among patients with diabetes. A cross sectional survey was conducted among 450 patients recruited through multistage sampling across hospitals in Benue State, Nigeria. Standardised measures of self care competence scale, perceived AI support scale, and medication adherence scale were analysed using correlation and regression analyses. Results showed that, self-care competence significantly predicted medication adherence, although some components (glucose management, physical activity, healthcare use) showed negative associations. Perceived AI-powered support demonstrated stronger predictive power, with social presence and social interactivity emerging as key predictors. The combined model explained 36.3% of variance. In conclusion, perceived AI powered virtual support, particularly socially interactive features, plays a significant role in enhancing medication adherence and may complement traditional self care strategies. It is recommended that clinicians should therefore adopt a hybrid care model that integrates traditional patient education with AI-assisted interventions. This approach can help bridge gaps caused by high patient loads and limited consultation time, while also enhancing personalised care.
]]></description>
<dc:creator><![CDATA[ Onah, C., Ajonye, A. A. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.26352516</dc:identifier>
<dc:title><![CDATA[Artificial Intelligence Driven Support and Self Care Competence as Determinants of Medication Adherence in Diabetes Care, A Cross-sectional Nigerian Study]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.06.26352564v1?rss=1">
<title>
<![CDATA[
Comparative Evaluation of Wearable Sensor Form Factors for Physiological Monitoring in Youth with Autism Spectrum Disorder 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.06.26352564v1?rss=1
</link>
<description><![CDATA[
Sudden behavioral outbursts in youth with autism spectrum disorder (ASD) are difficult to predict and create substantial caregiving burdens. Wearable physiological monitoring might enable prediction, but sustained use may be limited by tolerability. We evaluated adherence and data completeness in 40 youth with ASD over a two-week period across four device types (wristband, headband, adhesive chest patch, and finger ring) alongside caregiver-reported useability and comfort. Data completeness varied markedly by device, with the patch achieving the highest completeness (~80%), followed by the wristband (~60%), headband (~50%), and ring (~20%). In multivariate analyses, adherence was driven by the device form factor rather than participant-level clinical characteristics. Devices rated as more comfortable did not yield higher completeness, revealing a divergence between reported preference and actual use. These findings suggest that device choice is a key consideration for studies in ASD youths, highlighting the need for research into model stability across sensor types in neurodivergent populations.
]]></description>
<dc:creator><![CDATA[ Stewart, C., Albertazzi, A., Tasarz, J., Kim, K., Gandara, V., Blucher, C., Reyes-Martinez, C. C., Smarr, B., Besterman, A. D. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.26352564</dc:identifier>
<dc:title><![CDATA[Comparative Evaluation of Wearable Sensor Form Factors for Physiological Monitoring in Youth with Autism Spectrum Disorder]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.06.26352548v1?rss=1">
<title>
<![CDATA[
Single-cell splicing analysis with ISSAC links cell type specific and cell state-dependent sQTLs to neurological disorders 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.06.26352548v1?rss=1
</link>
<description><![CDATA[
Single-cell RNA sequencing enables comprehensive profiling of gene expression and splicing at cellular resolution, revealing cell type-specific and cell state-dependent regulation (variation within cell types based on their functional states). While genetic studies of expression (eQTLs) in single cells are well established, the genetic regulation of alternative splicing in single cells remains challenging. Existing single-cell splicing QTL (sQTL) studies perform pseudobulk aggregation using bulk analysis methods, which reduces power to detect cell type-specific sQTLs and cannot capture cell state-dependent splicing regulation. Here, we introduce ISSAC to directly quantify metacell-level splice site usage and map cell type- and cell state-specific sQTLs through generalized linear mixed models. In real-world benchmarking on peripheral blood single-cell data, ISSAC identified 1.4- to 2.5-fold more cell type-specific sQTLs than pseudobulk sQTL analysis, and uniquely enabled cell state-dependent sQTL discovery. We applied ISSAC to a harmonized aging brain resource consisting of approximately 3 million dorsolateral prefrontal cortex (DLPFC) single nuclei from 722 donors. ISSAC identified 31,318 independent cis-sQTLs across seven major cell types and 16,861 independent cis-sQTLs across 67 subcell types, with ~67% of sGenes showing no overlap with eGenes. We identified 369 independent sQTLs whose genetic effects were mediated by various cell states such as dendrite development and synaptic signaling. Additionally, we uncovered 194 Alzheimer's-biased and 207 sex-biased sGenes, as well as 142 risk genes that colocalized with neurological disorders including Alzheimer's disease, Neuroticism, Amyotrophic lateral sclerosis, Parkinson's disease, Lewy body dementia and Schizophrenia. Specifically, we functionally validated a causal variant rs11549690 modulating TRPT1 exon 7 skipping to influence neuroticism risk.
]]></description>
<dc:creator><![CDATA[ ZHANG, Y., Liu, B. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.26352548</dc:identifier>
<dc:title><![CDATA[Single-cell splicing analysis with ISSAC links cell type specific and cell state-dependent sQTLs to neurological disorders]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.05.26352366v1?rss=1">
<title>
<![CDATA[
Less is More: last observations of vital signs can outperform time series for hospital mortality prediction 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.05.26352366v1?rss=1
</link>
<description><![CDATA[
Timely identification of hospital inpatients at risk of deterioration facilitates interventions to support their recovery. Many hospitals implement early warning scores to detect abnormal patient vital signs, such as the National Early Warning Score 2 (NEWS2). However, these are typically based on a snapshot of the most recent vital signs, rather than exploiting trends overtime that clinical intuition suggests may also be informative. Multiple approaches, including recently described methods, have been developed to predict patient deterioration from time series. We therefore compared the effectiveness of different mortality prediction models, including clinical scoring systems, classical machine learning models and state-of-the-art deep learning models using both snapshot and time series vital sign data. No significant improvement in model performance was observed using predictions from time series compared to using the last observation of the time series and non-temporal features such as demographics. Our study comprehensively compares different model types, and provides recommendations for developing predictive models and guidance for what evaluation is needed before considering deploying such models in inpatient care.
]]></description>
<dc:creator><![CDATA[ Zhang, Z., Wei, J., Xu, J., Li, Y., Luk, A., Bhalla, S., Cui, H., Clifton, D. A., Walker, A. S., Eyre, D. W. ]]></dc:creator>
<dc:date>2026-05-06</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.26352366</dc:identifier>
<dc:title><![CDATA[Less is More: last observations of vital signs can outperform time series for hospital mortality prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.29.26351965v1?rss=1">
<title>
<![CDATA[
SmartAlert: Integrating Machine Learning and Alert Triggers into Live Electronic Medical Record Systems, Targeting Low-Yield Inpatient Lab Tests 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.29.26351965v1?rss=1
</link>
<description><![CDATA[
This study explores integrating machine learning into electronic medical record systems to predict stability of inpatient lab tests. A  smart alerts system was developed and tested at Stanford Hospital. The system identifies stable lab results, advising clinicians on test ordering. Live deployment showed desired precision at good recall in predicting test result stability, with suggestions for system optimization identified. This approach may significantly decrease low-yield testing and enhance personalized clinical decision-making.
]]></description>
<dc:creator><![CDATA[ Jiang, Y., Ma, S., Liang, A., Kim, G., Acharya, A., Mony, S., Punnathanam, S., Makeown, J., Jose, J., Shieh, L., Pham, T., Ng, A. Y., Chen, J. H. ]]></dc:creator>
<dc:date>2026-05-06</dc:date>
<dc:identifier>doi:10.64898/2026.04.29.26351965</dc:identifier>
<dc:title><![CDATA[SmartAlert: Integrating Machine Learning and Alert Triggers into Live Electronic Medical Record Systems, Targeting Low-Yield Inpatient Lab Tests]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.05.26352445v1?rss=1">
<title>
<![CDATA[
Enhancing dengue diagnosis and surveillance by integrating machine learning technologies with the NS1 rapid test kit 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.05.26352445v1?rss=1
</link>
<description><![CDATA[
BackgroundDengue has been a major health threat globally in recent years. In particular, dengue incidences continue to increase annually and the epidemic area has expanded primarily due to global warming. Therefore, effective case detection and surveillance strategies are crucial to tackle this global health challenge. In clinical practice, the rapid test kit detecting dengue non-structural protein 1 antigen and commonly referred as NS1, is widely employed for early diagnosis. However, real-world studies revealed that the sensitivity of the NS1 test kit ranged from approximately 61% to 95%. Since early diagnosis is really critical for disease surveillance in the early stage of a dengue epidemic, scientists have been working hard to develop novel diagnosis methods that can provide higher sensitivity levels.

Methodology/Principal FindingsIn response to this challenge, in this study, we have developed a novel diagnosis procedure that integrates machine learning technologies with the NS1 test kit. Our experimental results revealed that we would be able to raise the sensitivity of the dengue diagnosis procedure to higher than 99% by incorporating machine learning based prediction models to screen the suspected patients with a negative NS1 result. Furthermore, the relative risks between the suspected patients who were predicted to be positive and those who were predicted to be negative exceeded 4.8.

Conclusions/SignificanceThese results illustrate that the proposed approach provides an effective and efficient diagnosis procedure to address the global health challenge caused by spread of dengue.

Author SummaryThis study has aimed to enhance surveillance of the dengue disease by integrating machine learning technologies with the rapid test kit commonly employed in early diagnosis. In clinical practice, the NS1 rapid test kit is widely employed for early diagnosis. However, real-world studies revealed that a certain percentage of the patients with a negative NS1 test result, ranging from 5% to 39%, were actually infected by dengue. Since early diagnosis is critical for disease control in the early stage of a dengue epidemic, scientists have been working hard to tackle this challenge. Based on this observation, this study was launched to investigate the effects of incorporating machine learning based prediction models to further screen those patients with a negative NS1 test result. The experimental results revealed that the proposed approach was able to identify over 99% of the patients who were infected by the dengue disease. Furthermore, the risk of the suspected patients who were predicted to be positive was 4.8 times higher than the risk of those who were predicted to be negative. The experimental results illustrate that the proposed approach provides an effective and efficient diagnosis procedure to enhance surveillance of the dengue disease.
]]></description>
<dc:creator><![CDATA[ Hwang, C.-K., Chen, Y.-W., WANG, Y.-T., Ho, T.-S., Oyang, Y.-J. ]]></dc:creator>
<dc:date>2026-05-06</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.26352445</dc:identifier>
<dc:title><![CDATA[Enhancing dengue diagnosis and surveillance by integrating machine learning technologies with the NS1 rapid test kit]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.01.26352171v1?rss=1">
<title>
<![CDATA[
Patterns and Predictors of Artificial Intelligence Use Among Healthcare Professionals in the United States and United Kingdom: A Cross-National Survey 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.01.26352171v1?rss=1
</link>
<description><![CDATA[
ObjectiveWe surveyed 524 healthcare professionals (HCPs) in the United States and United Kingdom to examine workplace generative AI use, access, and barriers in two high-maturity health settings.

MethodsThis cross-sectional survey compared AI usage breadth, access modes, and barriers among HCPs, stratified by country and professional role.

ResultsOverall, 75.8% of HCPs reported recent AI use, mainly for documentation, literature search, and clinical decision support. Usage breadth was similar by country, but role differences were pronounced. Physicians reported broader use and were significantly more likely to access AI via personal, non-employer-provided tools (60.4% vs. 31.0% for nurses; P<.01). Personal tools were the most common access mode overall (40.1%).

ConclusionAI use is common, but institutional access lags adoption. Shifting use from personal accounts toward governed, approved systems is a key priority.
]]></description>
<dc:creator><![CDATA[ Sezgin, E., Lee, J. A., Jadczyk, T., Taxter, A. J. ]]></dc:creator>
<dc:date>2026-05-06</dc:date>
<dc:identifier>doi:10.64898/2026.05.01.26352171</dc:identifier>
<dc:title><![CDATA[Patterns and Predictors of Artificial Intelligence Use Among Healthcare Professionals in the United States and United Kingdom: A Cross-National Survey]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.04.26352393v1?rss=1">
<title>
<![CDATA[
Early Detection of Rare Disease Using Hierarchical Set-to-Sequence Modeling of Structured Electronic Health Records 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.04.26352393v1?rss=1
</link>
<description><![CDATA[
Rare diseases are characterized by heterogeneous, weak, and sparse phenotypic signals that emerge gradually across longitudinal clinical visits, making early detection a persistent challenge. In this study, we propose a hierarchical set-to-sequence (HSS) framework for prospective rare disease detection using structured EHR data. HSS decomposes the problem into two levels: (1) intra-visit encoding via Multi-Query Attention (MQA), which treats heterogeneous clinical events within a single clinical visit as an unordered set to generate unified visit-level representations, and (2) inter-visit temporal modeling with transformer encoders conditioned on patient visit age and inter-visit time gaps to capture the disease progression and the irregular intervals between clinical visits. We construct a real-world cohort of 40,223 patients comprising 708,422 visits from a single academic medical center (2005-2025), with 3,032 rare disease cases identified by curated rule-based phenotyping including severe neuro-developmental, congenital, or genetic conditions. We formulate the task as multi-horizon prospective binary classification with five prediction horizons of 7, 30, 90, 180, and 365 days prior to first diagnosis. Experimental results show that the proposed HSS model consistently outperforms linear logistic regression, tree-based XGBoost, and Transformer-based baselines at every prediction horizon, ranging from AUROC = 0.893 and AUPRC = 0.601 at 7 days with 5.17% prevalence to AUROC = 0.829 and AUPRC = 0.228 at 365 days with at 3.98% prevalence. Notably, the performance gap between HSS and the strongest competing baseline is largest at the 365 days horizon, indicating stronger advantages for long-horizon prediction where phenotypic signals for rare diseases are weak and sparse. Additional analyses further clarify the contribution of the hierarchical components and confirm the importance of hierarchical modeling. This work contributes to the ongoing development of AI methodologies tailored to rare diseases by introducing a hierarchical framework for early detection using structured longitudinal clinical data.
]]></description>
<dc:creator><![CDATA[ Ma, Y., Chinthala, L., Mohammed, A., Davis, R. L., Colonna, V. ]]></dc:creator>
<dc:date>2026-05-06</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.26352393</dc:identifier>
<dc:title><![CDATA[Early Detection of Rare Disease Using Hierarchical Set-to-Sequence Modeling of Structured Electronic Health Records]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.28.26351782v1?rss=1">
<title>
<![CDATA[
Extracting adverse event nature, severity, timelines and resulting interventions from clinical notes of patients receiving CAR-T therapy using large language models. 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.28.26351782v1?rss=1
</link>
<description><![CDATA[
Chimeric Antigen Receptor T-cell (CAR-T) therapy, where genetically engineered patient T cells target tumor antigens, has transformed care for hematologic malignancies but requires careful tracking of adverse events (AEs) often documented only in unstructured EHR notes.

We evaluated a Large Language Model (LLM)-based approach in UCSFs secure environment to extract AEs, dates, grades, and interventions within 30 days post-infusion for six commercial CAR-T products (2012-2023), benchmarking against two evaluators. Using GPT-4-0314 in a zero-shot setting with four prompts (prespecified AEs, non-prespecified AEs, CRS, ICANS), we compared outputs against dual annotations on a random sample of 50 notes using accuracy, precision, recall, F1, and Cohens kappa. From 4,762 progress notes for 293 patients (median age 65.6), CRS occurred in 80.2% (median onset 4 days); neutropenia 70.0% (16 days); neutropenic fever 64.8% (4 days); ICANS in 34.8%. Interventions included tocilizumab and corticosteroids. Grades were frequently undocumented (CRS 62.3%, ICANS 56.1%); documented cases were mainly CRS grade 1 (59.4%) and ICANS grade 2 (28.0%). Performance was high on CRS and ICANS grading (accuracy of 0.97 and 0.91, respectively). Moderate performances were assessed for prespecified AE extraction (accuracies 0.62-0.76), and non-prespecified AEs (accuracies 0.76-0.84). Inter-rater reliability was strong to near-perfect for CRS/ICANS presence and grade (kappa 0.86-0.96), moderate for dates and interventions, and weaker for broader AE attributes.

LLM-derived insights can augment AE monitoring and real-world evidence generation by unlocking unstructured clinical detail and characteristic timelines after CAR T. However, performance varied for broader AE attributes, warranting cautious use. Performance was highest for detecting the presence and grade of CRS and ICANS, with strong to near-perfect inter-rater reliability. While cautious use of LLMs for broad AE extraction is warranted due to the variable performance observed in this study, these results support integrating high-performing CRS/ICANS extraction into EHR workflows.

Author summaryChimeric Antigen Receptor T-cell (CAR-T) therapy has transformed care for blood cancer but requires careful tracking of adverse events (AEs). We asked whether a large language model could read routine clinical notes and extract AEs after CAR T-cell therapy. We analyzed de-identified notes from the first month after infusion. The model identified when two key side effects occurred--cytokine release syndrome (a whole-body inflammatory reaction) and neurotoxicity (brain and nerve symptoms)--and how severe they were, with accuracy similar to human reviewers. It also captured when side effects started and what treatments were given, though performance was more variable for the wider range of side effects beyond these two. In our data, these reactions often arose within the first week; blood count problems and infections were also common. Because many notes did not state severity explicitly, the model sometimes could not assign a grade. Our findings suggest that language models can help unlock important details hidden in clinical notes and could be incorporated into electronic records to support faster, more reliable side-effect monitoring and research. We recommend careful, supervised use and continued validation, especially for broader side-effect categories.
]]></description>
<dc:creator><![CDATA[ Guillot, J., Miao, B., Suresh, A., Sushil, M., Williams, C. Y., Vashisht, R., Oskotsky, T. T., Sirota, M., Butte, A. J. ]]></dc:creator>
<dc:date>2026-05-05</dc:date>
<dc:identifier>doi:10.64898/2026.04.28.26351782</dc:identifier>
<dc:title><![CDATA[Extracting adverse event nature, severity, timelines and resulting interventions from clinical notes of patients receiving CAR-T therapy using large language models.]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.03.26352339v1?rss=1">
<title>
<![CDATA[
Free-text MAUDE narratives provide a source-robust representation layer for biomaterial-device surveillance 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.03.26352339v1?rss=1
</link>
<description><![CDATA[
Implantable biomaterial devices require effective post-market surveillance because clinically important failure patterns often emerge only after widespread use. However, surveillance workflows often rely on structured coded summaries that compress heterogeneous adverse-event narratives into coarse categories. This study compares coded and free-text narrative representations across 1,500 FDA MAUDE reports from three biomaterial device classes (coronary stents, bone cement, and surgical mesh) to test whether narratives preserve a more source-robust surveillance representation. Under manufacturer-held-out evaluation, narrative TF-IDF features outperformed structured code-only features (macro F1 0.925 versus 0.827), while delexicalized narratives retained strong grouped performance after masking device-class, manufacturer, brand, and legal-template tokens (F1 0.897). Narrative topics resolved reported events into procedural, anatomical, host-response, and reporting-context patterns, and an interpretable classifier recovered code-derived complication phenotypes from narrative text alone (mean F1 0.902, AUC 0.967). These findings support free-text adverse-event narratives as a complementary representation layer for post-market device surveillance, while remaining bounded by passive adverse-event reporting limitations and requiring validation across additional years, device classes, and independently adjudicated outcomes.

Author SummaryWhen an implanted medical device fails inside a patient, the event is reported to the FDAs MAUDE database. Each report includes both a standardized code and a written narrative describing what happened. We asked whether these two representations carry the same information. Using 1,500 reports covering coronary stents, bone cement, and surgical mesh, we found that coded fields lose much of the clinical detail present in narratives. Importantly, narrative-based classifiers remained accurate even when tested on reports from manufacturers not seen during training, while code-based classifiers dropped substantially. This matters because real-world surveillance must generalize across different reporting sources. We also found that narrative text can recover clinically meaningful complication patterns that are defined by codes, and that most reports never name the specific biomaterial involved. These findings suggest that narrative text deserves a more central role in post-market device monitoring, complementing the coded fields that current surveillance pipelines rely on.
]]></description>
<dc:creator><![CDATA[ Chen, H. ]]></dc:creator>
<dc:date>2026-05-05</dc:date>
<dc:identifier>doi:10.64898/2026.05.03.26352339</dc:identifier>
<dc:title><![CDATA[Free-text MAUDE narratives provide a source-robust representation layer for biomaterial-device surveillance]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.04.26352350v1?rss=1">
<title>
<![CDATA[
Syrona: Visual Analytics for Systematic Comparison of Health Datasets on the OMOP Common Data Model 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.04.26352350v1?rss=1
</link>
<description><![CDATA[
Health research increasingly relies on observational datasets that capture systematically different patient populations, affecting the representativeness, comparability, and transportability of study findings. Yet these differences are rarely quantified across clinical domains or explored interactively, and the workflow for pairwise dataset comparison remains fragmented. We present Syrona, a visual analytics workflow for systematic pairwise comparison of datasets and subcohorts on the OMOP Common Data Model. Syrona extracts annual prevalence for conditions, procedures, and drugs from any OMOP CDM database, computes prevalence ratios across demographic strata, and synthesizes them via multilevel meta-analysis. A coordinated dashboard - distributional overviews, stratified heatmaps, forest plots, and absolute prevalence comparisons - enables interactive exploration driven by domain-adaptive SNOMED CT and ATC filtering. Three case studies on Estonian national health data (495,000 persons, 2012-2024) demonstrate how the workflow supports representativeness assessment, institutional practice comparison, and coding artifact detection. Syrona is open-source and applicable to any OMOP CDM database. Explore the demo at: http://omop-apps.cloud.ut.ee/ShinyApps/Syrona/
]]></description>
<dc:creator><![CDATA[ Pajusalu, M., Oja, M., Mooses, K., Heinsar, S., Laisk, T., Tillmann, T., Laur, S., Reisberg, S., Vilo, J., Kolde, R. ]]></dc:creator>
<dc:date>2026-05-05</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.26352350</dc:identifier>
<dc:title><![CDATA[Syrona: Visual Analytics for Systematic Comparison of Health Datasets on the OMOP Common Data Model]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.03.26352335v1?rss=1">
<title>
<![CDATA[
Calibration Drift Under Cross-Institutional Deployment: An External Validation Framework for ICU Mortality Prediction Across MIMIC-IV and eICU 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.03.26352335v1?rss=1
</link>
<description><![CDATA[
BackgroundMachine learning models for intensive care unit (ICU) mortality prediction achieve strong internal discrimination yet rarely undergo external validation with calibration assessment -- a gap undermining clinical deployment. Calibration, the agreement between predicted probabilities and observed event rates, is prerequisite for threshold-based decisions yet remains underreported.

MethodsWe conducted a retrospective cohort study using MIMIC-IV (v2.2; n = 52,028 ICU stays) for model development and eICU (n = 114,060) for independent external validation. Logistic regression, random forest, and gradient boosting (XGBoost) were evaluated on first-24-hour clinical variables. Discrimination was assessed via receiver operating characteristic area (AUROC) and precision-recall area (AUPRC); calibration via slope, intercept, and expected calibration error (ECE). Post-hoc logistic recalibration was applied externally. Clinical utility was evaluated by decision curve analysis benchmarked against Acute Physiology and Chronic Health Evaluation (APACHE) scores. Subgroup analyses examined sex and race/ethnicity; SHapley Additive exPlanations (SHAP) assessed feature importance. Uncertainty was estimated via bootstrap resampling; the study adheres to TRIPOD guidelines.

ResultsThe recalibrated XGBoost model achieved internal AUROC 0.847 (95% CI: 0.832-0.860) and external AUROC 0.819 (95% CI: 0.815-0.823). Internal calibration was near-ideal (slope 0.982; intercept 0.001), whereas external validation revealed systematic risk overestimation (intercept -0.678) attributable to prevalence-driven label shift. An intercept-only adjustment reduced ECE by 26%. The model outperformed APACHE (AUROC 0.817 vs. 0.795; p < 0.001).

ConclusionsICU mortality models exhibit transportable discrimination but clinically significant calibration drift under cross-institutional deployment. Calibration evaluation and targeted recalibration should be mandatory in any clinical machine learning validation framework.
]]></description>
<dc:creator><![CDATA[ Patel, K., Beedala, P. ]]></dc:creator>
<dc:date>2026-05-05</dc:date>
<dc:identifier>doi:10.64898/2026.05.03.26352335</dc:identifier>
<dc:title><![CDATA[Calibration Drift Under Cross-Institutional Deployment: An External Validation Framework for ICU Mortality Prediction Across MIMIC-IV and eICU]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.03.26352340v1?rss=1">
<title>
<![CDATA[
Womens Experiences of Accessing and Using Patient Portals Across Health Settings and Implications for Mental Health Care: A Qualitative Descriptive Study 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.03.26352340v1?rss=1
</link>
<description><![CDATA[
Patient portals are online tools that enhance patients access to various aspects of their health care, including provider communication, medication information, and lab results. As portals continue to be integrated into health systems, it is imperative to understand the experiences of various groups who utilize their functions. Womens experiences of using patient portals have been scantly explored in the literature, including their perceptions about use for mental health care. The purpose of this study was to explore womens experiences of accessing and using a variety of patient portals, including their perceptions of usefulness for mental health care. A qualitative descriptive methodology was used to explore womens experiences of accessing and using patient portals across Canada. Purposive sampling was used to recruit ten women, who completed semi-structured, one-to-one interviews between April-June 2025. Conventional qualitative content analysis was used to analyze the data. Each woman had used at least one patient portal for their health care at the time of their interview. Four main themes emerged from the data, including: (1) the health care lived experience, (2) individual autonomy, (3) provider partnership, and (4) portal improvement. The interrelated themes contain narrative descriptions of individual experiences of accessing and using patient portals, and implications for using portals for womens mental health care. These results demonstrate a variety of womens experiences. Patient portals were found to impact their lived experiences with health care, enhance individual autonomy, and foster partnerships with their health care providers. The women also suggested various areas of improvement in portal design elements, features, and privacy functions. Future research should focus on evaluating the design of new portals to ensure they meet the needs of the population they serve.

Author SummaryA patient portal is an example of a digital tool that is being integrated into various health organizations to supplement in-person care. Depending on the design and the complexity of the portal, patients may be able to complete online prescription renewals, access medication schedules, virtually communicate with their providers, and review their clinical notes. However, as digital tools continue to be produced and adapted within health settings, it is crucial to understand how they can best serve different populations. In this study, we explored womens experiences with using patient portals for their health care in Canada. We also aimed to understand womens perspectives on how patient portal use can be optimized for mental health care. We performed virtual interviews with 10 women who had used at least one patient portal for their health care, and gained their perspectives on accessibility, useful features, and how using a patient portal impacted their experiences of receiving health care. The women discussed how portal use improved their health care experiences and they suggested a variety of features to support mental health care as patient portal designs continue to be adapted to different settings.
]]></description>
<dc:creator><![CDATA[ Durocher, K., Kemp, J., Shin, H. D., Jackson, K. T., Strudwick, G. ]]></dc:creator>
<dc:date>2026-05-05</dc:date>
<dc:identifier>doi:10.64898/2026.05.03.26352340</dc:identifier>
<dc:title><![CDATA[Womens Experiences of Accessing and Using Patient Portals Across Health Settings and Implications for Mental Health Care: A Qualitative Descriptive Study]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.03.26352241v1?rss=1">
<title>
<![CDATA[
Multilingual Evaluation of a Large Language Model-Based Primary Care Chatbot 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.03.26352241v1?rss=1
</link>
<description><![CDATA[
Pre-visit planning has the potential to reduce EHR documentation burden while improving workflow efficiency, care quality, and patient-provider engagement. Large language model (LLM) chatbots show promise for supporting this task, but while their English-centric development suggests a potential for disparity, the extent to which these concerns translate into performance degradation in multilingual clinical settings remains unclear. In this mixed-methods study, we systematically evaluate the multilingual capabilities of PCP-Bot, an English-developed LLM-based (GPT-4o) clinical chatbot that collects patient concerns and generates structured, physician-ready summaries ([~]200 words) under structured output constraints. We enrolled 31 bilingual individuals (11 Mandarin, 10 Spanish, 10 Hindi) to role-play as patients to evaluate the PCP-Bot, interacting with it across five synthetic clinical cases in both English and a second language. Participants completed a structured survey comprising baseline language proficiency screening, standardized interactions with PCP-Bot in each language, and post-interaction evaluations. Case order was randomized, with each scenario completed first in English and subsequently in the participants second language. All summaries were generated in English, regardless of the interaction language. Our results show that Hindi achieved usability and conversation quality parity with English across all measured dimensions. Mandarin achieved usability parity but showed a significant conversation quality gap relative to English. Spanish demonstrated significant deficits in both conversation quality and summary quality. Trust and workload remained consistent across languages. Qualitatively, participants found PCP-Bot natural, smooth, and accurate overall, but noted repetition, transcription errors, missed follow-ups, and more frequent usability issues in non-English interactions. Overall, our findings demonstrate that LLM translation capabilities can enable effective deployment beyond English following appropriate performance validation.
]]></description>
<dc:creator><![CDATA[ Chen, P.-L., Rao, A. A., Pugh, S. F., Johnson, K. B. ]]></dc:creator>
<dc:date>2026-05-05</dc:date>
<dc:identifier>doi:10.64898/2026.05.03.26352241</dc:identifier>
<dc:title><![CDATA[Multilingual Evaluation of a Large Language Model-Based Primary Care Chatbot]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.03.26352326v1?rss=1">
<title>
<![CDATA[
Screening for Rheumatic Heart Disease in Asymptomatic Children using Machine Learning from Electrocardiograms 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.03.26352326v1?rss=1
</link>
<description><![CDATA[
Early detection of Rheumatic Heart Disease (RHD) is essential in reducing its associated mortality and late complications. In resource-limited settings, automated detection using low-cost electrocardiogram (ECG) sensors can enhance prevention efforts. However, its effectiveness as a potential RHD screening tool in at-risk populations remains unexplored. This study aimed to investigate the utility of machine learning for classifying RHD in a cohort screened for RHD using low-cost ECG devices. The ECGs were collected from 611 at-risk schoolchildren using KardiaMobile, where 47 were confirmed RHD and 564 were healthy. First, the ECG fiducial points were annotated using a publicly available prominence-based delineator. Then, temporal, frequency, wavelet, and visibility graph-based features were extracted from six-leads and fed to the XGBoost classifier. A 10-fold cross-validation was used at different prediction score thresholds to obtain target sensitivity (Se) for screening RHD. Single-lead evaluation on Lead-II showed an F1-score of 60.9%, a Se of 59.6% and a positive-predictive-value (PPV) of 62.2%. However, using multiple leads improved the results, with an F1-score of 62.8%, a Se of 59.6% and a PPV of 66.7%. The best model performance was achieved by adjusting the threshold to 0.6 with Se and PPV of 66% and 51%, respectively. Error analysis revealed that T-wave and STT changes, as well as non-rheumatic mitral valve cases were among the false positive cases. Machine learning can enhance early detection by leveraging relevant ECG features and adjustable target sensitivity based on screening priorities and resource capacity. Measurements can be obtained without chest contact, using only the fingers and knees, thereby enabling use by non-clinical staff. This approach provides a scalable and cost-effective solution for RHD screening in high-prevalence regions.
]]></description>
<dc:creator><![CDATA[ Chuma, A. T., Wang, c., Asmare, M. h., Varon, C., Voigt, J.-U., Kassie, D. M., Zuhlke, L., Vanrumste, B. ]]></dc:creator>
<dc:date>2026-05-05</dc:date>
<dc:identifier>doi:10.64898/2026.05.03.26352326</dc:identifier>
<dc:title><![CDATA[Screening for Rheumatic Heart Disease in Asymptomatic Children using Machine Learning from Electrocardiograms]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.29.26352082v1?rss=1">
<title>
<![CDATA[
Performance of Large Language Models as a Tool for Primary Care Consultations: Evaluation Study 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.29.26352082v1?rss=1
</link>
<description><![CDATA[
Since the release of the first ChatGPT model in 2022, large language models (LLMs) have evolved significantly, and an increasing number of users now turn to these generative information systems for inquiries as sensitive and consequential as those related to health. The primary objective is to identify the main strengths and weaknesses of generative AI systems when responding to information needs as critical as those arising in the health domain. The study was structured using a question-answer format, in which each question corresponded to a user query and each answer represented the output generated by a model in response. The study employed a human evaluation framework involving two distinct panels of clinical experts from different specialties. The evaluation criteria encompassed three dimensions: adherence to medical consensus; presence or absence of inappropriate or incorrect information; and the potential to cause harm to users. GPT-4o mini, Llama 3, and MedLlama 3 were selected as three representative systems for the experiments. This study presents a detailed analysis of the performance of widely used contemporary large language models in addressing common health-related queries posed by online users. The results reinforce the potential of LLMs as tools for online health information seeking among non-expert users. However, the performance limitations identified underscore the need for further studies to monitor the future development of these models. Among them, performance issues have been identified in areas where users may be more vulnerable, leading to the retrieval of clinically incorrect information, particularly in matters relating to rare diseases. Furthermore, it has been noted that these models can become trapped in obsolete medical knowledge due to continuous scientific progress.
]]></description>
<dc:creator><![CDATA[ Pascual, N., Fernandez-Pichel, M., Losada, D. E., Garcia-Orosa, B., Gude, F., Costa Lathan, C., Sueiro Justel, J., Gomez Fontenla, A., Lastra Perez, M., Alonso Garcia,, F. ]]></dc:creator>
<dc:date>2026-05-04</dc:date>
<dc:identifier>doi:10.64898/2026.04.29.26352082</dc:identifier>
<dc:title><![CDATA[Performance of Large Language Models as a Tool for Primary Care Consultations: Evaluation Study]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-04</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.02.26352261v1?rss=1">
<title>
<![CDATA[
Can large language models approximate human perceptions of disease severity? An evaluation using Global Burden of Disease 2010 disability weights 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.02.26352261v1?rss=1
</link>
<description><![CDATA[
BackgroundDisability weights (DWs) quantify the severity of health loss and are essential for estimating disability-adjusted life years in the Global Burden of Disease (GBD) framework. Conventional DW estimation relies on resource-intensive population surveys that are difficult to update or adapt to emerging health states. Large language models (LLMs) may offer a scalable alternative by approximating human perceptions of disease severity through structured judgment tasks.

MethodsThis exploratory study evaluated the alignment between LLM-derived and human-derived DW rankings using 222 health states from GBD 2010. All possible pairwise comparisons (24,531 pairs, each repeated three times) were conducted across four LLMs (GPT-5 mini, GPT-5, Claude Haiku 4.5, and Claude Sonnet 4.5). DWs were estimated via probit regression and evaluated using Spearmans rank correlation and Steigers z test. The effects of prompt language (English vs. Korean), cultural role prompting, and medical specialist role prompting on alignment were examined. Additionally, the Binomial-Logit Indifference-Point (BLIP) estimator was proposed and validated through leave-one-out cross-validation for estimating DWs for health states without established values.

ResultsAll four LLMs showed high rank correlation with GBD 2010 DWs (Spearmans {rho} = 0.893 to 0.909), with no significant inter-model differences. Korean-language prompting significantly improved alignment with Korean DWs ({rho} = 0.756 vs. 0.715, p = 0.011), and Korean cultural role prompting improved alignment with both GBD 2010 DWs ({rho} = 0.922 vs. 0.909, p = 0.002) and Korean DWs ({rho} = 0.738 vs. 0.715, p = 0.001). Medical specialist role prompting significantly reduced alignment with GBD 2010 DWs ({rho} = 0.895 vs. 0.909, p = 0.001). BLIP demonstrated strong agreement with GBD 2010 DWs (Pearsons r = 0.862, MAE = 0.066) and produced plausible estimates for Long COVID (mild: 0.020, moderate: 0.298, severe: 0.529).

ConclusionsLLMs can approximate human perceptions of disease severity with high rank-order consistency. Prompt language and role framing significantly influenced alignment, with culturally grounded lay prompting enhancing and specialist prompting reducing correspondence with population-based DWs. BLIP provides a practical framework for generating provisional DW estimates for emerging or underrepresented health states when conventional surveys are infeasible.
]]></description>
<dc:creator><![CDATA[ Ha, Y., Park, H., Lee, Y., Kim, S., Ahn, S. ]]></dc:creator>
<dc:date>2026-05-04</dc:date>
<dc:identifier>doi:10.64898/2026.05.02.26352261</dc:identifier>
<dc:title><![CDATA[Can large language models approximate human perceptions of disease severity? An evaluation using Global Burden of Disease 2010 disability weights]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-04</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.30.26352196v1?rss=1">
<title>
<![CDATA[
An Efficient and Interpretable Learning Approach for Large-Scale Histopathology Data 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.30.26352196v1?rss=1
</link>
<description><![CDATA[
Prostate cancer (PCa) remains one of the leading causes of cancer-related mortality among men, and histopathological analysis of prostate biopsy specimens is central to diagnosis and risk stratification. Whole-slide Images (WSIs) capture rich morphological information, but their gigapixel scale and the large number of extracted tissue patches make exhaustive annotation and model training computationally expensive. Attention-based Multiple Instance Learning (MIL) has emerged as an effective weakly supervised framework for WSI analysis, enabling slide-level prediction without requiring patch-level annotations. However, training MIL models on large histopathology cohorts remains resource intensive because many extracted patches are non-informative, and some patches are often processed repeatedly during training. To address these challenges, we propose an efficient and interpretable learning framework for large-scale histopathology analysis. Our method combines a pathology-pretrained UNI encoder, a Clustering-constrained Attention Multiple instance learning-Single Branch (CLAM-SB) attention-based MIL model, and a window-based training strategy that reduces computational overhead while preserving predictive performance. The paper illustrates our proposed approach and experiments on TCGA-PRAD WSIs for the PCa patients. Processing 189,600 sampled patches across 79 WSIs with our proposed approach reduced total training time by 57.5% (20 to 8.5 hours for 5 epochs) and 41.4% (27 to 16 hours for 10 epochs), respectively, underscoring its potential as a practical and resource-efficient strategy for scalable prostate histopathology analysis.
]]></description>
<dc:creator><![CDATA[ Moore, C., Gupta, V., Neupane, S., Tripathi, H. ]]></dc:creator>
<dc:date>2026-05-03</dc:date>
<dc:identifier>doi:10.64898/2026.04.30.26352196</dc:identifier>
<dc:title><![CDATA[An Efficient and Interpretable Learning Approach for Large-Scale Histopathology Data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.01.26352213v1?rss=1">
<title>
<![CDATA[
Inpatient diagnostic odysseys in rare diseases: a nationwide audited Orphanet ICD-10 DRG/GRD-IR analysis in Chile, 2019-2024 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.01.26352213v1?rss=1
</link>
<description><![CDATA[
BackgroundRare diseases (RD; enfermedades poco frecuentes, EPoF, in Chilean policy terminology) collectively affect 3.5-5.9% of the population and are associated with long diagnostic trajectories. Chile lacks a reproducible national operational definition for identifying RD in administrative hospital data.

MethodsWe conducted a retrospective observational analysis of Chilean GRD-IR events (IR-29301 version) for 2019-2024 released through FONASA Datos Abiertos, covering hospital discharges and major ambulatory surgery reported by 72 public establishments for FONASA-covered persons. The canonical analytical cohort contained 5,779,482 DRG events in 4,027,921 linked patients. We constructed a Chilean Orphanet-ICD-10 homologation and audited it through an agentic human-in-the-loop pipeline, yielding a conservative RD operational catalogue (434 final ICD-10 codes in the KEEP + MAP_TO_SPECIFIC_ORPHA scenario). RD-coded DRG events were labeled as observed inpatient odysseys when at least one prior DRG event existed for the same patient. We quantified prior events, DRG-observed inpatient trajectory time, nonspecific prior diagnoses, DRG weight, and bridge-code associations. Bridge-code enrichment was estimated using patient-level Fisher exact tests with Benjamini-Hochberg false-discovery correction; event-level estimates were retained as sensitivity analyses.

ResultsThe audited conservative catalogue identified 55,284 primary-diagnosis RD-coded DRG events in 45,784 patients and 374,866 RD-coded events in any diagnostic field. We characterized 63,685 observed inpatient odyssey cases in 25,648 unique patients across 371 audited RD ICD-10 codes. Median DRG-observed inpatient trajectory time to RD-coded diagnosis was 241 days, and mean prior events per odyssey was 8.1. Bridge-code analysis identified 616 associations with support [&ge;] 10 patients and 390 with q < 0.05; 350 significant associations were no-same-code administrative trajectory signals. These signals varied in interpretation, including clinically plausible precursors, diagnostic refinement, and care-process bridges. The Odyssey Index reordered conditions relative to raw prior-event counts, separating high-volume entities from stronger trajectory signatures.

ConclusionsTo our knowledge, we provide the first nationwide audited and reproducible characterization of inpatient RD diagnostic odysseys in Latin America using administrative hospital data. The framework supports trajectory surveillance, registry design, quality-control analyses, and prioritization of candidate signals for prospective clinical validation under Chiles Law 21,743. Bridge-code associations should be interpreted as statistically enriched administrative signals, not as validated causal or clinical pathways.



O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=112 SRC="FIGDIR/small/26352213v1_ufig1.gif" ALT="Figure 1">
View larger version (57K):
org.highwire.dtl.DTLVardef@4fee5corg.highwire.dtl.DTLVardef@1a9fe14org.highwire.dtl.DTLVardef@167e86aorg.highwire.dtl.DTLVardef@cbb3fa_HPS_FORMAT_FIGEXP  M_FIG Graphical abstract. Updated canonical FONASA DRG/GRD-IR 2019-2024 cohort, audited RD catalogue, odyssey cohort, and bridge-code signal summary.

C_FIG
]]></description>
<dc:creator><![CDATA[ Gomez-Vargas, G. A., Repetto, G. M., Bravo, L., Castillo-Laborde, C., Delgado, I., Matute, I. ]]></dc:creator>
<dc:date>2026-05-03</dc:date>
<dc:identifier>doi:10.64898/2026.05.01.26352213</dc:identifier>
<dc:title><![CDATA[Inpatient diagnostic odysseys in rare diseases: a nationwide audited Orphanet ICD-10 DRG/GRD-IR analysis in Chile, 2019-2024]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.05.01.26352193v1?rss=1">
<title>
<![CDATA[
Unmeasured but Not Unbiased: The Missingness Demographic Leakage Audit (MDLA) for Calibration-Aware Fairness Evaluation in Critical Care Mortality Prediction 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.05.01.26352193v1?rss=1
</link>
<description><![CDATA[
ObjectiveClinical prediction models trained on electronic health records are routinely evaluated for fairness on observed feature values, but the informativeness of which measurements are absent remains unaudited. We developed the Missingness Demographic Leakage Audit (MDLA), a reproducible four-step informatics framework that tests whether patterns of clinical measurement absence function as latent demographic proxies -- constituting a bias pathway invisible to standard fairness audits.

Materials and MethodsWe applied MDLA across development (MIMIC-IV v2.2; n=50,827; mortality 10.2%) and external validation (eICU-CRD v2.0; n=137,773; mortality 9.5%) cohorts following TRIPOD+AI standards. XGBoost, random forest, and logistic regression were trained on 43 clinical features and 44 binary missingness indicators. MDLA quantified demographic predictability from missingness alone, tested feature-level associations with Bonferroni correction, and verified model reliance via ablation. A calibration-aware fairness audit evaluated five criteria across four demographic axes; six post-hoc recalibration strategies were compared on a fairness-utility Pareto frontier.

ResultsMissingness indicators alone predicted racial group membership above chance (AUROC=0.543; 95% CI, 0.540-0.546), with 18 of 43 features showing Bonferroni-significant race-missingness associations (all Cramers V<0.10). Ablation confirmed model reliance: adding missingness indicators increased racial AUROC disparity by 10.7% (0.063 to 0.069) without improving global performance. XGBoost achieved AUROC=0.910 internally (AUROC=0.799 on external validation). Global Platt recalibration reduced overall calibration error by 94% and maximum racial calibration error by 51%, with zero AUROC loss and successful parameter transfer to external validation without retraining.

ConclusionMDLA provides a structured, reproducible protocol for detecting missingness-encoded demographic signals prior to model deployment. Applied across 188,600 ICU patient-stays from two institutionally diverse databases, it identified a statistically confirmed but subtle bias pathway undetectable by standard fairness audits. Missingness-aware auditing and calibration-aware evaluation should be integrated into clinical AI validation pipelines.
]]></description>
<dc:creator><![CDATA[ Patel, K., Beedala, P. ]]></dc:creator>
<dc:date>2026-05-03</dc:date>
<dc:identifier>doi:10.64898/2026.05.01.26352193</dc:identifier>
<dc:title><![CDATA[Unmeasured but Not Unbiased: The Missingness Demographic Leakage Audit (MDLA) for Calibration-Aware Fairness Evaluation in Critical Care Mortality Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.23.26351510v1?rss=1">
<title>
<![CDATA[
ALEX: Automatic Language EXplanations for Interpreting Treatment Effects via Multi-Agents 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.23.26351510v1?rss=1
</link>
<description><![CDATA[
Precision medicine requires understanding the underlying drivers of heterogeneous treatment responses. Although machine learning methods have shown promise for estimating patient-specific treatment effects, their clinical utility remains limited because they often function as "black box" predictors that fail to explain why responses vary across individuals. Here we present ALEX, an explainable AI (XAI)-driven, multi-agent framework that addresses this interpretability gap by translating the patient variables driving these predictions into data-grounded, natural-language clinical explanations. ALEX first performs XAI analysis on treatment effect estimation and couples the intermediate results with large language model (LLM) agents to produce contextualized clinical insights. Across five landmark randomized controlled trials, ALEX outperformed existing agentic methods on explanation quality metrics and alignment with the biomedical literature. In empirical case studies, ALEX identified baseline glucose level as a potential explanation for the divergent findings between the ACCORD-BP and SPRINT trials, and proposed age as a key effect modifier for pre-hospital tranexamic acid efficacy. These findings suggest that ALEX can help translate treatment effect heterogeneity into clinically grounded explanations for further investigation.
]]></description>
<dc:creator><![CDATA[ Lu, M., Kim, C., White, N. J., Lee, S.-I. ]]></dc:creator>
<dc:date>2026-05-01</dc:date>
<dc:identifier>doi:10.64898/2026.04.23.26351510</dc:identifier>
<dc:title><![CDATA[ALEX: Automatic Language EXplanations for Interpreting Treatment Effects via Multi-Agents]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.24.26351503v1?rss=1">
<title>
<![CDATA[
Disease Risk Prediction Using Structured EHR Data: Can Generalist Large Language Models Match Specialized Clinical Foundation Models? A Comparative Evaluation with Fine-Tuning 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.24.26351503v1?rss=1
</link>
<description><![CDATA[
BackgroundElectronic health records (EHRs) with clinical decision support tools are now ubiquitous in healthcare organizations. Clinical foundation models (CFMs) pretrained on large-scale, heterogeneous structured EHR data have emerged as a powerful approach to improve predictive performance and generalizability. Meanwhile, large language models (LLMs) pretrained on broad data sources are being applied to an expanding range of healthcare tasks. However, it remains unclear whether generalist LLMs can match specialized CFMs for disease risk prediction using structured clinical data.

MethodsWe compared CFMs (Med-BERT, CLMBR) against fine-tuned generalist LLMs (Mistral, LLaMA-2/3/3.1), a clinical LLM (Me-LLaMA), and LLM-generated embeddings paired with simple classifiers (using DeepSeek, Qwen3, and GPT-OSS) on two disease risk prediction tasks: heart failure risk among diabetic patients (DHF) and pancreatic cancer diagnosis (PaCa). Evaluations spanned multi-site EHR data, claims data, and an open-source single-institution benchmark (EHRSHOT). Performance was assessed using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC).

ResultsOn larger EHR and claims cohorts (>30,000 patients), fine-tuned CFMs outperformed fine-tuned LLMs by a small but statistically significant margin (<1% AUROC). The clinical LLM performed comparably to generalist LLMs despite being smaller. On the open-source PaCa cohort (3,810 patients, 199 cases), LLMs achieved slightly higher AUROCs that were not statistically significant (LLaMA-3.1-70B 86.1% vs. Med-BERT 85.3%, p=0.27), but CFMs achieved significantly higher AUPRC (Med-BERT 55.9% vs. LLaMA-3.1-70B 41.1%, p=0.001). Notably, LLM-generated trajectory embeddings paired with logistic regression or a simple MLP, without any LLM fine-tuning, achieved the best overall performance, with AUROC exceeding 90% (Qwen3) and AUPRC reaching 66% (GPT-OSS 20B).

ConclusionLLM-generated embeddings with lightweight classifiers outperformed both fine-tuned CFMs and fine-tuned LLMs on AUROC and AUPRC. While these results demonstrate the potential of generalist models to match or surpass specialized CFMs, their substantially greater computational cost and variable AUPRC performance in the fine-tuning setting warrant caution. We provide a reproducible evaluation framework and codebase to support continued benchmarking.
]]></description>
<dc:creator><![CDATA[ Mao, B., Prasadha, M. K., Xie, Z., He, J., Ghebranious, M., Xu, H., Zhi, D., Rasmy, L. ]]></dc:creator>
<dc:date>2026-05-01</dc:date>
<dc:identifier>doi:10.64898/2026.04.24.26351503</dc:identifier>
<dc:title><![CDATA[Disease Risk Prediction Using Structured EHR Data: Can Generalist Large Language Models Match Specialized Clinical Foundation Models? A Comparative Evaluation with Fine-Tuning]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.30.26352142v1?rss=1">
<title>
<![CDATA[
AERO: An AI Agent for Adaptive Eligibility Refinement and Optimization of Clinical Trial Criteria in Real-World Trial Emulation 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.30.26352142v1?rss=1
</link>
<description><![CDATA[
Randomized controlled trials (RCTs) provide high internal validity but often rely on restrictive eligibility criteria that limit generalizability and complicate real-world trial emulation. We propose AERO (AI Agent for Adaptive Eligibility Refinement and Optimization), an agentic framework that systematically adapts clinical trial eligibility criteria for application to electronic health record data. AERO integrates external clinical knowledge sources and large language model-based reasoning to classify criteria as strict inclusion, safety exclusion, confounder, or operational artifact. We evaluated AERO by emulating the WARCEF trial using Mayo Clinic Platform data restricted to the pre-trial completion period. Emulation with optimized criteria yielded a hazard ratio of 1.561 (p = 0.0605), consistent with the original neutral trial finding (HR = 1.01, p = 0.91). An ablation analysis demonstrated that eligibility handling decisions materially influence observed treatment effects. These results highlight the importance of systematic, knowledge-informed eligibility refinement in real-world evidence generation.
]]></description>
<dc:creator><![CDATA[ Li, X., James, J., Pellikka, P. A., Zong, N. ]]></dc:creator>
<dc:date>2026-05-01</dc:date>
<dc:identifier>doi:10.64898/2026.04.30.26352142</dc:identifier>
<dc:title><![CDATA[AERO: An AI Agent for Adaptive Eligibility Refinement and Optimization of Clinical Trial Criteria in Real-World Trial Emulation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.29.26352110v1?rss=1">
<title>
<![CDATA[
Protocol for the REVELIO test-track pilot study: a randomised, controlled, single-centre trial in healthy recreational cannabis users investigating real-time in-vehicle detection of cannabis-impaired driving 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.29.26352110v1?rss=1
</link>
<description><![CDATA[
BackgroundDriving under the influence of cannabis is associated with impaired cognitive and psychomotor performance and an increased risk of traffic accidents. Reliable real-time in-vehicle systems for detecting cannabis-related driving impairment are currently lacking; but hold great potential for improving road safety.

MethodsThis protocol describes the REVELIO test-track pilot study: a randomised, controlled, open-label, interventional, single-centre trial. The study assesses the feasibility and methodological requirements for developing and evaluating a multimodal in-vehicle detection approach using vehicle and driver state data. A total of 45 healthy recreational cannabis users will be enrolled and randomly allocated to an intervention or a reference control group. During the main study day, all participants will undergo biological sampling for tetrahydrocannabinol (THC) and related metabolites, as well as pre-driving assessments, followed by a sober baseline driving session on a closed test track using a dual-pedal vehicle with a certified driving instructor onboard. Participants in the intervention group will then receive a single controlled inhalative cannabis dose (target 0.67 mg THC per kg body weight), while the reference group will receive no cannabis. All participants will subsequently complete three additional standardized 50-minute driving sessions at predefined time points up to approximately six hours after administration, following identical schedules to enable within- and between-group comparisons. Between driving sessions, structured breaks will include recovery periods, repeated biological sampling, and traffic-medical, traffic-psychological, and pre-driving performance assessments, to characterise the temporal dynamics of cannabis-related impairment.

DiscussionMultimodal data will be collected, including vehicle controller area network (CAN) data, driver monitoring camera (DMC) data, physiological signals using wearables, and biological samples (capillary blood, breath, oral fluid.

Machine-learning-based models will be developed and evaluated to distinguish sober from cannabis-influenced driving states under controlled conditions. Secondary analyses will examine changes in driving performance over time and associations between functional measures and biological THC concentrations. As an exploratory pilot study conducted on a secured test track, the protocol aims to generate standardized reference data and quantitative performance metrics to inform both feasibility and system design considerations.

Ethics and trial registrationThe study was approved by the Cantonal Ethics Committee Bern, Switzerland (BASEC ID: 2025-01590) and is registered at ClinicalTrials.gov (NCT07401628).
]]></description>
<dc:creator><![CDATA[ Bechny, M., Deuber, R., Heck, C., Brügger, J., Pfäffli, M., Jovanova, M., Fleisch, E., Wortmann, F., Weinmann, W. ]]></dc:creator>
<dc:date>2026-05-01</dc:date>
<dc:identifier>doi:10.64898/2026.04.29.26352110</dc:identifier>
<dc:title><![CDATA[Protocol for the REVELIO test-track pilot study: a randomised, controlled, single-centre trial in healthy recreational cannabis users investigating real-time in-vehicle detection of cannabis-impaired driving]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.26.26351798v1?rss=1">
<title>
<![CDATA[
Clinician Discourse on Ambient AI Scribes: A Reddit-based Topic Modelling and Sentiment Analysis 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.26.26351798v1?rss=1
</link>
<description><![CDATA[
BackgroundAmbient AI scribes are rapidly entering clinical workflows, yet end-user perspectives remain underrepresented in the peer-reviewed literature. Online clinician communities offer an unfiltered window into adoption barriers, perceived benefits, and product-level concerns.

ObjectiveTo characterise themes and sentiment in clinician discourse on ambient AI scribes across professional Reddit communities.

MethodsWe scraped posts from ten clinically oriented subreddits using twelve AI scribe related queries via the public Reddit JSON API. A two-tier keyword filter retained posts mentioning at least one AI scribe term and one clinical or workflow term. Texts were embedded with all-MiniLM-L6-v2, reduced via UMAP, clustered with HDBSCAN, and labelled using BERTopic with c-TF-IDF keyword extraction. Noise topics matching predefined off-topic patterns (for example, residency match, finance) were removed. Themes were assigned concise labels via Claude Sonnet 4. Sentiment was classified per post using cardiffnlp/twitter-roberta-base-sentiment-latest.

ResultsAfter filtering, 176 unique relevant posts from seven active subreddits were retained, with r/FamilyMedicine (n = 64) and r/healthIT (n = 34) dominating. BERTopic produced 12 coherent themes spanning workflow integration, vendor comparison (DAX, Heidi, Freed, Abridge), HIPAA and privacy, mobile and device use, templates and formatting, and research versus clinical use. Overall sentiment was 61.4% neutral, 21.6% positive, and 17.0% negative. The most net-positive theme was DAX/Nuance/AI tools (about 55% positive); the most net-negative were charting fatigue and the freed-AI-scribes discussion thread (about 37 to 40% negative). Engagement (median upvotes and comments) was highest for tool-comparison and pricing themes, indicating salience of practical adoption questions.

ConclusionsClinician sentiment toward ambient AI scribes is cautiously favourable but dominated by neutral, problem-solving discourse. Vendor selection, cost, HIPAA compliance, and EHR integration are the most actively debated issues. These insights can inform implementation strategy, vendor benchmarking, and policy guidance for ambient documentation tools.
]]></description>
<dc:creator><![CDATA[ Shankar, R., Xu, Q. ]]></dc:creator>
<dc:date>2026-04-30</dc:date>
<dc:identifier>doi:10.64898/2026.04.26.26351798</dc:identifier>
<dc:title><![CDATA[Clinician Discourse on Ambient AI Scribes: A Reddit-based Topic Modelling and Sentiment Analysis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-30</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
