﻿<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="http://medrxiv.org">
<admin:errorReportsTo rdf:resource="mailto:medrxiv@cshlpress.edu"/>
<title>medrxiv Subject Collection: Health Informatics</title>
<link>http://medrxiv.org</link>
<description>
This feed contains articles for medRxiv Subject Collection "Health Informatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.07.26350297v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.07.26350263v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.06.26350257v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.07.26350300v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.02.26349884v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.04.26350180v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.05.26350190v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.03.26350114v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.03.26350116v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.03.26350117v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.03.26350138v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.03.26350102v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.02.26350080v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.02.26350065v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.03.26350034v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.02.26350091v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.31.26349906v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.04.01.26349920v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.31.26349842v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.27.26349538v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.31.26349817v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.31.26349827v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.31.26349861v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.31.26349766v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.31.26349853v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.30.26349782v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.30.26349756v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.30.26349388v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.30.26349749v1?rss=1"/>
<rdf:li rdf:resource="https://www.medrxiv.org/content/10.64898/2026.03.28.26349522v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>medrxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>medrxiv</title>
<url>https://www.medrxiv.org/sites/default/files/medrxiv_internal_logo.png</url>
<link>http://medrxiv.org</link>
</image>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.07.26350297v1?rss=1">
<title>
<![CDATA[
An End-to-End Synthetic Oncology Clinical Trial Framework Integrating Radiographic Response, Circulating Tumor DNA, Safety, and Survival for Decision-Oriented Clinical Data Science 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.07.26350297v1?rss=1
</link>
<description><![CDATA[
Background: Modern oncology development depends on integrating radiographic response, molecular biomarkers, treatment exposure, safety, and survival endpoints, yet access to well-structured patient-level trial data is often limited. Methods: We developed a synthetic, literature-informed phase II randomized oncology trial framework that followed the sequence Patient [-&gt;] Data [-&gt;] Dataset [-&gt;] Analysis [-&gt;] Tables/Figures [-&gt;] Decision. A cohort of randomized patients was simulated with baseline demographic and disease features, longitudinal tumor measurements, circulating tumor DNA, inflammatory and exploratory biomarkers, adverse events, treatment exposure, and survival outcomes. Raw source datasets were transformed into SDTM-like domains and ADaM-like analysis datasets, then analyzed for baseline characteristics, exposure, best overall response, survival, subgroup hazard ratios, longitudinal tumor and biomarker changes, exposure-response, and safety. Results: The treatment arm showed a coherent efficacy signal across multiple analytical layers. Treatment increased objective response and clinical benefit, reduced tumor burden over time, and prolonged survival. Median overall survival increased from 135 days in the control arm to 288 days in the treatment arm, with an approximate hazard ratio of 0.661 (95% CI, 0.480-0.911; p = 0.011). Median progression-free survival increased from 116 to 208 days, with an approximate hazard ratio of 0.601 (95% CI, 0.418-0.864; p = 0.006). Circulating tumor DNA showed a more favorable trajectory in treated patients and aligned directionally with radiographic and survival benefit. Safety analyses showed increased treatment-related toxicity, but the overall safety profile remained interpretable and compatible with continued development. Conclusions: This study demonstrates that a synthetic, literature-informed oncology trial can reproduce a biologically plausible and analytically coherent efficacy-safety signal architecture across radiographic, molecular, and time-to-event endpoints, providing a decision-oriented prototype for translational oncology clinical data science. Keywords: synthetic clinical trial, oncology, ctDNA, Kaplan-Meier, biomarker, survival analysis, translational data science, ADaM, SDTM
]]></description>
<dc:creator><![CDATA[ Petalcorin, M. I. R. ]]></dc:creator>
<dc:date>2026-04-08</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.26350297</dc:identifier>
<dc:title><![CDATA[An End-to-End Synthetic Oncology Clinical Trial Framework Integrating Radiographic Response, Circulating Tumor DNA, Safety, and Survival for Decision-Oriented Clinical Data Science]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.07.26350263v1?rss=1">
<title>
<![CDATA[
Attitudes and Perceptions Toward the Use of Artificial Intelligence Chatbots for Peer Review in Medical Journals: A Large-Scale, International Cross-Sectional Survey 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.07.26350263v1?rss=1
</link>
<description><![CDATA[
BackgroundArtificial intelligence chatbots (AICs), as a form of generative artificial intelligence (AI), are increasingly being considered for use in scholarly peer review to assist with tasks such as identifying methodological issues, verifying references, and improving language clarity. Despite these potential benefits, concerns remain regarding their reliability, ethical implications, and transparency. Evidence on how medical journal peer reviewers perceive the role and impact of AICs is limited. This study explored reviewers familiarity with AICs, perceived benefits and challenges, ethical concerns, and anticipated future roles in peer review.

MethodsWe conducted a cross-sectional online survey of medical journal peer reviewers. Corresponding author information was extracted from MEDLINE-indexed articles added to PubMed within a two-month period using an R-based approach. A total of 72,851 authors were invited via email to participate; those who self-identified as peer reviewers were eligible. The 29-item survey assessed familiarity with AICs and perceptions of their benefits and limitations in peer review. The survey was administered via SurveyMonkey from April 28 to June 16, 2025, with two reminder emails sent during the data collection period.

ResultsA total of 1,260 respondents completed the survey. Most participants were familiar with AICs (86.2%) and had used tools such as ChatGPT for general purposes (87.7%), but the majority had not used AICs for peer review (70.3%). Most respondents reported that their institutions do not provide training on AIC use in peer review (69.5%), although many expressed interest in such training (60.7%). Perceptions of AIC benefits were mixed, while concerns were widely shared, particularly regarding potential algorithmic bias (80.3%) and issues related to trust and user acceptance (73.3%).

ConclusionsWhile familiarity with AICs is high among medical journal peer reviewers, their use in peer review remains limited. There is clear interest in training and guidance, however, concerns related to ethics, data privacy, and research integrity persist and should be addressed before broader implementation.
]]></description>
<dc:creator><![CDATA[ Ng, J. Y., Bhavsar, D., Dhanvanthry, N., Bouter, L., Chan, T., Cramer, H., Flanagin, A., Iorio, A., Lokker, C., Maisonneuve, H., Marusic, A., Moher, D. ]]></dc:creator>
<dc:date>2026-04-07</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.26350263</dc:identifier>
<dc:title><![CDATA[Attitudes and Perceptions Toward the Use of Artificial Intelligence Chatbots for Peer Review in Medical Journals: A Large-Scale, International Cross-Sectional Survey]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.06.26350257v1?rss=1">
<title>
<![CDATA[
Perception of Safety in Behavioral Health Crisis Units among Patients and Care Partners versus Artificial Intelligence (AI): A Multimethod Study 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.06.26350257v1?rss=1
</link>
<description><![CDATA[
BackgroundSafety is a critical concern in behavioral health crisis units (BHCUs), where environmental risks (e.g., ligature points) can lead to injury to self or others. However, limited research has examined how perceived safety influences facility selection among patients and care partners, or how these perceptions align with AI-driven safety risk assessments in such environments.

MethodTo address these gaps, a nationwide discrete choice online survey was conducted using image-based scenarios of BHCU environments, where participants selected preferred facilities based on a range of attributes, including environmental safety risks (e.g., ligature points). Additionally, participants identified safety risks in survey images, which were compared with outputs from an AI-driven tool developed and trained to detect environmental risks by experts. Quantitative analysis using conditional logit models examined the influence of attributes on facility choice, while spatial comparisons of annotated images and heatmaps assessed participant and AI-identified risk alignments.

ResultsFindings revealed that the higher frequency of safety risks in images significantly reduced the likelihood of facility selection (p < .001, OR {approx} 1.28), highlighting the importance of perceived safety in user decision-making. While there was notable alignment between heatmaps generated by participants and AI, key differences emerged, suggesting that participants safety perception was influenced by features not fully captured by AI, such as the type of materials or unknown, out-of-label safety risks in facility images.

ConclusionsDespite these limitations, results highlighted the value of integrating AI-driven assistive tools for non-expert user safety risk assessment to support decision-making for safer BHCU environments.
]]></description>
<dc:creator><![CDATA[ Jafarifiroozabadi, R. ]]></dc:creator>
<dc:date>2026-04-07</dc:date>
<dc:identifier>doi:10.64898/2026.04.06.26350257</dc:identifier>
<dc:title><![CDATA[Perception of Safety in Behavioral Health Crisis Units among Patients and Care Partners versus Artificial Intelligence (AI): A Multimethod Study]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.07.26350300v1?rss=1">
<title>
<![CDATA[
High-Throughput Observational Evidence Generation Using Linked Electronic Health Record and Claims Data 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.07.26350300v1?rss=1
</link>
<description><![CDATA[
Background: The observational literature on comparative effectiveness is expanding rapidly but remains difficult to synthesize. Discordant findings often stem from structural differences in cohort definitions, inclusion criteria, and follow up windows, leaving stakeholders without a cohesive evidence base. Furthermore, studies typically focus on a narrow subset of outcomes, neglecting the broader needs of diverse healthcare stakeholders 1,2,3,4. Methods We developed a high throughput evidence generation workflow using linked EHR and administrative claims data. The cornerstone is a prespecified measurement architecture applied uniformly across clinical scenarios: six post index windows (acute to two year follow.up); 28 Elixhauser comorbidities; 14 healthcare resource utilization (HCRU) categories; 29 laboratory measures with 52 binary thresholds; and 42 adverse event categories. We generated unadjusted treatment comparisons across ~1,038 outcomes per scenario, including effect-measure modification (EMM) assessments across 130 baseline features. Results Across 40 clinical domains, the workflow produced approximately 32,982,552 outcome evaluations. An evaluation included a treatment comparison outcome population effect estimate with uncertainty bounds and supporting diagnostics. Approximately 5,000 narrative summaries underwent structured clinical and statistical quality control before dissemination. Conclusions Standardized, high throughput workflows can shift evidence generation away from fragmented studies toward comprehensive evidence packages. This shared evidence base supports precision medicine by making treatment effect heterogeneity visible across clinically meaningful subpopulations, reducing the need for redundant, stakeholder-specific studies.
]]></description>
<dc:creator><![CDATA[ Gombar, S., Shah, N., Sanghavi, N., Coyle, J., Mukerji, A., Chappelka, M. ]]></dc:creator>
<dc:date>2026-04-07</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.26350300</dc:identifier>
<dc:title><![CDATA[High-Throughput Observational Evidence Generation Using Linked Electronic Health Record and Claims Data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.02.26349884v1?rss=1">
<title>
<![CDATA[
Who is leading medical AI? A systematic review and scientometric analysis of chest x-ray research 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.02.26349884v1?rss=1
</link>
<description><![CDATA[
Computer vision models for chest X-ray interpretation hold significant promise for global healthcare, but their clinical value depends on equitable development across diverse populations. We conducted a scientometric analysis to examine authorship patterns, geographic distribution, and dataset origins to assess potential disparities that could affect clinical applicability. We systematically reviewed literature on computer vision applications for chest X-rays published between 2017-2025 across multiple databases, including PubMed, Embase and SciELO databases. Using Dimensions API and manual extraction, we analyzed 928 eligible studies, examining first and senior author affiliations, institutional contributions, dataset provenance, and collaboration patterns across different income classifications based on World Bank categories. High-income countries dominated research leadership, representing 55.6% of first authors and 59.7% of senior authors; no first authors were affiliated with low-income countries. China (16.93%) and the United States (16.72%) led in first authorship positions. Most datasets (73.6%) originated from high-income settings, with the United States being the largest contributor (40.45%). Private datasets were most frequently used (20.52%). Cross-income collaborations were rare, with only 3.9% of publications involving partnerships between high-income and lower-middle-income countries. Findings reveal substantial disparities in who shapes computer vision research on chest X-rays and which populations are represented in training data. These imbalances risk developing AI systems that perform inconsistently across diverse healthcare settings, potentially exacerbating healthcare inequities. Addressing these disparities requires coordinated efforts to develop globally representative datasets, establish equitable international collaborations, and implement policies that promote inclusive research practices.
]]></description>
<dc:creator><![CDATA[ Vasquez-Venegas, C., Chewcharat, A., Kimera, R., Kurtzman, N., Leite, M., Woite, N. L., Muppidi, I. J., Muppidi, R. J., Liu, X., Ong, E. P., Pal, R., Myers, C., Salzman, S., Patscheider, J. S., John, T. R., Rogers, M., Samuel, M., Santana-Guerrero, J. L., Yaacob, S., Gameiro, R. R., Celi, L. A. ]]></dc:creator>
<dc:date>2026-04-07</dc:date>
<dc:identifier>doi:10.64898/2026.04.02.26349884</dc:identifier>
<dc:title><![CDATA[Who is leading medical AI? A systematic review and scientometric analysis of chest x-ray research]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.04.26350180v1?rss=1">
<title>
<![CDATA[
TELF: An End-to-End Temporal Encoder with Late Fusion for Interpretable Disease Risk Prediction from Longitudinal Real-World Data 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.04.26350180v1?rss=1
</link>
<description><![CDATA[
Deep learning models utilizing longitudinal healthcare data have significantly advanced epidemiological research. However, contemporary transformer-based models increasingly rely on computationally intensive pre-training steps that entail processing massive real-world datasets with cost-prohibitive hardware. We introduce the Temporal Encoder with Late Fusion (TELF), a lightweight end-to-end predictive model featuring an encoder-only architecture for processing medical codes, followed by post-encoder concatenation with demographic variables. TELF learns code embeddings on-the-fly, thereby bypassing the resource-intensive pre-training bottleneck. Furthermore, its late-fusion design preserves the integrity of the temporal attention mechanism before integrating static demographic predictors. We evaluated TELF using an administrative claims database across three distinct cohorts: pancreatic cancer (n=53,661), type 2 diabetes (n=78,756), and heart failure (n=72,540). TELF consistently outperformed traditional machine learning baselines, including XGBoost, LightGBM, and logistic regression. Specifically, TELF achieved AUCs of 0.9150, 0.8199, and 0.8721 for pancreatic cancer, type 2 diabetes, and heart failure, respectively, compared with 0.9044, 0.7908, and 0.8535 for XGBoost and 0.9014, 0.7800, and 0.8466 for logistic regression. Beyond predictive superiority, TELFs isolated temporal attention mechanism enables population-level motif mining. By extracting high-attention temporal sequences, we mapped aggregated patient journey pathways, revealing interpretable clinical trajectories preceding disease onset. Collectively, these results demonstrate that TELF provides a resource-efficient and accessible framework for advanced temporal modeling in clinical and epidemiological research.
]]></description>
<dc:creator><![CDATA[ Liu, Y., Zhang, Z. ]]></dc:creator>
<dc:date>2026-04-06</dc:date>
<dc:identifier>doi:10.64898/2026.04.04.26350180</dc:identifier>
<dc:title><![CDATA[TELF: An End-to-End Temporal Encoder with Late Fusion for Interpretable Disease Risk Prediction from Longitudinal Real-World Data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.05.26350190v1?rss=1">
<title>
<![CDATA[
Sound of Aging: Large-Scale Evidence for a Voice-Based Biological Clock 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.05.26350190v1?rss=1
</link>
<description><![CDATA[
Using 30-second voice recordings from 7,081 adults aged 40-70, we trained gender-specific models to estimate voice-predicted age (Voice Age). Voice Age correlated with chronological age comparably to established omic and physiological aging clocks, while capturing an independent dimension of biological aging. Accelerated vocal aging showed association with higher adiposity, impaired sleep physiology, and cardiometabolic risk markers, supporting voice as a scalable, non-invasive functional aging biomarker.
]]></description>
<dc:creator><![CDATA[ Krongauz, D., Marmor, Y., Zulti, A., Godneva, A., Weinberger, A., Segal, E. ]]></dc:creator>
<dc:date>2026-04-06</dc:date>
<dc:identifier>doi:10.64898/2026.04.05.26350190</dc:identifier>
<dc:title><![CDATA[Sound of Aging: Large-Scale Evidence for a Voice-Based Biological Clock]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.03.26350114v1?rss=1">
<title>
<![CDATA[
Perioperative Mortality Prediction Using a Bayesian Ensemble with Prevalence-Adaptive Gating 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.03.26350114v1?rss=1
</link>
<description><![CDATA[
BackgroundPerioperative mortality prediction in resource-limited surgical settings remains challenging due to class imbalance, missing data, and the heterogeneity of postoperative complications. Existing risk scores such as POSSUM depend on intraoperative variables and do not quantify prediction uncertainty.

MethodsWe developed a prevalence-adaptive Bayesian ensemble comprising three stochastic models -- a classifier Variational Autoencoder (VAE, AUC=0.95), a Flipout Last Layer network (AUC=0.84), and a Monte Carlo Dropout network (AUC=0.80) -- trained on 697 patients (39 deaths, prevalence 5.59%) with 67 preoperative and postoperative features. Class imbalance (16.9:1) was addressed through Variational Autoencoder augmentation -- two class-conditional generative VAEs produced 619 synthetic survivor and 619 synthetic death records, yielding a balanced training corpus of 1,935 samples. VAE augmentation was selected over SMOTE and random oversampling after a comparative study (F1: random oversampling 0.61 vs VAE augmentation 0.77). Validation used a held-out set of 233 patients (13 deaths, 220 survivors). A six-stage prediction pipeline incorporated weighted base risk, a three-path prevalence-adaptive gate, Shannon entropy uncertainty quantification, and rank-transform calibration. Sensitivity analysis was conducted across all six empirically derived hyperparameters. A whole-cohort death audit evaluated all 52 deaths from the complete 930-patient dataset through the deployed clinical decision support system. Statistical analysis included Kruskal-Wallis testing of entropy across triage groups, Wilson score confidence intervals for performance metrics, and Spearman rank correlation for LIME-SHAP interpretability concordance.

ResultsOn the validation cohort the ensemble achieved complete separation (sensitivity 100%, specificity 100%, Youden J=1.000; TP=13, FP=0, TN=220, FN=0). The whole-cohort death audit identified 36 of 52 deaths (sensitivity 69.2%, 95% CI 55.7%-80.1%; precision 100%, 95% CI 90.4%-100.0%; F1=0.818, bootstrap 95% CI 0.732-0.894). Shannon entropy differed significantly across triage levels (Kruskal-Wallis H(2)=24.212, p<0.001, {varepsilon}2=0.453), confirming a monotone gradient SAFE < CRITICAL < GRAY ZONE. All six hyperparameters were invariant across their tested ranges (J=1.000 throughout; Supplementary Tables S1-S2). LIME and SHAP rankings showed statistically significant concordance (Spearman {rho}=0.440, p=0.024; Kendall {tau}=0.357, p=0.011), with 4 of 6 principal mortality determinants shared across both methods.

O_TBL View this table:
org.highwire.dtl.DTLVardef@167e86aorg.highwire.dtl.DTLVardef@cbb3faorg.highwire.dtl.DTLVardef@19516cforg.highwire.dtl.DTLVardef@10c94c0org.highwire.dtl.DTLVardef@f9cff1_HPS_FORMAT_FIGEXP  M_TBL O_FLOATNOTable S1.C_FLOATNO O_TABLECAPTIONSensitivity analysis of the VAE gate multiplier k. VAE_RECOVERY_THR = 0.2808 + k x 0.3799. Bold row = nominal value. Validation cohort n=233 (13 deaths, 220 survivors). Death audit n=52 deaths only.

C_TABLECAPTION C_TBL O_TBL View this table:
org.highwire.dtl.DTLVardef@c2f262org.highwire.dtl.DTLVardef@b73feborg.highwire.dtl.DTLVardef@c58feeorg.highwire.dtl.DTLVardef@6c8a0corg.highwire.dtl.DTLVardef@1db296e_HPS_FORMAT_FIGEXP  M_TBL O_FLOATNOTable S2.C_FLOATNO O_TABLECAPTIONSensitivity analysis of all six hyperparameters. All values in stable zone produce J=1.000 (TP=13, FP=0, TN=220, FN=0) on the validation cohort.

C_TABLECAPTION C_TBL ConclusionsA prevalence-adaptive Bayesian ensemble with entropy-based uncertainty triage achieves zero false positive alerts and clinically meaningful audit sensitivity in perioperative mortality prediction. Complete hyperparameter invariance confirms that reported performance reflects structural properties of the calibration architecture. The 16 missed deaths represent feature-invisible cases beyond current observational feature capacity.
]]></description>
<dc:creator><![CDATA[ Pandey, A. K. ]]></dc:creator>
<dc:date>2026-04-06</dc:date>
<dc:identifier>doi:10.64898/2026.04.03.26350114</dc:identifier>
<dc:title><![CDATA[Perioperative Mortality Prediction Using a Bayesian Ensemble with Prevalence-Adaptive Gating]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.03.26350116v1?rss=1">
<title>
<![CDATA[
CD276 in Meningioma Transcriptomic Classification: Internal Development, External Validation, and Stability-Informed Interpretation 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.03.26350116v1?rss=1
</link>
<description><![CDATA[
BackgroundCD276 has been proposed as a candidate gene associated with the biological characteristics of meningioma, but its predictive position and interpretive significance within a transcriptomic classifier have not yet been clearly established. Accordingly, this study aimed to evaluate CD276 stepwise across internal model development, external validation, calibration, decision-analytic assessment, feature stability, and robustness analyses using public transcriptomic cohorts.

MethodsThe analyses in this study were organized into two interconnected notebooks. In Notebook A, we reconstructed the internal training cohort (GSE183653), evaluated the CD276 single-gene signal, and then developed a transcriptome-wide multigene classifier. We also performed permutation importance, bootstrap confidence interval, label permutation test, repeated cross-validation, CD276 ablation, and internal calibration analyses. In Notebook B, we reproduced the external validation cohort (GSE136661) in a fixed common-gene space, applied train-only recalibration and train-only threshold transfer, and extended the interpretation through decision curve analysis, stability analysis, enrichment analysis, and one-factor-at-a-time robustness analysis.

ResultsThe internal training cohort consisted of 185 samples and 58,830 genes, of which 25 were WHO grade III cases. CD276 expression showed a significant association with WHO grade, but the internal discrimination of the CD276-only baseline was limited (ROC-AUC 0.628, average precision 0.323, balanced accuracy 0.540). In contrast, the initial transcriptome-wide model showed ROC-AUC 0.834 and PR-AUC 0.509, and under 5-fold cross-validation, the canonical full-transcriptome model and the CD276-forced 5,001-feature branch showed mean ROC-AUC/PR-AUC of 0.854/0.564 and 0.855/0.606, respectively, outperforming the CD276-only baseline at 0.644/0.391. CD276 was not included in the initial 5,000-feature filtered set and ranked 900th among 5,001 features even in the forcibly included 5,001-feature branch. In paired ablation analysis, the performance difference attributable to inclusion of CD276 was effectively close to zero (delta ROC-AUC 0.000062, delta PR-AUC 0.000056). Internal calibration analysis showed an overconfident probability pattern (Brier score 0.10501, intercept -1.421392, slope 0.413241). In external validation, the fixed multigene pipeline achieved ROC-AUC 0.928 and PR-AUC 0.335. Train-only recalibration improved calibration metrics while preserving discrimination, and decision curve analysis showed threshold-dependent but limited external utility. Stability analysis showed overlap between core-stable genes and high-impact genes, but CD276 was not supported as a dominant stable core feature and remained in the target-of-interest tier. In robustness analysis, some perturbations preserved the primary interpretation, whereas others revealed transform sensitivity or an alternative high-performing feature-space solution.

ConclusionsCD276 is a gene of interest associated with meningioma grade, but it was difficult to interpret it as a strong standalone predictor or a dominant stable classifier feature. In this study, the main basis of predictive performance lay not in CD276 alone but in a broader multigene transcriptomic structure, and probability output needed to be interpreted conservatively with calibration taken into account. These findings position CD276 not as a direct single-gene classifier but as a biology-motivated target-of-interest that should be interpreted within a broader transcriptomic program.
]]></description>
<dc:creator><![CDATA[ Lee, H., Kim, H. ]]></dc:creator>
<dc:date>2026-04-05</dc:date>
<dc:identifier>doi:10.64898/2026.04.03.26350116</dc:identifier>
<dc:title><![CDATA[CD276 in Meningioma Transcriptomic Classification: Internal Development, External Validation, and Stability-Informed Interpretation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.03.26350117v1?rss=1">
<title>
<![CDATA[
Reproducibility and Robustness of Large Language Models for Mobility Functional Status Extraction 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.03.26350117v1?rss=1
</link>
<description><![CDATA[
Clinical narrative text contains crucial patient information, yet reliable extraction remains challenging due to linguistic variability, documentation habits, and differences across care settings. Large language models (LLMs) have shown strong accuracy on clinical information extraction (IE), but their reproducibility (stability under repeated runs) and robustness (stability under small, natural prompt variations) are less consistently quantified, despite being central to clinical deployment. In this study, we evaluate three open-weight LLMs representing distinct modeling choices: a dense general-purpose model (Llama 3.3), a mixture-of-experts (MoE) general-purpose model (Llama 4), and a domain-tuned medical model (MedGemma). We focus on binary clinical IE aligned with four mobility classes from the International Classification of Functioning, Disability and Health (ICF) framework. Using a controlled experimental design, we quantify (1) intra-prompt reproducibility across repeated sampling and (2) inter-prompt robustness across paraphrased prompts. We jointly report predictive performance (F1-score) and stability (Fleiss Kappa). And we test factor effects using three-way ANOVA with post-hoc comparisons. Results show that increasing temperature generally degrades agreement, but the magnitude depends on model and task; furthermore, prompt paraphrasing can substantially reduce stability, with particularly large drops for the MoE model. Finally, we evaluate a practical mitigation, self-consistency via majority voting, which improves {kappa} substantially and often improves or preserves F1-score, at the cost of additional inference. Together, these findings provide a reproducible framework and concrete recommendations for evaluating and improving LLM reliability in clinical IE.
]]></description>
<dc:creator><![CDATA[ Liu, X., Garg, M., Jeon, E., Jia, H., Sauver, J. S., Pagali, S. R., Sohn, S. ]]></dc:creator>
<dc:date>2026-04-05</dc:date>
<dc:identifier>doi:10.64898/2026.04.03.26350117</dc:identifier>
<dc:title><![CDATA[Reproducibility and Robustness of Large Language Models for Mobility Functional Status Extraction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.03.26350138v1?rss=1">
<title>
<![CDATA[
Electronic Health Record-Based Estimation of Kansas City Cardiomyopathy Questionnaire Scores in Heart Failure 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.03.26350138v1?rss=1
</link>
<description><![CDATA[
BackgroundThe Kansas City Cardiomyopathy Questionnaire (KCCQ) is a validated patient-reported outcome measure for heart failure. However, its clinical utility is limited by incomplete and inconsistent data collection. We aimed to develop and validate machine learning models to estimate KCCQ overall summary scores from electronic health record (EHR) data.

MethodsWe assembled a retrospective cohort of 10,889 heart failure patients with recorded KCCQ scores from the Truveta database. Predictor features were derived from structured EHR variables across 13 historical time windows (15-360 days). Multiple regression algorithms were evaluated, followed by SHapley Additive exPlanations (SHAP)-based feature reduction and nested cross-validation for hyperparameter optimization. Model performance was assessed using the coefficient of determination (R2), mean absolute error (MAE), and ordinal discrimination and calibration for categorical severity classification.

ResultsHistogram-based gradient boosting (HGB) with HGB-SHAP feature selection achieved the strongest performance, reducing feature dimensionality by more than 94% while maintaining estimation accuracy. The 240-day window performed best (R2=0.522, MAE=12.485). For categorical severity classification, the model demonstrated strong ordinal discrimination (mean ordinal AUROC=0.850). Quantile-based calibration improved classification balance, increasing the F1-score for the most severe category (KCCQ<25) from 0.180 to 0.428 and the quadratic weighted kappa from 0.601 to 0.640. Longer EHR observation windows were associated with improved prediction performance.

ConclusionMachine learning models can estimate KCCQ scores from routine EHR data with clinically meaningful accuracy and strong discriminatory performance. This approach may help extend assessment of patient-reported health status to populations in which survey-based data are incompletely captured, supporting population-level cardiovascular outcomes assessment and risk stratification in heart failure care.

What is KnownO_LIPatient-reported outcomes, such as the Kansas City Cardiomyopathy Questionnaire (KCCQ), are essential for assessing health status in patients with heart failure but are inconsistently collected in routine clinical practice, limiting their use for population-level monitoring and outcomes assessment.
C_LIO_LIDespite their clinical importance, prior studies have primarily used KCCQ scores as predictors of downstream clinical outcomes, and few approaches have been developed to estimate KCCQ scores directly from electronic health record (EHR) data.
C_LI

What the Study AddsO_LIThis study presents a rigorously validated machine learning framework for estimating KCCQ scores from routinely collected EHR data, achieving clinically meaningful accuracy across multiple temporal windows with substantial feature reduction.
C_LIO_LIThe proposed approach incorporates post-hoc calibration to improve identification of patients with severe functional impairment, supporting scalable assessment of patient-reported health status in settings where survey data are incomplete or unavailable.
C_LI
]]></description>
<dc:creator><![CDATA[ Kim, Y. W., Lau, W., Patel, N., Kendrick, K., Wu, A., Feldman, T., Ahern, R., Oka, A. ]]></dc:creator>
<dc:date>2026-04-05</dc:date>
<dc:identifier>doi:10.64898/2026.04.03.26350138</dc:identifier>
<dc:title><![CDATA[Electronic Health Record-Based Estimation of Kansas City Cardiomyopathy Questionnaire Scores in Heart Failure]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.03.26350102v1?rss=1">
<title>
<![CDATA[
Automated detection of adult autism from vowel acoustics using machine learning 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.03.26350102v1?rss=1
</link>
<description><![CDATA[
Autism spectrum disorder (ASD) is a neurodevelopmental condition for which timely and accurate detection remains a major clinical priority. Early and reliable identification is important because it can facilitate access to assessment, diagnosis, and appropriate support; however, current diagnostic pathways still rely largely on behavioural evaluation and clinical judgement. In this context, machine-learning (ML) approaches have attracted growing interest because they can identify subtle and complex patterns in speech data that may not be easily captured through conventional methods. The current study capitalizes on this potential by developing and evaluating ML models for distinguishing autistic individuals from neurotypical individuals based on speech features. More specifically, acoustic features of vowels, including fundamental frequency (F0), first three formants (F1, F2, F3), duration, jitter, shimmer, harmonics-to-noise ratio (HNR), and intensity, were elicited from 18 autistic adults and 18 neurotypical adults through a controlled production task. Then, four supervised ML models were trained and evaluated on these features: LightGBM, Random Forest, Support Vector Machine, and XGBoost. All models demonstrated good classification performance, with the best-performing model achieving a strong discriminability of 89%. The explainability analysis identified F0 as the most influential predictor by a substantial margin, followed by intensity, F3, and F1, while duration, shimmer, HNR, jitter, and F2 contributed more modestly. These findings demonstrate that vowel acoustics contain clinically relevant information for distinguishing autistic from neurotypical adult speech and highlight the potential of interpretable, speech-based ML as a transparent and scalable aid for ASD screening and assessment.
]]></description>
<dc:creator><![CDATA[ Georgiou, G. P., Paphiti, M. ]]></dc:creator>
<dc:date>2026-04-04</dc:date>
<dc:identifier>doi:10.64898/2026.04.03.26350102</dc:identifier>
<dc:title><![CDATA[Automated detection of adult autism from vowel acoustics using machine learning]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-04</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.02.26350080v1?rss=1">
<title>
<![CDATA[
Med-ICE: Enhancing Factual Accuracy in Medical AI through Autonomous Multi-Agent Consensus 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.02.26350080v1?rss=1
</link>
<description><![CDATA[
The integration of Large Language Models into high-stakes clinical workflows is critically hampered by their lack of verifiable reliability and tendency to generate hallucinations. This paper introduces Med-ICE, an autonomous framework designed to enhance the reliability of LLMs for medical applications. Med-ICE adapts the Iterative Consensus Ensemble paradigm, enabling a group of peer LLM agents to collaboratively converge on a final answer through iterative rounds of generation and peer review, thereby eliminating the need for an external arbiter and its associated scalability bottleneck. Our work makes three key contributions: (1) a novel semantic consensus mechanism that determines agreement based on semantic similarity, crucial for nuanced clinical language; (2) demonstration of state-of-the-art performance, where Med-ICE significantly outperforms both direct single-LLM generation and the Self-Refinement technique on challenging medical benchmarks; and (3) a highly efficient and scalable architecture, as our Semantic Consensus Monitor is computationally lightweight. This research establishes a new standard for developing safer, more trustworthy LLM systems, paving the way for their responsible integration into medicine.
]]></description>
<dc:creator><![CDATA[ Chen, Z., Wu, R., Liu, Y., Li, R., Duprey, A. ]]></dc:creator>
<dc:date>2026-04-04</dc:date>
<dc:identifier>doi:10.64898/2026.04.02.26350080</dc:identifier>
<dc:title><![CDATA[Med-ICE: Enhancing Factual Accuracy in Medical AI through Autonomous Multi-Agent Consensus]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-04</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.02.26350065v1?rss=1">
<title>
<![CDATA[
Enhancing Medical Knowledge in Large Language Models via Supervised Continued Pretraining on Clinical Notes 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.02.26350065v1?rss=1
</link>
<description><![CDATA[
BackgroundLarge language models (LLMs) contain limited professional medical knowledge, as large-scale training on clinical text has not yet been possible due to restricted access.

ObjectivesTo continue pre-training an open-access instruct LLM on de-identified medical notes and evaluate the resulting impact on real-world clinical decision-making tasks and standard benchmarks.

MethodsUsing 500K de-identified clinical notes from Cedars-Sinai Health System, we fine-tuned a Qwen3-4B Instruct model with supervised learning to generate medical decision-making (MDM) paragraphs from patient presentations, and evaluated it on assigned-diagnosis prediction, in-hospital cardiac-arrest mention detection, and a suite of general and biomedical benchmarks.

ResultsThe fine-tuned model produced MDMs that closely resembled those written by physicians and outperformed the base-instruct model and larger clinically untrained models (Qwen3-32B and Llama-3.1-405B Instruct) on assigned-diagnosis prediction, the task most aligned with its training objective. On the task of detecting in-hospital cardiac arrest mentions, the model initially exhibited mild label collapse, but a brief task-specific fine-tuning stage resolved this issue and allowed it to surpass all competitors. The model also demonstrated global general knowledge retention on biomedical and general-domain evaluation benchmarks compared to the baseline.

ConclusionSupervised full fine-tuning on clinical notes allowed the model to incorporate medical knowledge without sacrificing general-domain abilities, and to transfer this knowledge to unseen biomedical tasks without wholesale loss of general-domain abilities, while revealing collapse-related failure modes that motivate more principled strategies for clinical specialization.
]]></description>
<dc:creator><![CDATA[ Weissenbacher, D., Shabbir, M., Campbell, I. M., Berdahl, C. T., Gonzalez-Hernandez, G. ]]></dc:creator>
<dc:date>2026-04-04</dc:date>
<dc:identifier>doi:10.64898/2026.04.02.26350065</dc:identifier>
<dc:title><![CDATA[Enhancing Medical Knowledge in Large Language Models via Supervised Continued Pretraining on Clinical Notes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-04</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.03.26350034v1?rss=1">
<title>
<![CDATA[
Multi-Task Learning and Soft-Label Supervision for Psychosocial Burden Profiling in Cancer Peer-Support Text 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.03.26350034v1?rss=1
</link>
<description><![CDATA[
ObjectiveOnline cancer peer-support text contains signals of psychosocial burden beyond emotional tone, including treatment burden, financial strain, uncertainty, and unmet support needs. We evaluated 2 modeling extensions: multi-task learning (MTL) for joint prediction of health economics and outcomes research (HEOR) burden dimensions, and soft-label supervision using large language model (LLM)-derived probability distributions.

Materials and MethodsWe analyzed 10,392 cancer peer-support posts. GPT-4o-mini generated proxy annotations for HEOR burden subscales, composite burden, high-need status, speaker role, cancer type, and emotion probabilities. Study 1 trained a shared ALBERT encoder under 4 MTL conditions: composite and subscale burden targets, each with and without auxiliary heads, using Kendall uncertainty weighting. Study 2 compared soft-label training on LLM emotion distributions with hard-label baselines under regular and token-augmented inputs, evaluating performance against both human labels and AI distributions.

ResultsComposite-only MTL achieved R2=0.446 for burden regression and weighted F1=0.810 for high-need screening; subscale classification achieved mean weighted F1=0.646. Adding auxiliary role and cancer-type heads reduced regression performance ({Delta}R2 = -0.209). Soft-label training reduced weighted F1 by 0.16 versus hard-label baselines (0.68 vs. 0.86), and token augmentation did not improve performance under soft supervision.

DiscussionComposite-only MTL supported modeling of multidimensional burden-related signals from forum text, whereas auxiliary prediction heads appeared to compete with primary tasks. Soft-label training aligned poorly with human-labeled emotion categories, suggesting that uncalibrated LLM distributions may propagate bias rather than improve supervision.

ConclusionComposite-only MTL was the strongest burden-modeling approach, and hard-label supervision remained preferable for emotion classification.
]]></description>
<dc:creator><![CDATA[ Wang, Z., Cao, Y., Shen, X., Ding, Z., Liu, Y., Zhang, Y. ]]></dc:creator>
<dc:date>2026-04-04</dc:date>
<dc:identifier>doi:10.64898/2026.04.03.26350034</dc:identifier>
<dc:title><![CDATA[Multi-Task Learning and Soft-Label Supervision for Psychosocial Burden Profiling in Cancer Peer-Support Text]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-04</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.02.26350091v1?rss=1">
<title>
<![CDATA[
Citation Hallucination Determines Success: An Empirical Comparison of Six Medical AI Research Systems 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.02.26350091v1?rss=1
</link>
<description><![CDATA[
Large language model (LLM) systems can now generate complete research manuscripts, yet their reliability in clinical medicine -- where citation accuracy and reporting standards carry direct consequences -- has not been systematically assessed. We introduce MedResearchBench, a benchmark of three clinical epidemiology tasks built on NHANES data, and use it to evaluate six AI research systems across six quality dimensions. Evaluation combines programmatic citation verification, rule-based reporting compliance checks, and multi-model LLM judging, providing a more discriminative assessment than conventional single-judge approaches.

Citation integrity emerged as the decisive quality dimension. Hallucination rates ranged from 2.9% to 36.8% across systems, and a hard-rule threshold on per-task citation scores capped four of six systems total scores at the penalty ceiling. Adding a multi-agent citation verification and repair pipeline to the best-performing system improved its citation integrity score from 40.0 to 90.9 and raised the weighted total from 68.9 to 81.8. Strikingly, a single-model evaluation ranked this system last (55.5), while our three-tier framework ranked it first (81.8) --a complete reversal that exposes the limitations of subjective LLM-only evaluation.

These results suggest that programmatic citation verification should be a core metric in future evaluations of AI scientific writing systems, and that multi-agent quality assurance can bridge the gap between fluent text generation and trustworthy scholarship.
]]></description>
<dc:creator><![CDATA[ Shi, X., Tian, Z., Tan, S., Wang, X. ]]></dc:creator>
<dc:date>2026-04-04</dc:date>
<dc:identifier>doi:10.64898/2026.04.02.26350091</dc:identifier>
<dc:title><![CDATA[Citation Hallucination Determines Success: An Empirical Comparison of Six Medical AI Research Systems]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-04</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.31.26349906v1?rss=1">
<title>
<![CDATA[
Corpus for Benchmarking Clinical Speech De-identification 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.31.26349906v1?rss=1
</link>
<description><![CDATA[
ObjectivesPublicly available datasets dedicated to clinical speech deidentification tasks remain scarce due to privacy constraints and the complexity of speech-level annotation. To address this gap, we compiled the SREDH-AICup sensitive health information (SHI) speech corpus, a time-aligned clinical speech dataset annotated across 38 SHI categories.

MethodsTwo publicly available English medical-domain datasets were adapted to support speech-level de-identification, including script reformulation and controlled re-recorded by 25 participants. Additional Mandarin Chinese clinical-style materials were incorporated to extend linguistic coverage. All audio data were annotated with million-level, time-aligned SHI spans using Label Studio. Inter-annotator agreement was evaluated using Cohens kappa, following iterative calibration rounds. The resulting corpus supports both automatic speech recognition (ASR) and speech-level recognition of SHIs.

ResultsThe final dataset comprises 20 hours of annotated audio, divided into training (10 hours, 1,539 files), validation (5 hours, 775 files), and test (5 hours, 710 files) subsets, totalling 7,830 SHI entities. The language distribution reflects the composition of the selected source materials, with 19.36 hours of English and 0.89 hours of Mandarin Chinese speech.

DiscussionThe corpus exhibits a long-tail distribution consistent with clinical documentation patterns and highlights the limited availability of Chinese medical speech resources. These characteristics underscore both the realism of the dataset and structural challenges associated with multilingual speech de-identification.

ConclusionThe SREDH-AICup SHI speech corpus provides a clinically grounded, time-aligned speech dataset supporting automated medical speech de-identification research and facilitating future development of multilingual speech-based privacy protection systems.

Key messageO_ST_ABSWhat is already known on this topicC_ST_ABS- There is a scarcity of publicly available clinical speech datasets containing time-aligned sensitive health information (SHI) annotations for de-identification research.


What this study adds- The SREDH-AICup SHI speech corpus introduces a clinically grounded speech dataset with millisecond-level, time-aligned SHI annotations across 38 categories.
- The corpus integrates structured annotation protocols and standardised data processing to support reproducible benchmarking of speech-based de-identification models.


How this study might affect research, practice or policy- The availability of time-aligned SHI annotations may facilitate research on real-time or streaming de-identification systems beyond conventional transcription-focused approaches.
- The dataset may support the development of multilingual privacy-preserving technologies in clinical speech environments.
]]></description>
<dc:creator><![CDATA[ Dai, H.-J., Fang, L.-C., Mir, T. H., Chen, C.-T., Feng, H.-H., Lai, J.-R., Hsu, H.-C., Nandy, P., Panchal, O., Liao, W.-H., Tien, Y.-Z., Chen, P.-Z., Lin, Y.-R., Jonnagaddala, J. ]]></dc:creator>
<dc:date>2026-04-03</dc:date>
<dc:identifier>doi:10.64898/2026.03.31.26349906</dc:identifier>
<dc:title><![CDATA[Corpus for Benchmarking Clinical Speech De-identification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.04.01.26349920v1?rss=1">
<title>
<![CDATA[
Counterfactual prediction of treatment effects on irregular clinical data using Time-Aware G-Transformers 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.04.01.26349920v1?rss=1
</link>
<description><![CDATA[
Selecting an effective treatment relies on accurately anticipating patients response to alternative interventions. However, forecasting longitudinal clinical trajectories remains difficult because electronic health records contain heterogeneous, irregularly sampled data over extended time periods. These issues are especially relevant for laboratory measurements, which are central for diagnostics, assessment of therapeutic responses, and tracking disease progression in routine clinical practice. However, existing deep learning methods for counterfactual prediction usually assume regularly sampled data, an assumption incompatible with the irregular, heterogeneous data-generation processes of real-world clinical practice. Here we present the Time-Aware G-Transformer, which integrates causal G-computation with time-aware attention to predict counterfactual outcomes on irregular data. By explicitly conditioning on the timing of future observations and encoding measurement patterns, the model captures temporal dynamics that previous methods overlook. Evaluated on synthetic tumor growth data and on 90,753 cancer patient trajectories from an academic medical center, our approach demonstrates superior long-horizon (>1 day) prediction accuracy and uncertainty calibration compared to state-of-the-art baselines. These results demonstrate that embedding temporal relations directly into the attention mechanism enables robust integration of patient history data for evaluating potential treatment strategies in personalized medicine.
]]></description>
<dc:creator><![CDATA[ Hornak, G., Heinolainen, A., Solyomvari, K., Silen, S., Renkonen, R., Koskinen, M. ]]></dc:creator>
<dc:date>2026-04-02</dc:date>
<dc:identifier>doi:10.64898/2026.04.01.26349920</dc:identifier>
<dc:title><![CDATA[Counterfactual prediction of treatment effects on irregular clinical data using Time-Aware G-Transformers]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.31.26349842v1?rss=1">
<title>
<![CDATA[
Development and Temporal Evaluation of Multimodal Machine Learning Models to Predict High Inpatient Opioid Exposure 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.31.26349842v1?rss=1
</link>
<description><![CDATA[
High inpatient opioid exposure is associated with increased risk of persistent opioid use. Early identification of high-risk patients may improve opioid stewardship. We developed machine learning models to predict high opioid exposure during hospitalization using electronic health record data from MIMIC-IV. We conducted a retrospective study of 223,452 unique first hospital admissions in MIMIC-IV. The outcome was high opioid exposure, defined as the top decile among opioid-exposed admissions (MME/day [&ge;] 225), representing 2.65% of all admissions. Structured early-admission features included demographics, admission characteristics, laboratory utilization and abnormality summaries, and 24-hour procedural indicators. Discharge-note data were incorporated using ClinicalBERT embeddings and interpretable bigram features. Models were trained using an 80/10/10 split and evaluated with temporal validation on the most recent 10% of admissions. Performance was assessed using ROC-AUC and PR-AUC with 95% confidence intervals. Among structured-only models, XGBoost achieved the best test performance (ROC-AUC 0.932 [0.924-0.940]; PR-AUC 0.223 [0.193-0.262]). The combined structured and notes model improved precision-recall performance (ROC-AUC 0.932 [0.920-0.943]; PR-AUC 0.276 [0.229-0.331]). Temporal evaluation showed similar discrimination (ROC-AUC 0.929; PR-AUC 0.223). High-risk bigrams included procedural terms such as "external fixation" and "cervical discectomy." Integration of structured and text-derived features improved risk stratification compared to structured data alone. Interpretable bigram signals reflected procedural complexity and orthopedic pathology, reinforcing the clinical plausibility of model predictions. Multimodal EHR-based models accurately predict high inpatient opioid exposure and may support targeted opioid stewardship during hospitalization.

AUTHOR SUMMARYOpioid medications are commonly used in hospitals to treat pain, but some patients receive very high doses, which may increase their risk of long-term opioid use, with negative side effects such as opioid addiction. Identifying patients at risk early during their hospital stay could help physicians make safer prescribing decisions. In this study, we used electronic health record data from over 220,000 hospital admissions to develop computer models that estimate the likelihood that a patient will receive high levels of opioids. We focused on information available within the first 24 hours of admission, including basic patient characteristics, laboratory testing patterns, and procedures. We also used information from clinical notes to capture additional context about patient care.

We found that these models were able to accurately identify patients at higher risk, and that combining structured data with information from clinical notes improved performance. Importantly, the patterns identified by the models, such as certain surgical procedures, were consistent with clinical expectations. Our findings suggest that routinely collected hospital data can be used to support earlier identification of patients at risk for high opioid exposure. This approach could help guide more targeted and cautious opioid prescribing practices in inpatient settings.
]]></description>
<dc:creator><![CDATA[ Kale, S., Singh, D., Truumees, E., Geck, M., Stokes, J. ]]></dc:creator>
<dc:date>2026-04-02</dc:date>
<dc:identifier>doi:10.64898/2026.03.31.26349842</dc:identifier>
<dc:title><![CDATA[Development and Temporal Evaluation of Multimodal Machine Learning Models to Predict High Inpatient Opioid Exposure]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.27.26349538v1?rss=1">
<title>
<![CDATA[
A Reproducible Health Informatics Pipeline for Simulating and Integrating Early-Phase Oncology Clinical, Biomarker, and Pharmacokinetic Data for Exploratory Decision-Support Analytics 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.27.26349538v1?rss=1
</link>
<description><![CDATA[
BackgroundEarly-phase oncology development increasingly depends on integrated interpretation of clinical outcomes, translational biomarkers, and pharmacokinetic exposure rather than toxicity alone. This shift has created a need for reproducible analytical workflows that can combine heterogeneous trial data into traceable, analysis-ready outputs suitable for exploratory review and early decision support.

ObjectiveTo develop a reproducible Python-based workflow that simulates a plausible early-phase oncology study, integrates clinical, biomarker, and pharmacokinetic data, and generates analysis-ready datasets, visual summaries, and exploratory predictive models relevant to early development analytics.

MethodsA workflow was constructed to simulate an early-phase oncology cohort of 120 patients distributed across multiple dose levels. Three synthetic raw data sources were generated, including patient-level clinical data, baseline biomarker data, and longitudinal pharmacokinetic profiles. These sources were merged into a single analysis-ready dataset containing derived variables such as tumor percent change from baseline, clinical-benefit status, exposure summaries, adverse-event indicators, and survival outcomes. The workflow produced structured tables, patient listings, waterfall plots, Kaplan-Meier-style survival curves, biomarker-response visualizations, pharmacokinetic profile plots, and exploratory machine-learning outputs.

ResultsThe final integrated dataset contained 120 patients and 30 variables. Median survival across the simulated cohort was 243.8 days, and higher dose groups showed improved median survival and greater clinical benefit relative to the low-dose group. Clinical benefit increased from 8.6% in the low-dose group to 29.0% in the medium-dose group and 45.2% in the high-dose group. Higher baseline LDH, CRP, and ctDNA fraction tracked with less favorable tumor-response trajectories, whereas higher exposure, reflected by AUC and Cmax, associated with improved disease control. Pharmacokinetic profiles showed clear dose-dependent separation. Grade 3 or higher adverse-event rates remained within a plausible exploratory range across dose groups. A random-forest model for clinical benefit achieved an exploratory ROC AUC of 0.845, while a logistic-regression model for strict responder status could not be fit because no simulated patient met the prespecified objective response threshold.

ConclusionsThis proof-of-concept demonstrates that a transparent Python workflow can generate a coherent early-phase oncology analytical ecosystem from synthetic inputs. The workflow supports integration of heterogeneous data streams, derivation of analysis-ready variables, production of interpretable outputs, and exploratory modeling in a reproducible framework. Although the simulated responder prevalence was too low to support objective response modeling, this limitation itself highlights the importance of simulation calibration for downstream analytical validity. The framework provides a practical Health Informatics demonstration of how early oncology trial data can be structured and analyzed for exploratory translational decision support.
]]></description>
<dc:creator><![CDATA[ Petalcorin, M. I. R. ]]></dc:creator>
<dc:date>2026-04-02</dc:date>
<dc:identifier>doi:10.64898/2026.03.27.26349538</dc:identifier>
<dc:title><![CDATA[A Reproducible Health Informatics Pipeline for Simulating and Integrating Early-Phase Oncology Clinical, Biomarker, and Pharmacokinetic Data for Exploratory Decision-Support Analytics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.31.26349817v1?rss=1">
<title>
<![CDATA[
DR. INFO at the Point of Care: A Prospective Pilot Study of an Agentic AI Clinical Assistant 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.31.26349817v1?rss=1
</link>
<description><![CDATA[
BackgroundClinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care.

ObjectiveTo evaluate clinician-reported time savings, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice.

MethodsIn this prospective, single-arm pilot study, 29 clinicians across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likertscale evaluations and a final Net Promoter Score. Non-parametric methods were used throughout.

ResultsClinicians reported high perceived time saving (mean 4.27/5; 95% CI: 3.97-4.57) and decision support (4.16/5; 95% CI: 3.86-4.45), with ratings stable across all study days and no evidence of attrition bias. The NPS was 81.2, with no detractors.

ConclusionsClinicians across specialties and career stages reported sustained satisfaction with DR. INFO for both time efficiency and clinical decision support. Validation in larger, controlled studies with objective outcome measures is warranted.
]]></description>
<dc:creator><![CDATA[ Corga Da Silva, R., Romano, M., Mendes, T., Isidoro, M., Ravichandran, S., Kumar, S., van der Heijden, M., Fail, O., Gnanapragasam, V. E. ]]></dc:creator>
<dc:date>2026-04-01</dc:date>
<dc:identifier>doi:10.64898/2026.03.31.26349817</dc:identifier>
<dc:title><![CDATA[DR. INFO at the Point of Care: A Prospective Pilot Study of an Agentic AI Clinical Assistant]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.31.26349827v1?rss=1">
<title>
<![CDATA[
MedScope: A Lightweight Benchmark of Open-Source Large Language Models for Medical Question Answering 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.31.26349827v1?rss=1
</link>
<description><![CDATA[
The rapid development of large language models (LLMs) has stimulated growing interest in their use for medical question answering and clinical decision support. However, compared with frontier proprietary systems, the empirical understanding of lightweight open-source LLMs in medical settings remains limited, particularly under resource-constrained experimental conditions. To address this gap, we introduce MedScope, a lightweight benchmarking framework for systematically evaluating open-source LLMs on medical multiple-choice question answering.

Using 1,000 sampled questions from MedMCQA, we benchmark six lightweight open-source models spanning three representative model families: LLaMA, Qwen, and Gemma. Beyond standard predictive metrics such as accuracy and macro-F1, our framework additionally considers inference time, prediction consistency, subject-wise variability, and model-specific error patterns. We further develop a set of multi-perspective visual analyses, including clustered heatmaps, agreement matrices, Pareto-style trade-off plots, radar charts, and multi-panel summary figures, in order to characterize model behavior in a more interpretable and comprehensive manner.

Our results reveal substantial heterogeneity across models in predictive performance, efficiency, and subject-level robustness. While larger lightweight models generally achieve better overall results, the gain is neither uniform across subject categories nor always aligned with efficiency. These findings suggest that lightweight open-source LLMs remain valuable as transparent and reproducible medical AI baselines, but their current capabilities are still insufficient for unsupervised deployment in high-risk healthcare scenarios. MedScope provides an accessible benchmark for evaluating lightweight medical LLMs and emphasizes the need for multi-dimensional assessment beyond accuracy alone.The relevant code is now open-sourced at: https://github.com/VhoCheng/MedScope.
]]></description>
<dc:creator><![CDATA[ Bian, R., Cheng, W. ]]></dc:creator>
<dc:date>2026-04-01</dc:date>
<dc:identifier>doi:10.64898/2026.03.31.26349827</dc:identifier>
<dc:title><![CDATA[MedScope: A Lightweight Benchmark of Open-Source Large Language Models for Medical Question Answering]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.31.26349861v1?rss=1">
<title>
<![CDATA[
Combining Token Classification With Large Language Model Revision for Age-Friendly 4M Entity Recognition From Nursing Home Text Messages: Development and Evaluation Study 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.31.26349861v1?rss=1
</link>
<description><![CDATA[
BackgroundSecure text messages (TMs) exchanged among interdisciplinary care teams in nursing homes (NHs) contain clinical information that aligns with the Age-Friendly Health Systems 4Ms: What Matters, Medication, Mentation, and Mobility, yet, this information is not captured in any structured form, making it unavailable for systematic monitoring or quality reporting. Automatically extracting 4M information accurately and efficiently from these messages could enable several downstream applications within long term care settings. This task, however, is challenging because of the fragmented syntax, brevity, abbreviations, and informality of TMs.

ObjectiveThis study aimed to develop and evaluate a multi-stage 4M Entity Recognition (4M-ER) pipeline that combines a fine-tuned token classifier with large language model (LLM) revision, using only locally deployed open-source models, to improve 4M information extraction from clinical TMs.

MethodsWe used an expert-annotated dataset of 1,169 TMs collected from interdisciplinary teams across 16 Midwest NHs. The pipeline first identifies candidate text spans using a fine-tuned Bio-ClinicalBERT token classifier. A semantic similarity retriever then selects in-context exemplars to guide an LLM revision in which the LLM (Gemma, Phi, Qwen, or Mistral) performs boundary correction, label evaluation, and selective acceptance or rejection of candidate spans. Baselines for comparison included single-stage zero-shot LLMs, single-stage fine-tuned Bio-ClinicalBERT, and a fine-tuned LLM (Gemma) from a prior study. Ablation studies assessed the contribution of each pipeline stage and the effect of message filtering. Robustness was evaluated across 5 repeated runs.

ResultsThe 4M-ER pipeline outperformed the previously fine-tuned Gemma LLM across all 4M domains, achieving F (entity type) improvements of +2 to +11 percentage points without any additional fine-tuning and at roughly half the GPU memory (12 vs 24 GB). It also improved upon single-stage fine-tuned Bio-ClinicalBERT in Mobility, Mentation, and What Matters (+0.02 to +0.05 F). Error analysis showed that LLM revision reduced false positives by 25% to 35% by correcting misclassifications caused by conversational ambiguity, while the fine-tuned Bio-ClinicalBERTs high recall captured subtle entities that the fine-tuned Gemma missed. Silver data augmentation further improved the hardest domains, raising What Matters F from 0.59 to 0.67 and Mobility from 0.64 to 0.67. Ablation studies confirmed that restricting LLMs to revision only yielded optimal accuracy and efficiency.

ConclusionsThe 4M-ER pipeline enables accurate and scalable extraction of 4M entities from clinical TMs by combining fine-tuned Bio-ClinicalBERT with LLM revision using only locally deployed open-source models. The structured 4M data produced by the pipeline can support 4M taxonomy and ontology construction, as demonstrated in the prior work, and provides a foundation for downstream applications including real-time clinical surveillance, compliance with emerging age-friendly quality measures, and predictive modeling in long-term care settings.
]]></description>
<dc:creator><![CDATA[ Amewudah, P., Popescu, M., Farmer, M. S., Powell, K. R. ]]></dc:creator>
<dc:date>2026-04-01</dc:date>
<dc:identifier>doi:10.64898/2026.03.31.26349861</dc:identifier>
<dc:title><![CDATA[Combining Token Classification With Large Language Model Revision for Age-Friendly 4M Entity Recognition From Nursing Home Text Messages: Development and Evaluation Study]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.31.26349766v1?rss=1">
<title>
<![CDATA[
Self-Reported Symptoms Enable Four-Phase Menstrual Cycle Classification with Hormonally Validated Labels 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.31.26349766v1?rss=1
</link>
<description><![CDATA[
Accurate inference of physiological state across the menstrual cycle has important applications in reproductive health and in understanding symptom dynamics, yet most non-hormonal approaches rely on wearable sensors or calendar-based tracking. Whether self-reported symptoms alone can support prospective, cross-subject phase classification remains unresolved. Here, we introduce a hybrid modelling framework that combines a gradient-boosted classifier with a Hidden Semi-Markov Model to infer four menstrual cycle phases (menstrual, follicular, fertile, and luteal) from self-reported data. The classifier captures non-linear symptom patterns, while the temporal model imposes biologically grounded constraints, including cyclic ordering and realistic phase durations. In a leave-one-subject-out evaluation using hormonally annotated data from 41 participants, the model achieved 67.6% accuracy and a macro F1 score of 0.662. Features reflecting short-term symptom variability were more informative than absolute symptom levels, indicating that within-person fluctuation provides a more generalisable signal of cycle phase than symptom intensity alone. These findings demonstrate the feasibility of low-burden, device-free menstrual health monitoring, establish symptom dynamics as a basis for scalable digital biomarkers, and expand access to tracking in resource-constrained settings and populations underserved by wearable-based approaches.
]]></description>
<dc:creator><![CDATA[ Specht, B., Tayeb, Z. Z., Garbaya, S., Khadraoui, D., EL-Khozondar, M., Schneider, R. ]]></dc:creator>
<dc:date>2026-04-01</dc:date>
<dc:identifier>doi:10.64898/2026.03.31.26349766</dc:identifier>
<dc:title><![CDATA[Self-Reported Symptoms Enable Four-Phase Menstrual Cycle Classification with Hormonally Validated Labels]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.31.26349853v1?rss=1">
<title>
<![CDATA[
Data sharing policies, requirements, and support from public and private clinical trial sponsors: a survey on top sponsors of clinical trials in Europe 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.31.26349853v1?rss=1
</link>
<description><![CDATA[
ObjectiveTo map the presence, public availability, and content of clinical trial data sharing policies (DSP), data management and sharing plans (DMSP), and data use agreements (DUA) among the most prolific public and private clinical trial sponsors operating in the European Union, and to identify key areas of convergence, divergence, and constraint in the context of General Data Protection Regulation (GDPR).

Eligibility criteriaWe included organisation-level documents describing approaches to clinical trial data sharing or data management from the top 20 public and top 20 private sponsors ranked by the number of trials registered in the EU Clinical Trials Information System (CTIS). Eligible materials comprised publicly available or sponsor-shared policies, guidelines, statements, templates, and agreements relevant to clinical trial data sharing or management.

Sources of evidenceEvidence was identified through systematic searches of sponsors public websites, structured Google searches, and major data management plan platforms (DMPTool, DMPonline, DMP Assistant), complemented by direct contact with sponsors to verify findings and request missing documentation. All sources were archived and catalogued.

Charting methodsTwo reviewers independently extracted data using a structured form, capturing the existence, accessibility, and content of data sharing policies, data management and sharing plans, and data use agreements. Quantitative data were summarised descriptively, and a non-interpretive descriptive content analysis was conducted to characterise recurring policy elements and areas of heterogeneity.

ResultsAmong 40 sponsors, private sponsors were substantially more likely than public sponsors to make trial-specific data sharing policies and data use agreements publicly accessible, often via established data sharing platforms. Public sponsors more frequently referenced data management and sharing plans, but these were heterogeneous in scope and often embedded within broader institutional governance documents rather than tailored to clinical trials. Across sectors, GDPR compliance, data protection, and legal safeguards were emphasised, while operational aspects such as dataset readiness, review criteria, and downstream responsibilities varied widely. Overall response rate to sponsor verification was 37.5%.

ConclusionClinical trial data sharing governance in the EU shows a marked sectoral imbalance among the top sponsors. Private sponsors tend to provide more detailed and operationally explicit documentation, whereas public sponsors often articulate high-level commitments without trial-specific guidance. Greater clarity and standardisation, particularly among public sponsors, could improve transparency and facilitate responsible data reuse, while remaining compatible with GDPR requirements.
]]></description>
<dc:creator><![CDATA[ Tai, K. H., Varvara, G., Escoffier, E., Mansmann, U., DeVito, N. J., Vieira Armond, A. C., Naudet, F. ]]></dc:creator>
<dc:date>2026-04-01</dc:date>
<dc:identifier>doi:10.64898/2026.03.31.26349853</dc:identifier>
<dc:title><![CDATA[Data sharing policies, requirements, and support from public and private clinical trial sponsors: a survey on top sponsors of clinical trials in Europe]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.30.26349782v1?rss=1">
<title>
<![CDATA[
Governance, Accountability and Post-Deployment Monitoring Preferences for AI Integration in West African Clinical Practice: A Mixed-Methods Study 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.30.26349782v1?rss=1
</link>
<description><![CDATA[
BackgroundThe integration of artificial intelligence (AI) into clinical practice holds transformative potential for healthcare in West Africa, but safe deployment requires context-appropriate governance, accountability, and post-deployment monitoring frameworks. This cross-sectional mixed-methods study examined preferences and concerns of West African clinicians and technical experts regarding AI governance structures, post-deployment surveillance mechanisms, and accountability allocation.

MethodsA structured questionnaire was administered to 136 physicians affiliated with the West African College of Physicians (February 22-28, 2026), complemented by 72 key informant interviews with technical leads, AI developers, data scientists, policymakers, and healthcare leaders. Data were analyzed using descriptive statistics, inferential tests, and thematic analysis.

ResultsClinicians strongly preferred independent regulatory bodies (40.4%) for overseeing AI tool performance, with high trust ratings (mean:4.3/5), while vendor self-monitoring received minimal support (3.7%, mean:2.4/5). Real-time dashboards were the most favored monitoring approach (41.9%). Clear accountability pathways (94.1%), algorithm transparency (91.9%), and real-time performance data (89.7%) were rated essential by majorities. Major concerns included clinicians being unfairly blamed for AI errors (76.5%), excessive vendor control (72.8%), and absence of clear reporting pathways (69.9%). Qualitative findings emphasized continuous performance tracking for accuracy, fairness, and bias; structured incident reporting; protocols for model drift and failure; and multi-layered governance combining independent oversight, institutional AI committees, and explicit liability frameworks.

ConclusionThis study provides the first empirical evidence from West Africa on clinician preferences for AI governance. Findings offer actionable guidance for policymakers to build trustworthy, equitable, and safe AI integration frameworks that prioritize transparency, independent oversight, and clinician protection.
]]></description>
<dc:creator><![CDATA[ Uzochukwu, B. S. C., Cherima, Y. J., Enebeli, U. U., Okeke, C. C., Uzochukwu, A. C., Omoha, A., Hassan, B., Eronu, E. M., Yusuf, S. M., Uzochukwu, K. A., Kalu, E. I. ]]></dc:creator>
<dc:date>2026-04-01</dc:date>
<dc:identifier>doi:10.64898/2026.03.30.26349782</dc:identifier>
<dc:title><![CDATA[Governance, Accountability and Post-Deployment Monitoring Preferences for AI Integration in West African Clinical Practice: A Mixed-Methods Study]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.30.26349756v1?rss=1">
<title>
<![CDATA[
BSO-AD: An Ontology for Representing and Harmonizing Behavioral Social Knowledge in ADRD 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.30.26349756v1?rss=1
</link>
<description><![CDATA[
ObjectiveBehavioral and social factors (BSFs) substantially influence the risk, onset, and progression of Alzheimers disease and related dementias (ADRD). A systematic representation of their interplay is essential for advancing prevention and targeted interventions. However, BSF-related knowledge is scattered across heterogeneous sources, limiting scalable evidence synthesis and computational analysis. To address this, we created a Behavioral Social Data and Knowledge Ontology for ADRD (BSO-AD) to represent and integrate BSFs with respect to ADRD.

Material and MethodsBSO-AD was developed following established ontology design principles, prioritizing reuse of existing ontology elements to ensure semantic interoperability. It was built upon the Social Determinants of Health Ontology (SDoHO) and the Drug-Repurposing Oriented Alzheimers Disease Ontology (DROADO). BSF-related classes were enriched with ICD-10-CMZ55-Z65 codes and ADRD-related classes with AD-Onto. Relationships between BSFs and ADRD were derived through literature mining. Ontology quality was evaluated through Hootation-based expert review and an LLM-assisted framework assessing structural coverage and semantic coherence.

ResultsBSO-AD contains 2,275 classes, 153 object properties, and 49 data properties. Expert review demonstrated strong rational agreement (0.95), with disagreements resolved through discussion. LLM-based evaluation showed high category coverage rates ([&ge;] 0.97) and robust semantic alignment with the relevant literature (average completeness = 0.79; conciseness = 0.94).

Discussion and ConclusionBSO-AD is, to our knowledge, the first ontology to systematically represent BSFs and hierarchically model their interrelationships in ADRD. It establishes a semantic backbone for computational analysis and knowledge integration. The LLM-assisted evaluation framework demonstrates the feasibility of scalable, automated ontology assessment.
]]></description>
<dc:creator><![CDATA[ Li, H., Yu, Y., Bhandarkar, A., Kumar, R., Clark, I. H., Hu, Y., Cao, W., Zhao, N., LI, F., Tao, C. ]]></dc:creator>
<dc:date>2026-03-31</dc:date>
<dc:identifier>doi:10.64898/2026.03.30.26349756</dc:identifier>
<dc:title><![CDATA[BSO-AD: An Ontology for Representing and Harmonizing Behavioral Social Knowledge in ADRD]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-03-31</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.30.26349388v1?rss=1">
<title>
<![CDATA[
Longitudinal information extraction from clinical notes in rare diseases: an efficient approach with small language models 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.30.26349388v1?rss=1
</link>
<description><![CDATA[
ObjectivesRare diseases often require longitudinal monitoring to characterise progression, yet much clinical information remains locked in unstructured electronic health records (EHRs). Efficient recovery of such data is critical for accurate prognostic modelling and clinical trial preparation. We aimed to develop and evaluate a small language model (SLM)-based pipeline for extracting longitudinal information from French clinical notes of patients with rare kidney diseases.

MethodsAs a use case, we focused on serum creatinine, a key biomarker of kidney function. We analyzed 81 clinical notes comprising 200 measurements (triplet of date, value and unit). Four open-source SLMs (Mistral-7B, Llama-3.2-3B, Qwen3-4B, Qwen3-8B) were systematically tested with different prompting strategies in French and English. Outputs were post-processed to standardize formats and resolve inconsistencies, and performance was assessed across model size, prompting, language, and robustness to text duplication.

ResultsAll SLMs extracted structured triplets, with F1-scores ranging from 0.519 to 0.928 (Qwen3-8B), outperforming the rule-based baseline. Larger models generally performed better, while prompting strategy and language had modest effects across models. SLMs also showed variable robustness to duplicated content common in real-world EHR notes.

DiscussionLightweight, locally deployable language models can accurately extract longitudinal biomarkers from unstructured clinical notes. Our findings highlight their practicality for rare diseases where data scarcity often limits task-specific model training.

ConclusionSLMs provide a privacy-preserving and resource-efficient solution for recovering longitudinal biomarker trajectories from unstructured notes, offering potential to advance real-world research and patient care in rare kidney diseases.

1) What is already known?O_LILongitudinal monitoring is essential in rare kidney diseases, yet key biomarker data are often locked in unstructured clinical notes.
C_LIO_LILarge language models (LLMs) have shown strong performance in clinical text processing tasks but face major challenges related to privacy, computational cost, and implementation feasibility in healthcare settings.
C_LIO_LISmall language models (SLMs) are emerging as lightweight, locally deployable alternatives whose potential for clinical applications is increasingly recognized.
C_LI

2) What does this paper add?O_LIThis study provides the first real-world evaluation of SLMs for extracting longitudinal biomarker measurements in rare kidney disease cohorts.
C_LIO_LIIt introduces and validates an efficient extraction pipeline that combines document preselection, SLM prompting, and post-processing to accurately retrieve biomarker measurements from French clinical notes.
C_LIO_LIThe findings show that SLM-based extraction can help mitigate data scarcity in rare diseases, thereby improving prognosis modeling and supporting clinical research.
C_LI
]]></description>
<dc:creator><![CDATA[ Wang, X., Faviez, C., Vincent, M., Andrew, J. J., Le Priol, E., Saunier, S., Knebelmann, B., Zhang, R., Garcelon, N., Burgun, A., Chen, X. ]]></dc:creator>
<dc:date>2026-03-31</dc:date>
<dc:identifier>doi:10.64898/2026.03.30.26349388</dc:identifier>
<dc:title><![CDATA[Longitudinal information extraction from clinical notes in rare diseases: an efficient approach with small language models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-03-31</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.30.26349749v1?rss=1">
<title>
<![CDATA[
MedResearchBench: A Multi-Domain Benchmark for Evaluating AI Research Agents on Clinical Medical Research 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.30.26349749v1?rss=1
</link>
<description><![CDATA[
The rapid advancement of AI research automation systems--including AI Scientist, data-to-paper, and Agent Laboratory--has demonstrated the potential for autonomous scientific discovery. However, existing benchmarks for evaluating these systems focus pre-dominantly on fundamental sciences (machine learning, physics, chemistry), overlooking the unique challenges of medical clinical research: complex survey designs, inferential statistics with confounding control, adherence to reporting standards (STROBE, CONSORT), and the requirement for clinically actionable interpretation. We present MedResearchBench, the first benchmark specifically designed to evaluate AI systems on medical clinical research tasks. MedResearchBench comprises 16 tasks spanning 7 clinical domains (cardiovascular, oncology, mental health, metabolic, respiratory, neurology, infectious disease), built on publicly available datasets (the National Health and Nutrition Examination Survey [NHANES] and the Surveillance, Epidemiology, and End Results [SEER] program) with ground truth from 16 high-quality published papers (IF range: 2.3-51.0). Each task is evaluated along 6 medical-specific dimensions: statistical methodology, results accuracy, visualization quality, clinical interpretation, confounding sensitivity, and reporting compliance. We describe the benchmark design rationale, task construction methodology, paper selection criteria with anti-paper-mill filtering, and a detailed analysis of task characteristics including methodological diversity, evaluation dimension coverage, and difficulty stratification. To demonstrate benchmark executability, we evaluate an agentic data2paper pipeline on 3 pilot tasks spanning all three difficulty tiers, achieving scores of 72/100 (Tier 1, Cardio 000), 69/100 (Tier 2, Mental 000), and 75/100 (Tier 3, Metabolic 002), with a mean score of 72/100 (B-level). Survey-weighted methodology was correctly implemented across all tasks; primary limitations included covariate incompleteness and reference group misspecification. MedResearchBench addresses a critical gap in AI research evaluation and provides a standardized, community-extensible platform for assessing whether AI systems can conduct clinically sound, publication-quality medical research. All task materials are publicly available at https://github.com/TerryFYL/MedResearchBench.
]]></description>
<dc:creator><![CDATA[ Tan, S., Tian, Z. ]]></dc:creator>
<dc:date>2026-03-31</dc:date>
<dc:identifier>doi:10.64898/2026.03.30.26349749</dc:identifier>
<dc:title><![CDATA[MedResearchBench: A Multi-Domain Benchmark for Evaluating AI Research Agents on Clinical Medical Research]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-03-31</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.medrxiv.org/content/10.64898/2026.03.28.26349522v1?rss=1">
<title>
<![CDATA[
MOE-ECG: Multi-Objective Ensemble Fusion for Robust Atrial Fibrillation Detection Using Electrocardiograms 
]]>
</title>
<link>
https://www.medrxiv.org/content/10.64898/2026.03.28.26349522v1?rss=1
</link>
<description><![CDATA[
BackgroundAtrial fibrillation (AFib) is the most common sustained arrhythmia in the world, imposing a heavy clinical and economic burden on global healthcare systems. Early detection of AFib can reduce mortality and morbidity, while helping to alleviate the growing economic burden of cardiovascular diseases. With the increasing availability of digital health technologies, computational solutions have great potential to support the timely diagnosis of cardiac abnormalities.

ObjectivesWith the increasing availability of electrocardiogram (ECG) data from clinical and wearable devices, manual interpretation has become impractical due to its time-consuming and subjective nature. Existing automated approaches often rely on single classifiers or fixed ensembles that primarily optimize predictive accuracy while neglecting model diversity, which leads to limited robustness and generalization across heterogeneous datasets. Therefore, this study aims to develop a robust and diversity-aware framework for automatic AFib detection that simultaneously improves classification performance and model generalizability. To this end, we propose MOE-ECG, a multi-objective ensemble selection and fusion framework that explicitly optimizes both predictive performance and inter-model diversity for reliable AFib detection from ECG recordings.

MethodsThe proposed multi-objective ensemble (MOE) framework uses ensemble selection as a bi-objective optimization problem and employs multi-objective particle swarm optimization to identify complementary classifiers from a heterogeneous model pool. Unlike conventional ensembles, it explicitly optimizes both predictive performance and diversity and integrates Dempster-Shafer theory for uncertainty-aware decision fusion. After filtering the ECG signals to remove baseline wander and noise, they were segmented into windows of 20, 60, and 120 heartbeats with 50% overlap. The proposed approach was evaluated over five independent runs to assess its stability and generalization. Fifteen statistical and nonlinear features were obtained from the RR-intervals of the pre-processed ECG signals, of which eight features were selected with correlation analysis to capture subtle information from the ECG data. We trained and evaluated the performance of the proposed model in three open source databases, namely, the MIT-BIH Atrial Fibrillation Database, Saitama Heart Database Atrial Fibrillation, and Long-Term AF Database.

ResultsThe proposed approach achieved the best overall performance on 60-beat segments, with an average accuracy of 89.85%, precision of 91.14%, recall of 94.19%, an F1-score of 92.64%, and area under the curve (AUC) of around 0.95. Statistical analysis using Holm-adjusted Wilcoxon tests confirmed significant improvements (p < 0.05) compared to both the best individual classifier and the unoptimized average ensemble of all classifiers. These findings show that the proposed selection and evaluation methodology, rather than group aggregation alone, is the key driver of performance improvements.

ConclusionThe results obtained demonstrate that the MOE-ECG model offers a robust, accurate, and reliable solution for the detection of AFib from short ECG segments. The empirical findings, in general, confirm that multi-objective ensemble fusion enhances diagnostic performance and offers robust predictions that will open up possibilities for real-time AFib detection in clinical and tele-health settings.
]]></description>
<dc:creator><![CDATA[ Peimankar, A., Hossein Motlagh, N., K. Khare, S., Spicher, N., Dominguez, H., Abolghasemi, V., Fujiwara, K., Teichmann, D., Rahmani, R., Puthusserypady, S. ]]></dc:creator>
<dc:date>2026-03-30</dc:date>
<dc:identifier>doi:10.64898/2026.03.28.26349522</dc:identifier>
<dc:title><![CDATA[MOE-ECG: Multi-Objective Ensemble Fusion for Robust Atrial Fibrillation Detection Using Electrocardiograms]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-03-30</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
