Comparative performance of four large language models in generating evidence-based exercise prescriptions using FITT-VP framework.

Huan Feng, Xiaojun Wang

BACKGROUND: Exercise prescription plays a critical role in health management, but effective implementation is limited by practitioner expertise and time constraints. Large language models (LLMs) offer potential for generating personalized prescriptions, yet their comparative performance within established frameworks remains unexplored. METHODS: This study evaluated four advanced LLMs (GPT-4o, Claude 3.7, DeepSeek R1, and Grok-3) in generating exercise prescriptions based on the FITT-VP framework (Frequency, Intensity, Time, Type, Volume, Progression). Thirty synthetic patient profiles were designed from epidemiological data and clinical guidelines, and validated through a three-stage process involving sports medicine students and expert review. Three certified exercise specialists independently rated each prescription across the six FITT-VP dimensions using a 0-10 scale. One-way repeated measures ANOVA with Bonferroni correction and effect size calculations were applied. RESULTS: Across six FITT-VP dimensions (maximum total: 60), Claude 3.7 showed the highest total score (50.23 ± 1.75), followed by Grok-3 (47.42 ± 1.50), GPT-4o (44.02 ± 1.68), and DeepSeek R1 (40.30 ± 1.46). ANOVA revealed significant differences among models (F = 250.58, p < 0.001, η² = 0.896). CONCLUSION: This study establishes important benchmarks for AI in exercise medicine, with Claude 3.7 showing promise as a drafting tool for individualized exercise prescriptions and supporting a collaborative human-AI framework. These results, however, are based on a single-run evaluation of static synthetic profiles and expert-rated prescription quality rather than real-world clinical outcomes; multi-run variability assessment and clinical validation with human-AI interaction and patient follow-up are required before implementation.

Read on ELI