logo
ResearchBunny Logo
Introduction
The increasing use of AI/ML in analyzing large, complex nutrition datasets (e.g., PREDICT study) necessitates the adoption of rigorous modeling practices. While commercial software makes AI/ML algorithms accessible, the ease of use often masks the inherent complexities, leading to ethical issues and biased conclusions. The authors define ML as algorithms that improve automatically through experience, and AI as algorithms capable of learning, adapting, and problem-solving with minimal human intervention. The terms are often used interchangeably, and this paper addresses ethical concerns applicable to both. Examples from other fields (e.g., Google Photos’ misidentification of people of color as gorillas) highlight the detrimental consequences of poorly implemented AI/ML models. The authors aim to provide guidance to nutrition researchers familiar with statistical methods but new to AI/ML model development, building upon existing literature on FAIR data principles and discipline-specific checklists. The tutorial focuses on extending established statistical methods to AI/ML and addressing unique challenges in AI/ML modeling, such as sample size determination and data balancing, to improve the ethical considerations and results.
Literature Review
The paper reviews existing literature on best practices for AI/ML modeling, integrating FAIR (Findable, Accessible, Interoperable, and Reusable) data principles [13] and drawing upon articles providing overviews of machine learning and discipline-specific checklists (e.g., CLAIM for medical imaging, checklists for NeurIPS and CVPR conferences) [14-21]. It emphasizes that the guidelines are tailored for nutrition researchers with a background in statistics looking to incorporate AI/ML models.
Methodology
The authors based their recommendations on their experience as reviewers, journal editors, and AI/ML modelers in nutrition research. They first outline well-established statistical modeling practices relevant to AI/ML, focusing on bias and error mitigation. Specific AI/ML challenges are then addressed: 1. **Sample Size Calculations:** The authors acknowledge the lack of a one-size-fits-all approach for AI/ML sample size calculation and emphasize the iterative process required, tailored to each model's complexity. 2. **Missing Data:** The paper discusses the three types of missing data (MCAR, MAR, MNAR) and their impact on model predictions. It recommends various imputation techniques (mean/mode imputation, k-nearest neighbor, multiple imputation) and explores the possibility of using missingness as a model feature. 3. **Data Imbalance:** The authors discuss the challenges of imbalanced datasets (where one class dominates) and suggest cautious use of up-sampling and down-sampling techniques. 4. **Explainable AI (XAI):** The authors stress the importance of using explainable AI methods (e.g., saliency maps, variable importance in random forests) in conjunction with more complex, less interpretable models (e.g., neural networks) to ensure transparency and avoid artifacts in model predictions. 5. **Data Literacy:** The authors highlight the crucial role of data literacy for AI/ML users to ensure appropriate question formulation and analysis choices, proper utilization of AI/ML models (descriptive, diagnostic, predictive), and the combination of diverse techniques. A detailed checklist summarizing these recommendations and guiding principles is provided.
Key Findings
The paper's key findings center around the importance of applying best practices from statistical modeling to the field of AI/ML in nutrition research. Specific key aspects highlighted include: 1. **Addressing Measurement Error:** The authors emphasize the need for controlled data with minimal measurement error, especially considering the limitations of subjective measures like self-reported dietary intake, suggesting the need for warning labels in data repositories. Explainable AI/ML models are crucial for understanding error propagation. 2. **Mitigating Selection Bias:** The authors discuss the importance of recruiting representative populations to avoid selection bias. They suggest techniques like data weighting and up/down sampling to adjust for imbalances in existing datasets, emphasizing the need for transparency regarding dataset limitations. 3. **Appropriate Sample Size:** The study underscores the difficulty of calculating sample sizes for AI/ML models due to their complexity. It suggests rules of thumb and iterative processes to determine appropriate sample sizes while noting that this depends greatly on the particular model used. Justification for sample size choices must be clearly articulated. 4. **Handling Missing Data:** The importance of appropriate strategies for missing data is stressed. The authors describe the types of missing data (MCAR, MAR, MNAR) and the appropriate methods for handling them, including imputation and treating missingness as a feature. The authors further highlight the danger of using only complete cases. 5. **Balancing Datasets:** The tutorial stresses the importance of balanced datasets for accurate AI/ML model training. The authors discuss methods to address data imbalances, such as up-sampling and down-sampling, while cautioning about potential issues like learning artifacts. 6. **Utilizing Explainable AI (XAI):** The authors strongly recommend the use of XAI methods alongside complex, less interpretable AI/ML models to improve model transparency and detect potential artifacts. Examples of XAI methods, such as saliency maps and variable importance in random forests, are discussed. 7. **Data Literacy:** The authors emphasize the importance of data literacy for researchers using AI/ML, covering areas like question formulation, analytic strategy and the need for transparency in methodology. A checklist is provided to promote best practices.
Discussion
The paper addresses a critical gap in nutrition research, providing practical guidance on ethically sound and effective AI/ML model development. The iterative and tailored approach advocated for ensures the mitigation of bias and the production of reproducible results. The checklist offers a practical tool for researchers, aiding in the transparent reporting of methods and facilitating critical evaluation. The emphasis on XAI is particularly valuable, as it addresses the “black box” problem inherent in some complex models. The discussion of data literacy highlights the importance of user responsibility in ensuring the appropriate application of these powerful tools, acknowledging that not all nutrition researchers will have expertise in advanced AI/ML techniques.
Conclusion
This paper offers a valuable contribution to the field of nutrition research by providing a practical tutorial and checklist for the ethical and effective implementation of AI/ML models. The emphasis on iterative processes, transparency, and the use of explainable AI methods ensures the development of robust and reliable models. Future research should focus on developing more sophisticated methods for sample size calculation specific to various AI/ML models and refining XAI techniques tailored for nutrition applications. Increased collaboration between statisticians, AI/ML experts, and nutrition researchers is essential for advancing the use of these powerful tools.
Limitations
The recommendations are based on the authors' experience and may not cover every possible scenario in AI/ML model development. The checklist, while comprehensive, might require adaptation based on the specifics of individual research questions and model choices. The reliance on examples from other fields highlights the need for more nutrition-specific research on AI/ML bias and error.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny