Machine learning highlights the importance of primary and secondary production in determining habitat for marine fish and macroinvertebrates

Document Type


Date of Original Version



Species distribution models for marine organisms are increasingly used for a range of applications, including spatial planning, conservation, and fisheries management. These models have been constructed using a variety of mathematical forms and drawing on both physical and biological independent variables; however, what might be called first-generation models have mainly followed the form of linear models, or smoothing splines, informed by data collected in the context of fish surveys. The performance of different classes of variables were tested in a series of species occurrence models built with machine learning methods, specifically evaluating the potential contribution of lower trophic level data. Random forest models were fitted based on the classification of the absence/presence for fish and macroinvertebrates surveyed on the US Northeast Continental Shelf. The potential variables included physical, primary production, secondary production, and terrain variables. For accepted model fits, six variable importance measures were computed, which collectively showed that physical and secondary production variables make the greatest contribution across all models. In contrast, terrain variables made the least contribution to these models. Multivariable analyses that account for all performance measures reinforce the role of water depth and temperature in defining species presence and absence; however, chlorophyll concentration and some specific zooplankton taxa, such as Metridia lucens and Paracalanus parvus, also make important contributions with strong seasonal variations. Our results suggest that lower trophic level variables, if available, are valuable in the creation of species distribution models for marine organisms.

Publication Title, e.g., Journal

Aquatic Conservation: Marine and Freshwater Ecosystems