A flexible and efficient knowledge-guided machine learning data assimilation (KGML-DA) framework for agroecosystem prediction in the US Midwest

Qi Yang; Licheng Liu; Junxiong Zhou; Rahul Ghosh; Bin Peng; Kaiyu Guan; Jinyun Tang; Wang Zhou; Vipin Kumar; Zhenong Jin

doi:10.1016/j.rse.2023.113880

A flexible and efficient knowledge-guided machine learning data assimilation (KGML-DA) framework for agroecosystem prediction in the US Midwest

Qi Yang, Licheng Liu, Junxiong Zhou, Rahul Ghosh, Bin Peng, Kaiyu Guan, Jinyun Tang, Wang Zhou, Vipin Kumar, Zhenong Jin

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

Process-based models are widely used to predict the agroecosystem dynamics, but such modeled results often contain considerable uncertainty due to the imperfect model structure, biased model parameters, and inaccurate or inaccessible model inputs. Data assimilation (DA) techniques are widely adopted to reduce prediction uncertainty by calibrating model parameters or dynamically updating the model state variables using observations. However, high computational cost, difficulties in mitigating model structural error, and low flexibility in framework development hinder its applications in large-scale agroecosystem predictions. In this study, we addressed these challenges by proposing a novel DA framework that integrates a Knowledge-Guided Machine Learning (KGML)-based surrogate with tensorized ensemble Kalman filter (EnKF) and parallelized particle swarm optimization (PSO) to effectively assimilate historical and in-season multi-source remote sensing data. Specifically, we incorporate knowledge from a process-based model, ecosys, into a Gated Recurrent Unit (GRU)-based hierarchical neural network. The hierarchical architecture of KGML-DA mimics key processes of ecosys and builds a causal relationship between target variables. Using carbon budget quantification in the US Corn-Belt as a context, we evaluated KGML-DA's performance in predicting key processes of the carbon cycle at three agricultural sites (US-Ne1, US-Ne2, US-Ne3), along with county-level (627 counties) and 30-m pixel-level (Champaign County, IL) grain yield. The site experiments show that updating the upstream variable, e.g., gross primary production (GPP), improved the prediction of downstream variables such as ecosystem respiration, net ecosystem exchange, biomass, and leaf area index (LAI), with RMSE reductions ranging from 9.2% to 30.5% for corn and 4.8% to 24.6% for soybean. Uncertainty in downstream variables was automatically constrained after correcting the upstream variables, demonstrating the effectiveness of the causality linkages in the hierarchical surrogate. We found joint use of in-season GPP and evapotranspiration (ET) products along with historical GPP and surveyed yields achieved the best prediction for county-level yields, while assimilating in-season LAI observations benefitted the prediction in extreme years. Uncertainty and error analysis of regional yield estimation demonstrated that KGML-DA could reduce prediction error by 26.5% for corn and 36.2% for soybean. Remarkably, the GPU-based tensor operation design makes this DA framework more than 7000 times faster than the PB model with a High-Performance Computing system, indicating the high potential of the proposed framework for in-season, high-resolution agroecosystem predictions.

Original language	English (US)
Article number	113880
Journal	Remote Sensing of Environment
Volume	299
DOIs	https://doi.org/10.1016/j.rse.2023.113880
State	Published - Dec 15 2023

Bibliographical note

Publisher Copyright:
© 2023 Elsevier Inc.

Keywords

Agroecosystem
Carbon fluxes
Crop yield
Data assimilation
Knowledge-guided machine learning
U.S. Midwest

Access

10.1016/j.rse.2023.113880

OpenUrl availability

Full text

Cite this

@article{ba0aa53acf0942ae9459307a6ad2633f,

title = "A flexible and efficient knowledge-guided machine learning data assimilation (KGML-DA) framework for agroecosystem prediction in the US Midwest",

abstract = "Process-based models are widely used to predict the agroecosystem dynamics, but such modeled results often contain considerable uncertainty due to the imperfect model structure, biased model parameters, and inaccurate or inaccessible model inputs. Data assimilation (DA) techniques are widely adopted to reduce prediction uncertainty by calibrating model parameters or dynamically updating the model state variables using observations. However, high computational cost, difficulties in mitigating model structural error, and low flexibility in framework development hinder its applications in large-scale agroecosystem predictions. In this study, we addressed these challenges by proposing a novel DA framework that integrates a Knowledge-Guided Machine Learning (KGML)-based surrogate with tensorized ensemble Kalman filter (EnKF) and parallelized particle swarm optimization (PSO) to effectively assimilate historical and in-season multi-source remote sensing data. Specifically, we incorporate knowledge from a process-based model, ecosys, into a Gated Recurrent Unit (GRU)-based hierarchical neural network. The hierarchical architecture of KGML-DA mimics key processes of ecosys and builds a causal relationship between target variables. Using carbon budget quantification in the US Corn-Belt as a context, we evaluated KGML-DA's performance in predicting key processes of the carbon cycle at three agricultural sites (US-Ne1, US-Ne2, US-Ne3), along with county-level (627 counties) and 30-m pixel-level (Champaign County, IL) grain yield. The site experiments show that updating the upstream variable, e.g., gross primary production (GPP), improved the prediction of downstream variables such as ecosystem respiration, net ecosystem exchange, biomass, and leaf area index (LAI), with RMSE reductions ranging from 9.2% to 30.5% for corn and 4.8% to 24.6% for soybean. Uncertainty in downstream variables was automatically constrained after correcting the upstream variables, demonstrating the effectiveness of the causality linkages in the hierarchical surrogate. We found joint use of in-season GPP and evapotranspiration (ET) products along with historical GPP and surveyed yields achieved the best prediction for county-level yields, while assimilating in-season LAI observations benefitted the prediction in extreme years. Uncertainty and error analysis of regional yield estimation demonstrated that KGML-DA could reduce prediction error by 26.5% for corn and 36.2% for soybean. Remarkably, the GPU-based tensor operation design makes this DA framework more than 7000 times faster than the PB model with a High-Performance Computing system, indicating the high potential of the proposed framework for in-season, high-resolution agroecosystem predictions.",

keywords = "Agroecosystem, Carbon fluxes, Crop yield, Data assimilation, Knowledge-guided machine learning, U.S. Midwest",

author = "Qi Yang and Licheng Liu and Junxiong Zhou and Rahul Ghosh and Bin Peng and Kaiyu Guan and Jinyun Tang and Wang Zhou and Vipin Kumar and Zhenong Jin",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier Inc.",

year = "2023",

month = dec,

day = "15",

doi = "10.1016/j.rse.2023.113880",

language = "English (US)",

volume = "299",

journal = "Remote Sensing of Environment",

issn = "0034-4257",

publisher = "Elsevier",

}

TY - JOUR

T1 - A flexible and efficient knowledge-guided machine learning data assimilation (KGML-DA) framework for agroecosystem prediction in the US Midwest

AU - Yang, Qi

AU - Liu, Licheng

AU - Zhou, Junxiong

AU - Ghosh, Rahul

AU - Peng, Bin

AU - Guan, Kaiyu

AU - Tang, Jinyun

AU - Zhou, Wang

AU - Kumar, Vipin

AU - Jin, Zhenong

PY - 2023/12/15

Y1 - 2023/12/15

N2 - Process-based models are widely used to predict the agroecosystem dynamics, but such modeled results often contain considerable uncertainty due to the imperfect model structure, biased model parameters, and inaccurate or inaccessible model inputs. Data assimilation (DA) techniques are widely adopted to reduce prediction uncertainty by calibrating model parameters or dynamically updating the model state variables using observations. However, high computational cost, difficulties in mitigating model structural error, and low flexibility in framework development hinder its applications in large-scale agroecosystem predictions. In this study, we addressed these challenges by proposing a novel DA framework that integrates a Knowledge-Guided Machine Learning (KGML)-based surrogate with tensorized ensemble Kalman filter (EnKF) and parallelized particle swarm optimization (PSO) to effectively assimilate historical and in-season multi-source remote sensing data. Specifically, we incorporate knowledge from a process-based model, ecosys, into a Gated Recurrent Unit (GRU)-based hierarchical neural network. The hierarchical architecture of KGML-DA mimics key processes of ecosys and builds a causal relationship between target variables. Using carbon budget quantification in the US Corn-Belt as a context, we evaluated KGML-DA's performance in predicting key processes of the carbon cycle at three agricultural sites (US-Ne1, US-Ne2, US-Ne3), along with county-level (627 counties) and 30-m pixel-level (Champaign County, IL) grain yield. The site experiments show that updating the upstream variable, e.g., gross primary production (GPP), improved the prediction of downstream variables such as ecosystem respiration, net ecosystem exchange, biomass, and leaf area index (LAI), with RMSE reductions ranging from 9.2% to 30.5% for corn and 4.8% to 24.6% for soybean. Uncertainty in downstream variables was automatically constrained after correcting the upstream variables, demonstrating the effectiveness of the causality linkages in the hierarchical surrogate. We found joint use of in-season GPP and evapotranspiration (ET) products along with historical GPP and surveyed yields achieved the best prediction for county-level yields, while assimilating in-season LAI observations benefitted the prediction in extreme years. Uncertainty and error analysis of regional yield estimation demonstrated that KGML-DA could reduce prediction error by 26.5% for corn and 36.2% for soybean. Remarkably, the GPU-based tensor operation design makes this DA framework more than 7000 times faster than the PB model with a High-Performance Computing system, indicating the high potential of the proposed framework for in-season, high-resolution agroecosystem predictions.

AB - Process-based models are widely used to predict the agroecosystem dynamics, but such modeled results often contain considerable uncertainty due to the imperfect model structure, biased model parameters, and inaccurate or inaccessible model inputs. Data assimilation (DA) techniques are widely adopted to reduce prediction uncertainty by calibrating model parameters or dynamically updating the model state variables using observations. However, high computational cost, difficulties in mitigating model structural error, and low flexibility in framework development hinder its applications in large-scale agroecosystem predictions. In this study, we addressed these challenges by proposing a novel DA framework that integrates a Knowledge-Guided Machine Learning (KGML)-based surrogate with tensorized ensemble Kalman filter (EnKF) and parallelized particle swarm optimization (PSO) to effectively assimilate historical and in-season multi-source remote sensing data. Specifically, we incorporate knowledge from a process-based model, ecosys, into a Gated Recurrent Unit (GRU)-based hierarchical neural network. The hierarchical architecture of KGML-DA mimics key processes of ecosys and builds a causal relationship between target variables. Using carbon budget quantification in the US Corn-Belt as a context, we evaluated KGML-DA's performance in predicting key processes of the carbon cycle at three agricultural sites (US-Ne1, US-Ne2, US-Ne3), along with county-level (627 counties) and 30-m pixel-level (Champaign County, IL) grain yield. The site experiments show that updating the upstream variable, e.g., gross primary production (GPP), improved the prediction of downstream variables such as ecosystem respiration, net ecosystem exchange, biomass, and leaf area index (LAI), with RMSE reductions ranging from 9.2% to 30.5% for corn and 4.8% to 24.6% for soybean. Uncertainty in downstream variables was automatically constrained after correcting the upstream variables, demonstrating the effectiveness of the causality linkages in the hierarchical surrogate. We found joint use of in-season GPP and evapotranspiration (ET) products along with historical GPP and surveyed yields achieved the best prediction for county-level yields, while assimilating in-season LAI observations benefitted the prediction in extreme years. Uncertainty and error analysis of regional yield estimation demonstrated that KGML-DA could reduce prediction error by 26.5% for corn and 36.2% for soybean. Remarkably, the GPU-based tensor operation design makes this DA framework more than 7000 times faster than the PB model with a High-Performance Computing system, indicating the high potential of the proposed framework for in-season, high-resolution agroecosystem predictions.

KW - Agroecosystem

KW - Carbon fluxes

KW - Crop yield

KW - Data assimilation

KW - Knowledge-guided machine learning

KW - U.S. Midwest

UR - http://www.scopus.com/inward/record.url?scp=85174587502&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85174587502&partnerID=8YFLogxK

U2 - 10.1016/j.rse.2023.113880

DO - 10.1016/j.rse.2023.113880

M3 - Article

AN - SCOPUS:85174587502

SN - 0034-4257

VL - 299

JO - Remote Sensing of Environment

JF - Remote Sensing of Environment

M1 - 113880

ER -

A flexible and efficient knowledge-guided machine learning data assimilation (KGML-DA) framework for agroecosystem prediction in the US Midwest

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this