KGML-ag: a modeling framework of knowledge-guided machine learning to simulate agroecosystems: a case study of estimating N2O emission using data from mesocosm experiments

Licheng Liu; Shaoming Xu; Jinyun Tang; Kaiyu Guan; Timothy J. Griffis; Matthew D. Erickson; Alexander L. Frie; Xiaowei Jia; Taegon Kim; Lee T. Miller; Bin Peng; Shaowei Wu; Yufeng Yang; Wang Zhou; Vipin Kumar; Zhenong Jin

doi:10.5194/gmd-15-2839-2022

KGML-ag: a modeling framework of knowledge-guided machine learning to simulate agroecosystems: a case study of estimating N2O emission using data from mesocosm experiments

Licheng Liu, Shaoming Xu, Jinyun Tang, Kaiyu Guan, Timothy J. Griffis, Matthew D. Erickson, Alexander L. Frie, Xiaowei Jia, Taegon Kim, Lee T. Miller, Bin Peng, Shaowei Wu, Yufeng Yang, Wang Zhou, Vipin Kumar, Zhenong Jin

Research output: Contribution to journal › Article › peer-review

15 Scopus citations

Abstract

Agricultural nitrous oxide (N2O) emission accounts for a non-trivial fraction of global greenhouse gas (GHG) budget. To date, estimating N2O fluxes from cropland remains a challenging task because the related microbial processes (e.g., nitrification and denitrification) are controlled by complex interactions among climate, soil, plant and human activities. Existing approaches such as process-based (PB) models have well-known limitations due to insufficient representations of the processes or uncertainties of model parameters, and due to leverage recent advances in machine learning (ML) a new method is needed to unlock the "black box"to overcome its limitations such as low interpretability, out-of-sample failure and massive data demand. In this study, we developed a first-of-its-kind knowledge-guided machine learning model for agroecosystems (KGML-ag) by incorporating biogeophysical and chemical domain knowledge from an advanced PB model, ecosys, and tested it by comparing simulating daily N2O fluxes with real observed data from mesocosm experiments. The gated recurrent unit (GRU) was used as the basis to build the model structure. To optimize the model performance, we have investigated a range of ideas, including (1) using initial values of intermediate variables (IMVs) instead of time series as model input to reduce data demand; (2) building hierarchical structures to explicitly estimate IMVs for further N2O prediction; (3) using multi-task learning to balance the simultaneous training on multiple variables; and (4) pre-training with millions of synthetic data generated from ecosys and fine-tuning with mesocosm observations. Six other pure ML models were developed using the same mesocosm data to serve as the benchmark for the KGML-ag model. Results show that KGML-ag did an excellent job in reproducing the mesocosm N2O fluxes (overall r2Combining double low line0.81, and RMSECombining double low line3.6g€¯mgNm-2d-1 from cross validation). Importantly, KGML-ag always outperforms the PB model and ML models in predicting N2O fluxes, especially for complex temporal dynamics and emission peaks. Besides, KGML-ag goes beyond the pure ML models by providing more interpretable predictions as well as pinpointing desired new knowledge and data to further empower the current KGML-ag. We believe the KGML-ag development in this study will stimulate a new body of research on interpretable ML for biogeochemistry and other related geoscience processes.

Original language	English (US)
Pages (from-to)	2839-2858
Number of pages	20
Journal	Geoscientific Model Development
Volume	15
Issue number	7
DOIs	https://doi.org/10.5194/gmd-15-2839-2022
State	Published - Apr 7 2022

Bibliographical note

Funding Information:
Financial support. This research was funded in part by the National Science Foundation SitS program (award no. 2034385) and the Advanced Research Projects Agency–Energy (ARPA-E), US Department of Energy, under award number DE-AR0001382.

Publisher Copyright:
© 2022 Licheng Liu et al.

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.5194/gmd-15-2839-2022

OpenUrl availability

Full text

Cite this

Liu, L., Xu, S., Tang, J., Guan, K., Griffis, T. J., Erickson, M. D., Frie, A. L., Jia, X., Kim, T., Miller, L. T., Peng, B., Wu, S., Yang, Y., Zhou, W., Kumar, V., & Jin, Z. (2022). KGML-ag: a modeling framework of knowledge-guided machine learning to simulate agroecosystems: a case study of estimating N2O emission using data from mesocosm experiments. Geoscientific Model Development, 15(7), 2839-2858. https://doi.org/10.5194/gmd-15-2839-2022

Liu, L, Xu, S, Tang, J, Guan, K, Griffis, TJ, Erickson, MD, Frie, AL, Jia, X, Kim, T, Miller, LT, Peng, B, Wu, S, Yang, Y, Zhou, W, Kumar, V & Jin, Z 2022, 'KGML-ag: a modeling framework of knowledge-guided machine learning to simulate agroecosystems: a case study of estimating N2O emission using data from mesocosm experiments', Geoscientific Model Development, vol. 15, no. 7, pp. 2839-2858. https://doi.org/10.5194/gmd-15-2839-2022

@article{2d99738fc7aa4dd3a1d56343d50e1876,

title = "KGML-ag: a modeling framework of knowledge-guided machine learning to simulate agroecosystems: a case study of estimating N2O emission using data from mesocosm experiments",

abstract = "Agricultural nitrous oxide (N2O) emission accounts for a non-trivial fraction of global greenhouse gas (GHG) budget. To date, estimating N2O fluxes from cropland remains a challenging task because the related microbial processes (e.g., nitrification and denitrification) are controlled by complex interactions among climate, soil, plant and human activities. Existing approaches such as process-based (PB) models have well-known limitations due to insufficient representations of the processes or uncertainties of model parameters, and due to leverage recent advances in machine learning (ML) a new method is needed to unlock the {"}black box{"}to overcome its limitations such as low interpretability, out-of-sample failure and massive data demand. In this study, we developed a first-of-its-kind knowledge-guided machine learning model for agroecosystems (KGML-ag) by incorporating biogeophysical and chemical domain knowledge from an advanced PB model, ecosys, and tested it by comparing simulating daily N2O fluxes with real observed data from mesocosm experiments. The gated recurrent unit (GRU) was used as the basis to build the model structure. To optimize the model performance, we have investigated a range of ideas, including (1) using initial values of intermediate variables (IMVs) instead of time series as model input to reduce data demand; (2) building hierarchical structures to explicitly estimate IMVs for further N2O prediction; (3) using multi-task learning to balance the simultaneous training on multiple variables; and (4) pre-training with millions of synthetic data generated from ecosys and fine-tuning with mesocosm observations. Six other pure ML models were developed using the same mesocosm data to serve as the benchmark for the KGML-ag model. Results show that KGML-ag did an excellent job in reproducing the mesocosm N2O fluxes (overall r2Combining double low line0.81, and RMSECombining double low line3.6g€¯mgNm-2d-1 from cross validation). Importantly, KGML-ag always outperforms the PB model and ML models in predicting N2O fluxes, especially for complex temporal dynamics and emission peaks. Besides, KGML-ag goes beyond the pure ML models by providing more interpretable predictions as well as pinpointing desired new knowledge and data to further empower the current KGML-ag. We believe the KGML-ag development in this study will stimulate a new body of research on interpretable ML for biogeochemistry and other related geoscience processes.",

author = "Licheng Liu and Shaoming Xu and Jinyun Tang and Kaiyu Guan and Griffis, {Timothy J.} and Erickson, {Matthew D.} and Frie, {Alexander L.} and Xiaowei Jia and Taegon Kim and Miller, {Lee T.} and Bin Peng and Shaowei Wu and Yufeng Yang and Wang Zhou and Vipin Kumar and Zhenong Jin",

note = "Funding Information: Financial support. This research was funded in part by the National Science Foundation SitS program (award no. 2034385) and the Advanced Research Projects Agency–Energy (ARPA-E), US Department of Energy, under award number DE-AR0001382. Publisher Copyright: {\textcopyright} 2022 Licheng Liu et al.",

year = "2022",

month = apr,

day = "7",

doi = "10.5194/gmd-15-2839-2022",

language = "English (US)",

volume = "15",

pages = "2839--2858",

journal = "Geoscientific Model Development",

issn = "1991-959X",

publisher = "Copernicus Gesellschaft mbH",

number = "7",

}

TY - JOUR

T1 - KGML-ag

T2 - a modeling framework of knowledge-guided machine learning to simulate agroecosystems: a case study of estimating N2O emission using data from mesocosm experiments

AU - Liu, Licheng

AU - Xu, Shaoming

AU - Tang, Jinyun

AU - Guan, Kaiyu

AU - Griffis, Timothy J.

AU - Erickson, Matthew D.

AU - Frie, Alexander L.

AU - Jia, Xiaowei

AU - Kim, Taegon

AU - Miller, Lee T.

AU - Peng, Bin

AU - Wu, Shaowei

AU - Yang, Yufeng

AU - Zhou, Wang

AU - Kumar, Vipin

AU - Jin, Zhenong

N1 - Funding Information: Financial support. This research was funded in part by the National Science Foundation SitS program (award no. 2034385) and the Advanced Research Projects Agency–Energy (ARPA-E), US Department of Energy, under award number DE-AR0001382. Publisher Copyright: © 2022 Licheng Liu et al.

PY - 2022/4/7

Y1 - 2022/4/7

N2 - Agricultural nitrous oxide (N2O) emission accounts for a non-trivial fraction of global greenhouse gas (GHG) budget. To date, estimating N2O fluxes from cropland remains a challenging task because the related microbial processes (e.g., nitrification and denitrification) are controlled by complex interactions among climate, soil, plant and human activities. Existing approaches such as process-based (PB) models have well-known limitations due to insufficient representations of the processes or uncertainties of model parameters, and due to leverage recent advances in machine learning (ML) a new method is needed to unlock the "black box"to overcome its limitations such as low interpretability, out-of-sample failure and massive data demand. In this study, we developed a first-of-its-kind knowledge-guided machine learning model for agroecosystems (KGML-ag) by incorporating biogeophysical and chemical domain knowledge from an advanced PB model, ecosys, and tested it by comparing simulating daily N2O fluxes with real observed data from mesocosm experiments. The gated recurrent unit (GRU) was used as the basis to build the model structure. To optimize the model performance, we have investigated a range of ideas, including (1) using initial values of intermediate variables (IMVs) instead of time series as model input to reduce data demand; (2) building hierarchical structures to explicitly estimate IMVs for further N2O prediction; (3) using multi-task learning to balance the simultaneous training on multiple variables; and (4) pre-training with millions of synthetic data generated from ecosys and fine-tuning with mesocosm observations. Six other pure ML models were developed using the same mesocosm data to serve as the benchmark for the KGML-ag model. Results show that KGML-ag did an excellent job in reproducing the mesocosm N2O fluxes (overall r2Combining double low line0.81, and RMSECombining double low line3.6g€¯mgNm-2d-1 from cross validation). Importantly, KGML-ag always outperforms the PB model and ML models in predicting N2O fluxes, especially for complex temporal dynamics and emission peaks. Besides, KGML-ag goes beyond the pure ML models by providing more interpretable predictions as well as pinpointing desired new knowledge and data to further empower the current KGML-ag. We believe the KGML-ag development in this study will stimulate a new body of research on interpretable ML for biogeochemistry and other related geoscience processes.

AB - Agricultural nitrous oxide (N2O) emission accounts for a non-trivial fraction of global greenhouse gas (GHG) budget. To date, estimating N2O fluxes from cropland remains a challenging task because the related microbial processes (e.g., nitrification and denitrification) are controlled by complex interactions among climate, soil, plant and human activities. Existing approaches such as process-based (PB) models have well-known limitations due to insufficient representations of the processes or uncertainties of model parameters, and due to leverage recent advances in machine learning (ML) a new method is needed to unlock the "black box"to overcome its limitations such as low interpretability, out-of-sample failure and massive data demand. In this study, we developed a first-of-its-kind knowledge-guided machine learning model for agroecosystems (KGML-ag) by incorporating biogeophysical and chemical domain knowledge from an advanced PB model, ecosys, and tested it by comparing simulating daily N2O fluxes with real observed data from mesocosm experiments. The gated recurrent unit (GRU) was used as the basis to build the model structure. To optimize the model performance, we have investigated a range of ideas, including (1) using initial values of intermediate variables (IMVs) instead of time series as model input to reduce data demand; (2) building hierarchical structures to explicitly estimate IMVs for further N2O prediction; (3) using multi-task learning to balance the simultaneous training on multiple variables; and (4) pre-training with millions of synthetic data generated from ecosys and fine-tuning with mesocosm observations. Six other pure ML models were developed using the same mesocosm data to serve as the benchmark for the KGML-ag model. Results show that KGML-ag did an excellent job in reproducing the mesocosm N2O fluxes (overall r2Combining double low line0.81, and RMSECombining double low line3.6g€¯mgNm-2d-1 from cross validation). Importantly, KGML-ag always outperforms the PB model and ML models in predicting N2O fluxes, especially for complex temporal dynamics and emission peaks. Besides, KGML-ag goes beyond the pure ML models by providing more interpretable predictions as well as pinpointing desired new knowledge and data to further empower the current KGML-ag. We believe the KGML-ag development in this study will stimulate a new body of research on interpretable ML for biogeochemistry and other related geoscience processes.

UR - http://www.scopus.com/inward/record.url?scp=85128750717&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85128750717&partnerID=8YFLogxK

U2 - 10.5194/gmd-15-2839-2022

DO - 10.5194/gmd-15-2839-2022

M3 - Article

AN - SCOPUS:85128750717

SN - 1991-959X

VL - 15

SP - 2839

EP - 2858

JO - Geoscientific Model Development

JF - Geoscientific Model Development

IS - 7

ER -

KGML-ag: a modeling framework of knowledge-guided machine learning to simulate agroecosystems: a case study of estimating N2O emission using data from mesocosm experiments

Abstract

Bibliographical note

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this