Multi-step commodity forecasts using deep learning

Siddhartha S. Bora (Davis College of Agriculture and Natural Resources, West Virginia University, Morgantown, West Virginia, USA)

Ani L. Katchova (Department of Agricultural, Environmental, and Development Economics, The Ohio State University, Columbus, Ohio, USA)

Agricultural Finance Review

ISSN: 0002-1466

Article publication date: 2 September 2024

Downloads

217

pdf (1.9 MB)

Abstract

Purpose

Long-term forecasts about commodity market indicators play an important role in informing policy and investment decisions by governments and market participants. Our study examines whether the accuracy of the multi-step forecasts can be improved using deep learning methods.

Design/methodology/approach

We first formulate a supervised learning problem and set benchmarks for forecast accuracy using traditional econometric models. We then train a set of deep neural networks and measure their performance against the benchmark.

Findings

We find that while the United States Department of Agriculture (USDA) baseline projections perform better for shorter forecast horizons, the performance of the deep neural networks improves for longer horizons. The findings may inform future revisions of the forecasting process.

Originality/value

This study demonstrates an application of deep learning methods to multi-horizon forecasts of agri-cultural commodities, which is a departure from the current methods used in producing these types of forecasts.

Keywords

Citation

Bora, S.S. and Katchova, A.L. (2024), "Multi-step commodity forecasts using deep learning", Agricultural Finance Review, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/AFR-08-2023-0105

Publisher

:

Emerald Publishing Limited

License

Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

1. Introduction

The availability of long-term information about commodity markets plays a vital role in policy and investment decisions by market participants. The forecasts of season-average farm prices of major field crops such as corn, soybeans, and wheat are widely used to inform decisions by farmers, agricultural businesses and the government. Similarly, the forecasts of harvested area and yield provide information about production of the commodities for the marketing year and help anticipate ending stocks. The USDA’s World Agricultural Supply and Demand Estimates (WASDE) provide forecasts about commodities for the current marketing year. However, market participants may need information about market trends beyond the current marketing year to inform their decisions. For example, forecasts for the next few years can facilitate comparisons of policy alternatives by government agencies. Similarly, long-term forecasts can help estimate the outlays of various farm program costs under the federal budget. The Farm Bill programs are typically implemented in five-year cycles, and having information for the next five years will help immensely in planning the budget. Similarly, long-term prices and crop yield forecasts may help farmers inform their long-term decisions about planting, crop choice, and land use. For example, the decision to enroll farmland in federal programs like conservation reserve programs (CRP) may be informed by crop prices and yield forecasts for multiple years into the future. The importance of reliable long-term forecasts became evident when the pandemic hit the economy, and policymakers required information far into the future to plan the recovery process.

The USDA’s baseline projections, published every year in February, are one of the principal sources of long-term information about the USA farm sector. The baselines are produced by a team from ten USDA agencies, including the Economic Research Service (ERS), and contain annual projections of key measures of agricultural market conditions for the next decade. These projections facilitate comparisons of policy alternatives by providing a conditional “baseline” scenario based on specific macroeconomic, weather, policy and trade assumptions. Over the years, the baseline projections have been used for a variety of purposes, including estimating farm program costs and preparing the president’s budget. In addition to the USDA, the Food and Agricultural Policy Research Institute (FAPRI), University of Missouri, produces similar ten-year projections of key agricultural variables. The baseline projections are produced through a mixture of the output of quantitative models and expert opinions. Previous studies show that many variables in the USDA baseline projections are biased and that the predictive content of the baselines diminishes after a few years (Bora et al., 2023; Katchova, 2024; Fang and Katchova, 2023; Chandio and Katchova, 2024). As the evaluation of the baselines has shown their limited predictive content, an investigation of alternative methods to improve the long-term projections becomes essential.

This study aims to forecast the harvested area, yield and farm price of three major field crops in the USA for the next five years using deep learning models. Our investigation is performed in three steps. First, we formulate a supervised learning problem for the forecasting process and develop a test harness to compare the performance of various methods based on a train-test split of the sample. The last ten years were used as a test sample using a walk-forward validation approach. Second, we benchmark the performance of traditional methods such as a naïve no-change forecast, exponential smoothing and USDA baseline reports. Finally, we implement a suite of deep learning models to predict commodity market indicators, with particular emphasis on long short-term memory (LSTM) recurrent neural networks (RNN), convolutional neural networks (CNN) and their hybrids. We train the deep learning models using a large number of input features reflecting macroeconomic indicators, demographic trends, weather variability, global trade and demand and supply of key commodities.

Previous studies have looked at the potential of using machine learning to improve commodity forecasts. Using satellite and weather data, Roznik et al. (2023) show that XGBoost-based machine learning models can produce reasonably accurate crop yield forecasts comparable to those produced by WASDE reports. We investigate the potential of employing deep learning techniques in forecasting multi-year forecasts of crop harvested acres, yield and farm price. The closest available information to such forecasts are the USDA baselines, which are the result of a complex process involving econometric analysis and expert inputs from various agencies within the USDA. Compared to traditional time-series forecasting models, deep neural networks, particularly LSTMs, excel at capturing nonlinear dependencies within sequential data (Panigrahi and Behera, 2017) and can model complex relationships between various factors such as macroeconomic conditions, weather and crop variables. With careful feature selection for economic interpretation, deep neural networks can also process diverse types of inputs. Another benefit is that the forecasts can be generated using publicly accessible data by any interested market participant.

Our study contributes to the literature in several ways. We use state-of-the-art deep learning methods to improve the long-term forecasts of commodity market indicators. While deep learning methods have shown great promise in forecasting in other fields (Kim and Won, 2018; Huang et al., 2020; Wang et al., 2019; Borovykh et al., 2019; Wan et al., 2019; Lara-Benítez et al., 2020), their application in predicting long-term agricultural statistics such as the USDA baselines has been limited. This study aims to bridge this gap. Our results suggest that deep learning networks may perform better than the official USDA baselines at longer forecast horizons. In particular, when the USDA baselines perform well, deep learning models match the accuracy, but if the USDA baselines do not perform well, deep learning models perform better. These findings may have important implications for future revisions of the USDA baseline models and processes. Deep learning models with improved accuracy may offer insights for the existing USDA baseline models by determining where improvements to the accuracy and performance of these models can occur. The existing process of producing the baseline reports involves many agencies, which work on specific components of the report and create inputs for the composite model. Deep learning methods have the potential to contribute to their work by identifying where the original USDA baseline models can be improved so they become more accurate and perform better.

The remainder of this article is organized as follows: The next section describes the various datasets used in this study. The third section describes the methodology, followed by results and discussion. The final section contains concluding remarks.

2. Data

Our dataset of the target variables consists of historical values of harvested area, yield and farm price of corn, soybeans and wheat in the USA since 1961. Together, these three field crops constitute a significant share of the area under cultivation in the USA. The values are averages for the marketing years, which differ by crop. The marketing year for corn and soybeans begins on September 1 and comprises four quarters. For example, the marketing year 2021–2022 for corn and soybeans starts on September 1, 2021, and ends on August 31, 2022. The 2021–2022 marketing year for wheat begins on June 1, 2021, and ends on May 31, 2022. All this information was obtained using the National Agriculture Statistics Services (NASS) Quickstats API (USDA National Agricultural Statistics Service, 2024). Figure 1 shows the plots of harvested area, yield and farm price of the three crops for the period 1961–2021. The figures suggest that many of these indicators are highly correlated, and they may be related to each other or to other macroeconomic, weather or trade indicators. For example, the loss of wheat harvested area over the years is accompanied by a contemporaneous increase in soybean harvested area.

An archive of the USDA agricultural baseline projections since 1997 is available at the Albert R. Mann Library at Cornell University (USDA ERS, 2024). The baseline reports typically include estimates of the previous year(s) and projections for the next ten years. For example, the February 2022 USDA report contains realized estimates for 2020, provisional estimates for 2021 and projections for 2022–2031 (USDA Office of Chief Economist, 2022). The exact information set, which was available to the committee producing the projections in the early years, is difficult to retrieve due to a lack of information on the variables that were in the information set of the committee and the revisions made to the realized values over time. As the organizations involved with the projection process go through personnel and information technology infrastructure changes over the years, the exact information used to produce the baselines is challenging to ascertain. The projections and estimates are often revised long after they are first published. For example, there is no way to access the exact data used as the information set by the committee when the baseline projections were prepared for 1997. We can assume that the committee made the best use of the information they had at that time. To mimic the forecasting process of the committee, we try to provide many features such as macroeconomic, population, trade and weather information input to train our deep learning models. The committee may have had a different set of variables and/or different values for these variables that were later revised to what is available today. Our goal is to use deep learning methods to produce the forecasts using a similar information set and examine whether these forecasts have superior performance over the USDA baselines.

We use data from several sources as input features to train the deep learning models. First, we use the lagged values of commodity indicators to forecast their future values. We also include several macroeconomic, population, trade, and weather variables for the world and the USA as input features to our models. These include growth rates for gross domestic product (GDP) and population. For the USA economy, we also include inflation, unemployment, labor market participation and interest rates. We also include features that represent changes in weather in the world and the USA over time. To account for temperature changes all over the world, we include global annual average temperature anomalies, measured as deviations from the 20th century average. The macroeconomic data are taken from the World Bank Open Data Catalog. Similarly, we include the USA's annual average temperature, maximum temperature, minimum temperature, precipitation and heating and cooling degree days. All weather information was obtained from the National Oceanic and Atmospheric Administration (NOAA) (National Centers for Environmental Information, 2022). Finally, we add commodity balance sheet variables representing domestic use, imports, exports and ending stocks of corn, soybeans and wheat as input features. The commodity balance sheet information is extracted from the Production, Supply and Distribution (PSD) Database published by the USDA Foreign Agricultural Service (USDA Foreign Agricultural Service, 2024). We have provided the descriptive statistics of the input features in Table 1 and their correlation plot in Figure 2.

3. Methodology

In this section, we define our prediction problem and proceed to develop a test harness for comparing the performance of the methods used in this study. We then describe the different traditional and deep learning methods used in this study.

3.1 The prediction problem

We denote the realized or actual values of commodity indicators of harvested acres, yield and farm price for corn, soybeans and wheat in year t by y_t. At year t, the forecaster makes a forecast yˆt+h|t for horizon h ∈ {0, 1, …, H − 1} for H future years, including year t using lagged values of the commodity indicators and a set of other covariates such as macroeconomic, population and weather variables. Although the baselines are for H = 10 years, we limit our attention to forecasts of up to five years due to the small length of the time period, i.e. h = {0, …, 4}. Similarly, we assume that up to five years of lagged values of input features are used to produce the forecasts.

We first transform the prediction problem into a supervised learning problem where a set of input features X is mapped to an output variable y. For year t, our input X_t consists of vectors of all input features up to lag five and y_t consists of vectors of the next five years of values of the target variables (harvested acres, yield and farm price of corn, soybeans and wheat). From our dataset for the time period 1961–2021, we construct {X_t, y_t} pairs for 52 years between 1966 and 2017. This yields a three-dimensional array of input features X with dimensions (52, 5, n_features), where n_features is the total number of input features. This is important since the deep learning models used in this study accept three-dimensional input. We use a total of 44 input features in this study; however, this number can be augmented by including additional features.

3.2 Developing a test harness

A test harness ensures that all deep learning methods used in this study are evaluated using a consistent approach for comparability. The important components of our test harness are the train-test split validation strategy and the evaluation criteria.

3.2.1 Train-test split

Our dataset contains commodity market variables of harvested area, yield and farm price for corn, soybeans and wheat between 1961 and 2021. Since we use up to five-year-lagged features in our deep learning algorithms to produce five-year-ahead forecasts, this results in a complete dataset of features (X) and output (y) between 1966 and 2017, for a total of 52 years. We used the last ten years of the data as our test sample between 2008 and 2017, representing close to 20% of the entire sample. As preferred in time-series applications, we use a walk-forward validation strategy, allowing updated information to train the model as we progress through the years in the test sample. We use an expanding training window approach, which means the training sample increases as we walk through the test sample. For example, we train a model using 42 samples between 1966 and 2007 to produce forecasts for 2008. We then add the sample for 2008 back to the training sample to produce forecasts for 2009 and so on. This validation strategy closely follows how the USDA produces the baseline reports as forecasters make use of new information as it becomes available. Another choice is to use a sliding window, where the oldest training sample is dropped as we add a new sample, keeping the length of the training sample constant. However, we prefer an expanding window as we would like to make use of all the information available, and our sample size is small.

3.2.2 Evaluation criteria

We will use two widely adopted error metrics to measure the performance of the proposed methods: root mean squared error (RMSE) and mean absolute percent error (MAPE). The RMSE is calculated at the level of the variables, while the MAPE is calculated relative to the actual level of the variables according to the following formulas:

(1)RMSEh=1T∑t=1T(yt+h−yˆt+h|t)2

(2)MAPEh=100T∑t=1Tyt+h−yˆt+h|tyt+h

where y_t ₊ _h are the realized values, yˆt+h|t are forecasts of the target variable at horizon h and T is the sample size of the test or the training sample. For calculating in-sample forecast errors, we use the sample size T = 42 for the training sample, while for out-of-sample errors, we use the test sample T = 10.

3.3 Benchmarking with traditional methods

3.3.1 Naïve benchmark

We first develop a benchmark model to improve upon using deep learning methods. A natural choice is to use a naïve no-change forecast, where we consider the most recent year’s value as the forecast for the next five years. This is a fairly naïve benchmark that would result in high forecast errors. Any econometric or deep learning method is expected to perform better than this naïve benchmark, as the methods are supposed to add some skill to forecasting.

3.3.2 Simple exponential smoothing (ETS)

We also use the simple exponential smoothing (ETS) method, which is useful for forecasting when the time series have no clear trend or seasonal pattern. The ETS forecast is a weighted average of past observations, where the weights decay exponentially for older observations. The ETS method can be expressed in terms of the following equations (Hyndman and Athanasopoulos, 2021),

(3)yˆt+h|t=ℓt

(4)ℓt=αyt+(1−α)ℓt−1

where ℓ_t is the level of the variable at time t. The smoothing parameter α represents the rate at which the weight placed on past observations decreases.

3.3.3 Exponential smoothing (ETS) with trend

We then use an extension of the simple ETS method, which allows a trend (Holt, 2004). Some of our data series, such as crop yield, shows a clear time trend, and farm price may also be trending upward over the years. The ETS method with trend can be expressed as Hyndman and Athanasopoulos (2021),

(5)yˆt+h|t=ℓt+hbt

(6)ℓt=αyt+(1−α)(ℓt−1+bt−1)

(7)bt=β(ℓt−ℓt−1)+(1−β)bt−1

where β is an additional smoothing parameter for the trend. We use the implementations of ETS and ETS with trend methods in Python statsmodels library to produce the forecasts (Seabold and Perktold, 2010).

3.3.4 Auto-regressive integrated moving average (ARIMA)

Auto-regressive integrated moving average (ARIMA) models are useful in forecasting when the time series can be made stationary by differencing. An ARIMA (p, d and q) model consists of p autoregressive terms and q lagged forecast errors in the prediction equation and needs d times differencing to achieve stationarity. While ARIMA models are traditionally univariate, ARIMA including exogenous features (ARIMAX) is also available. We employ an implementation of ARIMA model from the pmdarima library in Python, which automates the calibration of the ARIMA models in their auto_arima function (Smith, 2017).

3.3.5 USDA baseline report

Our final choice for comparison is the projections produced by the USDA in their baseline report. These projections are produced using a mixture of economic/econometric models, survey information and expert opinions. We calculate the error metrics for baseline projections up to five years for the test period 2008–2017 for comparison with the other methods used in our study. As mentioned earlier, the exact information set used to produce these projections is challenging to ascertain. Therefore, the comparison with deep learning methods using the current training set may not be entirely justifiable.

3.4 Deep learning methods

The methods discussed in the previous section are traditional time-series forecast models. However, in recent years, deep neural networks have become popular for forecasting time series (Schmidhuber, 2015). Neural networks are a collection of algorithms used in pattern recognition. Deep learning refers to a subset of neural networks, which consists of more than three layers.

The most basic deep learning networks are feed-forward neural networks (FNNs) that do not allow recursive feedback, such as the multi-layer perceptron (MLP). The computational architecture of FNNs consists of three layers: an input layer, hidden layer(s) and an output layer. Since two consecutive layers have only direct forward connections, FFNs ignore the temporal nature of the data and treat each input independently. Therefore, they are of limited use in dealing with our data, which are inherently temporal and sequential. We consider two main families of deep learning methods that account for temporal dependence in sequences, namely RNN and CNN. We also explore hybrid deep learning models, which have become more popular in recent years.

3.4.1 Recurrent neural networks

RNNs are popular in time series prediction applications. An RNN allows recursive feedback, and each RNN unit can take the current and previous input simultaneously. They are widely used for prediction in different fields, including stock price forecasting (Kim and Won, 2018), wind speed forecasting (Huang et al., 2020) and solar radiation forecasting (Wang et al., 2019). Moreover, RNNs have done remarkably well at forecasting competitions, such as the recent M4 forecasting competition (Makridakis et al., 2018). In a recent study, Medvedev and Wang (2022) used RNNs to predict the volatility of the S&P 500 Index (SPX) for pricing options, with good success. However, we are not aware of any studies applying RNNs to forecast long-term information about agricultural markets.

Elman (1990) proposed an early RNN, which generalizes FNN by using recurrent links in order to provide networks with dynamic memory. This type of network is more suitable for handling ordered data sequences like financial time series. While Elman’s RNN model is simple, training these models is difficult due to inefficient gradient propagation. In particular, the problem of vanishing and exploding gradients makes it challenging to learn long-term dependencies. Due to vanishing gradients, it may take a long time to train the model, while the exploding gradients may cause the model’s weights to oscillate (Lara-Benítez et al., 2021).

LSTM networks were proposed to address the vanishing and exploding gradient problems faced by standard RNNs (Hochreiter and Schmidhuber, 1997). LSTMs can model long-term temporal dependencies without compromising short-term patterns. LSTM networks have a similar structure to Elman’s RNN but differ in the composition of the hidden layer, known as the LSTM memory cell. Each LSTM cell has three gates: a multiplicative input that controls memory units, a multiplicative output that protects other cells from noise and a forget gate. Gated recurrence units (GRUs) are simplified versions of LSTMs that replace the forget and input gates with a single update gate to reduce trainable parameters. An RNN can also have stacked recurrent layers to form a deep RNN.

3.4.2 Convolutional neural networks

CNNs are mainly used in classification applications such as speech recognition, object recognition and natural language processing (NLP). However, with some adjustments, they can be used for time-series predictions as well. A CNN uses the convolutional operation to extract meaningful features from raw data and create feature maps (Lara-Benítez et al., 2021). A CNN consists of convolution layers, pooling layers and fully connected layers. The pooling layers lower the spatial dimension of the feature maps, while the fully connected layers combine the local features to form global features. As CNNs have a smaller number of trainable parameters, the learning process is more time-efficient than RNNs (Borovykh et al., 2019). In addition, different convolutional layers can be stacked together to allow the transformation of raw data (Chen et al., 2020).

Hybrid models are a recent trend in time-series forecasting using deep learning. For example, depending on the application, LSTMs can be used with RNNs or CNNs. Also, deep learning models can be used with traditional econometric methods to achieve superior results. The winning entry in the M4 forecasting competition in 2018 used a hybrid ETS-LSTM model (Smyl, 2020). While the ETS component captures seasonality, the LSTM focuses on non-linear trends and cross-learning from related series.

3.4.3 Training the deep neutral networks

In this study, we use three deep learning architectures to forecast commodity market indicators.

Vanilla LSTM: The first architecture that we use is a simple LSTM model with one LSTM layer.
Encoder-decoder LSTM (ED-LSTM): The second architecture that we use is an ED-LSTM with two layers. The first layer reads the input sequence and encodes it into a fixed-length vector, and the second layer decodes the fixed-length vector and outputs the predicted sequence.
CNN-LSTM: The last architecture that we use consists of an LSTM preceded by a convolution layer at the input.

Each architecture has a multiple-input multiple-output configuration, with three-dimensional tensors as inputs and output. The input consists of lagged values of the commodity variables, and additional features representing macro-economic environment, weather and commodity balance sheet information. The outputs are five future time steps of the nine target variables: acres, yields and farm prices of corn, soybeans and wheat. As the input features are correlated, we use a principal component analysis (PCA) of the scaled input features and retain 11 orthogonal features, representing 95% variance in the features. This allows us to reduce the dimensionality of the feature vector.

The forecasting process is outlined in Figure 3. We take a number of steps to ensure that we do not overfit the LSTM models. We train the deep learning networks using the Keras (Chollet, 2015) and TensorFlow (Abadi et al., 2015) libraries in Python. To reduce the chances of overfitting, we carefully choose the model hyperparameters using a grid search methodology over a validation set. In particular, we choose over a hyperparameter space all combinations of select values of the number of LSTM units, batch size, the number of training epochs and the dropout rate. The number of LSTM units refers to the complexity of the LSTM layers. The number of epochs indicates the number of passes through the training dataset, while the batch size hyperparameter indicates the number of training samples used before the model parameters are updated again.

In Figure 4, we plot the training and validation losses against training epochs for the tuned configurations for each of our models using a 20% validation split. While there is an expected gap between the learning curves of training and validation sets, there is no sign of severe overfitting. The literature shows that while sample size does contribute to the efficiency of the LSTM model predictability, a model is inefficient with small sample sizes. Boulmaiz et al. (2020) show these results for rainfall-runoff modeling and streamflow predictions. To predict streamflow, Boulmaiz et al. (2020) train an LSTM model with different lengths of data ranging from 3 to 15 years. The results of this study indicate the efficiency of the LSTM model using a small, three year sample size compared to a benchmark model that requires nine years of data to yield similar results. The study tested varying sizes of data from hundreds of watersheds to train an optimal LSTM model. Where it is traditionally expected that more data would result in higher model accuracy, this study finds that, beyond an optimal value, model accuracy does not improve significantly with the addition of data.

Due to the stochastic nature of the deep learning models, we consider the average of 100 models and report the mean and standard deviation of the errors in these models. As a standard practice, we normalize the input features using a MinMaxScaler, which transforms all feature values to be in the range [0,1]. However, we also train our models using a standard scaler, and the results are comparable. We compile the models using the Adam optimizer (Kingma and Ba, 2014) and a Huber loss function, which is less susceptible to outliers (Huber, 1964). We present our main results for a sliding training window where older training data is replaced with newer data during walk-forward validation and include an expanding training window as additional results.

4. Results and discussions

We present the forecast accuracy metrics for the harvested area, yield and farm price of the three commodities using traditional econometric models in Tables 2–4. The naïve benchmark is a low bar, and any model that yields smaller errors than this naïve benchmark will be considered skillful. The USDA baselines have smaller RMSE and MAPE than those of the naïve benchmark for harvested area, yield and farm price for all three crops across all horizons. Any candidate algorithm to improve the baselines would need to have a couple of desirable properties. At a minimum, it must perform better than the naïve benchmark. Second, it should improve the performance of the USDA baselines, at least for some horizons. In particular, smaller forecast errors at longer horizons would be a good contribution, as the USDA baselines tend to be less informative at longer horizons (Bora et al., 2023). Figures 5 and 6 show the comparison of the forecast errors of all methods for harvested area, yield and farm price, respectively, of the three commodities.

As expected, the RMSEs and MAPEs of the naïve benchmark are very high for most indicators across all horizons. The ETS methods, with or without trend, do not result in a considerable improvement in accuracy and have errors that are comparable to the naïve benchmark for forecasts of harvested area and farm price. For crop yield forecasts, the ETS with trend model performs well. The USDA baseline and the deep neural networks generally show superior skill compared to the naïve benchmarks. We focus the rest of our discussion on the performance of the USDA baseline and the three deep learning models in Tables 5–7.

The USDA predicts more accurately the harvested area of crops for the current year compared to the other methods (Tables 2 and 5). At h = 0, the MAPEs of the USDA baselines for corn, soybean and wheat harvested areas are 1.70, 3.79 and 3.69%, respectively. The MAPEs of USDA baselines of corn harvested area remain low at longer horizons, with LSTM forecasts matching its performance closely for h = {1, 2, 3, 4}. The vanilla LSTM shows better accuracy than all other models for h = {1, 2, 3}, but its MAPEs are high for h = {0, 4}. The USDA baselines do not perform well in predicting the harvested area of corn and soybeans for longer horizons, with large increases in MAPEs between h = 0 and h = 4 for both crops. For h = {2, 3, 4}, both ED-LSTM and CNN-LSTM forecasts have comparable accuracy for soybean and wheat harvested areas, with the CNN-LSTM model performing slightly better.

The USDA projections of crop yields are fairly accurate across horizons, with MAPEs around 5% (Table 3). As observed in Figure 1, crop yield has a strong time trend for all crops, making it easier to predict if the trend is correctly identified. The ETS trend model closely matches the performance of the USDA model for all three crops, suggesting the USDA might be using a similar model that includes trends to predict crop yield. The LSTM model has lower MAPEs than those of the USDA baselines for horizons h = {1, 2, 3}, but its accuracy drops sharply at h = 4.

The deep learning models show noticeable improvement in accuracy while predicting farm prices, which is an indicator that the USDA baselines have much lower accuracy at longer horizons (Tables 4 and 7). At h = 0, the MAPEs of the USDA baselines are the lowest among all models; however, they increase for longer horizons. Between h = 0 and h = 4, the MAPEs of USDA corn price baselines increase from 13.25 to 20.35%. For the same horizons, the MAPEs of the soybean and wheat price baselines increase from 10.34 to 18.55% and from 13.58 to 23.17%, respectively. The price forecasts from the LSTM model show very low MAPEs at horizons h = {1, 2, 3}, but their performance decreases drastically at h = 4. Given that farmers frequently choose between various crops when planting, being able to reliably predict long-term commodity prices has implications for estimating outlays for federal programs.

The USDA baselines generally perform better than all other methods for the current-year forecasts (h = 0). For example, the current-year USDA baselines for harvested areas of corn, soybeans and wheat have MAPEs of 1.7, 3.79 and 3.69%, respectively, which are among the lowest of all models. The current-year USDA crop yield baselines have low MAPEs as well, though the deep learning methods have comparable performance at h = 0. Similarly, the MAPEs of the USDA baselines are the lowest for the current year forecasts of farm prices of the three crops (13.25% for corn price, 10.34% for soybean price and 13.58% for wheat price). These findings show that for indicators like yield, for which the USDA baselines are relatively accurate, the deep learning methods do not show much improvement in their accuracy. However, for indicators like farm prices and, to some extent, harvested areas that are more difficult to predict and have high errors, the deep learning methods can be used to improve the accuracy of the USDA baselines. This is not surprising since the USDA enjoys rich market and survey information and expert judgments for making predictions for the current year. On the other hand, all other methods rely solely on past patterns for predicting the values for the current year. The value of survey information and expert judgments diminishes as we move into longer horizons. At longer horizons, the LSTM model performs better than the USDA baselines, or at least matches them. Among the three deep learning methods, the LSTM models show the most accurate performance across indicators over the horizon.

Our study provides a working example to demonstrate that deep learning methods may produce more accurate multi-step commodity forecasts. One way to improve the predictions may be to add more input features to the problem, such as variables for additional crops. As in many high-dimensional, small-sample applications of deep learning (Vabalas et al., 2019; Shen et al., 2022), incorporating additional features may help overcome challenges posed by limited training samples and facilitate better forecast performance. Such high-dimensional networks might need a more complex architecture than the ones used in this study. The three deep learning models used in this study are still relatively simple compared to what a production-ready model with more input features and additional target variables to cover the entire baseline report would entail. For robustness purposes, we provide additional results using an expanding training window in Tables 8–10, with similar forecast errors to the sliding window. In addition, we provide the results using a standard scaler in Tables 11–13, with similar forecast errors to the MinMaxScaler.

5. Conclusions

In this study, we developed three deep learning models for predicting harvested area, yield and farm price of three major field crops for five years into the future and compared their performance against a naïve benchmark, ETS with and without trend, ARIMA and USDA baselines. Except for ETS with a trend model for crop yields, the ETS methods do not significantly improve forecast accuracy over the naïve benchmark. The USDA baselines perform well in forecasting crop yield but do not perform as well in forecasting harvested area and farm price, especially at longer horizons. The deep learning models show better accuracy than the USDA baselines in forecasting at longer horizons, most notably in predicting farm prices, where the USDA baselines show poor accuracy. The results suggest that deep learning methods can, at the very least, match the accuracy of USDA baselines for most indicators while offering significant improvement in accuracy for indicators that the USDA baselines do not predict well.

Deep learning methods have shown great promise in forecasting in other fields, but their use in predicting long-term agricultural statistics such as the USDA baselines has been limited. Our study shows that efficient deep learning methods can have important implications for baseline models and processes. Since different commodity or country experts work on specific parts of the baseline report and produce inputs for the composite model, deep learning methods can provide insights on which specific parts of the baselines can be improved. The USDA can use the findings and insights from deep learning models to improve the baseline models that produce the initial baselines or the expert adjustment process that produces the final baselines, with the goal to improve the overall accuracy and performance of the baselines. Therefore, the deep learning methods can provide useful insights for the existing baseline projection models.

One of the limitations of this study is that the training sample is relatively small in the terms of years. Deep neural networks often perform better when the training sample is large, and at smaller samples, they may lead to overfitting. While our time period is limited, the number of features can be made much larger than in the current study. The USDA baseline report publishes hundreds of indicators, representing a high-dimensional prediction problem where the sample size is much smaller than the number of features. Future research may incorporate more input features and produce forecasts for additional target variables. However, it may require careful feature selection and dimensionality reduction strategies to overcome the challenge of high-dimensionality. Another limitation of the small sample is that we were able to produce forecasts for only five years. A limitation of the deep learning methods is their “black box” nature, making them difficult to explain when compared to economic modeling. However, advances in explainable deep learning methods may be able to address this issue in the future.

Figures

Figure 1

Historical harvested area, yield and farm price of corn, soybeans and wheat, 1961–2021

Figure 2

Correlation plot of input features

Figure 3

Flowchart representing the forecasting process

Figure 4

Learning curves for the LSTM models

Figure 5

Mean absolute percent errors (MAPE) for farm price, harvested area and yield of corn, soybeans and wheat

Figure 6

Root mean square errors (RMSE) for farm price, harvested area and yield of corn, soybeans and wheat