PDAF - the Parallel Data Assimilation Framework
PDAF is one of the most widely used data assimilation frameworks. It enables data assimilation, the combination of observations and numerical models, by simplifying the implementation and application of data assimilation systems based on existing numerical models. With PDAF one can faster reach the point to actually apply data assimilation even with complex model systems. Users of PDAF can leverage the long-time experience of the core developers, who applied PDAF for the assimilation in a wide range of data assimilation applications for more than two decades.
PDAF provides solutions for the different complications that one needs to solve when one plans to apply data assimilation with complex models, e.g.
- providing parallelized and optimized data assimilation methods compatible with the complexity and size of the models
- handling of different model fields or variables in the state vector treated by the data assimilation
- handling of various observation types relating to different model fields
- computational complexity of data assimilation, in particular when integrating an ensemble of model states
- application of different diagnostics on the model state or ensemble and the related observations
PDAF is computationally highly efficient. It is designed to be compatible with numerical models in Earth sciences, but its open design allows one to easily apply PDAF for data assimilation with essentially any model from any discipline. PDAF is designed for easy coupling with numerical models so that one can build a data assimilation application typically within a few weeks, so that one can faster get to the point to actually apply data assimilation. PDAF provides complete implementations of data assimilation methods, in particular ensemble Kalman filters and smoothers, nonlinear filters, and 3-dimensional variational methods. All methods are optimized for application on parallel computers. PDAF also provides an internal interface to the data assimilation methods, so that researchers can extend PDAF with their own developments.
PDAF is free open-source software and is continuously further advanced with new assimilation methods and tools for data assimilation. Next to the PDAF framework itself, coupling codes to PDAF exist for various different models and many of these coupling codes are available as open source.
Further information and code access
Below we provide some more information about PDAF. For detailed information, we recommend to visit the dedicated PDAF web site pdaf.awi.de. The web site contains a tutorial teaching how to use PDAF, step-wise implementation instructions, full documentation, a list of publications using PDAF, an overview of models that were already coupled with PDAF, and links to the PDAF code releases. There are also a number of scientific publications about PDAF: The first overview of PDAF was provided in Nerger et al. (2005), doi:10.1142/9789812701831_0006. Updates to the concepts and features of PDAF were also described in Nerger and Hiller (2013), doi:0.1016/j.cageo.2012.03.026 and Nerger et al. (2020), doi:10.5194/gmd-13-4305-2020. The latter publication also describes the strategy to implement coupled data assimilation for Earth system models.
Contact
Further Information
- Detailed information about PDAF can be found on the PDAF website pdaf.awi.de
- The PDAF code release can be downloaded at https://github.com/PDAF/PDAF
Efficient ensemble-based data assimilation with PDAF
Complex models in the Earth system usually utilize high-performance computers and parallelization. For compatibility, the data assimilation software has to support parallelization and the use of high-performance computers. For ensemble data assimilation with dynamic ensembles one needs to integrate the full ensemble of model states, usually with ensemble sizes of not less than 30 state realizations. These computations are very costly but can make nearly optimal use of todays high-performance computers.
PDAF is designed to facilitate the implementation of parallel data assimilation systems with such complex model. While existing numerical models are typically not prepared to be used with data assimilation algorithms, we can leverage the typical structure of such models to simplify the implementation of data assimilation. For this implementation, PDAF supports both the use of model output files with separate programs for model and data assimilation (offline coupling) and the choice to augment a model code by function calls to PDAF to build a data-assimilative model with ensemble functionality (online coupling).
PDAF's design bases on the fact that the model-specfic part of data assimilation can be clearly separated from the generic parts of the assimilation algorithms. This allows us to provide a generic framework for data assimilation with well-defined interfaces to the model-specific parts.
For the online coupling, the numerical model is augmented with ensemble functionality and with data assimilation methods, like ensemble-based filters, with minimal changes to the model code. The ensemble functionality supports the efficient use of parallel computers by creating a parallel data assimilation system. The most costly part of ensemble-based Kalman filters is the integration of an ensemble of model states. Converting a model into an ensemble model allows to compute this integration leveraging parallelization with multiple concurrent model tasks. This parallelization renders ensemble-based data assimilation with PDAF to be highly scalable as shown by scaling tests with up to 57600 processor cores.
In order to ensure compatibility with models in the Earth system, PDAF is implemented in modern Fortran and parallelized using the MPI standard. For efficiency, the numerical libraries BLAS and LAPACK are used. PDAF has been tested on different platforms from notebooks to high-performance computers with various compilers. PDAF is compatible with other programming languages. For the application with Python, e.g. for models based on machine learning, pyPDAF provides the possibility to use the Fortran-coded PDAF parts as a library and to implement application functions for the data assimilation in Python.
Structure of a data assimilation system
PDAF considers three components of the assimilation system as shown in the figure for the online coupled mode:
- The Model on the left side, provides the initialization and time integration of all fields considered in the model.
- The Observations component on the right side, provide the observational information. It consists of the information on the variables that are observed, the observation values stores as an observation vector and the error estimates of the observations. The structured handling of the observations is provided by PDAF-OMI, which is described further below.
- The DA method in between the model and observations, is the data assimilation method combines the information from the model and observations. The DA method is provided by the core of PDAF.
PDAF has well defined interfaces that enable the information exchange between the three components as indicated by the arrows. the PDAF core and PDAF-OMI are generic and model-agnostic. All parts of the assimilation application that are model-dependent or refer to the observations are organized as separate functions which are called by PDAF as call-back functions. These functions are implemented by the PDAF user like subroutines or functions of the model, while the core functions of PDAF remain unchanged.
A data-assimilative model built utilizing PDAF's online coupling will be executed in the same way as the pure model, but with additional options for the data assimilation, and perhaps additional processors to ensemble parallel ensemble integrations. The data assimilation in the program is controlled by the user-supplied functions so that the driver functionality remains in the model part of the program. In addition, the user-supplied functions can be implemented in the context of the model code. Commonly, models in the Earth system are coded using Fortran, and the user-supplied functions for PDAF can also be implemented in Fortran so that they are compatible with the model code. These features simplify the implementation of the user-supplied routines because one implements them as an extension ofthe model code.
The structure of the data assimilation system in PDAF's offline coupled mode is shown in the figure on the right. For the offline coupling the model and the assimilation are two separate programs.
- The model is started separately for each ensemble state. Each model execution is initialized from a separate restart file and the model generates new restart files.
- The observations component and the DA method are combined into the assimilation program on the right. Their functionality is analogous to the online coupling with the different that the assimialtion program reads the model states from the restart files and also needs to read model grid information from files. After assimilation updates the model states the program writes updates restart files for the model ensemble.
Because the offline coupling relies on writing and reading disk files, it is usually less efficient than the online coupling, which exchanges information in memory. In addition, the offline mode requires to restart the models each time when an assimilation of observation was computed, which also yields overheads compared to the online coupling. How relevant the overheads due tot he file handling and model restarts are depends on the data assimilation application.
Optimized and parallelized assimilation methods
PDAF provides complete implementations of data assimilation algorithms, in particular ensemble Kalman filters, nonlinear (particle) filters, ensemble smoothers, and 3-dimensional variational methods. The algorithms are optimized for their application with complex models on high-performance computers. However, one can also use them efficiently for small applications, for example on a desktop PC or notebook computer. This compatibility also simplifies the testing of data assimilation as one can test identical assimilation codes with small toy models and complex realistic models.
Optimized and parallelized ensemble filter algorithms included in PDAF are:
- EnKF / LEnKF (Ensemble Kalman Filter / local EnKF)
- ESTKF / LESTKF (Error Subspace Transform Kalman filter / local ESTKF)
- ETKF / LETKF (Ensemble Transform Kalman filter / local LETKF)
- SEIK / LSEIK (Singular "Evolutive" Interpolated Kalman filter / local SEIK
- NETF / LNETF (Nonlinear Ensemble Transform filter / local NETF)
- PF (Particle filter with importance resampling)
- LKNETF (Local Kalman-nonlinear ensemble transform filter)
- EnSRF (Ensemble square-root filter using serial observation processing and covariance localization)
- EAKF (Ensemble Adjustment Filter using serial observation processing and covariance localization)
All filters, except the PF, are provided with and without localization for optimal compute performance in global and localized applications. Also the LKNETF is only available as a local filter.
The ETKF and SEIK filters have been examined in Nerger et al. (2012b), where also the new ESTKF was introduced. The EnKF, SEEK, SEIK are described and compared in Nerger et al. (2005a), while the local SEIK filter is described in Nerger et al. (2006). The NETF has been described in Tödter et al. (2016).
Next to the filter algorithms, the following smoothers are available:
- EnKS (Ensemble Kalman Smoother)
- ETKS (Ensemble Transform Kalman Smoother)
- LETKS (Local Ensemble Transform Kalman Smoother)
- ESTKS (Error Subspace Transform Kalman Smoother)
- LESTKS (Local Error Subspace Transform Kalman Smoother)
- LNETS (Local Nonlinear Ensemble Transform Smoother)
The smoother extension was described in Nerger et al. (2014) where also the influence of nonlinearity on the smoothing was studied. The LNETS was studied in Kirchgessner et al. (2017).
PDAF also provides the following 3D variational data assimilation methods:
- 3D-Var (incremental 3D variational assimilation with parameterized covariances)
- 3D-EnVar (3D ensemble-variational assimilation)
- hyb3DVar (hybrid paramerized-ensemble assimilation)
A general overview of PDAF is provided in Nerger et al. (2005b) and the implementation strategy used in PDAF as well as its parallel performance have been discussed in Nerger and Hiller (2013). The strategy to implement coupled data assimilation for Earth system models is described in Nerger et al. (2020).
Advanced observation handling
Assimilating different types of observations is common. For the ocean, these are typically satellite observations as well as in situ measurements from ships, ARGO profilers, or other measurement devices. The observations have to be related to the model fields through observation operators, which map the model fields into an equivalent to the observation. The handling of observations can be complicated, e.g. due to quality checks, different observation operators and the spatial and temporal availability of the observations.
To simplify the handling of observations, PDAF provides the Observation Module Infrastructure - OMI. PDAF-OMI provides a structured approach to implementing the support for different observation types. PDAF-OMI encapsulates the code for each observation type to ensure that the code for different observations cannot interfere with each other. This allows also different developers to work on the same data assimilation code without interfering. PDAF-OMI provdes a selection of different observation operators and allows users to extend the selection of observation operators. Further, PDAF-OMI provides observation-related diagnostics allowing users to easily compare the assimilation outcome with the assimilated obervations to assess how well the assimilation process performs.
PDAF Tutorial
PDAF provides an extensive tutorial with both example codes and slide sets explaining the structure of the user codes for assimilating with PDAF in detail.
The tutorial codes, in combination with template codes, also serve as the basis for implementing own assimilation applications for the cases where a model coupling code is not yet available. The templates codes use extensive inline documentation to explain the required implementations.
The tutorial examples can also be used in lectures to study how different assimilation methods work. The tutorial uses a very simple two-dimensional model domain to allow for easy plotting and clearly visible effects of the assimilation on the model field.
Small Models for Assessing Data Assimilation
The PDAF releases contain different variants of the chaotic Lorenz models (namely the models 63, 96, 2005 models II and III) with a full implementation with PDAF. These implementations allow to study data assimilation methods with varying degrees of nonlinearity. We recommend tham also as an easy introduction to the application of PDAF.
Diagnostics, Ensembles, and Observation Generation
PDAF provides a selection of further functionality that is needed, or just useful, for data assimilation.
Diagnostics
PDAF provides functionality for different statistical diagnostics for ensemble data assimilation. There are, e.g. rank histograms and the computation of statistical moments. Diagnostics functions are provided to assess the state ensemble as well as to compare the state estimates with assimilated observations.
Ensemble generation
Ensemble data assimilation requires the initialization of an ensemble of model state realizations. PDAF provides tools for generating such ensembles which represent the uncertainty of the model state.
Generating synthetic observations
For assessing data assimilation methods, often twin experiments are performed. These need synthetic observations which are generated from a model run representing the 'truth'. PDAF provides functionality to easily generate synthetic observations which mimic the distribution of real observations.
Model couplings
PDAF was already coupled to a wide range of models and many of the couopling code are publicly available, as shown in the overview page on model coupled with PDAF.