Data Analytics in Earth Science
The continuous progress in remote sensor resolutions of Earth observation platforms generates large quantities of hyperspectral data for the mapping and monitoring of natural and man-made land covers.
Current Synthetic Aperture Radar missions – with high spatial resolution and frequent repeat passes – raise huge requirements for the analysis of satellite time series data. This enables the observation and analysis of dynamic processes involving natural landscape and built-up sites with significant socio-economic, environmental, and geopolitical impact. Similarly, multispectral point cloud datasets in earth sciences created by light detection and ranging (LiDAR) scanners drive data growth, up to scans of whole countries.
The University of Iceland team provides applications based on three data analytics methods used in order to extract knowledge from data:
• Density-based clustering (NextDBSCAN),
• Support Vector Machines (NextSVM),
• Deep learning (TensorFlow with Horovod)
This application performs approximate and non-approximate density based clustering, producing outputs that comply with the original DBSCAN algorithm. The application’s underlying algorithm uses orthtrees, i.e. octrees of arbitrary dimension, to speed-up range-searches, and a shallow tree labeling method that is reminiscent of union-find algorithms. NextDBSCAN has a very fast average performance baseline coupled with good scaling properties for distributed systems, i.e. the application is suitable for large-scale HPC systems and simple laptops alike, as well as any dataset. The application is available to the public at https://github.com/ernire/nextdbscan-exa.
NextDBSCAN supports GPU accelerators which generally provide a much faster average performance baseline than using only CPUs, thereby using much less energy, as the two graphs below illustrate. The graphs represent a typical solution from our experiments when comparing the CPUs of the Cluster Module (CM) to the GPUs of the Extreme Scale Booster (ESB). When available, NextDBSCAN runs entirely on GPUs, using the CPUs only for the initial offloading.
NextDBSCAN exploits heterogeneity when performing grid-searches across the parameter space and/or Level-Of-Detail (LOD) studies. The application mapping used in the DEEP-EST project is depicted below. The module executions are quasi-independent, where a heuristic estimates the computational complexity of a single task, and assigns it to the best fitting module. As the second figure below illustrates, a slight aggregated runtime performance benefit was attainable by using this heterogeneous approach (for 80 and 160 epsilon).
The application trains Support Vector Machine (SVM) models using a parallel implementation of the Sequential Minimal Optimization (SMO) to facilitate the application’s performance for both shared and distributed memory systems. For an average experiment, NextSVM performs best when using both CPUs and GPUs on a single module, but can also be executed exclusively with either one. The figure below depicts and represents a typical result from the experiments carried on in the DEEP-EST project.
This application consists of a Tensorflow/Keras script that uses Horovod to speed up the distributed communication between the nodes. Additionally, two different application mappings examine determine the possible benefits of the intermediary storage hardware for heterogeneous deep learning workflows, namely the DCPMM available in the Data Analytics Module (DAM) and the NAM in the ESB. These mappings are illustrated in the figure below. These mappings can decrease the overall runtime in some cases, such as when using a Transfer Learning (TL) strategy, where models are trained for only a short duration at a time but with a very high number of possible iterations.