|Published (Last):||21 September 2004|
|PDF File Size:||5.40 Mb|
|ePub File Size:||3.4 Mb|
|Price:||Free* [*Free Regsitration Required]|
The exponential accumulation of environmental and ecological data together with the adoption of open data initiatives bring opportunities and challenges for integrating and synthesising relevant knowledge that need to be addressed, given the ongoing environmental crises. Here we present Biospytial, a modular open source knowledge engine designed to import, organise, analyse and visualise big spatial ecological datasets using the power of graph theory. The engine uses a hybrid graph-relational approach to store and access information.
A graph data structure uses linkage relationships to build semantic structures represented as complex data structures stored in a graph database, while tabular and geospatial data are stored in an efficient spatial relational database system. We provide an application using information on species occurrences, their taxonomic classification and climatic datasets.
We built a knowledge graph of the Tree of Life embedded in an environmental and geographical grid to perform an analysis on threatened species co-occurring with jaguars Panthera onca. The Biospytial approach reduces the complexity of joining datasets using multiple tabular relations, while its scalable design eases the problem of merging datasets from different sources.
Its modular design makes it possible to distribute several instances simultaneously, allowing fast and efficient handling of big ecological datasets.
The example shows potential avenues for performing novel ecological analyses, biodiversity syntheses and species distribution models aided by a network of taxonomic and spatial relationships.
The IT revolution has created the opportunity to compute, store, and transfer massive amounts of information. In addition, the growth in data follows an exponential curve that doubles in volume every 2 years [ 2—4 ].
Moreover, this expansion in data production has occurred in all human activities, including the environmental sciences. This IT era is opening new opportunities for greater understanding of nature. For example, pervasive Internet connectivity has made possible the transfer of data across large distances in a short time, and the multifunctional capabilities of mobile and smart devices have enabled the management and deployment of collaborative surveys at low marginal costs.
Geospatial sciences have benefited in particular. Some iconic examples of these crowd-based platforms are OpenStreetMap [ 12 ] for geographic maps and the Global Biodiversity Information Facility GBIF , an international consortium of research and governmental institutions that gathers and publishes information of all types of biodiversity occurrences [ 13 ].
The exponential growth of data imposes new challenges for storage, access, integration, and analysis. Recent years have brought new theoretical methods and technologies that are being developed to tackle these problems.
See [ 14 , 15 ] for a review of the field and [ 16 ] for theoretical and practical challenges involving big geospatial data. A fundamental goal in ecology is the understanding of the relationships between living beings and the environment. A requirement to achieve this goal is the integration of independent studies and measurements to validate hypotheses on potential causal relations.
To test the existence of these causalities, a substantial number of inputs in terms of theory, methods, and data is needed. Moreover, reliable, reproducible, and easy-to-access methods are especially important given the urgency in addressing ongoing environmental crises e. Ecology is thus adapting rapidly to these critical challenges and is starting to adopt and develop novel theoretical and computational methods to solve a central problem: how to synthesize and integrate ecological theory with big ecological data.
Answering this question requires an interdisciplinary approach that touches many fields, including theoretical ecology, mathematical modelling, statistics, computer science, and information sciences.
For example, Loreau [ 19 ] proposed a conceptual framework for integrating ecological theory by centering evolution as the link to unify ecology; and Pavoine and Bonsall [ 20 ] proposed a semantic and mathematical formalization for unifying traits, species, and phylogenetic diversity.
The 2 approaches exemplify how evolutionary ancestry relationships between biological objects constitute a solid base to unify distant branches of ecology. From a statistical perspective, meta-analysis has been effective in synthesizing research evidence across independent studies, including unveiling general relations through a statistically sound framework [ 21 ].
Geospatial data constitute a crucial component for data fusion and harmonization; see [ 22 ] for a review of methods for heterogeneous spatial Big Data fusion, and [ 23 ] in order to remove bias by using spatial data stratification methods.
A clear example of geospatial data fusion is the building of essential biodiversity variables EBVs to identify biodiversity and ecosystem change [ 24 ]. EBVs constitute a minimal set of critical variables aimed to standardize and harmonize global biodiversity variables. EBVs integrate data in a standardized framework that describes spatial, temporal, and biological organization [ 27 ].
Recently, methodologies for building EBVs have been drawing the attention of interdisciplinary research for reliability and data quality [ 28 ]. System designs and infrastructures for integrating heterogeneous big ecological data are emerging. Despite the data heterogeneity and biased information against real absences a consequence of opportunistic sampling , these types of infrastructures are able to collect sufficient quantities of data to perform statistical inference [ 33 , 34 ].
The use of high-performance computational technologies with novel statistical methods for representing and modelling big ecological data can provide deeper understanding of biodiversity evolution and its dynamics in a changing world [ 25 , 27 , 35 ].
Moreover, its implications can be extended to other branches of ecology and earth sciences. For example, a process-based approach [ 36 ] showed how community assemblages can be integrated into dynamic vegetation models to increase the precision of climatic and earth system models.
From a technical perspective, environmental and ecological data often come in matrix form such that they can be stored and analysed efficiently with a relational database management system RDBMS or other tabular data structure. RDBMSs are reliable and sophisticated tools. An important feature is the possibility to extend their functionality with programming languages such as C, Java, Python, or R-Cran.
This allows the combined use of an efficient data management system with a broad range of statistical libraries and programming methodologies. An example of this is the integration of spatial analysis tools into the RDBMS through the Postgis project [ 37 ], a set of compiled functions written in the Postgresql Procedural Language PostgresPL that interfaces with high-level geospatial libraries e.
Postgis adds GIS capabilities to the database engine, giving superior performance for querying information with geometric and topological features in space. Integrating large datasets using only relational methods is computationally intensive. For example, matching data by a common feature involves the definition of join clauses plus computing the joined lookup between the pair of tables.
The resulting product is often stored in volatile memory, a limiting factor when integrating large datasets. A query involving multiple joins from multiple data tables can involve reverse and recursive lookups, which can increase the load from O n to O n k , where k is the number of data tables to join. Although this issue can be addressed with database design techniques such as normalization [ 41 ] or caching [ 42 ], the solution likely obfuscates the comprehension of the relational schema by adding unintuitive tables and other auxiliary information.
It also requires a learning curve and expertise for implementation as well as increasing complexity when more datasets are added.
Data structures based on direct acyclic graphs DAGs are advantageous in relation to the above approaches. Traversing a relationship in a graph database has constant cost O 1 [ 43 ] if the relations are defined explicitly for every node. Whenever a new dataset is added, a new link can be created to relate it with an existing record.
Graph databases, however, are not as efficient at processing geospatial queries or handling simultaneous queries [ 44 ]. In this sense, hybrid data management systems, capable of handling both paradigms relational tables and DAGs , were proposed to overcome the limitations of both systems.
However, to the best of our knowledge, these proposals have not been yet implemented [ 45 ], their code is closed [ 46 ], or their scope is not suited for environmental and spatial datasets, as is the case of the Reactome Database [ 47 ]. In this article we propose an implementation of an open source knowledge engine i.
Biospytial can be considered a component of traditional spatial data infrastructure SDI because we simplify access and analysis of big datasets while satisfying the need of producing information for scientists and policy makers, among others [ 48 ]. Therefore, the developed engine is aimed to serve SDI-based decision-making frameworks, such as, e.
The engine serves as a multi-purpose platform for modelling complex and heterogeneous data relationships using the power of graph theory. The current implementation uses the occurrences data from the GBIF and their updated systematic classification [ 50 ] to build the acyclic graph of the Tree of Life ToL.
To exemplify the geospatial capabilities, some EBVs such as mean monthly temperature, elevation, and mean monthly precipitation are also included in the engine. The article is structured as follows: the specification and general description of the engine is described in the next section followed by the methodology and software implementation for accessing biodiversity records arranged in a taxonomic tree. The knowledge graph of the ToL is explained with examples for traversing and extracting spatial and taxonomic sub-networks.
A tutorial explores the capabilities of the engine with a practical demonstration. This section shows the syntax and discusses ways to interpret and traverse the knowledge graph, ending with general conclusions and future research directions. The engine is able to import, organize, analyse, and visualize big ecological datasets using the power of graph theory. It performs geospatial and temporal computations to synthesize information in different forms. The software has been developed with object-relational and object-graph mappings ORM and OGM, respectively that use the object-oriented paradigm to abstract interrelated data into class instances [ 43 , 52 ].
In this sense, every record is represented as an instance of a certain class with its attributes mapped one-to-one to entries in a particular table if it is stored in a relational database or in a key:value hash table if it is stored in a graph-based database. This approach allows the building of complex and persistent data structures that can represent different aspects of the knowledge base.
It also allows the assembly of automatic methods for exploring, filtering, aggregating, and storing information. Each module is arranged in virtual containers isolated as stand-alone applications [ 53 ] running a common Linux image Debian 8 as the base operating system. The virtual container technology creates a common environment for each module, enabling the user to disregard the complications of working with heterogeneous computer infrastructures [ 54 ].
Its design allows the replication of several instances of the same module in a single computer or in a distributed network.
Containerized applications are easier to replicate and migrate compared to large data volumes and databases, which often involve resource-intensive tasks in terms of energy, computing, network bandwidth, and management. The idea behind containerization is to move the processes not the data and, especially in the geospatial context, to perform spatial analysis where the data are located.
The Biospytial system with the 3 interconnected modules. It includes several libraries for performing exploratory analysis as well as Bayesian statistical inference and prediction using the probabilistic programming language PYMC3. The RGU module undertakes the storage and raster-vector processing. It relies on high-level abstractions that represent geospatial data stored in relational tables. The supported geometric features are multi points, multi lines, multi polygons, and multiple-band raster data.
It features a fully operational Postgresql 9. The RGU image can be downloaded from [ 55 ]. This module hosts a graph database that stores data on nodes and their relations in a network structure called the knowledge base Fig. The graph database system is an instance of Neo4J 3. The GSPU image can be downloaded from [ 58 ]. This module provides the interface and processing toolbox for accessing, exploring, and analysing data structures through the Object Mapping design.
The container hosts a virtual environment and an Anaconda package manager [ 59 ] that includes all the dependencies required by the engine. The core code of the engine is contained in a new Python package called Biospytial [ 60 ] Fig.
The engine structure includes a drivers module to communfoicate with the graph database; the modules for accessing each dataset in the relational database; the module for graph traversals, data ingestion, gridding systems, vector sketching, and Jupyter notebooks; and external plugins such as spystats , a Python port of GeoR [ 61 ]. The image can be downloaded from [ 62 ].
This mode provides a granular configuration for the allocation of resources and services in a distributed manner. For example, the BCE module can be hosted in a computer with high-performance architectures or multiprocessing e. The engine includes a messaging service Redis [ 63 ] that delivers information between the different components. It also serves as an in-memory data structure storage and message broker.
The storage is useful for interchanging data between different platforms and languages. For example, it allows export of the results into intermediary files e. The software used in all the modules has been released with open source and free software licenses, which allow users to reproduce, modify, and publish their research source code.
DIN EN 792-13:2009-01