Harvesting resources from remote services¶

GeoNode is able to harvest resource metadata from multiple remote services.

Harvesting is the process by which a metadata catalogue, i.e. GeoNode, is able to connect to other remote catalogues and retrieve information about their resources. This process is usually performed periodically, in order to keep the local catalogue in sync with the remote.

When appropriately configured, GeoNode will contact the remote service, extract a list of relevant resources that can be harvested and create local resources for each remote resource. It will also keep the resources synchronized with the remote service by periodically updating them.

Out of the box, GeoNode ships with support for harvesting from:

Other remote GeoNode instances;
OGC WMS servers;
ArcGIS REST services.

Adding support for additional harvesting sources is also possible.

GeoNode harvesting concepts¶

When a harvester is configured, GeoNode is able to use its corresponding harvester worker to contact the remote service and generate a list of harvestable resources. The user is then able to select which of those resources are of interest. Depending on its configured update frequency, sometime later, the harvesting scheduler will create new harvesting sessions in order to create local GeoNode resources from the remote harvestable resources that had been marked as relevant by the user.

The above description uses the following key concepts:

harvester

This is the configuration object that is used to parametrize harvesting of a remote service. It is configurable at runtime and is preserved in the GeoNode database.

Harvesters and their properties can be managed by visiting the Harvesting -> Harvesters section of the GeoNode admin area, or by visiting the api/v2/harvesters/ API endpoint with an admin user.

Among other parameters, a harvester holds:

remote_url: Base URL of the remote service being harvested, e.g. https://stable.demo.geonode.org
harvester_type: Type of harvester worker that will be used to perform harvesting. See the Harvester worker concept and the standard harvester workers sections below for more detail. Example: geonode.harvesting.harvesters.geonodeharvester.GeonodeUnifiedHarvesterWorker.
scheduling_enabled: Whether harvesting shall be performed periodically by the harvesting scheduler or not.
harvesting_session_update_frequency: How often (in minutes) should new harvesting sessions be automatically scheduled?
refresh_harvestable_resources_update_frequency: How often (in minutes) should new refresh sessions be automatically scheduled?
default_owner: Which GeoNode user shall be made the owner of harvested resources
harvest_new_resources_by_default: Should new remote resources be harvested automatically? When this option is selected, the user does not need to specify which harvestable resources should be harvested, as all of them will be automatically marked for harvesting by GeoNode.
delete_orphan_resources_automatically: Orphan resources are those that have previously been created by means of a harvesting operation but that GeoNode can no longer find on the remote service being harvested. Should these resources be deleted from GeoNode automatically? This also applies to when a harvester configuration is deleted, in which case all of the resources that originated from that harvester are now considered to be orphan.

harvester worker

Harvester workers implement retrieval for concrete remote service types. Each harvester uses a specific worker, depending on the type of remote service that it gets data from. Harvester workers may accept their own additional configuration parameters.

Harvester workers are set as the harvester_type attribute on a harvester. Their configuration is set as a JSON object on the harvester_type_specific_configuration attribute of the harvester.

GeoNode ships with the following harvester workers:

GeoNode - Enables harvesting from other GeoNode deployments
WMS - Enables harvesting from OGC WMS servers
ArcGIS REST services - Enables harvesting from ArcGIS REST services

Adding new harvester workers is also possible. This allows custom GeoNode deployments to add support for harvesting from other remote sources.

harvestable resource

A resource that is available on the remote server. Harvestable resources are persisted in the GeoNode DB. They are created during refresh operations, when the harvester worker interacts with the remote service in order to discover which remote resources can be harvested.

Harvestable resources can be managed by visiting the Harvesting -> Harvestable resources section of the GeoNode admin area, or by visiting the api/v2/harvesters/{harvester-id}/harvestable-resources API endpoint with an admin user.

In order to be harvested by the harvesting scheduler, a harvestable resource must have its should_be_harvested attribute set to True. This attribute can be set manually by the user or it can be set automatically by the harvester worker, in case the corresponding harvester is configured with harvest_new_resources_by_default = True

harvesting session

In GeoNode, discovering remote resources and harvesting them is always done under the scope of a harvesting session. These sessions are stored in the GeoNode DB and can be inspected by visiting the Harvesting -> Asynchronous harvesting sessions section of the GeoNode admin area.

Harvesting sessions are used to keep track of the progress of execution of the relevant harvesting operations. They are updated while each operation is running. There are two types of sessions:

refresh session: This session is created during the update of harvestable resources operation. It has type=discover-harvestable-resources. During a refresh session, the harvester worker discovers remote resources and creates their respective harvestable resources on the GeoNode DB. After such session is finished, the user can inspect the found harvestable resources and mark those that are relevant with should_be_harvester=True.

harvesting session: This session is created during the perform harvesting operation. It has type=harvesting. During a harvesting session, the harvester worker creates or updates new GeoNode resources based on the harvestable resources that have been configured with should_be_harvested=True.

In addition to the aforementioned type, harvesting sessions also carry the status attribute, which provides context on the current status of the session (and consequently of the underlying harvesting operation).

harvesting scheduler

The scheduler is responsible for initiating new harvesting operations in an automated fashion. Periodically, the scheduler goes through the list of existing harvesters, checking if it is time to dispatch one of the harvesting operations mentioned in the next section.

The scheduler’s operation frequency is configurable by defining a HARVESTER_SCHEDULER_FREQUENCY_MINUTES setting - the default is to trigger the scheduler every 30 seconds.

Note

Since the harvesting scheduler only checks if there is work to do once every x seconds (defaulting to 30 seconds, as mentioned above), there will usually be a delay between the time a harvesting operation is supposed to be scheduled and the actual time when it is indeed scheduled. Moreover, the harvesting scheduler is implemented as a celery task. This means that, if the celery worker is busy, that may also cause a delay in scheduling harvesting operations, as the scheduler’s celery task may not be triggered immediately.

Harvesting workflows¶

There are two main possible harvesting workflows:

Continuous harvesting
One-time harvesting

Continuous harvesting¶

This workflow relies on the harvesting scheduler in order to ensure harvested resources are continuously kept up to date with their remote counterparts.

User creates harvester and sets its scheduling_enabled attribute to True;

When the time comes, the harvesting scheduler calls the update list of harvestable resources operation. Alternatively, the user may call this operation manually the first time.

When the previous operation is done, user goes through the list of generated harvestable resources and, for each relevant harvestable resource, sets it should_be_harvested attribute to True. Alternatively, if the harvester has its harvest_new_resources_automatically attribute set to True, the harvestable resources will already be marked as to be harvested, without requiring manual user intervention;

When the time comes, the harvesting scheduler calls the perform harvesting operation. This causes the remote resources to be harvested. These now show up as resources on the local GeoNode.

One-time harvesting¶

This workflow is mostly executed manually by the user.

User creates harvester and sets its scheduling_enabled attribute to False;

User calls the update list of harvestable resources operation;

When the previous operation is done, user goes through the list of generated harvestable resources and, for each relevant harvestable resource, sets it should_be_harvested attribute to True;

User then proceeds to call the perform harvesting operation. This causes the remote resources to be harvested. These now show up as resources on the local GeoNode.

Harvester operations¶

Each GeoNode harvester is able to perform a finite set of operations. These can be performed either:

In an automated fashion, being dispatched by the harvesting scheduler. Automated harvesting is only performed when the corresponding harvester has scheduling_enabled=True;
On-demand, by explicit request of the user. On-demand execution can be requested by one of two ways:
1. By selecting the relevant harvester(s) in the Harvesting -> Harvesters section of the GeoNode admin area and then selecting and running an action from the drop-down menu;
2. By interacting with the GeoNode REST API. Harvester actions are requested by issuing HTTP PATCH requests to the /api/v2/harvesters/{harvester-id}/ endpoint. The payload of such requests must specify the corresponding status. For example, by issuing a request like:
```
curl -X PATCH http:/localhost/api/v2/harvesters/1/ \
  -H "Content-Type: application/json" \
  -u "myuser:mypass" \
  --data '{"status": "updating-harvestable-resources"}'
```
  We are asking that the harvester’s status be changed to updating-harvestable-resources. If the server accepts this request, then the update list of harvestable resources operation is triggered.
  
  Note
  
  The server will not accept the API request if the harvester’s current status is not ready.

While performing an action, the harvester’s status property transitions from ready to whatever action-related status is appropriate (as indicated below). As the operation finishes execution, the harvester’s status transitions back to ready. If the harvester has any status other than ready, then it is currently busy. When a harvester is busy it cannot execute other operations, you’ll need to wait until the current operation finishes.

Check if the remote service is available operation¶

This operation causes the harvester to perform a simple health check on the remote service, in order to check whether it responds successfully. The response is stored in the harvester’s remote_available property. This operation is performed in the same process of the main GeoNode (i.e. it runs synchronously).

When triggered, this operation causes the harvester’s status to transition to checking-availability. As the operation finishes, the harvester’s status transitions back to ready.

Invocation via the GeoNode admin is performed by selecting the Check availability of selected harvesters command.