Data hub command line tools

Datahub documentation home

There are three command line tools:

Message loader

The data hub can be loaded from the command line using the MessageLoader component.

This can be run directly from the datahub jar file, using the following syntax.

java -jar datahub-dist.jar [options] filename

The message is read from the file.

Options are:

--properties propertyFile
Specify the locaion of the properties file. Defaults to datahub.properties in the current directory. In a standard-configured Linux system this should be.
--properties /etc/datahub/datahub.properties
--system system

This specifies a pipe-delimited list of valid values for system. If a single value for system is given, it is used as a default.

Use * to indicate the default system mapping (where incoming messages conform to the target entities).

--entity entity
Similar to --system, a pipe-delimited list of valid values for entity If a single value for system is given, it is used as a default. This option is required with the --allData option.
--timestamp timestamp
If the message does not have a timestamp, or the data or file --format option is used, effective timestamp to use in the message. Defaults to the current timestamp.
--user user
Similar to --system, a pipe-delimited list of valid values for user If a single value for system is given, it is used as a default.
--options options
Set message options. Options should be a JSON-formatted string. Their use is implementation-dependent.
--refresh true

If set to true, indicates that the message contains all the data for the target entity, and that existing target system records should be deleted.

If multiple entities are targeted by the source entity, the refresh option only applies to the first.

--format format

Indicates how the file should be treated. Set to one of:

  • message – data contains JSON-formatted message. This is the default.
  • data – the file contains only data, not a formatted message.
  • file – the file should be stored as a file, and data about the file generated in the message. See the files topic for more details.
If this is data or file, then the --entity option must be set, and the --system option may well be required.
--process true
If set to true, indicates that the message should be processed. If set to false, the default, the message is loaded but not processed.
--reprocess true

If set to true, reprocess the message the guid of which is used in place of the file name. Options other than --properties are ignored.

For example

--reprocess true 74cd9a05-e9c3-4a12-b1bf-7ab1e53ba9ee
--help
Print help text.

Monitor

The data hub monitor can be run from the command line. There are two options: you can start the monitor and let it run indefinitely, or you can perform a single run of the monitor which will look for unprocessed messages.

To run the monitor and leave it running, use

java -cp datahub-dist.jar com.metrici.datahub.DataHubMonitor --properties datahub.properties

To perform a single run of the monitor, use

java -cp datahub-dist.jar com.metrici.datahub.DataHubMonitorRun --properties instance/datahub.properties

The monitor reads properties using the --properties parameter, and also has a --help option, like the loader.

Properties for the monitor are described in the web server section. If the multi=true property is set, the property file should be the base property file and the individual instances in separate directories, as described in the web server topic. The multi=true only applies when running the whole monitor. If submitting a single run, use the property file for the individual instance (as in the example above).

If you leave the monitor running, it will manage work identically to the monitor run from the web server, but has no controlled shutdown method. You will need to provide some method of controlling the running monitor. It may be more convenient to use something like supervisorctl to schedule single runs periodically than to run the monitor continuously.

The single run will complete all work it identifies (it ignores the monitor.shutdownTimeout). If you want a single run to perform all outstanding work, set monitor.queue to 0.

Query

The query component can be used to extract data from the data store. See Data hub query for details.

Query can be run from the command line, using the following syntax.

java -cp datahub-dist.jar com.metrici.datahub.Query [options]

Options are:

--properties propertyFile
Specify the location of the properties file. Defaults to datahub.properties in the current directory.
--user user
The user under whose authority the data it to be retrieved. No authentication check is made on the user (command line access is assumed to be authenticated), but a user may be required to navigate through the entity authorization.
--query query.json
Identifies the file that contains the query JSON. See Data hub query for syntax.
--entity entity
Entity for the query. Will overwrite any entity in the query JSON.
--timestamp timestamp
Timestamp for the query. Will overwrite timestamp in the query JSON.
--where.field value
Provide a where object value for field. This will overwrite any where clause for the field in the query JSON.
--out file

Write the output to the file. If not provided, the output is written to sysout.

At a minimum, you must pass either an entity or a query file with an entity property.

Example 1

This shows how to list all data from the product entity.

java -cp /var/datahub/lib/datahub-dist.jar com.metrici.datahub.Query \
  --properties /var/datahub/config/acme/datahub.properties \
--entity product

Example 2

This more complex example shows what might be required to retrieve product data for a particular range, from a configuration where only the admin user has authority to read the entity.

java -cp /var/datahub/lib/datahub-dist.jar com.metrici.datahub.Query \
  --properties /var/datahub/config/acme/datahub.properties \
--user admin \
  --query product_query.json \
  --where.range_reference ELEC1 \
  --out range_ELEC1.json

Message housekeeping

Message housekeeping scans the message store and file store and deletes messages and files that have passed their retention period.

Although files are deleted out of the file store by the housekeeping, a copy of them can be kept in a purge area. This can then be cleared out by external routines.

Housekeeping rules are controlled by two entity properties on the schema, read from the first entity mapping for the combination of source system and source entity reference:
messageRetentionPeriod
For how long messages for this system/entity should be retained, in days. The default value of 0 means "indefinitely". May be fractional. Use a negative value to mean that messages can be discarded as soon as they have been processed.
retainFiles
If set to true, files associated with this system/entity should be retained even if the associated messages are deleted. Default is false.

To run the housekeeping, use:

java -cp datahub-dist.jar com.metrici.datahub.MessageHousekeeping [options]

Options are:

--properties propertyFile Specify the location of the properties file. Defaults to datahub.properties in the current directory.
--check true
Show what would be deleted, but do not perform the delete.

Housekeeping reads the message store, the message process table and the schema definitions. A message and associated process and history records are deleted when:

Messages without an associated message process record are deleted when the message timestamp is older than the number of days identified by the invalid message retention period set in datahub.properties. These orphaned messages indicate errors, but are retained for a short period for diagnostic purposes.

The file store is also scanned. A file is deleted when:

Messages and files may have invalid systems or invalid entities. These are processed using an invalid message retention period, and with retain files set to false.

In addition to the standard messageStore, messageControl and fileStore properties, housekeeping uses the following properties, which can be preceded by "housekeeping.".

multi
Specifies that this is a multi-instance configuration. This runs housekeeping for each of the instances.
invalidMessageRetentionPeriod

For how many days messages without process entries or with unrecognised systems or entities should be retained. May be fractional. A value of 0 indicates that invalid messages should be retained indefinitely.

For semantic consistency, the default is 0. Production systems should set this to a value that reflects how long the organisation would take to detect and diagnose errors, for example to 10.

Setting this to below 0 or close to 0 is permitted. However, this could potentially impact messages as they are written and would not allow for diagnosis after errors. A value of at least 1 is recommended.

fileStorePurge
A directory into which files will be moved, rather than deleted. If not set, files are deleted and not moved to a purge area.

Set the verbosity property to 1 or more to list messages and files that are deleted, or 2 or more for more diagnostics.

When running in multi mode, the invalidMessageRetentionPeriod on the top-level properties is used as a default for the value on instance properties.

When running in multi mode, the fileStorePurge area on the top-level properties is used to create a default for the value on instance properties, by adding the instance reference as a suffix.