Data hub web server configuration

Datahub documentation home

The datahub uses the Tomcat web server to provide a web service interface and to run a background process to process messages.

There are two ways to run the web server: single instance or multi-instance.

A single instance web server uses a single properties file and schema folder.

A multi instance web server uses multiple properties files and schema folders to access multiple "instances" of the datahub. These could be databases used by different applications, or, in a service offering, databases used by different organisations.

In most cases, it makes sense to use the multi instance web server, even if you only have one instance within it, as it provides more functionality such as a get option to retrieve data and the ability to store and retrieve files in the data hub. However, some solutions may work better with the single instance web server because it allows some items of configuration to be hard coded into the properties file, providing a more locked-down API. For example, the single instance web server allow the user to be defaulted on in incoming message, which might be useful for data feeds from different systems that do not otherwise self-identify.

web.xml configuration

The web.xml configuration is different for single and multi instance web server.

Single instance:

<?xml version="1.0" encoding="UTF-8"?>
<web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://java.sun.com/xml/ns/javaee
    http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd">
 
  <servlet>
    <display-name>Monitor</display-name>
    <servlet-name>Monitor</servlet-name>
    <servlet-class>com.metrici.datahub.MonitorServlet</servlet-class>
    <load-on-startup>1</load-on-startup>
    <init-param>
      <param-name>propertyFile</param-name>
      <param-value>/etc/datahub/datahub.properties</param-value>
    </init-param>           
  </servlet>

  <servlet>
    <display-name>Receive</display-name>
    <servlet-name>Receive</servlet-name>
    <servlet-class>com.metrici.datahub.ReceiveMessageServlet</servlet-class>
    <init-param>
      <param-name>system</param-name>
      <param-value>*</param-value>
    </init-param>           
    <init-param>
      <param-name>propertyFile</param-name>
      <param-value>/etc/datahub/datahub.properties</param-value>
    </init-param>           
    <load-on-startup>1</load-on-startup>
  </servlet>
 
  <servlet-mapping>
    <servlet-name>Receive</servlet-name>
    <url-pattern>/receive</url-pattern>
  </servlet-mapping>
 
</web-app>

Multi-instance

<?xml version="1.0" encoding="UTF-8"?>
<web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd">

  <context-param>
    <param-name>propertyDir</param-name>
    <param-value>/var/datahub/config</param-value>
  </context-param>

  <servlet>
    <display-name>Monitor</display-name>
    <servlet-name>Monitor</servlet-name>    
    <servlet-class>com.metrici.datahub.MonitorServlet</servlet-class>
    <load-on-startup>1</load-on-startup>
  </servlet>  
  
  <servlet>
    <display-name>Instance</display-name>
    <servlet-name>Instance</servlet-name>
    <servlet-class>com.metrici.datahub.InstanceServlet</servlet-class>
  </servlet>
 
  <servlet-mapping>
    <servlet-name>Instance</servlet-name>
    <url-pattern>/*</url-pattern>
  </servlet-mapping>
 
</web-app>

Monitor

The MonitorServlet runs a background process that searches for and processes messages. It is parameterised by properties read from the datahub properties file.

The Monitor Servlet can run in single instance mode or multi instance. The default is single instance. Add the property multi=true to datahub.properties to switch on multi instance mode.

The MonitorServlet searches for:

Messages with a status of 0 (not processed).
Messages with a status of 2 (requires reprocessing) where the next attempt timestamp is before the current time or where it is null and the processed timestamp is at least monitor.minimumDelay seconds before the current time.
Messages with a status of 1 (processing) or 3 (queued) where the processed timestamp is more than monitor.processTimeout seconds ago. This defaults 21,600, i.e. 6 hours.

Messages are queued in ascending processed_timestamp sequence, i.e. oldest first.

It requires a monitor.delay property set to the number of seconds before it first checks for work and a monitor.interval property set to the number of seconds between checks for new messages. These defaults to 0 and 60 respectively.

It requires a monitor.queue property set to the maximum number of messages that will be queued at any one time. This is useful in systems that can receive large numbers of messages in a small period of time, to make the system more manageable. This can be omitted or set to 0 for no limit. The default is 10.

It requires a monitor.threads property to set the maximum number of threads, which defaults to 1. This controls the number of messages that can be processed at once. (monitor.threads can be set to 0 to run all processing on the main thread and wait for it to continue. This is intended for use from some command-line utilities, and should not be used for the servlet).

In multi-instance mode, if you don't want to run the monitor for one instance, set monitor.run=false in just that instance. The monitor will skip over the instance. This can be useful for instances that don't use the normal datahub processing, for example ones containing only executed services.

If a message returns a requires reprocessing status, the processor will update the next attempt timestamp. The delay before the next run depends on the duration of the last run and the age of the message.

It calculates this using the most recent runtime (in seconds) and the age of the message (processed_timestamp minus message_timestamp). It uses the following properties.

Property	Description	Default
monitor.minimumAge	The minimum age of a message before it can be processed. Setting a minimum age is required in a configuration in which some messages are submitted for immediate processing, to prevent the monitor from attempting to process those messages in the short time after the message is written and before immediate processing starts.	15
monitor.minimumDelayMultiplier	A number which is multiplied by the most recent runtime to give the minimum delay for the next run. For example, if this is set to 10, and a run took 10 seconds it will not be scheduled to run again for a minimum of 100 seconds.	10
monitor.minimumDelay	The minimum delay between runs. Typically set to 60 seconds.	60
monitor.maximumDelay	The maximum number of seconds before the message should be reprocessed. This would typically be a number of hours, such as 14400 for 4 hours.	62400
monitor.messageTimeout	The number of seconds after the message timestamp when the message should be considered timed out. This is typically set to a reasonably long period, e.g. 3456000 for 40 days. Can be set to 0 to have no timeout.	3456000
monitor.ageExponent	Used to calculate reprocessing delay.	1
monitor.ageMultiplier	Used to calculate reprocessing delay.	1

The next attempt timestamp is calculated by adding a delay to the processed timestamp. The delay is calculated using:

min(
max(
minimumDelay + ageMultipler x age ^ ageExponent,
duration x minimumDelayMultiplier
),
maximumDelay
)

If this is greater than the message timestamp plus message timeout period, the message is marked as in error and not processed.

The monitor is shut down when the servlet ends. It will wait for current work to complete, up to a maximum of monitor.shutdownTimeout seconds.

The monitor can run in one of two modes: single instance or multi-instance. To turn on multi-instance, set multi=true, and set propertyDir to a directory that contains dorectories each of which represents a single instance and has a datahub.properties file within it. propertyDir defaults the folder containing the main properties file.

For debugging, there is verbosity parameter to control the amount of logging. Default is 0. Set to 1 or 2 for progressively more logging.

Multi-instance configuration

When running as a multi-instance server, there is one datahub.properties file for overall processing and one datahub.properties file for each instance.

The main datahub.properties file would typically be in

/var/datahub/config/datahub.properties

The properties file for each instance need to be in a directory named the same as the instance name, relative to the config directory. So the sales properties file would be

/var/datahub/config/sales/datahub.properties

The main datahub.properties file contains properties that pertain to the overall server (or the overall monitor, for running the monitor in multi-mode).

This would typically include monitor properties, and the root directory for the filestore.

multi=true
monitor.threads=5
fileStore=/var/datahub/filestore

The individual instance properties would include database connection, as well as the specific fileStore value for that instance.

dbDriver=com.mysql.jdbc.Driver
dbURL=jdbc:mysql://127.0.0.1/sales
dbUser=sales
dbPassword=hare93
verbosity=1
timestampPrecision=second

elastic=com.metrici.datahub.ElasticSearchDataStore
elastic.host=127.0.0.1
elastic.port=9200
elastic.scheme=http
elastic.timestampPrecision=second
elastic.prefix=sales

authenticator=com.metrici.datahub.AllowAllAuthenticator
authorizer=com.metrici.datahub.AllowAllAuthorizer
fileStore=/var/datahub/filestore/sales
fileURLPrefix=http://localhost:8080/datahub/sales/

Receive message

The receive message servlet provides a simple web service through which source systems can pass data into the data hub. The source systems POST messages to the web service. See message format for details of the format.

This takes init parameters that follow the same pattern as those for the command line program. Properties are specified separately using the load properties servlet and reprocess and help are not supported.

system	This specifies a pipe-delimited list of valid values for system. If a single value for system is given, it is used as a default. Use a value of * to mean the default mapping (i.e. to send in data that conforms to the target entities).
entity	Similar to system, a pipe-delimited list of valid values for entity If a single value for system is given, it is used as a default. This option is required with the allData option.
user	Similar to system, a pipe-delimited list of valid values for user If a single value for system is given, it is used as a default.
refresh	If set to true, indicates that the message contains all the data for the target entity, and that existing target system records should be deleted
format	If set to "data" indicates that the incoming request contains only data, not a formatted message. If set to "file" indicates that the incoming request should be stored as a file, and data about this file passed in the message. Default value is "message". If this is "data" or "file", then the entity option must be set, and the system option may well be required.
process	If set to true, indicates that the message should be processed straight away. If set to false, the default, the message is loaded and can be processed later.

The receive message servlet can then be mapped to a URL, in this case /receive.

You can define multiple servlets based on the com.metrici.datahub.ReceiveMessageServlet class, each with different parameters (for example, different system and entities). These can be mapped to different URLs and you can set different security constraints on these to secure them separately.

Instance servlet

The multi-instance web server uses the instance servlet in place of the receive message servlet. The servlet to use is identified by an instance reference that appears as the first part of the path (after the context path in Tomcat).

Unlike the receive message servlet, it does not allow for default properties (this would not make sense in a multi-tenant environment). However, different actions can be selected using different URL patterns.

The instance servlet supports a get interface backed by the Query class and for the storage and retrieval of files.

Sending data to the web server

From Linux, you can use the curl command-line tool to send data to the servlet.

The general format of this is

curl -d @file -X POST url

Where file contains the data to be sent to the datahub, and url is the URL to which the receive message servlet has been mapped.

If you want to upload a file to be stored (when the format is set to "file"):

Use the --data-binary option to pass the file without translation.
Set the Content-Type header to the content type.
Set the File-Name custom header to the file name. This allows you to set a "friendly" file name to be used when downloading the file.

For example, to upload an image called mypicture.jpg, you would do something like

curl --data-binary @mypicture.jpg -H "Content-Type: image/jpeg" -H "File-Name: mypicture.jpg" -X POST url