Skip to content

Data Storage / Transport

Data stores can be used to collect the results over a longer time to analyze them later. Each data store is different with different possibilities. But a store can also be any system gathering the data, also if it is not really stored. Multiple alternatives are possible to be used. They differ in the storage media:

  • Database Store (as long term store to base analysis on)
  • Graphite Store (not implemented yet, but also as long term store)
  • File Logger (easy recording)
  • Email Alert (not implemented yet)

As soon as a test is done it will be send to the used logger.

Define which one to use in the suite setup (base configuration file in CLI under store or options in creating Suite instance).

File Logger

This is one of the simplest stores. It supports 4 types of predefined logs for which you may specify how long to keep them. This is only a good choice if you need the values seldom or if you want to use the log files as interim storage to later import them into another analyzation tool like Logstash + Kibana.

Example

store:
  file:
    status: true
    error: true
    data: true
    action: true

Configuration options are:

  • dirname - base directory for all loggers (optional)
  • status - set to true or give specification object to log status to text file
  • error - set to true or give specification object to log error status to text file
  • data - set to true or give specification object to log result data to json file
  • action - set to true or give specification object to log repair actions to text file

The detailed settings for each logger may be a object with:

  • filename - Filename to be used to log to. This filename can include the %DATE% placeholder which will include the formatted datePattern at that point in the filename. (default: '.log.%DATE%)
  • dirname - The directory name to save log files to. (default: 'log')
  • datePattern - A string representing the moment.js date format to be used for rotating. The meta characters used in this string will dictate the frequency of the file rotation. For example, if your datePattern is simply 'HH' you will end up with 24 log files that are picked up and appended to every day. (default: 'YYYY-MM-DD' or 'YYYY-MM')
  • frequency - A string representing the frequency of rotation. This is useful if you want to have timed rotations, as opposed to rotations that happen at specific moments in time. Valid values are '#m' or '#h' (e.g., '5m' or '3h'). Leaving this null relies on datePattern for the rotation times. (default: null)
  • maxSize - Maximum size of the file after which it will rotate. This can be a number of bytes, or units of kb, mb, and gb. If using the units, add 'k', 'm', or 'g' as the suffix. The units need to directly follow the number. (default: null)
  • maxFiles - Maximum number of logs to keep. If not set, no logs will be removed. This can be a number of files or number of days. If using days, add 'd' as the suffix. (default: 30 or 12)
  • zippedArchive - A boolean to define whether or not to gzip archived log files. (default: true)
  • utc - Use UTC time for date in filename. (default: false)
  • extension - File extension to be appended to the filename. (default: '')
  • createSymlink - Create a tailable symlink to the current active log file. (default: true)
  • symlinkName - The name of the tailable symlink. (default: filename)

Warning

If you run checkup within the alinex server use a specific folder for the log files because at least the error.log file will conflict with the server error.log. Use the dirname setting with a specific folder or change the path of the files.

File Types

There are four types of log files:

  1. action.log

    If enabled you will find all calls of tests each as one line with a short summary:

    2020-09-01 17:15:49 INFO     local.uptime
    2020-09-01 17:15:49 ERROR    local.load Error: Die Last pro CPU über 5 Minuten von 1.9 ist zu hoch (pro CPU 0.475>=0.1).
    
  2. error.log

    If something is not OK, all the details will be logged in the error logs. This includes test error, warn and also the possible fixes.

    2020-09-01 17:25:10 ERROR    local.load Error: Die Last pro CPU über 5 Minuten von 2.12 ist zu hoch (pro CPU 0.53>=0.1).
                        REQUEST  cat /proc/loadavg && grep -c processor /proc/cpuinfo
                        RESPONSE code: 0
                        RETURN   '2.11 2.12 1.98 2/1648 20183\n4'
                        VALIDATE Host load/cpus
                                     Eine Liste mit Werten. Ein einfacher text wird bei /[\s\n]+/ in Einzelwerte getrennt. Die Werte haben das folgende Format:
                                     -   0: load 1m
                                         Ein nummerischer Wert.
                                     -   1: load 5m
                                         Ein nummerischer Wert.
                                     -   2: load 15m
                                         Ein nummerischer Wert.
                                     -   5: cpus
                                         Ein nummerischer Wert.
    
                        RESULT   {
                                   load1m: 2.11,
                                   load5m: 2.12,
                                   load15m: 1.98,
                                   cpus: 4,
                                   load1m_per_cpu: 0.5275,
                                   load5m_per_cpu: 0.53,
                                   load15m_per_cpu: 0.495
                                 }
                        VALIDATE Die Last pro CPU über 5 Minuten darf nicht höher als 2 (Warnung) oder 0.1 (Fehler) sein.
    2020-09-01 17:25:14 DEBUG    local.load: Fix high load
                        Es handelt sich hier um einen Hardware Rechner, daher können CPU und Arbeitsspeicher nicht einfach erweitert werden. Deshalb sollte zunächst ein Blick auf die aktuell laufenden Anwendungen geworfen werden. Eventuell kann kann eine Anwendung gestoppt werden oder deren Last reduziert werden:
    
                        CPU RAM     PPID            Applikation
                        9.3 18.3    1891            /opt/google/chrome/chrome
                        4.3 1.2     1163            /usr/bin/ssh-agent
                        3.1 1.0     1670            /usr/bin/kwin_x11
                        0.9 2.0     1727            /usr/bin/plasmashell
                        0.7 0.0     968                     /usr/sbin/iio-sensor-proxy
                        0.4 0.4     1071            /usr/bin/mongod
                        0.3 6.1     1695            /usr/lib/firefox/firefox
                        0.1 4.1     1952            /usr/bin/akonadi_unifiedmailbox_agent
                        0.1 3.2     1719            /usr/bin/baloo_file
                        0.1 0.4     1281            /usr/sbin/mysqld
    
                        Alle Prozesse einer Anwendung werden angezeigt mit: pstree -ap <PID>. Weiterhin kann iotop zur Beurteilung der Disk IO herangezogen werden. Werden Prozesse gestoppt, so sollte sich die Last langsam bessern.
    
  3. data.log

    The data log won't be for human readiness but will contain all data from the tests to be evaluated with any type of analysis tools as JSON:

    {"message":{"path":"local.uptime","status":"OK","result":{"boot":"2020-09-01T13:25:16.000Z","age":119}},"level":"info","timestamp":"2020-09-01 17:25:10"}
    {"message":{"path":"local.load","status":"ERROR","error":{},"result":{"load1m":2.11,"load5m":2.12,"load15m":1.98,"cpus":4,"load1m_per_cpu":0.5275,"load5m_per_cpu":0.53,"load15m_per_cpu":0.495}},"level":"error","timestamp":"2020-09-01 17:25:10"}
    
  4. status.log

    Only the status information (single line) from each run.

    2020-09-01 17:15:49 INFO     local.uptime
    2020-09-01 17:15:49 ERROR    local.load Error: Die Last pro CPU über 5 Minuten von 1.9 ist zu hoch (pro CPU 0.
    

Database Store

This will store the results in compressed form in time series tables within a database to be analyzed later.

Example

store:
  database:
    client: pg
    connection:
      user: alex
      password: alex
      database: postgres
    schema: checkup
    prefix: checkup_
    exclude:
        - ./linux/process:pid

Possible clients are: pg, mysql, mysql2, mssql, oracledb, sqlite3

All the databases are defined using:

  • client: pg
  • version: 11.0 - optional; for pg, mysql, oracledb
  • connection
    • user: admin
    • password: admin
    • host: localhost
    • port: 5432
    • database: alinex
  • schema: checkup - possible in pg, mssql, oracledb (use prefix in mysql)
  • prefix: checkup_ - optional
  • exclude: - optional list
    • 'local.process:pid' - exclude specific value
    • 'local.process' - exclude case
    • './linux/process:pid' - exclude specific value from test
    • './linux/process' - exclude test
    • ':pid' - exclude this values from any case
  • cleanup:
    • log - default 10000 - 10k messages
    • 5minute - default 12 h - 144 values
    • hour - default 7 days - 168 values
    • day - default 3 months - 60 values
    • month - default 5 years - 60 values

Info

For postgres as special implementation is used which will use a pool with a maximum of connections the same size as concurrency for test cases. But keep in mind that overlapping scheduler runs may overload the pool completely.

Native postgres support can be enabled by installing libpq-dev on the server and adding pg-native npm module:

sudo apt install libpq-dev
npm install pg-native

The prefix is optional and will be added before all table names. This is necessary if you want to store different suites within the same database but separated from each other.

The exclude list can be used to restrict storage. Generally everything will be written to the database, but some values need a lot of space and are not used in analyzation, so you can exclude them from storing. See the examples above for the possible types.

And the cleanup will set the time frame after which an automatic cleanup can remove older values. This is not calculated by calendar math meaning the time after which the deletion take place may vary slightly (because we work with average seconds for the time frames internally).

Warning

Some result values can be large and consume a lot of storage over time. Therefore some information from the result is blocked by default in the test. The default can be overridden for each test by giving a custom setting like ./log/event:- which will store everything (because no result with '-' will be there).

Tables

The results will be stored in different granularity: minute5, hour, day, month. For each interval a separate collection will be created. The cleanup will also be done by the checkup system and can be configured.

Schema

All values stored in tables with numeric and other values separated:

  • <prefix><granularity>_num
    • timeslot - from the interval the current time belongs to
    • testcase - path of case
    • element - name within result or status
    • num - number of checks
    • min - minimum of values
    • max - maximum of values
    • mean - arithmetic mean value
    • std - standard deviation
  • <prefix><granularity>_data
    • timeslot - from the interval the current time belongs to
    • testcase - path of case
    • element - name within result or status (WARN/ERROR)
    • num - number of checks
    • md5 - hash of json value for indexing
    • json - all other values as JSON

To make the used store as small as possible the numerical values are reduced into statistical values for the time intervals. Therefore the values will be continually calculated using welford's online algorithm.

Note

Arithmetic mean and standard deviation is normally calculated using:

\[ \overline x_n = \sum_{i=1}^{n} x_i = \frac{x_1 + x_2+...+x_n}{n} \]
\[ \sigma_n = \sqrt \frac{\sum_{i=1}^{n} (x_i-\overline x_n)^2 }{n-1} \]

But if we won't store all value we have to calculate the new value from the previous here using:

\[ \overline x_n = \overline x_{n-1} + \frac{x_n - \overline x_{n-1}}{n} \]
\[ \sigma_n = \sqrt{ \sigma_{n-1}^2 + \frac{(x_n - \overline x_{n-1} )^2}{n} - \frac{\sigma_{n-1}^2}{n - 1} } \]

As the status is stored as numerical value, too. We use the values -1 = NODATA, 0 = OK, 1 = WARN, 2 = ERROR.

Additionally an event log table will store the individual problems with details:

  • <prefix>log
    • time - time of event
    • testcase - path of case
    • level - numeric status (-1 = NODATA, 0 = OK, 1 = WARN, 2 = ERROR)
    • status - status name
    • message - error message
    • verbose - verbose log

Cleanups will run every half hour together for all tables.

Analysis

As described above numerical data will be stored with min, max, average and standard deviation to help analysis. Min and max values will show the value range, the average will show the progress and the standard deviation tells you how much the measurements are spread around average. A low value means most measurements are near the average.

SQL Queries

Show the stored test cases:

SELECT distinct testcase FROM dvb_checkup.minute5_num;

Show all elements stored for a specific test case:

SELECT 'num' as store, element FROM minute5_num WHERE testcase='local.os'
UNION
SELECT 'data' as store, element FROM dvb_checkup.minute5_data WHERE testcase='local.os';

Get numerical data for a specific element over time:

SELECT timeslot, min, max, mean, std FROM minute5_num WHERE testcase='local.os' AND element='security';

Get non-numerical data for a specific element over time:

SELECT * FROM minute5_data WHERE testcase='local.os' AND element='dist';

To get the errors per time period:

SELECT * FROM minute5_data WHERE element='error' ORDER BY timeslot DESC;

And to get only the last errors with details:

SELECT * FROM log ORDER BY timeslot DESC;

Visualization

DB Visualizer/DBeaver

Both tools support chart generation based on SQL results in the paid enterprise version. See the manual on how to do it, it's really easy.

Grafana

That is a really good web application to visualize data graphical from different data sources. It is easy to install and configure. It can:

  • display charts over time
  • allow swicth of time ranges
  • show annotation for log events with details
  • show tables of last errors

Setup Data Source

Download and intsall it from grafana.com and read the docs.

To get all that I will show you here how to basically configure Grafana with a postgres database:

  1. You have to set up a database under Configuration > Data Source to be used:

    Add new data source PostgreSQL and setup with:

    • Host: <domain or ip> - you may add :<port> if it is not the default port 5432
    • Database
    • User
    • Password
  2. Create a dashboar for a specific area.

  3. Set variables:

    • Type: Custom
    • Values: 5 minutes : minute5, hours : hour, days : day, months : month
  4. Set annotations to show the details:

    The first annotation will show the error messages stored within the selected ranges. Here multiple problems within the same range will be concatenated together.

    Annotation: Problems

    SELECT
        extract(epoch from timeslot) AS time, -- start of timeslot
        extract(epoch from timeslot + case -- calculate end based on range
            when '${range:raw}' = 'minute5' then interval '5 minutes' 
            when '${range:raw}' = 'hour' then interval '1 hour' 
            when '${range:raw}' = 'day' then interval '1 day' 
            when '${range:raw}' = 'month' then interval '1 month' 
            end) AS timeend, -- end of timeslot
        string_agg('<h3>' || testcase || '</h3><p>' || (json::json#>>'{}'), '<p>') as text, -- formatted text
        'problem' as tags
    FROM
        dvb_checkup.${range:raw}_data
    WHERE
        $__timeFilter(timeslot) and testcase like 'path.group.%' and element ='error'
    GROUP BY timeslot
    

    You may add multiple annotations for detailed log entries (better make multiple instead of a combined entry to be able to switch them on and off individually). This will show each individual message together with a part (here max 300 characters) of the verbose log.

    Annotation: Log

    SELECT
        extract(epoch from time) AS time,
        '<h3>' || testcase || '</h3><p>' || message || '<pre>' || substring("verbose"  for 300) || '...</pre>' as text,
        status as tags
    FROM
        dvb_checkup.log
    WHERE
        $__timeFilter(time) and level != 0 and testcase like 'path.group.test' 
    
  5. Now you can add multiple panels to the dashboard and define the data:

    Panel with multiple elements

    • FROM: ${range:raw}_num - use the selected range here
    • Time column: timeslot
    • Metric column: element
    • WHERE:
      • Macro: $__timeFilter
      • Expr: testcase = 'path.group.test'
      • Expr: element IN ('conn', 'conn_writing', 'conn_keepalive', 'conn_losing')

    Panel with same element from multiple testcases

    • FROM: ${range:raw}_num - use the selected range here
    • Time column: timeslot
    • Metric column: element
    • WHERE:
      • Macro: $__timeFilter
      • Expr: testcase LIKE 'path.group.%'
      • Expr: element = 'time'

Housekeeping

At last please also have a look at the collection which may grow and grow. You may get some big data here which needs housekeeping with removal after some time.