Data Storage / Transport¶
Data stores can be used to collect the results over a longer time to analyze them later. Each data store is different with different possibilities. But a store can also be any system gathering the data, also if it is not really stored. Multiple alternatives are possible to be used. They differ in the storage media:
- Database Store (as long term store to base analysis on)
- Graphite Store (not implemented yet, but also as long term store)
- File Logger (easy recording)
- Email Alert (not implemented yet)
As soon as a test is done it will be send to the used logger.
Define which one to use in the suite setup (base configuration file in CLI under store
or options in creating Suite instance).
File Logger¶
This is one of the simplest stores. It supports 4 types of predefined logs for which you may specify how long to keep them. This is only a good choice if you need the values seldom or if you want to use the log files as interim storage to later import them into another analyzation tool like Logstash + Kibana.
Example
store:
file:
status: true
error: true
data: true
action: true
Configuration options are:
dirname
- base directory for all loggers (optional)status
- set totrue
or give specification object to log status to text fileerror
- set totrue
or give specification object to log error status to text filedata
- set totrue
or give specification object to log result data to json fileaction
- set totrue
or give specification object to log repair actions to text file
The detailed settings for each logger may be a object with:
filename
- Filename to be used to log to. This filename can include the %DATE% placeholder which will include the formatted datePattern at that point in the filename. (default: '.log.%DATE%) dirname
- The directory name to save log files to. (default: 'log')datePattern
- A string representing the moment.js date format to be used for rotating. The meta characters used in this string will dictate the frequency of the file rotation. For example, if your datePattern is simply 'HH' you will end up with 24 log files that are picked up and appended to every day. (default: 'YYYY-MM-DD' or 'YYYY-MM')frequency
- A string representing the frequency of rotation. This is useful if you want to have timed rotations, as opposed to rotations that happen at specific moments in time. Valid values are '#m' or '#h' (e.g., '5m' or '3h'). Leaving this null relies on datePattern for the rotation times. (default: null)maxSize
- Maximum size of the file after which it will rotate. This can be a number of bytes, or units of kb, mb, and gb. If using the units, add 'k', 'm', or 'g' as the suffix. The units need to directly follow the number. (default: null)maxFiles
- Maximum number of logs to keep. If not set, no logs will be removed. This can be a number of files or number of days. If using days, add 'd' as the suffix. (default: 30 or 12)zippedArchive
- A boolean to define whether or not to gzip archived log files. (default:true
)utc
- Use UTC time for date in filename. (default: false)extension
- File extension to be appended to the filename. (default: '')createSymlink
- Create a tailable symlink to the current active log file. (default:true
)symlinkName
- The name of the tailable symlink. (default: filename)
Warning
If you run checkup within the alinex server use a specific folder for the log files because at least the error.log
file will conflict with the server error.log
. Use the dirname
setting with a specific folder or change the path of the files.
File Types¶
There are four types of log files:
-
action.log
If enabled you will find all calls of tests each as one line with a short summary:
2020-09-01 17:15:49 INFO local.uptime 2020-09-01 17:15:49 ERROR local.load Error: Die Last pro CPU über 5 Minuten von 1.9 ist zu hoch (pro CPU 0.475>=0.1).
-
error.log
If something is not OK, all the details will be logged in the error logs. This includes test error, warn and also the possible fixes.
2020-09-01 17:25:10 ERROR local.load Error: Die Last pro CPU über 5 Minuten von 2.12 ist zu hoch (pro CPU 0.53>=0.1). REQUEST cat /proc/loadavg && grep -c processor /proc/cpuinfo RESPONSE code: 0 RETURN '2.11 2.12 1.98 2/1648 20183\n4' VALIDATE Host load/cpus Eine Liste mit Werten. Ein einfacher text wird bei /[\s\n]+/ in Einzelwerte getrennt. Die Werte haben das folgende Format: - 0: load 1m Ein nummerischer Wert. - 1: load 5m Ein nummerischer Wert. - 2: load 15m Ein nummerischer Wert. - 5: cpus Ein nummerischer Wert. RESULT { load1m: 2.11, load5m: 2.12, load15m: 1.98, cpus: 4, load1m_per_cpu: 0.5275, load5m_per_cpu: 0.53, load15m_per_cpu: 0.495 } VALIDATE Die Last pro CPU über 5 Minuten darf nicht höher als 2 (Warnung) oder 0.1 (Fehler) sein. 2020-09-01 17:25:14 DEBUG local.load: Fix high load Es handelt sich hier um einen Hardware Rechner, daher können CPU und Arbeitsspeicher nicht einfach erweitert werden. Deshalb sollte zunächst ein Blick auf die aktuell laufenden Anwendungen geworfen werden. Eventuell kann kann eine Anwendung gestoppt werden oder deren Last reduziert werden: CPU RAM PPID Applikation 9.3 18.3 1891 /opt/google/chrome/chrome 4.3 1.2 1163 /usr/bin/ssh-agent 3.1 1.0 1670 /usr/bin/kwin_x11 0.9 2.0 1727 /usr/bin/plasmashell 0.7 0.0 968 /usr/sbin/iio-sensor-proxy 0.4 0.4 1071 /usr/bin/mongod 0.3 6.1 1695 /usr/lib/firefox/firefox 0.1 4.1 1952 /usr/bin/akonadi_unifiedmailbox_agent 0.1 3.2 1719 /usr/bin/baloo_file 0.1 0.4 1281 /usr/sbin/mysqld Alle Prozesse einer Anwendung werden angezeigt mit: pstree -ap <PID>. Weiterhin kann iotop zur Beurteilung der Disk IO herangezogen werden. Werden Prozesse gestoppt, so sollte sich die Last langsam bessern.
-
data.log
The data log won't be for human readiness but will contain all data from the tests to be evaluated with any type of analysis tools as JSON:
{"message":{"path":"local.uptime","status":"OK","result":{"boot":"2020-09-01T13:25:16.000Z","age":119}},"level":"info","timestamp":"2020-09-01 17:25:10"} {"message":{"path":"local.load","status":"ERROR","error":{},"result":{"load1m":2.11,"load5m":2.12,"load15m":1.98,"cpus":4,"load1m_per_cpu":0.5275,"load5m_per_cpu":0.53,"load15m_per_cpu":0.495}},"level":"error","timestamp":"2020-09-01 17:25:10"}
-
status.log
Only the status information (single line) from each run.
2020-09-01 17:15:49 INFO local.uptime 2020-09-01 17:15:49 ERROR local.load Error: Die Last pro CPU über 5 Minuten von 1.9 ist zu hoch (pro CPU 0.
Database Store¶
This will store the results in compressed form in time series tables within a database to be analyzed later.
Example
store:
database:
client: pg
connection:
user: alex
password: alex
database: postgres
schema: checkup
prefix: checkup_
exclude:
- ./linux/process:pid
Possible clients are: pg
, mysql
, mysql2
, mssql
, oracledb
, sqlite3
All the databases are defined using:
client: pg
version: 11.0
- optional; for pg, mysql, oracledbconnection
user: admin
password: admin
host: localhost
port: 5432
database: alinex
schema: checkup
- possible in pg, mssql, oracledb (use prefix in mysql)prefix: checkup_
- optionalexclude:
- optional list'local.process:pid'
- exclude specific value'local.process'
- exclude case'./linux/process:pid'
- exclude specific value from test'./linux/process'
- exclude test':pid'
- exclude this values from any case
cleanup:
log
- default 10000 - 10k messages5minute
- default 12 h - 144 valueshour
- default 7 days - 168 valuesday
- default 3 months - 60 valuesmonth
- default 5 years - 60 values
Info
For postgres as special implementation is used which will use a pool with a maximum of connections the same size as concurrency for test cases. But keep in mind that overlapping scheduler runs may overload the pool completely.
Native postgres support can be enabled by installing libpq-dev
on the server and adding pg-native
npm module:
sudo apt install libpq-dev
npm install pg-native
The prefix
is optional and will be added before all table names. This is necessary if you want to store different suites within the same database but separated from each other.
The exclude
list can be used to restrict storage. Generally everything will be written to the database, but some values need a lot of space and are not used in analyzation, so you can exclude them from storing. See the examples above for the possible types.
And the cleanup
will set the time frame after which an automatic cleanup can remove older values. This is not calculated by calendar math meaning the time after which the deletion take place may vary slightly (because we work with average seconds for the time frames internally).
Warning
Some result values can be large and consume a lot of storage over time. Therefore some information from the result is blocked by default in the test.
The default can be overridden for each test by giving a custom setting like ./log/event:-
which will store everything (because no result with '-' will be there).
Tables¶
The results will be stored in different granularity: minute5
, hour
, day
, month
.
For each interval a separate collection will be created.
The cleanup will also be done by the checkup system and can be configured.
All values stored in tables with numeric and other values separated:
<prefix><granularity>_num
timeslot
- from the interval the current time belongs totestcase
- path of caseelement
- name within result or statusnum
- number of checksmin
- minimum of valuesmax
- maximum of valuesmean
- arithmetic mean valuestd
- standard deviation
<prefix><granularity>_data
timeslot
- from the interval the current time belongs totestcase
- path of caseelement
- name within result or status (WARN/ERROR)num
- number of checksmd5
- hash of json value for indexingjson
- all other values as JSON
To make the used store as small as possible the numerical values are reduced into statistical values for the time intervals. Therefore the values will be continually calculated using welford's online algorithm.
Note
Arithmetic mean and standard deviation is normally calculated using:
But if we won't store all value we have to calculate the new value from the previous here using:
As the status is stored as numerical value, too. We use the values -1 = NODATA, 0 = OK, 1 = WARN, 2 = ERROR.
Additionally an event log table will store the individual problems with details:
<prefix>log
time
- time of eventtestcase
- path of caselevel
- numeric status (-1 = NODATA, 0 = OK, 1 = WARN, 2 = ERROR)status
- status namemessage
- error messageverbose
- verbose log
Cleanups will run every half hour together for all tables.
Analysis¶
As described above numerical data will be stored with min, max, average and standard deviation to help analysis. Min and max values will show the value range, the average will show the progress and the standard deviation tells you how much the measurements are spread around average. A low value means most measurements are near the average.
SQL Queries
Show the stored test cases:
SELECT distinct testcase FROM dvb_checkup.minute5_num;
Show all elements stored for a specific test case:
SELECT 'num' as store, element FROM minute5_num WHERE testcase='local.os'
UNION
SELECT 'data' as store, element FROM dvb_checkup.minute5_data WHERE testcase='local.os';
Get numerical data for a specific element over time:
SELECT timeslot, min, max, mean, std FROM minute5_num WHERE testcase='local.os' AND element='security';
Get non-numerical data for a specific element over time:
SELECT * FROM minute5_data WHERE testcase='local.os' AND element='dist';
To get the errors per time period:
SELECT * FROM minute5_data WHERE element='error' ORDER BY timeslot DESC;
And to get only the last errors with details:
SELECT * FROM log ORDER BY timeslot DESC;
Visualization¶
DB Visualizer/DBeaver
Both tools support chart generation based on SQL results in the paid enterprise version. See the manual on how to do it, it's really easy.
Grafana
That is a really good web application to visualize data graphical from different data sources. It is easy to install and configure. It can:
- display charts over time
- allow swicth of time ranges
- show annotation for log events with details
- show tables of last errors
Download and intsall it from grafana.com and read the docs.
To get all that I will show you here how to basically configure Grafana with a postgres database:
-
You have to set up a database under Configuration > Data Source to be used:
Add new data source PostgreSQL and setup with:
- Host:
<domain or ip>
- you may add:<port>
if it is not the default port 5432 - Database
- User
- Password
- Host:
-
Create a dashboar for a specific area.
-
Set variables:
- Type:
Custom
- Values:
5 minutes : minute5, hours : hour, days : day, months : month
- Type:
-
Set annotations to show the details:
The first annotation will show the error messages stored within the selected ranges. Here multiple problems within the same range will be concatenated together.
Annotation: Problems
SELECT extract(epoch from timeslot) AS time, -- start of timeslot extract(epoch from timeslot + case -- calculate end based on range when '${range:raw}' = 'minute5' then interval '5 minutes' when '${range:raw}' = 'hour' then interval '1 hour' when '${range:raw}' = 'day' then interval '1 day' when '${range:raw}' = 'month' then interval '1 month' end) AS timeend, -- end of timeslot string_agg('<h3>' || testcase || '</h3><p>' || (json::json#>>'{}'), '<p>') as text, -- formatted text 'problem' as tags FROM dvb_checkup.${range:raw}_data WHERE $__timeFilter(timeslot) and testcase like 'path.group.%' and element ='error' GROUP BY timeslot
You may add multiple annotations for detailed log entries (better make multiple instead of a combined entry to be able to switch them on and off individually). This will show each individual message together with a part (here max 300 characters) of the verbose log.
Annotation: Log
SELECT extract(epoch from time) AS time, '<h3>' || testcase || '</h3><p>' || message || '<pre>' || substring("verbose" for 300) || '...</pre>' as text, status as tags FROM dvb_checkup.log WHERE $__timeFilter(time) and level != 0 and testcase like 'path.group.test'
-
Now you can add multiple panels to the dashboard and define the data:
Panel with multiple elements
- FROM:
${range:raw}_num
- use the selected range here - Time column:
timeslot
- Metric column:
element
- WHERE:
- Macro:
$__timeFilter
- Expr:
testcase = 'path.group.test'
- Expr:
element IN ('conn', 'conn_writing', 'conn_keepalive', 'conn_losing')
- Macro:
Panel with same element from multiple testcases
- FROM:
${range:raw}_num
- use the selected range here - Time column:
timeslot
- Metric column:
element
- WHERE:
- Macro:
$__timeFilter
- Expr:
testcase LIKE 'path.group.%'
- Expr:
element = 'time'
- Macro:
- FROM:
Housekeeping¶
At last please also have a look at the collection which may grow and grow. You may get some big data here which needs housekeeping with removal after some time.