Tuesday, 13 March 2018

A simple example

https://reaction-engine.bitbucket.io/

Sorry, however I promised in the first blog entry that I am going to write some thoughts about the installation but I couldn't keep my word in the second entry either... :)

What I am planning to talk about is what to do if a new application has to be added to be monitored and to remedy an incident.

Let's assume there is a recurring memory-leak problem in the Hermes application (it is not a real application) that causes OutOfMemoryError and the application server (Tomcat 8) has to be restarted. The application has 2 running instances on 2 different host machines (they differ only in the host name but all the other properties (e.g. the path of log file, etc.) are the same) and due to a server synchronization issue both application servers on the two hosts have to be restarted.
The incident has 2 symptoms:
  • the application slows down as only one host can server the requests of the business users
  • the OutOfMemoryError in the log file of the application
Let's assume that the Reaction Engine and the administration web application already installed and work but no workers run on the hosts.

Let's set up our Reaction system to fix the incident automatically!

I. create the reference data
what, how, when -> all has to be done is to answer to these 3 questions: What to observe? How to fix the problem? When to start the remedy process?

What to observe?
It is simple, the log file has to be observed, more specifically the question refers to the location of the log file. As basically the business application (the system) is simulated in the management web application so the type of the reference data is called system.
The 2 running instances of Hermes application are almost the same so I recommend to create a common parent (the management web application can handle hierarchy of systems) where the matching properties can be specified.
Hermes [APPLICATION] - specify the log file location, maintenance window
    Hermes Host 0 [LOG FILE] - specify the host name
    Hermes Host 1 [LOG FILE] - specify the host name

How to fix the problem?
A chain of OS commands can be specified that is called the execution flow. The flow in our case can be as follows (it is just one solution, yours depends on your requirements):
  • Send mail to users that Hermes is being restarted: notifying the business users in email that the Hermes application will be unavailable
  • stop Tomcat on host 0: the command /opt/tomcat/bin/shutdown.sh is executed on host 0 as tomcat user
    The executor worker has to run on host 0. It polls the Reaction Engine every X sec (it can be set in the worker's config file) and if the 1st task of the flow (Send mail to users that Hermes is being restarted) is already executed (so the next task is the current one) then the command will be sent back in the REST response to the executor worker (only to the worker that runs on host 0) and it will executed on host 0.
  • start Tomcat on host 0: the command /opt/tomcat/bin/startup.sh is executed on host 0 as tomcat user
  • check if Tomcat is running on host 0:  the command ps -ef | grep /opt/tomcat is executed on host 0
    The next command is an IF-ELSE so a value has to be provided. It means that the IF-ELSE contains only the condition but the preceding task (which has to be an OS command execution task) can provide the value to it.
    So in this command the output pattern has to be specified (.*-Dcatalina\.base=(?<VALUETOBEEXTRACTED>[/a-z]+) -Dcatalina\.home=.*) and the executor worker will look for this pattern in the output of the command. If the executor worker finds the pattern then it will extract the value (in this case it will be /opt/tomcat). Then this extracted value will be passed to the IF-ELSE task.
    Sample output of the ps -ef | grep /opt/tomcat command:
    root       343     1  2 11:54 ?        00:02:58 /usr/bin/java -Djava.util.logging.config.file=/opt/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dspring.profiles.active=threadPool -Dspring.config.location=/local/reaction/reaction-engine/reaction-engine-application.yml -Dreaction.logback.config=/local/reaction/reaction-engine/logback-include.xml -Dignore.endorsed.dirs= -classpath /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/opt/tomcat -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/tomcat/temp org.apache.catalina.startup.Bootstrap start
    If Tomcat is not running then the executor won't find the pattern so the value will be an empty string so the false branch of the IF-ELSE will be executed.
  • if Tomcat is running on host 0 : IF-ELSE task with the following condition: if the value received from the preceding task is /opt/tomcat?
  • Send mail to admin that Tomcat didn't start on host 0: an email will be sent to the middleware administrator(s) that there was a problem with the start of Tomcat
  • Fail the flow: the flow will be interrupted and marked as failed
  • the same tasks are executed for host 1

When to start the remedy process?
The simple answer to this question is: when the text java.lang.OutOfMemoryError: Java heap space appears in the observed log.
A so-called error detector has to be created that will bind the system reference data with the execution flow and the message pattern (it is a regular expression, like .*java\.lang\.OutOfMemoryError: Java heap space.*) should be specified too. A hierarchy of systems is created in the first step of the reference data creation so here only the parent system has to be set; it will mean that all its children systems will be observed too so the log files of host 0 and host 1 will be monitored.
So when the text java.lang.OutOfMemoryError: Java heap space appears in the log file on host 0 or host 1 then execute the flow that was created above.
You can also set in the error detector if manual confirmation by the administrator is needed (otherwise the flow is started immediately). Also the error detector has to be activated too.


II. install and start the worker
The installation and how to start is pretty well described here.
Both the reader and the executor worker have to be started.


So what will happen when the dreadful OutOfMemoryError occurs?
The reader worker will notice that a new log entries got to the log file, it will read them and check if they match the one existing in the error detector. It will match so the reader worker will report the incident via REST to the Engine. The Engine will find the error detector record (based on the pattern (OutOfMemoryError) and on the system record (the REST request arrived from host 1 which has a system record and it has a parent that is assigned to the error detector)). The Reaction Engine will start the flow that is assigned to the error detector
However the 'confirmation needed' flag is on at the error detector so it needs a confirmation from the administrator (it can be done in the management web application). After the confirmation tasks of the flow will be executed one by one.

Thursday, 8 March 2018

Architecture

https://reaction-engine.bitbucket.io/

Yes, I promised in the end of the last (which was the first) blog entry that I will tell some details how to install the worker but it would be worth to give some time to the architecture (the main reason of this is that I have a picture about the architecture that I want to share; also the previous blog entry didn't contain any image...)

So first let's see the diagram:


Nice, isn't it? Well, please don't answer... :)

The HOST 0, HOST 1 and HOST 2 are the machines where the observed business applications are (more precisely where the log files of the applications are). As you can see the reader worker (the one that observes the log file(s)) has to be started on those machines only where the logs reside. Also the executor worker (the one that executes the OS command one by one) has to be started on those machines where operating system commands have to be executed. The workers are JAVA (JDK 8 is needed) applications. There is only one worker installation pack but there are different start/stop commands for the executor and the reader.
The reader worker calls 2 REST services:
  • getting the log file locations to be monitored (can be configured how often the REST service will be called -> the more often the service called the faster a change will be propagated to the reader worker)
  • reporting an incident
The executor uses the following 2 REST services:
  • getting the commands to be executed (the more often the service is called the less delay will be in the execution flow)
  • sending back the result/output of the OS command
If a new log file has to be monitored on a host where the reader worker already works then all has to be done is to define the data in the management we application and the log file location will be synchronized automatically and the log file will monitored out of the box.
HMAC authentication is used when the call is made. It means that the hash is not static (it is made of the password, current date, used HTTP verb, endpoint, etc.) i.e. it cannot be decrypted. The username and password must exist in the worker and in the Reaction Engine too. The authentication is mandatory.
The HTTP message can be encrypted which can be based on the username and password (which is used in HMAC authentication) or on public / private keys in certificate. The advantage of this encryption over HTTPS is that no need to rebuild the secure channel if the message goes through network devices.

The administration web application is a python-Django (python3 is needed) application and its main tasks are
  • to maintain the reference data
  • to monitor the run of the execution flows
  • to start / to schedule the execution
  • to give statistics
  • to provide a user management module
The web application has access the Reaction database and can call REST services of the Reaction Engine. The following operations can be called via REST:
  • approving to start a flow
  • starting / scheduling a flow
  • restarting / skipping a task of an execution flow
Also HMAC authentication is used when calling the REST service of the Reaction Engine.

The Reaction Engine is a Java web application (JDK8 is needed) that can be deployed on Tomcat 8, on Wildfly 10 or on Weblogic 12c (separate WAR files are provided in the download section).
What it does is as follows:
  • it provides REST interface for workers and the management web application (see above)
  • it makes a decision if a reported event (by the reader worker) is a real incident and a flow has to be started
  • performing the execution flow
    i.e. getting the first task in the flow, executing it then getting the second one, etc.
    based on the type of the task (OS command, if-else operation, mail sending) the engine will execute it differently:
        - if it is an OS command then it will provide it to the worker (i.e. it will just wait until the specific executor worker will call the REST service to get the command to be executed) and after the execution it will save the the output (if the command wasn't executed successfully then the flow will fail) and jump to the next task in the flow
        - if it is an if-else operation then it will get the output of the preceding task (which must be an OS command and it has to be an output) and use this output value to evaluate the condition of the if-else; if the condition is true to jump to the true branch (if it exists), if it is false then jump to the false branch (if it exists)
        - if it is an mail sending task then send the mail and jump to the next task in the flow

Sunday, 4 March 2018

Introduction to Reaction Engine

The first question what should be clarified is what this Reaction Engine (https://reaction-engine.bitbucket.io/) application is for?

In short it is an automatic incident detector and resolver.

Yes, it sounds pretty fancy and it would look pretty cool on a flyer or something. However the truth is that it is exactly what it does.
It doesn't have an artificial intelligence or machine learning module, it can do what it is specified inside (so John Connor can still feel safe ...). What it has is as follows:
  • a background application (reader worker) that monitors the log files of business applications and sends an alert to the Reaction Engine if an incident occurred
  • a server application (Reaction Engine) that can listen to the background application (reader worker) and start an execution flow (which is basically a chain of operating system commands)
  • another background application (executor worker) that gets the OS command to be executed from the Reaction Engine (it polls the engine) and execute it on the specified host
  • a management web application where all the data can be maintained (e.g. execution flow) and monitoring / controlling the running of the flow
So how can it be made to work? 

First let's imagine there is an application (called Hermes) which suffers from a memory leak so 2 - 3 times a week the dreadful OutOfMemoryError appears in the log file of the application and the Hermes application hangs. The solution is to restart the application server that hosts Hermes.

How is the memory problem fixed in the normal way? The business users realize that they cannot use their beloved system so they call the service desk, the service desk calls that middleware administrators who will log in to the server machine and check the log file. They realize that the memory leak struck again, they will restart the application server and let the service desk know when the restart finished and the service desk will notify the business users that they can work until ... they can.
The problem here is that there are many human interactions which usually takes lots of time and which takes lots of money for the firm.

How should the incident resolving work? The log file of the Hermes application should be monitored. If the OutOfMemoryError appears in the log file then the application server should be restarted automatically (perhaps including a confirmation step by the middleware administrator), a mail should be sent to the business users before and after the restart.
In order to do that
  • the reader worker has to be installed to the machine where the log file of the application resides
  • the executor worker has to be installed to the machine where the application server is
  • creating the database schema of Reaction and installing the Reaction management web application
  • creating the reference data of Hermes application (e.g. where the location of its log file is, what commands should be executed to remedy the memory error, etc.) with the Reaction management web application
  • installing the Reaction Engine to Weblogic 12c, Tomcat 8 or Wildfly 10
  • configuring the workers (e.g. specifying where the Reaction Engine is, etc.) and start them
It is important to note if a new application has to be monitored by the Reaction Engine then all has to be done is
  • to create the reference data of this new application  with the Reaction management web application
  • to install, to configure and to start the workers 
The reader worker will notice the error in the log file of Hermes application, it will send an alert to the Reaction Engine. The engine will select the execution flow that can remedy the problem (restarting the application server) and will notify the executor worker to execute commands (the workers are always the clients so the engine cannot send message to the worker directly but the workers poll the engine).

Basically it is the basic idea behind the Reaction system. However it can do much more, I will show it later in the subsequent blog entries.
In the next blog I will show how to install the worker.

Reaction v1.1 is released!