Sorry, however I promised in the first blog entry that I am going to write some thoughts about the installation but I couldn't keep my word in the second entry either... :)
What I am planning to talk about is what to do if a new application has to be added to be monitored and to remedy an incident.
Let's assume there is a recurring memory-leak problem in the Hermes application (it is not a real application) that causes OutOfMemoryError and the application server (Tomcat 8) has to be restarted. The application has 2 running instances on 2 different host machines (they differ only in the host name but all the other properties (e.g. the path of log file, etc.) are the same) and due to a server synchronization issue both application servers on the two hosts have to be restarted.
The incident has 2 symptoms:
- the application slows down as only one host can server the requests of the business users
- the OutOfMemoryError in the log file of the application
Let's set up our Reaction system to fix the incident automatically!
I. create the reference data
what, how, when -> all has to be done is to answer to these 3 questions: What to observe? How to fix the problem? When to start the remedy process?
What to observe?
It is simple, the log file has to be observed, more specifically the question refers to the location of the log file. As basically the business application (the system) is simulated in the management web application so the type of the reference data is called system.
The 2 running instances of Hermes application are almost the same so I recommend to create a common parent (the management web application can handle hierarchy of systems) where the matching properties can be specified.
Hermes [APPLICATION] - specify the log file location, maintenance window
Hermes Host 0 [LOG FILE] - specify the host name
Hermes Host 1 [LOG FILE] - specify the host name
How to fix the problem?
A chain of OS commands can be specified that is called the execution flow. The flow in our case can be as follows (it is just one solution, yours depends on your requirements):
- Send mail to users that Hermes is being restarted: notifying the business users in email that the Hermes application will be unavailable
- stop Tomcat on host 0: the command /opt/tomcat/bin/shutdown.sh is executed on host 0 as tomcat user
The executor worker has to run on host 0. It polls the Reaction Engine every X sec (it can be set in the worker's config file) and if the 1st task of the flow (Send mail to users that Hermes is being restarted) is already executed (so the next task is the current one) then the command will be sent back in the REST response to the executor worker (only to the worker that runs on host 0) and it will executed on host 0.
- start Tomcat on host 0: the command /opt/tomcat/bin/startup.sh is executed on host 0 as tomcat user
- check if Tomcat is running on host 0: the command ps -ef | grep /opt/tomcat is executed on host 0
The next command is an IF-ELSE so a value has to be provided. It means that the IF-ELSE contains only the condition but the preceding task (which has to be an OS command execution task) can provide the value to it.
So in this command the output pattern has to be specified (.*-Dcatalina\.base=(?<VALUETOBEEXTRACTED>[/a-z]+) -Dcatalina\.home=.*) and the executor worker will look for this pattern in the output of the command. If the executor worker finds the pattern then it will extract the value (in this case it will be /opt/tomcat). Then this extracted value will be passed to the IF-ELSE task.
Sample output of the ps -ef | grep /opt/tomcat command:
root 343 1 2 11:54 ? 00:02:58 /usr/bin/java -Djava.util.logging.config.file=/opt/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dspring.profiles.active=threadPool -Dspring.config.location=/local/reaction/reaction-engine/reaction-engine-application.yml -Dreaction.logback.config=/local/reaction/reaction-engine/logback-include.xml -Dignore.endorsed.dirs= -classpath /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/opt/tomcat -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/tomcat/temp org.apache.catalina.startup.Bootstrap start
If Tomcat is not running then the executor won't find the pattern so the value will be an empty string so the false branch of the IF-ELSE will be executed.
- if Tomcat is running on host 0 : IF-ELSE task with the following condition: if the value received from the preceding task is /opt/tomcat?
- Send mail to admin that Tomcat didn't start on host 0: an email will be sent to the middleware administrator(s) that there was a problem with the start of Tomcat
- Fail the flow: the flow will be interrupted and marked as failed
- the same tasks are executed for host 1
When to start the remedy process?
The simple answer to this question is: when the text java.lang.OutOfMemoryError: Java heap space appears in the observed log.
A so-called error detector has to be created that will bind the system reference data with the execution flow and the message pattern (it is a regular expression, like .*java\.lang\.OutOfMemoryError: Java heap space.*) should be specified too. A hierarchy of systems is created in the first step of the reference data creation so here only the parent system has to be set; it will mean that all its children systems will be observed too so the log files of host 0 and host 1 will be monitored.
So when the text java.lang.OutOfMemoryError: Java heap space appears in the log file on host 0 or host 1 then execute the flow that was created above.
You can also set in the error detector if manual confirmation by the administrator is needed (otherwise the flow is started immediately). Also the error detector has to be activated too.
II. install and start the worker
The installation and how to start is pretty well described here.
Both the reader and the executor worker have to be started.
So what will happen when the dreadful OutOfMemoryError occurs?
The reader worker will notice that a new log entries got to the log file, it will read them and check if they match the one existing in the error detector. It will match so the reader worker will report the incident via REST to the Engine. The Engine will find the error detector record (based on the pattern (OutOfMemoryError) and on the system record (the REST request arrived from host 1 which has a system record and it has a parent that is assigned to the error detector)). The Reaction Engine will start the flow that is assigned to the error detector.
However the 'confirmation needed' flag is on at the error detector so it needs a confirmation from the administrator (it can be done in the management web application). After the confirmation tasks of the flow will be executed one by one.