Readme of ClusterProbe V1.1

 

This is version 1.1 of ClusterProbe software. This new version adds the High Availability feature into ClusterProbe, comparing with the old version. The ClusterProbe software would be always available because it is executed on at least two nodes. When one node is down, the other node would take over the services and users would not feel it.

 

In 1990s, HA system is developed according to the requests of availability of System. More and more services need high availability and head to avoid down time. In our project, we construct HA systems in cluster and use ClusterProbe to manage them. The goal is to develop infrastructure of both software and hardware, so that users have no idea on which computer or node they are actually using. 

 

New version of ClusterProbe software adapts efficient techniques and allows:

·         Other applications running in the cluster as HA applications without any change.

·         Custom applications according to the users’ requests as monitored ones.

·         Configure or manipulate remote nodes

·         Configure HA system in two nodes easily.

 

What new version of ClusterProbe software contains:

·         Core Classes ( clusterprobe.jar )

·         Source Files   ( ClusterProbe_new_src.zip)

·         Web-based Cluster management tool  (WCMT-web-new.tar.gz)

·         Servlet files (WCMT-servlet-new.tar.gz)

·         Agent  Classes ( Agent_new.tar.gz, including clusterprobe.jar and runa )

 

You can get them from srgdell-15:/usr/local/ClusterProbe/new_CP/, or srgdell-15:~linwang/ClusterProbe_src/.

After install these packets, you should change two js files:

/home/httpd/html/srg/code/show_status.js

/home/httpd/html/srg/code/show_status1.js

statusWin.document.writeln("<font size=-1 color=blue>REAL-TIME STATUS OF SRG CLUSTERS(srgdell-15)</font>");

Change srgdell-15 to the node name of your node. You may see it from dynamic status form of ClusterProbe.

You also need to add file HAapps.conf under /etc/. That file is Application List. It is one of configuration files, just like node_resolv and queue.conf.

Example of HA application

To construct a HA system on two nodes in cluster, we need the following resources:

·         Two nodes

·         At least two NIC for each node

·         Serial line between nodes

·         Two IP addresses for two nodes.

·         If the HA application should have IP address as its resource, another IP address-working IP address- is needed.

 

With ClusterProbe, it is easy for users to configure their applications to HA applications.

 Step1: Configure nodes as HA nodes

            Step1.1   Install HA software on nodes

            Step1.2   Configure two nodes as HA nodes

            Step1.3   Configure two HA nodes as node pair

Step2: Configure application as HA application

            Step2.1   Configure HA application’s parameters.

Step3: restart HA software on two nodes

            Step3.1   restart HA heartbeat on both nodes

            Step3.2   restart Mon on primary node

            Step3.3   start Mon on secondary node if you want your application executed in active-active mode

            Step3.4   see the dynamical status of this application

 

e. g.    Configure a HA application named exam1.

Exam1 is a java applet program. This application needs not IP address in its resource group.

According to the steps we introduced above, we should first configure the nodes as HA nodes. Then we should install this application on both nodes. After that we should configure this application. Before our configuring, we should make sure that we have prepared the script, which is used to start this java applet program.

This script is: run_jsp, under directory: ~linwang/tools/jspringies1.0.1/

You may try to start exam1 by run this script directly. You can configure your application with this script.

Now you can configure your application following the steps of “config HA application” of ClusterProbe.

·         Copy files from primary node and secondary node.  What you should do here is just fill the name of both nodes in the form.

·         Configure application

 

     Attention:

·         HA app keyword is: appletviewer | grep jdk, we can only see apart of this sentence because the text field is short.

·         We set DISPLAY in HA application’s run command because we use VNC to show the result. VNC is using port 3 as display port.

·         This application needs not IP address, so we use an IP address, which is not used in our network. Such as 1.1.1.1.

·         The alert email address is: root@localhost.  System would send email to this email address if something happened on the application.

 

            We just need to give information ClusterProbe asks you to fill, then the following configure files are created on the /etc/HAscript/tmp/ of ClusterProbe server.

·         haresources

This file would be copied to both nodes under directory: /etc/ha.d/.

·         mon.cf

This file would be copied to both nodes under directory: /etc/mon/.

·         jspringies.run and jspringies

jspringies.run would be copied to nodes under directory: /usr/lib/mon/alert.d/.

jspringies would be copied to nodes under directory: /etc/rc.d/init.d/.

·         jspringies.monitor

This file would be copied to both nodes under directory: /usr/lib/mon/mon.d/.

 

After you copy these files to primary node and secondary node of application with ClusterProbe, this application is in Application list, which is /etc/HAapps.conf. And you can see the status of this application from the dynamical status form of HA applications. Of course, it is down right now because you have not restart the HA software and start the application yet.

 

Because you change the HA software configuration when you configure HA application, you should restart HA software. You can restart HA software with ClusterProbe. From ClusterProbe/HA node/Manipulate HA node/, you can stop HA heartbeat and mon first, and then start them.

 

As a HA application of both nodes, this application would be started by Heartbeat. If you also start mon on this node, mon would monitor the application according to the mon.cf file. ( You should start mon at least on primary node of application. )

 

 

Mon would help to keep this application available. It detects application’s error by running application’s monitor constantly. When the monitor for this application echoes “no”, that means this application is not running on this node any more, mon would restart this application.

When this node is down, or the heartbeat is down, that means this node can not give heartbeat message to the partner node, the partner node ( the secondary node of application) would take over the application because it has the idea that the primary node of this application is down.

If you set the “nice_failback” in configure file: ha.cf to off, that means you want the HA system act as nice failback mode. Then when the primary node is recovery, the application would start on primary node and it would give the services for its clients or users. Or if the nice_failback were on, the application would not start on primary node after its recovery. (“Off” is recommended to Nice_failback in file ha.cf.)

 

How to let ClusterProbe itself executed as HA application:

After know how to configure HA application, It is easy to make ClusterProbe as HA application also.

First, you should make sure you have two nodes, and HA software has been installed on both of them.   Then you can configure your application as following:

IP address here is working IP for Application ClusterProbe. It is used on both primary node and secondary node. With this IP address, users can use ClusterProbe without knowing which node ClusterProbe is running on.

 

You can choose two kinds of mode to configure ClusterProbe.

1.      Active-Active mode

 

Primary node

Secondary node

Heartbeat

Y

Y

Mon

Y

Y

You should start heartbeat and mon on both primary node and secondary node of ClusterProbe.

Because software “mon” would keep the application running on node, you may find out that ClusterProbe would exist on both nodes at the same time. But only one node would provide the services if you use working IP to access the services of ClusterProbe. 

The advantage of this mode is ClusterProbe is running on both nodes. If errors appear, the failover process would happen, the ClusterProbe of secondary node would provide the services immediately. The disadvantage is ClusterProbe on the secondary node would consume some resources of CPU of node and it is unnecessary if the primary node is on.

 

2.      Active-Passive mode

 

Primary node

Secondary node

Heartbeat

Y

Y

Mon

Y

N

  Heartbeat is running on both primary node and secondary node, but mon is only running on primary mode.   By this means, at normal status, ClusterProbe would be running on the primary node, but not on Secondary node. If the primary node were down, the secondary node would start ClusterProbe and provide the service. 

  The advantage of this mode is only one ClusterProbe is running at the same time and no resource is consumed for unnecessary works, comparing with the active-active mode. The disadvantage is users should wait for the ClusterProbe to start on the secondary node when failover happens. And also ClusterProbe would not be monitored by mon on the secondary node.

 

No matter which mode you choose, the ClusterProbe would fail over to the secondary node if the primary node is down, or the secondary node cannot get the heartbeat messages from the primary node.

 

            As we configure our ClusterProbe on the primary node ( srgdell-15 ) and the secondary node ( srgdell-16), in normal states, the primary node is on and it would have the ownership of resources, including IP address, and it would provide the services. The upper figure tells us that we can get the service of ClusterProbe from the primary node. If the primary node is down, it cannot provide service any more, then the secondary node: srgdell-16 would get the ownership of resources, such as IP address to provide service.

 

 

Users of ClusterProbe may not concern which node the ClusterProbe is running on. ClusterProbe would continue to give the services of manage cluster when the primary node is down.

When the primary node is recovered, it would or would not take over the services of ClusterProbe according to the configuration. Key parameter is nice_failback in ha.cf. As we discussed just now, if the nice_failback is off, the application would failback to the primary node when the primary node is recovered.