现在的位置: 首页 > 数据库 > 正文

oracle 11.2.0.2 新特性- Rebootless Restart

2019年10月04日 数据库 ⁄ 共 2571字 ⁄ 字号 暂无评论

所以从版本11.2.0.2 开始,oracle新特性rebootless restart被介绍。当出现以下情况的时候,集群件(GI)会重新启动集群管理软件,而不是将节点重启。
1.当某个节点连续丢失网络心跳超过misscount时。
2.当某个节点不能访问大多数表决盘(VF)时。
3.当member kill 被升级成为node kill的时候。
在之前的版本,以上情况,集群管理软件(CRS)会直接重启节点。

但在11.2.0.2之后以上情况不会重启节点,只会让GI重启集群。
GI 在重启集群之前,首先要对集群进行graceful shutdown, 基本的步骤如下。
1.停止本地节点的所有心跳(网络心跳,磁盘心跳和本地心跳)。
2.通知cssd agent,ocssd.bin即将停止
3.停止所有注册到css的具有i/o能力的进程,例如 lmon。
4.cssd通知crsd 停止所有资源,如果crsd不能成功的停止所有的资源,节点重启仍然会发生。
5.Cssd等待所有的具有i/o能力的进程退出,如果这些进程在short i/o timeout时间内不能不能全部推迟,节点重启仍然会发生。
6.通知cssd agent 所有的有i/o能力的进程全部退出。
7.ohasd 重新启动集群。
8.本地节点通知其他节点进行集群重配置。

[grid@rac01 rac01]$ crsctl get css misscount //心跳超时
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.
[grid@rac01 rac01]$ crsctl get css disktimeout //磁盘超时
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.

oracle rac 主节点判断:

ocrconfig -showbackup

心跳断开模拟:
节点1心跳断开:
节点1日志如下:
2018-06-29 07:30:42.770:
[cssd(2849)]CRS-1612:Network communication with node rac02 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.480 seconds
2018-06-29 07:30:49.775:
[cssd(2849)]CRS-1611:Network communication with node rac02 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.480 seconds
2018-06-29 07:30:54.790:
[cssd(2849)]CRS-1610:Network communication with node rac02 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.460 seconds
2018-06-29 07:30:57.256:
[cssd(2849)]CRS-1607:Node rac02 is being evicted in cluster incarnation 425722320; details at (:CSSNM00007:) in /oracle/app/grid/product/11.2.0/log/rac01/cssd/ocssd.log.
2018-06-29 07:30:59.798:
[cssd(2849)]CRS-1625:Node rac02, number 2, was manually shut down
2018-06-29 07:30:59.805:
[cssd(2849)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac01 .
2018-06-29 07:30:59.837:
[crsd(3446)]CRS-5504:Node down event reported for node 'rac02'.
从上图可以看到节点2被evicted

节点二日志:
[cssd(2835)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity;
[cssd(2835)]CRS-1656:The CSS daemon is terminating due to a fatal error;
[cssd(2835)]CRS-1656:The CSS daemon is terminating due to a fatal error;
[cssd(2835)]CRS-1652:Starting clean up of CRSD resources.
[cssd(2835)]CRS-1608:This node was evicted by node 1, rac01;
[cssd(2835)]CRS-1654:Clean up of CRSD resources finished successfully.
[cssd(2835)]CRS-1655:CSSD on node rac02 detected a problem and started to shutdown.
[ohasd(2344)]CRS-2765:Resource 'ora.crsd' has failed on server 'rac02'
下面停资源,会有一些报错,忽略
//启动
[cssd(6543)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac01 rac02 .
[ctssd(6995)]CRS-2403:The Cluster Time Synchronization Service on host rac02 is in observer mode.
//crs启动失败,原因是心跳未恢复,crs会一直检测,等心跳恢复了crs就启动成功了。
[ohasd(2344)]CRS-2878:Failed to restart resource 'ora.crsd'

给我留言

留言无头像?