急:在as4u2上oracm找不到hangcheck-timer

显示全部楼层 · 2007-9-26 18:42:10

数据库oracle 9i 9.2.0.4打9.2.0.7的补丁
系统：redhat as 4 u2
双机加一磁阵
做9i的rac,用的是裸设备
现在cm.log里出现：
oracm, version[ 9.2.0.2.0.47 ] started {Thu Oct 26 01:21:11 2006 }
KernelModuleName is hangcheck-timer {Thu Oct 26 01:21:11 2006 }
OemNodeConfig(): Network Address of node0: 10.1.2.236 (port 9998)
{Thu Oct 26 01:21:11 2006 }
OemNodeConfig(): Network Address of node1: 10.1.2.237 (port 9998)
{Thu Oct 26 01:21:11 2006 }
>WARNING:OemInit2: Opened file(/u01/app/oracle/oradata/shdb/CMQuorumFile 6), tid = main:2341088 file = oem.c, line = 491 {Thu Oct 26 01:21:11 2006 }
InitializeCM: ModuleName = hangcheck-timer{Thu Oct 26 01:21:11 2006 }
>ERROR:InitializeCM: query_module() failed, tid = main:2341088 file = cmstartup.c, line = 327 {Thu Oct 26 01:21:11 2006 }
Debug Hang :StartNMMon (PID=8355) Registered with watchdog daemon. {Thu Oct 26 01:21:11 2006 }
CreateLocalEndpoint(): Network Address: 10.1.2.237
{Thu Oct 26 01:21:11 2006 }
Debug Hang : ClusterListener (PID=8355) Registered withwatchdog daemon. {Thu Oct 26 01:21:11 2006 }
Debug Hang : CmConnectListener (PID=8355):Registered with watchdog daemon. {Thu Oct 26 01:21:11 2006 }
Debug Hang :SendingThread (PID=135159169): Registered with{Thu Oct 26 01:21:11 2006 }
Debug Hang :PollingThread (PID=135159169): Registered with{Thu Oct 26 01:21:11 2006 }
Debug Hang : DiskPingThread (PID=135159169): Registered with{Thu Oct 26 01:21:11 2006 }
UpdateNodeState(): node(1) added udpated {Thu Oct 26 01:21:13 2006 }
HandleUpdate(): SYNC(1) from node(0) completed {Thu Oct 26 01:21:13 2006 }
HandleUpdate(): NODE(0) IS ACTIVE MEMBER OF CLUSTER, INCARNATION(1) {Thu Oct 26 01:21:13 2006 }
HandleUpdate(): NODE(1) IS ACTIVE MEMBER OF CLUSTER, INCARNATION(2) {Thu Oct 26 01:21:13 2006 }
NMEVENT_RECONFIG [00][00][00][00][00][00][00][03] {Thu Oct 26 01:21:13 2006 }
Debug Hang : CMNodeListener(PID=8355) Registered with watchdog daemon. {Thu Oct 26 01:21:13 2006 }
Successful reconfiguration,2 active node(s) node 0 is the master, my node num is 1 (reconfig 2) {Thu Oct 26 01:21:13 2006 }
Debug Hang :ClientProcListener (PID=8355):Registered with watchdog daemon. {Thu Oct 26 10:08:16 2006 }
>WARNING:ReadCommPort:socket closed by peer on recv()., tid = ClientProcListener:124308400 file = unixinc.c, line = 767 {Thu Oct 26 10:08:16 2006 }
Debug Hang :ClientProcListener (PID=8355)UnRegistered with watchdog daemon. {Thu Oct 26 10:08:16 2006 }
Debug Hang :ClientProcListener (PID=8355):Registered with watchdog daemon. {Thu Oct 26 10:08:16 2006 }
Debug Hang :ClientProcListener (PID=8355):Registered with watchdog daemon. {Thu Oct 26 10:08:16 2006 }

我已明明加载了hangcheck-timer
用lsmod也能看到hangcheck-timer加载了。
但cm.log里死活都出现这样的日志：
InitializeCM: ModuleName = hangcheck-timer{Thu Oct 26 01:21:11 2006 }
>ERROR:InitializeCM: query_module() failed, tid = main:2341088 file = cmstartup.c, line = 327 {Thu Oct 26 01:21:11 2006 }

现在两台机虽都能正常启动，但是过七、八天，会一台一台的down掉，我都快崩溃了。请高手指点明路，我搜了很久，找到很多人都有提出这个问题，但就是没找到解决的办法，在metalink上也找到有这个bug的报告，但没看到有patch.

千问 · 2007-9-26 18:42:10

急啊，有高手们快来指点一下啊

千问 · 2007-9-26 18:42:10

急啊，有人来看一下吗

千问 · 2007-9-26 18:42:10

请帮助啊，高手

千问 · 2007-9-26 18:42:10

不知道你的cmcfg.ora 和ocmargs.ora配好没有。给你贴篇文档你参考一下：
Subject:RAC Linux 9.2: Configuration of cmcfg.ora and ocmargs.ora
Doc ID:Note:222746.1 Type:REFERENCE
Last Revision Date:26-AUG-2004 Status:PUBLISHED

PURPOSE
-------
This article will document the parameters in the Oracle Cluster
Manager (oracm) configuration files cmcfg.ora and ocmargs.ora.

SCOPE & APPLICATION
-------------------
This article is intended for Linux System Administrators and DBAs
who are configuring Real Application Clusters on Linux.
To better understand Oracle9i Release2 (9.2.0.2 and above) changes and new
features for the Oracle Cluster Manager (oracm), a brief description of the
architecture in Oracle9i Release 2 (9.2.0.1) is necessary. Oracle release
9.0.1 included these main components:

oNM service (only in 9.0.1)
oCM services (in 9.2.0.1), the NM and CM services were merged into

a single service, called oracm
oWatchdog Daemon (watchdogd)

NOTE: There is a software implementation of Watchdog in the Linux kernel,
called softdog, which when in use in conjunction with the Watchdog
Daemon causes a hardware reset of the node if the Watchdog Daemon
does not send a notification (ping) to the softdog within a
specified amount of time (soft_margin).

The Watchdog daemon is an Oracle supplied process which monitors the
oracm and sends pings to softdog through the Watchdog device, /dev/watchdog,
at defined intervals. It also monitors each oracm thread by receiving ping
messages from them, which have registered with watchdogd.

Watchdog Daemon detects the following cases:

oCM thread hang/death
oUser mode delay (scheduler / VM problem)
oKernel hang

In certain cases, systems with high loads resulted in unnecessary reboots.
As such, in Oracle9i Release 2 (9.2.0.2 and above) the Watchdog is detached from
the Cluster Manager.

New Watchdog Implementation in 9.2.0.2 and above
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In place of the Watchdog Daemon (watchdogd), the 9.2.0.2 and above version of
the oracm for Linux now includes the use of a Linux kernel module called
hangcheck-timer.The hangcheck-timer module monitors the Linux kernel
for long operating system hangs, and reboots the node if this occurs,
thereby preventing the database from potential corruptions. This is the
new I/O fencing mechanism for RAC on Linux.
This approach offers three advantages over the Watchdog implementation,
including:

oNode resets are triggered from within the Linux kernel making them
much less affected by the system load.
oThe oracm service on a node can easily be stopped and reconfigured,
as its operation is completely independent of the kernel module.
oThe features provided by the hangcheck-timer module closely resemble
features found in the implementation of the Cluster Manager for RAC
on the Windows platform.
The removal of the watchdogd means that the following parameters included
in the cmcfg.ora file are no longer valid:
oWatchdogTimerMargin
oWatchdogSafetyMargin
Please remove these watchdog parameters from your cmcfg.ora file.

Configuration of $ORACLE_HOME/oracm/admin/cmcfg.ora:
----------------------------------------------------
Parameters for cmcfg.ora, from a working Redhat Advanced Server 2.1
installation:
ClusterName=Oracle Cluster Manager, version 9i
KernelModuleName=hangcheck-timer
HeartBeat=15000
PollInterval=1000
MissCount=215
PrivateNodeNames=int-opcbr1 int-opcbr2
PublicNodeNames=opcbr1 opcbr2
ServicePort=9998
CmDiskFile=/u03/RAC/quorum.dbf
HostName=int-opcbr1

Detailed explaination of each parameter:
---------------------------------------
ClusterName:
This is a fixed parameter, and needs to remain at the above, default value.
KernelModuleName:
This is a new parameter for 9.2.0.2, and is required in order to use the
new Oracle-supplied hangcheck-timer module.
HeartBeat:
Leave at default setting of HeartBeat=15000.
PollInterval:
Leave at default, PollInterval=1000.
MissCount:
The MissCount must be set to a large value (at least 60) and must be greater
than the sum of hangcheck_tick + hangcheck_margin.We recommend 215 seconds.
The hangcheck_tick + hangcheck_margin parameter are set when you load
the hangcheck-timer module like:
/sbin/insmod hangcheck-timer hangcheck_tick=30 hangcheck_margin=180
Note that this value may need to be lowered for Transparent Application
Failover (TAF) environments.
PrivateNodeNames:
These are the /etc/host names for the private network used for RAC traffic.
PublicNodeNames:
These are the names returned by the hostname() system call.This list
is used by the Oracle Universal Installer (OUI) to determine what
available cluster members are present at software installation time.
ServicePort:
Default UDP port that the oracm service will open at startup for the
CM communications with other cluster nodes.
HostName:
The hostname (interface) where you want oracm to open it's UDP port for
cluster communications.
CmDiskFile:
This mandatory parameter points to a raw device or cluster file system
(OCFS) file that will be used by all cluster nodes.
-- End of cmcfg.ora parameters --

Configuration of ocmargs.ora
----------------------------
Parameters for ocmargs.ora, from a working Redhat Advanced Server 2.1
installation:
oracm
norestart 1800

Detailed explaination of each parameter:
---------------------------------------
oracm:
The name of the binary executable used to launch the Oracle Cluster
Manager daemon.
norestart:
The value used by the ocmstart.sh startup script to prevent too frequent
oracm restarts.
-- End of ocmargs.ora parameters --

NOTE:When adding a node to an existing cluster:
--------------------------------------------------
You may want to make a note of the values contained within these two files if they
are different from the default values.When adding a node to an existing cluster with
OUI, the default values may get reinserted.You will want to verify the correct values are present (or reinserted) after the addition is completed.

RELATED DOCUMENTS
-----------------
For more information on how to install and configure the Oracle
Cluster Manager, please read the Oracle Linux 9.2.0.2 README/Patchset

千问 · 2007-9-26 18:42:10

有人来顶顶吗

千问 · 2007-9-26 18:42:10

有人来顶顶吗

千问 · 2007-9-26 18:42:10

一天顶一次

千问 · 2007-9-26 18:42:10

一个客户遇到相同问题。
哎，弄了农还以为那个问题解决了，结果还是不行，同样求解中。

千问 · 2007-9-26 18:42:10

不用想这个问题了，去metalink上查了一下，redhat as 4没有通过92 ocm的认证。