CRS-1610 - 90% heartbeat fatal, eviction in 0.102 seconds

显示全部楼层 · 2015-3-6 11:57:31

Linux AS 5.3 64 bit
Oracle 10.2.0.4 2 nodes
GFS file system

Node2rebootabnormally .

node2 Linux Log:

Feb4 16:14:46 --- reboot
Feb4 16:18:57 --- ok

Feb4 16:14:14 hou249bbodb3112 snmpd[5979]: Received SNMP packet(s) from UDP: [127.0.0.1]:38732
Feb4 16:14:29 hou249bbodb3112 snmpd[5979]: Connection from UDP: [127.0.0.1]:51532
Feb4 16:14:29 hou249bbodb3112 snmpd[5979]: Received SNMP packet(s) from UDP: [127.0.0.1]:51532
Feb4 16:14:30 hou249bbodb3112 snmpd[5979]: Connection from UDP: [127.0.0.1]:51532
Feb4 16:14:46 hou249bbodb3112 snmpd[5979]: Connection from UDP: [127.0.0.1]:34969
Feb4 16:14:46 hou249bbodb3112 snmpd[5979]: Received SNMP packet(s) from UDP: [127.0.0.1]:34969
Feb4 16:14:46 hou249bbodb3112 snmpd[5979]: Connection from UDP: [10.13.8.110]:1048
Feb4 16:14:46 hou249bbodb3112 snmpd[5979]: Received SNMP packet(s) from UDP: [10.13.8.110]:1048
Feb4 16:18:57 hou249bbodb3112 syslogd 1.4.1: restart.
Feb4 16:18:57 hou249bbodb3112 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Feb4 16:18:57 hou249bbodb3112 kernel: Bootdata ok (command line is ro root=/dev/VolGroup00/LogVol00 rhgb quiet)
Feb4 16:18:57 hou249bbodb3112 kernel: Linux version 2.6.18-128.1.16.el5xen ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Fri Jun 26 11:10:46 EDT 2009
Feb4 16:18:57 hou249bbodb3112 kernel: BIOS-provided physical RAM map:
Feb4 16:18:57 hou249bbodb3112 kernel:Xen: 0000000000000000 - 00000003d7724000 (usable)
Feb4 16:18:57 hou249bbodb3112 kernel: DMI 2.5 present.
Feb4 16:18:57 hou249bbodb3112 kernel: ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Feb4 16:18:57 hou249bbodb3112 kernel: ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] enabled)
Feb4 16:18:57 hou249bbodb3112 kernel: ACPI: LAPIC (acpi_id[0x10] lapic_id[0x10] enabled)
Feb4 16:18:57 hou249bbodb3112 kernel: ACPI: LAPIC (acpi_id[0x18] lapic_id[0x18] enabled)
Feb4 16:18:57 hou249bbodb3112 kernel: ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Feb4 16:18:57 hou249bbodb3112 kernel: ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] enabled)
Feb4 16:18:57 hou249bbodb3112 kernel: ACPI: LAPIC (acpi_id[0x11] lapic_id[0x11] enabled)

Node2 CRS Log :

2010-01-25 13:12:27.741
[crsd(10609)]CRS-1012:The OCR service started on node hou249bbodb3111.
2010-01-25 13:12:28.464
[evmd(10607)]CRS-1401:EVMD started on node hou249bbodb3111.
2010-01-25 13:12:29.902
[crsd(10609)]CRS-1201:CRSD started on node hou249bbodb3111.
2010-01-25 14:10:32.055
[cssd(11203)]CRS-1612:node hou249bbodb3112 (2) at 50% heartbeat fatal, eviction in 14.078 seconds
2010-01-25 14:10:33.051
[cssd(11203)]CRS-1612:node hou249bbodb3112 (2) at 50% heartbeat fatal, eviction in 13.088 seconds
2010-01-25 14:15:01.135
[cssd(10708)]CRS-1605:CSSD voting file is online: /dev/sdc. Details in /u01/app/oracle/product/crs/log/hou249bbodb3111/cssd/ocssd.log.
2010-01-25 14:15:01.137
[cssd(10708)]CRS-1605:CSSD voting file is online: /dev/sdd. Details in /u01/app/oracle/product/crs/log/hou249bbodb3111/cssd/ocssd.log.
2010-01-25 14:15:01.169
[cssd(10708)]CRS-1605:CSSD voting file is online: /dev/sdg. Details in /u01/app/oracle/product/crs/log/hou249bbodb3111/cssd/ocssd.log.
[cssd(10708)]CRS-1601:CSSD Reconfiguration complete. Active nodes are hou249bbodb3111 hou249bbodb3112 .
2010-01-25 14:15:07.842
[crsd(10106)]CRS-1005:The OCR upgrade was completed. Version has changed from 185599488 to 185599488. Details in /u01/app/oracle/product/crs/log/hou249bbodb3111/crsd/crsd.log.
2010-01-25 14:15:07.843
[crsd(10106)]CRS-1012:The OCR service started on node hou249bbodb3111.
2010-01-25 14:15:08.430
[evmd(10057)]CRS-1401:EVMD started on node hou249bbodb3111.
2010-01-25 14:15:12.687
[crsd(10106)]CRS-1201:CRSD started on node hou249bbodb3111.
2010-02-04 16:15:11.137
[cssd(10708)]CRS-1612:node hou249bbodb3112 (2) at 50% heartbeat fatal, eviction in 14.102 seconds
2010-02-04 16:15:12.234
[cssd(10708)]CRS-1612:node hou249bbodb3112 (2) at 50% heartbeat fatal, eviction in 13.102 seconds
2010-02-04 16:15:19.129
[cssd(10708)]CRS-1611:node hou249bbodb3112 (2) at 75% heartbeat fatal, eviction in 6.102 seconds
2010-02-04 16:15:23.129
[cssd(10708)]CRS-1610:node hou249bbodb3112 (2) at 90% heartbeat fatal, eviction in 2.102 seconds
2010-02-04 16:15:24.125
[cssd(10708)]CRS-1610:node hou249bbodb3112 (2) at 90% heartbeat fatal, eviction in 1.112 seconds
2010-02-04 16:15:25.129
[cssd(10708)]CRS-1610:node hou249bbodb3112 (2) at 90% heartbeat fatal, eviction in 0.102 seconds
2010-02-04 16:15:26.006
[cssd(10708)]CRS-1607:CSSD evicting node hou249bbodb3112. Details in /u01/app/oracle/product/crs/log/hou249bbodb3111/cssd/ocssd.log.
[cssd(10708)]CRS-1601:CSSD Reconfiguration complete. Active nodes are hou249bbodb3111 .
2010-02-04 16:15:30.531
[crsd(10106)]CRS-1204:Recovering CRS resources for node hou249bbodb3112.
[cssd(10708)]CRS-1601:CSSD Reconfiguration complete. Active nodes are hou249bbodb3111 hou249bbodb3112 .

node1 crsd log:

hou249bbodb3111$vi crsd.log

2010-01-25 14:16:35.867: [CRSRES][1504274752] startRunnable: setting CLI values
2010-01-25 14:16:36.108: [CRSRES][1504274752] Attempting to start `ora.hou249bbodb3111.gsd` on member `hou249bbodb3111`
2010-01-25 14:16:36.473: [CRSRES][1537845568] Attempting to start `ora.wmb2bprd.db` on member `hou249bbodb3112`
2010-01-25 14:16:37.146: [CRSRES][1504274752] Start of `ora.hou249bbodb3111.gsd` on member `hou249bbodb3111` succeeded.
2010-01-25 14:16:37.420: [CRSRES][1537845568] Start of `ora.wmb2bprd.db` on member `hou249bbodb3112` succeeded.
2010-02-04 16:15:26.098: [ CRSCOMM][1537845568] CLEANUP: Searching for connections to failed node hou249bbodb3112
2010-02-04 16:15:26.098: [CRSEVT][1537845568] Processing member leave for hou249bbodb3112, incarnation: 145375564
2010-02-04 16:15:26.099: [CRSD][1537845568] SM: recovery in process: 8
2010-02-04 16:15:26.099: [CRSEVT][1537845568] Do failover for: hou249bbodb3112
2010-02-04 16:15:26.857: [CRSRES][1537845568]startup = 0
2010-02-04 16:15:26.881: [CRSRES][1537845568]startup = 0
2010-02-04 16:15:26.896: [CRSRES][1537845568]startup = 0
2010-02-04 16:15:26.914: [CRSRES][1537845568]startup = 0
2010-02-04 16:15:26.926: [CRSRES][1537845568]startup = 0
2010-02-04 16:15:26.946: [CRSRES][1537845568]startup = 0
2010-02-04 16:15:27.029: [CRSRES][1087633728] startRunnable: setting CLI values
2010-02-04 16:15:27.045: [CRSRES][1087633728] Attempting to start `ora.hou249bbodb3112.vip` on member `hou249bbodb3111`
2010-02-04 16:15:27.071: [CRSRES][1504274752] startRunnable: setting CLI values
2010-02-04 16:15:27.123: [CRSRES][1504274752] Attempting to start `ora.wmb2bprd.db` on member `hou249bbodb3111`
2010-02-04 16:15:27.276: [CRSRES][1504274752] Start of `ora.wmb2bprd.db` on member `hou249bbodb3111` succeeded.
2010-02-04 16:15:30.518: [CRSRES][1087633728] Start of `ora.hou249bbodb3112.vip` on member `hou249bbodb3111` succeeded.
2010-02-04 16:15:30.531: [CRSEVT][1537845568] Post recovery done evmd event for: hou249bbodb3112
2010-02-04 16:15:30.532: [CRSD][1537845568] SM: recoveryDone: 0
2010-02-04 16:15:30.537: [CRSEVT][1537845568] Processing RecoveryDone
2010-02-04 16:19:52.049: [OCRUTL][1283971392]u_freem: mem passed is null
2010-02-04 16:19:54.405: [CRSD][1094658368] SM: rE2Ec: 4
2010-02-04 16:19:54.406: [CRSRES][1537845568] StopResource: setting CLI values
2010-02-04 16:19:54.869: [CRSD][1537845568] SM:dE2Ec: all E2E cmds done. 0
"crsd.log" 8463L, 600108C

Node1 Linux Log :

Feb4 16:14:46 hou249bbodb3111 snmpd[5985]: Connection from UDP: [10.13.8.110]:1048
Feb4 16:14:46 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [10.13.8.110]:1048
Feb4 16:14:57 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.1: LIP reset occured (f7f7).
Feb4 16:14:57 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.1: LIP occured (f7f7).
Feb4 16:14:57 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.0: LIP reset occured (f7f7).
Feb4 16:14:57 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.0: LIP occured (f7f7).
Feb4 16:15:06 hou249bbodb3111 openais[5501]: [TOTEM] The token was lost in the OPERATIONAL state.
Feb4 16:15:06 hou249bbodb3111 openais[5501]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Feb4 16:15:06 hou249bbodb3111 openais[5501]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes).
Feb4 16:15:06 hou249bbodb3111 openais[5501]: [TOTEM] entering GATHER state from 2.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] entering GATHER state from 0.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] Creating commit token because I am the rep.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] Saving state aru 16aa35 high seq received 16aa35
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] Storing new sequence id for ring ac
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] entering COMMIT state.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] entering RECOVERY state.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] position [0] member 172.16.223.111:
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] previous ring seq 168 rep 172.16.223.111
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] aru 16aa35 high delivered 16aa35 received flag 1
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] Did not need to originate any messages in recovery.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] Sending initial ORF token
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM] CLM CONFIGURATION CHANGE
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM] New Configuration:
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM]r(0) ip(172.16.223.111)
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM] Members Left:
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM]r(0) ip(172.16.223.112)
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM] Members Joined:
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM] CLM CONFIGURATION CHANGE
Feb4 16:15:11 hou249bbodb3111 kernel: dlm: closing connection to node 2
Feb4 16:15:12 hou249bbodb3111 openais[5501]: [CLM] New Configuration:
Feb4 16:15:12 hou249bbodb3111 openais[5501]: [CLM]r(0) ip(172.16.223.111)
Feb4 16:15:13 hou249bbodb3111 openais[5501]: [CLM] Members Left:
Feb4 16:15:13 hou249bbodb3111 fenced[5520]: hou249bbodb3112priv not a cluster member after 1 sec post_fail_delay
Feb4 16:15:14 hou249bbodb3111 fenced[5520]: fencing node "hou249bbodb3112priv"
Feb4 16:15:14 hou249bbodb3111 openais[5501]: [CLM] Members Joined:
Feb4 16:15:15 hou249bbodb3111 openais[5501]: [SYNC ] This node is within the primary component and will provide service.
Feb4 16:15:15 hou249bbodb3111 openais[5501]: [TOTEM] entering OPERATIONAL state.
Feb4 16:15:15 hou249bbodb3111 openais[5501]: [CLM] got nodejoin message 172.16.223.111
Feb4 16:15:15 hou249bbodb3111 openais[5501]: [CPG] got joinlist message from node 1
Feb4 16:15:28 hou249bbodb3111 fenced[5520]: fence "hou249bbodb3112priv" success
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Trying to acquire journal lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Trying to acquire journal lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Trying to acquire journal lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Looking at journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Looking at journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Looking at journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Acquiring the transaction lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Acquiring the transaction lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Acquiring the transaction lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Replaying journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Replayed 0 of 2 blocks
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: replays = 0, skips = 0, sames = 2
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Replaying journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Journal replayed in 1s
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Done
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Replaying journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Replayed 0 of 1 blocks
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: replays = 0, skips = 1, sames = 0
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Replayed 0 of 38 blocks
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: replays = 0, skips = 12, sames = 26
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Journal replayed in 1s
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Done
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Journal replayed in 1s
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Done
Feb4 16:15:30 hou249bbodb3111 avahi-daemon[8110]: Registering new address record for 10.18.223.117 on bond1.
Feb4 16:15:30 hou249bbodb3111 avahi-daemon[8110]: Withdrawing address record for 10.18.223.117 on bond1.
Feb4 16:15:30 hou249bbodb3111 avahi-daemon[8110]: Registering new address record for 10.18.223.117 on bond1.
[ 本帖最后由 tolywang 于 2010-2-5 13:59 编辑 ]

千问 · 2015-3-6 11:57:31

不過經過時間對比，我們發現應該是節點 2 先重新啟動，然后才報的這個心跳線的信息。

2010-02-04 16:15:24.125
[cssd(10708)]CRS-1610:node hou249bbodb3112 (2) at 90% heartbeat fatal,
eviction in 1.112 seconds
2010-02-04 16:15:25.129
[cssd(10708)]CRS-1610:node hou249bbodb3112 (2) at 90% heartbeat fatal,
eviction in 0.102 seconds
2010-02-04 16:15:26.006
[cssd(10708)]CRS-1607:CSSD evicting node hou249bbodb3112. Details in
/u01/app/oracle/product/crs/log/hou249bbodb3111/cssd/ocssd.log.
[cssd(10708)]CRS-1601:CSSD Reconfiguration complete. Active nodes are
hou249bbodb3111 .
2010-02-04 16:15:30.531
[crsd(10106)]CRS-1204:Recovering CRS resources for node
hou249bbodb3112.
[cssd(10708)]CRS-1601:CSSD Reconfiguration complete. Active nodes are
hou249bbodb3111 hou249bbodb3112 .

千问 · 2015-3-6 11:57:31

I think your message is a report of an incident and the cause is already identified. Can you tell us what the moral of the story is? That is, any take-home message, or lessons learned, from this incident. I guess we can say the network connection between RAC nodes is very important, and bonded interface may be needed to improve up-time. Anything else?
Yong Huang

千问 · 2015-3-6 11:57:31

原帖由 Yong Huang 于 2010-2-6 06:05 发表
I think your message is a report of an incident and the cause is already identified. Can you tell us what the moral of the story is? That is, any take-home message, or lessons learned, from this incident. I guess we can say the network connection between RAC nodes is very important, and bonded interface may be needed to improve up-time. Anything else?
Yong Huang

yes, butI am not sure that heart beat network issue is root cause ,because if the node2 reboot at first,i think"heart beat fatal" message will also appear ,we also doubt there exists hardware ortime syncissue ,so we ask system administrator at Houston datacenter to check the time server andheart beat switch log message .
may be we will get some information from him in 2 days .

About this issue,Do you have any bright ideas ?

-----------------------------------------------------------
node1linuxlog message :

Feb4 16:12:22 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:12399
Feb4 16:12:23 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:12399
Feb4 16:12:38 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:3191
Feb4 16:12:38 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:3191
Feb4 16:12:54 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:45687
Feb4 16:12:54 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:45687
Feb4 16:12:55 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:45687
Feb4 16:13:10 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:29610
Feb4 16:13:10 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:29610
Feb4 16:13:25 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:42750
Feb4 16:13:25 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:42750
Feb4 16:13:27 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:42750
Feb4 16:13:42 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:20446
Feb4 16:13:42 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:20446
Feb4 16:13:57 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:6224
Feb4 16:13:57 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:6224
Feb4 16:13:59 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:6224
Feb4 16:14:14 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:5948
Feb4 16:14:14 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:5948
Feb4 16:14:29 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:57916
Feb4 16:14:29 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:57916
Feb4 16:14:31 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:57916
Feb4 16:14:46 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:45238
Feb4 16:14:46 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:45238
Feb4 16:14:46 hou249bbodb3111 snmpd[5985]: Connection from UDP: [10.13.8.110]:1048
Feb4 16:14:46 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [10.13.8.110]:1048
Feb4 16:14:57 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.1: LIP reset occured (f7f7).
Feb4 16:14:57 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.1: LIP occured (f7f7).
Feb4 16:14:57 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.0: LIP reset occured (f7f7).
Feb4 16:14:57 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.0: LIP occured (f7f7).
Feb4 16:15:06 hou249bbodb3111 openais[5501]: [TOTEM] The token was lost in the OPERATIONAL state.
Feb4 16:15:06 hou249bbodb3111 openais[5501]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Feb4 16:15:06 hou249bbodb3111 openais[5501]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes).
Feb4 16:15:06 hou249bbodb3111 openais[5501]: [TOTEM] entering GATHER state from 2.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] entering GATHER state from 0.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] Creating commit token because I am the rep.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] Saving state aru 16aa35 high seq received 16aa35
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] Storing new sequence id for ring ac
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] entering COMMIT state.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] entering RECOVERY state.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] position [0] member 172.16.223.111:
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] previous ring seq 168 rep 172.16.223.111
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] aru 16aa35 high delivered 16aa35 received flag 1
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] Did not need to originate any messages in recovery.
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [TOTEM] Sending initial ORF token
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM] CLM CONFIGURATION CHANGE
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM] New Configuration:
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM]r(0) ip(172.16.223.111)
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM] Members Left:
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM]r(0) ip(172.16.223.112)
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM] Members Joined:
Feb4 16:15:11 hou249bbodb3111 openais[5501]: [CLM] CLM CONFIGURATION CHANGE
Feb4 16:15:11 hou249bbodb3111 kernel: dlm: closing connection to node 2
Feb4 16:15:12 hou249bbodb3111 openais[5501]: [CLM] New Configuration:
Feb4 16:15:12 hou249bbodb3111 openais[5501]: [CLM]r(0) ip(172.16.223.111)
Feb4 16:15:13 hou249bbodb3111 openais[5501]: [CLM] Members Left:
Feb4 16:15:13 hou249bbodb3111 fenced[5520]: hou249bbodb3112priv not a cluster member after 1 sec post_fail_delay
Feb4 16:15:14 hou249bbodb3111 fenced[5520]: fencing node "hou249bbodb3112priv"
Feb4 16:15:14 hou249bbodb3111 openais[5501]: [CLM] Members Joined:
Feb4 16:15:15 hou249bbodb3111 openais[5501]: [SYNC ] This node is within the primary component and will provide service.
Feb4 16:15:15 hou249bbodb3111 openais[5501]: [TOTEM] entering OPERATIONAL state.
Feb4 16:15:15 hou249bbodb3111 openais[5501]: [CLM] got nodejoin message 172.16.223.111
Feb4 16:15:15 hou249bbodb3111 openais[5501]: [CPG] got joinlist message from node 1
Feb4 16:15:28 hou249bbodb3111 fenced[5520]: fence "hou249bbodb3112priv" success
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Trying to acquire journal lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Trying to acquire journal lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Trying to acquire journal lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Looking at journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Looking at journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Looking at journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Acquiring the transaction lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Acquiring the transaction lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Acquiring the transaction lock...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Replaying journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Replayed 0 of 2 blocks
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: replays = 0, skips = 0, sames = 2
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Replaying journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Journal replayed in 1s
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata1.1: jid=0: Done
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Replaying journal...
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Replayed 0 of 1 blocks
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: replays = 0, skips = 1, sames = 0
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Replayed 0 of 38 blocks
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: replays = 0, skips = 12, sames = 26
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Journal replayed in 1s
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata.1: jid=0: Done
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Journal replayed in 1s
Feb4 16:15:28 hou249bbodb3111 kernel: GFS: fsid=b2bgfs_cluster:gfs-b2bdata2.1: jid=0: Done
Feb4 16:15:30 hou249bbodb3111 avahi-daemon[8110]: Registering new address record for 10.18.223.117 on bond1.
Feb4 16:15:30 hou249bbodb3111 avahi-daemon[8110]: Withdrawing address record for 10.18.223.117 on bond1.
Feb4 16:15:30 hou249bbodb3111 avahi-daemon[8110]: Registering new address record for 10.18.223.117 on bond1.
Feb4 16:15:37 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:15502
Feb4 16:15:37 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:15502
Feb4 16:15:38 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:15502
Feb4 16:15:53 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:19598
Feb4 16:15:53 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:19598
Feb4 16:16:08 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:43684
Feb4 16:16:08 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:43684
Feb4 16:16:10 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:43684
Feb4 16:16:25 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:45465
Feb4 16:16:25 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:45465
Feb4 16:16:35 hou249bbodb3111 snmpd[5985]: Connection from UDP: [10.13.8.110]:1048
Feb4 16:16:40 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:55588
Feb4 16:16:40 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:55588
Feb4 16:16:42 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:55588
Feb4 16:16:57 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:48539
Feb4 16:16:57 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:48539
Feb4 16:17:12 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:63016
Feb4 16:17:12 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:63016
Feb4 16:17:13 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:63016
Feb4 16:17:28 hou249bbodb3111 snmpd[5985]: Connection from UDP: [127.0.0.1]:62670
Feb4 16:17:28 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [127.0.0.1]:62670
Feb4 16:17:29 hou249bbodb3111 snmpd[5985]: Connection from UDP: [10.17.48.15]:4296
Feb4 16:17:29 hou249bbodb3111 snmpd[5985]: Received SNMP packet(s) from UDP: [10.17.48.15]:4296

千问 · 2015-3-6 11:57:31

> if the node2 reboot at first,i think"heart beat fatal" message will also appear
Isn't that what you got, and also you used that message in the subject of this posting?
[cssd(11203)]CRS-1612:node hou249bbodb3112 (2) at 50% heartbeat fatal, eviction in 14.078 seconds 2010-01-25 14:10:33.051
...
> we ask system administrator at Houston datacenter to check the time server
> andheart beat switch log message
What a coincident! I'm in Houston, Texas. What company is it, if you don't mind telling me? Of course I won't contact them.
You can always check the time server's time and other info with ntpq. But obviously you have to ask them to check the server log, and also the switch log. If you find anything new, let us know.
On your end, you have to make sure ntpd always runs fine on both nodes.
Yong Huang

千问 · 2015-3-6 11:57:31

原帖由 Yong Huang 于 2010-2-9 10:59 发表
> if the node2 reboot at first,i think"heart beat fatal" message will also appear
Isn't that what you got, and also you used that message in the subject of this posting?
[cssd(11203)]CRS-1612:node hou249bbodb3112 (2) at 50% heartbeat fatal, eviction in 14.078 seconds 2010-01-25 14:10:33.051
...
> we ask system administrator at Houston datacenter to check the time server
> andheart beat switch log message
What a coincident! I'm in Houston, Texas. What company is it, if you don't mind telling me? Of course I won't contact them.
You can always check the time server's time and other info with ntpq. But obviously you have to ask them to check the server log, and also the switch log. If you find anything new, let us know.
On your end, you have to make sure ntpd always runs fine on both nodes.
Yong Huang

Sorry for late reply ,the company is Foxconn, they tell me there is not log message on the heart beat switch , they will reconfigure it .
ntpd runs find on the both nodes . They also told us the two servers status is healthy.
I will update if find anything new .