联系:手机/微信(+86 17813235971) QQ(107644445)
标题:ORA-00600[kjhn_post_ha_alert0-862]原因分析
作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]
数据库版本和平台信息
数据库版本为10.2.0.1版本,而且是32位的win 2003 sp2之上
ORACLE V10.2.0.1.0 - Production vsnsta=0 vsnsql=14 vsnxtr=3 Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production With the Partitioning, OLAP and Data Mining options Windows Server 2003 Version V5.2 Service Pack 2 CPU : 2 - type 586, 1 Physical Cores Process Affinity : 0x00000000 Memory (Avail/Total): Ph:2608M/3990M, Ph+PgF:4511M/5871M, VA:1242M/2047M Instance name: orcl
数据库报大量ORA-600[kjhn_post_ha_alert0-862]错误
数据库的mmon进程报大量ORA-00600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], []错误
Wed Jun 03 21:50:40 2015 Restarting dead background process MMON MMON started with pid=11, OS id=3804 Wed Jun 03 21:50:43 2015 Errors in file e:\oracle\product\10.2.0\admin\orcl\bdump\orcl_mmon_3804.trc: ORA-00600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] Wed Jun 03 21:50:49 2015 Errors in file e:\oracle\product\10.2.0\admin\orcl\bdump\orcl_mmon_3804.trc: ORA-00600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] Wed Jun 03 21:55:44 2015 Errors in file e:\oracle\product\10.2.0\admin\orcl\bdump\orcl_mmon_3804.trc: ORA-00600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] Wed Jun 03 21:55:49 2015 Errors in file e:\oracle\product\10.2.0\admin\orcl\bdump\orcl_mmon_3804.trc: ORA-00600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] Wed Jun 03 22:00:40 2015 Thread 1 advanced to log sequence 476 Current log# 1 seq# 476 mem# 0: E:\ORACLE\PRODUCT\10.2.0\ORADATA\ORCL\REDO01.LOG Wed Jun 03 22:00:44 2015 Errors in file e:\oracle\product\10.2.0\admin\orcl\bdump\orcl_mmon_3804.trc: ORA-00600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], []
查询对应trace文件发现
ORA-00600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [] , [], [], [], [] Current SQL statement for this session: BEGIN :success := dbms_ha_alerts_prvt.check_ha_resources; END;
人工执行该过程
SQL> var success varchar2 SQL> begin 2 :success := sys.dbms_ha_alerts_prvt.check_ha_resources; 3 end; 4 / PL/SQL procedure successfully completed. SQL> print success SUCCESS -------------------------------- N
通过查询相关资料得到如下说明
@ This check is triggered with FAN enabled at this instance and it seems to be @ associated with a startup action. From the procedure itself which is called @ this is a run-once MMON (startup) action which supports instance down @ notification reliability. It does the folowing a) registers the current @ instance incarnation in recent_resource_incarnations$ if it's not already @ there b) deletes recent_resource_incarnations$ records that don't apply to @ this database. They may, e.g., have been copied from seed db or from a former @ DataGuard primary c) scans recent_resource_incarnations$ for instance @ incarnations that are no longer alive, and submits instance down alerts for @ them . If all is good then return 'Y' else 'N' (or error) if there is a @ failure. That failure is to get back to MMON, so that it may retry this @ action later. In the local instance I get a 'Y' but in the customer's system @ it fails with a 'N' which seems related to the ORA-600 assert. @ This function is kjhn_post_ha_alert0() which is internal and does the real work of @ posting HA alerts. It is used by both kjhn_post_ha_alert and @ kjn_post_ha_alert_plsql. Its parameters are basically the same as those of @ kjhn_post_ha_alert,other than the fact that it uses individual parameters @ rather than the more easily extensible structure. Also the parameters passed @ to it are the instance_name and the host_name which is the kernelized @ implementation for posting HA alerts. Without actually having the arguments @ the guess is that either the host_name or the instance_name raised in the @ assert is null which triggered it.
mmon进程尝试调用相关程序,然后无法得出正确值,返回N,然后会一直尝试,如果不能得到返回Y,就会一直报ORA-600,错误.通过上述的三种情况来说,都和recent_resource_incarnations$表有关系.
该故障原因是由于:mmon在调用kjhn_post_ha_alert0函数在执行的时候,如果发现参数host_name或者instance_name为null,就会报该错误出来.
处理方法
This problem has been documented as Bug 5173066 REPEATED ORA-600 [KJHN_POST_HA_ALERT0-862] FROM MMON PROCESS.
The bug is fixed in 11.1.0.6. A workaround is available for the problem.
该bug在11.1.0.6中得以修复
To implement the workaround, please execute the following steps as the SYS user:
1. Collect the following information and spool it to a file for your records.
a. output of select * from v$instance
b. show parameter instance_name
c. set pages 1000
d. select * from recent_resource_incarnations$
2. Create a backup table of recent_resource_incarnations$.
SQL> create table recent_resource_inc$bk as select * from recent_resource_incarnations$;
3. Truncate recent_resource_incarnations$. Be sure to do this while the instance is up and running.
    Do not issue this statement if a shutdown is pending.
SQL> truncate table recent_resource_incarnations$;
4. Perform a clean shutdown, followed by a startup.
具体参考:
ORA-600 [kjhn_post_ha_alert0-862] Continuously Repeated in the Alert Log (Doc ID 401640.1)
Bug 5173066 : REPEATED ORA-600 [KJHN_POST_HA_ALERT0-862] FROM MMON PROCESS
Bug 5173066 : REPEATED ORA-600 [KJHN_POST_HA_ALERT0-862] FROM MMON PROCESS
Hdr: 5173066 10.2.0.1.0 RDBMS 10.2.0.1.0 RAC PRODID-5 PORTID-207 ORA-600 Abstract: REPEATED ORA-600 [KJHN_POST_HA_ALERT0-862] FROM MMON PROCESS *** RSERNA 04/19/06 12:06 pm *** TAR: ---- @SR:5322693.992 PROBLEM: -------- Tue Apr 18 18:17:58 2006 Completed: alter database open Tue Apr 18 18:18:33 2006 Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc: ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] Tue Apr 18 18:18:59 2006 Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc: ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] Tue Apr 18 18:20:31 2006 Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc: ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] Tue Apr 18 18:20:35 2006 Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc: ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] Tue Apr 18 18:25:32 2006 Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc: ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] Tue Apr 18 18:25:35 2006 Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc: ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] generating large trace files. DIAGNOSTIC ANALYSIS: -------------------- Customer already has several laptops with the same scenario. Personal Oracle Database 10g Release 10.2.0.1.0 - Production With the Partitioning, OLAP and Data Mining options Windows XP Version V5.1 Service Pack 2 CPU : 1 - type 586 Process Affinity : 0x00000000 Memory (Avail/Total): Ph:728M/1271M, Ph+PgF:2550M/3033M, VA:1746M/2047M Instance name: sfa *** SERVICE NAME:(SYS$BACKGROUND) 2006-04-18 18:18:33.736 *** SESSION ID:(106.1) 2006-04-18 18:18:33.736 *** 2006-04-18 18:18:33.736 ksedmp: internal or fatal error ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] Current SQL statement for this session: BEGIN :success := dbms_ha_alerts_prvt.check_ha_resources; END; ----- PL/SQL Call Stack ----- object line object handle number name 6959CF70 418 package body SYS.DBMS_HA_ALERTS_PRVT 6959CF70 552 package body SYS.DBMS_HA_ALERTS_PRVT 6959CF70 305 package body SYS.DBMS_HA_ALERTS_PRVT 692627B4 1 anonymous block Executing the procedure manually gives: SQL> var success varchar2 SQL> begin 2 :success := sys.dbms_ha_alerts_prvt.check_ha_resources; 3 end; 4 / PL/SQL procedure successfully completed. SQL> print success SUCCESS -------------------------------- N WORKAROUND: ----------- None. No documentation found on what the procedure does, but assuming the success='N' is why the errors are occuring. RELATED BUGS: ------------- REPRODUCIBILITY: ---------------- As soon as database starts. TEST CASE: ---------- None. STACK TRACE: ------------ ksedst ksedmp ksfdmp kgerinv kgeasnmierr kjhn_post_ha_alert spefcmpa spefmccallstd pextproc peftrusted psdexsp rpiswu2 psdextp pefccal pefcal pevm_FCAL pfrinstr_FCAL pfrrun_no_tool pfrrun plsql_run peicnt kkxexe opiexe kpoal8 opiodr kpoodr xupirtrc upirtrc kpurcsc kpuexecv8 kpuexec OCIStmtExecute kjhn_mmon_action 459 kjhn_check_ha_reso urces kebm_ronce_dispatc her kebm_ronce_execute ksb_run_timeout_an d_intr_action kebm_open_action ksbabs kebm_mmon_main ksbrdp opirip VInfreq__opidrv sou2o opimai_real opimai SUPPORTING INFORMATION: ----------------------- 24 HOUR CONTACT INFORMATION FOR P1 BUGS: ---------------------------------------- DIAL-IN INFORMATION: -------------------- IMPACT DATE: ------------ *** RSERNA 04/19/06 12:44 pm *** (CHG: Sta->16) *** RSERNA 04/19/06 12:44 pm *** @ RDA and alert_traces have been uploaded. *** SSRAO 04/19/06 09:03 pm *** (CHG: Asg->SSRAO) *** SSRAO 05/09/06 01:49 am *** @ BDE Screening in progress. Thanks *** SSRAO 05/09/06 01:51 am *** @ Reviewing code for the assert. Thanks *** SSRAO 05/09/06 01:54 am *** @ The assert condition is @ -- @ KSEORAASSERTNM(instance_name && *instance_name && @ host_name && *host_name, @ OERINM("kjhn_post_ha_alert0-862")); @ -- which is the instance_name and the host_name are set within the code. This @ function is kjhn_post_ha_alert0() which is internal and does the real work of @ posting HA alerts. It is used by both kjhn_post_ha_alert and @ kjn_post_ha_alert_plsql. Its parameters are basically the same as those of @ kjhn_post_ha_alert,other than the fact that it uses individual parameters @ rather than the more easily extensible structure. Also the parameters passed @ to it are the instance_name and the host_name which is the kernelized @ implementation for posting HA alerts. Without actually having the arguments @ the guess is that either the host_name or the instance_name raised in the @ assert is null which triggered it. Still reviewing. Thanks *** SSRAO 05/09/06 01:54 am *** (CHG: SubComp->RAC) *** SSRAO 05/09/06 02:21 am *** @ From he stack it seems we have @ -- @ kjhn_mmon_action kjhn_check_ha_resources kebm_ronce_dispatcher @ kebm_ronce_execute ksb_run_timeout_and_intr_action @ -- @ which is called from within the MMON routine for checking the HA resources. @ Statement executed through PLSQL is @ -- @ BEGIN :success := dbms_ha_alerts_prvt.check_ha_resources; END; @ .. @ 04/13/06 13:52:27 >ERROR: exception at @ dbms_ha_alerts_prvt.check_ha_resources637: SQLCODE -600,ORA-600: internal @ error @ code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], [] @ 04/13/06 13:52:27 >parameter dump for dbms_ha_alerts_prvt.check_ha_resources @ 04/13/06 13:52:27 > - local_db_unique_name (SFA) @ 04/13/06 13:52:27 > - local_db_domain (==N/A==) @ 04/13/06 13:52:27 > - rows deleted (0) @ -- @ . @ This check is triggered with FAN enabled at this instance and it seems to be @ associated with a startup action. From the procedure itself which is called @ this is a run-once MMON (startup) action which supports instance down @ notification reliability. It does the folowing a) registers the current @ instance incarnation in recent_resource_incarnations$ if it's not already @ there b) deletes recent_resource_incarnations$ records that don't apply to @ this database. They may, e.g., have been copied from seed db or from a former @ DataGuard primary c) scans recent_resource_incarnations$ for instance @ incarnations that are no longer alive, and submits instance down alerts for @ them . If all is good then return 'Y' else 'N' (or error) if there is a @ failure. That failure is to get back to MMON, so that it may retry this @ action later. In the local instance I get a 'Y' but in the customer's system @ it fails with a 'N' which seems related to the ORA-600 assert. Still @ reviewing. Thanks *** SSRAO 05/09/06 02:31 am *** @ Tracking this trace output @ -- @ 04/13/06 13:52:27 >parameter dump for @ dbms_ha_alerts_prvt.check_ha_resources @ 04/13/06 13:52:27 > - local_db_unique_name (SFA) @ 04/13/06 13:52:27 > - local_db_domain (==N/A==) @ 04/13/06 13:52:27 > - rows deleted (0) @ -- we have @ - the db_unique_name and domain printed @ - The SQL @ -- @ DELETE FROM recent_resource_incarnations$ @ WHERE db_unique_name <> local_db_unique_name @ OR db_domain <> local_db_domain; @ -- executed to delete rri$ records for other databases (e.g. seed db, former @ DataGuard primary). which ptints (0) rows so nothing was deleted here . For @ all suspected dead instances we have post_ha_alert() called with the @ arguments @ -- @ post_ha_alert(reason_id => dbms_server_alert. @ RSN_FAN_INSTANCE_DOWN, same_transaction => @ TRUE, @ clear_old_alert => TRUE, @ database_unique_name => local_db_unique_name, @ database_domain => alert_db_domain, @ instance_name => instance.instance_name, @ host_name => instance.host_name, @ event_reason => 'unknown', @ event_time => instance.shutdown_time); @ -- @ - which calls the assert. So far we know the dbunique_name is SFA and the @ domain is not set or N/A and hence '' by default. Thanks *** SSRAO 05/09/06 02:40 am *** @ The PLSQL code calls kjhn_post_ha_alert_plsql() which in turn calls @ kjhn_post_ha_alert0() which causes the assert so the parameters are being @ passed and called from the PLSQL function , we need to track these parameters @ for the failure to be able to isolate what is causing the assert. Still @ reviewing the code for tracing capabilities. Thanks *** SSRAO 05/09/06 05:33 pm *** @ It seems that v$instance is returning a null hostname and which also @ indicates why this package is returning a "F" which indicates a failure from @ their side. Also from the traces this might be owned by the HA side but these @ alerts are irrespective of whether RAC is present or not. @ Reference from kqfv.h we have the hostname come from KSUXSHST from @ x$ksuxsinst which in turn is populated from the callback of ksuxsrow(). @ This calls slkmnm() which in turn calls gethostbynam() to populate this @ information so the things we need to verify here are @ a) output of select * from v$instance @ b) show parameter instance_name @ . @ Thanks *** SSRAO 05/09/06 05:34 pm *** (CHG: Sta->10) *** SSRAO 05/09/06 05:35 pm *** @ Note that there might be two issues at play here @ a) The hostname is not being set or the instance_name is not being set @ b) The assert is not documented and should it depend on this condition @ Depending on the outcome of the provided data we might fork this into two @ separate bugs as they deal with different areas of the code. Thanks *** RSERNA 05/10/06 06:53 am *** (CHG: Sta->16) *** RSERNA 05/10/06 06:53 am *** @ SQL> select * from v$instance; @ . @ INSTANCE_NUMBER INSTANCE_NAME @ @ --------------- ---------------- @ @ . @ HOST_NAME @ @ ---------------------------------------------------------------- @ @ . @ VERSION STARTUP_T STATUS PAR THREAD# ARCHIVE @ LOG_SWITCH_WAIT @ ----------------- --------- ------------ --- ---------- ------- @ --------------- @ . @ LOGINS SHU DATABASE_STATUS INSTANCE_ROLE ACTIVE_ST BLO @ @ ---------- --- ----------------- ------------------ --------- --- @ @ . @ 1 sfa @ @ W8KKWR81 @ @ 10.2.0.1.0 10-MAY-06 OPEN NO 1 STOPPED @ @ ALLOWED NO ACTIVE PRIMARY_INSTANCE NORMAL NO @ @ . @ @ @ . @ . @ SQL> show parameter instance_name @ NAME TYPE VALUE @ @ ------------------------------------ ----------- @ ------------------------------ @ instance_name string sfa @ *** SSRAO 05/10/06 09:41 am *** @ I do not seem to understand this as both the hostname and the instance_name @ is populated correctly with v$instance and hence the callback ksuxsrow() is @ executing correctly to provide this information. Going to back into the @ package to see if I can find additional clues to capture some of the @ parameters passed. @ Is this a test setup ? Is the customer willing to take some diagnostic .plb @ scripts to execute and provide us with additional data ? Thanks *** SSRAO 05/10/06 09:41 am *** (CHG: Sta->10) *** RSERNA 05/10/06 10:34 am *** (CHG: Sta->16) *** RSERNA 05/10/06 10:34 am *** @ Yes customer is willing to take a diagnostic script or patch. *** SSRAO 05/15/06 11:41 am *** @ Working on a diagnostic. Thanks *** SSRAO 05/15/06 01:19 pm *** @ Reviewing the procedure check_ha_resources also @ - recent_resource_incarnations$ table is populated from v$instance @ - This is another table which contains the instance and the host_name @ - This could be another source from where the error is being reported @ - Please provide the output of @ -- @ set pages 1000 @ select * from recent_resource_incarnations$ @ -- @ - check_ha_resources() does @ - a) registers the current instance incarnation in @ recent_resource_incarnations$ if it's not already there b) deletes @ recent_resource_incarnations$ records that don't apply to this database. They @ may, e.g. have been copied from seed db or from a former DataGuard primary @ c) Scans recent_resource_incarnations$ for instance incarnations that are no @ longer alive, and submits instance down alerts for them. Breaking this down @ into function code we have @ - INSERT INTO recent_resource_incarnations$ as select * from v$instance @ - DELETE FROM recent_resource_incarnations$ @ WHERE db_unique_name <> local_db_unique_name @ OR db_domain <> local_db_domain; @ - Cursor dead_instances defines this as join from @ recent_resource_incarnations$ so it appears that there is already something @ out there which is causing these problems. Thanks *** SSRAO 05/15/06 01:19 pm *** (CHG: Sta->10) *** RSERNA 05/16/06 06:44 am *** (CHG: Sta->16) *** RSERNA 05/16/06 06:44 am *** @ File: recent_resource_incarnations.out uploaded. @ . @ Had customer execute: @ c:\hostname @ W8KKWR81 *** SSRAO 05/16/06 07:41 am *** @ From the provide traces this seems to be more clearer of why we are seeing @ this. The output reveals entries like @ -- @ INSTANCE 1 @ sfa @ SFA @ ==N/A== @ sfa @ . @ . @ ==N/A== @ -- where a normal entry is @ -- @ INSTANCE 1 @ sfa @ SFA @ ==N/A== @ sfa @ W8KKWR81 @ W8KKWR81 @ ==N/A== @ 09-JAN-06 04.09.36.000000000 AM @ -- so the hostname is missing for the above entries which is causing the @ problem. For this table itself it is supposed to insert an entry when an @ instance comes up and the cleanup is supposed to happen when it goes down or @ another instance detects this instance going down (if RAC). Since this is a @ single instance configuration the instance inserts this entry during startup @ and then deletes the same when the instance is being shutdown or on @ subsequent startup notes the state change. There are 57 rows in this table @ which in itself does not look right . All the entries except the one with the @ instance startup time around "10-JAN-06 08.17.00.000000000 PM" ar normal. @ This table is populated using v$instance so there was some problem around @ that time when the hostname was not listed which has caused this to manifest @ . @ Summary so far @ - hostname for entry "10-JAN-06 08.17.00.000000000 PM" is missing @ - Multiple entries in "recent_resource_incarnations$" which need cleanup @ - Dead instances are constructed using this list which is why the assert @ - post_instance_up() in MMON registers the current instance incarnation in @ recent_resource_incarnations$ if it's not already there and scans the same @ for instance incarnations that are no longer alive, and submits instance down @ alerts for them @ - check_ha_resources() - registers the db in recent_resource_incarnations @ (RRI) or deletes recent_resource_incarnations$ records that don't apply @ to this database. They may, e.g., have been copied from seed db or from a @ former DataGuard primary and scans RRI for instance incarnations that @ are no longer alive, and submits instance down alerts for them @ - clear_instance_resources() - It clears all incarnations for the given @ instance that started before the down event. @ - The only deletes I found for this table are @ -- @ DELETE FROM recent_resource_incarnations$ @ WHERE resource_type = 'INSTANCE' @ AND resource_name = clear_instance_resources.instance_name @ AND db_unique_name = database_unique_name @ AND db_domain = NVL(database_domain, '==N/A==') @ AND startup_time < SYS_EXTRACT_UTC(event_time); @ ... @ DELETE FROM recent_resource_incarnations$ @ WHERE db_unique_name <> local_db_unique_name @ OR db_domain <> local_db_domain; @ -- @ - Entries in this table are from "09-JAN-06 04.09.36.000000000 AM" to @ "16-MAY-06 01.13.51.000000000 PM" with no fixed pattern in between. @ - clear_instance_resources() is supposed to clean the entries out from the @ description but that is not happening. This is called from @ kjhn_clear_instance_resources() in kjhn.c which executes the following using @ OCI @ -- @ "BEGIN" @ " dbms_ha_alerts_prvt.clear_instance_resources(" @ " :dbdomain, :dbuniquename, :instance_name, :event_time);" @ "END;" @ -- @ - I did not spot any OCI errors but that is what might be happening. Since @ this in information for an alert this table can be purged (backup please) but @ this seems to be a re-curring case. @ - This is called from kjhn_post_ha_alert0() @ -- @ if ( reason_id == KELT_FAN_INSTANCE_DOWN ) @ { @ kjhn_clear_instance_resources(database_domain, database_unique_name, @ instance_name, event_time); @ } @ -- *** SSRAO 05/16/06 07:45 am *** @ The interesting thing is that the block which executes the code for ha_alerts @ and which returns "kjhn_post_ha_alert0-862" ORA-600 be much before the @ location where the assert is being processed. Since the code is in a KSETRY @ block when the ORA-600 is reported is does not process @ kjhn_clear_instance_resource() as that is the last funcction in the KSETRY @ block which explains why the superfluos entries for all recent days as its @ unable to clean it up when it gets posted by FAN or the reason_id of @ KELT_FAN_INSTANCE_DOWN. The root cause of this still looks like the entry of @ "09-JAN-06 04.09.36.000000000 AM" where the hostname is null. This is @ generated from gv$instance at that time and populated @ . @ Questions @ - Do we have the alert log dating that timeframe ? @ - Is there something specific about 9th Jan which the customer can tell us as @ the problems with the hostname have been fixed since then. Thanks *** SSRAO 05/16/06 07:45 am *** (CHG: Sta->10) *** SSRAO 05/16/06 07:47 am *** @ Also if the customer wishes to get rid of the ORA-600 the following steps @ as "sys" would help. @ -- @ - create table recent_resource_incarnation$_backup as select * from @ recent_resource_incarnation$; @ - truncate table recent_resource_incarnation$; @ - Please perform the truncate while this instance is up and running and not @ going down and perform a clean shutdown after this @ - This should stop the ORA-600's and subsequently the instance should have @ clean shutdowns with no errors. Please provide the infomation requested above @ too . Thanks *** RSERNA 05/16/06 08:51 am *** (CHG: Sta->16) *** RSERNA 05/16/06 08:51 am *** @ THANKS. I will give customer workaround because they definitely want traces @ to stop since they are large. @ i have uploaded alert log: alert_sfa_latest.log @ which covers Nov 2005 to April 2006. @ Something definitely happened in January. @ From customer @ "We build laptops from an image. The database on the image was created on @ November. But the laptop was imaged only in January." @ . @ I'm not sure how they do the image. *** SSRAO 05/16/06 01:07 pm *** @ Its hard to decipher this from the alert as the timeline is as follows @ - Sun Jan 08 23:11:02 2006 - shutdown @ - Tue Jan 10 15:17:00 2006 - startup @ - Nothing in the timeframe listed @ - Tue Jan 10 15:18:31 2006 - last startup @ - Wed Jan 11 14:27:06 2006 - startup so unclean shutdown @ - Wed Jan 11 14:29:36 2006 - ORA-600 reported @ - We need to know what happened between Jan 8th and the 11th to be able to @ conclude why v$instance did not return the valid hostname. If they imaged @ their laptop then is this imaged with no hostname initially set when the @ Oracle service came up. Also note the source is v$instance which pulls it @ from x$ksuxsinst which calls slkmnm()->gethostbyname() which is a common @ routine. The solution is to find out why the hostname was null on the @ "09-JAN-06 04.09.36.000000000 AM" which is not in the alert log too. Thanks *** SSRAO 05/16/06 01:07 pm *** (CHG: Sta->10) *** RSERNA 05/16/06 01:42 pm *** @ So far so good. They tried the suggested workarounds and no more errors. @ Customer stated "We used ghost image from microsoft to image all the laptops. @ We installed oracle on the host image and then slapped it on to other @ laptops from the host image." @ . @ All the laptops they have imaged in this way are experiencing the problem. @ So the problem may be on the original host. I will find out more. *** RSERNA 05/19/06 11:47 am *** (CHG: Sta->16) *** RSERNA 05/19/06 11:47 am *** @ Customer's latest update @ "I just got a new laptop that was created with the original image. @ I checked the table recent_resource_incarnations$ and it has @ the correct value for the host_name column. " @ So it appears that the original image used to image all the laptops doesn't @ always lead to the errors. Some laptops are fine, others are not. @ . @ They have also said "I checked with our desktop administrator and he told we @ can't query the database on our image directly" @ I'm not sure why. @ . @ How else can we diagnose this? *** RSERNA 05/19/06 12:11 pm *** @ I asked customer why they could not query the db directly on the image. Their @ reply was "Our helpdesk administrator is saying that, taking a look @ at copy of a brand new laptop is similar to seeing the image copy, which @ makes sense. Image copy copies everything, that is how we have built our 300 @ laptops for our sales force. ". Doesn't really answer the question, but @ thought I'd pass it along. Anyway, the same Image Copy is used on all the @ laptops, some had the problems, some did not. *** SSRAO 05/19/06 12:35 pm *** @ We might be at a deadend for this. Summarizing the problem so we know what we @ are looking for @ . @ - The routines called are for checking any instances are dead and for sending @ the appropriate notification @ - This scans the recent_resource_incarnations$ table for the details @ - recent_resource_incarnations$ is populated from gv$instance @ - The first instance (if RAC) which sees this clears the entry @ - The notification is sent by this instance as well and this is cleared @ - If the hostname is not populated we see this ORA-600 @ - The point which is not clear is how did gv$instance not have the hostname @ - gv$instance -> x$ksuxsinst -> slkmnm()-> gethostbyname() @ - gethostbyname() is generic call which gets the hostname from the OS @ - I do not see much from gethostbyname() or failure codes from it @ - The database alert log does not have enough information at that point @ - The imaging activity is external to the database @ - Clearing this table would prevent the assert @ - Root cause is why the null hostname came into the table and this cannot be @ deciphered with the information provided. @ . @ The reason I'm passing this bug is when there is an unclean shutdown the @ cursor dead_instances is scanned and picked up from where a stale entry with @ a bad hostname will cause every subsequent startup to assert with no cleanup. @ I believe that this aberrational condition could be handled with cleanup of @ the record and raising the ORA-600 , nonetheless not being able to send a @ notification is not probbaly as serious to raise so many asserts. @ The hostname issue cannot be debugged further without further details of what @ happened on that day. BDE Screening complete *** SSRAO 05/19/06 12:35 pm *** (CHG: Sta->11) *** BUGPATCH 05/19/06 12:35 pm *** (CHG: Asg->RPARK) *** BUGPATCH 05/19/06 12:35 pm *** @ Assigned to RDBMS/RAC area queue owner @ by SSRAO via the ST Bug Assignment Tool *** TAKIBA 05/21/06 08:52 pm *** @ Overflow queue screening. @ . @ I think it's better to investigate from viewpoint of PL/SQL. @ The host_name was already corrupted on the table @ recent_resource_incarnations$. Therefore, there are 2 possibilities. @ . @ 1. v$instance gave the empty host_name @ . @ 2. v$instance gave a correct host_name, but Pl/SQL function @ rdbms/src/server/ccl/crs/prvtkjhn.sql: check_ha_resources @ corrupted the value. @ . @ [quoted from rdbms/src/server/ccl/crs/prvtkjhn.sql: check_ha_resources @ L562] @ ============================================================ @ INSERT @ INTO recent_resource_incarnations$ @ ( resource_type, resource_id, resource_name, db_unique_name, @ db_domain, instance_name, host_name, startup_time, location, @ incarnation ) @ SELECT 'INSTANCE', instance_number, instance_name, @ local_db_unique_name, @ local_db_domain, @ instance_name, host_name, @ SYS_EXTRACT_UTC( @ instance_startup_timestamp_tz(vi.startup_time)), @ host_name, '==N/A==' @ FROM v$instance vi; @ ============================================================ *** TAKIBA 05/21/06 08:55 pm *** (CHG: Asg->PLSREP Prod->11 Comp->PLSQL SubComp->) *** TAKIBA 05/21/06 08:55 pm *** @ I guess they can easily route the bug to port-specific (v$instance @ issue probably in slkmnm()) or SQL execution. *** AMANGAL 05/22/06 10:53 am *** (CHG: Asg->RDBMSREP Prod->5 Comp->RDBMS) *** AMANGAL 05/22/06 10:53 am *** (CHG: Asg->RPARK) *** AMANGAL 05/22/06 10:55 am *** @ This needs to be investigated by the owners of check_ha_resources . *** TAKIBA 05/28/06 04:34 am *** (CHG: SubComp->RAC) *** TAKIBA 05/28/06 04:34 am *** @ Put the subcomponent RAC for now. *** BUGPATCH 06/05/06 10:21 pm *** (CHG: Asg->TAKIBA) *** BUGPATCH 06/05/06 10:21 pm *** @ Assigned from the area queue to TAKIBA via the ST Bug Assignment Tool *** TAKIBA 06/07/06 02:46 am *** @ I describe the 2 possibilities again. @ . @ The host_name was already corrupted on the table @ recent_resource_incarnations$. Therefore, there are 2 possibilities. @ . @ 1. v$instance gave the empty host_name @ . @ If GetComputerName() returned an error, slkmnm() can return a @ NULL string. @ . @ [quoted from rdbms/src/common/osds/slkmnm.c@@/RDBMS_10.2.0.1.0_NT_RELEASE @ slkmnm() L150] @ ============================================================ @ len = mnm_l - 1; @ if (GetComputerName(mnm, &len) == FALSE) @ { @ *mnm = '\0'; @ return(0); @ } @ else @ return(len); @ ============================================================ @ . @ 2. v$instance gave a correct host_name, but Pl/SQL function @ rdbms/src/server/ccl/crs/prvtkjhn.sql: check_ha_resources @ corrupted the value. @ . @ [quoted from rdbms/src/server/ccl/crs/prvtkjhn.sql: check_ha_resources @ L562] @ ============================================================ @ INSERT @ INTO recent_resource_incarnations$ @ ( resource_type, resource_id, resource_name, db_unique_name, @ db_domain, instance_name, host_name, startup_time, location, @ incarnation ) @ SELECT 'INSTANCE', instance_number, instance_name, @ local_db_unique_name, @ local_db_domain, @ instance_name, host_name, @ SYS_EXTRACT_UTC( @ instance_startup_timestamp_tz(vi.startup_time)), @ host_name, '==N/A==' @ FROM v$instance vi; @ ============================================================ *** TAKIBA 06/07/06 10:04 pm *** (CHG: Asg->NKOREHIS) *** TAKIBA 06/07/06 10:04 pm *** @ Assigning to Korehisa-san, to investigate from viewpoint of @ Windows RAC. *** NKOREHIS 06/13/06 11:20 pm *** (CHG: DevPri->2) *** NKOREHIS 06/13/06 11:20 pm *** (CHG: Confirmed Flag->Y) *** KEOCHI 07/30/06 02:39 pm *** @ Sorry for rushing you.. @ My customer hits similar problem, and then BUG:5403213 for my customer is @ marked as duplicate of this bug. @ How the work is going? Thanks. *** NKOREHIS 07/31/06 11:24 pm *** @ slkmnm calls GetComputerName WIN32API. GetComputerName could fail if @ buffer is too small to store the computer name. @ . @ on my local env, it is called with the following stack when select @ from v$instance. @ . @ > oracommon10.dll!slkmnm(slerc * se=0x0b58d5e4, @ unsigned char * mnm=0x0abd7ebc, @ unsigned int mnm_l=0x00000040) @ oracle.exe!_ksuxsrow() + 0x5a @ oracle.exe!_qerfxFetch() + 0x3da @ oracle.exe!_rwsfcd() + 0x5f @ oracle.exe!_qerjotFetch() + 0xf3 @ oracle.exe!_rwsfcd() + 0x5f @ oracle.exe!_qerjotFetch() + 0xf3 @ oracle.exe!_opifch2() + 0xc20 @ oracle.exe!_kpoal8() + 0xcf0 @ oracle.exe!_opiodr() + 0x44e @ oracommon10.dll!_ttcpip() + 0x4fc @ oracle.exe!_opitsk() + 0x3f9 @ oracle.exe!_opiino() + 0x444 @ oracle.exe!_opiodr() + 0x44e @ oracle.exe!_opidrv() + 0x338 @ oracle.exe!_sou2o() + 0x32 @ oracle.exe!_opimai() + 0x171 @ oracle.exe!_opimai() + 0x61 @ oracle.exe!_OracleThreadStart@4() + 0x2c9 @ kernel32.dll!_BaseThreadStart@8() + 0x37 @ . @ in ksuxsrow, call to slkmnm is made as follows. @ . @ DISCARD slkmnm(&se, xsc->ksuxshst, sizeof(xsc->ksuxshst)); /* @ host name */ @ . @ ksuxshst is defined as follows. @ . @ text ksuxshst[SKGP_HOSTNAME_LEN]; /* host @ machine name */ @ . @ in skgp0.h, SKGP_HOSTNAME_LEN is defined as follows. @ . @ #define SKGP_HOSTNAME_LEN SKGP_HOSTNAME_DEFAULT @ /* operating system name @ buffer size */ @ . @ SKGP_HOSTNAME_DEFAULT is 64 as defined in skgp.h @ . @ #define SKGP_HOSTNAME_DEFAULT 64 /* host name @ buffer size */ @ . @ So, GetComputerName should not be failed due to the buffer size in this @ case. *** NKOREHIS 07/31/06 11:29 pm *** (CHG: Sta->30) *** NKOREHIS 07/31/06 11:29 pm *** @ As the slkmnm does not return any errors though GetComputerName fails, @ we might need figure out what fails the function using debug patch. @ Please let me know if the ct would agree to apply the debug patch. *** RSERNA 08/01/06 06:19 am *** (CHG: Sta->11) *** RSERNA 08/01/06 06:19 am *** @ Yes, customer is willing to apply a debug patch. *** NKOREHIS 08/10/06 04:55 am *** @ working on the debug patch. *** NKOREHIS 08/10/06 09:16 pm *** (CHG: Sta->30) *** NKOREHIS 08/10/06 09:16 pm *** @ the debug patch is available at bugftp:/upload/bug5173066/5173066.zip. @ . @ 1. How is the diagnostic information activated and de-activated: @ Once installed the patch, it is automatically activated. @ 2. What new information is being traced and when is it triggered? @ Error information from GetComputerName will be traced in the @ alert log. @ 3. What is the performance impact of the new diagnostic tracing? @ No performance impact is expected. @ 4. What is expected from the generated diagnostic information and @ where will this lead. @ The patch is provided to confirm if GetComputerName is failed @ in slkmnm. If so, it would show what is the OS error from the @ call. Please provide alert log once the error reproduced. *** RSERNA 08/18/06 07:15 am *** (CHG: Sta->11) *** RSERNA 08/18/06 07:15 am *** @ From customer: @ "When i tried to start the Oracle database service, OracleDBService, it did @ not start. It failed with a generic windows error,Couldn't start the service. @ " @ . @ They followed the README that came with the patch and could not even start @ services. Backing out the patch, everything was fine again. *** NKOREHIS 08/22/06 04:06 am *** (CHG: Sta->30) *** NKOREHIS 08/22/06 04:06 am *** @ The patch worked fine on my development environment. Would you please @ describe what error exactly the ct got? *** RSERNA 08/28/06 08:12 am *** (CHG: Sta->11) *** RSERNA 08/28/06 08:12 am *** @ Customer attempted patch install again and same errors. @ I've uploaded screenshots from the customer. File:patch_error_screenshot.doc *** NKOREHIS 08/30/06 09:08 pm *** (CHG: Sta->30) *** NKOREHIS 08/30/06 09:08 pm *** @ rebuilt the diagnostic patch changing compiler options. @ Would you please verify if it works on the customer's env? @ It is available on bugftp:/upload/bug5173066/5173066_2.zip *** NKOREHIS 08/31/06 06:16 pm *** @ basically, why the patch failed to startup is still unknown. the @ major difference between the debug version and the release version is @ the compiler options for the patched file. The newer version is to @ verify if it could lead to the failure. diagnostic code for the @ original problem is present as well. *** RSERNA 09/05/06 01:36 pm *** (CHG: Sta->11) *** RSERNA 09/05/06 01:36 pm *** @ The customer still had problems applying this Diag patch. @ See file:screen_shot_of_second_patch_application.doc @ that was uploaded which shows the customer's output following step-by-step @ instructions from the readme. *** NKOREHIS 09/13/06 01:38 am *** @ built the patch on another development environment. regression @ test is running. *** NKOREHIS 09/13/06 07:18 am *** (CHG: Sta->30) *** NKOREHIS 09/13/06 07:18 am *** @ put the new diagnostic patch to bugftp:/upload/bug5173066/5173066_3.zip @ This time, I zipped the patch with orageneric10.dll and @ oraclient10.dll which exist on the development environment for @ RDBMS_10.2.0.1.0_NT_RELEASE. I confirmed the patch worked on my local @ testing env. Please try the new patch. *** RSERNA 09/14/06 10:49 am *** (CHG: Sta->11) *** RSERNA 09/14/06 10:49 am *** @ Customer applied diag patch and started DB. @ Alert log and mmon trace uploaded: alert_trace_14Sep06.ZIP *** NKOREHIS 09/14/06 10:20 pm *** @ alert log does not show any output from the diagnostic code. this @ means that GetComputerName is not the culprit. we might need @ another diagnostic. *** NKOREHIS 09/26/06 04:33 am *** (CHG: Sta->30) *** NKOREHIS 09/26/06 04:33 am *** @ I added some more diagnostic code to ksuxsrow to see if xsc->ksuxshst @ has a valid value after returning from slkmnm. the patch is in @ bugftp:/upload/bug5173066/5173066_4.zip. Would you please have the @ ct apply the patch? please provide alert log once the symptom reproduced. *** RSERNA 09/26/06 05:35 am *** (CHG: Sta->11) *** RSERNA 09/26/06 05:35 am *** @ Hi, there is no file in: bugftp:/upload/bug5173066/5173066_4.zip @ . @ ftp> pwd @ 257 "/upload/bug5173066" @ ftp> get 5173066_4.zip @ 200 PORT command successful. Consider using PASV. @ 550 Failed to open file. @ ftp> ls 5173066*.zip @ 200 PORT command successful. Consider using PASV. @ 150 Here comes the directory listing. @ 5173066.zip @ 5173066_2.zip @ 5173066_3.zip @ 226 Directory send OK. *** NKOREHIS 09/26/06 06:13 pm *** (CHG: Sta->30) *** NKOREHIS 09/26/06 06:13 pm *** @ Sorry I re-uploaded the patch. @ . @ ftp> ls 5173066*.zip @ 200 PORT command successful. Consider using PASV. @ 150 Here comes the directory listing. @ 5173066.zip @ 5173066_2.zip @ 5173066_3.zip @ 5173066_4.zip @ 226 Directory send OK. *** RSERNA 09/28/06 10:10 am *** (CHG: Sta->11) *** RSERNA 09/28/06 10:10 am *** @ Diag Patch has been applied, latest alert log: alert_Sep28.log @ has been uploaded. *** NKOREHIS 09/28/06 06:10 pm *** @ The following diagnostic trace indicates that MMON tid=1540 did not @ fail to get hostname in ksuxsrow. the hostname was vaild until the @ function finished. However the internal error singalled in the @ same thread. @ . @ MMON:1540:ksuxsrow: ksuxshst = W8KKWR81 @ MMON:1540:ksuxsrow: ksuxshst = W8KKWR81 @ MMON:1540:ksuxsrow: ksuxshst = W8KKWR81 @ Thu Sep 28 08:05:04 2006 @ Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1540.trc: @ ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [], @ [], [], [], [], [] *** NKOREHIS 09/28/06 06:13 pm *** (CHG: Asg->RDBMSREP) *** NKOREHIS 09/28/06 06:13 pm *** @ back to generic RAC to investigate check_ha_resources. *** BUGPATCH 09/28/06 06:18 pm *** (CHG: Asg->EMERITT) *** BUGPATCH 09/28/06 06:18 pm *** @ Assigned to RDBMS/RAC area queue owner @ by NKOREHIS via the ST Bug Assignment Tool @ . @ No problem found in GetComputerName, slknmn, ksuxsrow. However the internal @ error signalled. We should verify if check_ha_resources works as expected. @ . *** MMPANDEY 10/03/06 11:46 pm *** (CHG: Asg->MMPANDEY) *** MMPANDEY 10/03/06 11:46 pm *** @ This not exactly related to RAC, but as I have worked on a similar @ problem, I would review. *** RSERNA 10/25/06 06:19 am *** @ Please provide an update. Thanks. *** MMPANDEY 11/02/06 10:05 am *** @ I am just back from a 2-week vacation and would be reviewing the bug @ soon, alongwith others. Thanks. *** RSERNA 11/28/06 05:56 am *** @ Please provide an update. Thanks. *** MMPANDEY 12/03/06 08:17 pm *** @ I would do so, shortly. Thanks. *** MMPANDEY 12/13/06 07:04 am *** @ I am looking at the traces now. A similar problem is seen when the @ oracle_sid name exceeds 17 characters, which does not seem to be @ the case here. Looking at the possibilities of internal error being @ raised in ksuxsrow(). *** RSERNA 01/03/07 08:41 am *** @ Please provide an update. Thanks. *** MMPANDEY 01/19/07 02:17 pm *** @ The code to raise the error appears to have changed a bit in the later @ labels, i.e. we are no longer checking the condition (*host_name) @ . @ KSEORAASSERTNM(instance_name && *instance_name && host_name, @ OERINM("kjhn_post_ha_alert0-862")); @ . @ I am verifying this fact. *** KNEEL 01/19/07 05:47 pm *** (CHG: Asg->RSERNA) *** KNEEL 01/19/07 05:47 pm *** @ The new row being inserted by check_ha_resources should not be the one @ causing the assert; the assert is caused when attempting to delete an old row @ with null hostname. Are new rows in recent_resource_incarnations$ still @ getting null host name; if not, when is/are the bad row(s) from? Is this @ still the row from Jan 2006? @ . @ Also, this is XP, and not an earlier version, right? Was this db ever run on @ a pre-XP version, or on a system with a long host or network name? @ . @ We may never be able to solve where that original bad row came from, and it's @ not clear that we should bother customer with more diagnostic patches. If, on @ the other hand, the problem is repeating -- i.e. new rows with null hostname @ -- it would be good to figure out why. @ . @ If necessary, we can certainly backport the change to ignore null hostname. @ But it would be nice to know if the null hostname is still being generated; @ i.e. if another problem exists. @ . @ That change would include the line Mukul (MMPANDEY) suggests, but it also @ needs to include the following change: @ . @ reasonargs[KJHN_REASON_ARG_HOST] = host_name; @ => @ if ( *host_name ) @ reasonargs[KJHN_REASON_ARG_HOST] = host_name; *** RSERNA 01/23/07 06:27 am *** (CHG: Sta->33) *** RSERNA 01/23/07 06:27 am *** @ Customer no longer wishes to pursue. @ They always clone their machines so its something specific to the original @ one which they can not reproduce. *** MMPANDEY 01/23/07 09:37 am *** (CHG: Sta->30 Asg->MMPANDEY) *** MMPANDEY 01/23/07 09:37 am *** @ Assigning it back, looks like this bug got reassigned to Rowena when @ we needed the information. @ . @ Hi Rowena, @ . @ One question. I think they were able to reproduce this problem, as we @ were able to apply the diag patch, cf. [NKOREHIS 09/28/06 06:10 pm], @ isn't it? If the customer still has the backup, I was wondering if @ they could retry? @ . @ I think that we should give the patch of the changes to the customer, @ and I will get this verified by Kevin. Thanks. *** MMPANDEY 01/23/07 12:14 pm *** @ Hi Rowena, @ . @ Kevin has the following comments, about the customer's situation and @ the workaround (to truncate recent_resource_incarnations$): @ . @ So was this a permanent fix -- i.e. they have fixed the clone, so that @ they no longer need to do this? Or are they still doing these workarounds? @ I.e. ould we be creating work for them to apply the patch, or saving @ them work/making their processes more efficient? To some degree, that's @ up to hem. @ . @ Could you please request the customer to reply? Thanks. *** MMPANDEY 01/24/07 02:46 pm *** (CHG: Fixed->11.1.0) *** MMPANDEY 01/24/07 02:46 pm *** (CHG: Sta->35) *** MMPANDEY 01/24/07 02:46 pm *** @ Closing the bug, following Kevin's reply on my email: @ . @ "I don't think the "bad laptop" is the source of the "infection" of other @ boxes; the source is the image made last January. The "bad laptop" is simply @ one which has that bad image, without the workaround. It is a victim of the @ illness -- the only one who has not taken the cure -- but is not a carrier, @ infecting others. They just left it infected in case we wanted to look at the @ living illness. Unfortunately, we can't learn much from looking at the @ illness, since the entry path of the original virus didn't leave any trace. @ We can't see how the immune system was weakened; we just know that it was @ weakened at some point, and has returned to normal. IF they wanted a @ workaround patch [see below], it could be a good machine on which to test it. @ . @ I'm not surprised that they don't want any more diagnostic patches. Even if @ we succeed in figuring out what condition/event caused the original bad @ row(s), that is not useful for them, since that condition/event in itself is @ not recurring. What recurs is the interaction between the bad row(s) and our @ assert. So they have essentially three choices: @ . @ 1. continue with the workaround @ 2. create a new image without bad rows (seems they don't want to do this) @ 3. we can give them a workaround patch -- which presumably would have to @ be patched into their image -- to ignore the bad rows rather than assert @ . @ So if they want a workaround patch (#3) -- not a diagnostic one -- we can do @ that. But I suspect that it may be more work for them to get it into their @ image (and possibly existing boxes) than to keep their workaround. @ . @ It sounds like they have somehow automated the workaround as part of their @ processes (e.g. part of a script run after reimaging), and have little need @ to give themselves more work to install a different workaround (i.e. patch). @ . @ As for the bug, you could mark it fixed in 11.1." @ . @ Thanks.