Monday, January 3, 2022

RDBMS -- TRIGGER -- ORA-06502: PL/SQL: numeric or value error: character string buffer too small / a WEIRD CASE

This is about an interesting problem, that we encountered after migrating a critical database from HP-UX to Exadata.. I must admit that it was very challenging, the solution touched the foundations of our problem-solving mechanism and probably improved that mechanism for solving these types of problems that we may face in the future, faster.

Before giving you the story, I just want to remind that;

We are just trying to be Bayesians (having our subjective beliefs) and have priors on things ..

In addition to that, there is something that I can consider as a weakness. That is; we can find ourselves in a desperate situation if we get a false evidence (or it is better to say if we happen to misinterpret the evidence), that touches the foundation of our decision making mechanism that we use for finding the cause of the problem that we deal with. This type of an evidence can be an illusion that can be caused by an insufficiently defined error message or by lack of information, by lack of knowledge..

Two more words before start;

-What you get instantly, is not the whole nature of the mechanism.

-Make sure , you didn't come across to a  too non-representative part of  the distribution.

:)

Okay, now let's get down to business and look at our problem;

We migrated a mission critical database (running on a critical environment) to a newly deployed Exadata X8 using XXTS (accelerated with RMAN incremental backups) .. 

Actually, we had already done maybe 10 Exadata migrations in this customer environment, but in this case, we had ORA-06502 errors all over the place.. I mean once we enabled 6502 trace (alter session set events '6502 trace name errorstack forever, level 3') in the database level, we saw that, there were more than one PLSQL that was getting these type of errors.. (probably, overlooked by the testing team..)

Anyways; the error stack was;

ORA-06502: PL/SQL: numeric or value error: character string buffer too small

----- Current SQL Statement for this session (sql_id=1mspsdarsts3r) -----

UPDATE DENEME20 SET TEST_KODU =:B3 , DENEM_KODU =:B2 WHERE SOZLESME_NO =:B1 

----- PL/SQL Stack -----

----- PL/SQL Call Stack -----

  object      line  object

  handle    number  name

0x574be81d0        66  TEST.COMM_BEFORE_UPDATE

0x3d0285ac0       152  package body TEST.UA_DO_IT 

0x46a4c0dc0         2  anonymous block

Okay, here we go.. Let's see what we have done for finding the solution. Let's see the evidences that we collect and let's see the actual cause and the solution :)

1)

It was clear that, the Before Update Trigger TEST.COMM_BEFORE_UPDATE was failing at Line 66.

We checked that line 66 by using tools like Toad and views like dba_source.. (This is the actual mistake actually.. We will come this topic later.). Line 66 was a "select into" clause.. So the trigger was written in a way to query a table and put the value that it gets into a variable, which had the exact same length that the relevant table column had.

2)

The target database was created with a different NLS Language and territory setting. Different than the source.. So, those parameters were changed directly. On the other hand, as you already know that , NLS_DATABASE_PARAMETER can not be changed after the database creation.. But! this should not be a problem, because those parameters were changed in the instance level.. 

Still, this situation have created a big question mark for us.. Because, the "select into"  that we identified by checking the line 66 of that trigger could fail because of a char conversion problem.. Besides, we had some Oracle Support documens to support this idea.. That is, in some Oracle Support notes, we saw that a wrong NLS Lang could indirectly trigger some bugs in the code..

3)

We also suspected from NLS SEMANTICS settings. We checked it both in the database level, and  PLSQL object level ( DBA_PLSQL_OBJECT_SETTINGS) (that is set during the compile time) , but it was set to BYTES everywhere.. Both in the source and target..

4)

Most of the failing PLSQL codes were triggered by the DBMS SCHEDULER.. So there were database jobs.. They were not DBMS Jobs, but still they were scheduled using the Database engine, and it was the database itself that was executing these failing codes. Although, the NLS setting of these scheduler jobs were correct, we suspected that maybe database was setting those NLS parameters wrong in its execution time.. Maybe dbms scheduler was still using NLS_DATABASE_PARAMETER.. (there is no info about that bytheway, so it was just an educated guess at that moment:)

5)

A Sql Developer tool, installed in another client machine (different than our client machine where we have other database tools installed) could run those PLSQLs without any problems! I mean, direct executions.. (not using DBMS SCHEDULER).. However, DBMS SCHEDULER was still failing and we were still failing -- even if we run those PLSQLs directly, we were getting ORA-06502.. But, Sql Developer on that client could run them without any problems! 

We also checked the NLS preferences of this SQL Developer and saw that they were set correctly. We also saw that, even when we change the NLS setting of the session and set them to a wrong NLS Lang and territory; Sql Developer could run those problematic PLSQLs without getting any errors.. (Note that, this was a wrong try.. Actually, Sql Developer gets its env from its preferences..So it was actually no surprise :)

6)

We had similar cases in the other database environments ( Core Banking environment, Reporting environments) and we had a memory about them, a memory was telling us these of errors were already solved earlier.. It was telling -> just check the NLS setting :)  -- what an intuition :)

Well, you see how all these things bring us closer to the NLS settings, but they also lead us to ineffectiveness and insolvency.

The actual solution comes at this moment, we suddenly get the following info (we recognized someting);

In the call stack we actually got the correct line number, but our interpreations was wrong;

I mean, our error stack told us line 66 of that Before Update Trigger was failing and that was correct.

But! the line 66 was not that "select into" clause.

the line info in dba_source or toad didn't give us the correct info..

In case of triggers; the code includes both a trigger definition and PLSQL Block , so it it in the form of "Trigger Definition + PLSQL block"

ORA-06502 is a PL SQL error, so the Line 66 should be in the PL SQL block.. So we need to discard those definition lines and start counting from the word DECLARE...

In that case, we found the failing line is a variable assigment.. A variable assignment that was done using the return of a custom function and that custom function was the real cause, because it had a fixed variable string for storing the hostname.. After the Exadata migration, the hostname (FQDN) was extended, so it just couldn't fit and that's what caused the error. A substr modification or a variable size modification made us workaround that bug caused by the custom code, but at the end of the day, it made us learn something important and made us update our priors. 

--Note: That SQL Developer was running on a client which had a short hostname, and that's why it could fit in the relevant variable and that's why that Sql Developer didn't get any errors during our test runs.

Okay that is it! But still, look back at the first paragraph and think before you finish reading:)


No comments :

Post a Comment

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.