Tuesday, December 31, 2013

Database, Applications, OS and General IT experiences from the Journal(year 2013)

  1. 22 incidents seen and recorded in 2013 :

    1. EBS Server Migration --- dependencies..
      Yesterday night I migrated an EBS environment (app+db) to a new server.
      Both old and the new servers were using the same Operating Systems (Sun os 10). The directory structure was the same. Even the operating system usernames and the target paths were the same.
      After copying datafiles and application files , configuring the environment and opening the EBS , there was no problem at all. Everyting was normal, System was healthy.

      On the other hand, I was expecting a problem, related to dependency between software and OS and maybe the hardware. There must be a dependency( there shouldnt actually but always there is:) ) . Finally that little problem showed itself. There was a jar file, used to upload sqlloader datafiles to the server in order to prepare data for the sqlloader . This sqlloader process was triggered from Concurrent Program for loading data to EBS, and the Concurrent Set that contains this Concurrent Program was completing with errors.

      The problem was operating system account passwords.. Passwords was not the same in the new server. This java program was using operating system account to login to Solaris and upload the text files.
      I changed the passwords on New Server to be the same as they are in Old Server and problem disapeared.
    2. Linux Redirection on Backup scripts

      Yesterday night I realized that some bash scripts used in Production systems to backup the databases, have logging lines such as

      cp * location >> logfile;

      When checking Logfiles of these backups, as expected; no errors can be seen.
      But it does not mean that any errors are produced during these backups..
      Because using ">>", script just redirects stdout appending to file.
      For logging errors+output generated during the runtime of the script, and appending " cp * location >>logfile 2>&1 " must be used.

      starting with bash 4 "cp - location" &>>logfile can also be used for this operation. But this method will not be backward compatible.
    3. ORA-00600 [kewrose_1] ORA-00600 [ktsplbfmb-dblfree] on

      Today, ORa-0600 kewrose_1 and ktsplbfmb-dblfree erros are produced on a customer using on Exadata X2-2.
      After investigating the incident, I found the following sql was the cause the error produced.

      {INSERT INTO wrh$_sql_plan sp (snap_id, dbid, sql_id, plan_hash_value, id, operation, options, ....

      I made a search on Oracle Support and found a few documents that addresses this bug. One of the document actually addresses our situation as it s written for and does not include any fix or workaround.

      As a workaround I suggest my collegue(who ise responsible for this site) rebuild the AWR repository, because the wrh$.. table is an awr table.

      We will see if rebuilding the AWR will fix the error.
    4. Yes, Rebuilding the AWR repository fixed the issue..
    5. Enterprise Manager Tablespace Usage Threshold
      If Warning Threshold value is bigger than the Critical Threshold, Enterprise Manager updates the default tablespace thresholds as "not defined". This is because the threshould are logically wrong. Warning Threshold value cannot be bigger than Critical.

      This behaviour of Enterprise Manager occupied our time, as we thought that the mail infrastructure of Enterprise Manager is not working properly.
      We then noticed that the "not defined" threshold in the tablespaces, and fixed the issue by setting lower value to Warning Threshold...
    6. Oracle EBS "I" and "l" character conversion

      OM Debug File Retrieval diagnostic was failing with the error cat: "0652-050 Cannot open /usr/tmp/I0001064.dbg.", altough the file was in /usr/tmp..
      Actually there was a big difference in filename that is requested and actually exist. Actual file name was starting with "l", but requested file was starting with "I"..

      The debug note was already mentioning about this..(How to generate a debug file in OM [ID 121054.1])

      To retrieve the debug file, navigate to Order Management > Run Requests. Then choose:
      Diagnostics: OM Debug File Retrieval.
      Parameters: Give the name of the debug file noted in Step 3 above. Only give the file name, not the directory path. Remember the first letter in the file name is lower case 'L'

      The filename was entered by the application user wrongly (I and l characters..) That is, why the error was produced..
    7. Something like Concurrent Manager Internals

      Today , I faced a strange behaviour in Concurrent Manager shutdown process.
      In one of my customers, Concurrent Managers wer not shutting down, using adcmctl.sh.
      I tried with adcmctl.sh even with abort but all the concurrent managers and their OS process were just remain running. (even in OS).
      After spending some time, I realized that Concurrent Managers shutdown was implented thorugh a concurrent request (name Shutdown).
      This Shutdown concurrent request was sitting in the queue of the Internal Concurrent Manager ( in Pending Standby -> Phase and Status)

      After some investigation, I saw that concurrent managers actually were shutdown, but after 10-15 minutes..Right, Shutdown concurrent request was processed, but the question and the problem was obvious.. What did take this much time?

      At this point, I decided to look into the database.. I investigated sessions and saw that when shutdown request was in standby, the ICM was waiting enq-tx , which is a lock wait.
      The blocker was FNDSCH. I killed FNDSCH just to be sure. (It was a LAB env.)
      After FNDSCH was killed, shutdown request was processed in 10 seconds..

      So FNDSCH was doing nothing, but its transaction was open, and locking the ICM (fnd_concurrent_request/or queues)..

      I ll keep it short.. The problem was in FNDSCH concurrent manager. Its sleep time was high (5 mins)
      So this make ICM be unable to process shutdown request..

      Solution was decreasing the sleep time of FNDSCH.. I set it to 30 secs..
      After this setting, problem solved.. All the subsequent concurrent manager shutdown processes finished in 1 or 2 minutes...
    8. While patching from EBS 12.0.6 to 12.1.3 rup5, I could not find an nls translated version of a patch. This was my second iteration. I decided to check from server that I used for the first iteration. The weird thing was the nls translated patch was already applied on the first iteration. But the patch could not be found in anywhere ( servers filesystem, Oracle support and etc..).

      I analyzed the ad_bugs table and saw that creation_date of this nls translation patch was almost the same (just after) with creation_date of another patch.

      So, this patch was bundled in another patch. That s why I couldnt find it in Oracle Support. That was the story..

      Patch number was 13626800 .. There is Turkish version of it.
      On the other hand; patch 14586882 includes the Turkish version of 13626800..
    9. I was installing Discoverer Desktop and Admin 11g on a Windows 7 64 bit Laptop.
      Unzipped to installation files and found a subdirectory named Win64 in Discoverer Setup folder.. Normally, I used the setup.exe in Win64 folder. Installer invoked but installation program closed automatically without any trace.

      As a workaround, I installed the Discoverer Desktop&Admin from the subdirectory named Win32 in Discoverer Setup Directory, and the installer completed successfully.
    10. Because of a power cut on a linux system, a disk became physically corrupted.
      This disk was mounted to /home directory. So after the power up, home directory 's of all the users became lost.
      This server was hosting Apache services..
      But after the power cut, Apache couldnt start and gave up with the following error
      05/29/13-11:47:23 Starting Apache Web Server Listener (dedicated PLSQL)
      Syntax error on line 1362 of httpd_pls.conf
      Invalid command 'include', perhaps mis-spelled or defined by a module not included in the server configuration
      apachectl start: httpd could not be started
      So the invalid command was 'include', is it possible ?Of course not.. incude is a key statement , part of the main syntax..
      After analyzing the situation, the problem was solved by unsetting LANG environment variable for the Os user starting Apache services..
      (unset LANG in user's .bash_profile)
      The main cause of the problem was the power cut.
      Because it directly made the system lose the user homes, and made the user indirectly lose the unset LANG setting.
      In addition, the server's default setting was Turkish(TR_tr.UTF8) in /etc/sysconfig/i18n, which is the default setting file for the system language.
      So Os user starting Apache was using LANG=TR_tr..
      This made Apache to not recognize the "i" character.. (i converted to I) and invalid command error was produced.
    11. One customer was encountering Rep-3000 errors, in some E-Business Suite concurrent programs --Oracle Reports..
      Rep-3000 error was a common error for EBS systems in Linux/Unix, and the solution was disabling the access control for X sessions using xhost + command.. The command should be run from the Server itself, from a local connection.. But before running the xhost + command, DISPLAY environment variable should be set to the DISPLAY used in EBS concurrent managers..
      So the display used in EBS concurrent managers of the customer's EBS system was hostname:0.0..
      The problem was, after setting display variable to hostname:0.0 , xhost + command encountered errors.

      The comand was;

      export DISPLAY=exatest:0.0
      xhost +

      The error was;

      xhost: unable to open display "exatest:0.0"

      So, after some analysis, I found that a X server was actually listening on port :0, but its authorization was be sourced before the xhost + command .
      I found the authorization of the relevant X server process, using proc filesystem.. and sourced it before running the xhost + command.. And the issue is fixed..

      The solution was;

      root@exatest ~]# export XAUTHORITY=/var/gdm/:0.Xauth
      [root@exatest ~]# export DISPLAY=exatest:0.0
      [root@exatest ~]# xhost +
      access control disabled, clients can connect from any host
    12. One of my cusmomers was questioning the premature archivelogs generated by its production oracle database system..

      After analysis, the issue was rejected as it s a designed behaviour for this issue.. It was explained in the following blog post..

    13. The following performance issue was encountered in one of the customer site.. It was encountered in an EBS production system..

      In certain time intervals(especially at evenings..), the production system was suffering from slow query performance..
      After some analysis, I found that parallel query servers were running and generating a high load..These parallel servers were fetching data for a full table scan operation..

      The source of this parallel query operation was the following plsql;

      BEGIN WF_EVENT.LISTEN ( p_agent_name => :1, p_wait => :2, p_correlation => :3, p_deq_condition => null, p_message_count => :4, p_max_error_count => :5 ); END;

      This plsql was using the below query to scan the workflow queue (wf_notification_in)

      select tab.rowid, tab.msgid, tab.corrid, tab.priority, tab.delay, tab.expiration, tab.retry_count, tab.exception_qschema, tab.exception_queue, tab.chain_no, tab.local_order_no, tab.enq_time, tab.time_manager_info, tab.state, tab.enq_tid, tab.step_no, tab.sender_name, tab.sender_address, tab.sender_protocol, tab.dequeue_msgid, tab.user_prop, tab.user_data from "APPLSYS"."WF_NOTIFICATION_IN" tab where msgid = :1

      Plan -> full table scan ve high cost..
      SELECT STATEMENT CHOOSECost: 146.456 Bytes: 909 Cardinality: 1
      1 TABLE ACCESS FULL TABLE APPLSYS.WF_NOTIFICATION_IN Cost: 146.456 Bytes: 909 Cardinality: 1

      So the WF_NOTIFICATION_IN table was over sized..
      As a solution, I recommended to recreate the queue by using the following note: Workflow Queues Creation Scripts [ID 398412.1]

      But as the subject is worklow, it s a dangerous thing to do.. So we didnt rebuild the queue yet.. I will update this post with the final actions..
    14. Analyzing unexpected eof in bash script
      A backup script was producing unexpected eof error , when I try to run it..
      Here is the approach I followed to analyze and correct the error..

      with sh -x , I checked the bash script to find the exact line that was producing error.

      After that I used cat -vet to see the nonprinting characters..

      From man of cat ->
      -e equivalent to -vE
      -E, --show-ends display $ at end of each line
      -v, --show-nonprinting use ^ and M- notation, except for LFD and TAB
      -t equivalent to -vT
      -T, --show-tabs
      display TAB characters as ^I

      In my case, the source of the problem was a missing ";" while calling a php script..

    15. HPUX printer Turkish character problem
      One of my customer was not able to print turkish characters like "ç,ş,i" etc.. through a new configured printer..
      "This problematic printer was configured just like the other printers in the system" said my customer..
      After taking a look to the printer configurations using SAM, it seemed the printer was configured just like the other remote printers indeed..
      Then I decided to take a deep look through HPUX printer configuration files..

      key files were under /etcl/lp/interface directory..

      Every printer has a configuration file in that directory.

      So after searching through the configuration files of the printers that are able to print turkish character, following lines attracted my attention..
      #Following line added to this model script by HP Support
      #to Convert TR.iso88599 to ISO
      /usr/bin/turkceyap <"$1">/tmp/myprn.$$
      mv /tmp/myprn.$$ $1
      :) /usr/bin/turkceyap .. It was a binary .. turkceyap means make it turkish actually...
      So this was a post processing action, and it was missing in the configuration file of the problematic printer.. I added the relevant lines to the configuration file of the problematic printer and restarted the spooler..

      Problem disappeared, as expected..
    16. EBS Datafile block corrupted, actually not corrupted..
      One of my clients declared a corruption issue last week.
      The EBS version is 11i , and a concurrent programs was encountering the following error; ORA- 01578: ORACLE data block corrupted (file # 28, block # 33523) ORA-01110: data file 28:

      I checked the datafile with dbv and dbv found the corruption.
      After that I found the object that the corrupted block belongs to.. It was an index block..
      So for solution, we recreated the index..

      Next day, the client reported the issue again. The corruption reported was exactly the same.. Same block, same datafile..
      I checked the datafile with dbv and with dbms_repair.check_object, the tools didnt report any corruption for this time..
      It was strange but whatever.. to fix; I dropped the index, no luck..
      I created a table in the problematic tablespace, and extend ed it till the problematic block ( to reformat the block properly) .. Still no luck..
      So in the end, it is understood that, the error was not reported at all..
      The error numbers are written to the output of the concurrent program, because the query of the concurrent program was selecting the error code columns of all the records.. As a result, the corruption errors that we already fixed, are displayed in the output..
      Looking to the output, you can think that these errors are encountered during query... That was the story..
      So as a solution( to not display these error lines..),I put concurrent request in to debug mode , and found the sqls used in the concurrent program.. Found the table, and deleted the corruption error strings in the error_message column..

      After that, we run the concurrent program again, and the output produced as expected..

      So it was an interesting case, because it made me think that, there should be a bug and the corruption will never be dissapeared..

    17. EBS-OID authentication error
      In a customer site, where we positioned OID 11g and SSO to authenticate EBS users, we faced a strange problem..
      The problem was strange because it appeared occasionally..
      More specificaly, when the problem appeared, user could not login to EBS in their first try. Most of the time, users could login in their third or fourth try..

      So after analyzing the situation with ldap admins, it s found that the customer has 8 ldap servers. These ldap servers were working load balanced , and the connections were distributed by the round robin.. Unluckily, the user accounts were expired in some of these ldap servers.. So when the login requests came across to these problematic ldap servers, the users could not login.

      Solution: ldap guys corrected the accounts in question and problem dissapeared.
    18. AIX slibclean..
      While applying PSU to the Oracle Database on AIX platform, opatch encounter errors like "cannot copy file to the destination.." We checked the permission, file paths and environment, and we did not see any problems..
      After analyzing the situation, this error is produced while copying libraries, and related files. So for the solution, we run a for loop, which execute the slibclean command of AIX in every 2 seconds during the patch application.. By this method, we successfully applied the PSU patch..
    19. Oracle Application Server/Single Sign on installation on AIX -- port conflict..

      Using Oracle Installer on Vnc , my customer was trying to install SSO 10g on AIX server.
      In the Opmn start phase of the installation, port conflict error (unable to bind, port already in use) encountered..
      After analysis, found that, Vncvserver on port 3 was blocking the opmn.. Because, vncserver was using the port 6003, and that was the opmn port actually. Killed the vncserver, completed the installation on Vnc2 and the problem dissapeared..
      So it seems, vnc on Aix used the ports starting from 6000.. In linux the vnc ports start from 5900, and that's why I think, the same error will not be produced on linux.
    20. To be able to open a wallet using Owm does not mean that the wallet can be used for the ssl connection.. The content in the wallet is important actually..
      And remember, the cwallet.sso is used for file based credential store. This file is created by the Owm, to supply the auto login mechanism..
    21. If you define a printer usng SAMBA in Linux, you need to supply a username and password defined in Windows environment(ActiveDirectory) to access the printer via network.

      while supplying username, you need to supply the username with domain prefix.
      For example: ERMAN.ORG/ermanuser

      Also you can check the access to the printer using smbclient
      For ex:

      smbclient -L \\dns_name_of_the_server -I ip_of_the_server -U ERMAN.ORG/ermanuser
    22. During an Exadata POC, we had to create an EBS clone environment. We used a full compressed rman backup for this. Rman duplicate commands were executed, but there was a problem.. The duplicate (restore) was running so slow on the Linux machine.. The Linux machine had 2 Intel cpu 's each having 4 cores. The server's physical memory was 48 GB, and there was nothing running but rman restore job on the server.
      When I had made analysis, I suspect from the network latency, because the rman backup files were on a nfs filesystem.. But after some digging, I found the server was swapping, altough it had 48 gb memory... kswapd processes were active, and rman could run very slow.. This was increasing the latency in backup restore operation.

      Then I placed my Linux admin hat, made the memory analysis on the Linux server and found that the HugeTLB 's are configured on the server... The Hugepage count was above 30000. So there was almost no 4k pages available on the server.. Also note that : These hugepages are never swapped out..
      This made rman to face difficulties on writing in to the normal memory pages(4k), rman needed swap out operation to write in to the memory.

      As a solution, I reduced the hugepage count, dynamically activated this change and everything went back to normal. Kswapd became idle as expected and rman restore job accelareted significantly.

Monday, December 30, 2013

RDBMS-- Migrating Oracle Database from disk to disk

This post was prepared just to put down the migration operation on paper.
Sometimes, you need to migrate(move) you database file in to a new storage...
This operation is often required when you buy a new SAN with fast disks (mounted to local, using ext3/ext4 filesystem).. Usually, we move our whole database in to these kind devices, and expect an increase in the I/O performance..
Note that, you may need to migrate some of your datafiles in to a new storage, too..
Altough there are several options to make such a migration; this post will explain how to migrate some of the datafiles in to a new storage using Linux copy and Oracle 's datafile rename commands.. The method that we will use does not include creation of controlfile operation, because rename datafile option is easier to implement in this kind of scenario;


Scenario: Your datafiles are on /u1 mount, and you need to move some of your datafiles(names starting with ERMAN%) to  new storage mounted on /u2

The method;

1) Identifiy the datafiles to be moved and prepare the copy script.
You can use the following sql to build the script for you.(name this script as copy.sh, we will use it on Step 4)

select 'cp '|| name||' /u2/newmount/'from v$datafile where name like '%ERMAN%'

This sql will create the cp commands like below;

cp /u1/prod/ERMAN.dbf /u2/prod/
cp /u1/prod/ERMAN_INDEX.dbf /u2/prod/
cp /u1/prod/ERMAN2.dbf /u2/prod/
cp /u1/prod/ERMAN3.dbf /u2/prod/
cp /u1/prod/ERMAN4.dbf /u2/prod/

2) Create the sql script for rename operations; You can use the following sql for this. (name this script as rename.sql, we will use it on Step 5)

select 'alter database rename datafile '''||name||''' to '''||replace(name,'/u1','/u2')||'''' from v$datafile where name like '%ERMAN%';

This sql will create the rename commands like below;

alter database rename file '/u1/prod/ERMAN.dbf' to '/u2/prod/ERMAN.dbf';
alter database rename file '/u1/prod/ERMAN_INDEX.dbf' to '/u2/prod/ERMAN.dbf';
alter database rename file '/u1/prod/ERMAN2.dbf' to '/u2/prod/ERMAN2.dbf';
alter database rename file '/u1/prod/ERMAN3.dbf' to '/u2/prod/ERMAN3.dbf';
alter database rename file '/u1/prod/ERMAN4.dbf' to '/u2/prod/ERMAN4.dbf';

3) Shutdown your application and database.

4) Connect you your database server (supposing it s Linux or Unix) and run the copy script that we created in Step 1;
as database owner> sh copy_script

5)When Step4 is finished, connect to sqlplus using sysdba, startup mount your database and execute the rename script that we have created in step 2.
as database owner > sqlplus "/as sysdba"
SQL> startup mount;

6)Once Step 5 is finished, open your database..
SQL> alter database open;

7)Check your datafiles are actually migrated.. check using lsof and v$datafile to ensure the database is using the datafiles from the new storage..

8)If everything is expected, delete the old files using the output script that the dynamic sql below will produce..
select 'rm -f '||replace(name,'/u2','/u1')||';' from v$datafile where name like '%ERMAN%';

That's it.. The datafiles named with ERMAN prefix, have been migrated..

Friday, December 27, 2013

Linux/Oracle -- Find Client IP connected to a Shadow process directly from OS

As known; when you use Oracle Database on Unix/Linux OS, clients are connected to server processes called shadow processes. These shadow processes work on behalf of the clients and make database operations.. (there are shared servers, too. Finding the client ip in a shared server is another story)

The connection between the client computers and these shadow processes are  based on the Tcp sockets.
So if we trace a shadow process id all the way down from the pid to related socket, we can gather the remote ip , actually the ip address that belongs to the client.

By working on this info, we can understand who is connected to our database without even connecting to Oracle Instance..
This is also very helpful for Linux admins in diagnosing the performance of a Linux System that runs a critical Oracle Database.

Lets see,
First we identify the process that we want to find the associated client's remote ip address..

We can use top or ps for this...

top ->
2758 oraprod   18   0 55.2g  46m  34m R 100.0  0.0   6:03.32 oraclePROD (LOCAL=NO)

so it seems like the process with pid(Linux process id) 2758, consumes Cpu resources aggresively..

Lets go to /proc file system, and see the file descriptors of the process.. Note that the sockets have file descriptors, too..

cd /proc/2758/fd
ls -al |grep socket
lrwx------ 1 oraerman oinstall 64 Dec 27 11:31 12 -> socket:[1764624]

So we have found the socket number , it is 1764624..

Now, we ll go to /proc/net to find out the socket information..
We use the file tcp as it s a tcp socket;

cd /proc/net;
cat tcp | grep 1764624
 498: 9000050A:0623 8A00050A:3780 01 00000000:00000000 00:00000000 00000000   500        0 1764624 3 ffff81128e130c80 203 40 1 8 100

So we have gathered the socket information.
The information gathered here is in the format below..

46: 010310AC:9C4C 030310AC:1770 01 
   |      |      |      |      |   |--> connection state
   |      |      |      |      |------> remote TCP port number
   |      |      |      |-------------> remote IPv4 address
   |      |      |--------------------> local TCP port number
   |      |---------------------------> local IPv4 address
   |----------------------------------> number of entry

The remote IPV4 address in here is the ip address of the connected client. This is what we want to find. Also, there are port state and etc. information , but we dont need those...

498: 9000050A:0623 8A00050A:3780 01 00000000:00000000 00:00000000 00000000   500        0 1764624 3 ffff81128e130c80 203 40 1 8 100   

So the Ip adress in Hex is : 8A00050A 

This is Hex in Little endian form.. So we need to reverse it before converting to the real ip address.

8E00050A ---reverse----> 0A05008A 

So if we convert 0A05008A to ip address, we find, which is the related client's real ip address!


Note that; "Hex to Ip" conversion is  as follows:
0A . 05 . 00 . 8A -> 10 . 5. 0 . 138

I wil try to extend this OS information gathering method and try to implement these kind of methods programatically.

Thursday, December 26, 2013

Oracle Linux -- Memory Optimization -> NUMA, HugeTLB/HugePages, kswapd, TLB and UEK

I decided to write this post as the memory amounts that are used in Production systems are increasing day by day. Today, we can see an 2 TB sized Oracle Applications/EBS database , uses 80 GB sga. Mixed workload environments require both large amount of avaliable system resources and fast response times.

Most of the time, tuning the code meets this kind of requirements..
On the other hand, there are times that system resources need to be tuned in order to meet the performance requirements and prevent the hang situations..
So, I will write this post regarding the Linux Memory optimization and performance ..
Actually, I have got inspired from the Greg Marsden's , Kernel and Sustatining Engineer in Oracle Linux Team presentations and decided to write these facts down..

In Large systems(128GB and above) with multiple cpu's, the memory layout is different. Numa(Non Uniform Memory Allocation) is used to allocate the memory efficiently. NUMA is automatically configured by Oracle Linux, and used by the Oracle Databases running on the Linux server.. Oracle ships two kernels with its distribution.. One of the is UEK(Unbreakable Linux Kernel) and the other one is base kernel.. Both of these kernels supports NUMA.

Lets take a look at NUMA concept and architecture;

Non-Uniform Memory Access (NUMA):

Often made by physically linking two or more SMPs
One SMP can directly access memory of another SMP
Not all processors have equal access time to all memories
Memory access across link is slower
If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA

NUMA is a memory design used in multiprocessing systems.. With NUMA, the memory access time depends on the memory allocation relative to a processor. In NUMA the memory is splitted in to multiple regions.. These regions are assigned to the processors.. That is, every process has its own local memory. Ofcourse, a processor can access the non-local memory regions, which are the local memory regions of other processor in the system.. Naturally, a processor can access its own local memory faster than a non-local memory, and this is the main concept in NUMA.
The NUMA concept have brought a solution to the problem, that only one processor can access the computer's memory at a time. With NUMA, the memory is seperated in to multiple regions and these regions are assigned to the processors in the system..

Here is the graphical presentation of the architecture;

In Linux with NUMA, the memory is divided and one or more processor are attached to this divided memory zones.. The attached CPU and memory region can be considered a cell. A matter of course, the entire memory is visible and accessible from all the CPUs in the system. The coherency between these memory regions are also supplied. It is handled in hardware by the processor caches and/or the system interconnect.

In detail, Linux divides the hardware into multiple nodes.. Actually, these nodes are the software representations of the hardware portions.. As hardware supplies the physical cells, Linux maps these nodes onto the physical cells..

As a result, an access to a memory location that is in a closer node that maps to a closer cells will be faster than a remote cell..

Okay, we described NUMA in general.. Let get back to our subject..

So, Oracle Linux uses NUMA with big memory, but there are things that you need to consider when using NUMA with big memory..

NUMA systems can swap even if there is free ram.. These systems make their kernel allocations only from one NUMA Zone that is closer to the particular core... This is an issue in Linux today. Because, even if you are working with one NUMA zone and even there are free memory in other zones at that time, you can get memory allocation fail messages ( dmesg -> order N allocation Failed), and as a result the system can start swapping for these pages..

Min_free_kbytes parameter can handle this issue. (if the allocation failed messages are less than 5 or so.)

By default, Min_free_kbytes is 15 MB.. If you have several NUMA zones, than this 15 MB will be divided by the # of NUMA zones.. For example, for 5 numa zones -> min_free_kbytes is 15/5 = 3MB..

The solution is increasing the min_free_kbytes and it is addressed in MOS document 1546861.1..
Linux needs to be tuned to preserve more memory for the kernel in order to avoid this memory depletion event.

Increase vm.min_free_kbytes to 524288 (from default of 51200) by editing /etc/sysctl.conf and editing the vm.min_free_kbytes line to read:

vm.min_free_kbytes = 524288
Note, on NUMA systems vm.min_free_kbytes should be set to 524288 * <# of NUMA nodes>. Use the following command to see the number of NUMA nodes:
# numactl --hardware

WARNING: Changing vm.min_free_kbytes when the system is already under memory pressure can cause this system to panic or hang. Before trying to change this setting dynamically be sure the system has more free memory that what vm.min_free_kbytes will be set to. This can be checked using the free -k command, which will display the amount of memory used and free in the same format as vm.min_free_kbytes.

min_free_kbytes defitinition: This is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a watermark[WMARK_MIN] value for each lowmem zone in the system. Each lowmem zone gets a number of reserved free pages based proportionally on its size. Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will become subtly broken, and prone to deadlock under high loads

You can see the numa layout with numactl tool.
"numactl --hardware" will give you the layout of NUMA in the system.

Example output(144GB Ram two Cpu sockets):

available: 2 nodes (0-1)
node 0 size: 72697 MB
node 0 free: 318 MB
node 1 size: 72720 MB
node 1 free: 29 MB
node distances:
node 0 1
0: 10 20
1: 20 10

While working with big memory system, crash analysis are important, too..

If the memory is so big, then dumping the contents, I mean taking crash dump, becomes harder.. That's why, we need to take the critical contents that can be enough and used to make OS analysis..Otherwise, taking a crash dump can take several minutes... So in order to achieve this, the system and kdump should be configured in that way.

Also, keep in mind that, Kexec can be used for making fast reboot. It bypasses the BIOS instructions and reduces the downtime.. (15 minutes - > 2 minutes) On the other hand; some PCI devices does not like to be rebooted in this way..

Swap is another important thing for memory optimization.. Oracle 's memory recommendation for servers running Oracle Databases is as follows;

RAM                                    Swap Space
Between 1 GB and 2 GB     1.5 times the size of RAM
Between 2 GB and 16 GB   Equal to the size of RAM
More than 16 GB                16 GB

As you see above, after 16GB , regardless of memory size, the recommended swap space is 16GB. But, we do not want to use this space actually... Because, if the database code or data pages are swapped back and forth, we will face a significant decrease in performance .. We just set it, dont wanna use it..

If we dont wanna use it, we need to monitor it... We need to monitor swap usage to be sure it s not used.. So, we can use free tool to monitor it..

Remember in Linux free pages are used actually, they are used as page caches, but they are overwritable, so technically they are free.

That is why when you calculate your free memory , you will add cached to the free pages..

Cached column in free command output + Free column in free command output returns actual Total Free Memory..

Example Free command output:

total used free shared buffers cached
Mem: 144967 144625 341 0 413 71252
-/+ buffers/cache: 72959 72007
Swap: 32765 474 32290

The calculation is actually as follows Free + Used - (Buffer+cached)

You can see the swap usage in here, too.. Also, you can use the following command to see the percantage of used swap space : free -m | grep -i Swap | awk '{print ($3 / $2)*100}'

It is okay to see some of the Swap space used.. The important thing is that, a system should not do swap operations so often. You can monitor the swap in and out operations using Sar tool.

sar -W will report swap statistics in two columns

pswpin/s :Total number of swap pages the system brought in per second.
pswpout/s:Total number of swap pages the system brought out per second.

Example output:

pswpin/s pswpout/s
12:30:01 PM 8.02 11.87
12:40:01 PM 7.13 52.70
12:50:01 PM 6.61 15.07
01:00:01 PM 13.05 65.30
01:10:01 PM 29.10 4.72
01:20:01 PM 82.65 20.68
01:30:01 PM 44.91 17.46
01:40:01 PM 30.79 11.03
01:50:01 PM 10.10 19.78
02:00:01 PM 8.07 4.17
02:10:01 PM 18.40 9.05
02:20:01 PM 7.10 4.46
02:30:01 PM 11.90 27.23
02:40:01 PM 3.95 17.48
Average: 33.11 30.06

As I have mentioned above, seeing some swap space used by OS is not a bad thing.. Because Linux by design, swaps out the unnecessary/unused pages of an application in to swap space.. So a system using swap space can have a lot of free memory at the same time.. So pages swapped in and out (especially "in", in my opinion ) are important... These activities can decrease performance.. If you see a high count in this metrics, you need to tune your system... This behaviour can be controlled using /proc/sys/vm/swappiness...But, care must be taken on this parameter.. The default value is 60, the value 0 ->“never use swap if free ram avaiable“ and value 100->Swap out pages aggressively.

There are things to consider for utilizing the memory.. For example, batch process like updatedb or slocater can force the pages that belong to the applications in to swap.. So after this kind of process have worked, we can experience a performance decrease in our Applications, as their pages will need to be swapped in..

The general solution to that is to disable these kind of batch process.. An alternative solution can be using cgroups.. Using cgroups, you can isolate the memory usages of the applications.. You can configure a group of applications like updatedb to be in one group and you can set the maximum memory available for them..

So in general, swap space should only be used for Operating System services and that 's why, SWAP=RAM is not recommended.. On the other hand; swap space is important for small systems, as it supplies room to grow ,but when it is used so often, it decreases the general performance... Shortly, if a system has a lot of swap activity, then that system is either misconfigured or missized..

You can also use vmstat to monitor swap in and out operations..

Example output of vmstat:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st
1 9 10758504 22830608 80848 45653220 13 12 759 137 1 3 7 1 84 8 0

Look at si and so columns .. They show blocks per seconds..

One other thing that I want to mention, is the kswapd .. It s the daemon that makes the swap in and out operations. It is a beast that consumes CPU in some cases.. It can hang a Linux system, while trying to swap out and in the pages.. I personally saw it make a Linux 5 64 bit system hang for 30 minutes, altough there were 40000 free pages in the System.. I didnt check renicing kswapd process priority, it may help on this.(renice 0 kswapd_pid)

Following picture describes the kswapd 's default behaviour;

As you see above, kswapd wakes up when Page count is low.. It start to reclaim memory. So what are the Pages High and Pages Low values then? When do we declare them? Where do we see them? I will try to answer this question in my next post.
Again, min_free_kbytes is important on this.
Also , following should be considered if you see aggresive swap I/O in Oracle Linux.

In Linux, /proc/buddyinfo can also be used in such situation to extend the analysis and diagnose memory fragmentation..
buddyinfo displays
Node 0, zone DMA 3 5 5 4 2 2 2 0 1 0 2
Node 0, zone DMA32 135 12 8 0 1 1 1 0 1 1 67
Node 0, zone Normal 1 176 80 3 0 0 0 2 1 1 6
Node 1, zone Normal 1 1 8 6 1 7 4 1 1 1 6

The columns in the output can be considered as follows;

First column -> 2^0*PAGE_SIZE
Second colum -> 2^1*PAGE_SIZE
Third colum -> 2^2*PAGE_SIZE
and so on..

For example, by looking the output above, we can say that there are 176 of (2^1*PAGESIZE) sized pages in Zone 0.. In English , there are 1 single page could be allocated immediately, 176 contigous pairs exist, 80 groups of 4 pages can be found in the normal zone of Node 0.. If some process try to allocate some high order pages above these limits, allocations will start to fail.. This can trigger kswapd to work aggresively.

Okay , lets mention some important things about Linux pages, too.

There are two kind of pages in Linux. 4K Pages, and 2MB Pages.. Note that there are significant performance differences between them..

Linux uses paging to translate virtual addresses to physical addresses..

Processes work with virtual addresses that are mapped to physical addresses. This mapping is defined by page tables...

So consider you have 250 Gb memory installed in your Oracle Database Server and you are using 4K pages. Suppose you have 2000 connections in your database environment.. So every oracle process on Linux will its own Page tables to map the Sga ,and you will end up a giant total page table size .. (maybe 100gig or more..)

Note that, when calculating Sga, we need to consider the size of page tables and number of processes, because we need to leave some space for OS, too..

The solution for problem above is using Hugepages. Hugepages provide much more smaller page tables in terms of size, because there will be less pages to handle, as Hugepages are 2MB(or more , it depends on the system) sized. In addition the hugepages are never swapped out, they are locked in memory.. Kernel does less work for bookkeeping of virtual memory, because of the larger page sizes.. Note that: Hugepages is not compatible with automatic memory management that Oracle does if configured to do..
Also sharing of pagetables seems also supported in Linux 5 and afterwards, as declared by Redhat below..

Shared Page Tables
Shared page tables are now supported for hugetlb memory. This enables page table entries to be shared among multiple processes.
Sharing page table entries among multiple processes consumes less cache space. This improves application cache hit ratio, resulting in better application performance.

So for Oracle, it is recommended to use Hugepages with Shared page tables..

TLB hit ratio will also increase if Hugepages are in use. Lets see what TLB's are...
Linux uses TLB(Transaction Lookaside Buffers) in the CPU. TLB store the mappings of virtual memory to actual physical memory addresses for performance reasons.. It is used like a cache, by using them Linux can access the mappings directly without going to the page tables.. So as with hugepages, the page sizes will be higher, page count will be lower.. That's why , TLB consumption will decrease, also decreasing the processing overhead..

Following is a graphical representation decribing the TLB in use.. Ref: University of New Mexico.

As you see above, CPU reads the virtual address from the TLB, firstly. If it cannot find the page there, then it goes to page table to find it..
In addition; I want to mention about the allocation of these Hugepages briefly.

I guess (didnt trace it); Oracle Database allocates the shared memory from Hugepages using something like the following;

shmget(2, MB_8, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);

You can see the size of and the count of available and allocated Hugepages by looking to the /proc/meminfo..

Example output:

cat /proc/meminfo
MemTotal: 148446596 kB
MemFree: 363292 kB
Buffers: 857808 kB
Cached: 77299976 kB
SwapCached: 0 kB
Active: 15177984 kB
Inactive: 70000152 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 148446596 kB
LowFree: 363292 kB
SwapTotal: 33551744 kB
SwapFree: 33551508 kB
Dirty: 256 kB
Writeback: 0 kB
AnonPages: 7122052 kB
Mapped: 111868 kB
Slab: 1135976 kB
PageTables: 228652 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 77055040 kB
Committed_AS: 21068916 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 280780 kB
VmallocChunk: 34359457067 kB
HugePages_Total: 30000
HugePages_Free: 4399
HugePages_Rsvd: 2560
Hugepagesize: 2048 kB

So 30000-4399+2560= is or will be used..

So in the example above we allocated nearly 28161 hugepages, this makes approx. 55 GB..
Note that the system that the above output is gathered from, has 50 gb sga.. 55 Gb sga target.. That s why we see these reserverd hugepages count (2560)..
So it s a good configuration, almost all of the sga is and will be in hugepages, then we can say that hupages configuration for this system is done properly..

If you want to use Hugepages with Oracle Database, please see the folowing Oracle Support documents.
  • MOS 361323.1 – Hugepages on Linux
  • MOS 361468.1 HugePages on 64-bit Linux
  • MOS 401749.1 script to calculate number of huge pages
  • MOS 361468.1 Troubleshooting Huge Pages

Moreover; there is also another type of Hugepages called THP(Transparent Hugepages).

These type of Hugepages are different than normal Hugepages, as Oracle Database cannot use them.
Note that , Java does not use Hugepages at all , unless +UseLargePages parameter is specified.. Ex: java -XX :+UseLargePages..

THP 's are swappable. That's why not good for RDBMS, as RDBMS work can go to disk.. Also this brings a paging/memory overhead, too...

It s nice to know that, if you compile the linux kernel with THP support and even if you dont use THP, there can be a %10 perfomance gain, especially in Applications that do a lot of I/O......
Note that : THP support is disabled by default in the latest UEK kernel.

The last paragraph will be about UEK kernel ..

UEK kernel is based on Linux 2.6 and later.. With Oracle Linux, Oracle have done enhancements for supporting OLTP, infiniand, SSD disk access, RDS , asnyc I/O, OCFS2 and networking. UEK is compatible with RHEL, so you can install and the applications running on RHEL servers into Oracle Linux running with UEK. It is a kernel that is fast, modern and reliable..
UEK is modern because it supports and supplies PV Hugepages, Data Integrity(to prevent data corruption), Dtrace,OCFS2, btrfs, OFED , ksplice and etc..

UEK kernel has a World Record in TPCC Benchmark made on March 12 2012. You can see the full disclosure report with this link -> http://c970058.r58.cf2.rackcdn.com/fdr/tpcc/Oracle_X4800-M2_TPCC_OL-UEK-FDR_Rev2_071012.pdf

In my opinion, the machine that is usd in Benchmark is a high end machine.. Utilizing this machine and getting such a high transaction processing rate is a big success for Oracle Linux.

That 's all for now. Feel free to write comments.. Please help me make corrections, if necessary..