Well, i know Exadata has been in the market for a long time and as an APPS DBA i didn't get a chance to work on it yet :(
Maybe in the future i will get my hands on to work on this beast as the quote said it in a right manner "It does not matter how slowly you go so long as you do not stop."
Recently my current employer asked me to appear for internal Exadata exam. So without any practicals, i started learning Exadata(This happened in between my marriage and other personal things on the cards :). The below details are not for Exadata experts and are just ment for novice users who are just about to start with Exadata.
Exadata----Exadata is an engineered machine composed of database servers, Storage servers, an internal infiniband network with switches and storage devices(disks).
360 Degree overview-Smart Scans, Offloading, Intelligent storage, hybrid Columnar Compression, next generation Infiniband, tons of flash cache, Database cloud computing.
How to know if a database is running on Exadata or machine, if you are not a DMA.
SQL>select count(*) from (select distinct cell_name from gv$cell_state);
If the count returns 0, it means it is non exadata machine. The count represents the number of cells present in database machine.
Architecture:
Storage cells-Top and bottom in the exadata box. Then lies the Database services:Compute nodes. Between 2 db(compute nodes) lies the Cisco switch which is surrounded by Infiniband:Spine and leaf switches.
The format of the Exadata model number is Xn-s where:
– n is the generation number
– s is the number of CPU sockets on the database servers
X5-2-approx means 2 cpu Sockets. Lots of memory, flash cache,SSi have been added in Exadata which makes it a powerful db machine.
Database node also known as-Compute node
Shared storage also known as-Storage cells
Version 1 of exadata was on HP hardware. Later when Oracle acquired Sun, the hardware in version-2 moved to Sun hardwares.
No of Exadata Stroage cells No of Database Servers No of Infiniband Switches
Full Rack 14 8 3
Half Rack 7 3 3
Quarter Rack 3 2 2
IOLM-Integrated lights out manager-is used for monitoring exadata. It mainly uses ethernet network. The client also used ethernet to connect to exadata machine.
Three processes for exadata
IORM, cellsrv, MS and RS.
cellsrv-is the most important services. The request from DB is first handled by cellsrv.
MS-management server-provides Exadata cell management and configuration. MS is responsbile for sending alerts and collect statistics.Any admin services such as cellcli and dscli.
RS-Restart server-is used to restart start up/Shut down the cellsrv and MS services and monitor these services to automatically restart them.
libcell-It resides in the exadata db side. it will not be there in non exadata machine. It mainly conerts DB requests to IDP format(cellsrv understands this).
Database see each cell nodes as failure groups.
Login to exadata server as root.
uname -a
Linux uaexacell1 2.6.39-300.26.1.el5uek #1 SMP Thu Jan 3 18:31:38 PST 2013 x86_64 x86_64 x86_64 GNU/Linux
list the cell restart server process
ps -ef|celllrs
List the MS process
ps -ef | grep ms.err
MS(Management server) process’s parent process id belongs to RS (restart server).RS will restart the MS when it crashes or terminated abnormally.
cellsrv establishes the communication between storage server and Compute nodes(Database server)
cellsrv can also be started using cellcli utilility or celladmin user.
cellip.ora and cellinit.ora-2 imporatnt files. Without this, DB will not be able to connect to Exadata machines. Asm will not be access disk which are there in Exadata
storage servers.
Below are the two important files which resides in the DB server. They resides under /etc/oracle/cell/network-config
cellinit.ora-contains the inifiniband ip address for the db server.
cellip.ora-contains the inifiniband ip address(this is how infiniband switches are used) for all the storage servers.
To manage Exadata storage server, we need cellcli or dscli which is managed by MS.
Storage server:-
each cell server disks represent a physical disk and a LUN. And the cell disks are partioned and each parition known as Grid Disks. The main ingredient is the exadata
storage cell software. Each cell has 12 physical disks. From one physical disk we can get a one LUN. So in 12 disks means 12 LUN. During configurations, physical disk
and lun are formated and we get a cell disks. In storage server, we have 12 physcial disks=12 LUN=12 cell disks. Cell disks has also the metadata information. We can
parition the cell disk which becomes the grid disks. From one cell disks we can get multiple grid disks. and then grid disks assigned to ASM, so we can create ASM disk
groups on top of Grid disk groups. Oracle storage server also has the OS linux running on top of it which is installed under physical disks. This area is the known as
the system area(these are done in 1st two disks) and rest 10 disks are used to store the datafiles.
4 flash cards in a cell server or storage server. And one flash cards has 4 fdoms.(flash modules). One fdom=1 lun. Using 1 fdom we can create 1 cell disk.
All the space from flash will be given t flash cache. or create grid disks from flash disk and it will be visible to ASM then(but not recommended).
The compute nodes send an IDP request to Storage cells which also contains additional predective information. The additional information are nature of the I/O, the probability that this data will be requested again in near term and so on. The storage server uses this information to determine whether to cache or not cache the data blocks.
Storage index
===================
-These are not normal database index which gets stored to the tablespace. These are constructed online as and when transaction occurs and gets stored in flash cache.
Three data elements are maintained at a storage region level MIN value, Max Value and Null Exists.
When a query that is calling for a full table scan on an object is recieved by the exadata storage server, it looks for the existence storage indexes. based on the predicate column. The exadata storage scans through the storage index where the predicate value falls within the MIN/MAX for the region. Physical I/O is targeted at only the identified regions.
The smaller the set of storage regions identified as candidates the more I/O saved.
Cell Offloading:
=========
The storage in the Exadata Database Machine is not just dumb storage. The storage cells are intelligent enough to process some workload inside them, saving the database nodes from that work. This process is referred to as cell offloading.
In Traditional system storage will not have these features.
Smart Scan:
=========
In a traditional Oracle database, when a user selects a row or even a single column in a row, the entire block containing that row is fetched from the disk to the buffer cache, and the selected row is then extracted from the block and presented to the user’s session.
The Exadata Database Machine can pull the specific rows from the disks directly and send them to the database nodes. This functionality is known as Smart Scan. It results in huge savings in I/O.
Basically it will reduce the I/O between Compute Nodes (i.e Database Nodes) and Storage Servers.
Hybrid Columnar Compression-
========
Instead of storing the rows together, HCC allows the Columns to be stored together where the data has similar characteristics.
Traditionally, database rows gets stored in blocks with the columns of the rows stored next to each other. However when the row becomes too large to fit into the block, the rows overflows into next block which is known as row chaining. In ideal world scenario, data in columns usually gets repeated and thus they also becomes eligible for compression.
example-
First_name last_name
larry sinha
wasim roy
neha sinha
Antora sinha
We can see from the above that sinha gets repeated and thus it can be compressed.
With Hybrid Columnar Compression, Oracle Exadata Storage Server in Oracle Exadata V2 creates a column vector for each column, compresses the column vectors, and stores the column vectors in data blocks. The collection of blocks is called a compression unit
Types of Hybrid Columnar Compression
Hybrid Columnar Compression comes in two basic flavors: warehouse compression and archive compression.
Powering Off Oracle Exadata Rack
1. Stop Oracle Clusterware on all database servers
# GRID_HOME/grid/bin/crsctl stop cluster -all
2. Shut down all remote database servers
# dcli -l root -g remote_dbs_group shutdown -h -y now
remote_dbs_group is the file that contains a list of all database servers.
except the server you are using to shutdown the remote servers (step 4)
3. Shut down all Exadata Storage Servers
# dcli -l root -g cell_group shutdown -h -y now
cell_group is the file that contains a list of all Exadata Storage Servers.
4. Shut down the local database server using the following command:
shutdown -h -y now
5. Remove power from the rack.
There are two types of disks
High performance and high capacity disks
http://www.dbas-oracle.com/2013/06/What-is-Exadata-Architecture-Components-and-Main-Features-of-Exadata-Storage-Server.html---link this
references:https://flashdba.com/history-of-exadata/
https://kevinclosson.net/kevin-closson-index/exadata-posts/critical-thinking-exadata-and-engineered-systems-in-general/
https://docs.oracle.com/cd/E80920_01/SAGUG/exadata-storage-server-software-introduction.htm#SAGUG20312
********************************************
Patching in exadata:
atabase server, Storage Server, Infiniband switches are patched by utility patchmgr.
Grid Home, Database home are patched by OPatch.
Exadata Storage server patching-An exadata storage server patch may contain updates to firmware and operating sysem to apply on database servers known as a database server minimal pack.
Installation can be done in 2 ways
1. Rolling fashion-Where patches are applied one cell at a time. Patch will be applied to the cell which is down and rest cells will be up and running. So no downtime
So for a full rack, which has 14 storage servers and if patching each storage takes around 1-2 hours, then it will come to 28 hours for full rack and 14 hours for half rack. Also as per "https://uhesse.com/2014/12/20/exadata-patching-introduction/" olling patches are recommended only if you have ASM diskgroups with high redundancy or if you have a standby site to failover to in case. In other words: If you have a quarter rack without a standby site, don’t use rolling patches! That is because the DBFS_DG diskgroup that contains the voting disks cannot have high redundancy in a quarter rack with just three storage servers.
2. Non Rolling fashion- where all the cells are kept down including the RAC and then the patch is applied. This method requires a downtime.
Each Exadata Storage Server Patches has a fixed compatibility with other Components patchs. i.e. An Exadata Storage Server Patches require Database Server patch to support.
For example-Patch 21339377: QUARTERLY FULL STACK DOWNLOAD PATCH FOR EXADATA (JUL 2015 - 11.2.0.3.0)
The read me gives more illustration-https://updates.oracle.com/Orion/Services/download?type=readme&aru=19138657
a)Database server patches- It can contain patch to be applied to Database cluster(OPatch) and to the firmware/operating system patches. The later one requires an exadata software server version for compatibilty purpose. The later one are the OS and firmware. It gets the update witht help of dbnodeupdate.sh, a tool that accesses the exadata yum repository.
b) Infiniband switch patches- An infiniband switch patch contains updates to the software and or firmware for infiniband switches.
c) Additional Component patches-This patch can be applied to the ethernet switch, PDU and KVM etc.
Exadata Database Machine Components
Compute nodes (Database Server Grid)
Exadata (Linux Operating System, Firmware, Exadata software)
Oracle RDBM and Grid Infrastructure software
Exadata Storage Server (Storage Server Grid)
Exadata (Linux Operating System, Firmware, Exadata software)
Network (Infiniband Network Grid)
Exadata IB switch software
Other Components
Cisco Switch, PDUs
Oracle Platinum Services
If you are an Oracle Platinum Customer, then Oracle can patch your Exadata Database Machine free of cost. Here are some details around Oracle Platinum services:
What is Oracle Platinum Services?
Oracle Platinum Services maximize the availability and performance of your Oracle engineered systems with 24*7 remote fault monitoring, faster response times and patch deployment services at No additional cost.
Oracle Platinum Services is a special entitlement under Oracle Premier Support that provides customers with additional services at NO Extra cost.
Oracle Platinum services offers Patch deployment performed by Oracle up to four times per year.
To be eligible for Platinum support you should not be lagging behind with 3 or more patches level. This should be avoided.
Oracle platinum can patch up to 40 databases per Exadata cluster.
Oracle Platinum Engineer connects to the Customer's Exadata using Oracle Advanced Support Gateway for patching Exadata Database Machines.
Oracle Platinum can upgrade database from 11.2.0.1, 11.2.0.2 or 11.2.0.3 to 11.2.0.4.
At this moment database upgrade from 11g to 12c is not included as part Oracle Platinum patching.
If you are NOT an Oracle Platinum Customer, you can choose to Patch your Exadata Database Machines on your own.
For more details read the document at:
http://www.oracle.com/us/support/library/platinum-services-faq-1653259.pdf
http://www.oracle.com/us/support/library/oracle-platinum-services-ds-1653256.pdf
Patching Terminology
The below tables provides some details on the different Exadata patching terminology used in Exadata patching.
Terminology
Description
Storage server patches
Oracle releases Storage server patches every quarter. These patches are applied using patchmgr utility. Storage server patches apply operating system, firmware, and driver updates and update both the storage cells and compute nodes.
dbnodeupdate.sh
Compute nodes are patched using dbnodeupdate.sh utility. The DB Node Update Utility (dbnodeupdate.sh) automates all the steps and checks to upgrade Oracle Exadata database servers to a new Exadata release and replaces the manual steps. dbnodeupdate.sh replaced the “Yum channels”. Latest dbnodeupdate.sh can be downloaded from MOS using the patch 16486998.
QDPE
QDPE stands for “Quarterly Database Patch for Exadata”. Previously known as “Bundle Patches”. These patches are applied using famous opatch utility. QDPE patches are released at different times for different Oracle database versions, Example: quarterly in 11.2.0.4 and monthly in 12.1.0.2. QDPE patch includes patches for the GI and RDBMS binaries and are cumulative in nature.
InfiniBand patches
IB switch patches are released semi-annually to annually. These patches come with Exadata Storage server patch, update InfiniBand switches and are applied using patchmgr utility.
QFSDP
QFSDP stands for “Quarterly Full Stack Download Patch”. These patches are released quarterly and contain all of the patches for each stack in Exadata Database Machine, including Grid Infrastructure, RDBMS, and OEM. QFSDP patches are applied individually to each stack but downloaded from one single patch.
Patch set upgrades
Patch set upgrades are full-upgrade patches on the Exadata compute node, for example, upgrade from 11.2.0.3 to 11.2.0.4 or 12.1.0.1 to 12.1.0.2. These patches are released 1 – 2 years and are installed using OUI.
Exadata Patching Frequency
As of today, the Exadata patches are released at the following frequency:
Bundle Patch (QDPE)
12.1.0.2 - Monthly
12.1.0.1- No further BP
11.2.0.4 - Quarterly
11.2.0.3 - No further BP
Exadata Storage Software (ESS)
Quarterly
Quarterly Full Stack Download (QFSD)
Quarterly
Major Upgrade (Patch set)
Once in 1-2 year
Infiniband Switch (IB)
Semi-annually to Annually
For more details refer to the Oracle MOS note: 888828.1
Tools for Patching
Following tools are used for Patching Exadata Database Machines:
Tool Name
Exadata Components
Description
Patchmgr
Storage Cells, Compute nodes & Infiniband Switches
Patchmgr utility is used for applying patches to Storage cell and Infiniband switches. Starting with Exadata release 12.1.2.2.0, Oracle Exadata database can be updated, rolled back, and backed up using patchmgr. You can still run dbnodeupdate.sh in standalone mode, but using patchmgr enables you to run a single command to update multiple nodes; you do not need to run dbnodeupdate.sh separately on each node. Patchmgr can update the nodes in a rolling or non-rolling fashion. The updated patchmgr and dbnodeupdate.sh are available in the new dbserver.patch.zip file, which can be downloaded from patch 21634633.
dbnodeupdate.sh
Compute nodes
The DB Node Update Utility (dbnodeupdate.sh) automates all the steps and checks to upgrade Oracle Exadata database servers to a new Exadata release and replaces the manual steps.
Opatch
Oracle GI and RDBMS Homes
Interim patches and one-off patches are applied using opatch tool is used for patch application.
Oplan
Oracle GI and RDBMS Homes
Oplan utility provides the steps by steps patching instructions for your environment. It collects the configuration information for the target Oracle Home and then generates patching instructions specific to the Oracle Home to patch it. Details of OPlan are discussed in MOS note 1306814.1 and can be downloaded using the patch 11846294.
DBUA
Database
DBUA is the graphical tool that automates the Database upgrade process.
ASMCA
ASM Instance
DBCA has been superseded by the new ASMCA utility. ASMCA is a graphical tool for managing ACFS, ADVM and ASM Disk groups.
OUI
Oracle GI and RDBMS Homes
Oracle Universal Installer is used for installing and upgrading Oracle GI/RDBMS software. This is graphical method of installing or upgrading Oracle software.
MOS Notes
Before applying Exadata patches, read Oracle support note ID 888828.1. This note provides every detail required for Exadata patching. The section "Latest Releases and Patching News" is very useful if you are looking most recent patch updates.
MOS Note
Description
888828.1
Master note for all the Patches pertaining to Exadata
1070954.1
Exachk or Health Check (Run before and after patch)
1553103.1
Patch Exadata database server using dbnodeupdate.sh
1473002.1
Exadata YUM Repository Population and Linux Database Server Updating
Note: This note is useful if you are planning to use YUM repository server
1405320.1
Exadata Security Scan Findings
1270094.1
Critical Issues
Software Download
Refer to MOS note 888828.1 for details on Patches to be downloaded for a specific release.
Example: Here I am providing an Example for patching Exadata Database Machines to 12.1.2.2.0
20131726
Storage server and InfiniBand switch software (12.1.2.2.0)
Note: This number will be different for different release.
21825906
Database server bare metal / domU ULN exadata_dbserver_12.1.2.2.0_x86_64_base OL6 channel ISO image
EXADATA COMPUTE NODE 12.1.2.2.0 OL6 BASE REPOSITORY ISO
21634633
The updated patchmgr and dbnodeupdate.sh are available in the new dbserver.patch.zip file
Patching Sequence
You should patch the Exadata Database Machines in the following sequence
Oracle GI/RDBMS Homes
Exadata Storage Cells
Compute nodes
Infiniband Switches
Exadata Health Check using “exachk” utility
You must run the Exachk utility before and after patching Exadata Database Machines.
Here are the steps on how you can run Exachk utility and review the output file.
Download the latest exachk utility from MOS note ID 1070954.1
Transfer the zip to one of your computer node and unzip it.
The readme.txt file contains the information on how to use exachk.
Login as root user and run exachk.
[root@dm01db01 ~]#cd exachk
[root@dm01db01 ~]#pwd
/root/exachk
[root@dm01db01 ~]#./exachk –a
After the exachk completes, a zip file and HTML file output file are displayed on the screen.
Copy the files to the desktop/laptop and open the exachk HTML file in a web browser.
Review the Exachk report for hardware/software issues and resolve them before proceeding for the patching.
Procedure
Here are some basic commands for patching different Exadata stacks.
Exadata Compute nodes
Pre-requisites check
# patchmgr -dbnode -dbnode_precheck
Upgrade Compute nodes
Non-Rolling
# patchmgr -dbnode -dbnode_upgrade
Rolling
# patchmgr -dbnode -dbnode_upgrade -rolling
Exadata Storage Cells
Pre-requisites check
# patchmgr -cells -patch_check_prereq
Upgrade Storage Cells
Non-Rolling
# patchmgr -cells -patch
Rolling
# patchmgr -cells -patch –rolling
Exadata Infiniband Switches
Pre-requisites check
# patchmgr -ibswitch -ibswitch_precheck
Upgrade Infiniband Switches
# patchmgr -ibswitch -upgrade
Well, we know there are different components in Exadata machine and thus there are different ways to patch an exadata machine.
Database server(firmware+OS)-dbnodeupdate.sh
Storage server-patchmgr
Database Home/Grid Infra-opatch
Patchmgr utility is a tool used to apply the latest patches or rollback the patches to the Exadata Storage Servers in rolling or non-rolling fashion. The utility automates the patching operations and also provides an option to send email notification upon patch/rollback operation failure succeed, waiting, not attempted etc status.
./patchmgr –cells cel1_group –patch –rolling \
–smtp_from dba@domain.com \
–smtp_to dbma@domain.com
To update/patch the infiniband switches, the following syntax can be used:
./patchmgr –ibswitches [ibswitch_list_fle] \
-upgrade | -downgrade [-ibswitch_precheck] [-force]
The following tasks are performed during the course of Storage server patching:
· New OS image is pushed to the inactive partition
· Multiple cell reboots are performed for various course of action
· USB recovery media is recreated and keeps a good backup
dbnodeupdate.sh utility download dbnodeupdate.zip via patch 16486998 which contains dbnodeupdate.sh utility. The utility replaces the manual steps and automates all the necessary steps and checks required to successfully patch the Exadata database servers.
The following tasks are performed during the course of Database server patching:
· Stop/start/disable CRS
· Performs filesystem backup
· Applies OS updates
· Relink all Oracle homes
· Enable CRS for auto-restart
./dbnodeupdate.sh [ -u | -r | -c ] [ -l <baseurl|zip file> ] [-p] <phase> [-n] [-s] [-i] [-q] [-v] [-t] [-a] <alert.sh> [-b] [-m] | [-V] | [-h]
-u Upgrade
-r Rollback
-c Complete post actions (verify image status, cleanup, apply fixes, relink all homes, enable GI to start/start all domU's)
-l <baseurl|zip file> Baseurl (http or zipped iso file for the repository)
-s Shutdown stack (domU's for VM) before upgrading/rolling back
-p Bootstrap phase (1 or 2) only to be used when instructed by dbnodeupdate.sh
-q Quiet mode (no prompting) only be used in combination with -t
-n No backup will be created (Option disabled for systems being updated from Oracle Linux 5 to Oracle Linux 6)
-t 'to release' - used when in quiet mode or used when updating to one-offs/releases via 'latest' channel (requires 11.2.3.2.1)
-v Verify prereqs only. Only to be used with -u and -l option
-b Perform backup only
-a <alert.sh> Full path to shell script used for alert trapping
-m Install / update-to exadata-sun/hp-computenode-minimum only (11.2.3.3.0 and later)
-i Relinking of stack will be disabled and stack will not be not started. Only possible in combination with -c.
-V Print version
-h Print usage
Opatch utility is used to patch the GI and RDBMS homes.
OPlan utility provides step-by-step patching instructions specific to your environment. OPlan automatically analyze and collects the required configuration information. Also, generates a set of instructions and commands which are suitable/customized to your environment.
$ORACLE_HOME/oplan/oplan generateApplySteps <patch location>
patching orde in Exadata Machine
· InifiniBand Switch
o Spine/Leaf
· Exadata Storage Server
· Exadata Database Server
· Database Bundle Patches
o Grid Home
o Oracle Homes
Patching example
The following example demonstrates patching storage servers and db servers patching using the different utilities:
Pre-patching tasks (all actions performed as root user):
Step 1) # imageinfo
Step 2) On a DB server, connect to ASM instance to get the current disk_repair_time before you change the value to sustain the patching downtime:
SELECT dg.name AS diskgroup, SUBSTR(a.name,1,18) AS name,
SUBSTR(a.value,1,24) AS value, read_only FROM V$ASM_DISKGROUP dg,
V$ASM_ATTRIBUTE a WHERE dg.group_number = a.group_number
and a.name ='disk_repair_time';
As a safe side, change the value to 24hrs, as demonstrated below:
SQL> alter diskgroup DG_DBFS SET ATTRIBUTE 'disk_repair_time' = '24h';
SQL> alter diskgroup DG_DATA SET ATTRIBUTE 'disk_repair_time' = '24h';
SQL> alter diskgroup DG_FRA SET ATTRIBUTE 'disk_repair_time' = '24h';
Step 3) Create a file on the OS and have Cell 1 hostname in the file, maintain a separate file for each cell
vi /root/cell1_group – put cell host name in the file
Step 4) inactive all grid disks, login as celladmin users, with cellcli utility, do the following:
CellCLI> alter griddisk all inactive;
Verify all griddisk are inactive state:
CellCLI> list griddisk attributes name, status;
Cell Patching:
Step 5) Patching Cell node (starts from first cell node and patch remaining from the first cell node)
Get into patch location and run the following commands:
# ./patchmgr -cells /root/cell1_group -reset_force
# ./patchmgr -cells /root/cell1_group -cleanup
# ./patchmgr -cells /root/cell1_group -patch_check_prereq
# ./patchmgr -cells /root/cell1_group -patch
Once the patching is successfully completes on the cell, as celladmin, run the following commands through cellCLI command:
CellCLI> list griddisk attributes name, status;
CellCLI> alter griddisk all active;
Continue with other cells with the same set of commands. However, ensure you have successfully configured the SSH between the cell nodes and have a different file with hostname, as mentioned in step 3.
DB Server patching
After completing the cell server patching, move on to database servers and apply the patching with dbnodeupdate.sh utility. Beforehand, you must copy the patch (zip file) on all db severs.
Step 1) Stop/Disable cluster on the database server
./crsctl disable crs
./crsctl stop crs –f
Step 2)
./dbnodeupdate.sh -v -u -l /u01/app/oracle/patch/p18876946_112331_Linux-x86-64.zip
./dbnodeupdate.sh -b -u -l /u01/app/oracle/patch/p18876946_112331_Linux-x86-64.zip
./dbnodeupdate.sh -u -l /u01/app/oracle/patch/p18876946_112331_Linux-x86-64.zip
Wait for multiple reboots, once the system is up and running, do the following action:
./dbnodeupdate.sh -c
The –c option performs the post patch tasks, such as, enable/start the CRS, relink the binaries etc.
Continue the same set of action on the rest of nodes. Once you successfully complete patching over the remaining db servers, revert back the disk_repair_time value through the ASM.
Tips
Here is the list of few patching best practice tips that a DMA should follow:
o Patch non-production system first
o Perform standby first apply approach, if standby system exists
o Read the documents and have a appropriate plan in-place
o Run Exadata Healthcheck before/after patching
o Ensure you patch when the workload is low
o Always refer to MOS Note 888828.1
o Ensure the disk_repair_time value for existing diskgroups increased from its default value, preferably 24hrs value.
References:
1. https://www.toadworld.com/platforms/oracle/b/weblog/archive/2015/10/06/managing-amp-troubleshooting-exadata-upgrading-amp-patching-exadata
2. Exadata Database Machine and Exadata Storage Server Supported Versions (Doc ID 888828.1)
3. dbnodeupdate.sh: Exadata Database Server Patching using the DB Node Update Utility (Doc ID 1553103.1)
4. Exadata Patching Overview and Patch Testing Guidelines (Doc ID 1262380.1)
5. Oracle Exadata Database Machine exachk or HealthCheck (Doc ID 1070954.1)
*************************
Hierarchy of the logs, traces
Oracle keep track of all useful information into various log files, and dumps the critical information into trace or dump files. Reviewing these files time-to-time is strongly recommended as they would provide the glimpse and current state for Cell, database, RAC and etc. This part of the segment will take you through the hierarchy of logs on Exadata cell server, and explain the importance of the files.
Every cell has /var/log/oracle file system, as shown in the below picture:
You will find the following sub-directories underneath of /var/log/oracle:
diag
cellos
crashfiles
deploy
Cell alert.log
Like database and Oracle Cluster, each cell maintains its own alert.log file where it keep track of cell start/stop, services information and other important details. Whenever there is any issue with the Exadata services, this is the first file to be reviewed to get useful information.
Location: /opt/oracle/cell/log/diag/asm/cell/{cellname}/trace
Name : alert.log
MS logfile
Review the below log whenever you encounter issues with Management Server (MS) service:
Location: /opt/oracle/cell/log/diag/asm/cell/{cellname}/trace
Name : ms-odl.log
Crash and Core files
By default the crash core files are dumped at the following location on Exadata cell:
/var/log/oracle/crashfiles
In order to modify the crash core file location, you can modify the following configuration files on the cell:
/etc/kdump.conf – change the path to new location.
Cell patching log files
For any cell patching related log files, you should review files under the following location:
/var/log/oracle/cellos
OS log file
All OS related messages can be reviewed in the following:
/var/log/messages
The image below depicts which patching tool is used to patch the Exadata stack:
Disk controller Firmware logs
Battery capacity, feature properties can be viewed through the following command:
/opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -dsply -a0
alerthistory & cell details
The alerthistory is the another powerful command which giving significantly useful information about the cell. Strongly recommend to run through alerthistory on each cell from time-to-time.
Another power command on the cell is to determine the health state of the cell is executing the following:
To ensure the stability of the cell, verify the health status of a cell, ensure the fanstatus, powerstatus, cell status, and CellSrv/MS/RS services status is up and running.
Proper Tools to verify the Exadata components health check
It is essential to know the proper tools on Exadata to verify the Cell components health status. Following are a few important tools which can be used to verify the status of different components, such as, cell boot location/files, InfiniBand status etc.
Imageinfo
Imageinfo provides crucial information of the cell software, rolling back to previous image possibilities and the location/file for CELL boot usb partition, especially useful before/after patching on the cell servers:
Verifying network topology:
To verify spine/Leaf switch status, topology and errors, use the following command:
/opt/oracle.SupportTools/ibdiagtools/verify-topology
InfiniBand Link details
Run the iblinkinfo command to review the InfiniBand Link details on the cell:
CA: uso17 S 192.168.2.112,192.168.2.113 HCA-1:
0x0010e00001495101 5 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 10[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( )
0x0010e00001495102 6 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 1 10[ ] "SUN DCS 36P QDR uso28 10.0.9.92" ( )
Switch: 0x0010e04071e5a0a0 SUN DCS 36P QDR uso28 10.0.9.92:
1 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 12 2[ ] "uso19 C 192.168.2.116,192.168.2.117 HCA-1" ( )
1 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 4 2[ ] "uso18 C 192.168.2.114,192.168.2.115 HCA-1" ( )
1 3[ ] ==( Down/ Polling)==> [ ] "" ( )
1 4[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 10 2[ ] "uso20 C 192.168.2.118,192.168.2.119 HCA-1" ( )
1 5[ ] ==( Down/ Polling)==> [ ] "" ( )
1 6[ ] ==( Down/ Polling)==> [ ] "" ( )
1 7[ ] ==( Down/ Polling)==> [ ] "" ( )
1 8[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 8 2[ ] "uso26 S 192.168.2.110,192.168.2.111 HCA-1" ( )
1 9[ ] ==( Down/ Polling)==> [ ] "" ( )
1 10[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 6 2[ ] "uso17 S 192.168.2.112,192.168.2.113 HCA-1" ( )
1 11[ ] ==( Down/Disabled)==> [ ] "" ( )
1 12[ ] ==( Down/ Polling)==> [ ] "" ( )
1 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 14[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( )
1 14[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 13[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( )
1 15[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 16[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( )
1 16[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 15[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( )
1 17[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 18[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( )
1 18[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 17[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( )
Ibstatus
Review the IB status, speed details using the ibstauts command:
Diagnostic collection
Collecting right information is always important to troubleshoot or diagnose any issue. However, when the information needed to gather from dozens of different files from different servers, like Cell and DB, it is going to be time consuming. Oracle has provided couple of utilities/tools to gather diagnostic information from logs/traces across all Cells/DB servers together at one time. You will see below the tools that can do the job:
sundiag.sh
The Sundiag.sh is available under /opt/oracle.SupportTools location on each cell. The tool is used to collect the information from Cell server and DB server, need to run the script as root user.
root> ./sundiag.sh
Oracle Exadata Database Machine - Diagnostics Collection Tool
Gathering Linux information
Skipping ILOM collection. Use the ilom or snapshot options, or login to ILOM
over the network and run Snapshot separately if necessary.
/tmp/sundiag_usdwilo11_1418NML055_2016_02_07_13_53
Generating diagnostics tarball and removing temp directory
==============================================================================
Done. The report files are bzip2 compressed in /tmp/sundiag_usdwilo11_1418NML055_2016_02_07_13_53.tar.bz2
==============================================================================
The *.tzr.bz2 file contains several files, including alert.log and celldisk details etc.
Automated Cell File Management
Like automated Cluster file management deletion policy, there is automated cell maintenance which perform a file deletion policy based on the date. The feature has the following characteristics:
Management Server (MS) service is responsible to run through a file delete policy.
The retention for ADR is 7 days
Older than 7days metric history files will be deleted
Alert.log file will be renamed once it reaches to 10MB.
MS also triggers the deletion policy when the file system utilization become high.
If the /root and the /var/log/oracle directory utilization reaches to 80%, automatic deletion policy will be applied
The automatic deletion policy will be applied on the /opt/oracle file when the utilization reaches to 90%
Files over 5MB or one day older under the / file system, /tmp, /var/crash, /var/spool will be deleted
References:https://www.toadworld.com/platforms/oracle/w/wiki/11501.managing-troubleshooting-exadata-part-2
***************************
Migrating Databases to Exadata database machine best practices
https://www.toadworld.com/platforms/oracle/w/wiki/11551.managing-troubleshooting-exadata-part-3-migrating-databases-to-exadata-database-machine-best-practices
No comments:
Post a Comment