Sunday, September 13, 2015

Can not visit YARN resource manager Web URI at http://:8088


$ netstat -tunalp | grep LISTEN

tcp        0      0 0.0.0.0:50070           0.0.0.0:*               LISTEN      10532/java     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -              
tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      10694/java     
tcp        0      0 0.0.0.0:50075           0.0.0.0:*               LISTEN      10694/java     
tcp        0      0 0.0.0.0:10020           0.0.0.0:*               LISTEN      11498/java     
tcp        0      0 0.0.0.0:50020           0.0.0.0:*               LISTEN      10694/java     
tcp        0      0 0.0.0.0:50090           0.0.0.0:*               LISTEN      10886/java     
tcp        0      0 0.0.0.0:19888           0.0.0.0:*               LISTEN      11498/java     
tcp        0      0 0.0.0.0:10033           0.0.0.0:*               LISTEN      11498/java     
tcp        0      0 127.0.0.1:8020          0.0.0.0:*               LISTEN      10532/java     
tcp6       0      0 :::22                   :::*                    LISTEN      -              
tcp6       0      0 127.0.0.1:8088          :::*                    LISTEN      11027/java     
tcp6       0      0 :::13562                :::*                    LISTEN      11166/java     
tcp6       0      0 127.0.0.1:8030          :::*                    LISTEN      11027/java     
tcp6       0      0 127.0.0.1:8031          :::*                    LISTEN      11027/java     
tcp6       0      0 127.0.0.1:8032          :::*                    LISTEN      11027/java     
tcp6       0      0 127.0.0.1:8033          :::*                    LISTEN      11027/java     
tcp6       0      0 :::36580                :::*                    LISTEN      11166/java     
tcp6       0      0 :::8040                 :::*                    LISTEN      11166/java     
tcp6       0      0 :::8042                 :::*                    LISTEN      11166/java    

  

notice that port 8088 started at tcp6 instead of tcp.

Implementing following resolved issue.

modify  yarn-site.xml only on master node as follows. Do not modify the yarn-site.xml in slave nodes:


    yarn.nodemanager.aux-services
    mapreduce_shuffle


yarn.nodemanager.aux-services.mapreduce.shuffle.class
    org.apache.hadoop.mapred.ShuffleHandler

 
Specifying hostname causes the ports started as tcp6.

This is related to a bug. Also add following to $HADOOP_HOME/bin/yarn will force ports start at tcp

YARN_OPTS="$YARN_OPTS -Djava.net.preferIPv4Stack=true"

Saturday, September 12, 2015

IBM IoT Fundation Introduced to Big Data Analytics Class at MU.


In Tuesday's Big Data Analytics class, I invited Gayathri Srinivasan to kick off the Internet of Things (IoT) module. Gaya is a Business Development Executive of IoT at IBM. Gaya introduced the benefits of Internet of Things, the IBM IoT Foundation, how students could leverage IBM Bluemix to build IoT applications.  Gaya also presented a Hackathon opportunity for students to win cool prizes for their semester projects.

During the class, I presented a demo of the little temperature/moisture sensor I built that is able to connect to Cloud and the IBM bluemix application to visualize realtime capture of the temperature and moisture data sent from the sensor. Below is a picture of the major devices  used, mainly the Arduino Uno R3 board and a Arduino ethernet shield. The little blue thingy is the temperature and moisture sensor. Students went through a lab exercise of building a Twitter Bot using IBM Watson personality module and a temperature sensor App using simulated sensor.


Incident Management

I am asked to write a check-list when deal with incidents.
I feel below are some very important items regarding resolving Oracle or any incident, they apply to database and also to all other platforms. They are the fundamental items.

1. Very first thing is to find out or note down is what is the Business Impact. It needs to be on the subject line of email communications.

Next important item is to
2 Remember, incident management is to triage problems quickly and restore service as soon as possible. Often, people try to dig for root cause on incident call which could delay service restoration and lengthen outage time. Root cause analysis should be conducted after service is restored. (on Incident call, we need capture all logs and trace files before reboot)

3. Get all related stakeholders on the call. Ask SA what other teams need to be involved.

4. What is the error message? -- Gather data (logs, trace files and parameter settings) and work to understand what the data is telling us. Ask SA to send error message in the log. Ask them, did you Google, did you search vendor knowledge base, have you found a similar message in the knowledge base?

5 Check if there were recent changes (that is frequently the cause, need to be checked every time). Search Remedy for server name or db name or a relevant keyword to see if there were recent changes, capture the data.

10. Capture the server/san/network health check lists, we sometimes call these "meters" to show utilizations, counts, durations, special events,  such as CPU, memory, swap space, processes, i/o, disks, network paths, cables,routes, kernel parameters, long running jobs? number of connections. Add more capacity if needed (such as add more memory, add more space, enable a path etc)

6 Open SR with vendor (Oracle or other vendors) if no action plan can be determined in 30 minutes to an hour depend on Severity level. If it is serv1 or 2, open SR immediately regardless.

7. Find out what processes/jobs (including database jobs or server jobs, number of connections) are running. Any special transactions are going on.
This is to capture what is the end users are asking the system (database, servers) to do, that could have caused the problem.

8. Are there any known issues.

9. Compare with a similar server or database that is working, to understand what is normal and what is not.

10. Reboot could fix a lot of problems as it serves as some sort of reset. When no other work around, try reboot. But reboot often destroy evidences and will make root cause analysis very difficult and issues may reoccur if we don't know root cause.

Friday, September 16, 2011

Clean Standby archive

filename standby_archive_clean.sh

#!/bin/bash



#Begin environment setup
export ORACLE_BASE=/u01/app/oracle
export ORACLE_HOME=$ORACLE_BASE/product/11.2.0/dbhome_1
export ORACLE_SID=TESTDB
export PATH=$PATH:/usr/sbin:$ORACLE_HOME/bin:$HOME/bin
CURRENT_DATE=`date +%Y%m%d`
logfile="/u01/app/oracle/admin/TESTDB/dba_scripts/standby_archive_clean$CURRENT_DATE.log"

# Begin RMAN script
/u01/app/oracle/product/11.2.0/dbhome_1/bin/rman log /u01/app/oracle/admin/TESTDB/dba_scripts/standby_archive_clean$CURRENT_DATE.log << EOF
CONNECT TARGET /
SET COMMAND ID TO 'DELETE STANDBY ARCHIVE';
DELETE NOPROMPT  ARCHIVELOG ALL;
CROSSCHECK ARCHIVELOG ALL;
EXIT;
EOF

# Validation and e-mail status
status=$?
if [$status -gt 0]; then
mailx -s "STANDBY ARCHIVE CLEAN FAILED: TESTDB" DBA@TESTDB.COM<< !
`cat $logfile`
!
else
 mailx -s "SUCCESSFUL STANDBY ARCHIVE CLEAN: TESTDB" DBA@TESTDB.COM << !
`cat $logfile`
!
fi

# remove empty directores left by rman
for folder in $(find /u01/app/oracle/oradata/TESTDB/archivelog/TESTDB_STANDBY/archivelog -type d); do
if [ "`ls $folder | wc -l`" -eq 0 ]; then
rmdir $folder
fi
done


Moinitor Invalid Packages in SYS to debug an issue

file name: sys_monitor.sh

#!/bin/bash


#Begin environment setup
export ORACLE_BASE=/u01/app/oracle
export ORACLE_HOME=$ORACLE_BASE/product/11.2.0/dbhome_1
export ORACLE_SID=TESTDB
export PATH=$PATH:/usr/sbin:$ORACLE_HOME/bin:$HOME/bin
CURRENT_DATE=`date +%Y%m%d`
logfile="/home/oracle/bin/dba_scripts/sys_monitor$CURRENT_DATE.log"

# Begin sql script
sqlplus  /nolog @ '/home/oracle/bin/dba_scripts/sys_monitor.sql' > $logfile

# Validation and e-mail status
status=$?
if [ $status -gt 0 ]; then
mailx  -s "sys monitor info  FAILED" dba@testdb.com<< !
`cat $logfile`
!
else
 mailx  -s "sys monitor info SUCCESSFUL"dba@testdb.com<< !
 `cat $logfile`
!
fi

rm $logfile


--------------------
filename: sys_monitor.sql

connect / as sysdba
select 'ALTER PACKAGE SYS.' || OBJECT_NAME || ' COMPILE;' from dba_objects
where owner='SYS' and object_type='PACKAGE' AND STATUS='INVALID';


select 'ALTER PACKAGE SYS.' || OBJECT_NAME || ' COMPILE BODY;' from dba_objects
where owner='SYS' and object_type='PACKAGE BODY' AND STATUS='INVALID';

exit;







Primary DB Backup Script

#filename: primarydb_backup.sh
#!/bin/bash

#Begin environment setup
ORACLE_BASE=/u01/app/oracle
export ORACLE_BASE
export ORACLE_HOME=$ORACLE_BASE/product/11.2.0/dbhome_1
export ORACLE_SID=TESTDB
export PATH=$PATH:/usr/sbin:$ORACLE_HOME/bin:$HOME/bin
CURRENT_DATE=`date +%Y%m%d`

#define variables
logfile="/home/oracle/bin/dba_scripts/primarydb_backup$CURRENT_DATE.log"

# Begin RMAN script
/u01/app/oracle/product/11.2.0/dbhome_1/bin/rman @ '/home/oracle/bin/dba_scripts/primarydb_backup.rman' USING $CURRENT_DATE log /home/oracle/bin/dba_scripts/primarydb_backup$CURRENT_DATE.log

# Validation and send email
status=$?
if [ $status -gt 0 ]; then
mailx -s "testdbbackup FAILED: testdb" dba@testdb.com << !
`cat $logfile`
!
else
 mailx -s "testdb backup SUCCESSFUL: testdb" dba@testdb.com << !
`cat $logfile`
!
fi

rm $logfile

#Remove empty directories, after rman delete obselet, it leaves some empty dated directories
for folder in $(find /u01/app/oracle/oradata/TESTDB/archivelog/TESTDB_PRIMARY/backupset -type d); do
if [ "`ls $folder | wc -l`" -eq 0 ]; then
rmdir $folder
fi
done

----------------------
FILENAME: primarydb_backup.rman

CONNECT TARGET /
SET COMMAND ID TO 'cron_rman_full_bk_&1';
backup device type disk tag 'primary_full_daily%s' database;
backup device type disk tag 'primary_full_daily%s' archivelog all not backed up delete all input;
allocate channel for maintenance type disk;
crosscheck backup;
crosscheck archivelog all;
delete noprompt expired backup;
delete noprompt obsolete device type disk;
release channel;
EXIT;

-------------------------

Primary DB Scripts

#!/bin/bash
# collecting information from primary db, filename primary_db-info.sh

#Begin environment setup

export ORACLE_BASE=/u01/app/oracle
export ORACLE_HOME=$ORACLE_BASE/product/11.2.0/grid
export ORACLE_SID=+ASM
export PATH=$PATH:/usr/sbin:$ORACLE_HOME/bin:$HOME/bin:.

CURRENT_DATE=`date +%Y%m%d`
logfile="/home/oracle/bin/dba_scripts/prim_db_info$CURRENT_DATE.log"
# Begin sql script, -S turns off the banner
cd /home/oracle/bin/dba_scripts/
sqlplus  -S /nolog @ '/home/oracle/bin/dba_scripts/primarydb_asm_info.sql'


export ORACLE_HOME=$ORACLE_BASE/product/11.2.0/dbhome_1
export ORACLE_SID=TESTDB


# Begin sql script
sqlplus  -S /nolog @ '/home/oracle/bin/dba_scripts/primary_db_info.sql' 

# Validation and e-mail status
status=$?
if [ $status -gt 0 ]; then
mailx  -s "testdb_primary info gathering FAILED" dba@testdb.com << !
`df -h`
`cat $logfile`
!
else
 mailx  -s "testdb_primary info gathering" dba@testdb.com << !
`df -h`
 `cat $logfile`
!
fi

rm $logfile

---------------------------
connect / as sysdba

-------------------------------------------------------------------------
-- this script is to collect   data guard related info for primary db  --
-- It is referencing a note in metalink  
-- file name primary_db_info.sql                             --
-------------------------------------------------------------------------
column timecol new_value timestamp
column spool_extension new_value suffix

--this date format has to match whatever in the standby_db_info.sh so the spooled log can be emailed.
select to_char(sysdate,'YYYYmmdd') timecol,
'.log' spool_extension from sys.dual;
spool prim_db_info&&timestamp&&suffix APPEND

set echo off
set feedback off
set lines 200
set pagesize 35
set trim on
set trims on

alter session set nls_date_format = 'MON-DD-YYYY HH24:MI:SS';

COLUMN time heading "PRIMARY DATABASE INFO GENERATED ON" FORMAT a35
select to_char(sysdate) time from dual;

column pid Format 9999999999
column status  heading STATUS JUS L Format a13 tru
column thread# JUS R Format  99
column sequence# Format 9999999999
column block# Format 9999999999
column delay_mins Format 99999999999

-- Query v$managed_standby to see the status of processes involved in
-- the shipping redo on this system.  Does not include processes needed to
-- apply redo.

select process,status,client_process,sequence#
from v$managed_standby
where PROCESS='LNS' OR (PROCESS='ARCH' AND STATUS='CONNECTED');

-- The following query will determine the current sequence number
-- and the last sequence archived.  If you are remotely archiving
-- using the LGWR process then the archived sequence should be one
-- higher than the current sequence.  If remotely archiving using the
-- ARCH process then the archived sequence should be equal to the
-- current sequence.  The applied sequence information is updated at
-- log switch time.

select ads.dest_id,max(sequence#) "Current Sequence",
           max(log_sequence) "Last Archived"
   from v$archived_log al, v$archive_dest ad, v$archive_dest_status ads
   where ad.dest_id=al.dest_id
   and al.dest_id=ads.dest_id
   and al.resetlogs_change#=(select max(resetlogs_change#) from v$archived_log )
   group by ads.dest_id;

-- In the following the database_role should be primary. If protection_level is different
-- than protection_mode then for some reason the mode listed in
-- protection_mode experienced a need to downgrade.  Once the error
-- condition has been corrected the protection_level should match the
-- protection_mode after the next log switch.

column role format a7 tru
column name format a10 wrap

select name,platform_id,database_role role,log_mode,
       flashback_on flashback,protection_mode,protection_level 
from v$database;

-- ARCHIVER can be (STOPPED | STARTED | FAILED). FAILED means that the
-- archiver failed to archive a log last time, but will try again within 5
-- minutes. LOG_SWITCH_WAIT is the ARCHIVE LOG/CLEAR LOG/CHECKPOINT event log
-- switching is waiting for.  Note that if ALTER SYSTEM SWITCH LOGFILE is
-- hung, but there is room in the current online redo log, then value is
-- NULL

column host_name format a20 tru
column version format a9 tru

select instance_name,host_name,version,archiver,log_switch_wait
from v$instance;
 


-- Force logging is not mandatory but is recommended.
-- During normal operations it is acceptable for
-- SWITCHOVER_STATUS to be SESSIONS ACTIVE or TO STANDBY.

column force_logging format a13 tru
column remote_archive format a14 tru
column dataguard_broker format a16 tru

select force_logging,remote_archive,
       supplemental_log_data_pk,supplemental_log_data_ui,
       switchover_status,dataguard_broker
from v$database; 

-- This query produces a list of all archive destinations.  It shows if
-- they are enabled, what process is servicing that destination, if the
-- destination is local or remote, and if remote what the current mount ID
-- is.

column destination format a35 wrap
column process format a7
column archiver format a8
column ID format 99
column mid format 99

select dest_id "ID",destination,status,target,
       schedule,process,mountid  mid
from v$archive_dest where dest_id=1 or dest_id=2;

-- This select will give further detail on the destinations as to what
-- options have been set.  Register indicates whether or not the archived
-- redo log is registered in the remote destination control file.

set numwidth 8
column ID format 99

select dest_id "ID",archiver,transmit_mode,affirm,async_blocks async,
       net_timeout net_time,delay_mins delay,reopen_secs reopen,
       register,binding
from v$archive_dest where dest_id=1 or dest_id=2;

-- The following select will show any errors that occured the last time
-- an attempt to archive to the destination was attempted.  If ERROR is
-- blank and status is VALID then the archive completed correctly.

column error format a55 wrap

select dest_id,status,error from v$archive_dest
where dest_id=1 or dest_id=2;
-- The query below will determine if any error conditions have been
-- reached by querying the v$dataguard_status view (view only available in
-- 9.2.0 and above):

column message format a80

select message, timestamp
from v$dataguard_status
where severity in ('Error','Fatal')
order by timestamp;



-- The following select will attempt to gather as much information as
-- possible from the standby.  SRLs are not supported with Logical Standby
-- until Version 10.1.

set numwidth 8
column ID format 99
column "SRLs" format 99
column Active format 99

select dest_id id,database_mode db_mode,recovery_mode,
       protection_mode,standby_logfile_count "SRLs",
       standby_logfile_active ACTIVE,
       archived_seq#
from v$archive_dest_status where dest_id=1 or dest_id=2;



-- The following query is run on the primary to see if SRL's have been
-- created in preparation for switchover.

SELECT 'STANDBY LOGS (SRL)' AS "The Following Are:" from dual;
select group#,sequence#,bytes from v$standby_log;

-- The above SRL's should one more in number and match in size with the ORL's
-- returned below:

SELECT 'ONLINE REDO LOGS (ORL)' AS "The Following Are:" from dual;
select group#,thread#,sequence#,bytes,archived,status from v$log;


spool off

exit;

----------------------------
--file name primarydb_asm_info.sql

connect / as sysdba
column timecol new_value timestamp
column spool_extension new_value suffix

--this date format has to match whatever in the standby_db_info.sh so the spooled log can be emailed.
select to_char(sysdate,'YYYYmmdd') timecol,
'.log' spool_extension from sys.dual;
spool prim_db_info&&timestamp&&suffix

set echo OFF
set feedback off

SET LINESIZE 600
set pagesize 35
set trim on
set trims on

alter session set nls_date_format='DD-MON-YYYY HH24:MI:SS';

COLUMN time heading "ASM DISK GROUP INFO GENERATED ON" FORMAT a35
select  sysdate time from dual;
select 'HOSTNAME:'AS HOSTNAME, MACHINE " " from v$session where program like '%SMON%';


SELECT GROUP_NUMBER, NAME, TOTAL_MB/1024 AS TOTAL_GB, FREE_MB/1024 AS FREE_GB FROM V$ASM_DISKGROUP ORDER BY GROUP_NUMBER;

SELECT GROUP_NUMBER, DISK_NUMBER,BYTES_READ/1024/1024/1024 AS GB_READ,BYTES_WRITTEN/1024/1024/1024 GB_WRITTEN, READ_ERRS,WRITE_ERRS,HOT_READS,HOT_WRITES, COLD_READS, COLD_WRITES FROM v$ASM_DISK ORDER BY GROUP_NUMBER,DISK_NUMBER;


spool off

exit;