Saturday, September 12, 2015

Incident Management

I am asked to write a check-list when deal with incidents.
I feel below are some very important items regarding resolving Oracle or any incident, they apply to database and also to all other platforms. They are the fundamental items.

1. Very first thing is to find out or note down is what is the Business Impact. It needs to be on the subject line of email communications.

Next important item is to
2 Remember, incident management is to triage problems quickly and restore service as soon as possible. Often, people try to dig for root cause on incident call which could delay service restoration and lengthen outage time. Root cause analysis should be conducted after service is restored. (on Incident call, we need capture all logs and trace files before reboot)

3. Get all related stakeholders on the call. Ask SA what other teams need to be involved.

4. What is the error message? -- Gather data (logs, trace files and parameter settings) and work to understand what the data is telling us. Ask SA to send error message in the log. Ask them, did you Google, did you search vendor knowledge base, have you found a similar message in the knowledge base?

5 Check if there were recent changes (that is frequently the cause, need to be checked every time). Search Remedy for server name or db name or a relevant keyword to see if there were recent changes, capture the data.

10. Capture the server/san/network health check lists, we sometimes call these "meters" to show utilizations, counts, durations, special events,  such as CPU, memory, swap space, processes, i/o, disks, network paths, cables,routes, kernel parameters, long running jobs? number of connections. Add more capacity if needed (such as add more memory, add more space, enable a path etc)

6 Open SR with vendor (Oracle or other vendors) if no action plan can be determined in 30 minutes to an hour depend on Severity level. If it is serv1 or 2, open SR immediately regardless.

7. Find out what processes/jobs (including database jobs or server jobs, number of connections) are running. Any special transactions are going on.
This is to capture what is the end users are asking the system (database, servers) to do, that could have caused the problem.

8. Are there any known issues.

9. Compare with a similar server or database that is working, to understand what is normal and what is not.

10. Reboot could fix a lot of problems as it serves as some sort of reset. When no other work around, try reboot. But reboot often destroy evidences and will make root cause analysis very difficult and issues may reoccur if we don't know root cause.

No comments:

Post a Comment