Sunday, September 13, 2015

Can not visit YARN resource manager Web URI at http://:8088


$ netstat -tunalp | grep LISTEN

tcp        0      0 0.0.0.0:50070           0.0.0.0:*               LISTEN      10532/java     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -              
tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      10694/java     
tcp        0      0 0.0.0.0:50075           0.0.0.0:*               LISTEN      10694/java     
tcp        0      0 0.0.0.0:10020           0.0.0.0:*               LISTEN      11498/java     
tcp        0      0 0.0.0.0:50020           0.0.0.0:*               LISTEN      10694/java     
tcp        0      0 0.0.0.0:50090           0.0.0.0:*               LISTEN      10886/java     
tcp        0      0 0.0.0.0:19888           0.0.0.0:*               LISTEN      11498/java     
tcp        0      0 0.0.0.0:10033           0.0.0.0:*               LISTEN      11498/java     
tcp        0      0 127.0.0.1:8020          0.0.0.0:*               LISTEN      10532/java     
tcp6       0      0 :::22                   :::*                    LISTEN      -              
tcp6       0      0 127.0.0.1:8088          :::*                    LISTEN      11027/java     
tcp6       0      0 :::13562                :::*                    LISTEN      11166/java     
tcp6       0      0 127.0.0.1:8030          :::*                    LISTEN      11027/java     
tcp6       0      0 127.0.0.1:8031          :::*                    LISTEN      11027/java     
tcp6       0      0 127.0.0.1:8032          :::*                    LISTEN      11027/java     
tcp6       0      0 127.0.0.1:8033          :::*                    LISTEN      11027/java     
tcp6       0      0 :::36580                :::*                    LISTEN      11166/java     
tcp6       0      0 :::8040                 :::*                    LISTEN      11166/java     
tcp6       0      0 :::8042                 :::*                    LISTEN      11166/java    

  

notice that port 8088 started at tcp6 instead of tcp.

Implementing following resolved issue.

modify  yarn-site.xml only on master node as follows. Do not modify the yarn-site.xml in slave nodes:


    yarn.nodemanager.aux-services
    mapreduce_shuffle


yarn.nodemanager.aux-services.mapreduce.shuffle.class
    org.apache.hadoop.mapred.ShuffleHandler

 
Specifying hostname causes the ports started as tcp6.

This is related to a bug. Also add following to $HADOOP_HOME/bin/yarn will force ports start at tcp

YARN_OPTS="$YARN_OPTS -Djava.net.preferIPv4Stack=true"

Saturday, September 12, 2015

IBM IoT Fundation Introduced to Big Data Analytics Class at MU.


In Tuesday's Big Data Analytics class, I invited Gayathri Srinivasan to kick off the Internet of Things (IoT) module. Gaya is a Business Development Executive of IoT at IBM. Gaya introduced the benefits of Internet of Things, the IBM IoT Foundation, how students could leverage IBM Bluemix to build IoT applications.  Gaya also presented a Hackathon opportunity for students to win cool prizes for their semester projects.

During the class, I presented a demo of the little temperature/moisture sensor I built that is able to connect to Cloud and the IBM bluemix application to visualize realtime capture of the temperature and moisture data sent from the sensor. Below is a picture of the major devices  used, mainly the Arduino Uno R3 board and a Arduino ethernet shield. The little blue thingy is the temperature and moisture sensor. Students went through a lab exercise of building a Twitter Bot using IBM Watson personality module and a temperature sensor App using simulated sensor.


Incident Management

I am asked to write a check-list when deal with incidents.
I feel below are some very important items regarding resolving Oracle or any incident, they apply to database and also to all other platforms. They are the fundamental items.

1. Very first thing is to find out or note down is what is the Business Impact. It needs to be on the subject line of email communications.

Next important item is to
2 Remember, incident management is to triage problems quickly and restore service as soon as possible. Often, people try to dig for root cause on incident call which could delay service restoration and lengthen outage time. Root cause analysis should be conducted after service is restored. (on Incident call, we need capture all logs and trace files before reboot)

3. Get all related stakeholders on the call. Ask SA what other teams need to be involved.

4. What is the error message? -- Gather data (logs, trace files and parameter settings) and work to understand what the data is telling us. Ask SA to send error message in the log. Ask them, did you Google, did you search vendor knowledge base, have you found a similar message in the knowledge base?

5 Check if there were recent changes (that is frequently the cause, need to be checked every time). Search Remedy for server name or db name or a relevant keyword to see if there were recent changes, capture the data.

10. Capture the server/san/network health check lists, we sometimes call these "meters" to show utilizations, counts, durations, special events,  such as CPU, memory, swap space, processes, i/o, disks, network paths, cables,routes, kernel parameters, long running jobs? number of connections. Add more capacity if needed (such as add more memory, add more space, enable a path etc)

6 Open SR with vendor (Oracle or other vendors) if no action plan can be determined in 30 minutes to an hour depend on Severity level. If it is serv1 or 2, open SR immediately regardless.

7. Find out what processes/jobs (including database jobs or server jobs, number of connections) are running. Any special transactions are going on.
This is to capture what is the end users are asking the system (database, servers) to do, that could have caused the problem.

8. Are there any known issues.

9. Compare with a similar server or database that is working, to understand what is normal and what is not.

10. Reboot could fix a lot of problems as it serves as some sort of reset. When no other work around, try reboot. But reboot often destroy evidences and will make root cause analysis very difficult and issues may reoccur if we don't know root cause.