Thursday 14 March 2013

BIG DATA HADOOP Testing with MapReduce Examples Part 1

hadoop-mapreduce-examples-2.0.0-cdh4.2.0.jar - jar file for testing hadoop

wordcount example reads text files and counts how often words occur and here I am passing the name.txt which was copied to the HDFS


hadoop@bigdataserver1:~/hadoop> hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.2.0.jar wordcount /bigdata1/name.txt /bigdata1/output
13/03/13 14:58:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/03/13 14:58:06 INFO mapreduce.Cluster: Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid "mapreduce.jobtracker.address" configuration value for LocalJobRunner : "localhost:9001"
13/03/13 14:58:06 ERROR security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
        at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
        at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
        at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1188)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1184)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.mapreduce.Job.connect(Job.java:1183)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1212)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1236)
        at org.apache.hadoop.examples.WordCount.main(WordCount.java:84)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
hadoop@bigdataserver1:~/hadoop>

Solution to resolve the above error was to source the HADOOP_MAPRED_HOME in the hadoop-env.sh file.


Ran again and it resulted in another error


hadoop@bigdataserver1:~/hadoop> hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.2.0.jar wordcount /bigdata1/name.txt /bigdata1/output
java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/lib/partition/InputSampler$Sampler
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
        at java.lang.Class.getMethod0(Class.java:2670)
        at java.lang.Class.getMethod(Class.java:1603)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.(ProgramDriver.java:60)
        at org.apache.hadoop.util.ProgramDriver.addClass(ProgramDriver.java:103)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:51)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.lib.partition.InputSampler$Sampler
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
        ... 12 more
hadoop@bigdataserver1:~/hadoop>

Solution is to source the mapreduce classpath in the hadoop-env.sh file.



# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
  if [ "$HADOOP_CLASSPATH" ]; then
    export HADOOP_CLASSPATH=/home/hadoop/hadoop/share/hadoop/mapreduce/*:$HADOOP_CLASSPATH:$f
  else
    export HADOOP_CLASSPATH=$f
  fi
done


hadoop@bigdataserver1:~/hadoop> hadoop classpath
/home/hadoop/hadoop/etc/hadoop:/home/hadoop/hadoop/share/hadoop/common/lib/*:/home/hadoop/hadoop/share/hadoop/common/*:/contrib/capacity-scheduler/*.jar:/home/hadoop/hadoop/share/hadoop/hdfs:/home/hadoop/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop/share/hadoop/hdfs/*:/home/hadoop/hadoop/share/hadoop/yarn/lib/*:/home/hadoop/hadoop/share/hadoop/yarn/*:/home/hadoop/hadoop/share/hadoop/mapreduce/share/hadoop/mapreduce/*
hadoop@bigdataserver1:~/hadoop> ls /home/hadoop/hadoop/share/hadoop/mapreduce/share/hadoop/mapreduce/*
/bin/ls: /home/hadoop/hadoop/share/hadoop/mapreduce/share/hadoop/mapreduce/*: No such file or directory
hadoop@bigdataserver1:~/hadoop> pwd
/home/hadoop/hadoop
hadoop@bigdataserver1:~/hadoop> echo $CLASSPATH

hadoop@bigdataserver1:~/hadoop> vi etc/hadoop/hadoop-env.sh
hadoop@bigdataserver1:~/hadoop> echo $HADOOP_HOME

hadoop@bigdataserver1:~/hadoop> export HADOOP_HOME=/home/hadoop/hadoop
hadoop@bigdataserver1:~/hadoop> $HADOOP_HOME/contrib/capacity-scheduler/*.jar
hadoop@bigdataserver1:~/hadoop> ls $HADOOP_HOME/contrib/capacity-scheduler/*.jar
/bin/ls: /home/hadoop/hadoop/contrib/capacity-scheduler/*.jar: No such file or directory
hadoop@bigdataserver1:~/hadoop> echo $HADOOP_CLASSPATH

hadoop@bigdataserver1:~/hadoop> ls /home/hadoop/hadoop/share/hadoop/mapreduce
hadoop-mapreduce-client-app-2.0.0-cdh4.2.0.jar     hadoop-mapreduce-client-jobclient-2.0.0-cdh4.2.0.jar        lib
hadoop-mapreduce-client-common-2.0.0-cdh4.2.0.jar  hadoop-mapreduce-client-jobclient-2.0.0-cdh4.2.0-tests.jar  lib-examples
hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar    hadoop-mapreduce-client-shuffle-2.0.0-cdh4.2.0.jar
hadoop-mapreduce-client-hs-2.0.0-cdh4.2.0.jar      hadoop-mapreduce-examples-2.0.0-cdh4.2.0.jar
hadoop@bigdataserver1:~/hadoop>


hadoop@bigdataserver1:~/hadoop> vi etc/hadoop/hadoop-env.sh
update class path

hadoop@bigdataserver1:~/hadoop> hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.2.0.jar WordCount /bigdata1/name.txt /bigdata1/output
Unknown program 'WordCount' chosen.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
hadoop@bigdataserver1:~/hadoop>



Looks positive that the mapreduce is working but with a wrong syntax.