Hi
I am a newbie in the Hadoop's ecosystem.
I have a piece of codes in which I am using Python with Hadoop's MapReduce Streaming implementation.
It works as I expected but I could not get it works using Avro's format.
I am very appreciating if someone can pinpoint where I have done wrong.
Please see below for my objective, Challenging/Issues, Questions, codes and testing.
Objective:
Using Python with Hadoop's MapReduce Streaming implementation to extract each line of data in the CVS format with a comma as a delimiter. The Mapper-Combiner-Reducer will mangling the data and save it as Avro's schema format and put it into a file under the HDFS.
NOTE: I have successful to implement the Python with Hadoop's MapReduce Streaming implementation without Avro's schema.
Challenging:
1) If I am not using Hadoop's MapReduce Streaming then the Avro's DataFileWriter method will write data into my "custom" filenames. However, If I am using the Hadoop's MapReduce Streaming then the Avro's DataFileWriter method will create an emptied files with the Hadoop's default filenames (part-0000*) into the HDFS
2) If I am using Avro's DataFileWriter method will create an emptied files with Hadoop's default filename (part-00000*)
Questions:
1) Can I control the filename and location to put our Avro's files in the HDFS?
2) The Hadoop's MapReduce Streaming project has a MapDebug (-mapdebug) and ReduceDebug (-reducedebug) options but I can't get any debug message for my Map's debug
Source codes:
1) Wrapper shell script to run Python with Hadoop's MapReduce Streaming
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ cat devices-hdfs-mr-avro-PyIterGen-v2.sh
#!/bin/sh
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce
export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.2.0-mr1-cdh5.0.0-beta-1.jar
# Clean up the previous runs
sudo -u hdfs hadoop fs -rm -f -R /data/db/bdms1p/output/avro
sudo -u hdfs hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*streaming*.jar \
-D mapreduce.job.tracker="local" \
-files ./avro-PyIterGen-mapper-v1.py,map-debug.py,avro-PyIterGen-combiner-v1.py \
-files ./avro-1.7.6.jar,./avro-mapred-1.7.6-hadoop1.jar,./avro-tools-1.7.6.jar \
-libjars avro-1.7.6.jar,avro-mapred-1.7.6-hadoop1.jar,./avro-tools-1.7.6.jar \
-mapper ./avro-PyIterGen-mapper-v1.py \
-combiner ./avro-PyIterGen-combiner-v1.py \
-mapdebug ./map-debug.py \
-input /data/db/bdms1p/input/*.txt \
-output /data/db/bdms1p/output/avro \
-outputformat org.apache.avro.mapred.AvroTextOutputFormat \
-inputformat org.apache.avro.mapred.AvroAsTextInputFormat
sudo -u hdfs hadoop fs -ls -R -h /data/db/bdms1p
sudo -u hdfs hadoop jar ./avro-tools-1.7.6.jar totext /data/db/bdms1p/output/avro/part-00001.avro -
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2) Map Debugger
$ cat map-debug.py
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def main(out_msg, err_msg, log_msg):
# input comes from STDIN (standard input)
print "[DEBUG-Output]: %s" % out_msg
print "[DEBUG-ERROR]: %s" % err_msg
print "[DEBUG-INFO]: %s" % log_msg
if __name__ == "__main__":
main(sys.argv[2], sys.argv[3], sys.argv[4])
3) Python's Mapper
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ cat avro-PyIterGen-mapper-v1.py
#!/usr/bin/env python26
"""A more advanced Mapper, using Python iterators and generators."""
import sys
#import json
from avro import schema, datafile, io
import os
import fnmatch
import gzip, bz2
INFILE_NAME = "devices.avro"
OUTFILE_NAME = "devices.avro"
OUTFILE_NAME_PREFIX = "devices"
DEVICE_AVSC = """{
"type": "record",
"name": "device",
"namespace": "au.com.telstra.in.bdms",
"fields": [
{"name": "id", "type": "string"},
{"name": "parameter", "type": {
"type" : "map", "values" : "string"}
}"""
DEVICE_SCHEMA = schema.parse(DEVICE_AVSC)
# Keep the list of devices which have been processed so far
DEVICES_LIST = []
# Get a list of files which match the file pattern from the given path
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
#print "Path: %s/%s\n" % (path,name)
yield os.path.join(path,name)
# Get the file handle from a given file names
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name,'r')
# Get the Avro file handle
def open_avro_files(file_pattern):
file_names = gen_find(file_pattern,"./")
avro_files = gen_open(file_names)
for file in avro_files:
yield file
# Read Avro's records from the list of Avro's files
def read_avro_file(file_pattern):
# Create a 'data file' (avro file) reader
avro_files = open_avro_files(file_pattern)
for file in avro_files:
#print "File: %s" % file
# Create a 'record' (datum) reader
#rec_reader = io.DatumReader(DEVICE_AVSC)
rec_reader = io.DatumReader()
# Create a 'data file' (avro file) reader
df_reader = datafile.DataFileReader(file, rec_reader)
# Read all records stored inside the Avro reader
for record in df_reader:
print record['id']
#print record
for key in record['parameter'].keys():
print key,record['parameter'][key]
#print record['address'], record['value']
# Do whatever read-processing you wanna do
# for each record here ...
# Close to ensure reading is completed
df_reader.close()
def read_csv_file(file, separator):
for line in file:
record = {}
# split the line into fields
(oui,sn,key,value,datestr) = line.strip().split(separator)
# Combines fields: oui+sn to get a unique id
record['id'] = oui+'-'+sn
record['parameter'] = {key:value}
yield record
def write_avro_file(comma_separator):
# Get the list of input files and called read_input function to parse each input file
data = read_csv_file(sys.stdin, comma_separator)
# Use a device ID as a file ID
current_file_name = ""
for record in data:
new_file_name = record['id'] + ".avro"
if current_file_name != new_file_name:
if current_file_name:
df_writer.close()
current_file_name = new_file_name
# Create a 'record' (datum) writer
#rec_writer = io.DatumWriter(DEVICE_AVSC)
rec_writer = io.DatumWriter()
if record['id'] in DEVICES_LIST:
# To append to an existing datafile, do not initialize the writer object with a writers_schema again
df_writer = datafile.DataFileWriter(open(current_file_name,'a+'), rec_writer)
else:
#df_writer = datafile.DataFileWriter(open(current_file_name,'wb'), rec_writer, writers_schema = DEVICE_SCHEMA,
# codec = 'deflate')
df_writer = datafile.DataFileWriter(open(current_file_name,'w+'), rec_writer, writers_schema = DEVICE_SCHEMA)
DEVICES_LIST.append(record['id'])
# Write our data
df_writer.append(record)
# Close to ensure writing is complete
if current_file_name:
df_writer.close()
def main(comma_separator=","):
# input comes from STDIN (standard input)
# and write into AVRO format
write_avro_file(comma_separator)
# Now, read it
#read_avro_file("*.avro")
if __name__ == "__main__":
main()
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4) Python's Combiner
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#!/usr/bin/env python26
"""A more advanced Mapper, using Python iterators and generators."""
import sys
#import json
from avro import schema, datafile, io
import os
import fnmatch
import gzip, bz2
INFILE_NAME = "devices.avro"
OUTFILE_NAME = "devices.avro"
OUTFILE_NAME_PREFIX = "devices"
DEVICE_AVSC = """{
"type": "record",
"name": "device",
"namespace": "au.com.telstra.in.bdms",
"fields": [
{"name": "id", "type": "string"},
{"name": "parameter", "type": {
"type" : "map", "values" : "string"}
}"""
DEVICE_SCHEMA = schema.parse(DEVICE_AVSC)
# Keep the list of devices which have been processed so far
DEVICES_LIST = []
# Get a list of files which match the file pattern from the given path
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
#print "Path: %s/%s\n" % (path,name)
yield os.path.join(path,name)
# Get the file handle from a given file names
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name,'r')
# Get the Avro file handle
def open_avro_files(file_pattern):
file_names = gen_find(file_pattern,"./")
avro_files = gen_open(file_names)
for file in avro_files:
yield file
# Read Avro's records from the list of Avro's files
def read_avro_file(file_pattern):
# Create a 'data file' (avro file) reader
avro_files = open_avro_files(file_pattern)
for file in avro_files:
#print "File: %s" % file
# Create a 'record' (datum) reader
#rec_reader = io.DatumReader(DEVICE_AVSC)
rec_reader = io.DatumReader()
# Create a 'data file' (avro file) reader
df_reader = datafile.DataFileReader(file, rec_reader)
# Read all records stored inside the Avro reader
for record in df_reader:
print record['id']
#print record
for key in record['parameter'].keys():
print key,record['parameter'][key]
#print record['address'], record['value']
# Do whatever read-processing you wanna do
# for each record here ...
# Close to ensure reading is completed
df_reader.close()
# Sample of exeception handler
try:
continue
except ValueError:
pass
def main(comma_separator=","):
# input comes from STDIN (standard input)
# and write into AVRO format
#write_avro_file(comma_separator)
# Now, read it
read_avro_file("*.avro")
if __name__ == "__main__":
main()
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Testing:
1) Run Python's Map and Combine functions from OS
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ cat data.txt | python avro-PyIterGen-mapper-v1.py | python avro-PyIterGen-combiner-v1.py
A4B1E9-CP1242UA6MN
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.12.WorstE2eDelay 3
A4B1E9-CP1246RA7CJ
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.9.SourceMACExclude 0
A4B1E9-CP1238VASE3
InternetGatewayDevice.WANDevice.2.WANDSLInterfaceConfig.DownstreamPower 0
A4B1E9-CP1243TABPW
InternetGatewayDevice.WANDevice.2.WANConnectionDevice.1.WANIPConnection.1.PortMapping.147.PortMappingProtocol UDP
A4B1E9-CP1238VAR1D
InternetGatewayDevice.X_000E50_Connection.BindBlackList.3.Protocol UDP
A4B1E9-CP1244SA10Q
InternetGatewayDevice.WANDevice.2.WANConnectionDevice.1.WANIPConnection.1.PortMapping.344.PortMappingLeaseDuration 0
A4B1E9-CP1246RA9ER
InternetGatewayDevice.Services.VoiceService.1.VoiceProfile.1.Line.1.Session.1305.FarEndUDPPort 1382
A4B1E9-CP1301SAENP
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.117.WorstE2eDelay 72
A4B1E9-CP1244RA1UP
InternetGatewayDevice.QueueManagement.Classification.13.IPLengthExclude 0
A4B1E9-CP1242UA46P
InternetGatewayDevice.Services.VoiceService.1.VoiceProfile.1.Line.1.Session.1051.SessionDuration 1388863841
A4B1E9-CP1238VAB8W
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.112.Underruns 0
A4B1E9-CP1247SA8UE
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.128.FarEndPackLostRatio 0
A4B1E9-CP1239RA5T1
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.194.PacketsReceived 636
A4B1E9-CP1246RA4G7
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.35.Target Accept
A4B1E9-CP1239RA8R5
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.81.BytesReceived 22265
A4B1E9-CP1244RAZJZ
InternetGatewayDevice.WANDevice.2.WANConnectionDevice.1.WANDSLLinkConfig.ATMHECErrors 0
A4B1E9-CP1244RA6YS
InternetGatewayDevice.X_000E50_Connection.BindBlackList.26.Flags Dynamic
A4B1E9-CP1238VAEMS
InternetGatewayDevice.Services.X_000E50_NATApplicationList.Application.2.Category
A4B1E9-CP1239RAG79
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.188.FarEndPackLostCum 0
A4B1E9-CP1247SAYG9
InternetGatewayDevice.X_000E50_Firewall.Chain.16.Rule.1.DestinationIPMask
A4B1E9-CP1238VACA0
InternetGatewayDevice.Services.VoiceService.1.VoiceProfile.1.Line.1.Session.199.SessionStartTime 1389193782
A4B1E9-CP1242TAMJF
InternetGatewayDevice.QueueManagement.Classification.22.EthernetPriorityCheck -1
A4B1E9-CP1239RA8NQ
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.7.DestinationPort 67
A4B1E9-CP1239RA8NQ
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.8.DestinationPort 68
A4B1E9-CP1244SA10W
InternetGatewayDevice.X_000E50_AccessRights.Group.16.Parent
589835-CP1224SAKM7
InternetGatewayDevice.X_000E50_Firewall.Chain.17.Rule.4.TOS 0
A4B1E9-CP1246RAK7W
InternetGatewayDevice.Services.VoiceService.1.VoiceProfile.1.Line.2.Session.1369.FarEndUDPPort 1218
A4B1E9-CP1244RAVKV
InternetGatewayDevice.Services.VoiceService.1.VoiceProfile.1.NumberingPlan.PrefixInfo.94.FacilityActionArgument
A4B1E9-CP1244RAYNA
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.87.FarEndIPAddress
A4B1E9-CP1252SAHDG
InternetGatewayDevice.QueueManagement.Classification.28.SourcePort -1
A4B1E9-CP1243TA58G
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.204.PacketsSent 4801
A4B1E9-CP1246RALTZ
InternetGatewayDevice.QueueManagement.Classification.15.SourceIPExclude 0
oracle [ at ] bpdevdmsdbs01:BDMSSI1D1 ---- /ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/py -----
$ ls *.avro
589835-CP1224SAKM7.avro A4B1E9-CP1239RA8NQ.avro A4B1E9-CP1243TABPW.avro A4B1E9-CP1244SA10W.avro A4B1E9-CP1247SAYG9.avro
A4B1E9-CP1238VAB8W.avro A4B1E9-CP1239RA8R5.avro A4B1E9-CP1244RA1UP.avro A4B1E9-CP1246RA4G7.avro A4B1E9-CP1252SAHDG.avro
A4B1E9-CP1238VACA0.avro A4B1E9-CP1239RAG79.avro A4B1E9-CP1244RA6YS.avro A4B1E9-CP1246RA7CJ.avro A4B1E9-CP1301SAENP.avro
A4B1E9-CP1238VAEMS.avro A4B1E9-CP1242TAMJF.avro A4B1E9-CP1244RAVKV.avro A4B1E9-CP1246RA9ER.avro
A4B1E9-CP1238VAR1D.avro A4B1E9-CP1242UA46P.avro A4B1E9-CP1244RAYNA.avro A4B1E9-CP1246RAK7W.avro
A4B1E9-CP1238VASE3.avro A4B1E9-CP1242UA6MN.avro A4B1E9-CP1244RAZJZ.avro A4B1E9-CP1246RALTZ.avro
A4B1E9-CP1239RA5T1.avro A4B1E9-CP1243TA58G.avro A4B1E9-CP1244SA10Q.avro A4B1E9-CP1247SA8UE.avro
$ ls -al A4B1E9-CP1239RA8NQ.avro
-rw-r--r-- 1 oracle oinstall 471 Mar 18 12:03 A4B1E9-CP1239RA8NQ.avro
$ strings A4B1E9-CP1239RA8NQ.avro
avro.schema
{"type": "record", "namespace": "au.com.telstra.in.bdms", "name": "device", "fields": [{"type": "string", "name": "id"}, {"type": {"type": "map", "values": "string"}, "name": "parameter"}]}
avro.codec
null
$A4B1E9-CP1239RA8NQ
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.7.DestinationPort
$A4B1E9-CP1239RA8NQ
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.8.DestinationPort
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2) Run Python with Hadoop's MapReduce Streaming
$ ./devices-hdfs-mr-avro-PyIterGen-v2.sh
14/03/18 12:06:53 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 86400000 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://nsda3dmsrpt02.internal.bigpond.com:8020/data/db/bdms1p/output/avro' to trash at: hdfs://nsda3dmsrpt02.internal.bigpond.com:8020/user/hdfs/.Trash/Current
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-1.jar] /tmp/streamjob6036271345882146838.jar tmpDir=null
14/03/18 12:06:55 INFO client.RMProxy: Connecting to ResourceManager at bpdevdmsdbs01/172.18.127.245:8032
14/03/18 12:06:55 INFO client.RMProxy: Connecting to ResourceManager at bpdevdmsdbs01/172.18.127.245:8032
14/03/18 12:06:56 INFO mapred.FileInputFormat: Total input paths to process : 1
14/03/18 12:06:56 INFO mapreduce.JobSubmitter: number of splits:0
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files
14/03/18 12:06:56 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.map.task.debug.script is deprecated. Instead, use mapreduce.map.debug.script
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/03/18 12:06:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1394755763143_0020
14/03/18 12:06:57 INFO impl.YarnClientImpl: Submitted application application_1394755763143_0020 to ResourceManager at bpdevdmsdbs01/172.18.127.245:8032
14/03/18 12:06:57 INFO mapreduce.Job: The url to track the job: http://bpdevdmsdbs01:8088/proxy/application_1394755763143_0020/
14/03/18 12:06:57 INFO mapreduce.Job: Running job: job_1394755763143_0020
14/03/18 12:07:03 INFO mapreduce.Job: Job job_1394755763143_0020 running in uber mode : false
14/03/18 12:07:03 INFO mapreduce.Job: map 0% reduce 0%
14/03/18 12:07:08 INFO mapreduce.Job: map 0% reduce 4%
14/03/18 12:07:09 INFO mapreduce.Job: map 0% reduce 21%
14/03/18 12:07:10 INFO mapreduce.Job: map 0% reduce 29%
14/03/18 12:07:11 INFO mapreduce.Job: map 0% reduce 42%
14/03/18 12:07:12 INFO mapreduce.Job: map 0% reduce 58%
14/03/18 12:07:13 INFO mapreduce.Job: map 0% reduce 71%
14/03/18 12:07:14 INFO mapreduce.Job: map 0% reduce 83%
14/03/18 12:07:15 INFO mapreduce.Job: map 0% reduce 100%
14/03/18 12:07:16 INFO mapreduce.Job: Job job_1394755763143_0020 completed successfully
14/03/18 12:07:17 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=2183380
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=0
HDFS: Number of bytes written=1008
HDFS: Number of read operations=72
HDFS: Number of large read operations=0
HDFS: Number of write operations=48
Job Counters
Launched reduce tasks=24
Total time spent by all maps in occupied slots (ms)=0
Total time spent by all reduces in occupied slots (ms)=64414
Map-Reduce Framework
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=0
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=614
CPU time spent (ms)=30220
Physical memory (bytes) snapshot=6838841344
Virtual memory (bytes) snapshot=34129403904
Total committed heap usage (bytes)=18998624256
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Output Format Counters
Bytes Written=1008
14/03/18 12:07:17 INFO streaming.StreamJob: Output directory: /data/db/bdms1p/output/avro
drwxr-xr-x - hdfs supergroup 0 2014-03-14 10:15 /data/db/bdms1p/input
-rw-r--r-- 3 hdfs supergroup 4.0 K 2014-03-14 10:15 /data/db/bdms1p/input/data.txt
drwxr-xr-x - hdfs supergroup 0 2014-03-18 12:07 /data/db/bdms1p/output
-rw-r--r-- 3 hdfs supergroup 0 2014-03-13 15:55 /data/db/bdms1p/output/_SUCCESS
drwxr-xr-x - hdfs supergroup 0 2014-03-18 12:07 /data/db/bdms1p/output/avro
-rw-r--r-- 3 hdfs supergroup 0 2014-03-18 12:07 /data/db/bdms1p/output/avro/_SUCCESS
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00000.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00001.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00002.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00003.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00004.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00005.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00006.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00007.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00008.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00009.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00010.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00011.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00012.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00013.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00014.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00015.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00016.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00017.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00018.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00019.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00020.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00021.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00022.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00023.avro
Thanks and Regards,
I am a newbie in the Hadoop's ecosystem.
I have a piece of codes in which I am using Python with Hadoop's MapReduce Streaming implementation.
It works as I expected but I could not get it works using Avro's format.
I am very appreciating if someone can pinpoint where I have done wrong.
Please see below for my objective, Challenging/Issues, Questions, codes and testing.
Objective:
Using Python with Hadoop's MapReduce Streaming implementation to extract each line of data in the CVS format with a comma as a delimiter. The Mapper-Combiner-Reducer will mangling the data and save it as Avro's schema format and put it into a file under the HDFS.
NOTE: I have successful to implement the Python with Hadoop's MapReduce Streaming implementation without Avro's schema.
Challenging:
1) If I am not using Hadoop's MapReduce Streaming then the Avro's DataFileWriter method will write data into my "custom" filenames. However, If I am using the Hadoop's MapReduce Streaming then the Avro's DataFileWriter method will create an emptied files with the Hadoop's default filenames (part-0000*) into the HDFS
2) If I am using Avro's DataFileWriter method will create an emptied files with Hadoop's default filename (part-00000*)
Questions:
1) Can I control the filename and location to put our Avro's files in the HDFS?
2) The Hadoop's MapReduce Streaming project has a MapDebug (-mapdebug) and ReduceDebug (-reducedebug) options but I can't get any debug message for my Map's debug
Source codes:
1) Wrapper shell script to run Python with Hadoop's MapReduce Streaming
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ cat devices-hdfs-mr-avro-PyIterGen-v2.sh
#!/bin/sh
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce
export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.2.0-mr1-cdh5.0.0-beta-1.jar
# Clean up the previous runs
sudo -u hdfs hadoop fs -rm -f -R /data/db/bdms1p/output/avro
sudo -u hdfs hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*streaming*.jar \
-D mapreduce.job.tracker="local" \
-files ./avro-PyIterGen-mapper-v1.py,map-debug.py,avro-PyIterGen-combiner-v1.py \
-files ./avro-1.7.6.jar,./avro-mapred-1.7.6-hadoop1.jar,./avro-tools-1.7.6.jar \
-libjars avro-1.7.6.jar,avro-mapred-1.7.6-hadoop1.jar,./avro-tools-1.7.6.jar \
-mapper ./avro-PyIterGen-mapper-v1.py \
-combiner ./avro-PyIterGen-combiner-v1.py \
-mapdebug ./map-debug.py \
-input /data/db/bdms1p/input/*.txt \
-output /data/db/bdms1p/output/avro \
-outputformat org.apache.avro.mapred.AvroTextOutputFormat \
-inputformat org.apache.avro.mapred.AvroAsTextInputFormat
sudo -u hdfs hadoop fs -ls -R -h /data/db/bdms1p
sudo -u hdfs hadoop jar ./avro-tools-1.7.6.jar totext /data/db/bdms1p/output/avro/part-00001.avro -
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2) Map Debugger
$ cat map-debug.py
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def main(out_msg, err_msg, log_msg):
# input comes from STDIN (standard input)
print "[DEBUG-Output]: %s" % out_msg
print "[DEBUG-ERROR]: %s" % err_msg
print "[DEBUG-INFO]: %s" % log_msg
if __name__ == "__main__":
main(sys.argv[2], sys.argv[3], sys.argv[4])
3) Python's Mapper
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ cat avro-PyIterGen-mapper-v1.py
#!/usr/bin/env python26
"""A more advanced Mapper, using Python iterators and generators."""
import sys
#import json
from avro import schema, datafile, io
import os
import fnmatch
import gzip, bz2
INFILE_NAME = "devices.avro"
OUTFILE_NAME = "devices.avro"
OUTFILE_NAME_PREFIX = "devices"
DEVICE_AVSC = """{
"type": "record",
"name": "device",
"namespace": "au.com.telstra.in.bdms",
"fields": [
{"name": "id", "type": "string"},
{"name": "parameter", "type": {
"type" : "map", "values" : "string"}
}"""
DEVICE_SCHEMA = schema.parse(DEVICE_AVSC)
# Keep the list of devices which have been processed so far
DEVICES_LIST = []
# Get a list of files which match the file pattern from the given path
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
#print "Path: %s/%s\n" % (path,name)
yield os.path.join(path,name)
# Get the file handle from a given file names
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name,'r')
# Get the Avro file handle
def open_avro_files(file_pattern):
file_names = gen_find(file_pattern,"./")
avro_files = gen_open(file_names)
for file in avro_files:
yield file
# Read Avro's records from the list of Avro's files
def read_avro_file(file_pattern):
# Create a 'data file' (avro file) reader
avro_files = open_avro_files(file_pattern)
for file in avro_files:
#print "File: %s" % file
# Create a 'record' (datum) reader
#rec_reader = io.DatumReader(DEVICE_AVSC)
rec_reader = io.DatumReader()
# Create a 'data file' (avro file) reader
df_reader = datafile.DataFileReader(file, rec_reader)
# Read all records stored inside the Avro reader
for record in df_reader:
print record['id']
#print record
for key in record['parameter'].keys():
print key,record['parameter'][key]
#print record['address'], record['value']
# Do whatever read-processing you wanna do
# for each record here ...
# Close to ensure reading is completed
df_reader.close()
def read_csv_file(file, separator):
for line in file:
record = {}
# split the line into fields
(oui,sn,key,value,datestr) = line.strip().split(separator)
# Combines fields: oui+sn to get a unique id
record['id'] = oui+'-'+sn
record['parameter'] = {key:value}
yield record
def write_avro_file(comma_separator):
# Get the list of input files and called read_input function to parse each input file
data = read_csv_file(sys.stdin, comma_separator)
# Use a device ID as a file ID
current_file_name = ""
for record in data:
new_file_name = record['id'] + ".avro"
if current_file_name != new_file_name:
if current_file_name:
df_writer.close()
current_file_name = new_file_name
# Create a 'record' (datum) writer
#rec_writer = io.DatumWriter(DEVICE_AVSC)
rec_writer = io.DatumWriter()
if record['id'] in DEVICES_LIST:
# To append to an existing datafile, do not initialize the writer object with a writers_schema again
df_writer = datafile.DataFileWriter(open(current_file_name,'a+'), rec_writer)
else:
#df_writer = datafile.DataFileWriter(open(current_file_name,'wb'), rec_writer, writers_schema = DEVICE_SCHEMA,
# codec = 'deflate')
df_writer = datafile.DataFileWriter(open(current_file_name,'w+'), rec_writer, writers_schema = DEVICE_SCHEMA)
DEVICES_LIST.append(record['id'])
# Write our data
df_writer.append(record)
# Close to ensure writing is complete
if current_file_name:
df_writer.close()
def main(comma_separator=","):
# input comes from STDIN (standard input)
# and write into AVRO format
write_avro_file(comma_separator)
# Now, read it
#read_avro_file("*.avro")
if __name__ == "__main__":
main()
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4) Python's Combiner
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#!/usr/bin/env python26
"""A more advanced Mapper, using Python iterators and generators."""
import sys
#import json
from avro import schema, datafile, io
import os
import fnmatch
import gzip, bz2
INFILE_NAME = "devices.avro"
OUTFILE_NAME = "devices.avro"
OUTFILE_NAME_PREFIX = "devices"
DEVICE_AVSC = """{
"type": "record",
"name": "device",
"namespace": "au.com.telstra.in.bdms",
"fields": [
{"name": "id", "type": "string"},
{"name": "parameter", "type": {
"type" : "map", "values" : "string"}
}"""
DEVICE_SCHEMA = schema.parse(DEVICE_AVSC)
# Keep the list of devices which have been processed so far
DEVICES_LIST = []
# Get a list of files which match the file pattern from the given path
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
#print "Path: %s/%s\n" % (path,name)
yield os.path.join(path,name)
# Get the file handle from a given file names
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name,'r')
# Get the Avro file handle
def open_avro_files(file_pattern):
file_names = gen_find(file_pattern,"./")
avro_files = gen_open(file_names)
for file in avro_files:
yield file
# Read Avro's records from the list of Avro's files
def read_avro_file(file_pattern):
# Create a 'data file' (avro file) reader
avro_files = open_avro_files(file_pattern)
for file in avro_files:
#print "File: %s" % file
# Create a 'record' (datum) reader
#rec_reader = io.DatumReader(DEVICE_AVSC)
rec_reader = io.DatumReader()
# Create a 'data file' (avro file) reader
df_reader = datafile.DataFileReader(file, rec_reader)
# Read all records stored inside the Avro reader
for record in df_reader:
print record['id']
#print record
for key in record['parameter'].keys():
print key,record['parameter'][key]
#print record['address'], record['value']
# Do whatever read-processing you wanna do
# for each record here ...
# Close to ensure reading is completed
df_reader.close()
# Sample of exeception handler
try:
continue
except ValueError:
pass
def main(comma_separator=","):
# input comes from STDIN (standard input)
# and write into AVRO format
#write_avro_file(comma_separator)
# Now, read it
read_avro_file("*.avro")
if __name__ == "__main__":
main()
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Testing:
1) Run Python's Map and Combine functions from OS
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ cat data.txt | python avro-PyIterGen-mapper-v1.py | python avro-PyIterGen-combiner-v1.py
A4B1E9-CP1242UA6MN
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.12.WorstE2eDelay 3
A4B1E9-CP1246RA7CJ
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.9.SourceMACExclude 0
A4B1E9-CP1238VASE3
InternetGatewayDevice.WANDevice.2.WANDSLInterfaceConfig.DownstreamPower 0
A4B1E9-CP1243TABPW
InternetGatewayDevice.WANDevice.2.WANConnectionDevice.1.WANIPConnection.1.PortMapping.147.PortMappingProtocol UDP
A4B1E9-CP1238VAR1D
InternetGatewayDevice.X_000E50_Connection.BindBlackList.3.Protocol UDP
A4B1E9-CP1244SA10Q
InternetGatewayDevice.WANDevice.2.WANConnectionDevice.1.WANIPConnection.1.PortMapping.344.PortMappingLeaseDuration 0
A4B1E9-CP1246RA9ER
InternetGatewayDevice.Services.VoiceService.1.VoiceProfile.1.Line.1.Session.1305.FarEndUDPPort 1382
A4B1E9-CP1301SAENP
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.117.WorstE2eDelay 72
A4B1E9-CP1244RA1UP
InternetGatewayDevice.QueueManagement.Classification.13.IPLengthExclude 0
A4B1E9-CP1242UA46P
InternetGatewayDevice.Services.VoiceService.1.VoiceProfile.1.Line.1.Session.1051.SessionDuration 1388863841
A4B1E9-CP1238VAB8W
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.112.Underruns 0
A4B1E9-CP1247SA8UE
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.128.FarEndPackLostRatio 0
A4B1E9-CP1239RA5T1
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.194.PacketsReceived 636
A4B1E9-CP1246RA4G7
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.35.Target Accept
A4B1E9-CP1239RA8R5
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.81.BytesReceived 22265
A4B1E9-CP1244RAZJZ
InternetGatewayDevice.WANDevice.2.WANConnectionDevice.1.WANDSLLinkConfig.ATMHECErrors 0
A4B1E9-CP1244RA6YS
InternetGatewayDevice.X_000E50_Connection.BindBlackList.26.Flags Dynamic
A4B1E9-CP1238VAEMS
InternetGatewayDevice.Services.X_000E50_NATApplicationList.Application.2.Category
A4B1E9-CP1239RAG79
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.188.FarEndPackLostCum 0
A4B1E9-CP1247SAYG9
InternetGatewayDevice.X_000E50_Firewall.Chain.16.Rule.1.DestinationIPMask
A4B1E9-CP1238VACA0
InternetGatewayDevice.Services.VoiceService.1.VoiceProfile.1.Line.1.Session.199.SessionStartTime 1389193782
A4B1E9-CP1242TAMJF
InternetGatewayDevice.QueueManagement.Classification.22.EthernetPriorityCheck -1
A4B1E9-CP1239RA8NQ
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.7.DestinationPort 67
A4B1E9-CP1239RA8NQ
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.8.DestinationPort 68
A4B1E9-CP1244SA10W
InternetGatewayDevice.X_000E50_AccessRights.Group.16.Parent
589835-CP1224SAKM7
InternetGatewayDevice.X_000E50_Firewall.Chain.17.Rule.4.TOS 0
A4B1E9-CP1246RAK7W
InternetGatewayDevice.Services.VoiceService.1.VoiceProfile.1.Line.2.Session.1369.FarEndUDPPort 1218
A4B1E9-CP1244RAVKV
InternetGatewayDevice.Services.VoiceService.1.VoiceProfile.1.NumberingPlan.PrefixInfo.94.FacilityActionArgument
A4B1E9-CP1244RAYNA
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.87.FarEndIPAddress
A4B1E9-CP1252SAHDG
InternetGatewayDevice.QueueManagement.Classification.28.SourcePort -1
A4B1E9-CP1243TA58G
InternetGatewayDevice.Services.VoiceService.1.X_000E50_SessionLog.204.PacketsSent 4801
A4B1E9-CP1246RALTZ
InternetGatewayDevice.QueueManagement.Classification.15.SourceIPExclude 0
oracle [ at ] bpdevdmsdbs01:BDMSSI1D1 ---- /ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/py -----
$ ls *.avro
589835-CP1224SAKM7.avro A4B1E9-CP1239RA8NQ.avro A4B1E9-CP1243TABPW.avro A4B1E9-CP1244SA10W.avro A4B1E9-CP1247SAYG9.avro
A4B1E9-CP1238VAB8W.avro A4B1E9-CP1239RA8R5.avro A4B1E9-CP1244RA1UP.avro A4B1E9-CP1246RA4G7.avro A4B1E9-CP1252SAHDG.avro
A4B1E9-CP1238VACA0.avro A4B1E9-CP1239RAG79.avro A4B1E9-CP1244RA6YS.avro A4B1E9-CP1246RA7CJ.avro A4B1E9-CP1301SAENP.avro
A4B1E9-CP1238VAEMS.avro A4B1E9-CP1242TAMJF.avro A4B1E9-CP1244RAVKV.avro A4B1E9-CP1246RA9ER.avro
A4B1E9-CP1238VAR1D.avro A4B1E9-CP1242UA46P.avro A4B1E9-CP1244RAYNA.avro A4B1E9-CP1246RAK7W.avro
A4B1E9-CP1238VASE3.avro A4B1E9-CP1242UA6MN.avro A4B1E9-CP1244RAZJZ.avro A4B1E9-CP1246RALTZ.avro
A4B1E9-CP1239RA5T1.avro A4B1E9-CP1243TA58G.avro A4B1E9-CP1244SA10Q.avro A4B1E9-CP1247SA8UE.avro
$ ls -al A4B1E9-CP1239RA8NQ.avro
-rw-r--r-- 1 oracle oinstall 471 Mar 18 12:03 A4B1E9-CP1239RA8NQ.avro
$ strings A4B1E9-CP1239RA8NQ.avro
avro.schema
{"type": "record", "namespace": "au.com.telstra.in.bdms", "name": "device", "fields": [{"type": "string", "name": "id"}, {"type": {"type": "map", "values": "string"}, "name": "parameter"}]}
avro.codec
null
$A4B1E9-CP1239RA8NQ
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.7.DestinationPort
$A4B1E9-CP1239RA8NQ
InternetGatewayDevice.X_000E50_Firewall.Chain.9.Rule.8.DestinationPort
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2) Run Python with Hadoop's MapReduce Streaming
$ ./devices-hdfs-mr-avro-PyIterGen-v2.sh
14/03/18 12:06:53 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 86400000 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://nsda3dmsrpt02.internal.bigpond.com:8020/data/db/bdms1p/output/avro' to trash at: hdfs://nsda3dmsrpt02.internal.bigpond.com:8020/user/hdfs/.Trash/Current
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-1.jar] /tmp/streamjob6036271345882146838.jar tmpDir=null
14/03/18 12:06:55 INFO client.RMProxy: Connecting to ResourceManager at bpdevdmsdbs01/172.18.127.245:8032
14/03/18 12:06:55 INFO client.RMProxy: Connecting to ResourceManager at bpdevdmsdbs01/172.18.127.245:8032
14/03/18 12:06:56 INFO mapred.FileInputFormat: Total input paths to process : 1
14/03/18 12:06:56 INFO mapreduce.JobSubmitter: number of splits:0
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files
14/03/18 12:06:56 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.map.task.debug.script is deprecated. Instead, use mapreduce.map.debug.script
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/03/18 12:06:56 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/03/18 12:06:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1394755763143_0020
14/03/18 12:06:57 INFO impl.YarnClientImpl: Submitted application application_1394755763143_0020 to ResourceManager at bpdevdmsdbs01/172.18.127.245:8032
14/03/18 12:06:57 INFO mapreduce.Job: The url to track the job: http://bpdevdmsdbs01:8088/proxy/application_1394755763143_0020/
14/03/18 12:06:57 INFO mapreduce.Job: Running job: job_1394755763143_0020
14/03/18 12:07:03 INFO mapreduce.Job: Job job_1394755763143_0020 running in uber mode : false
14/03/18 12:07:03 INFO mapreduce.Job: map 0% reduce 0%
14/03/18 12:07:08 INFO mapreduce.Job: map 0% reduce 4%
14/03/18 12:07:09 INFO mapreduce.Job: map 0% reduce 21%
14/03/18 12:07:10 INFO mapreduce.Job: map 0% reduce 29%
14/03/18 12:07:11 INFO mapreduce.Job: map 0% reduce 42%
14/03/18 12:07:12 INFO mapreduce.Job: map 0% reduce 58%
14/03/18 12:07:13 INFO mapreduce.Job: map 0% reduce 71%
14/03/18 12:07:14 INFO mapreduce.Job: map 0% reduce 83%
14/03/18 12:07:15 INFO mapreduce.Job: map 0% reduce 100%
14/03/18 12:07:16 INFO mapreduce.Job: Job job_1394755763143_0020 completed successfully
14/03/18 12:07:17 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=2183380
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=0
HDFS: Number of bytes written=1008
HDFS: Number of read operations=72
HDFS: Number of large read operations=0
HDFS: Number of write operations=48
Job Counters
Launched reduce tasks=24
Total time spent by all maps in occupied slots (ms)=0
Total time spent by all reduces in occupied slots (ms)=64414
Map-Reduce Framework
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=0
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=614
CPU time spent (ms)=30220
Physical memory (bytes) snapshot=6838841344
Virtual memory (bytes) snapshot=34129403904
Total committed heap usage (bytes)=18998624256
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Output Format Counters
Bytes Written=1008
14/03/18 12:07:17 INFO streaming.StreamJob: Output directory: /data/db/bdms1p/output/avro
drwxr-xr-x - hdfs supergroup 0 2014-03-14 10:15 /data/db/bdms1p/input
-rw-r--r-- 3 hdfs supergroup 4.0 K 2014-03-14 10:15 /data/db/bdms1p/input/data.txt
drwxr-xr-x - hdfs supergroup 0 2014-03-18 12:07 /data/db/bdms1p/output
-rw-r--r-- 3 hdfs supergroup 0 2014-03-13 15:55 /data/db/bdms1p/output/_SUCCESS
drwxr-xr-x - hdfs supergroup 0 2014-03-18 12:07 /data/db/bdms1p/output/avro
-rw-r--r-- 3 hdfs supergroup 0 2014-03-18 12:07 /data/db/bdms1p/output/avro/_SUCCESS
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00000.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00001.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00002.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00003.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00004.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00005.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00006.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00007.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00008.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00009.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00010.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00011.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00012.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00013.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00014.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00015.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00016.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00017.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00018.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00019.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00020.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00021.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00022.avro
-rw-r--r-- 3 hdfs supergroup 42 2014-03-18 12:07 /data/db/bdms1p/output/avro/part-00023.avro
Thanks and Regards,