Hello, I'm having a hard time using Pig to extract data from Cassandra.
Cassandra: [cqlsh 4.1.0 | Cassandra 2.0.4 | CQL spec 3.1.1 | Thrift
protocol 19.39.0]
Hadoop (Cloudera): 2.0.0+1518
Map Reduce: v2 (Yarn)
Pig: Apache Pig version 0.11.0-cdh4.5.0
I can use pig fine to run mapreduce jobs.
The test schema is very simple:
cqlsh:main> create table a (id int, name varchar, primary key (id));
cqlsh:main> insert into a (id, name) values (1, 'blah');
cqlsh:main> select * from a;
id | name
----+------
1 | blah
(1 rows)
The problem I run into is when I'm trying to extract data from Cassandra:
bash-4.2$ ./apache-cassandra-2.0.4-src/examples/pig/bin/pig_cassandra -x
local
Using /home/hdfs/pig-0.12.0-src/pig-withouthadoop.jar.
2014-02-07 17:09:18,948 [main] INFO org.apache.pig.Main - Apache Pig
version 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25
2014-02-07 17:09:18,949 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/hdfs/pig_1391810958945.log
2014-02-07 17:09:19,373 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: file:///
2014-02-07 17:09:19,377 [main] WARN org.apache.hadoop.conf.Configuration -
mapred.used.genericoptionsparser is deprecated. Instead, use
mapreduce.client.genericoptionsparser.used
2014-02-07 17:09:19,394 [main] WARN org.apache.hadoop.conf.Configuration -
fs.default.name is deprecated. Instead, use fs.defaultFS
2014-02-07 17:09:19,395 [main] WARN org.apache.hadoop.conf.Configuration -
mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/home/hdfs/apache-cassandra-2.0.4-src/lib/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
2014-02-07 17:09:20,026 [main] WARN org.apache.hadoop.conf.Configuration -
io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2014-02-07 17:09:20,030 [main] WARN org.apache.hadoop.conf.Configuration -
fs.default.name is deprecated. Instead, use fs.defaultFS
2014-02-07 17:09:20,030 [main] WARN org.apache.hadoop.conf.Configuration -
mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
grunt> rows = LOAD 'cql://main/a' USING CqlStorage();
grunt> describe rows;
rows: {id: int,name: chararray}
Pig can get the schema out the table.
However trying to dump the data is when it all goes south:
grunt> data = foreach rows generate $1;
grunt> dump data;
2014-02-07 17:09:47,347 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: UNKNOWN
2014-02-07 17:09:47,416 [main] INFO
org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
for rows: $0
2014-02-07 17:09:47,548 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2014-02-07 17:09:47,589 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2014-02-07 17:09:47,589 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2014-02-07 17:09:47,960 [main] WARN org.apache.hadoop.conf.Configuration -
session.id is deprecated. Instead, use dfs.metrics.session-id
2014-02-07 17:09:47,968 [main] INFO
org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with
processName=JobTracker, sessionId=
2014-02-07 17:09:48,055 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job
2014-02-07 17:09:48,075 [main] WARN org.apache.hadoop.conf.Configuration -
mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use
mapreduce.reduce.markreset.buffer.percent
2014-02-07 17:09:48,075 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-02-07 17:09:48,075 [main] WARN org.apache.hadoop.conf.Configuration -
mapred.output.compress is deprecated. Instead, use
mapreduce.output.fileoutputformat.compress
2014-02-07 17:09:48,206 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2014-02-07 17:09:48,330 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2998: Unhandled internal error.
org.apache.hadoop.mapred.jobcontrol.JobControl.addJob(Lorg/apache/hadoop/mapred/jobcontrol/Job;)Ljava/lang/String;
Details at logfile: /home/hdfs/pig_1391810958945.log
The log says:
Pig Stack Trace
ERROR 2998: Unhandled internal error.
org.apache.hadoop.mapred.jobcontrol.JobControl.addJob(Lorg/apache/hadoop/mapred/jobcontrol/Job;)Ljava/lang/String;
java.lang.NoSuchMethodError:
org.apache.hadoop.mapred.jobcontrol.JobControl.addJob(Lorg/apache/hadoop/mapred/jobcontrol/Job;)Ljava/lang/String;
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:261)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:180)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
at org.apache.pig.PigServer.storeEx(PigServer.java:952)
at org.apache.pig.PigServer.store(PigServer.java:919)
at org.apache.pig.PigServer.openIterator(PigServer.java:832)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:682)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:490)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
I'd appreciate any help on this.