Hi all,
I have users writing AVRO files in different server and I want to use Flume
to move all those files into HDFS using Flume. So I can later use Hive or
Pig to query/analyse the data.
On the client I installed flume and have a SpoolDir source and AVRO sink
like this:
a1.sources = src1
a1.sinks = sink1
a1.channels = c1
a1.channels.c1.type = memory
a1.sources.src1.type = spooldir
a1.sources.src1.channels = c1
a1.sources.src1.spoolDir = {directory}
a1.sources.src1.fileHeader = true
a1.sources.src1.deserializer = avro
a1.sinks.sink1.type = avro
a1.sinks.sink1.channel = c1
a1.sinks.sink1.hostname = {IP}
a1.sinks.sink1.port = 41414
On the hadoop cluster I have this AVRO source and HDFS sink:
a1.sources = avro1
a1.sinks = sink1
a1.channels = c1
a1.channels.c1.type = memory
a1.sources.avro1.type = avro
a1.sources.avro1.channels = c1
a1.sources.avro1.bind = 0.0.0.0
a1.sources.avro1.port = 41414
a1.sinks.sink1.type = hdfs
a1.sinks.sink1.channel = c1
a1.sinks.sink1.hdfs.path = {hdfs dir}
a1.sinks.sink1.hdfs.fileSuffix = .avro
a1.sinks.sink1.hdfs.rollSize = 67108864
a1.sinks.sink1.hdfs.fileType = DataStream
The problem is that the files on HDFS are not valid AVRO files! I am using
the hue UI to check whenever the file is a valid AVRO file or not. If I
upload an AVRO I file that I generate on my pc to the cluster I can see its
contents perfectly, even create a Hive table and query but the files I send
via flume are not valid AVRO files.
I tried the flume avro client that is included in flume but didn't work
because it sends a flume event per line breaking the avro files, so i fixed
that using the spooldir source using deserializer = avro. So I think the
problem is on the HDFS sink when is writing the files.
Using hdfs.fileType = DataStream it writes the values from the avro fields
not the whole avro file, losing all the schema information. If I use
hdfs.fileType
= SequenceFile the files are not valid for some reason.
I appreciate any help.
Thanks,
Daniel
I have users writing AVRO files in different server and I want to use Flume
to move all those files into HDFS using Flume. So I can later use Hive or
Pig to query/analyse the data.
On the client I installed flume and have a SpoolDir source and AVRO sink
like this:
a1.sources = src1
a1.sinks = sink1
a1.channels = c1
a1.channels.c1.type = memory
a1.sources.src1.type = spooldir
a1.sources.src1.channels = c1
a1.sources.src1.spoolDir = {directory}
a1.sources.src1.fileHeader = true
a1.sources.src1.deserializer = avro
a1.sinks.sink1.type = avro
a1.sinks.sink1.channel = c1
a1.sinks.sink1.hostname = {IP}
a1.sinks.sink1.port = 41414
On the hadoop cluster I have this AVRO source and HDFS sink:
a1.sources = avro1
a1.sinks = sink1
a1.channels = c1
a1.channels.c1.type = memory
a1.sources.avro1.type = avro
a1.sources.avro1.channels = c1
a1.sources.avro1.bind = 0.0.0.0
a1.sources.avro1.port = 41414
a1.sinks.sink1.type = hdfs
a1.sinks.sink1.channel = c1
a1.sinks.sink1.hdfs.path = {hdfs dir}
a1.sinks.sink1.hdfs.fileSuffix = .avro
a1.sinks.sink1.hdfs.rollSize = 67108864
a1.sinks.sink1.hdfs.fileType = DataStream
The problem is that the files on HDFS are not valid AVRO files! I am using
the hue UI to check whenever the file is a valid AVRO file or not. If I
upload an AVRO I file that I generate on my pc to the cluster I can see its
contents perfectly, even create a Hive table and query but the files I send
via flume are not valid AVRO files.
I tried the flume avro client that is included in flume but didn't work
because it sends a flume event per line breaking the avro files, so i fixed
that using the spooldir source using deserializer = avro. So I think the
problem is on the HDFS sink when is writing the files.
Using hdfs.fileType = DataStream it writes the values from the avro fields
not the whole avro file, losing all the schema information. If I use
hdfs.fileType
= SequenceFile the files are not valid for some reason.
I appreciate any help.
Thanks,
Daniel