I have a use case for Flume and I'm wondering which of the many options
in Flume to use for making this work.
I have a data source that produces log data in UDP packets containing
JSON (a bit like syslog, but the data is already structured). I want to
get this into Hadoop somehow (either HBase or HDFS+Hive, not sure yet).
My first attempt was to write a sink (based on the syslog UDP sink) that
receives UDP packets, parses the JSON, stuffs the fields into the
headers of the internal Flume event object, and sends it off. (The body
is left empty.) On the receiving end, I wrote a serializer for the
hbase sink that writes each header field into a separate column. That
works, but I was confused that the default supplied hbase serializers
ignored all event headers, so I was wondering whether I'm abusing them.
An alternative approach I was thinking about was writing a generic UDP
sink that stuffs the entire UDP packet into the event body, and then
write a serializer for the hbase sink that parses the JSON and puts the
fields into the columns. Or alternatively write the JSON straight into
HDFS and have Hive to the JSON parsing later.
Which one of these would be more idiomatic and/or generally useful?
in Flume to use for making this work.
I have a data source that produces log data in UDP packets containing
JSON (a bit like syslog, but the data is already structured). I want to
get this into Hadoop somehow (either HBase or HDFS+Hive, not sure yet).
My first attempt was to write a sink (based on the syslog UDP sink) that
receives UDP packets, parses the JSON, stuffs the fields into the
headers of the internal Flume event object, and sends it off. (The body
is left empty.) On the receiving end, I wrote a serializer for the
hbase sink that writes each header field into a separate column. That
works, but I was confused that the default supplied hbase serializers
ignored all event headers, so I was wondering whether I'm abusing them.
An alternative approach I was thinking about was writing a generic UDP
sink that stuffs the entire UDP packet into the event body, and then
write a serializer for the hbase sink that parses the JSON and puts the
fields into the columns. Or alternatively write the JSON straight into
HDFS and have Hive to the JSON parsing later.
Which one of these would be more idiomatic and/or generally useful?