Hi all,
Whether you’re using Hive or MapReduce, avro input/output formats require you to specify a schema at the beginning of the job or the table definition in order to work with them. Is there any way to configure the jobs in a way that the input/output formats can dynamically determine the schema from the data itself?
Think about a job like this. I have a set of CSV files that I want to serialize into avro files. These CSV files are self describing and each CSV file has a unique schema. If I want to write a job that scans over all of this data and serialize it into avro I can’t do that with today’s tools (as far as I know). If I can’t specify the schema up front, what can I do? Am I forced to write my own avro input/output formats?
The avro schema is stored within the avro data file itself, why can’t these input/output formats be smart enough to figure that out? Am I fundamentally doing something against the principles of the avro format? I would be surprised if no one has run into this issue before.
Regards,
Ryan Tabora
Whether you’re using Hive or MapReduce, avro input/output formats require you to specify a schema at the beginning of the job or the table definition in order to work with them. Is there any way to configure the jobs in a way that the input/output formats can dynamically determine the schema from the data itself?
Think about a job like this. I have a set of CSV files that I want to serialize into avro files. These CSV files are self describing and each CSV file has a unique schema. If I want to write a job that scans over all of this data and serialize it into avro I can’t do that with today’s tools (as far as I know). If I can’t specify the schema up front, what can I do? Am I forced to write my own avro input/output formats?
The avro schema is stored within the avro data file itself, why can’t these input/output formats be smart enough to figure that out? Am I fundamentally doing something against the principles of the avro format? I would be surprised if no one has run into this issue before.
Regards,
Ryan Tabora