Hi
I am new to PIG scripting and need help or suggestion to resolve below
problem.
I have 1000 XML files in a folder and my PIG script has to take them one
by one to parse for some values and has to store those values in a single
files.
I tried with below script but it is not working as expected.
register piggybank.jar;
*A = load 'XML/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('org_study_id') as (x:
chararray);*
*A2 = foreach A GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x,'<org_study_id>(.*)</org_study_id>')) as
(org_study_id : chararray);*
*A3 = foreach A2 GENERATE CONCAT('#$',CONCAT(org_study_id,'$'));*
*STORE A3 into 'piglab/result1';*
*data = load 'piglab/result1' USING PigStorage('$') as (a1: chararray,a2:
chararray);*
*C = load 'XML/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('nct_id') as (x1: chararray);*
*C2 = foreach C GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x1,'<nct_id>(.*)</nct_id>')) as (nct_id :
chararray);*
*C3 = foreach C2 GENERATE CONCAT('#$',CONCAT(nct_id,'$'));*
*STORE C3 into 'piglab/result11';*
*data11 = load 'piglab/result11' USING PigStorage('$') as (c1:
chararray,c2: chararray);*
*I = load 'piglab/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('minimum_age') as (x5:
chararray);*
*I2 = foreach I GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x5,'<minimum_age>(.*)</minimum_age>')) as
(minimum_age: chararray);*
*I3 = foreach I2 GENERATE CONCAT('#$',CONCAT(minimum_age,'$'));*
*STORE I3 into 'piglab/result9';*
*data8 = load 'piglab/result9' USING PigStorage('$') as (i1: chararray,i2:
chararray);*
*result3 = JOIN data by a1,data11 by c1,data8 by i1;*
*Store result3 into 'piglab/result'*;
The XML looks like this and each XML file has different clinical_study_rank
record.</link_text>*
Skin
any help on this will be highly appreciable.
thanks
I am new to PIG scripting and need help or suggestion to resolve below
problem.
I have 1000 XML files in a folder and my PIG script has to take them one
by one to parse for some values and has to store those values in a single
files.
I tried with below script but it is not working as expected.
register piggybank.jar;
*A = load 'XML/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('org_study_id') as (x:
chararray);*
*A2 = foreach A GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x,'<org_study_id>(.*)</org_study_id>')) as
(org_study_id : chararray);*
*A3 = foreach A2 GENERATE CONCAT('#$',CONCAT(org_study_id,'$'));*
*STORE A3 into 'piglab/result1';*
*data = load 'piglab/result1' USING PigStorage('$') as (a1: chararray,a2:
chararray);*
*C = load 'XML/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('nct_id') as (x1: chararray);*
*C2 = foreach C GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x1,'<nct_id>(.*)</nct_id>')) as (nct_id :
chararray);*
*C3 = foreach C2 GENERATE CONCAT('#$',CONCAT(nct_id,'$'));*
*STORE C3 into 'piglab/result11';*
*data11 = load 'piglab/result11' USING PigStorage('$') as (c1:
chararray,c2: chararray);*
*I = load 'piglab/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('minimum_age') as (x5:
chararray);*
*I2 = foreach I GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x5,'<minimum_age>(.*)</minimum_age>')) as
(minimum_age: chararray);*
*I3 = foreach I2 GENERATE CONCAT('#$',CONCAT(minimum_age,'$'));*
*STORE I3 into 'piglab/result9';*
*data8 = load 'piglab/result9' USING PigStorage('$') as (i1: chararray,i2:
chararray);*
*result3 = JOIN data by a1,data11 by c1,data8 by i1;*
*Store result3 into 'piglab/result'*;
The XML looks like this and each XML file has different clinical_study_rank
record.</link_text>*
Skin
any help on this will be highly appreciable.
thanks