Quantcast
Channel: Apache Timeline
Viewing all articles
Browse latest Browse all 5648

Push tuple/bag schema to UDF function automatically

$
0
0
Hi, I have rather complex data processing routines and I'm using many
Jython UDFS.

Here is my code example:

--project and get all visited route pivots
*routePivots* = FOREACH withNoIntersection GENERATE
withIntersection::msisdn as msisdn: long,
--0
withIntersection::ts
as ts: long,
--1
withIntersection::lac
as lac,
--2
withIntersection::cid
as cid,
--3

withIntersection::cell_type as cell_type,
--4

withIntersection::branch_id as branch_id,
--5

(withIntersection::center_lon is null ? tiledGsm::center_lon :
withIntersection::center_lon) as center_lon: double, --6

(withIntersection::center_lat is null ? tiledGsm::center_lat :
withIntersection::center_lat) as center_lat: double, --7

(withIntersection::tile_id is null ? tiledGsm::tile_id :
withIntersection::tile_id) as tile_id: int, --8

(withIntersection::zone_col is null ? tiledGsm::zone_col :
withIntersection::zone_col) as zone_col: int, --9

(withIntersection::zone_row is null ? tiledGsm::zone_row :
withIntersection::zone_row) as zone_row: int, --10

withIntersection::is_active as is_active;

--filter non valid route pivots and store them
routePivotsGroupedByMsisdn = GROUP *routePivots* BY msisdn;
markedPivots = FOREACH routePivotsGroupedByMsisdn {
*ordered* = ORDER *routePivots* BY ts;
GENERATE FLATTEN(udf.*filter_route_pivots*(*ordered*, 55,
10000, 'ts:1, lon:6, lat:7'))
as (msisdn: long, --0
ts: long, --1
lac: int, --2
cid: int, --3
cell_type: chararray, --4
branch_id: int, --5
center_lon: double, --6
center_lat: double, --7
tile_id: int, --8
zone_col: int, --9
zone_row: int, --10
is_active: boolean, --11
avg_speed: double, --12
distance: int, --13
not_valid: int); --14

You can see that relation *routePivots* has explicit schema.
I'm invoking UDF*filter_route_pivots* for the bag containing tuples with *
routePivots* schema.
And I'm passing partial schema description ('ts:1, lon:6, lat:7') to UDF.
I do it because I don't want to play with numbered tuple. It's a great tool
to do silly bugs. And its hard to catch them. I know.
I do access tuple values in 'named manner' inside UDF.

ts_diff = n*ext_pivot[pivot_schema['ts']]* - *pivot[pivot_schema['ts']]*
distance =
RouteCalculator.getDistanceInMeters(pivot[pivot_schema['lon']],
pivot[pivot_schema['lat']],

next_pivot[pivot_schema['lon']], next_pivot[pivot_schema['lat']])

You see that I don't play with magic numbers, my code is easier to read and
maintain and i'm not tightly coupled to schema and field order.

Does pig provide a tool to push down shema to UDF? I've declared it, why
can't I use it inside my udf?
I think such approach should be very useful.

Viewing all articles
Browse latest Browse all 5648

Trending Articles