Quantcast
Channel: Apache Timeline
Viewing all articles
Browse latest Browse all 5648

Achieving reasonably performing federated queries

$
0
0
Hi all,

This is partly a summary of my recent experiences with federated queries
and partly a request for your feedback on making /reasonably/ performing
federated queries.

The query in question is here [1]. Essentially there are two endpoints
(which may or may not be the same), and they return the same pattern.
There are millions of triples to get through, so throwing out false
negatives (early on) is quite important. We assume that graph names are
not known and that everything is accessible from the default graph. The
endpoint which dispatches the two queries needs to filter out what's
remaining. There are no common variables. This means that both endpoints
need to do their own thing and then the patterns are joined.

Needless to say, OPTIONALs that are in there are expensive, but they
help a great deal in making sure to use only what's necessary i.e.,
either a refArea doesn't have an exactMatch or if there is an
exactMatch, it contains the domain of the refArea that's at the other
endpoint. Without OPTIONALs, the outer endpoint will end up with more
possibilities to join. Using MINUS is more or less the same.

By default, ARQ uses an optimizer to do a whole bunch of good stuff
that's mostly foreign to me. What I'm aware of however is how it behaves
when it comes SERVICE calls. When the first SERVICE call comes back with
n number of triples, the second SERVICE is called n times. Undoubtedly,
this doesn't sale at all.

To work around this, I've turned off the optimizer with
Optimize.noOptimizer() [2] with a simple class which is called from the
parent endpoint's TDB assembler file. As expected, that allows the
parent to make only two SERVICE calls.

This is the current state of things. I'd like to take it further to get
more out of this, but at this point, I need a different set of eyes.

[I will prepare a chart for this, but this rough explanation might do
for now] As there are different endpoints with different amounts of
data, what I've experienced is that some of the fastest quickest queries
take around 3 seconds. That's typically queries with low number of
joins; ~150x150=22500 possibilities before the last filter kicks in. It
gets heavy quite fast, as I've seen some queries to take 30 seconds or more.

The TDB optimizer stats file is up to date on all endpoints.

I am completely open to how this query can be restructured, or simply
like to hear about your own experiences with federated queries.

[1]
http://csarven.ca/linked-statistical-data-analysis#federated-sparql-query
[2]
http://jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/sparql/algebra/optimize/Optimize.html#noOptimizer()

-Sarven
http://csarven.ca/#i

Viewing all articles
Browse latest Browse all 5648

Trending Articles