I'd want to use Spark SQL (1.6) to do "filtered equi-joins" of the kind

A inner join B where A.group_id = B.group_id and pair_filter_udf(A[cols], B[cols])
The group id is coarse in this case: a single group id value may be connected with, say, 10,000 entries in both A and B.

The coarseness of group id might cause computing concerns if the equi-join was conducted without the pair filter udf. For example, if a group id had 10,000 records in both A and B, the join would have 100 million entries. If we had thousands of such huge groups, we would build a massive table and would quickly run out of memory.

As a result, rather of waiting until all pairs are formed, we must insert pair filter udf inside the join and have it filter pairs as they are generated. My question is if Spark SQL accomplishes this as stated in this example using scaler topics.

I put up a basic filtered equi-join and asked Spark what its query strategy was:

# run in PySpark Shellimport pyspark.sql.functions as Fsq = sqlContextn=100g=10a = sq.range(n)a = a.withColumn('grp',F.floor(a['id']/g)*g)a = a.withColumnRenamed('id','id_a')b = sq.range(n)b = b.withColumn('grp',F.floor(b['id']/g)*g)b = b.withColumnRenamed('id','id_b')c = a.join(b,(a.grp == b.grp) & (F.abs(a['id_a'] - b['id_b']) < 2)).drop(b['grp'])c = c.sort('id_a')c = c[['grp','id_a','id_b']]c.explain()

You need to be a member of Virtual Academy of Pakistan to add comments!

Join Virtual Academy of Pakistan

Votes: 0
Email me when people reply –

Activity

ZAK updated their profile photo
Sep 9
Bareera Adnan is now a member of Virtual Academy of Pakistan
Sep 7
Jekky Sharma posted a discussion in C++ Programming Fundamentals
 I'm working on a Python project where I have multiple functions, and I want to log each function call along with its arguments and return value for debugging purposes. I've heard that decorators can help achieve this. Could someone guide me on how…
Sep 6
LogoCent updated their profile
Aug 7
saad jamal is now a member of Virtual Academy of Pakistan
Jul 19
The Joker posted a status
Assignment wali gal koi nai ethay fer :p
Jul 9
Tabassam Ali updated their profile
Jul 3
Muhammad Kazim and tayyaba bibi joined Virtual Academy of Pakistan
IT/CS/SE
Jun 19
Ashley posted a discussion
Hello members,I am seeking insight on the average salary of a software engineer or developer. I have recently been pursuing my software engineering and development career and I am trying to get a better understanding of the salary range and…
Jun 15
zahoor, JS and Kanza Sarfraz joined Virtual Academy of Pakistan
IT/CS/SE
Jun 8
Farhana Hassan Janjua, Salman Ahmad, Sajjad Hussain and 1 more joined Virtual Academy of Pakistan
IT/CS/SE Math & Stat
Jun 5
Black Dahlia and abdul-hayee joined Virtual Academy of Pakistan
IT/CS/SE
May 22
Aden is now a member of Virtual Academy of Pakistan
May 21
Ali Kibs is now a member of Virtual Academy of Pakistan
May 19
Safia Qaiser, Abdul Rehman and Ayman Ashry joined Virtual Academy of Pakistan
Business Study Business Study Business Study
May 17
Abu Hurairah and Malik Ahtsham joined Virtual Academy of Pakistan
Business Study IT/CS/SE
May 13
More…