apache spark - Modify an existing RDD without replicating memory -

- April 15, 2012

i trying implement distributed algorithm using spark. computer vision algorithm tens of thousands of images. images divided "partitions" processed in distributed fashion, of master process. pseudocode goes this:

# iterate t = 1 ... t   # each partition   p = 1 ... p     d[p] = f1(b[p], z[p], u[p])   # master   y = f2(d)   # each partition   p = 1 ... p     u[p] = f3(u[p], y)  # each partition p = 1 ... p   # iterate   t = 1 ... t     z[p] = f4(b[p], y, v[p])     v[p] = f5(z[p])

where b[p] contains pth partition of images numpy ndarray, z[p] contains function of b[p] , numpy ndarray, y computed on master knowing partitions of d, , u[p] updated on each partition knowing y. in attempted implementation, of b, z, , u separate rdds corresponding keys (e.g. (1, b[1]), (1,z[1]) , (1, u[1]) correspond first partition, etc.).

the problem using spark b , z extremely large, in order of gbs. since rdds immutable, whenever want "join" them (e.g. bring z[1] , b[1] onto same machine processing) replicated i.e. new copies returned numpy arrays. multiplies amount of memory needed, , limits number of images can processed.

i thought way avoid joins have rdd combines of variables e.g. (p, (z[p], b[p], u[p], v[p])), immutability problem still there.

so question: there workaround update rdd in place? example, if have rdd (p, (z[p], b[p], u[p], v[p])), can update z[p] in-memory?

Search This Blog

If cop

apache spark - Modify an existing RDD without replicating memory -

Comments

Post a Comment

Popular posts from this blog

Android volley - avoid multiple requests of the same kind to the server? -

magento2 - Magento 2 admin grid add filter to collection -

Combining PHP Registration and Login into one class with multiple functions in one PHP file -