apache spark - Modify an existing RDD without replicating memory -
i trying implement distributed algorithm using spark. computer vision algorithm tens of thousands of images. images divided "partitions" processed in distributed fashion, of master process. pseudocode goes this:
# iterate t = 1 ... t # each partition p = 1 ... p d[p] = f1(b[p], z[p], u[p]) # master y = f2(d) # each partition p = 1 ... p u[p] = f3(u[p], y) # each partition p = 1 ... p # iterate t = 1 ... t z[p] = f4(b[p], y, v[p]) v[p] = f5(z[p]) where b[p] contains pth partition of images numpy ndarray, z[p] contains function of b[p] , numpy ndarray, y computed on master knowing partitions of d, , u[p] updated on each partition knowing y. in attempted implementation, of b, z, , u separate rdds corresponding keys (e.g. (1, b[1]), (1,z[1]) , (1, u[1]) correspond first partition, etc.).
the problem using spark b , z extremely large, in order of gbs. since rdds immutable, whenever want "join" them (e.g. bring z[1] , b[1] onto same machine processing) replicated i.e. new copies returned numpy arrays. multiplies amount of memory needed, , limits number of images can processed.
i thought way avoid joins have rdd combines of variables e.g. (p, (z[p], b[p], u[p], v[p])), immutability problem still there.
so question: there workaround update rdd in place? example, if have rdd (p, (z[p], b[p], u[p], v[p])), can update z[p] in-memory?
Comments
Post a Comment