apache spark - Modify an existing RDD without replicating memory -
i trying implement distributed algorithm using spark. computer vision algorithm tens of thousands of images. images divided "partitions" processed in distributed fashion, of master process. pseudocode goes this:
# iterate t = 1 ... t # each partition p = 1 ... p d[p] = f1(b[p], z[p], u[p]) # master y = f2(d) # each partition p = 1 ... p u[p] = f3(u[p], y) # each partition p = 1 ... p # iterate t = 1 ... t z[p] = f4(b[p], y, v[p]) v[p] = f5(z[p])
where b[p]
contains pth partition of images numpy ndarray, z[p]
contains function of b[p]
, numpy ndarray, y
computed on master knowing partitions of d
, , u[p]
updated on each partition knowing y
. in attempted implementation, of b
, z
, , u
separate rdds corresponding keys (e.g. (1, b[1])
, (1,z[1])
, (1, u[1])
correspond first partition, etc.).
the problem using spark b
, z
extremely large, in order of gbs. since rdds immutable, whenever want "join" them (e.g. bring z[1]
, b[1]
onto same machine processing) replicated i.e. new copies returned numpy arrays. multiplies amount of memory needed, , limits number of images can processed.
i thought way avoid joins have rdd combines of variables e.g. (p, (z[p], b[p], u[p], v[p]))
, immutability problem still there.
so question: there workaround update rdd in place? example, if have rdd (p, (z[p], b[p], u[p], v[p]))
, can update z[p] in-memory?
Comments
Post a Comment