scala - Reading in multiple files compressed in tar.gz archive into Spark -
this question has answer here:
i'm trying create spark rdd several json files compressed tar. example, have 3 files
file1.json file2.json file3.json
and these contained in archive.tar.gz
.
i want create dataframe json files. problem spark not reading in json files correctly. creating rdd using sqlcontext.read.json("archive.tar.gz")
or sc.textfile("archive.tar.gz")
results in garbled/extra output.
is there way handle gzipped archives containing multiple files in spark?
update
using method given in answer read whole text files compression in spark able things running, method not seem suitable large tar.gz archives (>200 mb compressed) application chokes on large archive sizes. of archives i'm dealing reach sizes upto 2 gb after compression i'm wondering if there efficient way deal problem.
i'm trying avoid extracting archives , merging files time consuming.
a solution given in read whole text files compression in spark . using code sample provided, able create dataframe compressed archive so:
val jsonrdd = sc.binaryfiles("gzarchive/*"). flatmapvalues(x => extractfiles(x).tooption). mapvalues(_.map(decode()) val df = sqlcontext.read.json(jsonrdd.map(_._2).flatmap(x => x))
this method works fine tar archives of relatively small size, not suitable large archive sizes.
a better solution problem seems to convert tar archives hadoop sequencefiles, splittable , hence can read , processed in parallel in spark (as opposed tar archives.)
see: stuartsierra.com/2008/04/24/a-million-little-files
Comments
Post a Comment