scala - Reading in multiple files compressed in tar.gz archive into Spark -


this question has answer here:

i'm trying create spark rdd several json files compressed tar. example, have 3 files

file1.json file2.json file3.json 

and these contained in archive.tar.gz.

i want create dataframe json files. problem spark not reading in json files correctly. creating rdd using sqlcontext.read.json("archive.tar.gz") or sc.textfile("archive.tar.gz") results in garbled/extra output.

is there way handle gzipped archives containing multiple files in spark?

update

using method given in answer read whole text files compression in spark able things running, method not seem suitable large tar.gz archives (>200 mb compressed) application chokes on large archive sizes. of archives i'm dealing reach sizes upto 2 gb after compression i'm wondering if there efficient way deal problem.

i'm trying avoid extracting archives , merging files time consuming.

a solution given in read whole text files compression in spark . using code sample provided, able create dataframe compressed archive so:

val jsonrdd = sc.binaryfiles("gzarchive/*").                flatmapvalues(x => extractfiles(x).tooption).                mapvalues(_.map(decode())  val df = sqlcontext.read.json(jsonrdd.map(_._2).flatmap(x => x)) 

this method works fine tar archives of relatively small size, not suitable large archive sizes.

a better solution problem seems to convert tar archives hadoop sequencefiles, splittable , hence can read , processed in parallel in spark (as opposed tar archives.)

see: stuartsierra.com/2008/04/24/a-million-little-files


Comments

Popular posts from this blog

magento2 - Magento 2 admin grid add filter to collection -

Android volley - avoid multiple requests of the same kind to the server? -

Combining PHP Registration and Login into one class with multiple functions in one PHP file -