scala - Reading in multiple files compressed in tar.gz archive into Spark -

- January 15, 2012

this question has answer here:

read whole text files compression in spark 1 answer

i'm trying create spark rdd several json files compressed tar. example, have 3 files

file1.json file2.json file3.json

and these contained in archive.tar.gz.

i want create dataframe json files. problem spark not reading in json files correctly. creating rdd using sqlcontext.read.json("archive.tar.gz") or sc.textfile("archive.tar.gz") results in garbled/extra output.

is there way handle gzipped archives containing multiple files in spark?

update

using method given in answer read whole text files compression in spark able things running, method not seem suitable large tar.gz archives (>200 mb compressed) application chokes on large archive sizes. of archives i'm dealing reach sizes upto 2 gb after compression i'm wondering if there efficient way deal problem.

i'm trying avoid extracting archives , merging files time consuming.

a solution given in read whole text files compression in spark . using code sample provided, able create dataframe compressed archive so:

val jsonrdd = sc.binaryfiles("gzarchive/*").                flatmapvalues(x => extractfiles(x).tooption).                mapvalues(_.map(decode())  val df = sqlcontext.read.json(jsonrdd.map(_._2).flatmap(x => x))

this method works fine tar archives of relatively small size, not suitable large archive sizes.

a better solution problem seems to convert tar archives hadoop sequencefiles, splittable , hence can read , processed in parallel in spark (as opposed tar archives.)

see: stuartsierra.com/2008/04/24/a-million-little-files

Search This Blog

If cop

scala - Reading in multiple files compressed in tar.gz archive into Spark -

Comments

Post a Comment

Popular posts from this blog

Android volley - avoid multiple requests of the same kind to the server? -

magento2 - Magento 2 admin grid add filter to collection -

Combining PHP Registration and Login into one class with multiple functions in one PHP file -