Geeks With Blogs
Josh Reuben

Big Data has a plethora of Data File Formats - its important to understand their strengths and weaknesses. Most explorers start out with some NoSQL exported JSON data. However, specialized data structures are required  - because putting each blob of binary data into its own file just doesn’t scale across a distributed filesystem.  

TL/DR; Choose Parquet !!!

Row-oriented File Formats

(Sequence, Avro ) – best for large number of columns of a single row needed for processing at the same time. general-purpose, splittable, compressible formats

rows stored contiguously in the file.

Suitable for streaming - can be read up to last sync point offset after a writer failure.

SequenceFile

record == line of text. persistent data structure for binary key-value pairs.

Key - eg timestamp represented by LongWritable - value == a Writable

HDFS / MapReduce are optimized for large files: pack files into a SequenceFile efficient storing / processing of smaller files

positions of the sync points can be used to resynchronize with a record boundary if the reader is “lost”

hadoop fs command -text option - display sequence files in textual form: can recognize gzipped / sequence / Avro files

% hadoop fs -text numbers.seq | head

sequence file consists of a header followed by 1 or more records.first 3 bytes == SEQ magic number; followed by 1 byte for version number. header contains other fields: key / value class names, compression details, user-defined metadata, the sync marker.[

Each file has a randomly generated sync marker stored in the header.

If no compression is enabled (default), each record is made up of record length (in bytes) + key length + key + value. Block compression compresses multiple records at once; more compact & preferred - Records are added to a block until it reaches a minimum size in bytes, defined by io.seqfile.compress.blocksize property(default 1MB). A sync marker is written before start of every block.

MapFile

a sorted SequenceFile with an index to permit lookups by key. The index is itself a SequenceFile that contains every 128th key - loaded into memory for fast lookups

similar interface to SequenceFile for reading and writing — but when writing using MapFile.Writer, map entries must be added in order, otherwise IOException.

MapFile variants:

SetFile - store a set of Writable keys.
ArrayFile - key == integer index of array element, value == Writable.

BloomMapFile- fast get() for sparse files via dynamic Bloom filter in-memory over key

Avro

language-neutral data serialization system specifies an object container format for sequences of objects- portability.

like sequence files (compact , splittable) , but has schema → makes it language agnostic

self-describing metadata section JSON schema (.avsc ) --> compact binary encoded values do not need to be tagged with a field identifier. Read schema need not be identical to write schema --> schema evolution.

File format - similar in design to Hadoop’s SequenceFile format, but designed to be A) portable across languages, B) splittable, C) sortable

Avro primitive types: null , boolean , int , long , float , double , bytes , string

Avro complex types: array ,record ,map ,enum , fixed , union

schema supports Generic mapping, Reflect mapping, schema evolution and multiple language bindings:

Column-Oriented File Formats

(Parquet, RCFile, ORCFile) best for queries access only a small number of columns

rows in a file (or Hive table ) are broken up into row splits, then each split is stored in column-oriented fashion: values for each row in the first column are stored first, followed by the values for each row in the second column, and so on.

columns that are not accessed in a query are skipped.

Disadv:

  • need more memory for reading / writing, since they have to buffer a row split in memory, rather than just a single row.

  • not possible to control when writes occur (via flush or sync operations) -not suited to streaming writes, as the current file cannot be recovered if writer process fails.

ORCFile

Hive Optimized Record Columnar File

Parquet

Flat columnar storage file format - store deep nested data efficiently (file size, query perf).

values from a column are stored contiguously --> efficient encoding, query engine skips over un-needed columns

compare to Hive ORCFile (Optimized Record Columnar File)

from Google Dremel , Twitter , Cloudera

language-neutral - can use in-memory data models for Avro, Thrift, Protocol Buffers to read / write

Dremel nested encoding each value is encoded as two ints: definition level , repetition level -→nested values can be read independently

primitive types: boolean, int32/64/96, float/double , binary ,fixed_len_byte_array

schema- root: message contains fields. Each field has a repetition (required, optional, repeated), type, name.

Group - complex type

Logical types: UTF8 , ENUM ,DECIMAL(precision,scale) ,LIST ,DATE, MAP

message m {
  required int32 field1;
  required binary field2 (UTF8);
  required group a (LIST) {
    repeated group list {
      required int32 element;

}

}

  required int32 a (DATE);
  repeated group key_value {
    required binary key (UTF8);
    optional int32 value;

}

}

File Format: Header, blocks, footer.

  • Header: 4-byte magic number, PAR1

  • Footer – stores all file metadata: format version, schema, block metadata, length, PAR1 → 2 seeks to read metadata, but then don’t need sync markers for block boundaries --> splittable files for parallel processing

  • Each block stores a row group of column chunks (a slice) written in pages for each column – can compress per column.

compression: automatic column type compression (delta encoding , run-length encoding , dictionary encoding, bit packing ) + page compression (Snappy, gzip, LZO )

ParquetOutputFormat properties - set at write time:

  • parquet.block.size (128 MB) trade-off scanning efficiency vs memory usage. If job failures due to out of memory errors, adjust this down.

  • parquet.page.size (1 MB)

  • parquet.dictionary.page.size

  • parquet.enable.dictionary

  • parquet.compression (Snappy, gzip, LZO )

Parquet command-line tools to dump the output Parquet file for inspection:

% parquet-tools dump output/part-m-00000.parquet

Posted on Wednesday, March 16, 2016 6:15 AM Spark | Back to top

Related Posts on Geeks With Blogs Matching Categories

Comments on this post: Big Data File Format Zoo

# re: Big Data File Format Zoo
Requesting Gravatar...
This season is marked with the celebration of colours and enjoying time with dear ones. This day is marked with the welcoming of spring in the month of Phalgun and everyone grooves in utmost fun in welcoming this festival of colours and this festival is celebrated almost in every part of India by different names. This ritual of giving gifts happened in a very classy way and even nowadays people Send Holi Gifts to India.
Left by Aparna Dutta on Mar 19, 2016 12:01 PM

# re: Big Data File Format Zoo
Requesting Gravatar...
Nice idea. I might be able to use this information in my project. - Bath Planet
Left by David Williams on Dec 26, 2016 10:54 AM

Your comment:
 (will show your gravatar)


Copyright © JoshReuben | Powered by: GeeksWithBlogs.net