Geeks With Blogs
Josh Reuben

Big Data keeps evolving. Stone Age Hadoop was alot of Java bolierplate for defining HDFS access, Mapper & Reducer. This was superceded by Bronze Age Spark, which provided a succint Scala unification of:

  • ML pipelines
  • in-memory structured DataSets over RDDs via a SparkSession SQL API
  • Distributed Streams

(Note: You can run such jobs easily in a dynamically scalable manner on Google Dataproc) Technology keeps evolving - the Big-Iron Age has arrived in the form of Google Cloud Platform's SPARK KILLER - a nextgen Big Data stack, consisting of:

  • BigQuery - massively parallel, blazing fast https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query data analytics. This fetches from a column-store over a 1 Pb/s backbone and calculates aggregates over hundreds of thousands of VMs - you cant afford to scale Spark to such a distributed cluster!
  • TensorFlow - Deep Learning framework (the biggest Machine Learning breakthrough of the decade) distributed and accelerated over custom 'TensorChips'
  • DataFlow - (Apache Beam) a structured streaming windowing paradigm with a clean rich API that matches this metaphor - see https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison - build up a transform graph connecting sources and sinks (many of these integration points come out of the box - eg gpubsubbigquery)
  • DataLab - Juypyter Notebooks with some extra IPython integration-convenience magic: %%sql , %%bigquery , %%storage%%chart%%mlalpha%tensorboard%%monitoring

BigQuery has 4 components which you can read about here: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood

  • Borg - Resource Manager
  • Colossus - Distributed FS
  • Jupiter - 1 Pb/s network
  • Dremel - query engine, open sourced as Apache Drill

Anyhow, IMHO - Spark cannot compete, but I want to focus here on the BigQuery API. heres my QuickRef:

bq CLI

  • cancel - Request a job cancel and optionally waits
bq --nosync cancel job_id
  • cp - Copies a table
bq cp dataset.old_table dataset2.new_table
  • extract - extract source_table into Cloud Storage destination_uris.
bq extract ds.summary gs://mybucket/summary.csv
  • head - Displays rows in a table. params: --[no]job, --max_rows, --start_row, --[no]table
bq head -s 5 -n 10 dataset.table
  • init - Authenticate and create a default .bigqueryrc file. ??
  • insert - Insert JSON rows (from file or string) into a table
bq insert dataset.table /tmp/mydata.json
echo '{"a":1, "b":2}' | bq insert dataset.table
  • load - load source file / uris into destination_table, with optional json schema file / string

bq load ds.new_tbl ./info.csv ./info_schema.json
bq load ds.small gs://mybucket/small.csv name:integer,value:string
  • ls - List objects contained in project or dataset. Flags: -j show jobs, -p, show all projects
bq ls mydataset
bq ls --filter labels.color:red -a -j -p -n 1000
  • mk - Create a dataset, table or view
bq mk -d --data_location=EU new_dataset
bq --dataset_id=new_dataset mk -t new_dataset.newtable name:integer,value:string
bq mk --view='select 1 as num' new_dataset.newview
  • mkdef - Emits JSON definition for a GCS backed table.
bq mkdef 'gs://bucket/file.csv' field1:integer,field2:string
  • partition - Copies source tables of the format into partitioned tables, with the date suffix of the source tables becoming the partition date of the destination table partitions.
bq partition dataset1.sharded_ dataset2.partitioned_table
  • query - Execute a query
bq query 'select count(*) from publicdata:samples.shakespeare'
  • rm - Delete dataset / table (-d / -t flags signify target type, -f force, -r remove all tables of a dataset)
bq rm ds.table
bq rm -r -f old_dataset
  • shell - Start an interactive bq session.
  • show - Show all information about an object.
  • update - Updates a dataset, table or view.
bq update --description "My table" existing_dataset.existing_table
bq update -t existing_dataset.existing_table name:integer,value:string
bq update --view='select 1 as num' existing_dataset.existing_view --view_udf_resource=path/to/file.js
  • version
  • wait - Wait #seconds for a job to finish
bq wait --fail_on_error job_id 100

SQL Syntax

query_statement:

[ WITH with_query_name AS ( select ) [, ...] ]
query_expr

query_expr:

{ select | ( query_expr ) | query_expr set_op query_expr }
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ LIMIT count [ OFFSET skip_rows ] ]

select:

SELECT  [{ ALL | DISTINCT }]
  { [ expression. ]* [ EXCEPT ( column_name [, ...] ) ]
  [ REPLACE ( expression [ AS ] column_name [, ...] ) ]
  | expression [ [ AS ] alias ] } [, ...]
[ FROM from_item  [, ...] ]
[ WHERE bool_expression ]
[ GROUP BY expression [, ...] ]
[ HAVING bool_expression ]
[ WINDOW window_name AS ( window_definition ) [, ...] ]

ALL returns all rows (default), including duplicate rows - opt out with DISTINCT EXCEPT - specifies columns to exclude from result REPLACE - replace column value with expression HAVING - similar to WHERE, but evaluated over GROUP BY aggs

set_op:

UNION { ALL | DISTINCT }

from_item:

{
    table_name [ [ AS ] alias ] |
    join |
    ( select ) [ [ AS ] alias ] |
    field_path |
    { UNNEST( array_expression ) | UNNEST( array_path ) | array_path }
    [ [ AS ] alias ] [ WITH OFFSET [ [ AS ] alias ] ] |
    with_query_name [ [ AS ] alias ]
}

UNNEST iterates through an ARRAY and produces one row for each ARRAY element WITH OFFSET add index col for UNNEST rows

join:

from_item join_type JOIN from_item
[ { ON bool_expression | USING ( join_column [, ...] ) } ]

**USING** clause specify mutual column_list of one or more columns which occur in both input tables. It performs an equality comparison

join_type:

{ [INNER] | CROSS | FULL [OUTER] | LEFT [OUTER] | RIGHT [OUTER] }

Data types

  • INT64
  • FLOAT64
  • BOOL
  • STRING
  • BYTES
  • DATE
  • TIME
  • TIMESTAMP
  • ARRAY
  • STRUCT eg ARRAY<STRUCT<ARRAY<STRUCT<a int64, b string>>>> - STRUCT between the 2 ARRAYs because ARRAYs cannot hold other ARRAYs !

Functions

Converters

  • CAST

Aggregate functions

  • ANY_VALUE - value ?? NULL
  • APPROX_COUNT_DISTINCT - statistical estimate for big data
  • APPROX_QUANTILES - get aprox boundaries for n quantiles
  • APPROX_TOP_COUNT - for agg-cat rows with count >= n, return agg-cat & count
  • APPROX_TOP_SUM - get aprox top n elements, based on the sum of an assigned weight
  • ARRAY_AGG - Returns an ARRAY of all expression values
  • ARRAY_CONCAT_AGG Concats ARRAYs into a single ARRAY
  • AVGMINMAXSUM
  • BIT_AND , BIT_ORBIT_XOR
  • COUNT , COUNT_IF
  • LOGICAL_ANDLOGICAL_OR
  • STRING_AGG - concats strings
SELECT ARRAY_AGG(x*2) as array_agg
FROM UNNEST([1, 2, 3, 4, 5]) as x;

Statistical Aggregate Functions

  • CORR - Pearson coefficient of correlation of (x,y) pair set: -1 <= r <= 1
  • COVAR_POP - population covariance of (x,y) pair set: -Inf <= r <= +Inf
  • COVAR_SAMP - sample covariance of (x,y) pair set: -Inf <= r <= +Inf

Window-Frame Analytic Functions

OLAP time-slice aggregations

Analytic Function Syntax

analytic_function_name ( [ argument_list ] )
  OVER { window_name | ( [ window_specification ] ) }

window_specification:

  [ window_name ]
  [ PARTITION BY partition_expression_list ]
  [ ORDER BY expression [{ ASC | DESC }] [, ...] ]
  [ window_frame_clause ]

window_frame_clause:

{ ROWS | RANGE }
{
  { UNBOUNDED PRECEDING | numeric_expression PRECEDING | CURRENT ROW }
  |
  { BETWEEN window_frame_boundary_start AND window_frame_boundary_end }
}

eg ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING window_frame_boundary_start:

{ UNBOUNDED PRECEDING | numeric_expression { PRECEDING | FOLLOWING } | CURRENT ROW }

window_frame_boundary_end:

{ UNBOUNDED FOLLOWING | numeric_expression { PRECEDING | FOLLOWING } | CURRENT ROW }
  • PARTITION BY - similar to GROUP BY
  • RANK - computed ordinal column
  • DENSE_RANK - as above, but no peers ?
  • PERCENT_RANK - percentile rank of a row defined as (RK-1)/(NR-1)
  • CUME_DIST - relative rank of a row defined as NP/NR
  • NTILE - bucket number for constant bucket size
  • ROW_NUMBER
  • LEAD - window offset
  • LAG - window offset
  • FIRST_VALUE
  • LAST_VALUE
  • NTH_VALUE

Mathematical functions

  • ABS
  • SIGN
  • IS_INFIS_NAN
  • IEEE_DIVIDE(X,Y) - If the result overflows, returns +/-inf.If Y=0 and X=0, returns NaN.
  • RAND
  • SQRT
  • POWPOWER
  • EXP
  • LNLOGLOGLOG10
  • GREATESTLEAST
  • DIV
  • SAFE_DIVIDE(X, Y) - Returns NULL on error
  • MOD

Rounding Functions

  • ROUND
  • TRUNC
  • CEIL
  • CEILING
  • FLOOR

Trig Functions

  • COSCOSHACOSACOSH
  • SINSINHASINASINH
  • TANTANHATANATANHATAN2

String functions

  • BYTE_LENGTH
  • CHAR_LENGTH
  • CONCAT
  • ENDS_WITH
  • FORMAT
  • LENGTH
  • LOWER
  • LTRIM
  • REGEXP_CONTAINS
  • REGEXP_EXTRACT - Returns an array of all substrings of value that match
  • REGEXP_REPLACE
  • REPLACE
  • RTRIM
  • SAFE_CONVERT_BYTES_TO_STRING - Any invalid UTF-8 chars are replaced with Unicode replacement char, U+FFFD.
  • SPLIT
  • STARTS_WITH
  • STRPOS
  • SUBSTR
  • TRIM
  • UPPER

Array functions

  • ARRAY_CONCAT
  • ARRAY_LENGTH
  • ARRAY_TO_STRING
  • GENERATE_ARRAY - range
  • GENERATE_DATE_ARRAY
  • OFFSETSAFE_OFFSET
  • ORDINALSAFE_ORDINAL

Date / Time functions

  • CURRENT_DATE
  • EXTRACT(part FROM date_expression)
  • DATE
  • DATE_ADDDATE_SUBDATE_DIFF
  • DATE_TRUNC
  • DATE_FROM_UNIX_DATE
  • FORMAT_DATE
  • PARSE_DATE
  • UNIX_DATE
  • DATETIME
  • DATETIME_ADDDATETIME_SUBDATETIME_DIFF
  • TIME
  • TIME_ADDTIME_SUBTIME_DIFF
  • TIME_TRUNC
  • CURRENT_TIMESTAMP
  • STRING
  • TIMESTAMP_ADDTIMESTAMP_SUBTIMESTAMP_DIFF
  • TIMESTAMP_TRUNC
  • FORMAT_TIMESTAMP
  • PARSE_TIMESTAMP

Security functions

  • SESSION_USER

Misc DSL features

  • CASE ... WHEN ... THEN
  • COALESCE
  • IF
  • IFNULL
  • BETWEEN
  • LIKE
  • IN
  • IS
Posted on Thursday, December 15, 2016 5:33 AM OLAP / Data Mining , Cloud , Spark | Back to top


Comments on this post: BigQuery QuickRef

# re: BigQuery QuickRef
Requesting Gravatar...
Sủ dụng thuốc tăng cường sinh lực để có thể thoả mãn cơn thịnh nộ dục vọng của mình là một sai lầm
Left by alex on Dec 16, 2016 12:13 PM

# ctet
Requesting Gravatar...
this is nice.
Left by pankaj karnwal on Dec 16, 2016 2:27 PM

# re: BigQuery QuickRef
Requesting Gravatar...
Learning this is significant as this can be a big help to understand how the process works. - Dr. Thomas Devlin
Left by James Michael on Dec 17, 2016 7:00 PM

# re: BigQuery QuickRef
Requesting Gravatar...
I was excited to uncover this site. I wanted to thank you for your appvn
time for this wonderful read!! I definitely enjoyed every little bit of it and
I have you bookmarked to see new things in your website
Left by appvn on Dec 20, 2016 7:26 AM

# re: BigQuery QuickRef
Requesting Gravatar...
This blog is amazing we have allot of good writing thing in a better place we need but we always should make sure that we support new world essay writing service online
Left by Alax Micheal on Dec 20, 2016 2:43 PM

# re: BigQuery QuickRef
Requesting Gravatar...
Những sự thực thú vị về tinh trùng bạn nên biết – Tinh trùng có thể lưu lại cơ thể phụ nữ
thuốc tăng cường sinh lý nam
Left by alexbanadasadg on Dec 22, 2016 4:42 PM

# BigQuery QuickRef
Requesting Gravatar...
Thank you for bringing more information to this topic for me. I loved the way you discuss the topic great work thanks for the sharing this valuable info. I really enjoy reading this article.
Left by AliExpress on Jan 09, 2017 10:59 AM

# re: BigQuery QuickRef
Requesting Gravatar...
I always find it very difficult to code a program and using this specific platform but after reading your post fill that I code a program with else. Facebook customer service number
Left by jems rock on Feb 12, 2017 8:33 AM

# re: BigQuery QuickRef
Requesting Gravatar...
create your own playlists, saavn pro latest apk
enjoy sets from our expert curators, or play custom radio stations from any song or artist
Left by adad on Mar 17, 2017 8:39 AM

# re: BigQuery QuickRef
Requesting Gravatar...
I still don’t know how to use regular expressions in programs. The code is very hard for me to understand. I always ask someone else to do the work when the regular expression part comes. Thank god it only is required once in a while. louvre museum private tours babylon
Left by Melodie Raymond on Apr 04, 2017 9:49 AM

# re: BigQuery QuickRef
Requesting Gravatar...
Awesome work you have done here, I am very happy to read this nice post. You are a great writer and give us much information.
MBA Capstone Project Help
Left by ken Smith on Apr 07, 2017 9:23 AM

# re: BigQuery QuickRef
Requesting Gravatar...
I like your working style. Your blog is so interesting and nice. You have done superb job. Thanks for sharing. 
harvard case solutions
Left by John Martin on Apr 07, 2017 10:04 AM

# re: BigQuery QuickRef
Requesting Gravatar...
John arnold is an academic writer of the Dissertation-Guidance. Who writes quality academic papers for students to help them in accomplishing their goals.
Do My Programming Homework
Left by Alison Andrew on Apr 07, 2017 10:35 AM

# http://ssapptricks.com/suntrust/
Requesting Gravatar...
The information you have posted is very useful. Thank you for nice and wonderful Information
Left by SunTrust on Apr 27, 2017 12:44 PM

# re: BigQuery QuickRef
Requesting Gravatar...
Those who have studied computer science might have heard about big data that is mainly related to data analytics. I don’t know how to learn these codes if we have to execute a program. This is very difficult for me. Best Manufactured Home
Left by Nikita on Apr 28, 2017 2:47 PM

Your comment:
 (will show your gravatar)


Copyright © JoshReuben | Powered by: GeeksWithBlogs.net