Geeks With Blogs
Josh Reuben

When constructing Spark Machine Learning Pipelines - I find it really helpful to maintain a bird's eye view of the various transformers and estimators available.

in a nutshell: fit trainingData (train a model), transform testData (predict with model)

  • Transformer: DataFrame => DataFrame
  • Estimator: DataFrame => Transformer

Transformers

  • Tokenizer: sentence => words
  • RegexTokenizer: sentence => words - setPattern
  • HashingTF: terms => feature vectors based on frequency - setNumFeatures
  • StopWordsRemover: filter - setStopWords
  • NGram: sequence of n strings
  • Binarizer: number => 0/1 threshold - setThreshold
  • PCA: reduce feature set statistical dimensionality reduction (selects least correlated) - setK
  • PolynomialExpansion: feature set dimensionality expansion (~ Taylor Series) - setDegree
  • DCT: time series => frequencies (via cosine wave)- setInverse
  • StringIndexer: strings => frequency ordinals
  • IndexToString: dual of StringIndexer
  • OneHotEncoder: category feature => 1-hot bitset
  • VectorIndexer: category automatically index categorical features in the featureset - setMaxCategories
  • Normalizer: vector features to p-norm - setP
  • StandardScaler: features to z-scores - setWithStd, setWithMean
  • MinMaxScaler: scale feature to range [0, 1]
  • Bucketizer: continuous to discrete - setSplits
  • ElementwiseProduct: apply weights to vector features - setScalingVec
  • SQLTransformer: SQL over featureset ! - setStatement
  • VectorAssembler: combine multi-columns into a single vector column
  • QuantileDiscretizer: continuous to discrete - setNumBuckets
  • VectorSlicer: select subset of featureset - setIndices, setNames
  • RFormula: specify labelled point dependent / independent variables - setFormula("y ~ x1 + x2"), setFeaturesCol, setLabelCol
  • ChiSqSelector: select features with most predictive power - setNumTopFeatures, setFeaturesCol, setLabelCol

Estimators:

  • IDF: down-weights high frequency terms
  • Word2Vec: document => token count - setVectorSize, setMinCount
  • CountVectorizer: document => token count - setVocabSize, setMinDF
  • LogisticRegression - setMaxIter, setRegParam, setElasticNetParam, setTol, setFitIntercept
  • DecisionTreeClassifier
  • RandomForestClassifier - setNumTrees
  • GBTClassifier - setMaxIter
  • MultilayerPerceptronClassifier - setLayers, setBlockSize, setSeed, setMaxIter
  • OneVsRest - setClassifier
  • DecisionTreeRegressor
  • RandomForestRegressor
  • GBTRegressor
  • AFTSurvivalRegression - setQuantileProbabilities, setQuantilesCol
  • KMeans - setK
  • LDA - setK, setMaxIter

Models:

  • CountVectorizerModel
  • LogisticRegressionModel - coefficients, intercept, setThreshold, summary
  • DecisionTreeClassificationModel
  • RandomForestClassificationModel
  • GBTClassificationModel
  • DecisionTreeRegressionModel
  • RandomForestRegressionModel
  • GBTRegressionModel
  • LDAModel - logLikelihood, logPerplexity

Evaluators:

  • BinaryLogisticRegressionSummary - fMeasureByThreshold, areaUnderROC, roc
  • BinaryClassificationEvaluator - default metric names: "areaUnderROC"
  • MulticlassClassificationEvaluator - default metric name: "precision"
  • MulticlassMetrics - confusionMatrix, falsePositiveRate
  • RegressionEvaluator - default metric name: "rmse"
Posted on Tuesday, March 22, 2016 7:05 AM | Back to top


Comments on this post: Spark.ml Pipelines QuickRef

# re: Spark.ml Pipelines QuickRef
Requesting Gravatar...
It's easy to follow and really works well. - Bath Planet
Left by Lira Sale on Dec 26, 2016 11:16 AM

# re: Spark.ml Pipelines QuickRef
Requesting Gravatar...
Left by mia on Mar 02, 2018 8:14 AM

# re: Spark.ml Pipelines QuickRef
Requesting Gravatar...
Left by MITHI on Mar 02, 2018 8:16 AM

Your comment:
 (will show your gravatar)


Copyright © JoshReuben | Powered by: GeeksWithBlogs.net