经过进一步学习,发现flink-ml-lib这个包是基于flink-ml-api这个包做的一个高层封装。
下面,就对这个包的内容进行深入的分析。
首先这个包里分为四个模块,分别对应本文的四个一级标题。
文章目录
common
- linalg 主要与线性代数有关
- mapper 一些映射,现在还不知道干什么的
- model 一些model source,作用未知
- statistics 统计学算法,里边只有一个多元高斯分布
- utils 常用工具类
MLEnvironment
作用未知MLEnvironmentFactory
作用未知
linalg
-
BLAS
A utility class that provides BLAS routines over matrices and vectors.
BLAS(Basic Linear Algebra Subprograms)即基础线性代数子程序库,里面拥有大量已经编写好的关于线性代数运算的程序。 -
DenseMatrix
DenseMatrix stores dense matrix data and provides some methods to operate on the matrix it represents. -
DenseVector
A dense vector represented by a values array. -
MatVecOp
工具类
A utility class that provides operations over {@link DenseVector}, {@link SparseVector} and {@link DenseMatrix}. -
SparseVector
A sparse vector represented by an indices array and a values array. -
Vector 工具类,包含关于DenseVector和SparseVector的常用方法
The Vector class defines some common methods for both DenseVector and SparseVector. -
VectorIterator
遍历Vector使用的工具类
An iterator over the elements of a vector. -
VectorUtil
Vector和它的子类的工具类
Utility class for the operations on {@link Vector} and its subclasses.
mapper
Mapper
Abstract class for mappers. A mapper takes one row as input and transform it into another row.MapperAdapter
A class that helps adapt a {@link Mapper} to a {@link MapFunction} so that the mapper can run in Flink.ModelMapper
An abstract class for {@link Mapper Mappers} with a model.ModelMapperAdapter
A class that adapts a {@link ModelMapper} to a Flink {@link RichMapFunction} so the model can be loaded in a Flink job.
This adapter class hold the target {@link ModelMapper} and it’s {@link ModelSource}.
Upon open(), it will load model rows from {@link ModelSource} into {@link ModelMapper}.
model
-
BroadcastVariableModelSource
A {@link ModelSource} implementation that reads the model from the broadcast variable. -
ModelSource
An interface that load the model from different sources. E.g. broadcast variables, list of rows, etc. -
RowsModelSource
A {@link ModelSource} implementation that reads the model from the memory.
statistics - basicstatistic
MultivariateGaussian
多元高斯分布
This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution.
utils
DataSetConversionUtil
Provide functions of conversions between DataSet and Table.DataStreamConversionUtil
Provide functions of conversions between DataStream and Table.OutputColsHelper
/**
* Utils for merging input data with output data.
*
* <p>Input:
* 1) Schema of input data being predicted or transformed.
* 2) Output column names of the prediction/transformation operator.
* 3) Output column types of the prediction/transformation operator.
* 4) Reserved column names, which is a subset of input data's column names that we want to preserve.
*
* <p>Output:
* 1)The result data schema. The result data is a combination of the preserved columns and the operator's
* output columns.
*
* <p>Several rules are followed:
* <ul>
* <li>If reserved columns are not given, then all columns of input data is reserved.
* <li>The reserved columns are arranged ahead of the operator's output columns in the final output.
* <li>If some of the reserved column names overlap with those of operator's output columns, then the operator's
* output columns override the conflicting reserved columns.
* <li>The reserved columns in the result table preserve their orders as in the input table.
* </ul>
*
* <p>For example, if we have input data schema of ["id":INT, "f1":FLOAT, "f2":DOUBLE], and the operator outputs
* a column "label" with type STRING, and we want to preserve the column "id", then we get the result
* schema of ["id":INT, "label":STRING].
*
* <p>end user should not directly interact with this helper class. instead it will be indirectly used via concrete algorithms.
*/
-
TableUtil
Utility to operator to interact with Table contents, such as rows and columns. -
VectorTypes
内置Vector类型
Built-in vector types.
MLEnvironment
The MLEnvironment stores the necessary context in Flink.
Each MLEnvironment will be associated with a unique ID.
The operations associated with the same MLEnvironment ID will share the same Flink job context.
MLEnvironmentFactory
Factory to get the MLEnvironment using a MLEnvironmentId.
operator
TableSourceBatchOp
Transform the Table to SourceBatchOp.BatchOperator
Base class of batch algorithm operators.TableSourcesStreamOp
Transform the Table to SourceStreamOp.StreamOperator
Base class of stream algorithm operators.AlgoOperatior
Base class for algorithm operators.
params
HasOutputCol
- An interface for classes with a parameter specifying the name of the output column.
HasOutputColDefaultAsNull
- An interface for classes with a parameter specifying name of the output column with a null default value.
HasOutputCols
- An interface for classes with a parameter specifying names of multiple output columns.
HasOutputColsDefaultAsNull
- An interface for classes with a parameter specifying names of multiple output columns. The default parameter value is null.
HasPredictionCol
- An interface for classes with a parameter specifying the column name of the prediction.
HasPredictionDetailCol
- An interface for classes with a parameter specifying the column name of prediction detail.
HasReservedCols
- An interface for classes with a parameter specifying the names of the columns to be retained in the output table.
HasSelectedCol
- An interface for classes with a parameter specifying the name of the table column.
HasSelectedColDefaultAsNull
- An interface for classes with a parameter specifying the name of the table column with null default value.
HasSelectedCols
- An interface for classes with a parameter specifying the name of multiple table columns.
HasSelectedColsDefaultAsNull
- An interface for classes with a parameter specifying the name of multiple table columns with null default value.
HasMLEnvironmentId
- An interface for classes with a parameter specifying the id of MLEnvironment.
pipeline
-
EstimatorBase
The base class for estimator implementations. -
ModelBase
The base class for a machine learning model. -
PipelineStageBase
The base class for a stage in a pipeline, either an [[EstimatorBase]] or a [[TransformerBase]]. -
TransformerBase
The base class for transformer implementations.
下一篇具体分析源码及用法。