Graph Feature Preprocessor
The GraphFeaturePreprocessor is a scikit-learn compatible preprocessor that enables scalable and real-time feature extraction from graph-structured data. It provides utilities for creating and updating in-memory graphs as well as extracting new features from these graphs. The basic input and output data representation used by GraphFeaturePreprocessor is an edge list with edge features. As an example, in a financial transaction graph, the graph nodes represent accounts and the graph edges represent transactions. The date of the transactions as well as the amount transferred between the accounts may be provided as edge features. However, the GraphFeaturePreprocessor can generate a much richer set of edge features by analyzing the graph topology and extracting meaningful substructures. For instance, the GraphFeaturePreprocessor can extract the number of cycles each transaction is part of in the graph. Such additional features can significantly increase the predictive power of machine learning models in practical settings, e.g., when monitoring suspicious transactions and account activities in financial transaction graphs.
The current version of the GraphFeaturePreprocessor is specialised to operate on temporal graphs. More specifically, the graph edges must have timestamps. Furthermore, each edge must have a unique id. The edge list must be provided as a two dimensional numpy array of floating point numbers. The first column must be Edge id, the second Source vertex id, the third Target vertex id, and the fourth Timestamp. Additional columns can be used to store more edge features. An example edge list is shown in the below table.
Edge id |
Source vertex id |
Target vertex id |
Timestamp |
Other Feature |
Other Feature |
---|---|---|---|---|---|
1000 |
1 |
2 |
10 |
||
3000 |
3 |
4 |
20 |
||
2000 |
1 |
4 |
20 |
||
4000 |
3 |
2 |
30 |
The edge list must be sorted in the ascending order of the timestamps for best performance. An in-memory directed multi-graph representation will be created internally when an edge list is provided as input to the fit function. The transform function, on the other hand, receives an edge list as input, inserts these edges into the in-memory graph, and extracts new features from the graph, which are appended to the input edge list as additional columns to form the output of this function. The output edge list is again a two dimensional numpy array of floating point numbers, however, with additional columns to store the new features.
- class snapml.GraphFeaturePreprocessor
Graph Feature Preprocessor
This class implements a preprocessor for real-time extraction of graph-based features.
There are two types of graph-based features that this preprocessor computes:
1) graph patterns - for each edge in the input edge list, this preprocessor searches for graph patterns in which that edge participates. The graph patterns that can be used are: fan in/out, degree in/out, scatter-gather, temporal cycle, and length-constrained simple cycle. For each edge and each pattern type, this preprocessor computes a pattern histogram that contains the number of patterns of a given size.
2) graph vertex statistics - for each vertex in the input edge list, this preprocessor computes various statistical properties based on the selected raw features of the outgoing or incoming edges. The statistical properties that can be computed are: number of neighboring vertices (fan), number of incident edges (degree), fan/degree ratio, as well as average, sum, minimum, maximum, median, var, skew, and kurtosis of the selected raw features.
To generate graph-based features, this preprocessor maintains an in-memory graph. Whenever
fit
,partial_fit
, ortransform
is called with an edge list as input, the edges in this list are inserted into the in-memory graph in the order they are provided. Edges that already exist in the in-memory graph are ignored during the insertion. Each edge insertion can also lead to the removal of some existing edges of the graph. More specifically, if the timestamp of an edge being inserted is ts_now, the edges with timestamps smaller than or equal to ts_now -time_window
will be removed. Note thattime_window
can be set by callingset_params
. However, this parameter can be overwritten if a graph pattern requires an even larger time window, i.e., if<pattern-name>_tw
is larger thantime_window
for a pattern that is enabled. Lastly, the edge list does not have to be sorted. However, the pattern detection is most effective when the edges in the list are sorted in the increasing order of their timestamps.- Attributes
- paramsdict
Parameters of this graph preprocessor.
These parameters are used to define which graph pattern and graph vertex statistics features are going to be computed in
transform()
. Valid parameter keys and their default values can be listed withget_params()
and can be modified usingset_params()
. The parameters inparams
are the following:- num_threadsint, default=12
Number of threads used in the computation.
- time_windowint, default=10*3600
Default time window value used if no graph pattern has been enabled. The unit of time is the same as the one used for timestamps of edges.
- vertex_statsbool, default=True
Enable generation of features based on vertex statistics.
- vertex_stats_colsarray-like of int, default: [3]
Columns of the input numpy array used for generating vertex statistics features.
- vertex_stats_featsarray-like of int, default: [0, 1, 2, 3, 4, 8, 9, 10]
Array indicating which statistical properties are computed for each vertex. The mapping between the values of this array and the statistical properties is: 0:fan, 1:degree, 2:ratio, 3:average, 4:sum, 5:minimum, 6:maximum, 7:median, 8:var, 9:skew, 10:kurtosis.
In the following parameters, <pattern-name> denotes one of the graph pattern names: fan, degree, scatter-gather, temp-cycle, and lc-cycle. These graph pattern names correspond to fan in/out, degree in/out, scatter-gather, temporal cycle, and length-constrained simple cycle patterns, respectively.
- <pattern-name>bool
Enable generation of graph pattern features based on <pattern-name> pattern.
- <pattern-name>_twint
Time window used for computing the <pattern-name> patterns. The unit of time is the same as the one used for timestamps of edges.Increasing the time window enables finding more patters, but it also makes the problem more time consuming.
- <pattern-name>_binsarray-like of int
Array used for specifying the bins of the pattern histogram for <pattern-name> pattern. Bin i in that histogram contains the number of patterns <pattern-name> of size S, where bin[i] <= S < bin[i+1]. The last bin contains the patterns of size greater than or equal to bin[i].
- lc-cycle_lenint, default=10
Length constraint used when searching for length-constrained simple cycles. Increasing the value of this parameter enables finding longer cycles, but it also makes the problem more time consuming.
- fit(features)
Create the in-memory graph using the edges from the input edge list
features
.This function clears the existing in-memory graph before creating a new graph using the edges from the input edge list
features
.- Parameters
- featuresarray-like of float, shape = (n_edges, n_raw_features)
Input edge list. Each edge of this edge list should have the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>].
- fit_transform(features_in)
Generate graph-based features for each edge in the input edge list
features_in
.This function is equivalent to performing
fit
followed bytransform
using the same input edge listfeatures_in
.- Parameters
- features_inarray-like of float, shape = (n_edges, n_raw_features)
Input edge list. Each edge of this edge list should have the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>].
- Returns
- features_out: array-like of float, shape = (n_edges, n_raw_features + n_eng_features)
Input edge list with additional n_eng_features graph-based features per edge. Each edge of this edge list has the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>, <graph-based features>].
- get_params()
Get the parameters of this graph preprocessor.
- Returns
- paramsdict
- partial_fit(features)
Update the in-memory graph with the edges from the input edge list
features
.This function inserts the edges from the input edge list
features
into the in-memory graph.- Parameters
- featuresarray-like of float, shape = (n_edges, n_raw_features)
Input edge list. Each edge of this edge list should have the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>].
- set_params(params)
Set the parameters of this graph preprocessor.
Valid parameter keys can be listed with
get_params()
.Invoking this function clears the existing in-memory graph.
- Returns
- paramsdict
- transform(features_in)
Generate graph-based features for each edge in the input edge list
features_in
.This function inserts the edges from the input edge list
features_in
into the in-memory graph of this preprocessor and computes the graph-based features using the updated graph. The input edge list is updated with the graph-based features and is returned as the output.The computed graph-based features have the following format: <graph-based features> = <graph-pattern features> <vertex statistics features>
The graph-pattern features <graph-pattern features> are created by concatenating the calculated pattern histograms for the patterns in the following order: fan-in, fan-out, degree-in, degree-out, scatter-gather, temporal cycle, length-constrained simple cycles. Only the histograms of the graph patterns that are enabled in
params
are present in <graph-pattern features>.The graph vertex statistics features <vertex statistics features> contain the features for the source and the target vertex of an edge. These vertex features are calculated separately for outgoing and incoming edges of each vertex: <vertex statistics features> = <source vertex features - outgoing edges> <source vertex features - incoming edges>
<target vertex features - outgoing edges> <target vertex features - incoming edges>
Source/target vertex features for both outgoing and incoming edges have the following format: <fan> <degree> <ratio> <average,sum,…,kurtosis for c_0>…<average,sum,…,kurtosis for c_k> where c_0,…,c_k are raw feature columns specified in
vertex_stats_cols
ofparams
. The vertex statistics features other than fan, degree, and ratio are calculated for each raw feature column fromvertex_stats_cols
. Only the vertex statistics features that are selected invertex_stats_feats
ofparams
are present in <vertex statistics features>.- Parameters
- features_inarray-like of float, shape = (n_edges, n_raw_features)
Input edge list. Each edge of this edge list should have the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>].
- Returns
- features_out: array-like of float, shape = (n_edges, n_raw_features + n_eng_features)
Input edge list with additional n_eng_features graph-based features per edge. Each edge of this edge list has the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>, <graph-based features>].