Graph Feature Preprocessor

The GraphFeaturePreprocessor is a scikit-learn compatible preprocessor that enables scalable and real-time feature extraction from graph-structured data. It provides utilities for creating and updating in-memory graphs as well as extracting new features from these graphs. The basic input and output data representation used by GraphFeaturePreprocessor is an edge list with edge features. As an example, in a financial transaction graph, the graph nodes represent accounts and the graph edges represent transactions. The date of the transactions as well as the amount transferred between the accounts may be provided as edge features. However, the GraphFeaturePreprocessor can generate a much richer set of edge features by analyzing the graph topology and extracting meaningful substructures. For instance, the GraphFeaturePreprocessor can extract the number of cycles each transaction is part of in the graph. Such additional features can significantly increase the predictive power of machine learning models in practical settings, e.g., when monitoring suspicious transactions and account activities in financial transaction graphs.

The current version of the GraphFeaturePreprocessor is specialised to operate on temporal graphs. More specifically, the graph edges must have timestamps. Furthermore, each edge must have a unique id. The edge list must be provided as a two dimensional numpy array of floating point numbers. The first column must be Edge id, the second Source vertex id, the third Target vertex id, and the fourth Timestamp. Additional columns can be used to store more edge features. An example edge list is shown in the below table.

Edge id	Source vertex id	Target vertex id	Timestamp
1000	1	2	10
3000	3	4	20
2000	1	4	20
4000	3	2	30

The edge list must be sorted in the ascending order of the timestamps for best performance. An in-memory directed multi-graph representation will be created internally when an edge list is provided as input to the fit function. The transform function, on the other hand, receives an edge list as input, inserts these edges into the in-memory graph, and extracts new features from the graph, which are appended to the input edge list as additional columns to form the output of this function. The output edge list is again a two dimensional numpy array of floating point numbers, however, with additional columns to store the new features.

Note: The Graph Feature Preprocessor is currently not supported on Windows platforms.

class snapml.GraphFeaturePreprocessor

Graph Feature Preprocessor

This class implements a preprocessor for real-time extraction of graph-based features.

There are two types of graph-based features that this preprocessor computes:

1) graph patterns - for each edge in the input edge list, this preprocessor searches for graph patterns in which that edge participates. The graph patterns that can be used are: fan in/out, degree in/out, scatter-gather, temporal cycle, and length-constrained simple cycle. For each edge and each pattern type, this preprocessor computes a pattern histogram that contains the number of patterns of a given size.

2) graph vertex statistics - for each vertex in the input edge list, this preprocessor computes various statistical properties based on the selected raw features of the outgoing or incoming edges. The statistical properties that can be computed are: number of neighboring vertices (fan), number of incident edges (degree), fan/degree ratio, as well as average, sum, minimum, maximum, median, var, skew, and kurtosis of the selected raw features.

To generate graph-based features, this preprocessor maintains an in-memory graph. Whenever fit, partial_fit, or transform is called with an edge list as input, the edges in this list are inserted into the in-memory graph in the order they are provided. Edges that already exist in the in-memory graph are ignored during the insertion. Each edge insertion can also lead to the removal of some existing edges of the graph. More specifically, if the timestamp of an edge being inserted is ts_now, the edges with timestamps smaller than or equal to ts_now - time_window will be removed. Note that time_window can be set by calling set_params. In addition, if max_no_edges is specified and it is a positive value, only max_no_edges most recently added edges are kept in the dynamic graph. Lastly, the edge list does not have to be sorted. However, the pattern detection is most effective when the edges in the list are sorted in the increasing order of their timestamps.

Attributes:

paramsdict

Parameters of this graph preprocessor.

These parameters are used to define which graph pattern and graph vertex statistics features are going to be computed in transform(). Valid parameter keys and their default values can be listed with get_params() and can be modified using set_params(). The parameters in params are the following:

num_threadsint, default=12: Number of threads used in the computation.
time_windowint, default=-1: Time window size of the dynamic graph. If this parameter is set to a negative value, it is overwritten with the largest time window value <graph-feat>_tw defined for a graph feature <graph-feat>. The unit of time is the same as the one used for timestamps of edges.
max_no_edgesint, default=-1: Maximum number of edges that can exist in the dynamic graph. If this parameter is set to a negative value, the number of edges in the dynamic graph is defined only using the time window.
vertex_statsbool, default=True: Enable generation of features based on vertex statistics.
vertex_stats_twint: Time window used for computing the vertex statistics. The unit of time is the same as the one used for timestamps of edges.
vertex_stats_colsarray-like of int, default: [3]: Columns of the input numpy array used for generating vertex statistics features.
vertex_stats_featsarray-like of int, default: [0, 1, 2, 3, 4, 8, 9, 10]: Array indicating which statistical properties are computed for each vertex. The mapping between the values of this array and the statistical properties is: 0:fan, 1:degree, 2:ratio, 3:average, 4:sum, 5:minimum, 6:maximum, 7:median, 8:var, 9:skew, 10:kurtosis.

In the following parameters, <pattern-name> denotes one of the graph pattern names: fan, degree, scatter-gather, temp-cycle, and lc-cycle. These graph pattern names correspond to fan in/out, degree in/out, scatter-gather, temporal cycle, and length-constrained simple cycle patterns, respectively.

<pattern-name>bool: Enable generation of graph pattern features based on <pattern-name> pattern.
<pattern-name>_twint: Time window used for computing the <pattern-name> patterns. The unit of time is the same as the one used for timestamps of edges.Increasing the time window enables finding more patters, but it also makes the problem more time consuming.
<pattern-name>_binsarray-like of int: Array used for specifying the bins of the pattern histogram for <pattern-name> pattern. Bin i in that histogram contains the number of patterns <pattern-name> of size S, where bin[i] <= S < bin[i+1]. The last bin contains the patterns of size greater than or equal to bin[i].
lc-cycle_lenint, default=10: Length constraint used when searching for length-constrained simple cycles. Increasing the value of this parameter enables finding longer cycles, but it also makes the problem more time consuming.

fit(features)

Create the in-memory graph using the edges from the input edge list features.

This function clears the existing in-memory graph before creating a new graph using the edges from the input edge list features.

Parameters:

featuresarray-like of float, shape = (n_edges, n_raw_features): Input edge list. Each edge of this edge list should have the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>].

fit_transform(features_in)

Generate graph-based features for each edge in the input edge list features_in.

This function is equivalent to performing fit followed by transform using the same input edge list features_in.

Parameters:

features_inarray-like of float, shape = (n_edges, n_raw_features): Input edge list. Each edge of this edge list should have the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>].

Returns:

features_out: array-like of float, shape = (n_edges, n_raw_features + n_eng_features): Input edge list with additional n_eng_features graph-based features per edge. Each edge of this edge list has the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>, <graph-based features>].

get_params()

Get the parameters of this graph preprocessor.

Returns:

paramsdict

partial_fit(features)

Update the in-memory graph with the edges from the input edge list features.

This function inserts the edges from the input edge list features into the in-memory graph.

Parameters:

featuresarray-like of float, shape = (n_edges, n_raw_features): Input edge list. Each edge of this edge list should have the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>].

set_params(params)

Set the parameters of this graph preprocessor.

Valid parameter keys can be listed with get_params().

Invoking this function clears the existing in-memory graph.

Returns:

paramsdict

transform(features_in)

Generate graph-based features for each edge in the input edge list features_in.

This function inserts the edges from the input edge list features_in into the in-memory graph of this preprocessor and computes the graph-based features using the updated graph. The input edge list is updated with the graph-based features and is returned as the output.

The computed graph-based features have the following format: <graph-based features> = <graph-pattern features> <vertex statistics features>

The graph-pattern features <graph-pattern features> are created by concatenating the calculated pattern histograms for the patterns in the following order: fan-in, fan-out, degree-in, degree-out, scatter-gather, temporal cycle, length-constrained simple cycles. Only the histograms of the graph patterns that are enabled in params are present in <graph-pattern features>.

The graph vertex statistics features <vertex statistics features> contain the features for the source and the target vertex of an edge. These vertex features are calculated separately for outgoing and incoming edges of each vertex: <vertex statistics features> = <source vertex features - outgoing edges> <source vertex features - incoming edges>

<target vertex features - outgoing edges> <target vertex features - incoming edges>

Source/target vertex features for both outgoing and incoming edges have the following format: <fan> <degree> <ratio> <average,sum,…,kurtosis for c_0>…<average,sum,…,kurtosis for c_k> where c_0,…,c_k are raw feature columns specified in vertex_stats_cols of params. The vertex statistics features other than fan, degree, and ratio are calculated for each raw feature column from vertex_stats_cols. Only the vertex statistics features that are selected in vertex_stats_feats of params are present in <vertex statistics features>.

Parameters:

features_inarray-like of float, shape = (n_edges, n_raw_features): Input edge list. Each edge of this edge list should have the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>].

Returns:

features_out: array-like of float, shape = (n_edges, n_raw_features + n_eng_features): Input edge list with additional n_eng_features graph-based features per edge. Each edge of this edge list has the following format: [Edge ID, Source Vertex ID, Target Vertex ID, Timestamp, <other raw features>, <graph-based features>].