Evaluation Metrics¶
The Collie library supports evaluating both implicit and explicit models.
Three common implicit recommendation evaluation metrics come out-of-the-box with Collie. These include Area Under the ROC Curve (AUC), Mean Reciprocal Rank (MRR), and Mean Average Precision at K (MAP@K). Each metric is optimized to be as efficient as possible by having all calculations done in batch, tensor form on the GPU (if available). We provide a standard helper function, evaluate_in_batches
, to evaluate a model on many metrics in a single pass.
Explicit evaluation of recommendation systems is luckily much more straightforward, allowing us to utilize the TorchMetrics library for flexible, optimized metric calculations on the GPU accessed through a standard helper function, explicit_evaluate_in_batches
, whose API is very similar to its implicit counterpart.
Evaluate in Batches¶
Implicit Evaluate in Batches¶
- collie.metrics.evaluate_in_batches(metric_list: Iterable[Callable], test_interactions: collie.interactions.datasets.Interactions, model: collie.model.base.base_pipeline.BasePipeline, k: int = 10, batch_size: int = 20, logger: Optional[pytorch_lightning.loggers.base.LightningLoggerBase] = None, verbose: bool = True) → List[float][source]¶
Evaluate a model with potentially several different metrics.
Memory constraints require that most test sets will need to be evaluated in batches. This function handles the looping and batching boilerplate needed to properly evaluate the model without running out of memory.
- Parameters
metric_list (list of functions) –
List of evaluation functions to apply. Each function must accept keyword arguments:
targets
user_ids
preds
k
test_interactions (collie.interactions.Interactions) – Interactions to use as labels
model (collie.model.BasePipeline) – Model that can take a (user_id, item_id) pair as input and return a recommendation score
k (int) – Number of recommendations to consider per user. This is ignored by some metrics
batch_size (int) – Number of users to score in a single batch. For best efficiency, this number should be as high as possible without running out of memory
logger (pytorch_lightning.loggers.base.LightningLoggerBase) – If provided, will log outputted metrics dictionary using the
log_metrics
method with keys being the string representation ofmetric_list
and values beingevaluation_results
. Additionally, ifmodel.hparams.num_epochs_completed
exists, this will be logged as well, making it possible to track metrics progress over the course of model trainingverbose (bool) – Display progress bar and print statements during function execution
- Returns
evaluation_results – List of floats, with each metric value corresponding to the respective function passed in
metric_list
- Return type
list
Examples
from collie.metrics import auc, evaluate_in_batches, mapk, mrr map_10_score, mrr_score, auc_score = evaluate_in_batches( metric_list=[mapk, mrr, auc], test_interactions=test, model=model, ) print(map_10_score, mrr_score, auc_score)
Explicit Evaluate in Batches¶
- collie.metrics.explicit_evaluate_in_batches(metric_list: Iterable[torchmetrics.metric.Metric], test_interactions: collie.interactions.datasets.ExplicitInteractions, model: collie.model.base.base_pipeline.BasePipeline, logger: Optional[pytorch_lightning.loggers.base.LightningLoggerBase] = None, verbose: bool = True, **kwargs) → List[float][source]¶
Evaluate a model with potentially several different metrics.
Memory constraints require that most test sets will need to be evaluated in batches. This function handles the looping and batching boilerplate needed to properly evaluate the model without running out of memory.
- Parameters
metric_list (list of
torchmetrics.Metric
) – List of evaluation functions to apply. Each function must accept arguments for predictions and targets, in ordertest_interactions (collie.interactions.ExplicitInteractions) –
model (collie.model.BasePipeline) – Model that can take a (user_id, item_id) pair as input and return a recommendation score
batch_size (int) – Number of users to score in a single batch. For best efficiency, this number should be as high as possible without running out of memory
logger (pytorch_lightning.loggers.base.LightningLoggerBase) – If provided, will log outputted metrics dictionary using the
log_metrics
method with keys being the string representation ofmetric_list
and values beingevaluation_results
. Additionally, ifmodel.hparams.num_epochs_completed
exists, this will be logged as well, making it possible to track metrics progress over the course of model trainingverbose (bool) – Display progress bar and print statements during function execution
kwargs (keyword arguments) – Additional arguments sent to the
InteractionsDataLoader
- Returns
evaluation_results – List of floats, with each metric value corresponding to the respective function passed in
metric_list
- Return type
list
Examples
import torchmetrics from collie.metrics import explicit_evaluate_in_batches mse_score, mae_score = evaluate_in_batches( metric_list=[torchmetrics.MeanSquaredError(), torchmetrics.MeanAbsoluteError()], test_interactions=test, model=model, ) print(mse_score, mae_score)
Implicit Metrics¶
AUC¶
- collie.metrics.auc(targets: scipy.sparse.csr.csr_matrix, user_ids: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), preds: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), k: Optional[Any] = None) → float[source]¶
Calculate the area under the ROC curve (AUC) for each user and average the results.
- Parameters
targets (scipy.sparse.csr_matrix) – Interaction matrix containing user and item IDs
user_ids (np.array or torch.tensor) – Users corresponding to the recommendations in the top k predictions
preds (torch.tensor) – Tensor of shape (n_users x n_items) with each user’s scores for each item
k (Any) – Ignored, included only for compatibility with
mapk
- Returns
auc_score
- Return type
float
MAP@K¶
- collie.metrics.mapk(targets: scipy.sparse.csr.csr_matrix, user_ids: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), preds: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), k: int = 10) → float[source]¶
Calculate the mean average precision at K (MAP@K) score for each user.
- Parameters
targets (scipy.sparse.csr_matrix) – Interaction matrix containing user and item IDs
user_ids (np.array or torch.tensor) – Users corresponding to the recommendations in the top k predictions
preds (torch.tensor) – Tensor of shape (n_users x n_items) with each user’s scores for each item
k (int) – Number of recommendations to consider per user
- Returns
mapk_score
- Return type
float
MRR¶
- collie.metrics.mrr(targets: scipy.sparse.csr.csr_matrix, user_ids: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), preds: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), k: Optional[Any] = None) → float[source]¶
Calculate the mean reciprocal rank (MRR) of the input predictions.
- Parameters
targets (scipy.sparse.csr_matrix) – Interaction matrix containing user and item IDs
user_ids (np.array or torch.tensor) – Users corresponding to the recommendations in the top k predictions
preds (torch.tensor) – Tensor of shape (n_users x n_items) with each user’s scores for each item
k (Any) – Ignored, included only for compatibility with
mapk
- Returns
mrr_score
- Return type
float