Target Label:
1 if clicked / viewed otherwise 0 Training data:
We build model on USER-CLP pair with label as 1 if clicked / viewed otherwise 0 We randomly pick 5% of users and all interacted clps on a given day to create training data We use 4 category of features to build the model (user-static/user-clp/user-attribute/clp-level features) We look at user history for past [7,14,30,60] days to capture both short term and long interest PI/Ni : 4.5% in training data We created training data using 1 day from each week with atleast 7 day interval between label dates A row in training data would look like this: user_id, clp_id, date, features, y-value y-value would be 1/0 (click/no click) Validation data:
Same as training data generation with Validation set dates > Training set dates
Testing data:
We used few newly created experiment Widget groups and corresponding clps 3M random signed-up users were used along with few internal users Feature creation same as training data generation with Test set dates > Validation set dates > Training set dates Num of rows in testing data = num of users (3M) * Num of clps in experiment Widget groups
Model:
XgBoost classification model with “binary: logistic“ loss. We build three variations of model using above features:
Using only user features [Removing clp-level features] Using user features without views [Removing clp-level features and views feature]
MODEL EVALUATION:
Training and backtesting results can be found here Feature importance of each model can be found here We do prediction for the test set and rank predictions in sorted order of predicted scores Labeled data for each user will be as ground truth Calculated precision@k and recall@k using sorted predicted ranking and ground truth set for each user Calculate MRR using below algorithm 1. We looked at top clp distribution for each model to analyse skewness and biases in model. Top clp distribution for each model can be found here RESTRICTED CONTENT 2. We manually looked at rankings for few internal users OUTPUT PROB DISTRIBUTION
Training sample
Inference sample