2022 U+ AI Ground LG Uplus is a competitive 3th solution that recommends content that customers expect to prefer using children's national service data.
- CPU: i7-11799K core 8
- RAM: 32GB
- GPU: NVIDIA GeForce RTX 3090 Ti
The model was improved by combining various features and layers. The model picture is shown below.
We envisioned a model based on the random forest method of bagging.
The frequency of the predefined list for each model was sorted by setting the weight value. The illustrations and codes for implementing the code are as follows.
def customize_blend(
recommends: pd.Series, top_k: int, weighted: List[float]
) -> List[int]:
"""
Args:
recommends: recommend items
top_k: top k items
weighted: weighted list
Returns:
top k items
"""
recommended_items = [
eval(recommends[f"predicted_list{num}"]) for num in range(len(weighted))
]
res = {}
for weight, items in zip(weighted, recommended_items):
for n, v in enumerate(items):
if v in res:
res[v] += weight / (n + 1)
else:
res[v] = weight / (n + 1)
# Sort dictionary by item weights
res = list(dict(sorted(res.items(), key=lambda item: -item[1])).keys())
return res[:top_k]After building an ensemble pipeline using Random Forest's Bagging method, Cross validation training was performed using Group KFold for users under the assumption that similar users have common behavioral characteristics.
- data train test split: 94, 95, 96, 99, 123, 317, 529, 705, 1234, 3407
- 5 Group KFold Baseline: 22, 94, 95, 96, 99, 317, 2020, 3407
- 5 Group KFold My Model: 22, 94, 95, 96, 99, 3407
By default, hydra-core==1.2.0 was added to the requirements given by the competition.
For pytorch, refer to the link at https://pytorch.org/get-started/previous-versions/ and reinstall it with the right version of pytorch for your environment.
You can install a library where you can run the file by typing:
$ pip install -r requirements.txtCode execution for the new model is as follows:
-
Put the basic data into the
input/upplus-recsys/folder. When you execute the code that creates the data, the data for eachfoldstar anditem_features,user_featuresare stored in theinput/upplus-recsys/folder.$ python scripts/make_dataset.py models=neucf
-
Running the learning code shell allows learning for each
fold.$ sh scripts/train.sh
Modifying the learning code shell will allow you to learn for each
fold. You can also change the seed value. Examples are as follows.for seed in 22 94 95 96 99 3407 do for fold in 0 1 2 3 4 do python src/train.py models.fold=$fold models.seed=$seed done done
-
Running the prediction code shell saves the inferred values for each
foldin theoutputfolder.$ sh scripts/predict.sh
Modifying the prediction code shell allows inference for each
fold. And you need to set the seed value of the learned model. Examples are as follows.for seed in 22 94 95 96 99 3407 do for fold in 0 1 2 3 4 do python src/predict.py models.fold=$fold models.seed=94 done done
-
To ensemble for each
fold, modify theconfig/ensemble.yamlfile to ensemble the desired file.defaults: - _self_ - data: dataset - features: features - models: neucf - hydra: default - override hydra/hydra_logging: disabled - override hydra/job_logging: disabled output: path: output name: neural-mf-layer3-seed94-group-5fold-ensemble.csv submit: sample_submission.csv features: features.yaml ensemble: preds1: neural-mf-layer3-seed94-group-fold0.csv preds2: neural-mf-layer3-seed94-group-fold1.csv preds3: neural-mf-layer3-seed94-group-fold2.csv preds4: neural-mf-layer3-seed94-group-fold3.csv preds5: neural-mf-layer3-seed94-group-fold4.csv weights: - 1 - 1 - 1 - 1 - 1
-
Ensemble code saves the final result in the
outputfolder.$ python src/ensemble.py output.name=neural-mf-layer3-seed94-group-5fold-ensemble.csv
The boosting model has a significant performance difference compared to the NN. Ensemble results also appeared to have a greater impact than other models.
The file submit in the output folder is the file we finally submitted.
best-lb-bootstrap-group-fold-enemble.csv is an ensemble result of existing baseline models and models learned by 5Group KFold. In the case of rank-nural-enemble.csv, it is a result of adding existing baseline models and 5Group KFold models, and 5Fold models.
- Boosting ranker model: Failed to process data to learn rank.
- Boosting binary model: binary learning took a long time. Not only did the training take a long time, but the model was not able to distinguish properly. This seems to be a problem caused by the lack of feature.
- Using the Boosting Ranker model after generation of candidates: It seems that it was not distinguished well because of the lack of features.
- As a result of using 4 layers, the difference in score between CV and LB seems to be overfitting.
- The Graph model took too long to learn.