This project implements a hierarchical user interest modeling system for news recommendation, focusing on combining both historical user behaviors and multimodal content representations.
save_imgs_reduced()
: Reduces image embeddings using PCAsave_articles_reduced()
: Reduces text embeddings using PCAprepare_reduced_embeddings()
: Main function to prepare both image and text reduced embeddingscompute_engagement_bins()
: Creates bins for user engagement metrics (read times and scroll percentages)
- Creates base article embeddings (e_i) using categorical features
- Handles article metadata including category, subcategory, article type, and sentiment
- Produces fixed-dimension embeddings for each article
- Manages precomputation and caching of article embeddings
- Provides efficient lookup of embeddings by article ID
- Saves embeddings in HDF5 format for efficient storage and retrieval
- Combines article embeddings with user engagement data
- Processes historical user interactions including read times and scroll percentages
- Transforms engagement metrics into learned embeddings
MLP
: Multi-layer perceptron for feature transformationPointwiseAttention
: Implements attention mechanism without softmax normalization
UserSideInterest
: Combines article embeddings (e_i) with user history (e_j) to create e_uMultiModalInterest
: Processes reduced image and text embeddings to create e_u'HistoricUserInterest
: Final model that combines both e_u and e_u' into a unified user interest representation
- First, prepare the reduced embeddings:
prepare_reduced_embeddings(img_path='path/to/images.parquet',
word2vec_path='path/to/word2vec.parquet')
- Compute engagement bins:
compute_engagement_bins(history_path='path/to/history.parquet')
- Initialize the models:
# Initialize base components
article_embedder = ArticleEmbedder(embedding_dim=64)
article_embedder.fit_from_parquet('path/to/articles.parquet')
embedding_manager = ArticleEmbeddingManager(article_embedder)
user_history_embedder = UserHistoryEmbedder(embedding_manager, 'path/to/bins.pkl')
# Initialize interest models
user_side = UserSideInterest(user_history_embedder, article_embedder, embedding_dim=64)
multimodal = MultiModalInterest('path/to/imgs_reduced.pkl',
'path/to/word2vec_reduced.pkl',
device)
# Create final combined model
historic_model = HistoricUserInterest(user_side, multimodal)
The system follows a hierarchical architecture for modeling user interests:
- Base Level: Article representations using categorical features and sentiment
- Engagement Level: User interaction patterns through read times and scroll behavior
- Historical Level: Attention-based combination of user history and target articles
- Multimodal Level: Integration of reduced image and text representations
- Final Level: Unified representation combining behavioral and content-based interests
The model aims to capture both long-term user preferences through historical behaviors and short-term interests through multimodal content understanding.