BigDL Movie Recommendation System 🎬
Authors: Akshay Bahadur and Ayush Agarwal
The following blog post is part of Machine Learning in Production (17634) coursework at Carnegie Mellon University. In this discussion, we will consider the scenario of Movie Recommendation with respect to the following two frameworks.
BigDL: A distributed deep learning framework developed and open-sourced by Intel. BigDL is the major point of our discussion for this blog.
Streamlit: A framework for web application development and deployment for data science applications. (only for exploration)
Now that we have mentioned the resources and formed a basic blog premise let’s talk about some standard definitions.
Recommendation System: A recommendation system is a filtration program whose prime goal is to predict a user’s “rating” or “preference” towards a domain-specific item or item. In our case, this domain-specific item is a movie; therefore, the main focus of our recommendation system is to filter and predict only those movies that a user would prefer, given some data about the user.
BigDL: BigDL is a distributed deep learning framework for big data platforms and workflows. It is implemented as a library on top of Apache Spark. It allows users to write their deep learning applications as standard Spark programs, running directly on existing big data clusters.
Streamlit: Streamlit is a tool that allows users to build highly interactive web applications around their data quickly, machine learning models. It also can deploy the web app on the cloud by simply pushing the changes to GitHub.
Problem Statement ✨
Movie recommendations are challenging because we can approach it in multiple ways, especially since we have numerous data points that can be modeled for the recommendation. Let’s try to analyze some of the problems.
Scale: For a company like Netflix, which has 2769 hours of original movies, 3,600+ movies, and more than 1,800 TV shows available, making a total of 10000+ hours viewable hours. In 2020, Netflix had 30 million subscribers in 200+ countries. Therefore, building a recommendation engine for the users is a cumbersome task.
Traditional ML techniques: The traditional ML technique becomes inefficient as the data scale increases. Manual feature engineering becomes cumbersome at scale.
Although the problem discussed above is tricky, there have been multiple approaches in the past to address the issue. Let’s briefly go through some of them.
Collaborative Filtering: The underlying intuition behind collaborative filtering is that if user A and B have similar tastes in a product, then A and B are likely to have similar tastes in other products as well.
Content-Based: Content-based systems generate recommendations based on the user’s preferences and profile by matching users to items with the level of similarity
Hybrid Approach: As the name suggests, we merge both the above methods to extract the best features and eliminate the bad ones in this approach.
Now that we have discussed the problem and the approaches for the movie recommendation, it is apt to discuss the framework to implement the strategy.
BigDL is used in building Large-Scale AI Applications for distributed big data. As part of the BigDL framework, multiple modules handle specific tasks
DLlib: distributed deep learning library for Apache Spark
Orca: seamlessly scale-out TensorFlow and PyTorch pipelines for distributed Big Data
RayOnSpark: run Ray programs directly on Big Data clusters
Chronos: scalable time series analysis using AutoML
PPML: privacy-preserving big data analysis and machine learning (experimental)
Serving: distributed and automated model inference on Big Data streaming frameworks
In this blog, we will be discussing BigDL in general and specifically about Orca (wherever required).
“More Data = More Compute Resources = More Cost”
Features of BigDL 🦄
It supports an API similar to Torch and Keras for constructing neural network models.
It supports large-scale distributed training and inference, leveraging the scale-out architecture of the underlying Spark framework.
BigDL provides an expressive, “data-analytics integrated” deep learning programming model.
Within a single, unified data analysis pipeline, users can efficiently process huge datasets using Spark APIs, feed the distributed dataset to the neural network model, and perform distributed training or inference on top of Spark.
Movie Recommendations with BigDL 🍿
BigDL is useful for processing large datasets and doing large-scale distributed training and inference. BigDL can be useful for handling the movie recommendation scenario when we try to scale our movie recommendation system.
For more details on the source code, visit the GitHub Source
Setup the environment for execution: We have used ‘local’ execution for experimentation but we can change this to a cluster mode through yarn.
We have used the ‘MovieLens 1 million’ dataset for the assignment.
Number of ratings: 1000000
Number of users: 6040
Number of Movies: 4000
For the modeling, we have used Keras BigDL abstraction.
GMF layers: We create embeddings of the user and movies. A Generalized factorized matrix is returned.
MLP layers: The output of GMF layers is used to compose the Multi-layer perception model. Dense layers have been added to propagate the learned features down the network, and Dropout layers have been implemented to prevent overfitting.
Optimizer: Adam Optimizer
Loss: Sparse Categorical Crossentropy
Output: 5 classes (rating of a movie from 1 to 5)
Train Loss: 0.8198
Train Accuracy: 0.71
Test Loss: 1.416
Test Accuracy: 0.58
Prediction and Caching
For each user and given a movie, the model predicts the probability of giving this movie a rating of n [n ranges from 1 to 5]. This can be seen in the image below as part of the DenseVector.
The system recommends movies where the predicted rating is greater than or equal to three (≥3). This threshold can be tuned as per the requirement of the system
Caching the recommendations.
For more details on the results, visit the BigDL Movie Recomendation Web App
We encourage you to check out the result yourself in our BigDL Movie Recommendation Web App. The web application is built and deployed using Streamlit.
Once you visit the page, you can press the Generate Insights button to get a summary of the dataset.
Make the selection on the sidebar, choose the userID and number of recommendations. From the dropdown, select the movie genre (default genre is set to all).
Native Spark Support: BigDL allows neural network models to be directly applied in standard distributed streaming architecture for Big Data (using Apache Kafka and Spark Streaming), which efficiently scale out to multiple nodes in a transparent fashion. As a result, this greatly improves the developer productivity and deployment efficiency of the end-to-end streaming workflow.
Distributed Training: By adopting the state of practice of big data systems (i.e., coarse-grained functional compute model), BigDL provides a viable design alternative for distributed model training. This allows deep learning algorithms and big data analytics to be seamlessly integrated into a single unified data pipeline and completely eliminates the impedance mismatch problem described in Section 2. Furthermore, this also makes it easy to handle failures, resource changes, task preemptions, etc., which are expected to be the norm rather than the exception in large-scale systems.
Scaling: BigDL runs a series of short-lived Spark jobs (e.g., two jobs per mini-batch as described earlier), and each task in the job is stateless, non-blocking, and completely independent of each SoCC ’19, November 20–23, Santa Cruz, CA J. Dai et al. other; as a result, BigDL tasks can run without gang scheduling. In addition, it can also efficiently support fine-grained failure recovery by just re-running the failed job (which can then regenerate the associated partition of the local gradient or updated weight in the in-memory storage of Spark); this allows the framework to automatically and efficiently address failures (e.g., cluster scale-down, task preemption, random bugs in the code, etc.) in a timely fashion.
Unification of ML and BigData: Contrary to the conventional wisdom of the machine learning community that fine-grained data access and in-place updates are critical for efficient distributed training, BigDL provides large-scale, data-parallel training directly on top of the functional compute model (with copy-on-write and coarse-grained operations) of Spark. By unifying the execution model of neural network models and big data analytics, BigDL allows new deep learning algorithms to be seamlessly integrated into production data pipelines, which can be easily deployed, monitored, and managed in a single unified big data platform.
Dependency: BigDL requires the system to install a specific java, TensorFlow, or PyTorch version. Hence, the code will stop working even if the version differs by a bug number.
Abstraction issues: BigDL provides an abstraction over popular Machine Learning and Big Data frameworks. However, for multiple methods of the base library, the abstraction has not been implemented. Therefore, substituting the libraries will result in losing some of the features.
Developer community: For an open-source project to succeed, the developer community should experiment with it, report the bugs and often improve it by raising pull requests. However, the BigDL community has been mostly dormant, resulting in a buggy experience.
Third-party extensions: There are only a few third-party extensions of BigDL. The absence of third-party extension has resulted in fewer functionalities for the product.
BigDL is a distributed deep learning framework developed and open-sourced by Intel that can be used easily for developing and deploying an AI system. In this blog, we have given a brief introduction and discussed its strength and limitations. We have also discussed its usage in general and with respect to building a recommendation system.
Made with ❤️ and 🦙 by Akshay and Ayush