FastAI Movie Recommender Step-by-Step

An app where you can type in your favorite movie and it will suggest 30 movies similar to i t using a collaborative filtering model trained on over 9million movie reviews.

Makes sure all your libraries and packages are installed and loaded.

Fastai is the library we use to train, load and run our model. NB: we already trained our model so we’ll only be loading it here.

# !pip install -Uqq fastai

Gradio allows you to easily publish your model onto the web for peope to use.

# !pip install -Uqq gradio

nbdev is what we’re using to make this documentation easier and more presentable.

# !pip install nbdev

These are the libraries we need

# For modeling
from fastai.tabular.all import *
from fastai.collab import *
#for publishing the model
import gradio as gr

Load model and data loaders or movie title csv

The dataloaders we used to train the model from the 18 million reviews were 200MB so we use a csv instead. To convert from data loaders to pandas DataFrame use code hashtagged out just below

# titles = pd.DataFrame(dls.classes['title']).reset_index() #create df from data loaders(dls)
# titles = titles.rename(columns={'index':'midx',0:'title'}) # rename columns

Functions for Model Inference


source

get_movie_recs

 get_movie_recs (full_title:str, learn,
                 df_titles:pandas.core.frame.DataFrame)

This function will use the nn.CosineSimilarity on the 50 latent factors for each movie to find the 30 movies most similar to your favorite movie.

Type Details
full_title str String saved to favorite_movie variable
learn Trained Model
df_titles DataFrame DataFrame with all movie titles from dls

source

search_movie_titles

 search_movie_titles (favorite_movie:str,
                      df_titles:pandas.core.frame.DataFrame)

This Function searches through df_titles[‘title’] column for titles that contian words saved in the favorite_movie variable after making all words lowercase and removing THE from the start of a title

Type Details
favorite_movie str String saved to favorite_movie variable
df_titles DataFrame DataFrame with all movie titles located in the column title from dls

source

search_movies_n_recommend

 search_movies_n_recommend (favorite_movie:str, learn,
                            df_titles:pandas.core.frame.DataFrame)

This function returns 30 recommendations using search_movie_titles and get_movie_recs.

Type Details
favorite_movie str The movie title typed into the gr.Textbox() that the user will see on the gradio app
learn Trained model
df_titles DataFrame Movie titles df

Gradio For Publishing the Model

Now we use gr.Interface() to create a block that contains our search_movies_n_recommend fuction as well as variables that dictate what will be shown once we publish it on gradio

intf = gr.Interface(fn=search_movies_n_recommend,
                    inputs=gr.Textbox(lines=1,placeholder="Put Your Favorite Movie Here To See Similar Movie Recommendations..."),
                    outputs=["text","text"],
                    examples=['Child\'s Play (1988)','Rushmore','brothers Bloom',
                              'call me by your name', 'Jumanji (1995)','The Three Amigos'],
                    title="Dave's Fast.AI Movie Recommender",
                    article='This recommender uses a collaborative filtering model fashioned from the [Fast.AI](https://github.com/fastai/fastbook/blob/master/08_collab.ipynb) library and trained on the 9 million movie reviews from the [Movie Lense Dataset](https://grouplens.org/datasets/movielens/25m/).  It trains Latent Factors to give movies and users descriptive weights that help predict thier reivews based on these factors.  The model was trained completely WITHOUT titles, descriptions, or categories and amazingly these things can be added to imporove this model.  Enjoy!')

Actually Try the Model HERE

Now we launch it so it can be used literally here in the jupyter notebook

intf.launch(inline=True, #when True publishes the page in notebook
        share=False, quiet=True, width=600, height=700) #when share=True automatically publishes the app to the gradio website for 72 hours.
Running on local URL:  http://127.0.0.1:7862
(<gradio.routes.App>, 'http://127.0.0.1:7862/', None)

Example Without Gradio:

Here is an example of the model being used straight from the get_movie_recs function, recommending movies similar to one of my favorites.

favorite_movie = 'Beasts of the Southern Wild (2012)'
get_movie_recs(full_title=favorite_movie, learn=learn, df_titles=df_titles)
['Beasts of the Southern Wild (2012)',
 'No Country for Old Men (2007)',
 'Wrestler, The (2008)',
 'Leaving Las Vegas (1995)',
 'City of God (Cidade de Deus) (2002)',
 'Talk to Her (Hable con Ella) (2002)',
 'Antichrist (2009)',
 'Nocturnal Animals',
 'Requiem for a Dream (2000)',
 'Magnolia (1999)',
 'In the Company of Men (1997)',
 'Old Boy (2003)',
 'Enemy (2013)',
 '12 Angry Men (1957)',
 'Samouraï, Le (Godson, The) (1967)',
 'Bringing Out the Dead (1999)',
 'Eyes Wide Shut (1999)',
 'Mulholland Drive (2001)',
 'Punch-Drunk Love (2002)',
 'Shallow Grave (1994)',
 'Birdman: Or (The Unexpected Virtue of Ignorance) (2014)',
 'Taxi Driver (1976)',
 'Black Swan (2010)',
 'American Beauty (1999)',
 'Broken Flowers (2005)',
 "Winter's Bone (2010)",
 'Little Miss Sunshine (2006)',
 'Hunt, The (Jagten) (2012)',
 'Control (2007)',
 'The Machinist (2004)']

How does this work?

Click HERE for the full fast.ai lesson explaination.

The model is trained on 3 pieces of data. Movie_Id, User_Id, and Movie_Rating. So how can we discern which movies are similar to each other? The answer is “Latent Factors.”

How do Latent factors work? > Let’s say john loves movies that score high in 3 categories: Action, Science Fiction Old(though old he doesn’t care about as much). we can reprsent this as an array with scores between -1 and 1: [Action, Scifi, Old]

john = np.array([1,1,.8]) # [action, scifi, old]

Now we look a the movie The Last Skywalker and score that with the same system. It has a lot of action but not craaazy amounts action so we’ll score that .9. It’s definitely scifi so we score that 1 but it isn’t old so we’l score that -.9.

last_skywalker = np.array([1,.9,-.9])

Now we multiply those together to create a matching score

john*last_skywalker
array([1.  , 1.  , 0.64])

Then we can sum it together to get a single score:

(john*last_skywalker).sum()
2.64

This is called the dot product.

Let’s try another movie

casablanca = np.array([-0.99,-0.3,0.8])

Casablanca is a much worse match for John

(john*casablanca).sum()
-0.6499999999999999

This score can be used to help us predict how someone would rate a movie and that system can be optimized with machine learning.

Not enough data?

Our problem is that we actually don’t know the users preferences and we don’t have categories for the movies, but we DO have the users ratings of the movies.

Solution:

Instead of specific categories, create space for categories and fill them with random numbers. We then use the movie ratings we have and train the random nubers using Stochastic Gradient Descent(SGD) to make these random nubmers useful for predicting all the users ratings.

Now we have useful categories that were created from randomness and and can be used to show similar users and similar movies as wel as predict a users likely rating.

Can we add more categories and Info?

Yes.

That is the next step to optimize the model. This can be done by changing the model from a Dot Product model to a Neural Network where you can add categories.