Zero Shot Classification

This shows how to use zero shot classification on data from a linkedin dataset
#!pip install transformers

Here we’ll load everything we need into the notebook

!pip install plotly
Requirement already satisfied: plotly in /opt/conda/lib/python3.7/site-packages (5.11.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.7/site-packages (from plotly) (8.1.0)
from transformers import pipeline
from fastai.tabular.all import *
import plotly.express as px

Now we’ll load a classifier via the Hugging face transformers pipeline

classifier = pipeline("zero-shot-classification")
No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)

Here we’ll load our top bio terms and all comments

path = Path('./text_sample')
df_terms = pd.read_csv(path/'top_bio_terms.csv')
dftext = pd.read_csv(path/'found_my_fitness_UCWF8SqJVNlx-ctXbLswcTcA_youtube_comments_only_121rows.csv')

We’ll use this info to create one textblock from all the comments that we’ll form into a df to use the “analyze_one” function

def create_one_large_text_block(df):
        text_block = ' '.join(df.comment.tolist()) 
        return text_block
#create the textblock
textblock = create_one_large_text_block(dftext)

#create df with an input column 
df = pd.DataFrame({'input':[textblock]})
df.input = df.input.astype(str)
df.input = df.input.str.replace('\\n',' ',regex=False) #replace any \n text

Here We’ll create a list from the bio_terms we pulled in the “Found My Fitness Example”

candidate_labels = df_terms.bio_term.tolist()

And now we’ll add a few more to see how the model analyzes the whole text

candidate_labels.extend(['rhonda','Mickey Mouse','addiction','Howard Taft']); candidate_labels
['alcohol (drug)',
 'fgf21 (gene)',
 'alcoholism (disease)',
 'humans (species)',
 'naltrexone (drug)',
 'rhonda',
 'Mickey Mouse',
 'addiction',
 'Howard Taft']

This is the fucntion that will run

def analyze_one(df, candidate_labels, index ):
    i=index
    sequence = df.input[i]
    answer = classifier(sequence, candidate_labels)
    dfo = pd.DataFrame(answer)
    dfo.sort_values('scores',inplace=True)
    
    fig = px.bar(dfo, x="scores", y="labels", orientation='h')
    txtblck = dfo.sequence[0]
    print(str(txtblck[:400])+'...')
    
    fig.show()