biobertApi

Example Texts

Create Single Text Block from Text That is easily re-separated

create_com_with_idx

 create_com_with_idx (df)

This function takes a pandas DataFrame with a column titled “comments” that is filled with text strings, ie. from comments. It adds an index number to each text string and adds ‘::’ at the end for later parsing. Finally, it returns all text elements combined into a single text block

	Details
df	pandas DataFrame containg a column, titled “comments,” of text elements. In this case these are Youtube comments.

create_text_block example

source

make_5k_sections

 make_5k_sections (df)

This Function gives section numbers to rows in order to group them later so that each text block is less than 5k characters.

source

create_one_large_text_block

 create_one_large_text_block (df)

source

create_all_text_blocks

 create_all_text_blocks (df)

This function creates multiple text blocks all less than 5k char.

text_block = create_one_large_text_block(df=coms); text_block[:250]

NameError: name 'create_one_large_text_block' is not defined

# show_doc(create_text_block)

Get Start and End Doc Indecies

Here we’ll use regex to find the text indecies of each doc or comment. This will tell us which doc or comment is teh parent of each biomedical term returned from BERN2

source

get_comment_spans_textblock

 get_comment_spans_textblock (text_block:str)

This function returns a dataframe full of the start, end and span of each text comment/doc in the text_block

	Type	Details
text_block	str	single block of text in this structure: `'07 textt ext text. ::'`

dfi = get_comment_spans_textblock(text_block); dfi

	text	start	end	span
0	0:: autophagy maintains tumour growth through ...	0.0	80.0	(0, 80)
1	1:: x-rays were negative and physical assessme...	81.0	305.0	(81, 305)
2	2:: it is a skin disease causing much itchines...	306.0	450.0	(306, 450)
3	3:: maybe its a tumour. maybe take some tyleno...	451.0	541.0	(451, 541)

Send combined text to bern2

Query Function:

This function will send our text to bern2 using their API to get the text labeled

text_block

"0:: autophagy maintains tumour growth through circulating the great arginine. :: 1:: x-rays were negative and physical assessment determined soft tissue damage to the lateral aspect of her ankle. she was initially treated with ice, an ace wrap, crutches and mild pain medications ,tylenol with codeine, :: 2:: it is a skin disease causing much itchiness. scratching leads to redness, swelling, cracking, weeping clear fluid, crusting, and scaling. :: 3:: maybe its a tumour. maybe take some tylenol. don't worry i'm not a doctor. i'm dave ::"

source

query_plain

 query_plain (text:str, url='http://bern2.korea.ac.kr/plain')

This function sends your text_block to the bern2 API and returns a json of labled biomedical terms from text_block with thier indecies.

	Type	Default	Details
text	str		single block of bioMedical text
url	str	http://bern2.korea.ac.kr/plain	the api address

# show_doc(query_plain)

Example of query_plain Output

text_block

"0:: autophagy maintains tumour growth through circulating the great arginine. :: 1:: x-rays were negative and physical assessment determined soft tissue damage to the lateral aspect of her ankle. she was initially treated with ice, an ace wrap, crutches and mild pain medications ,tylenol with codeine, :: 2:: it is a skin disease causing much itchiness. scratching leads to redness, swelling, cracking, weeping clear fluid, crusting, and scaling. :: 3:: maybe its a tumour. maybe take some tylenol. don't worry i'm not a doctor. i'm dave ::"

Send it all to Bern2

# # Send it all to Bern2
# output = query_plain(text=text_block);

An example of the bern2 out put

output = {'annotations': [{'id': ['mesh:D009369'],
   'is_neural_normalized': False,
   'mention': 'tumour',
   'obj': 'disease',
   'prob': 0.9999957084655762,
   'span': {'begin': 23, 'end': 29}},
  {'id': ['mesh:D001120'],
   'is_neural_normalized': False,
   'mention': 'arginine',
   'obj': 'drug',
   'prob': 0.9939362406730652,
   'span': {'begin': 67, 'end': 75}},
  {'id': ['mesh:D000082'],
   'is_neural_normalized': False,
   'mention': 'tylenol',
   'obj': 'drug',
   'prob': 0.9972689747810364,
   'span': {'begin': 278, 'end': 285}},
  {'id': ['mesh:D003061'],
   'is_neural_normalized': False,
   'mention': 'codeine',
   'obj': 'drug',
   'prob': 0.947392463684082,
   'span': {'begin': 291, 'end': 298}},
  {'id': ['mesh:D012871'],
   'is_neural_normalized': False,
   'mention': 'skin disease',
   'obj': 'disease',
   'prob': 0.9998037815093994,
   'span': {'begin': 313, 'end': 325}},
  {'id': ['mesh:D011537'],
   'is_neural_normalized': False,
   'mention': 'itchiness',
   'obj': 'disease',
   'prob': 0.9898108243942261,
   'span': {'begin': 339, 'end': 348}},
  {'id': ['mesh:D000080822'],
   'is_neural_normalized': False,
   'mention': 'redness',
   'obj': 'disease',
   'prob': 0.9481215476989746,
   'span': {'begin': 370, 'end': 377}},
  {'id': ['mesh:D004487'],
   'is_neural_normalized': True,
   'mention': 'swelling',
   'obj': 'disease',
   'prob': 0.9774566292762756,
   'span': {'begin': 379, 'end': 387}},
  {'id': ['mesh:D012135'],
   'is_neural_normalized': True,
   'mention': 'cracking',
   'obj': 'disease',
   'prob': 0.8271865248680115,
   'span': {'begin': 389, 'end': 397}},
  {'id': ['mesh:D002862'],
   'is_neural_normalized': True,
   'mention': 'crusting',
   'obj': 'disease',
   'prob': 0.9943530559539795,
   'span': {'begin': 420, 'end': 428}},
  {'id': ['mesh:D012871'],
   'is_neural_normalized': True,
   'mention': 'scaling',
   'obj': 'disease',
   'prob': 0.9980024695396423,
   'span': {'begin': 434, 'end': 441}},
  {'id': ['mesh:D009369'],
   'is_neural_normalized': False,
   'mention': 'tumour',
   'obj': 'disease',
   'prob': 0.9999805688858032,
   'span': {'begin': 460, 'end': 466}},
  {'id': ['mesh:D000082'],
   'is_neural_normalized': False,
   'mention': 'tylenol',
   'obj': 'drug',
   'prob': 0.9799597263336182,
   'span': {'begin': 484, 'end': 491}}],
 'text': "0: autophagy maintains tumour growth through circulating the great arginine.:: 1: x-rays were negative and physical assessment determined soft tissue damage to the lateral aspect of her ankle. she was initially treated with ice, an ace wrap, crutches and mild pain medications ,tylenol with codeine,:: 2: it is a skin disease causing much itchiness. scratching leads to redness, swelling, cracking, weeping clear fluid, crusting, and scaling.:: 3: maybe its a tumour. maybe take some tylenol. don't worry i'm not a doctor. i'm dave::",
 'timestamp': 'Mon Nov 14 18:00:04 +0000 2022'}

Example of overall df

pd.DataFrame(output)[-3:-1]

	annotations	text	timestamp
10	{'id': ['mesh:D012871'], 'is_neural_normalized...	0: autophagy maintains tumour growth through c...	Mon Nov 14 18:00:04 +0000 2022
11	{'id': ['mesh:D009369'], 'is_neural_normalized...	0: autophagy maintains tumour growth through c...	Mon Nov 14 18:00:04 +0000 2022

Example of df from just the annotations column

pd.DataFrame(output['annotations'])[-3:-1]

	id	is_neural_normalized	mention	obj	prob	span
10	[mesh:D012871]	True	scaling	disease	0.998002	{'begin': 434, 'end': 441}
11	[mesh:D009369]	False	tumour	disease	0.999981	{'begin': 460, 'end': 466}

Add BERN2 Lables to Docs

We sent all separate text documents as one big text document to bern2. Now we’ll re-separate the labeled text to show which biomedical words were in which documents.

dfa = pd.DataFrame(output['annotations']) #crete dfa for dataframe of Annotaation

# create str_end col as type str
dfa.span = dfa.span.astype(str)
dfa ['str_end'] = dfa.span.str.replace(r".*'end': (\d+)}",r"\1",regex=True)

dfa.str_end = dfa.str_end.astype(int)  # make str end as type str

#add dfi_idx col
dfi.reset_index(inplace=True)
dfi.rename(columns={'index':'dfi_idx'},inplace=True)

for o,m in zip(dfi.index,dfi.span):  #add dfi_idx col to dfa 
    x,y = m #open span tuple
    #write conditions for specific df rows
    conds = (dfa.str_end > x) & (dfa.str_end < y)
    dfa.loc[conds,'dfi_idx'] = o #save the index of the dfi span that fits to dfa

# dfa.merge(dfi, left_on='dfi_idx',right_index=True)
df = dfa.merge(dfi, left_on='dfi_idx',right_on='dfi_idx');  df.head(2)

	id	is_neural_normalized	mention	obj	prob	span_x	str_end	dfi_idx	text	start	end	span_y
0	[mesh:D009369]	False	tumour	disease	0.999996	{'begin': 23, 'end': 29}	29	0.0	0:: autophagy maintains tumour growth through ...	0.0	80.0	(0, 80)
1	[mesh:D001120]	False	arginine	drug	0.993936	{'begin': 67, 'end': 75}	75	0.0	0:: autophagy maintains tumour growth through ...	0.0	80.0	(0, 80)

Output df

df = df[['dfi_idx','mention','obj','text']]; df[-4:-1]

	dfi_idx	mention	obj	text
9	2.0	crusting	disease	2:: it is a skin disease causing much itchines...
10	2.0	scaling	disease	2:: it is a skin disease causing much itchines...
11	3.0	tumour	disease	3:: maybe its a tumour. maybe take some tyleno...

Show # of Comments that Contain Word

dfwords = df.groupby(['mention']).dfi_idx.count().reset_index().sort_values('dfi_idx',ascending=False)

# !pip install seaborn

# import seaborn as sns
# import matplotlib.pyplot as plt

# # Set the width and height of the figure
# plt.figure(figsize=(8,6))

# ax = sns.barplot(x=dfwords.dfi_idx, y=dfwords.mention)

# #title
# ax.set_title(f'Biomedical Terms in Comments')

# # Add label for  axis
# ax.set(xlabel='Number of commenters mentioning the term')
# plt.show()