An Introduction to Natural Language Processing is fairly a good start for students who wish to bridge the gap between what’s human-like and what’s mechanical. Natural language processing is widely utilized in artificial intelligence and also implemented in machine learning. Its use is expected to go up in the coming years, along with rising job opportunities. Students preparing for natural language processing (NLP) should have a decent understanding of the type of questions that get asked in the interview.
1. Discuss real-life apps based on Natural Language Processing (NLP).
Chatbot: Businesses and companies have realized the importance of chatbots, as they assist in maintaining good communication with customers, any queries that a chatbot fails to resolve gets forwarded. They help keep the business moving as they are used 24/7. This feature makes use of natural language processing.
Google Translate: Spoken words or written text can be converted into another language, proper pronunciation is also available of words, Google Translate makes use of advanced NLP which makes all of this possible.
2. What is meant by NLTK?
Natural language toolkit is a Python library that processes human languages, different techniques including tokenization, stemming, parsing, lemmatization are used for grasping the languages. Also used for classification of text, and assessing documents. Some libraries include DefaultTagger, wordnet, patterns, treebank, etc.
3. Explain parts of speech tagging (POS tagging).
POS is also known as parts of speech tagging is Implemented for assigning tags onto words like verbs, nouns, or adjectives. It allows the software to understand the text, then recognize word differences using algorithms. The purpose is to make the machine comprehend the sentences correctly.
Example:-
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set (stopwords.Words('english'))
txt = "A, B, C are longtime classmates."
## Tokenized via sent_tokenize
tokenized_text = sent_tokenize (txt)
## Using word_tokenizer to identify a string’s words and punctuation then removing stop words
for n in tokenized_text:
wordsList = nltk.word_tokenize(i)
wordsList = [w for w in wordsList if not w in stop_words]
## Using POS tagger
tagged_words = nltk.pos_tag(wordsList)
print (tagged_words)
Output:-
[(‘A’, 'NNP'), ('B', 'NNP’), ('C', 'NNP’), ('longtime', 'JJ’), ('classmates', 'NNS')]
4. Define pragmatic analysis
In a given data of human language, different meaning exists, in order to understand more, pragmatic analysis is used for discovering different facets of the data or document. The actual meaning of words or sentences can be understood by the systems, and for this purpose pragmatic analysis is deployed.
5. Elaborate on Natural language processing components
These are the major NLP components:-
1. Lexical/morphological analysis: word structure is made comprehensible via analysis through parsing.
2. Syntactic analysis: specific text meaning is assessed
3. Entity extraction: information like the place, institution, individual gets retrieved via sentence dissection. Entities present in a sentence get identified.
4. Pragmatic analysis: helps in finding real meaning and relevancy behind the sentences.
6. List the steps in NLP problem-solving
The steps in NLP problem-solving include:-
1. Web scraping or collecting the texts from the dataset.
2. For text cleaning, making use of lemmatization and stemming.
3. Use feature engineering
4. Use word2vec for embedding
5. Using machine learning techniques or with neural networks, start training the models.
6. Assess the performance.
7. Do the required model modifications and deploy.
7. Elaborate stemming with examples
When a root word is gained by detaining the prefix or suffix involved, then that process is known as stemming. For instance, the word ‘playing’ can be minimized to ‘play’ by removing the rest.
Different algorithms are deployed for implementation of stemming, for example, PorterStemmer which can be imported from NLTK as follows:-
from nltk.stem import PorterStemmer
pst = PorterStemmer()
pst.stem(“running”), pst.stem(“cookies”), pst.stem(“flying”)
Output:-
(‘run’, 'cooki', 'fly' )
8. Define and implement named entity recognition
For retrieving information and identifying entities present in the data for instance location, time, figures, things, objects, individuals, etc. NER (named entity recognition) is used in AI, NLP, machine learning, implemented for making the software understand what the text means. Chatbots are a real-life example that makes use of NER.
Implementing NER with spacy package:-
import spacy
nlp = spacy.load('en_core_web_sm')
Text = "The head office of Tesla is in California"
document = nlp(text)
for ent in document.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Output:-
Office 9 15 Place
Tesla 19 25 ORG
California 32 41 GPE
9. Explain checking word similarity with spacy package
Spacy library allows the implementation of word similarity techniques for detecting similar words. The evaluation is done with a number between 0 & 1 (where 0 tends towards less similar and 1 tends toward highly similar).
import spacy
nlp = spacy.load('en_core_web_md')
print ("Enter the words:")
input_words = input()
tokens = nlp(input_words)
for i in tokens:
print(i.text, i.has_vector, i.vector_norm, i.is_oov)
token_1, token_2 = tokens[0], tokens[1]
print("Similarity between words:", token_1.similarity(token_2))
Output:-
hot True 5.6898586 False
cold True6.5396233 False
Similarity: 0.597265
This implies that the similarity between the two words cold and hot is 59%.
10. Describe recall and precision. Also, explain TF-IDF.
Precision and recall
Precision, F1 and Recall, accuracy are NLP model testing metrics. The ratio of predictions with required output provides for a model’s accuracy.
Precision: The ratio of positive instances and total predicted instances.
Recall: The ratio between real positive instances and total (real + unreal) positive instances.
TF-IDF
Term frequency-inverse document frequency is used for retrieval of information via numerical Statistics. It helps in identifying keywords present in any document. The real usage of it revolves around getting information from important documents using Statistical data. It’s also useful in filtering out the stop words and text summarizing plus classification in the documents. With TF one can calculate the ratio of term frequency in a document wrt total terms, whereas IDF implies the significance of the term.
TF IDF calculation formula:
TF = frequency of term ‘W’ in a document / total terms in the document
IDF = log( total documents / total documents with the term ‘W’)
If TF*IDF appears higher then term frequency is likely less.
Google implements TF-IDF for deciding search results index, which helps in optimization or ranking the relevant quality content higher.
Want a One-On-One session with a Gyansetu Instructor to clear doubts and help you clear Interviews? Contact Us at +91-9999201478 or fill out the Enquiry Form