Let's now discuss the method for building a trigram HMM POS tagger. Notice how the Brown training corpus uses a slightly different notation than the standard part-of-speech notation in the table above. Įach sentence is a string of space separated WORD/TAG tokens, with a newline character in the end. Here is an example sentence from the Brown training corpus.Īt/ADP that/DET time/NOUN highway/NOUN engineers/NOUN traveled/VERB rough/ADJ and/CONJ dirty/ADJ roads/NOUN to/PRT accomplish/VERB their/DET duties/NOUN. It is useful to know as a reference how the part-of-speech tags are abbreviated, and the following table lists out few important part-of-speech tags and their corresponding descriptions. The algorithm of tagging each word token in the devset to the tag it occurred the most often in the training set Most Frequenct Tag is the baseline against which the performances of various trigram HMM taggers are measured. The accuracy of the tagger is measured by comparing the predicted tags with the true tags in Brown_tagged_dev.txt. We train the trigram HMM POS tagger on the subset of the Brown corpus containing nearly 27500 tagged sentences in the development test set, or devset Brown_dev.txt. In the following sections, we are going to build a trigram HMM POS tagger and evaluate it on a real-world text called the Brown corpus which is a million word sample from 500 texts in different genres published in 1961 in the United States. Designing a highly accurate POS tagger is a must so as to avoid assigning a wrong tag to such potentially ambiguous word since then it becomes difficult to solve more sophisticated problems in natural language processing ranging from named-entity recognition and question-answering that build upon POS tagging. When someone says I just remembered that I forgot to bring my phone, the word that grammatically works as a complementizer that connects two sentences into one, whereas in the following sentence, Does that make you feel sad, the same word that works as a determiner just like the, a, and an. The algorithm works to resolve ambiguities of choosing the proper tag that best represents the syntax and the semantics of the sentence. A tagging algorithm receives as input a sequence of words and a set of all different tags that a word can take and outputs a sequence of tags. POS tagging is extremely useful in text-to-speech for example, the word read can be read in two different ways depending on its part-of-speech in a sentence. Tags are not only applied to words, but also punctuations as well, so we often tokenize the input text as part of the preprocessing step, separating out non-words like commas and quotation marks from words as well as disambiguating end-of-sentence punctuations such as period and exclamation point from part-of-word punctuation in the case of abbreviations like i.e. Part-of-speech tagging or POS tagging is the process of assigning a part-of-speech marker to each word in an input text. This post presents the application of hidden Markov models to a classic problem in natural language processing called part-of-speech tagging, explains the key algorithm behind a trigram HMM tagger, and evaluates various trigram HMM-based taggers on the subset of a large real-world corpus. The HMM is widely used in natural language processing since language consists of sequences at many levels such as sentences, phrases, words, or even characters. The model computes a probability distribution over possible sequences of labels and chooses the best label sequence that maximizes the probability of generating the observed sequence. The hidden Markov model or HMM for short is a probabilistic sequence model that assigns a label to each unit in a sequence of observations.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |