[試題] 103下 陳信希 自然語言處理 期中考+解答

作者rod24574575 (天然呆)

看板NTU-Exam

標題[試題] 103下陳信希自然語言處理期中考+解答

時間Mon Jun 8 10:56:40 2015

課程名稱︰自然語言處理課程性質︰選修課程教師：陳信希開課學院：電資學院開課系所︰資工所、網媒所、知識管理學程考試日期（年月日）︰2015.04.23 考試時限（分鐘）：180分鐘試題 : Natural Language Processing Mid-Term Exam 2015/4/23 Problem 1 What is the prepositional phrase attachment problem in natural language processing? (5 points) Answer: (See Lec.1 p.49. See Lec.2 p.11, 12.) In natural language processing, there are lots of issue about language ambiguity, such as lexical ambiguity and structure ambiguity. And PP attachment is one source of structure ambiguity which refers to the circumstance where PP can attach to sentences, verb phrases, noun phrases, and adjectival phrases. For example: I observed the boy with a telescope. Here [with a telescope] is a PP and there are two possible PP attaching choices: (1) The PP can attach to [the boy] to form a NP first and couple with [observe] to consist of a VP. (2) The [observe] and [the boy] form a VP first and the PP attach to this VP to consist of a new VP. Note that: If you mention "ambiguity" or "ambiguous", you will already get one point. And if you mention "belong to" or "attachment" but your explanation is incomplete, you may get extra one to two points. Finally, the completeness of your answer is worth two points. Often, giving a proper example will indicate your good understanding of what the PP attachment problem is and therefore can earn you higher scores. Problem 2 Non-local dependency is a challenge for n-gram approach. Please explain why. (5 points) Answer: (See Lec.2 p.18. See Lec.4 p.16, 17, 18.) A non-local dependency is an instance in which two words can be syntactically dependent even though they occur far apart in a sentence (e.g., subject-verb agreement; long-distance dependencies such as wh-extraction). While for n-gram model, it only considers n-1 previous words due to the Markov assumption. So If n is not large enough it will catch this linguistic phenomena and if n is too large, the number of n-gram will grow exponentially and is therefore impossible. Note that: Explanation about what non-local dependency is worth two points. And explanation about the why challenge is worth two points. Finally the completeness of the answer is worth one point. Problem 3 Please use the sentence, "the boys grill their catches on an open fire", to explain the semantic role labelling problem. (5 points). Answer: Semantic role labeling, sometimes also called shallow semantic parsing, is a task in natural language processing consisting of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. For example: [Cook the boys] ... grill [Food their catches] [Heating_instrument on an open fire]. Problem 4 Please name 3 linguistic resources. (5 points) Answer: Corpus 1. Academia Sinica Balanced Corpus 2. Penn Chinese Treebank 3. Sinica Treebank 4. Sinica Bilingual WordNet 5. WordNet 6. FrameNet 7. Dbpedia 8. etc... or Tools 1. NLTK 2. Jieba Chinese Segmenter 3. Stanford parser 4. CKIP parser 5. etc... Problem 5 Pointwise mutual information (PMI) is defined as follows. (a) Please describe the association between x and y by PMI. (5 points) (b) In which case PMI is arguable? Please describe how to deal with this problem. (5 points) Answer: (See Lec.3 p.52, 61, 63, 64, 65, 66, 69, 70.) (a) If x and y are independent, PMI is 0. And if x and y are positively correlated, PMI is more than 0. And if they are negatively correlated PMI is less than 0. Note that: If you only mention perfect independent and perfect dependent cases, you can only get three to four points depending on the completeness of you answer. (b) When perfectly dependent: P(xy) P(x) 1 I(x, y) = log ───── = log ───── = log ─── P(x)P(y) P(x)P(y) P(y) As the perfectly dependent bi-grams get rarer, their mutual information increases → bad measure of dependence. That is, the score depends on the frequency of the individual words. Solution: PMI * Word Count(frequency). Note that: Entropy, Smoothing and Log-likelihood Ratio are also regarded as correct alternatives. But χ^2 test will get one point reduction, since Likelihood ratios are more appropriate for sparse data than the χ^2 test according to the slide (Lec.3 P.52). Problem 6 Log likelihood ratio is one of measurements for relation discovery. Please define it first and then use it to determine the polarity of a potential opinionated word in a corpus for a specific domain such as hotel and restaurant. (10 points) Answer: (See Lec.3 p.52, 53, 54, 55, 56 for the first part of the answer.) To determine the polarity of a potential opinionated word, we first choose some seed words that have known polarity, e.g. nice, bad. Afterwards, we compute log likelihood ratio of potential words with the seed words. If a candidate is highly related to seed words of one specific polarity, the candidate may have the same polarity with those words. Alternative Answer: (See Lec.3 p.59, 60 for the first part of the answer.) To determine the polarity of a potential opinionated word, we first divide the corpus for hotel and restaurant to two sets according to their polarity: positive set, negative set. Afterwards, we compute the likelihood ratios for the potential word between two corpora. Those words that have extreme values for likelihood ratios could potentially have the same polarity with the corpus where the words occur often. Problem 7 In authorship verification, you have to answer whether a given document is written by a specific author. Please describe how language modelling is employed to this application. (5 points) Answer: (See Lec.4 p.30, 31, 40, 41, 42. See Lec.3 p.29, 51.) To determine whether a document is written by a specific author we can take various methods: (1) We can train a language model from the previous works of that author, and use this language model to generate that given document like Shannon's Method. If the probability is higher than a certain threshold or its perplexity is low enough it can be verified to be written by one specific author. (2) Or we can train two language models, one is from the previous works of that author and the other is from the given document. And then compare the distance between two distributions using methods like KL-divergence or statistical testing method like Chi-square, T-test et al. to decide whether these two distribution is statistically different from each other or not. Problem 8 Please discuss why maximum likelihood estimate is not suitable for language modelling. (5 points) Answer: MLE is usually unsuitable for NLP because of the sparseness of the data. For example, if we use trigram model with 60000 word vocabulary, there are 216000000 million trigrams, but a large training set may only contain 200 million words. Many trigrams will not be observed in the training data. Thus, MLE gives them zero probabilities. However, many of these trigrams are reasonable sequences that have small but non-zero probability to occur in the testing data. Also, when we try to estimate the probability of a given sentence, we multiply the probabilities of all trigrams. So if one trigram has zero probability, the whole sentence would have zero probability as well. Thus, we usually use smoothing to redistribute the probabilities to unseen events. Problem 9 Assume you are fishing and you have caught 10 carp, 3 perch, 2 whitefish, 2 catfish, 1 trout, 1 salmon, and 1 eel. (a) How likely is it that the next fish to be caught is a catfish if good tuning method is adopted? (3 points) (b) How likely is it that the next fish caught will be a member of newly seen species if good tuning method is adopted? (4 points) (c) Assume bass is the only one new species. How likely is it that the next fish to be caught is a bass if Laplace smoothing is adopted? (3 points) Answer: (a) c*_(GT) = (2 + 1) * 1/2 = 1.5, Prob = 1.5/20 (See Lec.5 p.41, 42.) (b) 3/20 (See Lec.5 p.45. "To estimate total number of unseen species...") (c) 1/28 (See Lec.5 p.20.) Problem 10 Forward probability and backward probability for an HMM, λ(A, B, π), are defined recursively as follows, respectively. (a) Please specify how to compute the probability of a sequence, w_1, w_2, ..., w_T, without enumerating all the paths to generate the sequence. (5 points) (b) Please specify how to select the most probable path without enumerating all the paths to generate the sequence. (5 points) (c) Assume s_i and s_j are two states in an HMM. Given an observation sequence, w_1, w_2, ..., w_T, please specify how to compute the count transiting from s_i to s_j without enumerating all the paths to generate the sequence. (5 points) Answer: (a) Use Forward Algorithm. (1) Initialization α_1(j) = a_(0j) * b_j(o_1); 1≦j≦N (2) Recursion N v_t(j) = Σ α_(t-1)(i) * a_(ij) * b_j(o_t); 1≦j≦N, 1＜t≦T i=1 (3) Termination: N P(Ο|λ) = α_T(q_F) = Σ α_T(i) * a_(iF) i=1 The sequence probability is P(Ο|λ). (b) Use Viterbi Algorithm. (1) Initialization v_1(j) = a_(0j) * b_j(o_1); 1≦j≦N bt_1(j) = 0 (2) Recursion N v_t(j) = max(v_(t-1)(i) * a_(ij) * b_j(o_t)); 1≦j≦N, 1＜t≦T i=1 N bt_t(j) = arg max(v_(t-1)(i) * a_(ij) * b_j(o_t)); 1≦j≦N, 1＜t≦T i=1 (3) Termination: N The best score: P_(*) = v_t(q_F) = max(v_T(i) * a_(i,F)) i=1 N The start of backtrace: q_(T*) = bt_T(q_F) = arg max(v_T(i) * a_(i,F)) i=1 To find the best path, use backtrace. (c) We use Forward & Backward Algorithm to compute α & β. And as in E-step of Forward-Backward Algorithm. α_t(i) * a_(ij) * b_j(o_(t+1)) * β_(t+1)(j) ξ_t(i, j) = ──────────────────────── α_T(q_F) T-1 The estimated count from s_i to s_j is computed by: Σ ξ_t(i, j) t=1 Problem 11 Part of speech (POS) tagging is often modeled as sequence classification. Hidden Markov Model (HMM) and Maximum Entropy Markov Model (MEMM) can be adopted to deal with sequence labelling problem. (a) Please formulate the POS tagging with HMM. (5 points) (b) Please formulate the POS tagging with MEMM. (5 points) (c) Please describe how to get the best POS tag sequence with MEMM without enumerating all the possible paths. (5 points) Answer: (See Lec.7 p.27, 28, 40, 56, 57, 59, 60, 61.) (a) Using HMM model in POS tagging. When using HMM model, each state in the model is the POS tag for a certain word. And we want out of all sequences of n tags t_1 ... t_n the single tag sequence such that: P(t_1 ... t_n | w_1 ... w_n) is highest, that is (^t)^n_1 = argmax P(t^n_1 | w^n_1). t^n_1 Based on Bayes Rule it can be re-written as: (^t)^n_1 = argmax P(w^n_1 | t^n_1) P(t^n_1) t^n_1 n PS. t^n_1 代表 t 的上標是 n 且下標是 1，看起來就像 t 這樣 1 ︿ (^t) 代表 t 上面有一個 ^ 的重音符號，看起來就像ｔ這樣 ~~ 代表上下各一個「~」組成的符號，也就是波浪版的「=」 Using the Markov Assumption, the likelihood and prior can be re-written as: n P(w^n_1 | t^n_1) ~~ Π P(w_i | t_i) i=1 n P(t^n_1) ~~ Π P(t_i | t_(i-1)) i=1 Finally the POS tagging result is: n (^t)^n_1 ~~ argmax Π P(w_i | t_i) * P(t_i | t_(i-1)) t^n_1 i=1 HMM is a generative model that optimizes the likelihood P(W|T). (b) Using HEMM model in POS tagging. Unlike the HMM model, HEMM model compute the posterior P(T|W) directly. n (^T) = argmax P(T|W) = argmax Π P(t_i | w_i, t_(i-1)) T T i=1 1 And P(t_i | w_i, t_(i-1)) = ──────── * exp(Σ w_i * f_i(w_i, t_i)), Z(w_i, t_(i-1)) i w is trained feature weight and f_i(w_i, t_i) is extracted feature from the word sequence. HEMM model is a discriminative model which trains the model directly to discriminate among the possible tag sequences. (c) Using Viterbi and HEMM model to tag POS. This is similar with HMM and we can simply replace the Viterbi equation from N V_t(j) = max(V_(t-1)(i) * P(t_j | t_i) * P(w_t | t+j); 1≦j≦N, 1＜t≦T i=1 to N V_t(j) = max(V_(t-1)(i) * P(t_j | t_i, w_t); 1≦j≦N, 1＜t≦T i=1 and it computes the Viterbi value of word position t for tag j. And we also use a matrix S to store the MAX prob path for each tag in each word position. N S_t(j) = argmax(V_(t-1)(i) * P(t_j | t_i, w_t); 1≦j≦N, 1＜t≦T. i=1 In addition, for corner case t = 1: V_1(j) = P(tj | w_1, <s>); 1≦j≦N, <s> is the start symbol. S_1(j) = 0; 1≦j≦N Problem 12 Most natural language processing is notation transformation. A sentence may be mapped into different forms on different levels with some information added. Given a sentence, Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony, we want to make Dubai a new trading center., what are the outputs of (a) entity extraction and (b) co-reference resolution? (10 points) Answer: (a) Entity Extraction: (See Lec.2 p.53.) (1) Sheikh Mohammed (2) who (3) Defense Minister (4) United Arab Emirates (5) inauguration ceremony (6) we (7) Dubai (8) trading center Note that: You should write at least three of these to gain a full score. (b) co-reference resolution. (See Lec.2 p.54.) (1) Sheikh Mohammed = who (2) who = Defense Minister (3) United Arab Emirates = we (4) Dubai = trading center Note that: You should write at least three of these to gain a full score. -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 111.250.48.19 ※ 文章網址: https://www.ptt.cc/bbs/NTU-Exam/M.1433732204.A.5E6.html

→ rod24574575 : 已收資訊系精華區! 06/08 11:00

推 t0444564 : 好長一篇! 06/10 23:07