You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.rst
+130Lines changed: 130 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1044,6 +1044,136 @@ output:
1044
1044
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1045
1045
Conditional Random Field (CRF)
1046
1046
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1047
+
Conditional Random Field (CRF) is an undirected graphical model as shown in figure. CRFs state the conditional probability of a label sequence *Y* give a sequence of observation *X* *i.e.* P(Y|X). CRFs can incorporate complex features of observation sequence without violating the independence assumption by modeling the conditional probability of the label sequence rather than the joint probability P(X,Y). The concept of clique which is a fully connected subgraph and clique potential are used for computing P(X|Y). Considering one potential function for each clique of the graph, the probability of a variable configuration is corresponding to the product of a series of non-negative potential function. The value computed by each potential function is equivalent to the probability of the variables in its corresponding clique taken on a particular configuration.
1048
+
1049
+
1050
+
.. image:: docs/pic/CRF.png
1051
+
1052
+
1053
+
Example from `Here <http://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html>`__
1054
+
Let’s use CoNLL 2002 data to build a NER system
1055
+
CoNLL2002 corpus is available in NLTK. We use Spanish data.
sklearn-crfsuite (and python-crfsuite) supports several feature formats; here we use feature dicts.
1069
+
1070
+
.. code:: python
1071
+
1072
+
defword2features(sent, i):
1073
+
word = sent[i][0]
1074
+
postag = sent[i][1]
1075
+
1076
+
features = {
1077
+
'bias': 1.0,
1078
+
'word.lower()': word.lower(),
1079
+
'word[-3:]': word[-3:],
1080
+
'word[-2:]': word[-2:],
1081
+
'word.isupper()': word.isupper(),
1082
+
'word.istitle()': word.istitle(),
1083
+
'word.isdigit()': word.isdigit(),
1084
+
'postag': postag,
1085
+
'postag[:2]': postag[:2],
1086
+
}
1087
+
if i >0:
1088
+
word1 = sent[i-1][0]
1089
+
postag1 = sent[i-1][1]
1090
+
features.update({
1091
+
'-1:word.lower()': word1.lower(),
1092
+
'-1:word.istitle()': word1.istitle(),
1093
+
'-1:word.isupper()': word1.isupper(),
1094
+
'-1:postag': postag1,
1095
+
'-1:postag[:2]': postag1[:2],
1096
+
})
1097
+
else:
1098
+
features['BOS'] =True
1099
+
1100
+
if i <len(sent)-1:
1101
+
word1 = sent[i+1][0]
1102
+
postag1 = sent[i+1][1]
1103
+
features.update({
1104
+
'+1:word.lower()': word1.lower(),
1105
+
'+1:word.istitle()': word1.istitle(),
1106
+
'+1:word.isupper()': word1.isupper(),
1107
+
'+1:postag': postag1,
1108
+
'+1:postag[:2]': postag1[:2],
1109
+
})
1110
+
else:
1111
+
features['EOS'] =True
1112
+
1113
+
return features
1114
+
1115
+
1116
+
defsent2features(sent):
1117
+
return [word2features(sent, i) for i inrange(len(sent))]
1118
+
1119
+
defsent2labels(sent):
1120
+
return [label for token, postag, label in sent]
1121
+
1122
+
defsent2tokens(sent):
1123
+
return [token for token, postag, label in sent]
1124
+
1125
+
X_train = [sent2features(s) for s in train_sents]
1126
+
y_train = [sent2labels(s) for s in train_sents]
1127
+
1128
+
X_test = [sent2features(s) for s in test_sents]
1129
+
y_test = [sent2labels(s) for s in test_sents]
1130
+
1131
+
1132
+
To see all possible CRF parameters check its docstring. Here we are useing L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.
0 commit comments