Skip to content

Commit 6ec6ef1

Browse files
committed
第0006题,统计日记当中每篇最重要的词,个人只是简单地做了词频统计并去除了stopwords,然后找了个从句子中生成topics的脚本,改了下代码,还慢有意思的,通过各种句子类型来判断,但是我还是弄的不是很懂,可能没有在这方面做过系统的研究
1 parent 214f050 commit 6ec6ef1

8 files changed

Lines changed: 189 additions & 0 deletions

File tree

burness/0006/01.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
China's Nanjing Massacre museum received 103,500 visitors on Sunday, one day after the first national memorial day for victims of the Nanjing Massacre, marking a record high since establishment in 1985.
2+
Zhu Chengshan, curator of the memorial hall, said the passion of the visitors showed the influence of the national memorial day.
3+
The Memorial Hall of the Victims in Nanjing Massacre by Japanese Invaders, based in Nanjing, capital of east China's Jiangsu Province, was closed to the public from Nov. 18 to Dec. 13 in preparation for the first national memorial day.
4+
On Dec. 13, 1937, Japanese troops began six weeks of destruction, pillage, rape and slaughter in Nanjing. Historical records show that more than 300,000 Chinese, including unarmed soldiers and innocent civilians, were murdered.
5+
The memorial hall access was free to the public from 2004, which houses original remains, sculptures and historical records of the massacre.
6+
Starting from 1994, Jiangsu Province and Nanjing City have given memorial assemblies on Dec. 13 every year to mourn the victims and promote peace.
7+
The Standing Committee of the National People's Congress, China's top legislature, has set two national memorial days, July 7 and Dec. 13, early this year to mark victory in the anti-Japanese invasion war and mourn Nanjing Massacre victims.

burness/0006/02.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
China's Hong Kong, the world's hub of finance, trade and commerce, will surely play a significant and irreplaceable role in building the 21st Century Maritime Silk Road, a Hong Kong business leader said here Saturday.
2+
In his keynote speech to the ASEAN (the Association of Southeast Asian Nations) Development Forum 2014, Dr.Jonathan Choi Koon-shum, permanent honorary president of the Chinese General Chamber of Commerce and chairman of Sunwah Group, said building the maritime silk road, which was proposed by Chinese President Xi Jinping last year, will generate huge business opportunities for China's Hong Kong and the ASEAN region at large, improve the livelihoods of the people within the region and serve as the driving force for China and ASEAN nations to achieve greater economic miracles.
3+
The 21st Century Maritime Silk Road runs through the Straight of Malacca to India, the Middle East and East Africa.
4+
Choi said to promote the economic cooperation and development in East Asia and build the maritime silk road, Hong Kong will better play its unique role as the world's hub of finance, marine transportation, trade and commerce by providing the enterprises of the region with trans-boundary financing, financial expertise, accounting, arbitration, law and other related professional services.
5+
Hong Kong has served as a "super contact person" in regional economic cooperation, said Choi, adding that businessmen from Hong Kong have not only invested in neighboring Guongdong Province but also in the ASEAN, thus having clear perceptual knowledge and rich experience of their investment environments, law systems and models of business running and managing money matters.
6+
Hong Kong capitals have long been the important driving forces for the economic development in the ASEAN, with thousands of Hong Kong enterprises doing businesses there, and they have earned respect from ASEAN nations, the business leader said.
7+
There is no doubt that Hong Kong will sever as the best bridge and partner for ASEAN nations to tap the market in mainland China and for the state-owned and private enterprises from mainland China to expand their businesses in ASEAN nations, he further said.
8+
Choi said China and the 10-member ASEAN have labeled the past 10 years as a "golden decade" for the relationship since 2003 when the two sides forged a strategic partnership and have dubbed the next 10 years as a "diamond decade" which both sides hope will feature more political cooperation and regional economic integration.
9+
With the support of the central government, Hong Kong started in July this year the negotiations with the ASEAN on a free trade agreement, and once the agreement is hammered out, Hong Kong's status as the hub of economic cooperation in East Asia will be further consolidated, he said.

burness/0006/03.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
China and Kazakhstan pledged here Sunday to deepen collaboration in the area of confidence-building measures, to allow CICA, an inter-governmental forum for dialogues on regional security issues in Asia, to play a greater role in the future.
2+
In a joint communique issued during Chinese Premier Li Keqiang's official visit to Kazakhstan, the two countries agreed that they will work with other members of CICA, or the Conference on Interaction and Confidence Building Measures in Asia, to intensify cooperation in related areas and improve the mechanism of this multi-national platform.
3+
With China and Kazakhstan being two of its 26 members, CICA is aimed at enhancing cooperation toward promoting peace, security and stability in Asia.
4+
The idea of CICA was first proposed by Kazakh President Nursultan Nazarbayev in October 1992 at the 47th Session of the UN General Assembly.
5+
China assumes the chairmanship of CICA for the 2014-2016 period.
6+
In the joint communique, China and Kazakhstan both spoke highly of the outcomes of the fourth CICA summit held in Shanghai, China in May 2014.
7+
They vowed to abide by all the principles set in the declaration of the Shanghai meeting and strive to put into practice the agreements reached then, according to the document.
8+
Li arrived in Astana earlier on Sunday for an official visit to the Central Asian country as well as a prime ministers' meeting of the Shanghai Cooperation Organization.

burness/0006/04.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
The extremist Islamic State (IS) militants on Sunday executed 19 local policemen in Iraq's western province of Anbar, while seven IS militants were killed in clashes in the northern central province of Salahudin, security source said.
2+
The policemen were executed one day after the militants who seized the al-Wafaa area, just southwest of the provincial capital city of Ramadi, some 110 km west of Baghdad, a provincial security source told TingVoa.com on condition of anonymity.
3+
Separately, the IS militants surrounded a force from Iraqi police and government-allied militiamen in an area located some 35 km west of Ramadi, and fierce clashes were underway, the source said.
4+
Sporadic clashes continued in the day after the IS militants took control of several villages in the area between the town of al-Baghdadi and the nearby town of Heet, some 160 km west of Baghdad, the source added.
5+
In Salahudin province, heavy clashes erupted between the security forces backed by Shiite militias and the IS militants near the town of Balad, some 80 km north of Baghdad, leaving at least seven IS militants killed, the source said without giving further details about the casualties among the troops and the militiamen.
6+
Also in the province, clashes broke out near the oil refinery town of Baiji, some 200 km north of Baghdad, the source said with elaboration.
7+
Meanwhile, three IS militants were killed while they were preparing a booby-trapped house in al-Zuwiyah village, some 30 km east of Baiji, the source said.
8+
The security situation in Iraq has drastically deteriorated since June 10, when bloody clashes broke out between the Iraqi security forces and the IS group.

burness/0006/05.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
China's Three Gorges project celebrated its 20th anniversary Sunday with a record high throughput of its five-tier ship lock.
2+
The throughput of the year 2014 is expected to reach a record 110 million tonnes and the total throughput since the lock started operation in 2003 reached 700 million tonnes.
3+
In the first six months, the throughput through the project's ship lock reached 55.69 million tonnes, an increase of 11.35 percent year on year, said the Three Gorges Navigation Administration.
4+
The throughput of passengers also grew 54 percent to reach 191,400, said the administration.
5+
The top three categories of cargo were mining materials, ore and containers.
6+
The Three Gorges project is a multi-functional water control system, consisting of a 2,309-meter-long and 185-meter-high dam, a five-tier ship lock and 26 hydropower turbo-generators.
7+
The project generates electricity, controls floods by storing excess water and helps to regulate the river's shipping capacity.

burness/0006/Readme.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#### 说明: ####
2+
3+
对自然语言处理没有接触过,因此第0006题,我只是简单地处理
4+
5+
1,首先,将txt读入,并且去除stopwords(english),stopwords在NLTK这个包中有现成,进行词频统计。
6+
2,其次,选择词频最高的三个
7+
8+
ps:读入文件时,判断后缀用endswith('.txt');os.listdir()读当前目录文件
9+
10+
网络上搜到Shlomi Babluki的一篇blog,讲述了如何从一个句子提取出topic,我改了下他的基本代码,见![头像](extract_key_word_Shlomi_Babluki.py),个人感觉比我自己简单地进行词频统计要有效的多,它做了一些句子成分的分析考虑,不是很懂,但是柑橘很有用。
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# coding=UTF-8
2+
import nltk
3+
from nltk.corpus import brown
4+
import os
5+
6+
# This is a fast and simple noun phrase extractor (based on NLTK)
7+
# Feel free to use it, just keep a link back to this post
8+
# http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/
9+
# http://www.sharejs.com/codes/
10+
# Create by Shlomi Babluki
11+
# May, 2013
12+
13+
14+
# This is our fast Part of Speech tagger
15+
#############################################################################
16+
brown_train = brown.tagged_sents(categories='news')
17+
regexp_tagger = nltk.RegexpTagger(
18+
[(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
19+
(r'(-|:|;)$', ':'),
20+
(r'\'*$', 'MD'),
21+
(r'(The|the|A|a|An|an)$', 'AT'),
22+
(r'.*able$', 'JJ'),
23+
(r'^[A-Z].*$', 'NNP'),
24+
(r'.*ness$', 'NN'),
25+
(r'.*ly$', 'RB'),
26+
(r'.*s$', 'NNS'),
27+
(r'.*ing$', 'VBG'),
28+
(r'.*ed$', 'VBD'),
29+
(r'.*', 'NN')
30+
])
31+
unigram_tagger = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)
32+
bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger)
33+
#############################################################################
34+
35+
36+
# This is our semi-CFG; Extend it according to your own needs
37+
#############################################################################
38+
cfg = {}
39+
cfg["NNP+NNP"] = "NNP"
40+
cfg["NN+NN"] = "NNI"
41+
cfg["NNI+NN"] = "NNI"
42+
cfg["JJ+JJ"] = "JJ"
43+
cfg["JJ+NN"] = "NNI"
44+
#############################################################################
45+
46+
47+
class NPExtractor(object):
48+
49+
def __init__(self, sentence):
50+
self.sentence = sentence
51+
52+
# Split the sentence into singlw words/tokens
53+
def tokenize_sentence(self, sentence):
54+
tokens = nltk.word_tokenize(sentence)
55+
return tokens
56+
57+
# Normalize brown corpus' tags ("NN", "NN-PL", "NNS" > "NN")
58+
def normalize_tags(self, tagged):
59+
n_tagged = []
60+
for t in tagged:
61+
if t[1] == "NP-TL" or t[1] == "NP":
62+
n_tagged.append((t[0], "NNP"))
63+
continue
64+
if t[1].endswith("-TL"):
65+
n_tagged.append((t[0], t[1][:-3]))
66+
continue
67+
if t[1].endswith("S"):
68+
n_tagged.append((t[0], t[1][:-1]))
69+
continue
70+
n_tagged.append((t[0], t[1]))
71+
return n_tagged
72+
73+
# Extract the main topics from the sentence
74+
def extract(self):
75+
76+
tokens = self.tokenize_sentence(self.sentence)
77+
tags = self.normalize_tags(bigram_tagger.tag(tokens))
78+
79+
merge = True
80+
while merge:
81+
merge = False
82+
for x in range(0, len(tags) - 1):
83+
t1 = tags[x]
84+
t2 = tags[x + 1]
85+
key = "%s+%s" % (t1[1], t2[1])
86+
value = cfg.get(key, '')
87+
if value:
88+
merge = True
89+
tags.pop(x)
90+
tags.pop(x)
91+
match = "%s %s" % (t1[0], t2[0])
92+
pos = value
93+
tags.insert(x, (match, pos))
94+
break
95+
96+
matches = []
97+
for t in tags:
98+
if t[1] == "NNP" or t[1] == "NNI":
99+
# if t[1] == "NNP" or t[1] == "NNI" or t[1] == "NN":
100+
matches.append(t[0])
101+
return matches
102+
103+
104+
# Main method, just run "python np_extractor.py"
105+
def main():
106+
path = '.'
107+
for file in os.listdir(path):
108+
text = []
109+
if file.endswith('.txt'):
110+
with open(file, 'rt') as f:
111+
for line in f:
112+
words = line.split()
113+
text += words
114+
str_text=' '.join(text)
115+
np_extractor = NPExtractor(str_text)
116+
result = np_extractor.extract()
117+
print("This file is about: %s" % ", ".join(result))
118+
if __name__ == '__main__':
119+
main()
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import os,re
2+
3+
4+
useless_words=['the','I','and','']
5+
def main():
6+
from collections import Counter
7+
import nltk
8+
from nltk.corpus import stopwords
9+
for file in os.listdir():
10+
result=Counter()
11+
if file.endswith('.txt'):
12+
with open(file,'rt') as f:
13+
for line in f:
14+
# delete the stopwords in note
15+
words=line.split()
16+
words= [w for w in words if not w in stopwords.words('english')]
17+
result+=Counter(words)
18+
print('The most important word in %s is %s',(file,result.most_common(2)))
19+
20+
if __name__ == '__main__':
21+
main()

0 commit comments

Comments
 (0)