forked from joshgachnang/diveintopython
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathextracting_data.html
More file actions
240 lines (232 loc) · 22.9 KB
/
extracting_data.html
File metadata and controls
240 lines (232 loc) · 22.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
<!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>8.3. Extracting data from HTML documents</title>
<link rel="stylesheet" href="/css/diveintopython.css" type="text/css" />
<link rev="made" href="josh@servercobra.com" />
<meta name="generator" content="DocBook XSL Stylesheets V1.52.2" />
<meta name="keywords" content="Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free" />
<meta name="description" content="Python from novice to pro" />
<link rel="home" href="http://www.diveintopython.net/" title="Dive Into Python" />
<link rel="up" href="http://www.diveintopython.net/" title="Chapter 8. HTML Processing" />
<link rel="previous" href="http://www.diveintopython.net/" title="8.2. Introducing sgmllib.py" />
<link rel="next" href="http://www.diveintopython.net/" title="8.4. Introducing BaseHTMLProcessor.py" />
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-9740779-18']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script></head>
<body>
<style type="text/css">body{margin-top:0!important;padding-top:0!important;min-width:800px!important;}#wm-ipp a:hover{text-decoration:underline!important;}</style>
<table id="Header" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<tr>
<td id="breadcrumb" colspan="5" align="left" valign="top">You are here: <a href="http://www.diveintopython.net/index.html">Home</a> > <a href="http://www.diveintopython.net/toc/index.html">Dive Into Python</a> > <a href="http://www.diveintopython.net/html_processing/index.html">HTML Processing</a> > <span class="thispage">Extracting data from HTML documents</span></td>
<td id="navigation" align="right" valign="top"> <a href="http://www.diveintopython.net/html_processing/introducing_sgmllib.html" title="Prev: “Introducing sgmllib.py”"><<</a> <a href="http://www.diveintopython.net/html_processing/basehtmlprocessor.html" title="Next: “Introducing BaseHTMLProcessor.py”">>></a></td>
</tr>
<tr>
<td colspan="3" id="logocontainer">
<h1 id="logo"><a href="http://www.diveintopython.net/index.html" accesskey="1">Dive Into Python</a></h1>
<p id="tagline">Python from novice to pro</p>
</td>
<td colspan="3" align="right">
<form id="search" method="GET" action="http://www.google.com/custom">
<p><label for="q" accesskey="4">Find: </label><input type="text" id="q" name="q" size="20" maxlength="255" value=" " /> <input type="submit" value="Search" /><input type="hidden" name="cof" value="LW:752;L:http://diveintopython.org/images/diveintopython.png;LH:42;AH:left;GL:0;AWFID:3ced2bb1f7f1b212;" /><input type="hidden" name="domains" value="diveintopython.org" /><input type="hidden" name="sitesearch" value="diveintopython.org" /></p>
</form>
</td>
</tr>
</table>
<div class="section" lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a name="dialect.extract"></a>8.3. Extracting data from <span class="acronym">HTML</span> documents
</h2>
</div>
</div>
<div></div>
</div>
<div class="abstract">
<p>To extract data from <span class="acronym">HTML</span> documents, subclass the <tt class="classname">SGMLParser</tt> class and define methods for each tag or entity you want to capture.
</p>
</div>
<p>The first step to extracting data from an <span class="acronym">HTML</span> document is getting some <span class="acronym">HTML</span>. If you have some <span class="acronym">HTML</span> lying around on your hard drive, you can use <a href="http://www.diveintopython.net/file_handling/file_objects.html" title="6.2. Working with File Objects">file functions</a> to read it, but the real fun begins when you get <span class="acronym">HTML</span> from live web pages.
</p>
<div class="example"><a name="dialect.extract.urllib"></a><h3 class="title">Example 8.5. Introducing <tt class="filename">urllib</tt></h3><pre class="screen">
<tt class="prompt">>>> </tt><span class="userinput"><span class="pykeyword">import</span> urllib</span> <a name="dialect.extract.1.1"></a><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" />
<tt class="prompt">>>> </tt><span class="userinput">sock = urllib.urlopen(<span class="pystring">"http://diveintopython.org/"</span>)</span> <a name="dialect.extract.1.2"></a><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" />
<tt class="prompt">>>> </tt><span class="userinput">htmlSource = sock.read()</span> <a name="dialect.extract.1.3"></a><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" />
<tt class="prompt">>>> </tt><span class="userinput">sock.close()</span> <a name="dialect.extract.1.4"></a><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" />
<tt class="prompt">>>> </tt><span class="userinput"><span class="pykeyword">print</span> htmlSource</span> <a name="dialect.extract.1.5"></a><img src="http://www.diveintopython.net/images/callouts/5.png" alt="5" border="0" width="12" height="12" />
<span class="computeroutput"><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>
<meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'>
<title>Dive Into Python</title>
<link rel='stylesheet' href='diveintopython.css' type='text/css'>
<link rev='made' href='mailto:mark@diveintopython.org'>
<meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'>
<meta name='description' content='a free Python tutorial for experienced programmers'>
</head>
<body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'>
<table cellpadding='0' cellspacing='0' border='0' width='100%'>
<tr><td class='header' width='1%' valign='top'>diveintopython.org</td>
<td width='99%' align='right'><hr size='1' noshade></td></tr>
<tr><td class='tagline' colspan='2'>Python&nbsp;for&nbsp;experienced&nbsp;programmers</td></tr></span>
[...snip...]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.1.1"><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">The <tt class="filename">urllib</tt> module is part of the standard <span class="application">Python</span> library. It contains functions for getting information about and actually retrieving data from Internet-based <span class="acronym">URL</span>s (mainly web pages).
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.1.2"><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">The simplest use of <tt class="filename">urllib</tt> is to retrieve the entire text of a web page using the <tt class="function">urlopen</tt> function. Opening a <span class="acronym">URL</span> is similar to <a href="http://www.diveintopython.net/file_handling/file_objects.html" title="6.2. Working with File Objects">opening a file</a>. The return value of <tt class="function">urlopen</tt> is a file-like object, which has some of the same methods as a file object.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.1.3"><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">The simplest thing to do with the file-like object returned by <tt class="function">urlopen</tt> is <tt class="function">read</tt>, which reads the entire <span class="acronym">HTML</span> of the web page into a single string. The object also supports <tt class="function">readlines</tt>, which reads the text line by line into a list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.1.4"><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">When you're done with the object, make sure to <tt class="function">close</tt> it, just like a normal file object.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.1.5"><img src="http://www.diveintopython.net/images/callouts/5.png" alt="5" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">You now have the complete <span class="acronym">HTML</span> of the home page of <tt class="systemitem">http://diveintopython.org/</tt> in a string, and you're ready to parse it.
</td>
</tr>
</table>
</div>
</div>
<div class="example"><a name="dialect.extract.links"></a><h3 class="title">Example 8.6. Introducing <tt class="filename">urllister.py</tt></h3>
<p>If you have not already done so, you can <a href="http://www.diveintopython.net/download/diveintopython-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
</p><pre class="programlisting"><span class="pykeyword">
from</span> sgmllib <span class="pykeyword">import</span> SGMLParser
<span class="pykeyword">class</span><span class="pyclass"> URLLister</span>(SGMLParser):
<span class="pykeyword">def</span><span class="pyclass"> reset</span>(self): <a name="dialect.extract.2.1"></a><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" />
SGMLParser.reset(self)
self.urls = []
<span class="pykeyword">def</span><span class="pyclass"> start_a</span>(self, attrs): <a name="dialect.extract.2.2"></a><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" />
href = [v <span class="pykeyword">for</span> k, v <span class="pykeyword">in</span> attrs <span class="pykeyword">if</span> k==<span class="pystring">'href'</span>] <a name="dialect.extract.2.3"></a><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" /> <a name="dialect.extract.2.4"></a><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" />
<span class="pykeyword">if</span> href:
self.urls.extend(href)</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.2.1"><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left"><tt class="function">reset</tt> is called by the <tt class="function">__init__</tt> method of <tt class="classname">SGMLParser</tt>, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization,
do it in <tt class="function">reset</tt>, not in <tt class="function">__init__</tt>, so that it will be re-initialized properly when someone re-uses a parser instance.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.2.2"><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left"><tt class="function">start_a</tt> is called by <tt class="classname">SGMLParser</tt> whenever it finds an <tt class="sgmltag-element"><a></tt> tag. The tag may contain an <tt class="literal">href</tt> attribute, and/or other attributes, like <tt class="literal">name</tt> or <tt class="literal">title</tt>. The <tt class="varname">attrs</tt> parameter is a list of tuples, <tt class="literal">[(<i class="replaceable">attribute</i>, <i class="replaceable">value</i>), (<i class="replaceable">attribute</i>, <i class="replaceable">value</i>), ...]</tt>. Or it may be just an <tt class="sgmltag-element"><a></tt>, a valid (if useless) <span class="acronym">HTML</span> tag, in which case <tt class="varname">attrs</tt> would be an empty list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.2.3"><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">You can find out whether this <tt class="sgmltag-element"><a></tt> tag has an <tt class="literal">href</tt> attribute with a simple <a href="http://www.diveintopython.net/native_data_types/declaring_variables.html#odbchelper.multiassign" title="3.4.2. Assigning Multiple Values at Once">multi-variable</a> <a href="http://www.diveintopython.net/native_data_types/mapping_lists.html" title="3.6. Mapping Lists">list comprehension</a>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.2.4"><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">String comparisons like <tt class="literal">k=='href'</tt> are always case-sensitive, but that's safe in this case, because <tt class="classname">SGMLParser</tt> converts attribute names to lowercase while building <tt class="varname">attrs</tt>.
</td>
</tr>
</table>
</div>
</div>
<div class="example"><a name="dialect.feed.example"></a><h3 class="title">Example 8.7. Using <tt class="filename">urllister.py</tt></h3><pre class="screen">
<tt class="prompt">>>> </tt><span class="userinput"><span class="pykeyword">import</span> urllib, urllister</span>
<tt class="prompt">>>> </tt><span class="userinput">usock = urllib.urlopen(<span class="pystring">"http://diveintopython.org/"</span>)</span>
<tt class="prompt">>>> </tt><span class="userinput">parser = urllister.URLLister()</span>
<tt class="prompt">>>> </tt><span class="userinput">parser.feed(usock.read())</span> <a name="dialect.extract.3.1"></a><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" />
<tt class="prompt">>>> </tt><span class="userinput">usock.close()</span> <a name="dialect.extract.3.2"></a><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" />
<tt class="prompt">>>> </tt><span class="userinput">parser.close()</span> <a name="dialect.extract.3.3"></a><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" />
<tt class="prompt">>>> </tt><span class="userinput"><span class="pykeyword">for</span> url <span class="pykeyword">in</span> parser.urls: <span class="pykeyword">print</span> url</span> <a name="dialect.extract.3.4"></a><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" />
<span class="computeroutput">toc/index.html
#download
#languages
toc/index.html
appendix/history.html
download/diveintopython-html-5.0.zip
download/diveintopython-pdf-5.0.zip
download/diveintopython-word-5.0.zip
download/diveintopython-text-5.0.zip
download/diveintopython-html-flat-5.0.zip
download/diveintopython-xml-5.0.zip
download/diveintopython-common-5.0.zip
</span>
... rest of output omitted for brevity ...</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.3.1"><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">Call the <tt class="function">feed</tt> method, defined in <tt class="classname">SGMLParser</tt>, to get <span class="acronym">HTML</span> into the parser.<sup>[<a name="d0e20503" href="http://www.diveintopython.net/html_processing/extracting_data.html#ftn.d0e20503">1</a>]</sup> It takes a string, which is what <tt class="function">usock.read()</tt> returns.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.3.2"><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">Like files, you should <tt class="function">close</tt> your <span class="acronym">URL</span> objects as soon as you're done with them.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.3.3"><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">You should <tt class="function">close</tt> your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the <tt class="function">feed</tt> method isn't guaranteed to have actually processed all the <span class="acronym">HTML</span> you give it; it may buffer it, waiting for more. Be sure to call <tt class="function">close</tt> to flush the buffer and force everything to be fully parsed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.3.4"><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" /></a>
</td>
<td valign="top" align="left">Once the parser is <tt class="function">close</tt>d, the parsing is complete, and <tt class="varname">parser.urls</tt> contains a list of all the linked <span class="acronym">URL</span>s in the <span class="acronym">HTML</span> document. (Your output may look different, if the download links have been updated by the time you read this.)
</td>
</tr>
</table>
</div>
</div>
<div class="footnotes">
<h3 class="footnotetitle">Footnotes</h3>
<div class="footnote">
<p><sup>[<a name="ftn.d0e20503" href="http://www.diveintopython.net/html_processing/extracting_data.html#d0e20503">1</a>] </sup>The technical term for a parser like <tt class="classname">SGMLParser</tt> is a <span class="emphasis"><em>consumer</em></span>: it consumes <span class="acronym">HTML</span> and breaks it down. Presumably, the name <tt class="function">feed</tt> was chosen to fit into the whole “<span class="quote">consumer</span>” motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or
evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring
back at you from the far left corner, but you convince yourself that that's just your mind playing tricks on you, and the
only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, “<span class="quote">Do not feed the parser.</span>” But maybe that's just me. In any event, it's an interesting mental image.
</p>
</div>
</div>
</div>
<table class="Footer" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<tr>
<td width="35%" align="left"><br /><a class="NavigationArrow" href="http://www.diveintopython.net/html_processing/introducing_sgmllib.html"><< Introducing sgmllib.py</a></td>
<td width="30%" align="center"><br /> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/index.html#dialect.divein" title="8.1. Diving in">1</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/introducing_sgmllib.html" title="8.2. Introducing sgmllib.py">2</a> <span class="divider">|</span> <span class="thispage">3</span> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/basehtmlprocessor.html" title="8.4. Introducing BaseHTMLProcessor.py">4</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/locals_and_globals.html" title="8.5. locals and globals">5</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/dictionary_based_string_formatting.html" title="8.6. Dictionary-based string formatting">6</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/quoting_attribute_values.html" title="8.7. Quoting attribute values">7</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/dialect.html" title="8.8. Introducing dialect.py">8</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/all_together.html" title="8.9. Putting it all together">9</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/summary.html" title="8.10. Summary">10</a> <span class="divider">|</span>
</td>
<td width="35%" align="right"><br /><a class="NavigationArrow" href="http://www.diveintopython.net/html_processing/basehtmlprocessor.html">Introducing BaseHTMLProcessor.py >></a></td>
</tr>
<tr>
<td colspan="3"><br /></td>
</tr>
</table>
<div class="Footer">
<p class="copyright">Copyright © 2000, 2001, 2002, 2003, 2004 <a href="mailto:josh@servercobra.com">Mark Pilgrim</a></p>
</div>
</body>
</html>