diveintopython/html_processing/extracting_data.html at master · C-elegantcodeboy/diveintopython

240 lines (232 loc) · 22.9 KB
<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>8.3.&nbsp;Extracting data from HTML documents</title>
<link rel="stylesheet" href="/css/diveintopython.css" type="text/css" />
<link rev="made" href="josh@servercobra.com" />
<meta name="generator" content="DocBook XSL Stylesheets V1.52.2" />
<meta name="keywords" content="Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free" />
<meta name="description" content="Python from novice to pro" />
<link rel="home" href="http://www.diveintopython.net/" title="Dive Into Python" />
<link rel="up" href="http://www.diveintopython.net/" title="Chapter&nbsp;8.&nbsp;HTML Processing" />
<link rel="previous" href="http://www.diveintopython.net/" title="8.2.&nbsp;Introducing sgmllib.py" />
<link rel="next" href="http://www.diveintopython.net/" title="8.4.&nbsp;Introducing BaseHTMLProcessor.py" />
<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-9740779-18']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
</script></head>
<style type="text/css">body{margin-top:0!important;padding-top:0!important;min-width:800px!important;}#wm-ipp a:hover{text-decoration:underline!important;}</style>
<table id="Header" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<td id="breadcrumb" colspan="5" align="left" valign="top">You are here: <a href="http://www.diveintopython.net/index.html">Home</a>&nbsp;&gt;&nbsp;<a href="http://www.diveintopython.net/toc/index.html">Dive Into Python</a>&nbsp;&gt;&nbsp;<a href="http://www.diveintopython.net/html_processing/index.html">HTML Processing</a>&nbsp;&gt;&nbsp;<span class="thispage">Extracting data from HTML documents</span></td>
<td id="navigation" align="right" valign="top">&nbsp;&nbsp;&nbsp;<a href="http://www.diveintopython.net/html_processing/introducing_sgmllib.html" title="Prev: “Introducing sgmllib.py”">&lt;&lt;</a>&nbsp;&nbsp;&nbsp;<a href="http://www.diveintopython.net/html_processing/basehtmlprocessor.html" title="Next: “Introducing BaseHTMLProcessor.py”">&gt;&gt;</a></td>
<td colspan="3" id="logocontainer">
<h1 id="logo"><a href="http://www.diveintopython.net/index.html" accesskey="1">Dive Into Python</a></h1>
<p id="tagline">Python from novice to pro</p>
<td colspan="3" align="right">
<form id="search" method="GET" action="http://www.google.com/custom">
<p><label for="q" accesskey="4">Find:&nbsp;</label><input type="text" id="q" name="q" size="20" maxlength="255" value=" " /> <input type="submit" value="Search" /><input type="hidden" name="cof" value="LW:752;L:http://diveintopython.org/images/diveintopython.png;LH:42;AH:left;GL:0;AWFID:3ced2bb1f7f1b212;" /><input type="hidden" name="domains" value="diveintopython.org" /><input type="hidden" name="sitesearch" value="diveintopython.org" /></p>
<div class="section" lang="en">
<div class="titlepage">
<h2 class="title"><a name="dialect.extract"></a>8.3.&nbsp;Extracting data from <span class="acronym">HTML</span> documents
<div></div>
<div class="abstract">
<p>To extract data from <span class="acronym">HTML</span> documents, subclass the <tt class="classname">SGMLParser</tt> class and define methods for each tag or entity you want to capture.
            </p>
<p>The first step to extracting data from an <span class="acronym">HTML</span> document is getting some <span class="acronym">HTML</span>.  If you have some <span class="acronym">HTML</span> lying around on your hard drive, you can use <a href="http://www.diveintopython.net/file_handling/file_objects.html" title="6.2.&nbsp;Working with File Objects">file functions</a> to read it, but the real fun begins when you get <span class="acronym">HTML</span> from live web pages.
         </p>
<div class="example"><a name="dialect.extract.urllib"></a><h3 class="title">Example&nbsp;8.5.&nbsp;Introducing <tt class="filename">urllib</tt></h3><pre class="screen">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput"><span class="pykeyword">import</span> urllib</span>                                       <a name="dialect.extract.1.1"></a><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" />
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">sock = urllib.urlopen(<span class="pystring">"http://diveintopython.org/"</span>)</span> <a name="dialect.extract.1.2"></a><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" />
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">htmlSource = sock.read()</span>                            <a name="dialect.extract.1.3"></a><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" />
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">sock.close()</span>                                        <a name="dialect.extract.1.4"></a><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" />
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput"><span class="pykeyword">print</span> htmlSource</span>                                    <a name="dialect.extract.1.5"></a><img src="http://www.diveintopython.net/images/callouts/5.png" alt="5" border="0" width="12" height="12" />
<span class="computeroutput">&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;&lt;html&gt;&lt;head&gt;
      &lt;meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'&gt;
   &lt;title&gt;Dive Into Python&lt;/title&gt;
&lt;link rel='stylesheet' href='diveintopython.css' type='text/css'&gt;
&lt;link rev='made' href='mailto:mark@diveintopython.org'&gt;
&lt;meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'&gt;
&lt;meta name='description' content='a free Python tutorial for experienced programmers'&gt;
&lt;/head&gt;
&lt;body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'&gt;
&lt;table cellpadding='0' cellspacing='0' border='0' width='100%'&gt;
&lt;tr&gt;&lt;td class='header' width='1%' valign='top'&gt;diveintopython.org&lt;/td&gt;
&lt;td width='99%' align='right'&gt;&lt;hr size='1' noshade&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='tagline' colspan='2'&gt;Python&amp;nbsp;for&amp;nbsp;experienced&amp;nbsp;programmers&lt;/td&gt;&lt;/tr&gt;</span>
[...snip...]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.1.1"><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" /></a>
<td valign="top" align="left">The <tt class="filename">urllib</tt> module is part of the standard <span class="application">Python</span> library.  It contains functions for getting information about and actually retrieving data from Internet-based <span class="acronym">URL</span>s (mainly web pages).
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.1.2"><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" /></a>
<td valign="top" align="left">The simplest use of <tt class="filename">urllib</tt> is to retrieve the entire text of a web page using the <tt class="function">urlopen</tt> function.  Opening a <span class="acronym">URL</span> is similar to <a href="http://www.diveintopython.net/file_handling/file_objects.html" title="6.2.&nbsp;Working with File Objects">opening a file</a>.  The return value of <tt class="function">urlopen</tt> is a file-like object, which has some of the same methods as a file object.
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.1.3"><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" /></a>
<td valign="top" align="left">The simplest thing to do with the file-like object returned by <tt class="function">urlopen</tt> is <tt class="function">read</tt>, which reads the entire <span class="acronym">HTML</span> of the web page into a single string.  The object also supports <tt class="function">readlines</tt>, which reads the text line by line into a list.
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.1.4"><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" /></a>
<td valign="top" align="left">When you're done with the object, make sure to <tt class="function">close</tt> it, just like a normal file object.
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.1.5"><img src="http://www.diveintopython.net/images/callouts/5.png" alt="5" border="0" width="12" height="12" /></a>
<td valign="top" align="left">You now have the complete <span class="acronym">HTML</span> of the home page of <tt class="systemitem">http://diveintopython.org/</tt> in a string, and you're ready to parse it.
<div class="example"><a name="dialect.extract.links"></a><h3 class="title">Example&nbsp;8.6.&nbsp;Introducing <tt class="filename">urllister.py</tt></h3>
<p>If you have not already done so, you can <a href="http://www.diveintopython.net/download/diveintopython-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
            </p><pre class="programlisting"><span class="pykeyword">
from</span> sgmllib <span class="pykeyword">import</span> SGMLParser
<span class="pykeyword">class</span><span class="pyclass"> URLLister</span>(SGMLParser):
    <span class="pykeyword">def</span><span class="pyclass"> reset</span>(self):                              <a name="dialect.extract.2.1"></a><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" />
        SGMLParser.reset(self)
        self.urls = []
    <span class="pykeyword">def</span><span class="pyclass"> start_a</span>(self, attrs):                     <a name="dialect.extract.2.2"></a><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" />
        href = [v <span class="pykeyword">for</span> k, v <span class="pykeyword">in</span> attrs <span class="pykeyword">if</span> k==<span class="pystring">'href'</span>] <a name="dialect.extract.2.3"></a><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" /> <a name="dialect.extract.2.4"></a><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" />
        <span class="pykeyword">if</span> href:
            self.urls.extend(href)</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.2.1"><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" /></a>
<td valign="top" align="left"><tt class="function">reset</tt> is called by the <tt class="function">__init__</tt> method of <tt class="classname">SGMLParser</tt>, and it can also be called manually once an instance of the parser has been created.  So if you need to do any initialization,
                        do it in <tt class="function">reset</tt>, not in <tt class="function">__init__</tt>, so that it will be re-initialized properly when someone re-uses a parser instance.
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.2.2"><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" /></a>
<td valign="top" align="left"><tt class="function">start_a</tt> is called by <tt class="classname">SGMLParser</tt> whenever it finds an <tt class="sgmltag-element">&lt;a&gt;</tt> tag.  The tag may contain an <tt class="literal">href</tt> attribute, and/or other attributes, like <tt class="literal">name</tt> or <tt class="literal">title</tt>.  The <tt class="varname">attrs</tt> parameter is a list of tuples, <tt class="literal">[(<i class="replaceable">attribute</i>, <i class="replaceable">value</i>), (<i class="replaceable">attribute</i>, <i class="replaceable">value</i>), ...]</tt>.  Or it may be just an <tt class="sgmltag-element">&lt;a&gt;</tt>, a valid (if useless) <span class="acronym">HTML</span> tag, in which case <tt class="varname">attrs</tt> would be an empty list.
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.2.3"><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" /></a>
<td valign="top" align="left">You can find out whether this <tt class="sgmltag-element">&lt;a&gt;</tt> tag has an <tt class="literal">href</tt> attribute with a simple <a href="http://www.diveintopython.net/native_data_types/declaring_variables.html#odbchelper.multiassign" title="3.4.2.&nbsp;Assigning Multiple Values at Once">multi-variable</a> <a href="http://www.diveintopython.net/native_data_types/mapping_lists.html" title="3.6.&nbsp;Mapping Lists">list comprehension</a>.
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.2.4"><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" /></a>
<td valign="top" align="left">String comparisons like <tt class="literal">k=='href'</tt> are always case-sensitive, but that's safe in this case, because <tt class="classname">SGMLParser</tt> converts attribute names to lowercase while building <tt class="varname">attrs</tt>.
<div class="example"><a name="dialect.feed.example"></a><h3 class="title">Example&nbsp;8.7.&nbsp;Using <tt class="filename">urllister.py</tt></h3><pre class="screen">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput"><span class="pykeyword">import</span> urllib, urllister</span>
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">usock = urllib.urlopen(<span class="pystring">"http://diveintopython.org/"</span>)</span>
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">parser = urllister.URLLister()</span>
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">parser.feed(usock.read())</span>         <a name="dialect.extract.3.1"></a><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" />
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">usock.close()</span>                     <a name="dialect.extract.3.2"></a><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" />
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">parser.close()</span>                    <a name="dialect.extract.3.3"></a><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" />
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput"><span class="pykeyword">for</span> url <span class="pykeyword">in</span> parser.urls: <span class="pykeyword">print</span> url</span> <a name="dialect.extract.3.4"></a><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" />
<span class="computeroutput">toc/index.html
toc/index.html
appendix/history.html
download/diveintopython-html-5.0.zip
download/diveintopython-pdf-5.0.zip
download/diveintopython-word-5.0.zip
download/diveintopython-text-5.0.zip
download/diveintopython-html-flat-5.0.zip
download/diveintopython-xml-5.0.zip
download/diveintopython-common-5.0.zip
... rest of output omitted for brevity ...</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.3.1"><img src="http://www.diveintopython.net/images/callouts/1.png" alt="1" border="0" width="12" height="12" /></a>
<td valign="top" align="left">Call the <tt class="function">feed</tt> method, defined in <tt class="classname">SGMLParser</tt>, to get <span class="acronym">HTML</span> into the parser.<sup>[<a name="d0e20503" href="http://www.diveintopython.net/html_processing/extracting_data.html#ftn.d0e20503">1</a>]</sup>  It takes a string, which is what <tt class="function">usock.read()</tt> returns.
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.3.2"><img src="http://www.diveintopython.net/images/callouts/2.png" alt="2" border="0" width="12" height="12" /></a>
<td valign="top" align="left">Like files, you should <tt class="function">close</tt> your <span class="acronym">URL</span> objects as soon as you're done with them.
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.3.3"><img src="http://www.diveintopython.net/images/callouts/3.png" alt="3" border="0" width="12" height="12" /></a>
<td valign="top" align="left">You should <tt class="function">close</tt> your parser object, too, but for a different reason.  You've read all the data and fed it to the parser, but the <tt class="function">feed</tt> method isn't guaranteed to have actually processed all the <span class="acronym">HTML</span> you give it; it may buffer it, waiting for more.  Be sure to call <tt class="function">close</tt> to flush the buffer and force everything to be fully parsed.
<td width="12" valign="top" align="left"><a href="http://www.diveintopython.net/html_processing/extracting_data.html#dialect.extract.3.4"><img src="http://www.diveintopython.net/images/callouts/4.png" alt="4" border="0" width="12" height="12" /></a>
<td valign="top" align="left">Once the parser is <tt class="function">close</tt>d, the parsing is complete, and <tt class="varname">parser.urls</tt> contains a list of all the linked <span class="acronym">URL</span>s in the <span class="acronym">HTML</span> document.  (Your output may look different, if the download links have been updated by the time you read this.)
<div class="footnotes">
<h3 class="footnotetitle">Footnotes</h3>
<div class="footnote">
<p><sup>[<a name="ftn.d0e20503" href="http://www.diveintopython.net/html_processing/extracting_data.html#d0e20503">1</a>] </sup>The technical term for a parser like <tt class="classname">SGMLParser</tt> is a <span class="emphasis"><em>consumer</em></span>: it consumes <span class="acronym">HTML</span> and breaks it down.  Presumably, the name <tt class="function">feed</tt> was chosen to fit into the whole &#8220;<span class="quote">consumer</span>&#8221; motif.  Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or
                  evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring
                  back at you from the far left corner, but you convince yourself that that's just your mind playing tricks on you, and the
                  only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, &#8220;<span class="quote">Do not feed the parser.</span>&#8221;  But maybe that's just me.  In any event, it's an interesting mental image.
               </p>
<table class="Footer" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<td width="35%" align="left"><br /><a class="NavigationArrow" href="http://www.diveintopython.net/html_processing/introducing_sgmllib.html">&lt;&lt;&nbsp;Introducing sgmllib.py</a></td>
<td width="30%" align="center"><br />&nbsp;<span class="divider">|</span>&nbsp;<a href="http://www.diveintopython.net/html_processing/index.html#dialect.divein" title="8.1.&nbsp;Diving in">1</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/introducing_sgmllib.html" title="8.2.&nbsp;Introducing sgmllib.py">2</a> <span class="divider">|</span> <span class="thispage">3</span> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/basehtmlprocessor.html" title="8.4.&nbsp;Introducing BaseHTMLProcessor.py">4</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/locals_and_globals.html" title="8.5.&nbsp;locals and globals">5</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/dictionary_based_string_formatting.html" title="8.6.&nbsp;Dictionary-based string formatting">6</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/quoting_attribute_values.html" title="8.7.&nbsp;Quoting attribute values">7</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/dialect.html" title="8.8.&nbsp;Introducing dialect.py">8</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/all_together.html" title="8.9.&nbsp;Putting it all together">9</a> <span class="divider">|</span> <a href="http://www.diveintopython.net/html_processing/summary.html" title="8.10.&nbsp;Summary">10</a>&nbsp;<span class="divider">|</span>&nbsp;
            </td>
<td width="35%" align="right"><br /><a class="NavigationArrow" href="http://www.diveintopython.net/html_processing/basehtmlprocessor.html">Introducing BaseHTMLProcessor.py&nbsp;&gt;&gt;</a></td>
<td colspan="3"><br /></td>
<div class="Footer">
<p class="copyright">Copyright &copy; 2000, 2001, 2002, 2003, 2004 <a href="mailto:josh@servercobra.com">Mark Pilgrim</a></p>
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

extracting_data.html

Latest commit

History

extracting_data.html

File metadata and controls