1- ************************************************
2- HOWTO Fetch Internet Resources Using urllib2
3- ************************************************
1+ *****************************************************
2+ HOWTO Fetch Internet Resources Using urllib package
3+ *****************************************************
44
55:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml >`_
66
@@ -24,14 +24,14 @@ Introduction
2424
2525 A tutorial on *Basic Authentication *, with examples in Python.
2626
27- **urllib2 ** is a `Python <http://www.python.org >`_ module for fetching URLs
27+ **urllib.request ** is a `Python <http://www.python.org >`_ module for fetching URLs
2828(Uniform Resource Locators). It offers a very simple interface, in the form of
2929the *urlopen * function. This is capable of fetching URLs using a variety of
3030different protocols. It also offers a slightly more complex interface for
3131handling common situations - like basic authentication, cookies, proxies and so
3232on. These are provided by objects called handlers and openers.
3333
34- urllib2 supports fetching URLs for many "URL schemes" (identified by the string
34+ urllib.request supports fetching URLs for many "URL schemes" (identified by the string
3535before the ":" in URL - for example "ftp" is the URL scheme of
3636"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
3737This tutorial focuses on the most common case, HTTP.
@@ -40,43 +40,43 @@ For straightforward situations *urlopen* is very easy to use. But as soon as you
4040encounter errors or non-trivial cases when opening HTTP URLs, you will need some
4141understanding of the HyperText Transfer Protocol. The most comprehensive and
4242authoritative reference to HTTP is :rfc: `2616 `. This is a technical document and
43- not intended to be easy to read. This HOWTO aims to illustrate using *urllib2 *,
43+ not intended to be easy to read. This HOWTO aims to illustrate using *urllib *,
4444with enough detail about HTTP to help you through. It is not intended to replace
45- the :mod: `urllib2 ` docs, but is supplementary to them.
45+ the :mod: `urllib.request ` docs, but is supplementary to them.
4646
4747
4848Fetching URLs
4949=============
5050
51- The simplest way to use urllib2 is as follows::
51+ The simplest way to use urllib.request is as follows::
5252
53- import urllib2
54- response = urllib2 .urlopen('http://python.org/')
53+ import urllib.request
54+ response = urllib.request .urlopen('http://python.org/')
5555 html = response.read()
5656
57- Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we
57+ Many uses of urllib will be that simple (note that instead of an 'http:' URL we
5858could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the
5959purpose of this tutorial to explain the more complicated cases, concentrating on
6060HTTP.
6161
6262HTTP is based on requests and responses - the client makes requests and servers
63- send responses. urllib2 mirrors this with a ``Request `` object which represents
63+ send responses. urllib.request mirrors this with a ``Request `` object which represents
6464the HTTP request you are making. In its simplest form you create a Request
6565object that specifies the URL you want to fetch. Calling ``urlopen `` with this
6666Request object returns a response object for the URL requested. This response is
6767a file-like object, which means you can for example call ``.read() `` on the
6868response::
6969
70- import urllib2
70+ import urllib.request
7171
72- req = urllib2 .Request('http://www.voidspace.org.uk')
73- response = urllib2 .urlopen(req)
72+ req = urllib.request .Request('http://www.voidspace.org.uk')
73+ response = urllib.request .urlopen(req)
7474 the_page = response.read()
7575
76- Note that urllib2 makes use of the same Request interface to handle all URL
76+ Note that urllib.request makes use of the same Request interface to handle all URL
7777schemes. For example, you can make an FTP request like so::
7878
79- req = urllib2 .Request('ftp://example.com/')
79+ req = urllib.request .Request('ftp://example.com/')
8080
8181In the case of HTTP, there are two extra things that Request objects allow you
8282to do: First, you can pass data to be sent to the server. Second, you can pass
@@ -94,28 +94,28 @@ your browser does when you submit a HTML form that you filled in on the web. Not
9494all POSTs have to come from forms: you can use a POST to transmit arbitrary data
9595to your own application. In the common case of HTML forms, the data needs to be
9696encoded in a standard way, and then passed to the Request object as the ``data ``
97- argument. The encoding is done using a function from the ``urllib `` library
98- *not * from ``urllib2 ``. ::
97+ argument. The encoding is done using a function from the ``urllib.parse `` library
98+ *not * from ``urllib.request ``. ::
9999
100- import urllib
101- import urllib2
100+ import urllib.parse
101+ import urllib.request
102102
103103 url = 'http://www.someserver.com/cgi-bin/register.cgi'
104104 values = {'name' : 'Michael Foord',
105105 'location' : 'Northampton',
106106 'language' : 'Python' }
107107
108- data = urllib.urlencode(values)
109- req = urllib2 .Request(url, data)
110- response = urllib2 .urlopen(req)
108+ data = urllib.parse. urlencode(values)
109+ req = urllib.request .Request(url, data)
110+ response = urllib.request .urlopen(req)
111111 the_page = response.read()
112112
113113Note that other encodings are sometimes required (e.g. for file upload from HTML
114114forms - see `HTML Specification, Form Submission
115115<http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13> `_ for more
116116details).
117117
118- If you do not pass the ``data `` argument, urllib2 uses a **GET ** request. One
118+ If you do not pass the ``data `` argument, urllib.request uses a **GET ** request. One
119119way in which GET and POST requests differ is that POST requests often have
120120"side-effects": they change the state of the system in some way (for example by
121121placing an order with the website for a hundredweight of tinned spam to be
@@ -127,18 +127,18 @@ GET request by encoding it in the URL itself.
127127
128128This is done as follows::
129129
130- >>> import urllib2
131- >>> import urllib
130+ >>> import urllib.request
131+ >>> import urllib.parse
132132 >>> data = {}
133133 >>> data['name'] = 'Somebody Here'
134134 >>> data['location'] = 'Northampton'
135135 >>> data['language'] = 'Python'
136- >>> url_values = urllib.urlencode(data)
136+ >>> url_values = urllib.parse. urlencode(data)
137137 >>> print(url_values)
138138 name=Somebody+Here&language=Python&location=Northampton
139139 >>> url = 'http://www.example.com/example.cgi'
140140 >>> full_url = url + '?' + url_values
141- >>> data = urllib2 .open(full_url)
141+ >>> data = urllib.request .open(full_url)
142142
143143Notice that the full URL is created by adding a ``? `` to the URL, followed by
144144the encoded values.
@@ -150,7 +150,7 @@ We'll discuss here one particular HTTP header, to illustrate how to add headers
150150to your HTTP request.
151151
152152Some websites [# ]_ dislike being browsed by programs, or send different versions
153- to different browsers [# ]_ . By default urllib2 identifies itself as
153+ to different browsers [# ]_ . By default urllib identifies itself as
154154``Python-urllib/x.y `` (where ``x `` and ``y `` are the major and minor version
155155numbers of the Python release,
156156e.g. ``Python-urllib/2.5 ``), which may confuse the site, or just plain
@@ -160,8 +160,8 @@ pass a dictionary of headers in. The following example makes the same
160160request as above, but identifies itself as a version of Internet
161161Explorer [# ]_. ::
162162
163- import urllib
164- import urllib2
163+ import urllib.parse
164+ import urllib.request
165165
166166 url = 'http://www.someserver.com/cgi-bin/register.cgi'
167167 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
@@ -170,9 +170,9 @@ Explorer [#]_. ::
170170 'language' : 'Python' }
171171 headers = { 'User-Agent' : user_agent }
172172
173- data = urllib.urlencode(values)
174- req = urllib2 .Request(url, data, headers)
175- response = urllib2 .urlopen(req)
173+ data = urllib.parse. urlencode(values)
174+ req = urllib.request .Request(url, data, headers)
175+ response = urllib.request .urlopen(req)
176176 the_page = response.read()
177177
178178The response also has two useful methods. See the section on `info and geturl `_
@@ -182,7 +182,7 @@ which comes after we have a look at what happens when things go wrong.
182182Handling Exceptions
183183===================
184184
185- *urlopen * raises ``URLError `` when it cannot handle a response (though as usual
185+ *urllib.error * raises ``URLError `` when it cannot handle a response (though as usual
186186with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also
187187be raised).
188188
@@ -199,9 +199,9 @@ error code and a text error message.
199199
200200e.g. ::
201201
202- >>> req = urllib2 .Request('http://www.pretend_server.org')
203- >>> try: urllib2 .urlopen(req)
204- >>> except URLError, e:
202+ >>> req = urllib.request .Request('http://www.pretend_server.org')
203+ >>> try: urllib.request .urlopen(req)
204+ >>> except urllib.error. URLError, e:
205205 >>> print(e.reason)
206206 >>>
207207 (4, 'getaddrinfo failed')
@@ -214,7 +214,7 @@ Every HTTP response from the server contains a numeric "status code". Sometimes
214214the status code indicates that the server is unable to fulfil the request. The
215215default handlers will handle some of these responses for you (for example, if
216216the response is a "redirection" that requests the client fetch the document from
217- a different URL, urllib2 will handle that for you). For those it can't handle,
217+ a different URL, urllib.request will handle that for you). For those it can't handle,
218218urlopen will raise an ``HTTPError ``. Typical errors include '404' (page not
219219found), '403' (request forbidden), and '401' (authentication required).
220220
@@ -305,12 +305,12 @@ dictionary is reproduced here for convenience ::
305305When an error is raised the server responds by returning an HTTP error code
306306*and * an error page. You can use the ``HTTPError `` instance as a response on the
307307page returned. This means that as well as the code attribute, it also has read,
308- geturl, and info, methods. ::
308+ geturl, and info, methods as returned by the `` urllib.response `` module ::
309309
310- >>> req = urllib2 .Request('http://www.python.org/fish.html')
310+ >>> req = urllib.request .Request('http://www.python.org/fish.html')
311311 >>> try:
312- >>> urllib2 .urlopen(req)
313- >>> except URLError, e:
312+ >>> urllib.request .urlopen(req)
313+ >>> except urllib.error. URLError, e:
314314 >>> print(e.code)
315315 >>> print(e.read())
316316 >>>
@@ -334,7 +334,8 @@ Number 1
334334::
335335
336336
337- from urllib2 import Request, urlopen, URLError, HTTPError
337+ from urllib.request import Request, urlopen
338+ from urllib.error import URLError, HTTPError
338339 req = Request(someurl)
339340 try:
340341 response = urlopen(req)
@@ -358,7 +359,8 @@ Number 2
358359
359360::
360361
361- from urllib2 import Request, urlopen, URLError
362+ from urllib.request import Request, urlopen
363+ from urllib.error import URLError
362364 req = Request(someurl)
363365 try:
364366 response = urlopen(req)
@@ -377,7 +379,8 @@ info and geturl
377379===============
378380
379381The response returned by urlopen (or the ``HTTPError `` instance) has two useful
380- methods ``info `` and ``geturl ``.
382+ methods ``info `` and ``geturl `` and is defined in the module
383+ ``urllib.response ``.
381384
382385**geturl ** - this returns the real URL of the page fetched. This is useful
383386because ``urlopen `` (or the opener object used) may have followed a
@@ -397,7 +400,7 @@ Openers and Handlers
397400====================
398401
399402When you fetch a URL you use an opener (an instance of the perhaps
400- confusingly-named :class: `urllib2 .OpenerDirector `). Normally we have been using
403+ confusingly-named :class: `urllib.request .OpenerDirector `). Normally we have been using
401404the default opener - via ``urlopen `` - but you can create custom
402405openers. Openers use handlers. All the "heavy lifting" is done by the
403406handlers. Each handler knows how to open URLs for a particular URL scheme (http,
@@ -466,24 +469,24 @@ The top-level URL is the first URL that requires authentication. URLs "deeper"
466469than the URL you pass to .add_password() will also match. ::
467470
468471 # create a password manager
469- password_mgr = urllib2 .HTTPPasswordMgrWithDefaultRealm()
472+ password_mgr = urllib.request .HTTPPasswordMgrWithDefaultRealm()
470473
471474 # Add the username and password.
472475 # If we knew the realm, we could use it instead of ``None``.
473476 top_level_url = "http://example.com/foo/"
474477 password_mgr.add_password(None, top_level_url, username, password)
475478
476- handler = urllib2 .HTTPBasicAuthHandler(password_mgr)
479+ handler = urllib.request .HTTPBasicAuthHandler(password_mgr)
477480
478481 # create "opener" (OpenerDirector instance)
479- opener = urllib2 .build_opener(handler)
482+ opener = urllib.request .build_opener(handler)
480483
481484 # use the opener to fetch a URL
482485 opener.open(a_url)
483486
484487 # Install the opener.
485- # Now all calls to urllib2 .urlopen use our opener.
486- urllib2 .install_opener(opener)
488+ # Now all calls to urllib.request .urlopen use our opener.
489+ urllib.request .install_opener(opener)
487490
488491.. note ::
489492
@@ -505,46 +508,46 @@ not correct.
505508Proxies
506509=======
507510
508- **urllib2 ** will auto-detect your proxy settings and use those. This is through
511+ **urllib.request ** will auto-detect your proxy settings and use those. This is through
509512the ``ProxyHandler `` which is part of the normal handler chain. Normally that's
510513a good thing, but there are occasions when it may not be helpful [# ]_. One way
511514to do this is to setup our own ``ProxyHandler ``, with no proxies defined. This
512515is done using similar steps to setting up a `Basic Authentication `_ handler : ::
513516
514- >>> proxy_support = urllib2 .ProxyHandler({})
515- >>> opener = urllib2 .build_opener(proxy_support)
516- >>> urllib2 .install_opener(opener)
517+ >>> proxy_support = urllib.request .ProxyHandler({})
518+ >>> opener = urllib.request .build_opener(proxy_support)
519+ >>> urllib.request .install_opener(opener)
517520
518521.. note ::
519522
520- Currently ``urllib2 `` *does not * support fetching of ``https `` locations
521- through a proxy. However, this can be enabled by extending urllib2 as
523+ Currently ``urllib.request `` *does not * support fetching of ``https `` locations
524+ through a proxy. However, this can be enabled by extending urllib.request as
522525 shown in the recipe [# ]_.
523526
524527
525528Sockets and Layers
526529==================
527530
528- The Python support for fetching resources from the web is layered. urllib2 uses
529- the http.client library, which in turn uses the socket library.
531+ The Python support for fetching resources from the web is layered.
532+ urllib.request uses the http.client library, which in turn uses the socket library.
530533
531534As of Python 2.3 you can specify how long a socket should wait for a response
532535before timing out. This can be useful in applications which have to fetch web
533536pages. By default the socket module has *no timeout * and can hang. Currently,
534- the socket timeout is not exposed at the http.client or urllib2 levels.
537+ the socket timeout is not exposed at the http.client or urllib.request levels.
535538However, you can set the default timeout globally for all sockets using ::
536539
537540 import socket
538- import urllib2
541+ import urllib.request
539542
540543 # timeout in seconds
541544 timeout = 10
542545 socket.setdefaulttimeout(timeout)
543546
544- # this call to urllib2 .urlopen now uses the default timeout
547+ # this call to urllib.request .urlopen now uses the default timeout
545548 # we have set in the socket module
546- req = urllib2 .Request('http://www.voidspace.org.uk')
547- response = urllib2 .urlopen(req)
549+ req = urllib.request .Request('http://www.voidspace.org.uk')
550+ response = urllib.request .urlopen(req)
548551
549552
550553-------
0 commit comments