Skip to content

Commit 174f6bb

Browse files
author
tim_one
committed
Mostly in SequenceMatcher.{__chain_b, find_longest_match}:
This now does a dynamic analysis of which elements are so frequently repeated as to constitute noise. The primary benefit is an enormous speedup in find_longest_match, as the innermost loop can have factors of 100s less potential matches to worry about, in cases where the sequences have many duplicate elements. In effect, this zooms in on sequences of non-ubiquitous elements now. While I like what I've seen of the effects so far, I still consider this experimental. Please give it a try! git-svn-id: http://svn.python.org/projects/python/trunk@26661 6015fed2-1504-0410-9fe1-9d1591cc4771
1 parent 55c981a commit 174f6bb

3 files changed

Lines changed: 84 additions & 34 deletions

File tree

Doc/lib/libdifflib.tex

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -90,13 +90,19 @@ \section{\module{difflib} ---
9090
Optional keyword parameters \var{linejunk} and \var{charjunk} are
9191
for filter functions (or \code{None}):
9292

93-
\var{linejunk}: A function that should accept a single string
94-
argument, and return true if the string is junk (or false if it is
95-
not). The default is module-level function
93+
\var{linejunk}: A function that accepts a single string
94+
argument, and returns true if the string is junk, or false if not.
95+
The default is (\code{None}), starting with Python 2.3. Before then,
96+
the default was the module-level function
9697
\function{IS_LINE_JUNK()}, which filters out lines without visible
9798
characters, except for at most one pound character (\character{\#}).
99+
As of Python 2.3, the underlying \class{SequenceMatcher} class
100+
does a dynamic analysis of which lines are so frequent as to
101+
constitute noise, and this usually works better than the pre-2.3
102+
default.
98103

99-
\var{charjunk}: A function that should accept a string of length 1.
104+
\var{charjunk}: A function that accepts a character (a string of
105+
length 1), and returns if the character is junk, or false if not.
100106
The default is module-level function \function{IS_CHARACTER_JUNK()},
101107
which filters out whitespace characters (a blank or tab; note: bad
102108
idea to include newline in this!).
@@ -150,7 +156,7 @@ \section{\module{difflib} ---
150156
Return true for ignorable lines. The line \var{line} is ignorable
151157
if \var{line} is blank or contains a single \character{\#},
152158
otherwise it is not ignorable. Used as a default for parameter
153-
\var{linejunk} in \function{ndiff()}.
159+
\var{linejunk} in \function{ndiff()} before Python 2.3.
154160
\end{funcdesc}
155161

156162

@@ -443,16 +449,14 @@ \subsection{Differ Objects \label{differ-objects}}
443449
Optional keyword parameters \var{linejunk} and \var{charjunk} are
444450
for filter functions (or \code{None}):
445451

446-
\var{linejunk}: A function that should accept a single string
447-
argument, and return true if the string is junk. The default is
448-
module-level function \function{IS_LINE_JUNK()}, which filters out
449-
lines without visible characters, except for at most one pound
450-
character (\character{\#}).
452+
\var{linejunk}: A function that accepts a single string
453+
argument, and returns true if the string is junk. The default is
454+
\code{None}, meaning that no line is considered junk.
451455

452-
\var{charjunk}: A function that should accept a string of length 1.
453-
The default is module-level function \function{IS_CHARACTER_JUNK()},
454-
which filters out whitespace characters (a blank or tab; note: bad
455-
idea to include newline in this!).
456+
\var{charjunk}: A function that accepts a single character argument
457+
(a string of length 1), and returns true if the character is junk.
458+
The default is \code{None}, meaning that no character is
459+
considered junk.
456460
\end{classdesc}
457461

458462
\class{Differ} objects are used (deltas generated) via a single

Lib/difflib.py

Lines changed: 55 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -161,8 +161,6 @@ def __init__(self, isjunk=None, a='', b=''):
161161
# b2j
162162
# for x in b, b2j[x] is a list of the indices (into b)
163163
# at which x appears; junk elements do not appear
164-
# b2jhas
165-
# b2j.has_key
166164
# fullbcount
167165
# for x in b, fullbcount[x] == the number of times x
168166
# appears in b; only materialized if really needed (used
@@ -188,6 +186,10 @@ def __init__(self, isjunk=None, a='', b=''):
188186
# for x in b, isbjunk(x) == isjunk(x) but much faster;
189187
# it's really the has_key method of a hidden dict.
190188
# DOES NOT WORK for x in a!
189+
# isbpopular
190+
# for x in b, isbpopular(x) is true iff b is reasonably long
191+
# (at least 200 elements) and x accounts for more than 1% of
192+
# its elements. DOES NOT WORK for x in a!
191193

192194
self.isjunk = isjunk
193195
self.a = self.b = None
@@ -266,6 +268,12 @@ def set_seq2(self, b):
266268
# map at all, which stops the central find_longest_match method
267269
# from starting any matching block at a junk element ...
268270
# also creates the fast isbjunk function ...
271+
# b2j also does not contain entries for "popular" elements, meaning
272+
# elements that account for more than 1% of the total elements, and
273+
# when the sequence is reasonably large (>= 200 elements); this can
274+
# be viewed as an adaptive notion of semi-junk, and yields an enormous
275+
# speedup when, e.g., comparing program files with hundreds of
276+
# instances of "return NULL;" ...
269277
# note that this is only called when b changes; so for cross-product
270278
# kinds of matches, it's best to call set_seq2 once, then set_seq1
271279
# repeatedly
@@ -282,32 +290,44 @@ def __chain_b(self):
282290
# out the junk later is much cheaper than building b2j "right"
283291
# from the start.
284292
b = self.b
293+
n = len(b)
285294
self.b2j = b2j = {}
286-
self.b2jhas = b2jhas = b2j.has_key
287-
for i in xrange(len(b)):
288-
elt = b[i]
289-
if b2jhas(elt):
290-
b2j[elt].append(i)
295+
populardict = {}
296+
for i, elt in enumerate(b):
297+
if elt in b2j:
298+
indices = b2j[elt]
299+
if n >= 200 and len(indices) * 100 > n:
300+
populardict[elt] = 1
301+
del indices[:]
302+
else:
303+
indices.append(i)
291304
else:
292305
b2j[elt] = [i]
293306

307+
# Purge leftover indices for popular elements.
308+
for elt in populardict:
309+
del b2j[elt]
310+
294311
# Now b2j.keys() contains elements uniquely, and especially when
295312
# the sequence is a string, that's usually a good deal smaller
296313
# than len(string). The difference is the number of isjunk calls
297314
# saved.
298-
isjunk, junkdict = self.isjunk, {}
315+
isjunk = self.isjunk
316+
junkdict = {}
299317
if isjunk:
300-
for elt in b2j.keys():
301-
if isjunk(elt):
302-
junkdict[elt] = 1 # value irrelevant; it's a set
303-
del b2j[elt]
318+
for d in populardict, b2j:
319+
for elt in d.keys():
320+
if isjunk(elt):
321+
junkdict[elt] = 1
322+
del d[elt]
304323

305324
# Now for x in b, isjunk(x) == junkdict.has_key(x), but the
306325
# latter is much faster. Note too that while there may be a
307326
# lot of junk in the sequence, the number of *unique* junk
308327
# elements is probably small. So the memory burden of keeping
309328
# this dict alive is likely trivial compared to the size of b2j.
310329
self.isbjunk = junkdict.has_key
330+
self.isbpopular = populardict.has_key
311331

312332
def find_longest_match(self, alo, ahi, blo, bhi):
313333
"""Find longest matching block in a[alo:ahi] and b[blo:bhi].
@@ -388,6 +408,19 @@ def find_longest_match(self, alo, ahi, blo, bhi):
388408
besti, bestj, bestsize = i-k+1, j-k+1, k
389409
j2len = newj2len
390410

411+
# Extend the best by non-junk elements on each end. In particular,
412+
# "popular" non-junk elements aren't in b2j, which greatly speeds
413+
# the inner loop above, but also means "the best" match so far
414+
# doesn't contain any junk *or* popular non-junk elements.
415+
while besti > alo and bestj > blo and \
416+
not isbjunk(b[bestj-1]) and \
417+
a[besti-1] == b[bestj-1]:
418+
besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
419+
while besti+bestsize < ahi and bestj+bestsize < bhi and \
420+
not isbjunk(b[bestj+bestsize]) and \
421+
a[besti+bestsize] == b[bestj+bestsize]:
422+
bestsize += 1
423+
391424
# Now that we have a wholly interesting match (albeit possibly
392425
# empty!), we may as well suck up the matching junk on each
393426
# side of it too. Can't think of a good reason not to, and it
@@ -736,12 +769,16 @@ def __init__(self, linejunk=None, charjunk=None):
736769
- `linejunk`: A function that should accept a single string argument,
737770
and return true iff the string is junk. The module-level function
738771
`IS_LINE_JUNK` may be used to filter out lines without visible
739-
characters, except for at most one splat ('#').
772+
characters, except for at most one splat ('#'). It is recommended
773+
to leave linejunk None; as of Python 2.3, the underlying
774+
SequenceMatcher class has grown an adaptive notion of "noise" lines
775+
that's better than any static definition the author has ever been
776+
able to craft.
740777
741778
- `charjunk`: A function that should accept a string of length 1. The
742779
module-level function `IS_CHARACTER_JUNK` may be used to filter out
743780
whitespace characters (a blank or tab; **note**: bad idea to include
744-
newline in this!).
781+
newline in this!). Use of IS_CHARACTER_JUNK is recommended.
745782
"""
746783

747784
self.linejunk = linejunk
@@ -1005,17 +1042,17 @@ def IS_CHARACTER_JUNK(ch, ws=" \t"):
10051042

10061043
del re
10071044

1008-
def ndiff(a, b, linejunk=IS_LINE_JUNK, charjunk=IS_CHARACTER_JUNK):
1045+
def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK):
10091046
r"""
10101047
Compare `a` and `b` (lists of strings); return a `Differ`-style delta.
10111048
10121049
Optional keyword parameters `linejunk` and `charjunk` are for filter
10131050
functions (or None):
10141051
10151052
- linejunk: A function that should accept a single string argument, and
1016-
return true iff the string is junk. The default is module-level function
1017-
IS_LINE_JUNK, which filters out lines without visible characters, except
1018-
for at most one splat ('#').
1053+
return true iff the string is junk. The default is None, and is
1054+
recommended; as of Python 2.3, an adaptive notion of "noise" lines is
1055+
used that does a good job on its own.
10191056
10201057
- charjunk: A function that should accept a string of length 1. The
10211058
default is module-level function IS_CHARACTER_JUNK, which filters out

Misc/NEWS

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,9 +72,9 @@ Core and builtins
7272

7373
Extension modules
7474

75-
- The bsddb.*open functions can now take 'None' as a filename.
75+
- The bsddb.*open functions can now take 'None' as a filename.
7676
This will create a temporary in-memory bsddb that won't be
77-
written to disk.
77+
written to disk.
7878

7979
- posix.mknod was added.
8080

@@ -99,6 +99,15 @@ Extension modules
9999

100100
Library
101101

102+
- difflib's SequenceMatcher class now does a dynamic analysis of
103+
which elements are so frequent as to constitute noise. For
104+
comparing files as sequences of lines, this generally works better
105+
than the IS_LINE_JUNK function, and function ndiff's linejunk
106+
argument defaults to None now as a result. A happy benefit is
107+
that SequenceMatcher may run much faster now when applied
108+
to large files with many duplicate lines (for example, C program
109+
text with lots of repeated "}" and "return NULL;" lines).
110+
102111
- New Text.dump() method in Tkinter module.
103112

104113
- New distutils commands for building packagers were added to

0 commit comments

Comments
 (0)