Skip to content

Latest commit

 

History

History
executable file
·
407 lines (281 loc) · 21.1 KB

第十四章-可迭代之迭代器和生成器.md

File metadata and controls

executable file
·
407 lines (281 loc) · 21.1 KB

CHAPTER 14 Iterables, iterators and generators

When I see patterns in my programs, I consider it a sign of trouble. The shape of a program should reflect only the problem it needs to solve. Any other regularity in the code is a sign, to me at least, that I’m using abstractions that aren’t powerful enough — often that I’m generating by hand the expansions of some macro that I need to write [1]. — Paul Graham Lisp hacker and venture capitalist

Iteration is fundamental to data processing. And when scanning datasets that don’t fit in memory, we need a way to fetch the items lazily, that is, one at a time and on demand. This is what the Iterator pattern is about. This chapter shows how the Iterator pattern is built into the Python language so you never need to implement it by hand.

迭代是数据处理的基础。当扫描数据集时不要放进内存,我们需要一种惰性获取项的方法,即,

Python does not have macros like Lisp (Paul Graham’s favorite language), so abstracting away the Iterator pattern required changing the language: the yield keyword was added in Python 2.2 (2001)[2]. The yield keyword allows the construction of generators, which work as iterators.

不像Lisp,Python没有宏,所以抽象出迭代器模式就要求改变语言:在Python 2.2(2001)【注释2】中加入的yield关键子。

Note

Every generator is an iterator: generators fully implement the iterator interface. But an iterator — as defined in the GoF book — re‐trieves items from a collection, while a generator can produce items “out of thin air”. That’s why the Fibonacci sequence generator is a common example: an infinite series of numbers cannot be stored in a collection. However, be aware that the Python community treats iterator and generator as synonyms most of the time.

注释 每个生成器都是一个迭代器:生成器完全实现了迭代器接口。但迭代器在GOF这本书中定义为:从集合中重新取回项,而生成器则是“凭空”产生项。

  1. From Revenge of the Nerds, a blog post.

  2. Python 2.2 users could use yield with the directive from future import generators; yield became available by default in Python 2.3.

  3. 来自博文《呆瓜的复仇》。

  4. Python 2.2 用户能够利用命令from future import generators运用yield;在Python 2.3中yield默认可用。

Python 3 uses generators in many places. Even the range() built-in now returns a generator-like object instead of full-blown lists like before. If you must build a list from range, you have to be explicit, e.g. list(range(100)).

Python 3在很多地方使用了生成器。甚至内建的range()现在也返回一个类生成器对象,而不是之前的完整列表。如果你必须用range构建出一个列表,那么你必须明确的指明,比如,list(range(100))。

Every collection in Python is iterable, and iterators are used internally to support:

Python中的每一个集合都是可迭代的,迭代器用于集合的内部以支持以下动作:

  • for loops;

  • collection types construction and extension;

  • looping over text files line by line;

  • list, dict and set comprehensions;

  • tuple unpacking;

  • unpacking actual parameters with * in function calls.

  • 支持循环;

  • 集合类型构建以及扩展;

  • 一行接一行的循环文本文件;

  • 列表、字典和集合解析式;

  • 元组解包;

  • 在函数调用中使用 * 解包实参。

This chapter covers the following topics:

本章讨论以下话题:

  • How the iter(...) built-in function is used internally to handle iterable objects.

  • How to implement the classic Iterator pattern in Python.

  • How a generator function works in detail, with line by line descriptions.

  • How the classic Iterator can be replaced by a generator function or generator ex‐ pression.

  • Leveraging the general purpose generator functions in the standard library.

  • Using the new yield from statement to combine generators.

  • A case study: using generator functions in a database conversion utility designed to work with large data sets.

  • Why generators and coroutines look alike but are actually very different and should not be mixed.

  • 内建函数iter(...)是如何用在内部去处理可迭代对象的。

  • 在Python如何实现典型的迭代器模式

  • 生成器函数具体是如何工作的,我们一字一句得来说一说

  • 改进标准库中的普通用途生成器函数

  • 在语句中使用新的yield来合并生成器

  • 案例研究:在数据库中使用生成器函数转变实用设计以便用大数据集。

  • 为什么生成器和协程看起相似,但实际上有着很大不同,所以不要把它们给弄混淆了。

We’ll get started studying how the iter(...) function makes sequences iterable.

我们从研习iter(...)函数如何使序列可迭代开始。

Sentence take #1: a sequence of words

We’ll start our exploration of iterables by implementing a Sentence class: you give its constructor a string with some text, and then you can iterate word by word. The first version will implement the sequence protocol, and it’s iterable because all sequences are iterable, as we’ve seen before, but now we’ll see exactly why.

我们通过实现的Sentence类开始对可迭代的探究:你对类的构造器赋值一些文本,然后你就可以一个词接着一个词的迭代。第一个版本会将实现序列协议,

Example 14-1 shows a Sentence class that extracts words from a text by index.

例子14-1 展示了一个通过索引从文本提取单词的Sentence类。

Example 14-1. sentence.py: A Sentence as a sequence of words.

例子14-1.sentence.py:由单词序列组成的Sentence

import re
import reprlib


RE_WORD = re.compile('\w+')


class Sentence:
    def __init__(self, text): 
        self.text = text
	  	self.words = RE_WORD.findall(text) # 1

    def __getitem__(self, index):
        return self.words[index]  # 2

    def __len__(self):
        return len(self.words) 

    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)
  1. re.findall returns a list with all non-overlapping matches of the regular expression, as a list of strings.

  2. self.words holds the result of .findall, so we simply return the word at the given index.

  3. To complete the sequence protocol, we implement __len__ — but it is not needed to make an iterable object.

  4. reprlib.repr is a utility function to generate abbreviated string representations of data structures that can be very large [3].

  5. re.findall返回一个正则表达式的所有不重叠匹配的列表。

  6. self.words包含了.findall的结果,所以我们就简单地返回指定索引的单词。

  7. 为了完成序列协议,我们实现了__len__,但是对于生成可迭代对象来所,这并不是必须的。

  8. reprlib.repr是一个用来生成大量数据的字符串显示缩写的实用函数【注释3】。

[注释3] Wefirstusedreprlibin“Vectortake#1:Vector2dcompatible”onpage278.`

By default, reprlib.repr limits the generated string to 30 characters. See the following console session to see how Sentence is used:

默认情况下,reprlib.repr限制生成的字符串为30个字符串。在下列终端中的会话你可以看到Sentence是如何使用的:

Example 14-2. Testing iteration on a Sentence instance.

>>> s = Sentence('"The time has come," the Walrus said,') # 1
>>> s
Sentence('"The time ha... Walrus said,') # 2
>>>for word in s: # 3
...   print(word) 
The
time
has
come
the
Walrus
said
>>> list(s) # 4
['The', 'time', 'has', 'come', 'the', 'Walrus', 'said']
  1. A sentence is created from a string.

  2. Note the output of __repr__ using ... generated by reprlib.repr.

  3. Sentence instances are iterable, we’ll see why in a moment.

  4. Being iterable, Sentence objects can be used as input to build lists and other iterable types.

  5. 创建了一个字符串句子。

  6. 注意__repr__的输出,它使用了通过由reprlib.repr生成的...

  7. Sentence实例是可迭代的,稍后我们来看看为什么。

  8. 因为可迭代,所以Sentence对象可以用于输入以构造列表和其他可迭代类型。

In the next pages, we’ll develop other Sentence classes that pass the tests in Example 14-2. However, the implementation in Example 14-1 is different from all the others because it’s also a sequence, so you can get words by index:

在下一页,我们会编写其他的Sentence在例子14-2中通过测试的类。不过,在例子14-1中实现的例子和其他的所有例子都不相同,因为结果就是一个序列,所以能通过索引获取单词:

>>> s[0]
  'The' 
>>> s[5]
  'Walrus'
>>> s[-1]
 'said'

Every Python programmer knows that sequences are iterable. Now we’ll see precisely why.

每个Python程序员都知道序列是可迭代的。现在我们来仔细的一探究竟。

Why sequences are iterable: the iter function

为什么序列可以迭代:iter函数

Whenever the interpreter needs to iterate over an object x, it automatically calls iter(x). The iter built-in function:

不论何时当解释器需要迭代对象x时,它会自动地调用iter(x)。iter为内建函数:

  1. Checks whether the object implements, __iter__, and calls that to obtain an iterator;

  2. If __iter__ is not implemented, but __getitem__ is implemented, Python creates an iterator that attempts to fetch items in order, starting from index 0 (zero);

  3. If that fails, Python raises TypeError, usually saying "'C' object is not itera ble", where C is the class of the target object.

  4. 检查对象是否实现__iter__,并调用它来获得一个迭代器;

  5. 如果没有实现__iter__,而是实现了__getitem__,Python会创建一个迭代器,从索引0开始,尝试按顺序去获取项;

  6. 如果以上方法调用都失败了,Python抛出TypeError,通常是“C object is not iterable”,这里C是目标对象的类。

That is why any Python sequence is iterable: they all implement __getitem__ . In fact, the standard sequences also implement __iter__, and yours should too, because the special handling of __getitem__ exists for backward compatibility reasons and may be gone in the future (although it is not deprecated as I write this).

这就是为什么Python序列可以迭代的原因:它们都实现了__getitem__。实际上,标准序列也实现了__iter__,而且你的自定义序列也应该如此,因为现有的为了向后兼容性的特殊处理 __getitem__,在未来或许会消失(尽管在我写本书时还没有移除)。

As mentioned in “Python digs sequences” on page 312, this is an extreme form of duck typing: an object is considered iterable not only when it implements the special method __iter__, but also when it implements __getitem__, as long as __getitem__ accepts int keys starting from 0.

就像在312页中提及的“深入Python序列”那样,这就是一个鸭子类型的一个极致表现:一个对象被认为是可迭代的不仅仅是在它实现了特殊方法__iter__之时,而且在当它实现了__getitem__,只要__getitem__能够接受从0开始的整数键。

In the goose-typing approach, the definition for an iterable is simpler but not as flexible: an object is considered iterable if it implements the __iter__ method. No subclassing or registration is required, because abc.Iterable implements the __subclasshook__, as seen in “Geese can behave as ducks” on page 340. Here is a demonstration:

在鹅类型方法中,可迭代的定义简单但是不那么灵活:如果一个对象实现了__iter__方法便认为是可迭代的。没有子类化或者注册上的要求,因为abc.Iterable实现了__subclasshook__,你可以在340页中的“鹅的行为可以像鸭子”中见到。这里对其说明:

>>> class Foo:
...     def __iter__(self):
...         pass
...
>>> from collections import abc 
>>> issubclass(Foo, abc.Iterable) True
>>> f = Foo()
>>> isinstance(f, abc.Iterable) True

However, note that our initial Sentence class does not pass the issubclass(Sentence, abc.Iterable) test, even though it is iterable in practice.

不过,值得注意的是第一版Sentence类并没有通过issubclass(Sentence, abc.Iterable)测试,即便实际上它是可迭代的。

Note

As of Python 3.4, the most accurate way to check whether an object x is iterable is to call iter(x) and handle a TypeError exception if it isn’t. This is more accurate than using isinstance(x, abc.Iterable), because iter(x) also considers the legacy __getitem__ method, while the Iterable ABC does not.

注释

从Python 3.4开始,检查对象x是否是可迭代的最精确方法为调用iter(x),并在对象不可迭代时处理TypeError异常。这比使用isinstance(x, abc.Iterable)更为精准,因为iter(x)还使用了早期的__getitem__方法,而Iterable ABC则不然。

Explicitly checking whether an object is iterable may not be worthwhile if right after the check you are going to iterate over the object. After all, when the iteration is attempted on a noniterable, the exception Python raises is clear enough: TypeError: 'C' object is not iterable . If you can do better than just raising TypeError, then do so in a try/except block instead of doing an explicit check. The explicit check may make sense if you are holding on to the object to iterate over it later; in this case, catching the error early may be useful.

明确地检查一个对象是否时可迭代

The next section makes explicit the relationship between iterables and iterators.

下一节将明确一下可迭代和迭代器之间的关系。

Iterables Versus Iterators

可迭代与迭代器

From the explanation in “Why Sequences Are Iterable: The iter Function” on page 404 we can extrapolate a definition:

从404页中的定义“为什么序列是可迭代:iter函数”中我们可以推断定义:

iterable
Any object from which the iter built-in function can obtain an iterator. Objects implementing an iter method returning an iterator are iterable. Sequences are always iterable; as are objects implementing a getitem method that takes 0-based indexes.

可迭代
任何

It’s important to be clear about the relationship between iterables and iterators: Python obtains iterators from iterables.

Here is a simpleforloop iterating over astr. Thestr'ABC'is the iterable here. You don’t see it, but there is an iterator behind the curtain:

>>> s = 'ABC'
>>> for char in s: 
...     print(char) 
...
A
B
C

If there was no for statement and we had to emulate the for machinery by hand with a while loop, this is what we’d have to write:

>>> s = 'ABC'
>>> it = iter(s) 
# >>> while True: 
... try:
 ... 
 ... ... 
 ... ... A B C
print(next(it)) 
# except StopIteration: #
delit # break #
  1. Build an iterator it from the iterable.

  2. Repeatedly call next on the iterator to obtain the next item.

  3. The iterator raises StopIteration when there are no further items.

  4. Release reference to it—the iterator object is discarded.

  5. Exit the loop.

  6. 从可迭代对象中构造一个迭代器

  7. 对迭代器重复地调用next以获取下一个项。

StopIteration signals that the iterator is exhausted. This exception is handled inter‐nally in for loops and other iteration contexts like list comprehensions, tuple unpacking, etc.

StopIteration指示迭代器迭代完了。

The standard interface for an iterator has two methods:

迭代器的标准接口有两个方法:

next Returns the next available item, raising StopIteration when there are no more items.

iter Returns self; this allows iterators to be used where an iterable is expected, for example, in a for loop.

next 返回下一个可用的项,在没有更多可用项时抛出StopIteration。
iter 返回self;允许将迭代器用在期望可迭代的地方,例如,在循环中。

This is formalized in the collections.abc.Iterator ABC, which defines the __next__ abstract method, and subclasses Iterable—where the abstract __iter__ method is defined. See Figure 14-1.

在collections.abc.Iterator ABC中这是标准化用法,它定义了 __next__ 抽象方法,以及可迭代的子类——即定义抽象__iter__方法的地方。参见图表14-1.

img

Figure 14-1. The Iterable and Iterator ABCs. Methods in italic are abstract. A concrete Iterable.iter should return a new Iterator instance. A concrete Iterator must implement next. The Iterator.iter method just returns the instance itself.

The Iterator ABC implements __iter__ by doing return self. This allows an iterator to be used wherever an iterable is required. The source code for abc.Iterator is in Example 14-3.

Example 14-3. abc.Iterator class; extracted from Lib/_collections_abc.py

class Iterator(Iterable):
    __slots__ = ()

    @abstractmethod
    def __next__(self):
        'Return the next item from the iterator. When exhausted, raise StopIteration' 
        raise StopIteration
    
    def __iter__(self): 
        return self
    
    @classmethod
    def __subclasshook__(cls, C): 
        if cls is Iterator:
            if (any("__next__" in B.__dict__ for B in C.__mro__) and any("__iter__" in B.__dict__ for in C.__mro__)):
                return True 
        return NotImplemented

Warning

The Iterator ABC abstract method is it.__next__() in Python 3 and it.next() in Python 2. As usual, you should avoid calling special methods directly. Just use the next(it): this built-in func‐ tion does the right thing in Python 2 and 3.

警告⚠️

The Lib/types.py module source code in Python 3.4 has a comment that says:

# Iterators in Python aren't a matter of type but of protocol.  A large
# and changing number of builtin types implement *some* flavor of
# iterator.  Don't check the type!  Use hasattr to check for both
# "__iter__" and "__next__" attributes instead.

In fact, that’s exactly what the __subclasshook__ method of the abc.Iterator ABC does (see Example 14-3).

Note

Taking into account the advice from Lib/types.py and the logic implemented in Lib/_collections_abc.py, the best way to check if an object x is an iterator is to call isinstance(x, abc.Iterator). Thanks to Iterator.__subclasshook__, this test works even if the class of x is not a real or virtual subclass of Iterator.

注意

Back to our Sentence class from Example 14-1, you can clearly see how the iterator is built by iter(...) and consumed by next(...) using the Python console:

>>> s3 = 'ABC' # 1
>>> it = iter(s) # 1
>>> while True:
...     try:
...         print(next(it)) # 2
...     except StopIteration: # 3
...         del it # 4
...         break # 5
... 
A
B
C
  1. Create a sentence s3 with three words.

  2. Obtain an iterator from s3.

  3. next(it) fetches the next word.

  4. There are no more words, so the iterator raises a StopIteration exception. Once exhausted, an iterator becomes useless.

  5. To go over the sentence again, a new iterator must be built.

  6. 穿件一个

Because the only methods required of an iterator are __next__ and __iter__, there is no way to check whether there are remaining items, other than to call next() and catch StopInteration. Also, it’s not possible to “reset” an iterator. If you need to start over, you need to call iter(...) on the iterable that built the iterator in the first place. Calling iter(...) on the iterator itself won’t help, because—as mentioned—Iterator.__iter__ is implemented by returning self, so this will not reset a depleted iter‐ ator.

因为

To wrap up this section, here is a definition for iterator:

为了结束本节,这里是迭代器等定义:

iterator Any object that implements the __next__ no-argument method that returns the next item in a series or raises StopIteration when there are no more items. Python iterators also implement the __iter__ method so they are iterable as well.

迭代器

This first version of Sentence was iterable thanks to the special treatment the iter(...) built-in gives to sequences. Now we’ll implement the standard iterable protocol.

Sentence Take #2: A Classic Iterator

The next Sentence class is built according to the classic Iterator design pattern following the blueprint in the GoF book. Note that this is not idiomatic Python, as the next re‐ factorings will make very clear. But it serves to make explicit the relationship between the iterable collection and the iterator object.

Example 14-4 shows an implementation of a Sentence that is iterable because it imple‐ ments the __iter__ special method, which builds and returns a SentenceIterator. This is how the Iterator design pattern is described in the original Design Patterns book.

例子14-4展示了

We are doing it this way here just to make clear the crucial distinction between an iterable and an iterator and how they are connected.

Example 14-4. sentence_iter.py: Sentence implemented using the Iterator pattern