Lesson 20: Iterators and Generators

"Just lazy chaps"

Content

Iterators
Building Your Own Iterators
Performance overview
Generators
Generators VS Lists
Best practices
Quiz
Homework

1. Iterators

In Python, iterators are fundamental constructs that allow for efficient looping through collections of items, (iterables), such as lists or tuples. They implement two special methods, __iter__() and __next__(). The __iter__() method returns the iterator object itself and is automatically called at the start of loops.

1.1 Why do we need them?

1.Memory Efficiency: By allowing for item-by-item processing, iterators enable handling large datasets and streams efficiently, without loading everything into memory.

2.Lazy Evaluation: This feature enhances performance by delaying the computation of values until the moment they are actually needed. It allows for the handling of potentially infinite data streams.

1.2 Syntax

To manually iterate over an iterable object, you can use the iter() function to convert it into an iterator and then repeatedly call next() to get each item.

Example

my_list = [1, 2, 3]
my_iter = iter(my_list)

print(next(my_iter))
print(next(my_iter))
print(next(my_iter))

print(type(my_iter))

try:
    print(next(my_iter))  # This will raise StopIteration
except StopIteration:
    print("Reached the end of the iterator")

Output

1
2
3
<class 'list_iterator'>
Reached the end of the iterator

As well as we can use iterators in for loop. It internally converts the iterable into an iterator, automatically calls __iter__() to initiate the iteration, and handles the StopIteration exception by terminating the loop when the end of the iterator is reached.

Example

my_list = [4, 7, 0, 3]

# Iterating over the list
for item in my_list:
    print(item)

Output

It's a built in way of working with iterators, but in reality somemtimes we want to have more controll over them, so that we can define custom iterators.

2. Building Your Own Iterators

Creating a custom iterator involves defining a class that implements the __iter__() and __next__() methods. Let's create a simple class that iterates from 1 up to a given number.

Example

class CountUpTo:
    def __init__(self, max):
        self.max = max
        self.num = 1

    def __iter__(self):
        return self

    def __next__(self):
        if self.num <= self.max:
            result = self.num
            self.num += 1
            return result
        else:
            raise StopIteration

counter = CountUpTo(3)
for num in counter:     # Python calls method ``__next__()`` during each iteration
    print(num)

Explanation

The __iter__() method must return the iterator object itself. This is used by Python to create an iterator from an iterable object.
The __next__() method must return the next item in the sequence. On reaching the end, and to avoid an infinite loop, it should raise a StopIteration exception.

Output

1
2
3

A few more examples which can be used for real world applications such as iterator for processing a large file or the iterator representing the tray.

Example

class LargeFileLineIterator:
    def __init__(self, filepath):
        self.filepath = filepath

    def __iter__(self):
        self.file = open(self.filepath, 'r')
        return self

    def __next__(self):
        line = self.file.readline()
        if line:
            return line.strip()  # Remove the newline character from the end
        else:
            self.file.close()  # Close the file when done
            raise StopIteration

# for will call iter under the hood
filepath = 'path/to/large/file.txt'
for line in LargeFileLineIterator(filepath):
    print(line)

Use Python Visualiser to show how exactly iterators are being called and what happens under the hood.

Example

class ListContainer(object):
    def __init__(self, fruits):
        self.fruits = fruits
    def __iter__(self):
        return iter(self.fruits)
	
# Imagine we have a really big amount of fruits here, in this case we might consider using a custom storage instead of default `list` in Python
fruits = ListContainer(["orange", "mango", "banana"]) 

for fruit in fruits:
    print(fruit)

Output

orange
mango
banana

3. Performance overview

You could ask, why do we need iterator here if we can use the default with open(filepath) context manager and read all lines as we have learnt before.

Let's take a closer look on the perfomance. In the example below we will compare processing a really big file with and without usage of iterators and compare the output.

Example

import time

start_time = time.time()

with open('assets/m.txt', 'r') as f:
    lines = f.readlines()
    line_count = len(lines)

end_time = time.time()
non_iterator_time = end_time - start_time

print(f"Line count: {line_count}")
print(f"Processing time without iterator: {non_iterator_time} seconds")

start_time = time.time()

line_count = 0
with open('assets/m.txt', 'r') as f:
    for line in f:  # This uses an iterator internally
        line_count += 1

end_time = time.time()
iterator_time = end_time - start_time

print(f"Line count: {line_count}")
print(f"Processing time with iterator: {iterator_time} seconds")


print(f"Time taken without using iterator: {non_iterator_time:.4f} seconds.")
print(f"Time taken using iterator: {iterator_time:.4f} seconds.")

Note:: Time processing may different because of hardware used for calculations

Output

# Time taken without using iterator: 0.0340 seconds.
# Time taken using iterator: 0.0199 seconds.

As you can see, the diference is not very significant, but for the bigger files in production enviroment it can play a key role for performance.

4. Generators

Generators are a simple and powerful tool for creating iterators in Python. They allow you to declare a function that behaves like an iterator, i.e., it can be used in a for loop.

4.1 Why?

1.Highly memory-efficient - they yield one item at a time, only holding one item in memory.
2.Reduce the complexity of creating iterators: There’s no need to implement the __iter__() and __next__() methods; the generator function automatically creates these methods in the background.

But the best part of using them is that:

Generators can be composed together, allowing for the construction of pipelines that process data in a memory-efficient manner and can be used to filter, transform, or aggregate data efficiently.

It is an ideal chocice for processing streams of data, such as log files, sensor data, or large datasets that cannot fit into memory.

I opened this approach (look at section 4.3.3) recently and didn't know about for a long time, but for now it is being used on the daily basis, let's finally take a look and skip this boring theory.

4.2 Syntax

A generator is defined much like a normal function, but it uses the yield statement to return data. Each time the generator's function is called, it resumes execution right after the yield statement where it left off.

This behavior allows generators to produce a sequence of values over time, pausing after each yield and continuing from there on the next call.

Example

def count_up_to(max):
    count = 1
    while count <= max:
        yield count     # instead of return!
        count += 1

# Using the generator
for number in count_up_to(3):
    print(number)

Output

1
2
3

4.3 Advanced Generator Examples

Here are a few more examples to illustrate their power in real-world scenarios.

4.3.1 Generating Infinite Sequences

One of the fascinating uses of generators is creating infinite sequences. Unlike lists or tuples, generators can produce values indefinitely.

Example: Infinite Fibonacci Sequence

def infinite_fibonacci():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

# Using the generator
fib = infinite_fibonacci()
for _ in range(7):
    print(next(fib))

Ouput

4.3.2 Generator Expressions

Python supports generator expressions, which offer a concise syntax for creating generators. They are similar to list comprehensions but use parentheses instead of square brackets. It's same as a function with yield we have seen in a previous example.

Example

squares = (x**2 for x in range(10))
for square in squares:
    print(square)

print()

Output

IMPORTANT: Do not confuse them with list/set/dict comprehansitions!

4.3.3 Chaining Generators

Generators can be chained together to create powerful data processing pipelines.

Example

Imagine you have a log file where each line contains a timestamp and a message. You want to filter specific messages and then format them for display.

def read_logs(file_path):
    with open(file_path, 'r') as file:
        for line in file:
            yield line.strip()

def filter_errors(log_lines):
    for line in log_lines:
        if "ERROR" in line:
            yield line

def format_errors(error_lines):
    for line in error_lines:
        yield f"Error found: {line}"

# Chaining generators
log_path = "assets/application.log"
formatted_errors = format_errors(filter_errors(read_logs(log_path)))

for error in formatted_errors:
    print(error)

Output

Error found: 2024-05-04 08:05:00 ERROR: Database connection failed
Error found: 2024-05-04 08:20:00 ERROR: Server timeout

With generators, you can solve a wide range of programming problems such as infinite sequences, large datasets, or building complex data processing pipelines.

It's incredibly powerful mechanism, don't hesitate to use them in your apps!

4.4 Memory Usage

Generators are designed to yield one item at a time, only holding that item in memory, which contrasts sharply with lists that store all their elements in memory. Again, this difference becomes especially significant when working with large data volumes.

Example

Consider calculating the sum of a large range of numbers. Using a list comprehension would require storing all numbers in memory, whereas a generator expression does not.

import sys

# Using a list comprehension
large_list = [x for x in range(1000000)]
print("List memory:", sys.getsizeof(large_list), "bytes")

# Using a generator expression
large_gen = (x for x in range(1000000))
print("Generator memory:", sys.getsizeof(large_gen), "bytes")

Output

List memory: 8448728 bytes
Generator memory: 104 bytes

5. Generators VS Lists

Generators are ideal for:
- Large datasets that do not fit into memory.
- Stream processing or pipelining tasks where data can be processed sequentially.
- Situations where only a part of the data is needed at any one time.
Lists are better suited for:
- Situations requiring random access to elements.
- When the size of the dataset is relatively small, and the overhead of storing it in memory is not a concern.
- Tasks involving list-specific operations like slicing or list comprehensions that benefit from having all data available at once.

Decide what exactly should be used by the needs of your application. My genuine advice would be not to overuse generators as this can lead to some bugs which are hard to track.

6. Best practices

Using Context Managers: Whenever possible, use context managers (with statement) within your generator to ensure that resources are automatically cleaned up when the generator is exhausted or if an exception occurs.

def read_file_lines(file_path):
    with open(file_path, 'r') as file:
        for line in file:
            yield line

Try-Finally Blocks: For more complex scenarios where context managers cannot be used directly within the generator, ensure cleanup code is run through a try-finally block.

def custom_generator():
    resource = allocate_resource()
    try:
        yield from process_resource(resource)
    finally:
        free_resource(resource)

Explicit Closure: In cases where a generator may not be entirely consumed, ensure that any external resources are explicitly released. This can be done by calling the generator's close() method, which triggers any finally blocks associated with the generator.

gen = custom_generator()
try:
    next(gen)
    # If not consuming the entire generator,
    # ensure resources are released
finally:
    gen.close()

Additionaly I would recommend to try out itertools collections to dive deeper into generators as a part of your further learning.

7. Quiz

Question 1:

What will be the output of the following code snippet?

def simple_gen():
    yield 'Python'
    yield 'Rocks'

gen = simple_gen()

print(next(gen))
print(next(gen))

A) Python Rocks
B) Python So
C) StopIteration error

Question 2:

What is an iterator in Python?

A) A data type that can store multiple items.
B) An object that can be iterated upon and returns data, one element at a time when next() is called on it.
C) A syntax for handling exceptions.
D) A module that provides a way to iterate over data structures.

Question 3:

Which of the following is true about generator functions?

A) They return a single value using the return statement.
B) They can yield multiple values, one at a time.
C) They cannot be used in a for loop.
D) They consume more memory than equivalent list comprehensions.

Question 4:

What advantage do generators have over list comprehensions when dealing with large datasets?

A) Generators process elements faster than list comprehensions.
B) Generators enhance readability and are preferred for simple data processing tasks.
C) Generators yield one item at a time and are more memory-efficient.
D) Generators have a more straightforward syntax compared to list comprehensions.

Question 5:

Which Python module provides a collection of tools for handling iterators?

A) collections
B) functools
C) itertools
D) operator

8. Homework

Task 1: Custom Range Generator

Objective: Create a generator function that mimics the behavior of Python's built-in range function.

Requirements:

The generator should be able to handle the same arguments as range(): start, stop, and step.
It should yield one number at a time in the specified range.
Include proper handling for negative steps and reverse iteration.

# Starter code
def custom_range(start, stop=None, step=1):
    # Your implementation here
    pass

# Example usage
for num in custom_range(3, 15, 3):
    print(num)

Task 2: Log File Parser

Objective: Develop a generator that parses a log file and yields dictionaries of log data.

Requirements:

Each yielded dictionary should contain the parts of a log line, such as timestamp, log level, and message.
The generator should handle large files efficiently.
Write a function to filter yielded log entries by log level (INFO, DEBUG, ERROR).

def log_parser(log_file_path):
    pass

def filter_logs(log_generator, log_level):
    pass

logs = log_parser('path/to/log/file')
error_logs = filter_logs(logs, 'ERROR')
for log in error_logs:
    print(log)

Task 3: Batch Processor

Objective: Write a generator function that processes items in batches of a specified size.

Requirements:

The generator should accept any iterable as input.
It should yield lists containing a batch of items.
If the number of items in the last batch is less than the batch size, it should still be yielded.

def batch_processor(iterable, batch_size):
    pass

for batch in batch_processor(range(10), 3):
    print(batch)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!