"Just lazy chaps"
- Iterators
- Building Your Own Iterators
- Performance overview
- Generators
- Generators VS Lists
- Best practices
- Quiz
- Homework
In Python
, iterators are fundamental constructs that allow for efficient looping through collections of items, (iterables), such as lists or tuples. They implement two special methods, __iter__()
and __next__()
. The __iter__()
method returns the iterator object itself and is automatically called at the start of loops.
1.Memory Efficiency: By allowing for item-by-item processing, iterators enable handling large datasets and streams efficiently, without loading everything into memory.
2.Lazy Evaluation: This feature enhances performance by delaying the computation of values until the moment they are actually needed. It allows for the handling of potentially infinite data streams.
To manually iterate over an iterable object, you can use the iter()
function to convert it into an iterator and then repeatedly call next()
to get each item.
my_list = [1, 2, 3]
my_iter = iter(my_list)
print(next(my_iter))
print(next(my_iter))
print(next(my_iter))
print(type(my_iter))
try:
print(next(my_iter)) # This will raise StopIteration
except StopIteration:
print("Reached the end of the iterator")
1
2
3
<class 'list_iterator'>
Reached the end of the iterator
As well as we can use iterators in for
loop. It internally converts the iterable into an iterator, automatically calls __iter__()
to initiate the iteration, and handles the StopIteration
exception by terminating the loop when the end of the iterator is reached.
my_list = [4, 7, 0, 3]
# Iterating over the list
for item in my_list:
print(item)
4
7
0
3
It's a built in way of working with iterators, but in reality somemtimes we want to have more controll over them, so that we can define custom iterators.
Creating a custom iterator involves defining a class that implements the __iter__()
and __next__()
methods. Let's create a simple class that iterates from 1 up to a given number.
class CountUpTo:
def __init__(self, max):
self.max = max
self.num = 1
def __iter__(self):
return self
def __next__(self):
if self.num <= self.max:
result = self.num
self.num += 1
return result
else:
raise StopIteration
counter = CountUpTo(3)
for num in counter: # Python calls method ``__next__()`` during each iteration
print(num)
- The
__iter__()
method must return the iterator object itself. This is used by Python to create an iterator from an iterable object. - The
__next__()
method must return the next item in the sequence. On reaching the end, and to avoid an infinite loop, it should raise aStopIteration
exception.
1
2
3
A few more examples which can be used for real world applications such as iterator for processing a large file or the iterator representing the tray.
class LargeFileLineIterator:
def __init__(self, filepath):
self.filepath = filepath
def __iter__(self):
self.file = open(self.filepath, 'r')
return self
def __next__(self):
line = self.file.readline()
if line:
return line.strip() # Remove the newline character from the end
else:
self.file.close() # Close the file when done
raise StopIteration
# for will call iter under the hood
filepath = 'path/to/large/file.txt'
for line in LargeFileLineIterator(filepath):
print(line)
Use Python Visualiser to show how exactly iterators are being called and what happens under the hood.
class ListContainer(object):
def __init__(self, fruits):
self.fruits = fruits
def __iter__(self):
return iter(self.fruits)
# Imagine we have a really big amount of fruits here, in this case we might consider using a custom storage instead of default `list` in Python
fruits = ListContainer(["orange", "mango", "banana"])
for fruit in fruits:
print(fruit)
orange
mango
banana
You could ask, why do we need iterator here if we can use the default with open(filepath)
context manager and read all lines as we have learnt before.
Let's take a closer look on the perfomance. In the example below we will compare processing a really big file with and without usage of iterators and compare the output.
import time
start_time = time.time()
with open('assets/m.txt', 'r') as f:
lines = f.readlines()
line_count = len(lines)
end_time = time.time()
non_iterator_time = end_time - start_time
print(f"Line count: {line_count}")
print(f"Processing time without iterator: {non_iterator_time} seconds")
start_time = time.time()
line_count = 0
with open('assets/m.txt', 'r') as f:
for line in f: # This uses an iterator internally
line_count += 1
end_time = time.time()
iterator_time = end_time - start_time
print(f"Line count: {line_count}")
print(f"Processing time with iterator: {iterator_time} seconds")
print(f"Time taken without using iterator: {non_iterator_time:.4f} seconds.")
print(f"Time taken using iterator: {iterator_time:.4f} seconds.")
Note:: Time processing may different because of hardware used for calculations
# Time taken without using iterator: 0.0340 seconds.
# Time taken using iterator: 0.0199 seconds.
As you can see, the diference is not very significant, but for the bigger files in production enviroment it can play a key role for performance.
Generators are a simple and powerful tool for creating iterators in Python. They allow you to declare a function that behaves like an iterator, i.e., it can be used in a for loop.
1.Highly memory-efficient - they yield one item at a time, only holding one item in memory.
2.Reduce the complexity of creating iterators: There’s no need to implement the __iter__()
and __next__()
methods; the generator function automatically creates these methods in the background.
But the best part of using them is that:
Generators can be composed together, allowing for the construction of pipelines that process data in a memory-efficient manner and can be used to filter, transform, or aggregate data efficiently.
It is an ideal chocice for processing streams of data, such as log files, sensor data, or large datasets that cannot fit into memory.
I opened this approach (look at section 4.3.3) recently and didn't know about for a long time, but for now it is being used on the daily basis, let's finally take a look and skip this boring theory.
A generator is defined much like a normal function, but it uses the yield
statement to return data. Each time the generator's function is called, it resumes execution right after the yield
statement where it left off.
This behavior allows generators to produce a sequence of values over time, pausing after each yield
and continuing from there on the next call.
def count_up_to(max):
count = 1
while count <= max:
yield count # instead of return!
count += 1
# Using the generator
for number in count_up_to(3):
print(number)
1
2
3
Here are a few more examples to illustrate their power in real-world scenarios.
One of the fascinating uses of generators is creating infinite sequences. Unlike lists or tuples, generators can produce values indefinitely.
def infinite_fibonacci():
a, b = 0, 1
while True:
yield a
a, b = b, a + b
# Using the generator
fib = infinite_fibonacci()
for _ in range(7):
print(next(fib))
0
1
1
2
3
5
8
Python supports generator expressions, which offer a concise syntax for creating generators. They are similar to list comprehensions but use parentheses instead of square brackets. It's same as a function with yield
we have seen in a previous example.
squares = (x**2 for x in range(10))
for square in squares:
print(square)
print()
0
1
4
9
16
25
36
49
64
81
IMPORTANT: Do not confuse them with list/set/dict comprehansitions!
Generators can be chained together to create powerful data processing pipelines.
Imagine you have a log file where each line contains a timestamp and a message. You want to filter specific messages and then format them for display.
def read_logs(file_path):
with open(file_path, 'r') as file:
for line in file:
yield line.strip()
def filter_errors(log_lines):
for line in log_lines:
if "ERROR" in line:
yield line
def format_errors(error_lines):
for line in error_lines:
yield f"Error found: {line}"
# Chaining generators
log_path = "assets/application.log"
formatted_errors = format_errors(filter_errors(read_logs(log_path)))
for error in formatted_errors:
print(error)
Error found: 2024-05-04 08:05:00 ERROR: Database connection failed
Error found: 2024-05-04 08:20:00 ERROR: Server timeout
With generators, you can solve a wide range of programming problems such as infinite sequences, large datasets, or building complex data processing pipelines.
It's incredibly powerful mechanism, don't hesitate to use them in your apps!
Generators are designed to yield one item at a time, only holding that item in memory, which contrasts sharply with lists that store all their elements in memory. Again, this difference becomes especially significant when working with large data volumes.
Consider calculating the sum of a large range of numbers. Using a list comprehension would require storing all numbers in memory, whereas a generator expression does not.
import sys
# Using a list comprehension
large_list = [x for x in range(1000000)]
print("List memory:", sys.getsizeof(large_list), "bytes")
# Using a generator expression
large_gen = (x for x in range(1000000))
print("Generator memory:", sys.getsizeof(large_gen), "bytes")
List memory: 8448728 bytes
Generator memory: 104 bytes
-
Generators are ideal for:
- Large datasets that do not fit into memory.
- Stream processing or pipelining tasks where data can be processed sequentially.
- Situations where only a part of the data is needed at any one time.
-
Lists are better suited for:
- Situations requiring random access to elements.
- When the size of the dataset is relatively small, and the overhead of storing it in memory is not a concern.
- Tasks involving list-specific operations like slicing or list comprehensions that benefit from having all data available at once.
Decide what exactly should be used by the needs of your application. My genuine advice would be not to overuse generators as this can lead to some bugs which are hard to track.
Using Context Managers: Whenever possible, use context managers (with
statement) within your generator to ensure that resources are automatically cleaned up when the generator is exhausted or if an exception occurs.
def read_file_lines(file_path):
with open(file_path, 'r') as file:
for line in file:
yield line
Try-Finally Blocks: For more complex scenarios where context managers cannot be used directly within the generator, ensure cleanup code is run through a try-finally
block.
def custom_generator():
resource = allocate_resource()
try:
yield from process_resource(resource)
finally:
free_resource(resource)
Explicit Closure: In cases where a generator may not be entirely consumed, ensure that any external resources are explicitly released. This can be done by calling the generator's close()
method, which triggers any finally
blocks associated with the generator.
gen = custom_generator()
try:
next(gen)
# If not consuming the entire generator,
# ensure resources are released
finally:
gen.close()
Additionaly I would recommend to try out itertools
collections to dive deeper into generators as a part of your further learning.
What will be the output of the following code snippet?
def simple_gen():
yield 'Python'
yield 'Rocks'
gen = simple_gen()
print(next(gen))
print(next(gen))
A) Python Rocks
B) Python So
C) StopIteration
error
What is an iterator in Python?
A) A data type that can store multiple items.
B) An object that can be iterated upon and returns data, one element at a time when next()
is called on it.
C) A syntax for handling exceptions.
D) A module that provides a way to iterate over data structures.
Which of the following is true about generator functions?
A) They return a single value using the return
statement.
B) They can yield multiple values, one at a time.
C) They cannot be used in a for
loop.
D) They consume more memory than equivalent list comprehensions.
What advantage do generators have over list comprehensions when dealing with large datasets?
A) Generators process elements faster than list comprehensions.
B) Generators enhance readability and are preferred for simple data processing tasks.
C) Generators yield one item at a time and are more memory-efficient.
D) Generators have a more straightforward syntax compared to list comprehensions.
Which Python module provides a collection of tools for handling iterators?
A) collections
B) functools
C) itertools
D) operator
Objective: Create a generator function that mimics the behavior of Python's built-in range function.
- The generator should be able to handle the same arguments as
range()
: start, stop, and step. - It should yield one number at a time in the specified range.
- Include proper handling for negative steps and reverse iteration.
# Starter code
def custom_range(start, stop=None, step=1):
# Your implementation here
pass
# Example usage
for num in custom_range(3, 15, 3):
print(num)
Objective: Develop a generator that parses a log file and yields dictionaries of log data.
- Each yielded dictionary should contain the parts of a log line, such as timestamp, log level, and message.
- The generator should handle large files efficiently.
- Write a function to filter yielded log entries by log level (INFO, DEBUG, ERROR).
def log_parser(log_file_path):
pass
def filter_logs(log_generator, log_level):
pass
logs = log_parser('path/to/log/file')
error_logs = filter_logs(logs, 'ERROR')
for log in error_logs:
print(log)
Objective: Write a generator function that processes items in batches of a specified size.
- The generator should accept any iterable as input.
- It should yield lists containing a batch of items.
- If the number of items in the last batch is less than the batch size, it should still be yielded.
def batch_processor(iterable, batch_size):
pass
for batch in batch_processor(range(10), 3):
print(batch)