PEP 533 – Deterministic cleanup for iterators
- PEP
- 533
- Title
- Deterministic cleanup for iterators
- Author
- Nathaniel J. Smith
- BDFL-Delegate
- Yury Selivanov <yury at edgedb.com>
- Status
- Deferred
- Type
- Standards Track
- Created
- 18-Oct-2016
- Post-History
- 18-Oct-2016
Contents
Abstract
We propose to extend the iterator protocol with a new
__(a)iterclose__
slot, which is called automatically on exit from
(async) for
loops, regardless of how they exit. This allows for
convenient, deterministic cleanup of resources held by iterators
without reliance on the garbage collector. This is especially valuable
for asynchronous generators.
Note on timing
In practical terms, the proposal here is divided into two separate parts: the handling of async iterators, which should ideally be implemented ASAP, and the handling of regular iterators, which is a larger but more relaxed project that can’t start until 3.7 at the earliest. But since the changes are closely related, and we probably don’t want to end up with async iterators and regular iterators diverging in the long run, it seems useful to look at them together.
Background and motivation
Python iterables often hold resources which require cleanup. For
example: file
objects need to be closed; the WSGI spec adds a close
method
on top of the regular iterator protocol and demands that consumers
call it at the appropriate time (though forgetting to do so is a
frequent source of bugs);
and PEP 342 (based on PEP 325) extended generator objects to add a
close
method to allow generators to clean up after themselves.
Generally, objects that need to clean up after themselves also define
a __del__
method to ensure that this cleanup will happen
eventually, when the object is garbage collected. However, relying on
the garbage collector for cleanup like this causes serious problems in
several cases:
- In Python implementations that do not use reference counting
(e.g. PyPy, Jython), calls to
__del__
may be arbitrarily delayed – yet many situations require prompt cleanup of resources. Delayed cleanup produces problems like crashes due to file descriptor exhaustion, or WSGI timing middleware that collects bogus times. - Async generators (PEP 525) can only perform cleanup under the
supervision of the appropriate coroutine runner.
__del__
doesn’t have access to the coroutine runner; indeed, the coroutine runner might be garbage collected before the generator object. So relying on the garbage collector is effectively impossible without some kind of language extension. (PEP 525 does provide such an extension, but it has a number of limitations that this proposal fixes; see the “alternatives” section below for discussion.)
Fortunately, Python provides a standard tool for doing resource
cleanup in a more structured way: with
blocks. For example, this
code opens a file but relies on the garbage collector to close it:
def read_newline_separated_json(path):
for line in open(path):
yield json.loads(line)
for document in read_newline_separated_json(path):
...
and recent versions of CPython will point this out by issuing a
ResourceWarning
, nudging us to fix it by adding a with
block:
def read_newline_separated_json(path):
with open(path) as file_handle: # <-- with block
for line in file_handle:
yield json.loads(line)
for document in read_newline_separated_json(path): # <-- outer for loop
...
But there’s a subtlety here, caused by the interaction of with
blocks and generators. with
blocks are Python’s main tool for
managing cleanup, and they’re a powerful one, because they pin the
lifetime of a resource to the lifetime of a stack frame. But this
assumes that someone will take care of cleaning up the stack
frame… and for generators, this requires that someone close
them.
In this case, adding the with
block is enough to shut up the
ResourceWarning
, but this is misleading – the file object cleanup
here is still dependent on the garbage collector. The with
block
will only be unwound when the read_newline_separated_json
generator is closed. If the outer for
loop runs to completion then
the cleanup will happen immediately; but if this loop is terminated
early by a break
or an exception, then the with
block won’t
fire until the generator object is garbage collected.
The correct solution requires that all users of this API wrap every
for
loop in its own with
block:
with closing(read_newline_separated_json(path)) as genobj:
for document in genobj:
...
This gets even worse if we consider the idiom of decomposing a complex pipeline into multiple nested generators:
def read_users(path):
with closing(read_newline_separated_json(path)) as gen:
for document in gen:
yield User.from_json(document)
def users_in_group(path, group):
with closing(read_users(path)) as gen:
for user in gen:
if user.group == group:
yield user
In general if you have N nested generators then you need N+1 with
blocks to clean up 1 file. And good defensive programming would
suggest that any time we use a generator, we should assume the
possibility that there could be at least one with
block somewhere
in its (potentially transitive) call stack, either now or in the
future, and thus always wrap it in a with
. But in practice,
basically nobody does this, because programmers would rather write
buggy code than tiresome repetitive code. In simple cases like this
there are some workarounds that good Python developers know (e.g. in
this simple case it would be idiomatic to pass in a file handle
instead of a path and move the resource management to the top level),
but in general we cannot avoid the use of with
/finally
inside
of generators, and thus dealing with this problem one way or
another. When beauty and correctness fight then beauty tends to win,
so it’s important to make correct code beautiful.
Still, is this worth fixing? Until async generators came along I would have argued yes, but that it was a low priority, since everyone seems to be muddling along okay – but async generators make it much more urgent. Async generators cannot do cleanup at all without some mechanism for deterministic cleanup that people will actually use, and async generators are particularly likely to hold resources like file descriptors. (After all, if they weren’t doing I/O, they’d be generators, not async generators.) So we have to do something, and it might as well be a comprehensive fix to the underlying problem. And it’s much easier to fix this now when async generators are first rolling out, than it will be to fix it later.
The proposal itself is simple in concept: add a __(a)iterclose__
method to the iterator protocol, and have (async) for
loops call
it when the loop is exited, even if this occurs via break
or
exception unwinding. Effectively, we’re taking the current cumbersome
idiom (with
block + for
loop) and merging them together into a
fancier for
. This may seem non-orthogonal, but makes sense when
you consider that the existence of generators means that with
blocks actually depend on iterator cleanup to work reliably, plus
experience showing that iterator cleanup is often a desirable feature
in its own right.
Alternatives
PEP 525 asyncgen hooks
PEP 525 proposes a set of global thread-local hooks
managed by new sys.{get/set}_asyncgen_hooks()
functions, which
allow event loops to integrate with the garbage collector to run
cleanup for async generators. In principle, this proposal and PEP 525
are complementary, in the same way that with
blocks and
__del__
are complementary: this proposal takes care of ensuring
deterministic cleanup in most cases, while PEP 525’s GC hooks clean up
anything that gets missed. But __aiterclose__
provides a number of
advantages over GC hooks alone:
- The GC hook semantics aren’t part of the abstract async iterator
protocol, but are instead restricted specifically to the async
generator concrete type. If
you have an async iterator implemented using a class, like:
class MyAsyncIterator: async def __anext__(): ...
then you can’t refactor this into an async generator without changing its semantics, and vice-versa. This seems very unpythonic. (It also leaves open the question of what exactly class-based async iterators are supposed to do, given that they face exactly the same cleanup problems as async generators.)
__aiterclose__
, on the other hand, is defined at the protocol level, so it’s duck-type friendly and works for all iterators, not just generators. - Code that wants to work on non-CPython implementations like PyPy
cannot in general rely on GC for cleanup. Without
__aiterclose__
, it’s more or less guaranteed that developers who develop and test on CPython will produce libraries that leak resources when used on PyPy. Developers who do want to target alternative implementations will either have to take the defensive approach of wrapping everyfor
loop in awith
block, or else carefully audit their code to figure out which generators might possibly contain cleanup code and addwith
blocks around those only. With__aiterclose__
, writing portable code becomes easy and natural. - An important part of building robust software is making sure that
exceptions always propagate correctly without being lost. One of the
most exciting things about async/await compared to traditional
callback-based systems is that instead of requiring manual chaining,
the runtime can now do the heavy lifting of propagating errors,
making it much easier to write robust code. But, this beautiful
new picture has one major gap: if we rely on the GC for generator
cleanup, then exceptions raised during cleanup are lost. So, again,
with
__aiterclose__
, developers who care about this kind of robustness will either have to take the defensive approach of wrapping everyfor
loop in awith
block, or else carefully audit their code to figure out which generators might possibly contain cleanup code.__aiterclose__
plugs this hole by performing cleanup in the caller’s context, so writing more robust code becomes the path of least resistance. - The WSGI experience suggests that there exist important
iterator-based APIs that need prompt cleanup and cannot rely on the
GC, even in CPython. For example, consider a hypothetical WSGI-like
API based around async/await and async iterators, where a response
handler is an async generator that takes request headers + an async
iterator over the request body, and yields response headers + the
response body. (This is actually the use case that got me interested
in async generators in the first place, i.e. this isn’t
hypothetical.) If we follow WSGI in requiring that child iterators
must be closed properly, then without
__aiterclose__
the absolute most minimalistic middleware in our system looks something like:async def noop_middleware(handler, request_header, request_body): async with aclosing(handler(request_body, request_body)) as aiter: async for response_item in aiter: yield response_item
Arguably in regular code one can get away with skipping the
with
block aroundfor
loops, depending on how confident one is that one understands the internal implementation of the generator. But here we have to cope with arbitrary response handlers, so without__aiterclose__
, thiswith
construction is a mandatory part of every middleware.__aiterclose__
allows us to eliminate the mandatory boilerplate and an extra level of indentation from every middleware:async def noop_middleware(handler, request_header, request_body): async for response_item in handler(request_header, request_body): yield response_item
So the __aiterclose__
approach provides substantial advantages
over GC hooks.
This leaves open the question of whether we want a combination of GC
hooks + __aiterclose__
, or just __aiterclose__
alone. Since
the vast majority of generators are iterated over using a for
loop
or equivalent, __aiterclose__
handles most situations before the
GC has a chance to get involved. The case where GC hooks provide
additional value is in code that does manual iteration, e.g.:
agen = fetch_newline_separated_json_from_url(...)
while True:
document = await type(agen).__anext__(agen)
if document["id"] == needle:
break
# doesn't do 'await agen.aclose()'
If we go with the GC-hooks + __aiterclose__
approach, this
generator will eventually be cleaned up by GC calling the generator
__del__
method, which then will use the hooks to call back into
the event loop to run the cleanup code.
If we go with the no-GC-hooks approach, this generator will eventually be garbage collected, with the following effects:
- its
__del__
method will issue a warning that the generator was not closed (similar to the existing “coroutine never awaited” warning). - The underlying resources involved will still be cleaned up, because
the generator frame will still be garbage collected, causing it to
drop references to any file handles or sockets it holds, and then
those objects’s
__del__
methods will release the actual operating system resources. - But, any cleanup code inside the generator itself (e.g. logging, buffer flushing) will not get a chance to run.
The solution here – as the warning would indicate – is to fix the
code so that it calls __aiterclose__
, e.g. by using a with
block:
async with aclosing(fetch_newline_separated_json_from_url(...)) as agen:
while True:
document = await type(agen).__anext__(agen)
if document["id"] == needle:
break
Basically in this approach, the rule would be that if you want to
manually implement the iterator protocol, then it’s your
responsibility to implement all of it, and that now includes
__(a)iterclose__
.
GC hooks add non-trivial complexity in the form of (a) new global
interpreter state, (b) a somewhat complicated control flow (e.g.,
async generator GC always involves resurrection, so the details of PEP
442 are important), and (c) a new public API in asyncio (await
loop.shutdown_asyncgens()
) that users have to remember to call at
the appropriate time. (This last point in particular somewhat
undermines the argument that GC hooks provide a safe backup to
guarantee cleanup, since if shutdown_asyncgens()
isn’t called
correctly then I think it’s possible for generators to be silently
discarded without their cleanup code being called; compare this to the
__aiterclose__
-only approach where in the worst case we still at
least get a warning printed. This might be fixable.) All this
considered, GC hooks arguably aren’t worth it, given that the only
people they help are those who want to manually call __anext__
yet
don’t want to manually call __aiterclose__
. But Yury disagrees
with me on this :-). And both options are viable.
Always inject resources, and do all cleanup at the top level
Several commentators on python-dev and python-ideas have suggested
that a pattern to avoid these problems is to always pass resources in
from above, e.g. read_newline_separated_json
should take a file
object rather than a path, with cleanup handled at the top level:
def read_newline_separated_json(file_handle):
for line in file_handle:
yield json.loads(line)
def read_users(file_handle):
for document in read_newline_separated_json(file_handle):
yield User.from_json(document)
with open(path) as file_handle:
for user in read_users(file_handle):
...
This works well in simple cases; here it lets us avoid the “N+1
with
blocks problem”. But unfortunately, it breaks down quickly
when things get more complex. Consider if instead of reading from a
file, our generator was reading from a streaming HTTP GET request –
while handling redirects and authentication via OAUTH. Then we’d
really want the sockets to be managed down inside our HTTP client
library, not at the top level. Plus there are other cases where
finally
blocks embedded inside generators are important in their
own right: db transaction management, emitting logging information
during cleanup (one of the major motivating use cases for WSGI
close
), and so forth. So this is really a workaround for simple
cases, not a general solution.
More complex variants of __(a)iterclose__
The semantics of __(a)iterclose__
are somewhat inspired by
with
blocks, but context managers are more powerful:
__(a)exit__
can distinguish between a normal exit versus exception
unwinding, and in the case of an exception it can examine the
exception details and optionally suppress
propagation. __(a)iterclose__
as proposed here does not have these
powers, but one can imagine an alternative design where it did.
However, this seems like unwarranted complexity: experience suggests
that it’s common for iterables to have close
methods, and even to
have __exit__
methods that call self.close()
, but I’m not
aware of any common cases that make use of __exit__
’s full
power. I also can’t think of any examples where this would be
useful. And it seems unnecessarily confusing to allow iterators to
affect flow control by swallowing exceptions – if you’re in a
situation where you really want that, then you should probably use a
real with
block anyway.
Specification
This section describes where we want to eventually end up, though there are some backwards compatibility issues that mean we can’t jump directly here. A later section describes the transition plan.
Guiding principles
Generally, __(a)iterclose__
implementations should:
- be idempotent,
- perform any cleanup that is appropriate on the assumption that the
iterator will not be used again after
__(a)iterclose__
is called. In particular, once__(a)iterclose__
has been called then calling__(a)next__
produces undefined behavior.
And generally, any code which starts iterating through an iterable
with the intention of exhausting it, should arrange to make sure that
__(a)iterclose__
is eventually called, whether or not the iterator
is actually exhausted.
Changes to iteration
The core proposal is the change in behavior of for
loops. Given
this Python code:
for VAR in ITERABLE:
LOOP-BODY
else:
ELSE-BODY
we desugar to the equivalent of:
_iter = iter(ITERABLE)
_iterclose = getattr(type(_iter), "__iterclose__", lambda: None)
try:
traditional-for VAR in _iter:
LOOP-BODY
else:
ELSE-BODY
finally:
_iterclose(_iter)
where the “traditional-for statement” here is meant as a shorthand for
the classic 3.5-and-earlier for
loop semantics.
Besides the top-level for
statement, Python also contains several
other places where iterators are consumed. For consistency, these
should call __iterclose__
as well using semantics equivalent to
the above. This includes:
for
loops inside comprehensions*
unpacking- functions which accept and fully consume iterables, like
list(it)
,tuple(it)
,itertools.product(it1, it2, ...)
, and others.
In addition, a yield from
that successfully exhausts the called
generator should as a last step call its __iterclose__
method. (Rationale: yield from
already links the lifetime of the
calling generator to the called generator; if the calling generator is
closed when half-way through a yield from
, then this will already
automatically close the called generator.)
Changes to async iteration
We also make the analogous changes to async iteration constructs,
except that the new slot is called __aiterclose__
, and it’s an
async method that gets await
ed.
Modifications to basic iterator types
Generator objects (including those created by generator comprehensions):
__iterclose__
callsself.close()
__del__
callsself.close()
(same as now), and additionally issues aResourceWarning
if the generator wasn’t exhausted. This warning is hidden by default, but can be enabled for those who want to make sure they aren’t inadvertently relying on CPython-specific GC semantics.
Async generator objects (including those created by async generator comprehensions):
__aiterclose__
callsself.aclose()
__del__
issues aRuntimeWarning
ifaclose
has not been called, since this probably indicates a latent bug, similar to the “coroutine never awaited” warning.
QUESTION: should file objects implement __iterclose__
to close the
file? On the one hand this would make this change more disruptive; on
the other hand people really like writing for line in open(...):
...
, and if we get used to iterators taking care of their own
cleanup then it might become very weird if files don’t.
New convenience functions
The operator
module gains two new functions, with semantics
equivalent to the following:
def iterclose(it):
if not isinstance(it, collections.abc.Iterator):
raise TypeError("not an iterator")
if hasattr(type(it), "__iterclose__"):
type(it).__iterclose__(it)
async def aiterclose(ait):
if not isinstance(it, collections.abc.AsyncIterator):
raise TypeError("not an iterator")
if hasattr(type(ait), "__aiterclose__"):
await type(ait).__aiterclose__(ait)
The itertools
module gains a new iterator wrapper that can be used
to selectively disable the new __iterclose__
behavior:
# QUESTION: I feel like there might be a better name for this one?
class preserve(iterable):
def __init__(self, iterable):
self._it = iter(iterable)
def __iter__(self):
return self
def __next__(self):
return next(self._it)
def __iterclose__(self):
# Swallow __iterclose__ without passing it on
pass
Example usage (assuming that file objects implements
__iterclose__
):
with open(...) as handle:
# Iterate through the same file twice:
for line in itertools.preserve(handle):
...
handle.seek(0)
for line in itertools.preserve(handle):
...
@contextlib.contextmanager
def iterclosing(iterable):
it = iter(iterable)
try:
yield preserve(it)
finally:
iterclose(it)
__iterclose__ implementations for iterator wrappers
Python ships a number of iterator types that act as wrappers around
other iterators: map
, zip
, itertools.accumulate
,
csv.reader
, and others. These iterators should define a
__iterclose__
method which calls __iterclose__
in turn on
their underlying iterators. For example, map
could be implemented
as:
# Helper function
map_chaining_exceptions(fn, items, last_exc=None):
for item in items:
try:
fn(item)
except BaseException as new_exc:
if new_exc.__context__ is None:
new_exc.__context__ = last_exc
last_exc = new_exc
if last_exc is not None:
raise last_exc
class map:
def __init__(self, fn, *iterables):
self._fn = fn
self._iters = [iter(iterable) for iterable in iterables]
def __iter__(self):
return self
def __next__(self):
return self._fn(*[next(it) for it in self._iters])
def __iterclose__(self):
map_chaining_exceptions(operator.iterclose, self._iters)
def chain(*iterables):
try:
while iterables:
for element in iterables.pop(0):
yield element
except BaseException as e:
def iterclose_iterable(iterable):
operations.iterclose(iter(iterable))
map_chaining_exceptions(iterclose_iterable, iterables, last_exc=e)
In some cases this requires some subtlety; for example, itertools.tee 1
should not call __iterclose__
on the underlying iterator until it
has been called on all of the clone iterators.
Example / Rationale
The payoff for all this is that we can now write straightforward code like:
def read_newline_separated_json(path):
for line in open(path):
yield json.loads(line)
and be confident that the file will receive deterministic cleanup without the end-user having to take any special effort, even in complex cases. For example, consider this silly pipeline:
list(map(lambda key: key.upper(),
doc["key"] for doc in read_newline_separated_json(path)))
If our file contains a document where doc["key"]
turns out to be
an integer, then the following sequence of events will happen:
key.upper()
raises anAttributeError
, which propagates out of themap
and triggers the implicitfinally
block insidelist
.- The
finally
block inlist
calls__iterclose__()
on the map object. map.__iterclose__()
calls__iterclose__()
on the generator comprehension object.- This injects a
GeneratorExit
exception into the generator comprehension body, which is currently suspended inside the comprehension’sfor
loop body. - The exception propagates out of the
for
loop, triggering thefor
loop’s implicitfinally
block, which calls__iterclose__
on the generator object representing the call toread_newline_separated_json
. - This injects an inner
GeneratorExit
exception into the body ofread_newline_separated_json
, currently suspended at theyield
. - The inner
GeneratorExit
propagates out of thefor
loop, triggering thefor
loop’s implicitfinally
block, which calls__iterclose__()
on the file object. - The file object is closed.
- The inner
GeneratorExit
resumes propagating, hits the boundary of the generator function, and causesread_newline_separated_json
’s__iterclose__()
method to return successfully. - Control returns to the generator comprehension body, and the outer
GeneratorExit
continues propagating, allowing the comprehension’s__iterclose__()
to return successfully. - The rest of the
__iterclose__()
calls unwind without incident, back into the body oflist
. - The original
AttributeError
resumes propagating.
(The details above assume that we implement file.__iterclose__
; if
not then add a with
block to read_newline_separated_json
and
essentially the same logic goes through.)
Of course, from the user’s point of view, this can be simplified down to just:
1. int.upper()
raises an AttributeError
1. The file object is closed.
2. The AttributeError
propagates out of list
So we’ve accomplished our goal of making this “just work” without the user having to think about it.
Transition plan
While the majority of existing for
loops will continue to produce
identical results, the proposed changes will produce
backwards-incompatible behavior in some cases. Example:
def read_csv_with_header(lines_iterable):
lines_iterator = iter(lines_iterable)
for line in lines_iterator:
column_names = line.strip().split("\t")
break
for line in lines_iterator:
values = line.strip().split("\t")
record = dict(zip(column_names, values))
yield record
This code used to be correct, but after this proposal is implemented
will require an itertools.preserve
call added to the first for
loop.
[QUESTION: currently, if you close a generator and then try to iterate
over it then it just raises Stop(Async)Iteration
, so code the
passes the same generator object to multiple for
loops but forgets
to use itertools.preserve
won’t see an obvious error – the second
for
loop will just exit immediately. Perhaps it would be better if
iterating a closed generator raised a RuntimeError
? Note that
files don’t have this problem – attempting to iterate a closed file
object already raises ValueError
.]
Specifically, the incompatibility happens when all of these factors come together:
- The automatic calling of
__(a)iterclose__
is enabled - The iterable did not previously define
__(a)iterclose__
- The iterable does now define
__(a)iterclose__
- The iterable is re-used after the
for
loop exits
So the problem is how to manage this transition, and those are the levers we have to work with.
First, observe that the only async iterables where we propose to add
__aiterclose__
are async generators, and there is currently no
existing code using async generators (though this will start changing
very soon), so the async changes do not produce any backwards
incompatibilities. (There is existing code using async iterators, but
using the new async for loop on an old async iterator is harmless,
because old async iterators don’t have __aiterclose__
.) In
addition, PEP 525 was accepted on a provisional basis, and async
generators are by far the biggest beneficiary of this PEP’s proposed
changes. Therefore, I think we should strongly consider enabling
__aiterclose__
for async for
loops and async generators ASAP,
ideally for 3.6.0 or 3.6.1.
For the non-async world, things are harder, but here’s a potential transition path:
In 3.7:
Our goal is that existing unsafe code will start emitting warnings, while those who want to opt-in to the future can do that immediately:
- We immediately add all the
__iterclose__
methods described above. - If
from __future__ import iterclose
is in effect, thenfor
loops and*
unpacking call__iterclose__
as specified above. - If the future is not enabled, then
for
loops and*
unpacking do not call__iterclose__
. But they do call some other method instead, e.g.__iterclose_warning__
. - Similarly, functions like
list
use stack introspection (!!) to check whether their direct caller has__future__.iterclose
enabled, and use this to decide whether to call__iterclose__
or__iterclose_warning__
. - For all the wrapper iterators, we also add
__iterclose_warning__
methods that forward to the__iterclose_warning__
method of the underlying iterator or iterators. - For generators (and files, if we decide to do that),
__iterclose_warning__
is defined to set an internal flag, and other methods on the object are modified to check for this flag. If they find the flag set, they issue aPendingDeprecationWarning
to inform the user that in the future this sequence would have led to a use-after-close situation and the user should usepreserve()
.
In 3.8:
- Switch from
PendingDeprecationWarning
toDeprecationWarning
In 3.9:
- Enable the
__future__
unconditionally and remove all the__iterclose_warning__
stuff.
I believe that this satisfies the normal requirements for this kind of transition – opt-in initially, with warnings targeted precisely to the cases that will be effected, and a long deprecation cycle.
Probably the most controversial / risky part of this is the use of
stack introspection to make the iterable-consuming functions sensitive
to a __future__
setting, though I haven’t thought of any situation
where it would actually go wrong yet…
Acknowledgements
Thanks to Yury Selivanov, Armin Rigo, and Carl Friedrich Bolz for helpful discussion on earlier versions of this idea.
References
- 1
- https://docs.python.org/3/library/itertools.html#itertools.tee
Copyright
This document has been placed in the public domain.
Source: https://github.com/python/peps/blob/master/pep-0533.txt
Last modified: 2021-07-14 17:18:34 GMT