There has been a longstanding gap in the Python packaging ecosystem that has
somewhat annoyed me,
but not enough to do anything about it:
we haven't really had a good way to compose multiple layers of Python
virtual environments together,
allowing large dependencies (like AI and machine learning libraries)
to be shared across multiple different application environments without having
to install them directly into the base runtime environment.
Utilities for collecting up an entire Python runtime, an application, and all its
dependencies into a single deployable artifact have existed since before the turn
of the century.
We've had standardised virtual environments
(allowing multiple applications to share a base Python runtime and its directly
installed third party packages) for almost as long.
We've had zip applications
for a long time as well (and other utilities which build on that feature).
We've had tools like wagon
which allow us to ship a bundle of prebuilt Python wheel archives and
install them on a destination system without needing to download
anything else from the internet at installation time.
We've had tools like conda
(and more recently uv),
which make intelligent use of hard links on local systems to avoid
making duplicate copies of completely identical versions of packages.
We've technically had platform specific mechanisms like Linux container images,
where the contents of an environment can be built up across multiple
container image layers, with the lower layers being shared across multiple
image definitions, but have lacked a convenient way to handle the dependency
management complications involved in using these tools to share large Python
libraries.
But we've never had something which specifically took full advantage of the way
Python's import system works to enable robust structural decomposition of Python
applications into independently updatable subcomponents (with a granularity
larger than single packages).
All of this history meant that I was thoroughly intrigued when a mutual
acquaintance introduced me to the creators of the LM Studio
personal AI desktop application to discuss a Python packaging problem they
had looming on their technical road map: it was clear from user demand and the
rate of evolution in the Python AI/ML ecosystem that they needed a way to ship
Python AI/ML components directly to their users without having to wait for
those capabilities to be made available through native interfaces in other
languages (such as Swift, C++, or JavaScript), but it didn't seem obvious to them
how they could readily integrate that capability into LM Studio without
making the application installation process substantially more complicated
for their users.
What started as a consulting contract for a technical proof of concept,
and has since turned into a permanent position with the organisation,
proved fruitful,
and the result is the recently published
open source venvstacks utility,
which is specifically designed to enable the kind of portable deterministic
artifact publishing setup that LM Studio needed, including:
Framework layers (for shipping large dependencies, such as Apple MLX or PyTorch)
Application layers (including additional unpackaged "launch modules" for app execution)
There are certainly still some technical limitations to be addressed (the
dynamic linking problem
with layering virtual environments like this is notorious amongst Python packaging
experts for a reason), but even in its current form, venvstacks is already capable
enough to power the recent inclusion of
Apple MLX support
in LM Studio.
Guido van Rossum recently put together an
excellent post
talking about the value of infix binary operators in making certain kinds of
operations easier to reason about correctly.
The context inspiring that post is a python-ideas discussion regarding the
possibility of adding a shorthand spelling (x = a + b) to Python for the
operation:
x = a.copy()
x.update(b)
The PEP for that proposal is still in development, so I'm not going to link to
it directly [1], but the paragraph above gives the gist of the idea. Guido's
article came in response to the assertion that infix operators don't improve
readability, when we have plenty of empirical evidence to show that they do.
Where this article comes from is a key point that Guido's article mentions,
but doesn't emphasise: that those readability benefits rely heavily on
implicitly shared context between the author of an expression and the readers
of that expression.
Without a previous agreement on the semantics, the only possible general answer
to the question "What does x = a + b mean?" is "I need more information to
answer that".
If the additional information supplied is "This is an algebraic expression",
then x = a + b is expressing a constraint on the permitted values of x,
a, and b.
Specifying x = a - b as an additional constraint would then further allow
the reader to infer that x = a and b = 0.
The use case for + in Python that most closely corresponds with algebra is
using it with numbers - the key differences lie in the meaning of =, rather
than the meaning of +.
So if the additional information supplied is "This is a Python assignment
statement; a and b are both well-behaved finite numbers", then the
reader will be able to infer that x will be the sum of the two numbers.
Inferring the exact numeric type of x would require yet more information
about the types of a and b, as types implementing the numeric +
operator are expected to participate in a type coercion protocol that gives
both operands a chance to carry out the operation, and only raises TypeError
if neither type understands the other.
The original algebraic meaning then gets expressed in Python as
assert x == a + b, and successful execution of the assignment statement
ensures that assertion will pass.
In this context, types implementing the + operator are expected to provide
all the properties that would be expected of the corresponding mathematical
concepts (a + b == b + a, a + (b + c) == (a + b) + c, etc), subject
to the limitations of performing calculations on computers that actually exist.
If the given expression used uppercase letters, as in X = A + B, then the
additional information supplied may instead be "This is a matrix algebra
expression". (It's a notational convention in mathematics that matrices be
assigned uppercase letters, while lowercase letters indicate scalar values)
For matrices, addition and subtraction are defined as only being valid between
matrices of the same size and shape, so if X = A - B were to be supplied as
an additional constraint, then the implications would be:
The numpy.ndarray type, and other types implementing the same API, bring the
semantics of matrix algebra to Python programming, similar to the way that the
builtin numeric types bring the semantics of scalar algebra.
This means that if the additional information supplied is "This is a Python
assignment statement; A and B are both matrices of the same size and
shape containing well-behaved finite numbers", then the reader will be able to
infer that X will be a new matrix of the same shape and size as matrices
A and B, with each element in X being the sum of the corresponding
elements in A and B.
As with scalar algebra, inferring the exact numeric type of the elements of
X would require more information about the types of the elements in A
and B, the original algebraic meaning gets expressed in Python as
assert X == A + B, successful execution of the assignment statement
ensures that assertion will pass, and types implementing + in this context
are expected to provide the properties that would be expected of a matrix in
mathematics.
Mathematics doesn't provide a convenient infix notation for concatenating two
strings together (aside from writing their names directly next to each other),
so programming language designers are forced to choose one.
While this does vary across languages, the most common choice is the one that
Python uses: the + operator.
This is formally a distinct operation from numeric addition, with different
semantic expectations, and CPython's C API somewhat coincidentally ended up
reflecting that distinction by offering two different ways of implementing
+ on a type: the tp_number->nb_add and tp_sequence->sq_concat slots.
(This distinction is absent at the Python level: only __add__, __radd__
and __iadd__ are exposed, and they always populate the relevant
tp_number slots in CPython)
The key semantic difference between algebraic addition and string concatenation is
that in algebraic addition, the order of the operands doesn't matter
(a + b == b + a), while in the string concatenation case, the order of the
operands determines which items appear first in the result (e.g.
"Hello" + "World" == "HelloWorld" vs "World" + "Hello" == "WorldHello").
This means that a + b == b + a being true when concatenating strings
indicates that either one or both strings are empty, or else the two strings are
identical.
Another less obvious semantic difference is that strings don't participate in
the type coercion protocol that is defined for numbers: if the right hand
operand isn't a string (or string subclass) instance, they'll raise
TypeError immediately, rather than letting the other operand attempt the
operation.
Python goes further than merely allowing + to be used for string
concatenation: it allows it to be used for arbitrary sequence concatenation.
For immutable container types like tuple, this closely parallels the way
that string concatenation works: a new immutable instance of the same type is
created containing references to the same items referenced by the original
operands:
>>> a = 1, 2, 3
>>> b = 4, 5, 6
>>> x = a + b
>>> a
(1, 2, 3)
>>> b
(4, 5, 6)
>>> x
(1, 2, 3, 4, 5, 6)
As for strings, immutable sequences will usually only interact with other
instances of the same type (or subclasses), even when the x += b notation
is used as an alternative to x = x + b. For example:
>>> x = 1, 2, 3
>>> x += [4, 5, 6]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate tuple (not "list") to tuple
>>> x += 4, 5, 6
>>> x
(1, 2, 3, 4, 5, 6)
In addition to str, the tuple, and bytes types implement these
concatenation semantics. range and memoryview, while otherwise
implementing the Sequence API, don't support concatenation operations.
Mutable sequence types add yet another variation to the possible meanings of
+ in Python. For the specific example of x = a + b, they're very similar
to immutable sequences, creating a fresh instance that references the same items
as the original operands:
>>> a = [1, 2, 3]
>>> b = [4, 5, 6]
>>> x = a + b
>>> a
[1, 2, 3]
>>> b
[4, 5, 6]
>>> x
[1, 2, 3, 4, 5, 6]
Where they diverge is that the x += b operation will modify the target
sequence directly, rather than creating a new container:
>>> a = [1, 2, 3]
>>> b = [4, 5, 6]
>>> x = a; x = x + b
>>> a
[1, 2, 3]
>>> x = a; x += b
>>> a
[1, 2, 3, 4, 5, 6]
The other difference is that where + remains restrictive as to the
container types it will work with, += is typically generalised to work
with arbitrary iterables on the right hand side, just like the
MutableMapping.extend() method:
>>> x = [1, 2, 3]
>>> x = x + (4, 5, 6)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "tuple") to list
>>> x += (4, 5, 6)
>>> x
[1, 2, 3, 4, 5, 6]
Amongst the builtins, list and bytearray implement these semantics
(although bytearray limits even in-place concatenation to bytes-like
types that support memoryview style access). Elsewhere in the standard
library, collections.deque and array.array are other mutable sequence
types that behave this way.
Multisets are a concept in mathematics that allow for values to occur in a set
more than once, with the multiset then being the mapping from the values
themselves to the count of how many times that value occurs in the multiset
(with a count of zero or less being the same as the value being omitted from
the set entirely).
While they don't natively use the x = a + b notation the way that scalar
algebra and matrix algebra do, the key point regarding multisets that's relevant
to this article is the fact that they do have a "Sum" operation defined, and the
semantics of that operation are very similar to those used for matrix addition:
element wise summation for each item in the multiset. If a particular value is
only present in one of the multisets, that's handled the same way as if it were
present with a count of zero.
Since Python 2.7 and 3.1, Python has included an implementation of the
mathematical multiset concept in the form of the collections.Counter class.
It uses x = a + b to denote multiset summation:
>>> a = collections.Counter(maths=2, python=2)
>>> b = collections.Counter(python=4, maths=1)
>>> x = a + b
>>> x
Counter({'python': 6, 'maths': 3})
As with sequences, counter instances define their own interoperability domain,
so they won't accept arbitrary mappings for a binary + operation:
>>> x = a + dict(python=4, maths=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'Counter' and 'dict'
But they're more permissive for in-place operations, accepting arbitrary
mapping objects:
>>> x += dict(python=4, maths=1)
>>> x
Counter({'python': 10, 'maths': 4})
Python's dictionaries are quite interesting mathematically, as in mathematical
terms, they're not actually a container. Instead, they're a function mapping
between a domain defined by the set of keys, and a range defined by a multiset
of values [2].
This means that the mathematical context that would most closely correspond to
defining addition on dictionaries is the algebraic combination of functions.
That's defined such that (f + g)(x) is equivalent to f(x) + g(x), so
the only binary in-fix operator support for dictionaries that could be grounded
in an existing mathematical shared context is one where d1 + d2 was
shorthand for:
x = d1.copy()
for k, rhs in d2.items():
try:
lhs = x[k]
except KeyError:
x[k] = rhs
else:
x[k] = lhs + rhs
That has the unfortunate implication that introducing a Python-specific binary
operator shorthand for dictionary copy-and-update semantics would represent a
hard conceptual break with mathematics, rather than a transfer of existing
mathematical concepts into the language. Contrast that with the introduction
of collections.Counter (which was grounded in the semantics of mathematical
multisets and borrowed its Python notation from element-wise addition on
matrices), or the matrix multiplication operator (which was grounded in the
semantics of matrix algebra, and only needed a text-editor-friendly symbol
assigned, similar to using * instead of × for scalar multiplication
and / instead of ÷ for division),
At least to me, that seems like a big leap to take for something where the
in-place form already has a perfectly acceptable spelling (d1.update(d2)),
and a more expression-friendly variant could be provided as a new dictionary
class method:
@classmethod
def from_merge(cls, *inputs):
self = cls()
for input in inputs:
self.update(input)
return self
With that defined, then the exact equivalent of the proposed d1 + d2 would
be type(d1).from_merge(d1, d2), and in practice, you would often give the
desired result type explicitly rather than inferring it from the inputs
(e.g. dict.from_merge(d1, d2)).
However, the PEP is still in the very first stage of the discussion and review
process, so it's entirely possible that by the time it reaches python-dev
it will be making a more modest proposal like a new dict class method,
rather than the current proposal of operator syntax support.
Several years ago, I
highlighted
"CPython moves both too fast and too slowly" as one of the more common causes
of conflict both within the python-dev mailing list, as well as between the
active CPython core developers and folks that decide that participating in
that process wouldn't be an effective use of their personal time and energy.
I still consider that to be the case, but it's also a point I've spent a lot
of time reflecting on in the intervening years, as I wrote that original article
while I was still working for Boeing Defence Australia. The following month,
I left Boeing for Red Hat Asia-Pacific, and started gaining a redistributor
level perspective on
open source supply chain management
in large enterprises.
While it's a gross oversimplification, I tend to break down CPython's use cases
as follows (note that these categories aren't fully distinct, they're just
aimed at focusing my thinking on different factors influencing the rollout of
new software features and versions):
Education: educator's main interest is in teaching ways of modelling and
manipulating the world computationally, not writing or maintaining
production software). Examples:
Development, build & release management tooling for Linux distros
Set-and-forget infrastructure: software where, for sometimes debatable
reasons, in-life upgrades to the software itself are nigh impossible, but
upgrades to the underlying platform may be feasible. Examples:
most self-managed corporate and institutional infrastructure (where properly
funded sustaining engineering plans are disturbingly rare)
grant funded software (where maintenance typically ends when the initial
grant runs out)
software with strict certification requirements (where recertification is
too expensive for routine updates to be economically viable unless
absolutely essential)
Embedded software systems without auto-upgrade capabilities
Continuously upgraded infrastructure: software with a robust sustaining
engineering model, where dependency and platform upgrades are considered
routine, and no more concerning than any other code change. Examples:
Facebook's Python service infrastructure
Rolling release Linux distributions
most public PaaS and serverless environments (Heroku, OpenShift, AWS Lambda,
Google Cloud Functions, Azure Cloud Functions, etc)
Intermittently upgraded standard operating environments: environments that do
carry out routine upgrades to their core components, but those upgrades occur
on a cycle measured in years, rather than weeks or months. Examples:
Ephemeral software: software that tends to be used once and then discarded
or ignored, rather than being subsequently upgraded in place. Examples:
Ad hoc automation scripts
Single-player games with a defined "end" (once you've finished them, even
if you forget to uninstall them, you probably won't reinstall them on a new
device)
Single-player games with little or no persistent state (if you uninstall and
reinstall them, it doesn't change much about your play experience)
Event-specific applications (the application was tied to a specific physical
event, and once the event is over, that app doesn't matter any more)
Regular use applications: software that tends to be regularly upgraded after
deployment. Examples:
Business management software
Personal & professional productivity applications (e.g. Blender)
Multi-player games, and other games with significant persistent state, but
no real defined "end"
Embedded software systems with auto-upgrade capabilities
Shared abstraction layers: software components that are designed to make it
possible to work effectively in a particular problem domain even if you don't
personally grasp all the intricacies of that domain yet. Examples:
most runtime libraries and frameworks fall into this category (e.g. Django,
Flask, Pyramid, SQL Alchemy, NumPy, SciPy, requests)
many testing and type inference tools also fit here (e.g. pytest,
Hypothesis, vcrpy, behave, mypy)
plugins for other applications (e.g. Blender plugins, OpenStack hardware
adapters)
the standard library itself represents the baseline "world according to
Python" (and that's an
incredibly complex
world view)
Ultimately, the main audiences that CPython and the standard library specifically
serve are those that, for whatever reason, aren't adequately served by the
combination of a more limited standard library and the installation of
explicitly declared third party dependencies from PyPI.
To oversimplify the above review of different usage and deployment models even
further, it's possible to summarise the single largest split in Python's user
base as the one between those that are using Python as a scripting language
for some environment of interest, and those that are using it as an application
development language, where the eventual artifact that will be distributed is
something other than the script that they're working on.
Typical developer behaviours when using Python as a scripting language include:
the main working unit consists of a single Python file (or Jupyter notebook!),
rather than a directory of Python and metadata files
there's no separate build step of any kind - the script is distributed as a
script, similar to the way standalone shell scripts are distributed
there's no separate install step (other than downloading the file to an
appropriate location), as it is expected that the required runtime environment
will be preconfigured on the destination system
no explicit dependencies stated, except perhaps a minimum Python version,
or else a statement of the expected execution environment. If dependencies
outside the standard library are needed, they're expected to be provided by
the environment being scripted (whether that's an operating system,
a data analysis platform, or an application that embeds a Python runtime)
no separate test suite, with the main test of correctness being "Did the
script do what you wanted it to do with the input that you gave it?"
if testing prior to live execution is needed, it will be in the form of a
"dry run" or "preview" mode that conveys to the user what the software would
do if run that way
if static code analysis tools are used at all, it's via integration into the
user's software development environment, rather than being set up separately
for each individual script
By contrast, typical developer behaviours when using Python as an application
development language include:
the main working unit consists of a directory of Python and metadata files,
rather than a single Python file
these is a separate build step to prepare the application for publication,
even if it's just bundling the files together into a Python sdist, wheel
or zipapp archive
whether there's a separate install step to prepare the application for use
will depend on how the application is packaged, and what the supported target
environments are
external dependencies are expressed in a metadata file, either directly in
the project directory (e.g. pyproject.toml, requirements.txt,
Pipfile), or as part of the generated publication archive (e.g.
setup.py, flit.ini)
a separate test suite exists, either as unit tests for the Python API,
integration tests for the functional interfaces, or a combination of the two
usage of static analysis tools is configured at the project level as part of
its testing regime, rather than being dependent on
As a result of that split, the main purpose that CPython and the standard
library end up serving is to define the redistributor independent baseline
of assumed functionality for educational and ad hoc Python scripting
environments 3-5 years after the corresponding CPython feature release.
For ad hoc scripting use cases, that 3-5 year latency stems from a combination
of delays in redistributors making new releases available to their users, and
users of those redistributed versions taking time to revise their standard
operating environments.
In the case of educational environments, educators need that kind of time to
review the new features and decide whether or not to incorporate them into the
courses they offer their students.
This post was largely inspired by the Twitter discussion following on from
this comment of mine
citing the Provisional API status defined in
PEP 411 as an example of an
open source project issuing a de facto invitation to users to participate more
actively in the design & development process as co-creators, rather than only
passively consuming already final designs.
The responses included several expressions of frustration regarding the difficulty
of supporting provisional APIs in higher level libraries, without those libraries
making the provisional status transitive, and hence limiting support for any
related features to only the latest version of the provisional API, and not any
of the earlier iterations.
My main reaction
was to suggest that open source publishers should impose whatever support
limitations they need to impose to make their ongoing maintenance efforts
personally sustainable. That means that if supporting older iterations of
provisional APIs is a pain, then they should only be supported if the project
developers themselves need that, or if somebody is paying them for the
inconvenience. This is similar to my view on whether or not volunteer-driven
projects should support older commercial LTS Python releases for free when it's
a hassle for them to do: I don't think they should,
as I expect most such demands to be stemming from poorly managed institutional
inertia, rather than from genuine need (and if the need is genuine, then it
should instead be possible to find some means of paying to have it addressed).
However, my second reaction,
was to realise that even though I've touched on this topic over the years (e.g.
in the original 2011 article linked above, as well as in Python 3 Q & A answers
here,
here,
and here,
and to a lesser degree in last year's article on the
Python Packaging Ecosystem),
I've never really attempted to directly explain the impact it has on the standard
library design process.
And without that background, some aspects of the design process, such as the
introduction of provisional APIs, or the introduction of
inspired-by-but-not-the-same-as, seem completely nonsensical, as they appear to be an attempt to standardise
APIs without actually standardising them.
The first hurdle that any proposal sent to python-ideas or python-dev has to
clear is answering the question "Why isn't a module on PyPI good enough?". The
vast majority of proposals fail at this step, but there are several common
themes for getting past it:
rather than downloading a suitable third party library, novices may be prone
to copying & pasting bad advice from the internet at large (e.g. this is why
the secrets library now exists: to make it less likely people will use the
random module, which is intended for games and statistical simulations,
for security-sensitive purposes)
the module is intended to provide a reference implementation and to enable
interoperability between otherwise competing implementations, rather than
necessarily being all things to all people (e.g. asyncio, wsgiref,
unittest`, and logging all fall into this category)
the module is intended for use in other parts of the standard library (e.g.
enum falls into this category, as does unittest)
the module is designed to support a syntactic addition to the language (e.g.
the contextlib, asyncio and typing modules fall into this
category)
the module is just plain useful for ad hoc scripting purposes (e.g.
pathlib, and ipaddress fall into this category)
the module is useful in an educational context (e.g. the statistics
module allows for interactive exploration of statistic concepts, even if you
wouldn't necessarily want to use it for full-fledged statistical analysis)
Passing this initial "Is PyPI obviously good enough?" check isn't enough to
ensure that a module will be accepted for inclusion into the standard library,
but it's enough to shift the question to become "Would including the proposed
library result in a net improvement to the typical introductory Python software
developer experience over the next few years?"
The introduction of ensurepip and venv modules into the standard library
also makes it clear to redistributors that we expect Python level packaging
and installation tools to be supported in addition to any platform specific
distribution mechanisms.
While existing third party modules are sometimes adopted wholesale into the
standard library, in other cases, what actually gets added is a redesigned
and reimplemented API that draws on the user experience of the existing API,
but drops or revises some details based on the additional design considerations
and privileges that go with being part of the language's reference
implementation.
For example, unlike its popular third party predecessor, path.py`, ``pathlib
does not define string subclasses, but instead independent types. Solving
the resulting interoperability challenges led to the definition of the
filesystem path protocol, allowing a wider range of objects to be used with
interfaces that work with filesystem paths.
The API design for the ipaddress module was adjusted to explicitly
separate host interface definitions (IP addresses associated with particular
IP networks) from the definitions of addresses and networks in order to serve
as a better tool for teaching IP addressing concepts, whereas the original
ipaddr module is less strict in the way it uses networking terminology.
In other cases, standard library modules are constructed as a synthesis of
multiple existing approaches, and may also rely on syntactic features that
didn't exist when the APIs for pre-existing libraries were defined. Both of
these considerations apply for the asyncio and typing modules,
while the latter consideration applies for the dataclasses API being
considered in PEP 557 (which can be summarised as "like attrs, but using
variable annotations for field declarations").
The working theory for these kinds of changes is that the existing libraries
aren't going away, and their maintainers often aren't all that interested
in putitng up with the constraints associated with standard library maintenance
(in particular, the relatively slow release cadence). In such cases, it's
fairly common for the documentation of the standard library version to feature
a "See Also" link pointing to the original module, especially if the third
party version offers additional features and flexibility that were omitted
from the standard library module.
While CPython does maintain an API deprecation policy, we generally prefer not
to use it without a compelling justification (this is especially the case
while other projects are attempting to maintain compatibility with Python 2.7).
However, when adding new APIs that are inspired by existing third party ones
without being exact copies of them, there's a higher than usual risk that some
of the design decisions may turn out to be problematic in practice.
When we consider the risk of such changes to be higher than usual, we'll mark
the related APIs as provisional, indicating that conservative end users may
want to avoid relying on them at all, and that developers of shared abstraction
layers may want to consider imposing stricter than usual constraints on which
versions of the provisional API they're prepared to support.
The short answer here is that the main APIs that get upgraded are those where:
there isn't likely to be a lot of external churn driving additional updates
there are clear benefits for either ad hoc scripting use cases or else in
encouraging future interoperability between multiple third party solutions
a credible proposal is submitted by folks interested in doing the work
If the limitations of an existing module are mainly noticeable when using the
module for application development purposes (e.g. datetime), if
redistributors already tend to make an improved alternative third party option
readily available (e.g. requests), or if there's a genuine conflict between
the release cadence of the standard library and the needs of the package in
question (e.g. certifi), then the incentives to propose a change to the
standard library version tend to be significantly reduced.
This is essentially the inverse to the question about PyPI above: since PyPI
usually is a sufficiently good distribution mechanism for application
developer experience enhancements, it makes sense for such enhancements to be
distributed that way, allowing redistributors and platform providers to make
their own decisions about what they want to include as part of their default
offering.
Changing CPython and the standard library only comes into play when there is
perceived value in changing the capabilities that can be assumed to be present
by default in 3-5 years time.
Yes, it's likely the bundling model used for ensurepip (where CPython
releases bundle a recent version of pip without actually making it part
of the standard library) may be applied to other modules in the future.
The most probable first candidate for that treatment would be the distutils
build system, as switching to such a model would allow the build system to be
more readily kept consistent across multiple releases.
Other potential candidates for this kind of treatment would be the Tcl/Tk
graphics bindings, and the IDLE editor, which are already unbundled and turned
into an optional addon installations by a number of redistributors.
By the very nature of things, the folks that tend to be most actively involved
in open source development are those folks working on open source applications
and shared abstraction layers.
The folks writing ad hoc scripts or designing educational exercises for their
students often won't even think of themselves as software developers - they're
teachers, system administrators, data analysts, quants, epidemiologists,
physicists, biologists, business analysts, market researchers, animators,
graphical designers, etc.
When all we have to worry about for a language is the application developer
experience, then we can make a lot of simplifying assumptions around what
people know, the kinds of tools they're using, the kinds of development
processes they're following, and the ways they're going to be building and
deploying their software.
Things get significantly more complicated when an application runtime also
enjoys broad popularity as a scripting engine. Doing either job well is
already difficult, and balancing the needs of both audiences as part of a single
project leads to frequent incomprehension and disbelief on both sides.
This post isn't intended to claim that we never make incorrect decisions as part
of the CPython development process - it's merely pointing out that the most
reasonable reaction to seemingly nonsensical feature additions to the Python
standard library is going to be "I'm not part of the intended target audience
for that addition" rather than "I have no interest in that, so it must be a
useless and pointless addition of no value to anyone, added purely to annoy me".
There have been a few recent articles reflecting on the current status of
the Python packaging ecosystem from an end user perspective, so it seems
worthwhile for me to write-up my perspective as one of the lead architects
for that ecosystem on how I characterise the overall problem space of software
publication and distribution, where I think we are at the moment, and where I'd
like to see us go in the future.
For context, the specific articles I'm replying to are:
These are all excellent pieces considering the problem space from different
perspectives, so if you'd like to learn more about the topics I cover here,
I highly recommend reading them.
Since it heavily influences the way I think about packaging system design in
general, it's worth stating my core design philosophy explicitly:
As a software consumer, I should be able to consume libraries, frameworks,
and applications in the binary format of my choice, regardless of whether
or not the relevant software publishers directly publish in that format
As a software publisher working in the Python ecosystem, I should be able to
publish my software once, in a single source-based format, and have it be
automatically consumable in any binary format my users care to use
This is emphatically not the way many software packaging systems work - for a
great many systems, the publication format and the consumption format are
tightly coupled, and the folks managing the publication format or the
consumption format actively seek to use it as a lever of control over a
commercial market (think operating system vendor controlled application stores,
especially for mobile devices).
While we're unlikely to ever pursue the specific design documented in the
rest of the PEP (hence the "Deferred" status), the
"Development, Distribution, and Deployment of Python Software"
section of PEP 426 provides additional details on how this philosophy applies
in practice.
I'll also note that while I now work on software supply chain management
tooling at Red Hat, that wasn't the case when I first started actively
participating in the upstream Python packaging ecosystem
design process. Back then I was working
on Red Hat's main
hardware integration testing system, and
growing increasingly frustrated with the level of effort involved in
integrating new Python level dependencies into Beaker's RPM based development
and deployment model. Getting actively involved in tackling these problems on
the Python upstream side of things then led to also getting more actively
involved in addressing them on the
Red Hat downstream side.
When talking about the design of software packaging ecosystems, it's very easy
to fall into the trap of only considering the "direct to peer developers" use
case, where the software consumer we're attempting to reach is another developer
working in the same problem domain that we are, using a similar set of
development tools. Common examples of this include:
Linux distro developers publishing software for use by other contributors to
the same Linux distro ecosystem
Web service developers publishing software for use by other web service
developers
Data scientists publishing software for use by other data scientists
In these more constrained contexts, you can frequently get away with using a
single toolchain for both publication and consumption:
Linux: just use the system package manager for the relevant distro
Web services: just use the Python Packaging Authority's twine for publication
and pip for consumption
Data science: just use conda for everything
For newer languages that start in one particular domain with a preferred
package manager and expand outwards from there, the apparent simplicity arising
from this homogeneity of use cases may frequently be attributed as an essential
property of the design of the package manager, but that perception of inherent
simplicity will typically fade if the language is able to successfully expand
beyond the original niche its default package manager was designed to handle.
In the case of Python, for example, distutils was designed as a consistent
build interface for Linux distro package management, setuptools for plugin
management in the Open Source Application Foundation's Chandler project, pip
for dependency management in web service development, and conda for local
language-independent environment management in data science.
distutils and setuptools haven't fared especially well from a usability
perspective when pushed beyond their original design parameters (hence the
current efforts to make it easier to use full-fledged build systems like
Scons and Meson as an alternative when publishing Python packages), while pip
and conda both seem to be doing a better job of accommodating increases in
their scope of application.
This history helps illustrate that where things really have the potential to
get complicated (even beyond the inherent challenges of domain-specific
software distribution) is when you start needing to cross domain boundaries.
For example, as the lead maintainer of contextlib in the Python
standard library, I'm also the maintainer of the contextlib2 backport
project on PyPI. That's not a domain specific utility - folks may need it
regardless of whether they're using a self-built Python runtime, a pre-built
Windows or Mac OS X binary they downloaded from python.org, a pre-built
binary from a Linux distribution, a CPython runtime from some other
redistributor (homebrew, pyenv, Enthought Canopy, ActiveState,
Continuum Analytics, AWS Lambda, Azure Machine Learning, etc), or perhaps even
a different Python runtime entirely (PyPy, PyPy.js, Jython, IronPython,
MicroPython, VOC, Batavia, etc).
Fortunately for me, I don't need to worry about all that complexity in the
wider ecosystem when I'm specifically wearing my contextlib2 maintainer
hat - I just publish an sdist and a universal wheel file to PyPI, and the rest
of the ecosystem has everything it needs to take care of redistribution
and end user consumption without any further input from me.
However, contextlib2 is a pure Python project that only depends on the
standard library, so it's pretty much the simplest possible case from a
tooling perspective (the only reason I needed to upgrade from distutils to
setuptools was so I could publish my own wheel files, and the only reason I
haven't switched to using the much simpler pure-Python-only flit instead of
either of them is that that doesn't yet easily support publishing backwards
compatible setup.py based sdists).
This means that things get significantly more complex once we start wanting to
use and depend on components written in languages other than Python, so that's
the broader context I'll consider next.
When it comes to handling the software distribution problem in general, there
are two main ways of approaching it:
design a plugin management system that doesn't concern itself with the
management of the application framework that runs the plugins
design a platform component manager that not only manages the plugins
themselves, but also the application frameworks that run them
This "plugin manager or platform component manager?" question shows up over and
over again in software distribution architecture designs, but the case of most
relevance to Python developers is in the contrasting approaches that pip and
conda have adopted to handling the problem of external dependencies for Python
projects:
pip is a plugin manager for Python runtimes. Once you have a Python runtime
(any Python runtime), pip can help you add pieces to it. However, by design,
it won't help you manage the underlying Python runtime (just as it wouldn't
make any sense to try to install Mozilla Firefox as a Firefox Add-On, or
Google Chrome as a Chrome Extension)
conda, by contrast, is a component manager for a cross-platform platform
that provides its own Python runtimes (as well as runtimes for other
languages). This means that you can get pre-integrated components, rather
than having to do your own integration between plugins obtained via pip and
language runtimes obtained via other means
What this means is that pip, on its own, is not in any way a direct
alternative to conda. To get comparable capabilities to those offered by conda,
you have to add in a mechanism for obtaining the underlying language runtimes,
which means the alternatives are combinations like:
apt-get + pip
dnf + pip
yum + pip
pyenv + pip
homebrew (Mac OS X) + pip
python.org Windows installer + pip
Enthought Canopy
ActiveState's Python runtime + PyPM
This is the main reason why "just use conda" is excellent advice to any
prospective Pythonista that isn't already using one of the platform component
managers mentioned above: giving that answer replaces an otherwise operating
system dependent or Python specific answer to the runtime management problem
with a cross-platform and (at least somewhat) language neutral one.
It's an especially good answer for Windows users, as chocalatey/OneGet/Windows
Package Management isn't remotely comparable to pyenv or homebrew at this point
in time, other runtime managers don't work on Windows, and getting folks
bootstrapped with MinGW, Cygwin or the new (still experimental) Windows
Subsystem for Linux is just another hurdle to place between them and whatever
goal they're learning Python for in the first place.
However, conda's pre-integration based approach to tackling the external
dependency problem is also why "just use conda for everything" isn't a
sufficient answer for the Python software ecosystem as a whole.
If you're working on an operating system component for Fedora, Debian, or any
other distro, you actually want to be using the system provided Python
runtime, and hence need to be able to readily convert your upstream Python
dependencies into policy compliant system dependencies.
Similarly, if you're wanting to support folks that deploy to a preconfigured
Python environment in services like AWS Lambda, Azure Cloud Functions, Heroku,
OpenShift or Cloud Foundry, or that use alternative Python runtimes like PyPy
or MicroPython, then you need a publication technology that doesn't tightly
couple your releases to a specific version of the underlying language runtime.
As a result, pip and conda end up existing at slightly different points in the
system integration pipeline:
Publishing and consuming Python software with pip is a matter of "bring your
own Python runtime". This has the benefit that you can readily bring your
own runtime (and manage it using whichever tools make sense for your use
case), but also has the downside that you must supply your own runtime
(which can sometimes prove to be a significant barrier to entry for new
Python users, as well as being a pain for cross-platform environment
management).
Like Linux system package managers before it, conda takes away the
requirement to supply your own Python runtime by providing one for you.
This is great if you don't have any particular preference as to which
runtime you want to use, but if you do need to use a different runtime
for some reason, you're likely to end up fighting against the tooling, rather
than having it help you. (If you're tempted to answer "Just add another
interpreter to the pre-integrated set!" here, keep in mind that doing so
without the aid of a runtime independent plugin manager like pip acts as a
multiplier on the platform level integration testing needed, which can be a
significant cost even when it's automated)
In case it isn't already clear from the above, I'm largely happy with the
respective niches that pip and conda are carving out for themselves as a
plugin manager for Python runtimes and as a cross-platform platform focused
on (but not limited to) data analysis use cases.
However, there's still plenty of scope to improve the effectiveness of the
collaboration between the upstream Python Packaging Authority and downstream
Python redistributors, as well as to reduce barriers to entry for participation
in the ecosystem in general, so I'll go over some of the key areas I see for
potential improvement.
It's not a secret that the core PyPA infrastructure (PyPI, pip, twine,
setuptools) is
nowhere near as well-funded
as you might expect given its criticality to the operations of some truly
enormous organisations.
The biggest impact of this is that even when volunteers show up ready and
willing to work, there may not be anybody in a position to effectively wrangle
those volunteers, and help keep them collaborating effectively and moving in a
productive direction.
To secure long term sustainability for the core Python packaging infrastructure,
we're only talking amounts on the order of a few hundred thousand dollars a
year - enough to cover some dedicated operations and publisher support staff for
PyPI (freeing up the volunteers currently handling those tasks to help work on
ecosystem improvements), as well as to fund targeted development directed at
some of the other problems described below.
However, rather than being a true
"tragedy of the commons",
I personally chalk this situation up to a different human cognitive bias: the
bystander effect.
The reason I think that is that we have so many potential sources of the
necessary funding that even folks that agree there's a problem that needs to be
solved are assuming that someone else will take care of it, without actually
checking whether or not that assumption is entirely valid.
The primary responsibility for correcting that oversight falls squarely on the
Python Software Foundation, which is why the Packaging Working Group was
formed in order to investigate possible sources of additional funding, as well
as to determine how any such funding can be spent most effectively.
However, a secondary responsibility also falls on customers and staff of
commercial Python redistributors, as this is exactly the kind of ecosystem
level risk that commercial redistributors are being paid to manage on behalf of
their customers, and they're currently not handling this particular situation
very well. Accordingly, anyone that's actually paying for CPython, pip, and
related tools (either directly or as a component of a larger offering), and
expecting them to be supported properly as a result, really needs to be asking
some very pointed question of their suppliers right about now. (Here's a sample
question: "We pay you X dollars a year, and the upstream Python ecosystem is
one of the things we expect you to support with that revenue. How much of what
we pay you goes towards maintenance of the upstream Python packaging
infrastructure that we rely on every day?").
One key point to note about the current situation is that as a 501(c)(3) public
interest charity, any work the PSF funds will be directed towards better
fulfilling that public interest mission, and that means focusing primarily on
the needs of educators and non-profit organisations, rather than those of
private for-profit entities.
Commercial redistributors are thus far better positioned to properly
represent their customers interests in areas where their priorities may
diverge from those of the wider community (closing the "insider threat"
loophole in PyPI's current security model is a particular case that comes to
mind - see Making PyPI security independent of SSL/TLS).
An instance of the new PyPI implementation (Warehouse) is up and running at
https://pypi.org/ and connected directly to the
production PyPI database, so folks can already explicitly opt-in to using it
over the legacy implementation if they prefer to do so.
However, there's still a non-trivial amount of design, development and QA work
needed on the new version before all existing traffic can be transparently
switched over to using it.
Getting at least this step appropriately funded and a clear project management
plan in place is the main current focus of the PSF's Packaging Working Group.
Between the wheel format and the manylinux1 usefully-distro-independent
ABI definition, this is largely handled now, with conda available as an
option to handle the relatively small number of cases that are still a problem
for pip.
The main unsolved problem is to allow projects to properly express the
constraints they place on target environments so that issues can be detected
at install time or repackaging time, rather than only being detected as
runtime failures. Such a feature will also greatly expand the ability to
correctly generate platform level dependencies when converting Python
projects to downstream package formats like those used by conda and Linux
system package managers.
With pip being bundled with recent versions of CPython (including CPython 2.7
maintenance releases), and pip (or a variant like upip) also being bundled with
most other Python runtimes, the ecosystem bootstrapping problem has largely
been addressed for new Python users.
There are still a few usability challenges to be addressed (like defaulting
to per-user installations when outside a virtual environment, interoperating
more effectively with platform component managers like conda, and providing
an officially supported installation interface that works at the Python prompt
rather than via the operating system command line), but those don't require
the same level of political coordination across multiple groups that was
needed to establish pip as the lowest common denominator approach to
dependency management for Python applications.
As mentioned above, distutils was designed ~18 years ago as a common interface
for Linux distributions to build Python projects, while setuptools was designed
~12 years ago as a plugin management system for an open source Microsoft
Exchange replacement. While both projects have given admirable service in
their original target niches, and quite a few more besides, their age and
original purpose means they're significantly more complex than what a user
needs if all they want to do is to publish their pure Python library or
framework to the Python Package index.
Their underlying complexity also makes it incredibly difficult to improve the
problematic state of their documentation, which is split between the legacy
distutils documentation in the CPython standard library and the additional
setuptools specific documentation in the setuptools project.
Accordingly, what we want to do is to change the way build toolchains for
Python projects are organised to have 3 clearly distinct tiers:
toolchains for pure Python projects
toolchains for Python projects with simple C extensions
toolchains for C/C++/other projects with Python bindings
This allows folks to be introduced to simpler tools like flit first, better
enables the development of potential alternatives to setuptools at the second
tier, and supports the use of full-fledged pip-installable build systems like
Scons and Meson at the third tier.
The first step in this project, defining the pyproject.toml format to allow
declarative specification of the dependencies needed to launch setup.py,
has been implemented, and Daniel Holth's enscons project demonstrates that
that is already sufficient to bootstrap an external build system even without
the later stages of the project.
Future steps include providing native support for pyproject.toml in pip
and easy_install, as well as defining a declarative approach to invoking
the build system rather than having to run setup.py with the relevant
distutils & setuptools flags.
PyPI currently relies entirely on SSL/TLS to protect the integrity of the link
between software publishers and PyPI, and between PyPI and software consumers.
The only protections against insider threats from within the PyPI
administration team are ad hoc usage of GPG artifact signing by some projects,
personal vetting of new team members by existing team members and 3rd party
checks against previously published artifact hashes unexpectedly changing.
However, implementing that solution has been gated not only on being able to
first retire the legacy infrastructure, but also the PyPI administators being
able to credibly commit to the key management obligations of operating the
signing system, as well as to ensuring that the system-as-implemented actually
provides the security guarantees of the system-as-designed.
Accordingly, this isn't a project that can realistically be pursued until the
underlying sustainability problems have been suitably addressed.
While redistributors will generally take care of converting upstream Python
packages into their own preferred formats, the Python-specific wheel format
is currently a case where it is left up to publishers to decide whether or
not to create them, and if they do decide to create them, how to automate that
process.
Having PyPI take care of this process automatically is an obviously desirable
feature, but it's also an incredibly expensive one to build and operate.
Thus, it currently makes sense to defer this cost to individual projects, as
there are quite a few commercial continuous integration and continuous
deployment service providers willing to offer free accounts to open source
projects, and these can also be used for the task of producing release
artifacts. Projects also remain free to only publish source artifacts, relying
on pip's implicit wheel creation and caching and the appropriate use of
private PyPI mirrors and caches to meet the needs of end users.
For downstream platform communities already offering shared build
infrastructure to their members (such as Linux distributions and conda-forge),
it may make sense to offer Python wheel generation as a supported output option
for cross-platform development use cases, in addition to the platform's native
binary packaging format.
One of the more puzzling aspects of Python for newcomers to the language is the
stark usability differences between the standard library's urllib module
and the popular (and well-recommended) third party module, requests, when
it comes to writing HTTP(S) protocol clients. When your problem is
"talk to a HTTP server", the difference in usability isn't immediately obvious,
but it becomes clear as soon as additional requirements like SSL/TLS,
authentication, redirect handling, session management, and JSON request and
response bodies enter the picture.
It's tempting, and entirely understandable, to want to
chalk this difference
in ease of use up to requests being "Pythonic" (in 2016 terms), while urllib
has now become un-Pythonic (despite being included in the standard library).
While there are certainly a few elements of that (e.g. the property builtin
was only added in Python 2.2, while urllib2 was included in the original
Python 2.0 release and hence couldn't take that into account in its API design),
the vast majority of the usability difference relates to an entirely different
question we often forget to ask about the software we use:
What problem does it solve?
That is, many otherwise surprising discrepancies between urllib/urllib2
and requests are best explained by the fact that they solve different
problems, and the problems most HTTP client developers have today
are closer to those Kenneth Reitz designed requests to solve in 2010/2011,
than they are to the problems that Jeremy Hylton was aiming to solve more than
a decade earlier.
It's all in the name
To quote the current Python 3 urllib package documentation: "urllib is a
package that collects several modules for working with URLs".
And the docstring from Jeremy's
original commit message
adding urllib2 to CPython: "An extensible library for opening URLs using a
variety [of] protocols".
Wait, what? We're just trying to write a HTTP client, so why is the
documentation talking about working with URLs in general?
While it may seem strange to developers accustomed to the modern HTTPS+JSON
powered interactive web, it wasn't always clear that that was how things were
going to turn out.
At the turn of the century, the expectation was instead that we'd retain a
rich variety of data transfer protocols with different characteristics optimised
for different purposes, and that the most useful client to have in the standard
library would be one that could be used to talk to multiple different kinds
of servers (like HTTP, FTP, NFS, etc), without client developers needing to
worry too much about the specific protocol used (as indicated by the URL
schema).
In practice, things didn't work out that way (mostly due to restrictive
institutional firewalls meaning HTTP servers were the only remote services that
could be accessed reliably), so folks in 2016 are now regularly comparing the
usability of a dedicated HTTP(S)-only client library with a general purpose
URL handling library that needs to be configured to specifically be using
HTTP(S) before you gain access to most HTTP(S) features.
When it was written, urllib2 was a square peg that was designed to fit into
the square hole of "generic URL processing". By contrast, most modern client
developers are looking for a round peg to fit into the round hole that is
HTTPS+JSON processing - urllib/urllib2 will fit if you shave the corners
off first, but requests comes pre-rounded.
So why not add requests to the standard library?
Answering the not-so-obvious question of "What problem does it solve?" then
leads to a more obvious follow-up question: if the problems that urllib/
urllib2 were designed to solve are no longer common, while the problems that
requests solves are common, why not add requests to the standard library?
If I recall correctly, Guido gave in-principle approval to this idea at a
language summit back in 2013 or so (after the requests 1.0 release), and it's
a fairly common assumption amongst the core development team that either
requests itself (perhaps as a bundled snapshot of an independently upgradable
component) or a compatible subset of the API with a different implementation
will eventually end up in the standard library.
However, even putting aside the
misgivings of the requests developers
about the idea, there are still some non-trivial system integration problems
to solve in getting requests to a point where it would be acceptable as a
standard library component.
In particular, one of the things that requests does to more reliably handle
SSL/TLS certificates in a cross-platform way is to bundle the Mozilla
Certificate Bundle included in the certifi project. This is a sensible
thing to do by default (due to the difficulties of obtaining reliable access
to system security certificates in a cross-platform way), but it conflicts
with the security policy of the standard library, which specifically aims to
delegate certificate management to the underlying operating system. That policy
aims to address two needs: allowing Python applications access to custom
institutional certificates added to the system certificate store (most notably,
private CA certificates for large organisations), and avoiding adding an
additional certificate store to end user systems that needs to be updated when
the root certificate bundle changes for any other reason.
These kinds of problems are technically solvable, but they're not fun to solve,
and the folks in a position to help solve them already have a great many other
demands on their time.This means we're not likely to see much in the way of
progress in this area as long as most of the CPython and requests developers
are pursuing their upstream contributions as a spare time activity, rather than
as something they're specifically employed to do.
Involved in Australian education, whether formally or informally?
Making use of Python in your classes, workshops or other activities?
Interested in sharing your efforts with other Australian educators, and with
the developers that create the tools you use? Able to get to the Melbourne
Convention & Exhibition Centre on Friday August 12th, 2016?
Then please consider submitting a proposal to speak at the Python in Australian
Education seminar at PyCon Australia 2016! More information about the seminar
can be found
here,
while details of the submission process are on the main
Call for Proposals
page.
Submissions close on Sunday May 8th, but may be edited further after submission
(including during the proposal review process based on feedback from reviewers).
PyCon Australia is a community-run conference, so everyone involved is a
volunteer (organisers, reviewers, and speakers alike), but accepted speakers
are eligible for discounted (or even free) registration, and assistance with
other costs is also available to help ensure the conference doesn't miss out
on excellent presentations due to financial need (for teachers needing to
persuade skeptical school administrators, this assistance may extend to
contributing towards the costs of engaging a substitute teacher for the day).
The background
At PyCon Australia 2014, James Curran presented an excellent keynote on
"Python for Every Child in Australia",
covering some of the history of the National Computer Science School, the
development of Australia's National Digital Curriculum (finally approved in
September 2015), and the opportunity this represented to introduce the next
generation of students to computational thinking in general, and Python in
particular.
Encouraged by both Dr Curran's keynote at PyCon Australia, and Professor Lorena
Barba's
"If There's Computational Thinking, There's Computational Learning" keynote at SciPy 2014, it was my honour and privilege
in 2015 not only to invite Carrie Anne Philbin, Education Pioneer at the
UK's Raspberry Pi Foundation, to speak at the main conference (on
"Designed for Education: a Python Solution"),
but also to invite her to keynote the inaugural Python in Australian Education
seminar. With the support of the Python Software Foundation and Code Club
Australia, Carrie Anne joined QSITE's Peter Whitehouse, Code Club Australia's
Kelly Tagalan, and several other local educators, authors and community workshop
organisers to present an informative, inspirational and sometimes challenging
series of talks.
For 2016, we have a new location in Melbourne (PyCon Australia has a two year
rotation in each city, and the Education seminar was launched during the
second year in Brisbane), a new co-organiser (Katie Bell of Grok Learning and
the National Computer Science School), and a Call for Proposals and financial
assistance program that are fully integrated with those for the main conference.
As with the main conference, however, the Python in Australian Education seminar
is designed around the idea of real world practitioners sharing information with
each other about their day to day experiences, what has worked well for them,
and what hasn't, and creating personal connections that can help facilitate
additional collaboration throughout the year.
So, in addition to encouraging people to submit their own proposals, I'd also
encourage folks to talk to their friends and peers that they'd like to see
presenting, and see if they're interested in participating.
As a co-designer of one of the world's most popular programming languages, one
of the more frustrating behaviours I regularly see (both in the Python community
and in others) is influential people trying to tap into fears of "losing" to
other open source communities as a motivating force for community contributions.
(I'm occasionally guilty of this misbehaviour myself, which makes it even
easier to spot when others are falling into the same trap).
While learning from the experiences of other programming language communities
is a good thing, fear based approaches to motivating action are seriously
problematic, as they encourage community members to see members of those
other communities as enemies in a competition for contributor attention, rather
than as potential allies in the larger challenge of advancing the state of the
art in software development. It also has the effect of telling folks that enjoy
those other languages that they're not welcome in a community that views them
and their peers as "hostile competitors".
In truth, we want there to be a rich smorgasboard of cross platform open
source programming languages to choose from, as programming languages are first
and foremost tools for thinking - they make it possible for us to convey our
ideas in terms so explicit that even a computer can understand them. If someone
has found a language to use that fits their brain and solves their immediate
problems, that's great, regardless of the specific language (or languages)
they choose.
So I have three specific requests for the Python community, and one broader
suggestion. First, the specific requests:
If we find it necessary to appeal to tribal instincts to motivate action, we
should avoid using tribal fear, and instead aim to use tribal pride.
When we use fear as a motivator, as in phrasings like "If we don't do X,
we're going to lose developer mindshare to language Y", we're deliberately
creating negative emotions in folks freely contributing the results of their
work to the world at large. Relying on tribal pride instead leads to
phrasings like "It's currently really unclear how to solve problem X in
Python. If we look to ecosystem Y, we can see they have a really nice
approach to solving problem X that we can potentially adapt to provide a
similarly nice user experience in Python". Actively emphasising taking pride
in our own efforts, rather than denigrating the efforts of others, helps
promote a culture of continuous learning within the Python community and
also encourages the development of ever improving collaborative
relationships with other communities.
Refrain from adopting attitudes of contempt towards other open source
programming language communities, especially if those communities have
empowered people to solve their own problems rather than having to wait for
commercial software vendors to deign to address them. Most of the important
problems in the world aren't profitable to solve (as the folks afflicted by
them aren't personally wealthy and don't control institutional funding
decisions), so we should be encouraging and applauding the folks stepping up
to try to solve them, regardless of what we may think of their technology
choices.
If someone we know is learning to program for the first time, and they
choose to learn a language we don't personally like, we should support them
in their choice anyway. They know what fits their brain better than we do,
so the right language for us may not be the right language for them. If
they start getting frustrated with their original choice, to the point where
it's demotivating them from learning to program at all, then it makes sense
to start recommending alternatives. This advice applies even for those of us
involved in improving the tragically bad state of network security: the way
we solve the problem with inherently insecure languages is by improving
operating system sandboxing capabilities, progressively knocking down
barriers to adoption for languages with better native security properties,
and improving the default behaviours of existing languages, not by confusing
beginners with arguments about why their chosen language is a poor choice
from an application security perspective. (If folks are deploying unaudited
software written by beginners to handle security sensitive tasks, it isn't
the folks writing the software that are the problem, it's the folks
deploying it without performing appropriate due diligence on the provenance
and security properties of that software)
My broader suggestion is aimed at folks that are starting to encounter the
limits of the core procedural subset of Python and would hence like to start
exploring more of Python's own available "tools for thinking".
One of the things we do as part of the Python core development process is to
look at features we appreciate having available in other languages we have
experience with, and see whether or not there is a way to adapt them to be
useful in making Python code easier to both read and write. This means that
learning another programming language that focuses more specifically on a
given style of software development can help improve anyone's understanding
of that style of programming in the context of Python.
To aid in such efforts, I've provided a list below of some possible areas for
exploration, and other languages which may provide additional insight into
those areas. Where possible, I've linked to Wikipedia pages rather than
directly to the relevant home pages, as Wikipedia often provides interesting
historical context that's worth exploring when picking up a new programming
language as an educational exercise rather than for immediate practical use.
While I do know many of these languages personally (and have used several of
them in developing production systems), the full list of recommendations
includes additional languages that I only know indirectly (usually by either
reading tutorials and design documentation, or by talking to folks that I trust
to provide good insight into a language's strengths and weaknesses).
There are a lot of other languages that could have gone on this list, so the
specific ones listed are a somewhat arbitrary subset based on my own interests
(for example, I'm mainly interested in the dominant Linux, Android and Windows
ecosystems, so I left out the niche-but-profitable Apple-centric Objective-C
and Swift programming languages, and I'm not familiar enough with art-focused
environments like Processing to even guess at what learning them might teach
a Python developer). For a more complete list that takes into account factors
beyond what a language might teach you as a developer, IEEE Spectrum's
annual ranking of programming language popularity and growth is well worth a
look.
Python's default execution model is procedural: we start at the top of the main
module and execute it statement by statement. All of Python's support for the
other approaches to data and computational modelling covered below is built
on this procedural foundation.
The C programming language is still the unchallenged ruler of low level
procedural programming. It's the core implementation language for the reference
Python interpreter, and also for the Linux operating system kernel. As a
software developer, learning C is one of the best ways to start learning more
about the underlying hardware that executes software applications - C is often
described as "portable assembly language", and one of the first applications
cross-compiled for any new CPU architecture will be a C compiler
Rust, by contrast, is a relatively new programming language created by
Mozilla. The reason it makes this list is because Rust aims to take all of the
lessons we've learned as an industry regarding what not to do in C, and
design a new language that is interoperable with C libraries, offers the same
precise control over hardware usage that is needed in a low level systems
programming language, but uses a different compile time approach to data modelling
and memory management to structurally eliminate many of the common flaws
afflicting C programs (such as buffer overflows, double free errors, null
pointer access, and thread synchronisation problems). I'm an embedded systems
engineer by training and initial professional experience, and Rust is the first
new language I've seen that looks like it may have the potential to scale down
to all of the niches currently dominated by C and custom assembly code.
Cython is also a lower level procedural-by-default language, but unlike
general purpose languages like C and Rust, Cython is aimed specifically at
writing CPython extension modules. To support that goal, Cython is designed as
a Python superset, allowing the programmer to choose when to favour the pure
Python syntax for flexibility, and when to favour Cython's syntax extensions
that make it possible to generate code that is equivalent to native C code in
terms of speed and memory efficiency.
Learning one of these languages is likely to provide insight into memory
management, algorithmic efficiency, binary interface compatibility, software
portability, and other practical aspects of turning source code into running
systems.
One of the main things we need to do in programming is to model the state of
the real world, and offering native syntactic support for object-oriented
programming is one of the most popular approaches for doing that:
structurally grouping data structures, and methods for operating on those
data structures into classes.
Python itself is deliberately designed so that it is possible to use the
object-oriented features without first needing to learn to write your own
classes. Not every language adopts that approach - those listed in this section
are ones that consider learning object-oriented design to be a requirement for
using the language at all.
After a major marketing push by Sun Microsystems in the mid-to-late 1990's,
Java became the default language for teaching introductory computer science
in many tertiary institutions. While it is now being displaced by Python for
many educational use cases, it remains one of the most popular languages for
the development of business applications. There are a range of other languages
that target the common JVM (Java Virtual Machine) runtime, including the
Jython implementation of Python. The Dalvik and ART environments for Android
systems are based on a reimplementation of the Java programming APIs.
C# is similar in many ways to Java, and emerged as an alternative after Sun
and Microsoft failed to work out their business differences around Microsoft's
Java implementation, J++. Like Java, it's a popular language for the
development of business applications, and there are a range of other languages
that target the shared .NET CLR (Common Language Runtime), including
the IronPython implementation of Python (the core components of the original
IronPython 1.0 implementation were extracted to create the language neutral
.NET Dynamic Language Runtime). For a long time, .NET was a proprietary Windows
specific technology, with mono as a cross-platform open source
reimplementation, but Microsoft shifted to an open source ecosystem strategy
in early 2015.
Unlike most of the languages in this list, Eiffel isn't one I'd recommend
for practical day-to-day use. Rather, it's one I recommend because learning it
taught me an incredible amount about good object-oriented design where
"verifiably correct" is a design goal for the application. (Learning Eiffel also
taught me a lot about why "verifiably correct" isn't actually a design goal in
most software development, as verifiably correct software really doesn't cope
well with ambiguity and is entirely unsuitable for cases where you genuinely
don't know the relevant constraints yet and need to leave yourself enough
wiggle room to be able to figure out the finer details through iterative
development).
Learning one of these languages is likely to provide insight into inheritance
models, design-by-contract, class invariants, pre-conditions, post-conditions,
covariance, contravariance, method resolution order, generic programming, and
various other notions that also apply to Python's type system. There are also
a number of standard library modules and third party frameworks that use this
"visibly object-oriented" design style, such as the unittest and logging
modules, and class-based views in the Django web framework.
One way of using the CPython runtime is as a "C with objects" programming
environment - at its core, CPython is implemented using C's approach to
object-oriented programming, which is to define C structs to hold the data
of interest, and to pass in instances of the struct as the first argument to
functions that then manipulate that data (these are the omnipresent
PyObject* pointers in the CPython C API). This design pattern is
deliberately mirrored at the Python level in the form of the explicit self
and cls arguments to instance methods and class methods.
C++ is a programming language that aimed to retain full source compatibility
with C, while adding higher level features like native object-oriented
programming support and template based metaprogramming. It's notoriously verbose
and hard to program in (although the 2011 update to the language standard
addressed many of the worst problems), but it's also the language of choice in
many contexts, including 3D modelling graphics engines and cross-platform
application development frameworks like Qt.
The D programming language is also interesting, as it has a similar
relationship to C++ as Rust has to C: it aims to keep most of the desirable
characteristics of C++, while also avoiding many of its problems (like the lack
of memory safety). Unlike Rust, D was not a ground up design of a new
programming language from scratch - instead, D is a close derivative of C++,
and while it isn't a strict C superset as C++ is, it does follow the design
principle that any code that falls into the common subset of C and D must
behave the same way in both languages.
Learning one of these languages is likely to provide insight into the
complexities of combining higher level language features with the underlying
C runtime model. Learning C++ is also likely to be useful when using Python
to manipulate existing libraries and toolkits written in C++.
Array oriented programming is designed to support numerical programming models:
those based on matrix algebra and related numerical methods.
While Python's standard library doesn't support this directly, array oriented
programming is taken into account in the language design, with a range of
syntactic and semantic features being added specifically for the benefit of
the third party NumPy library and similarly array-oriented tools.
In many cases, the Scientific Python stack is adopted as an alternative to
the proprietary MATLAB programming environment, which is used extensively
for modelling, simulation and numerical data analysis in science and
engineering. GNU Octave is an open source alternative that aims to be
syntactically compatible with MATLAB code, allowing folks to compare and
contrast the two approaches to array-oriented programming.
Julia is another relatively new language, which focuses heavily on array
oriented programming and type-based function overloading.
Learning one of these languages is likely to provide insight into the
capabilities of the Scientific Python stack, as well as providing opportunities
to explore hardware level parallel execution through technologies like OpenCL
and Nvidia's CUDA, and distributed data processing through ecosystems like
Apache Spark and the Python-specific Blaze.
As access to large data sets has grown, so has demand for capable freely
available analytical tools for processing those data sets. One such tool is
the R programming language, which focuses specifically on statistical data
analysis and visualisation.
Learning R is likely to provide insight into the statistical analysis
capabilities of the Scientific Python stack, especially the pandas data
manipulation library and the seaborn statistical visualisation library.
Object-oriented data modelling and array-oriented data processing focus a lot
of attention on modelling data at rest, either in the form of collections of
named attributes or as arrays of structured data.
By contrast, functional programming languages emphasise the modelling of data
in motion, in the form of computational flows. Learning at least the basics
of functional programming can help greatly improve the structure of data
transformation operations even in otherwise procedural, object-oriented or
array-oriented applications.
Haskell is a functional programming language that has had a significant
influence on the design of Python, most notably through the introduction of
list comprehensions in Python 2.0.
Scala is an (arguably) functional programming language for the JVM that,
together with Java, Python and R, is one of the four primary programming
languages for the Apache Spark data analysis platform. While being designed to
encourage functional programming approaches, Scala's syntax, data model, and
execution model are also designed to minimise barriers to adoption for current
Java programmers (hence the "arguably" - the case can be made that Scala is
better categorised as an object-oriented programming language with strong
functional programming support).
Clojure is another functional programming language for the JVM that is
designed as a dialect of Lisp. It earns its place in this list by being
the inspiration for the toolz functional programming toolkit for Python.
F# isn't a language I'm particularly familiar with myself, but seems worth
noting as the preferred functional programming language for the .NET CLR.
Learning one of these languages is likely to provide insight into Python's own
computational pipeline modelling tools, including container comprehensions,
generators, generator expressions, the functools and itertools standard
library modules, and third party functional Python toolkits like toolz.
Computational pipelines are an excellent way to handle data transformation and
analysis problems, but many problems require that an application run as a
persistent service that waits for events to occur, and then handles those
events. In these kinds of services, it is usually essential to be able to handle
multiple events concurrently in order to be able to accommodate multiple users
(or at least multiple actions) at the same time.
JavaScript was originally developed as an event handling language for web
browsers, permitting website developers to respond locally to client side
actions (such as mouse clicks and key presses) and events (such as the page
rendering being completed). It is supported in all modern browsers, and
together with the HTML5 Domain Object Model, has become a de facto standard
for defining the appearance and behaviour of user interfaces.
Go was designed by Google as a purpose built language for creating highly
scalable web services, and has also proven to be a very capable language for
developing command line applications. The most interesting aspect of Go from
a programming language design perspective is its use of Communicating
Sequential Processes concepts in its core concurrency model.
Erlang was designed by Ericsson as a purpose built language for creating
highly reliable telephony switches and similar devices, and is the language
powering the popular RabbitMQ message broker. Erlang uses the Actor model
as its core concurrency primitive, passing messages between threads of
execution, rather than allowing them to share data directly. While I've never
programmed in Erlang myself, my first full-time job involved working with (and
on) an Actor-based concurrency framework for C++ developed by an ex-Ericsson
engineer, as well as developing such a framework myself based on the TSK (Task)
and MBX (Mailbox) primitives in Texas Instrument's lightweight DSP/BIOS
runtime (now known as TI-RTOS).
Elixir earns an entry on the list by being a language designed to run on the
Erlang VM that exposes the same concurrency semantics as Erlang, while also
providing a range of additional language level features to help provide a more
well-rounded environment that is more likely to appeal to developers migrating
from other languages like Python, Java, or Ruby.
Learning one of these languages is likely to provide insight into Python's own
concurrency and parallelism support, including native coroutines, generator
based coroutines, the concurrent.futures and asyncio standard
library modules, third party network service development frameworks like
Twisted and Tornado, the channels concept being introduced to Django,
and the event handling loops in GUI frameworks.
One of the more controversial features that landed in Python 3.5 was the new
typing module, which brings a standard lexicon for gradual typing support
to the Python ecosystem.
For folks whose primary exposure to static typing is in languages like C,
C++ and Java, this seems like an astoundingly terrible idea (hence the
controversy).
Microsoft's TypeScript, which provides gradual typing for JavaScript
applications provides a better illustration of the concept. TypeScript code
compiles to JavaScript code (which then doesn't include any runtime type
checking), and TypeScript annotations for popular JavaScript libraries are
maintained in the dedicated DefinitelyTyped repository.
As Chris Neugebauer pointed out in his PyCon Australia presentation, this is
very similar to the proposed relationship between Python, the typeshed type
hint repository, and type inference and analysis tools like mypy.
In essence, both TypeScript and type hinting in Python are ways of writing
particular kinds of tests, either as separate files (just like normal tests),
or inline with the main body of the code (just like type declarations in
statically typed languages). In either case, you run a separate command to
actually check that the rest of the code is consistent with the available type
assertions (this occurs implicitly as part of the compilation to JavaScript for
TypeScript, and as an entirely optional static analysis task for Python's type
hinting).
A feature folks coming to Python from languages like C, C++, C# and Java often
find disconcerting is the notion that "code is data": the fact that things like
functions and classes are runtime objects that can be manipulated like any
other object.
Hy is a Lisp dialect that runs on both the CPython VM and the PyPy VM. Lisp
dialects take the "code as data" concept to extremes, as Lisp code consists of
nested lists describing the operations to be performed (the name of the language
itself stands for "LISt Processor"). The great strength of Lisp-style languages
is that they make it incredibly easy to write your own domain specific
languages. The great weakness of Lisp-style languages is that they make it
incredibly easy to write your own domain specific languages, which can sometimes
make it difficult to read other people's code.
Ruby is a language that is similar to Python in many respects, but as a
community is far more open to making use of dynamic metaprogramming features
that are "supported, but not encouraged" in Python. This includes things like
reopening class definitions to add additional methods, and using closures to
implement core language constructs like iteration.
Learning one of these languages is likely to provide insight into Python's own
dynamic metaprogramming support, including function and class decorators,
monkeypatching, the unittest.mock standard library module, and third
party object proxying modules like wrapt. (I'm not aware of any languages to
learn that are likely to provide insight into Python's metaclass system, so if
anyone has any suggestions on that front, please mention them in the comments.
Metaclasses power features like the core type system, abstract base classes,
enumeration types and runtime evaluation of gradual typing expressions)
Popular programming languages don't exist in isolation - they exist as part of
larger ecosystems of redistributors (both commercial and community focused),
end users, framework developers, tool developers, educators and more.
Lua is a popular programming language for embedding in larger applications
as a scripting engine. Significant examples include it being the language
used to write add-ons for the World of Warcraft game client, and it's also
embedded in the RPM utility used by many Linux distributions. Compared to
CPython, a Lua runtime will generally be a tenth of the size, and its weaker
introspection capabilities generally make it easier to isolate from the rest of
the application and the host operating system. A notable contribution from the
Lua community to the Python ecosystem is the adoption of the LuaJIT FFI
(Foreign Function Interface) as the basis of the JIT-friendly cffi interface
library for CPython and PyPy.
PHP is another popular programming language that rose to prominence as the
original "P" in the Linux-Apache-MySQL-PHP LAMP stack, due to its focus on
producing HTML pages, and its broad availability on early Virtual Private
Server hosting providers. For all the handwringing about conceptual flaws in
various aspects of its design, it's now the basis of several widely popular
open source web services, including the Drupal content management system, the
Wordpress blogging engine, and the MediaWiki engine that powers Wikipedia. PHP
also powers important services like the Ushahidi platform for crowdsourced
community reporting on distributed events.
Like PHP, Perl rose to popularity on the back of Linux. Unlike PHP, which
grew specifically as a web development platform, Perl rose to prominence as
a system administrator's tool, using regular expressions to string together
and manipulate the output of text-based Linux operating system commands. When
sh, awk and sed were no longer up to handling a task, Perl was there
to take over.
Learning one of these languages isn't likely to provide any great insight into
aesthetically beautiful or conceptually elegant programming language design.
What it is likely to do is to provide some insight into how programming
language distribution and adoption works in practice, and how much that has to
do with fortuitous opportunities, accidents of history and lowering barriers to
adoption by working with redistributors to be made available by default, rather
than the inherent capabilities of the languages themselves.
Finally, I fairly regularly get into discussions with functional and
object-oriented programming advocates claiming that those kinds of languages
are just as easy to learn as procedural ones.
I think the OOP folks have a point if we're talking about teaching through
embodied computing (e.g. robotics), where the objects being modelled in
software have direct real world counterparts the students can touch, like
sensors, motors, and relays.
For everyone else though, I now have a standard challenge: pick up a cookbook,
translate one of the recipes into the programming language you're claiming is
easy to learn, and then get a student that understands the language the
original cookbook was written in to follow the translated recipe. Most of the
time folks don't need to actually follow through on this - just running it
as a thought experiment is enough to help them realise how much prior knowledge
their claim of "it's easy to learn" is assuming. (I'd love to see academic
researchers perform this kind of study for real though - I'd be genuinely
fascinated to read the results)
Another way to tackle this problem though is to go learn the languages that
are actually being used to start teaching computational thinking to children.
One of the most popular of those is Scratch, which uses a drag-and-drop
programming interface to let students manipulate a self-contained graphical
environment, with sprites moving around and reacting to events in that
environment. Graphical environments like Scratch are the programming
equivalent of the picture books we use to help introduce children to reading
and writing.
This idea of using a special purpose educational language to manipulate a
graphical environment isn't new though, with one of the earliest incarnations
being the Logo environment created back in the 1960's. In Logo (and similar
environments like Python's own turtle module), the main thing you're
interacting with is a "turtle", which you can instruct to move around and
modify its environment by drawing lines. This way, concepts like command
sequences, repetition, and state (e.g. "pen up", "pen down") can be introduced
in a way that builds on people's natural intuitions ("imagine you're the turtle,
what's going to happen if you turn right 90 degrees?")
Going back and relearning one of these languages as an experienced programmer
is most useful as a tool for unlearning: the concepts they introduce help
remind us that these are concepts that we take for granted now, but needed to
learn at some point as beginners. When we do that, we're better able to work
effectively with students and other newcomers, as we're more likely to
remember to unpack our chains of logic, including the steps we'd otherwise take
for granted.
This is a follow-on from my
previous post
on Python 3.5's new async/await syntax. Rather than the simple background
timers used in the original post, this one will look at the impact native
coroutine support has on the TCP echo client and server examples from the
asyncio documentation.
First, we'll recreate the run_in_foreground helper defined in the previous
post. This helper function makes it easier to work with coroutines from
otherwise synchronous code (like the interactive prompt):
defrun_in_foreground(task,*,loop=None):"""Runs event loop in current thread until the given task completes Returns the result of the task. For more complex conditions, combine with asyncio.wait() To include a timeout, combine with asyncio.wait_for() """ifloopisNone:loop=asyncio.get_event_loop()returnloop.run_until_complete(asyncio.ensure_future(task,loop=loop))
Next we'll define the coroutine for our TCP echo server implementation,
which simply waits to receive up to 100 bytes on each new client connection,
and then sends that data back to the client:
asyncdefhandle_tcp_echo(reader,writer):data=awaitreader.read(100)message=data.decode()addr=writer.get_extra_info('peername')print("-> Server received %r from %r"%(message,addr))print("<- Server sending: %r"%message)writer.write(data)awaitwriter.drain()print("-- Terminating connection on server")writer.close()
And then the client coroutine we'll use to send a message and wait for a
response:
Conveniently, since this is a coroutine running in the current thread, rather
than in a different thread, we can retrieve the details of the listening
socket immediately, including the automatically assigned port number:
Now, both of these servers are configured to run directly in the main thread's
event loop, so trying to talk to them using a synchronous client wouldn't work.
The client would block the main thread, and the servers wouldn't be able to
process incoming connections. That's where our asynchronous client coroutine
comes in: if we use that to send messages to the server, then it doesn't
block the main thread either, and both the client and server coroutines can
process incoming events of interest. That gives the following results:
>>> print(run_in_foreground(tcp_echo_client('Hello World!',port)))-> Client sending: 'Hello World!'-> Server received 'Hello World!' from ('127.0.0.1', 44386)<- Server sending: 'Hello World!'-- Terminating connection on server<- Client received: 'Hello World!'-- Terminating connection on clientHello World!
Note something important here: you will get exactly that sequence of
output messages, as this is all running in the interpreter's main thread, in
a deterministic order. If the servers were running in their own threads, we
wouldn't have that property (and reliably getting access to the port numbers
the server components were assigned by the underlying operating system would
also have been far more difficult).
And to demonstrate both servers are up and running:
>>> print(run_in_foreground(tcp_echo_client('Hello World!',port2)))-> Client sending: 'Hello World!'-> Server received 'Hello World!' from ('127.0.0.1', 44419)<- Server sending: 'Hello World!'-- Terminating connection on server<- Client received: 'Hello World!'-- Terminating connection on clientHello World!
That then raises an interesting question: how would we send messages to the
two servers in parallel, while still only using a single thread to manage the
client and server coroutines? For that, we'll need another of our helper
functions from the previous post, schedule_coroutine:
defschedule_coroutine(target,*,loop=None):"""Schedules target coroutine in the given event loop If not given, *loop* defaults to the current thread's event loop Returns the scheduled task. """ifasyncio.iscoroutine(target):returnasyncio.ensure_future(target,loop=loop)raiseTypeError("target must be a coroutine, ""not {!r}".format(type(target)))
Update:As with the previous post, this post originally suggested a
combined "run_in_background" helper function that handled both scheduling
coroutines and calling arbitrary callables in a background thread or process.
On further reflection, I decided that was unhelpfully conflating two different
concepts, so I replaced it with separate "schedule_coroutine" and
"call_in_background" helpers
First, we set up the two client operations we want to run in parallel:
Then we use the asyncio.wait function in combination with run_in_foreground
to run the event loop until both operations are complete:
>>> run_in_foreground(asyncio.wait([echo1,echo2]))-> Client sending: 'Hello World!'-> Client sending: 'Hello World!'-> Server received 'Hello World!' from ('127.0.0.1', 44461)<- Server sending: 'Hello World!'-- Terminating connection on server-> Server received 'Hello World!' from ('127.0.0.1', 44462)<- Server sending: 'Hello World!'-- Terminating connection on server<- Client received: 'Hello World!'-- Terminating connection on client<- Client received: 'Hello World!'-- Terminating connection on client({<Task finished coro=<tcp_echo_client() done, defined at <stdin>:1> result='Hello World!'>, <Task finished coro=<tcp_echo_client() done, defined at <stdin>:1> result='Hello World!'>}, set())
And finally, we retrieve our results using the result method of the task
objects returned by schedule_coroutine:
We can set up as many concurrent background tasks as we like, and then use
asyncio.wait as the foreground task to wait for them all to complete.
But what if we had an existing blocking client function that we wanted or
needed to use (e.g. we're using an asyncio server to test a synchronous
client API). To handle that case, we use our third helper function from the
previous post:
defcall_in_background(target,*,loop=None,executor=None):"""Schedules and starts target callable as a background task If not given, *loop* defaults to the current thread's event loop If not given, *executor* defaults to the loop's default executor Returns the scheduled task. """ifloopisNone:loop=asyncio.get_event_loop()ifcallable(target):returnloop.run_in_executor(executor,target)raiseTypeError("target must be a callable, ""not {!r}".format(type(target)))
To explore this, we'll need a blocking client, which we can build based on
Python's existing
socket programming HOWTO guide:
importsocketdeftcp_echo_client_sync(message,port):conn=socket.socket(socket.AF_INET,socket.SOCK_STREAM)print('-> Client connecting to port: %r'%port)conn.connect(('127.0.0.1',port))print('-> Client sending: %r'%message)conn.send(message.encode())data=conn.recv(100).decode()print('<- Client received: %r'%data)print('-- Terminating connection on client')conn.close()returndata
We can then use functools.partial in combination with call_in_background to
start client requests in multiple operating system level threads:
Here we see that, unlike our coroutine clients, the synchronous clients have
started running immediately in a separate thread. However, because the event
loop isn't currently running in the main thread, they've blocked waiting for
a response from the TCP echo servers. As with the coroutine clients, we
address that by running the event loop in the main thread until our clients
have both received responses:
>>> run_in_foreground(asyncio.wait([bg_call,bg_call2]))-> Server received 'Hello World!' from ('127.0.0.1', 52585)<- Server sending: 'Hello World!'-- Terminating connection on server-> Server received 'Hello World!' from ('127.0.0.1', 34399)<- Server sending: 'Hello World!'<- Client received: 'Hello World!'-- Terminating connection on server-- Terminating connection on client<- Client received: 'Hello World!'-- Terminating connection on client({<Future finished result='Hello World!'>, <Future finished result='Hello World!'>}, set())>>> bg_call.result()'Hello World!'>>> bg_call2.result()'Hello World!'
But how do I arrange for that ticker to start running in the background? What's
the coroutine equivalent of appending & to a shell command?
It turns out it looks something like this:
importasynciodefschedule_coroutine(target,*,loop=None):"""Schedules target coroutine in the given event loop If not given, *loop* defaults to the current thread's event loop Returns the scheduled task. """ifasyncio.iscoroutine(target):returnasyncio.ensure_future(target,loop=loop)raiseTypeError("target must be a coroutine, ""not {!r}".format(type(target)))
Update:This post originally suggested a combined "run_in_background"
helper function that handle both scheduling coroutines and calling arbitrary
callables in a background thread or process. On further reflection, I decided
that was unhelpfully conflating two different concepts, so I replaced it with
separate "schedule_coroutine" and "call_in_background" helpers
But how do I run that for a while? The event loop won't run unless the current
thread starts it running and either stops when a particular event occurs, or
when explicitly stopped. Another helper function covers that:
defrun_in_foreground(task,*,loop=None):"""Runs event loop in current thread until the given task completes Returns the result of the task. For more complex conditions, combine with asyncio.wait() To include a timeout, combine with asyncio.wait_for() """ifloopisNone:loop=asyncio.get_event_loop()returnloop.run_until_complete(asyncio.ensure_future(task,loop=loop))
And then I can do:
>>> run_in_foreground(asyncio.sleep(5))01234
Here we can see the background task running while we wait for the foreground
task to complete. And if I do it again with a different timeout:
>>> run_in_foreground(asyncio.sleep(3))567
We see that the background task picked up again right where it left off
the first time.
We can also single step the event loop with a zero second sleep (the ticks
reflect the fact there was more than a second delay between running each
command):
And start a second ticker to run concurrently with the first one:
>>> ticker2=schedule_coroutine(ticker())>>> ticker2<Task pending coro=<ticker() running at <stdin>:1>>>>> run_in_foreground(asyncio.sleep(0))010
The asynchronous tickers will happily hang around in the background, ready to
resume operation whenever I give them the opportunity. If I decide I want to
stop one of them, I can cancel the corresponding task:
But what about our original synchronous ticker? Can I run that as a
background task? It turns out I can, with the aid of another helper function:
defcall_in_background(target,*,loop=None,executor=None):"""Schedules and starts target callable as a background task If not given, *loop* defaults to the current thread's event loop If not given, *executor* defaults to the loop's default executor Returns the scheduled task. """ifloopisNone:loop=asyncio.get_event_loop()ifcallable(target):returnloop.run_in_executor(executor,target)raiseTypeError("target must be a callable, ""not {!r}".format(type(target)))
However, I haven't figured out how to reliably cancel a task running in a
separate thread or process, so for demonstration purposes, we'll define a
variant of the synchronous version that stops automatically after 5 ticks
rather than ticking indefinitely:
The key difference between scheduling a callable in a background thread and
scheduling a coroutine in the current thread, is that the callable will start
executing immediately, rather than waiting for the current thread
to run the event loop:
That's both a strength (as you can run multiple blocking IO operations in
parallel), but also a significant weakness - one of the benefits of explicit
coroutines is their predictability, as you know none of them will start
doing anything until you start running the event loop.
PyCon Australia launched its Call for Papers just over a month ago, and it closes in a little over a week on Friday the 8th of May.
A new addition to PyCon Australia this year, and one I'm particularly excited about co-organising following Dr James Curran's "Python for Every Child in Australia" keynote last year, is the inaugural Python in Education miniconf as a 4th specialist track on the Friday of the conference, before we move into the main program over the weekend.
From the CFP announcement: "The Python in Education Miniconf aims to bring together community workshop organisers, professional Python instructors and professional educators across primary, secondary and tertiary levels to share their experiences and requirements, and identify areas of potential collaboration with each other and also with the broader Python community."
If that sounds like you, then I'd love to invite you to head over to the conference website and make your submission to the Call for Papers!
This year, all 4 miniconfs (Education, Science & Data Analysis, OpenStack and DjangoCon AU) are running our calls for proposals as part of the main conference CFP - every proposal submitted will be considered for both the main conference and the miniconfs.
I'm also pleased to announce two pre-arranged sessions at the Education Miniconf:
Carrie Anne Philbin, author of "Adventures in Raspberry Pi" and Education Pioneer at the Raspberry Pi Foundation will be speaking on the Foundation's Picademy professional development program for primary and secondary teachers
I'm genuinely looking forward to chairing this event, as I see tremendous potential in forging stronger connections between Australian educators (both formal and informal) and the broader Python and open source communities.