"/Users/imad/Desktop/") sys.path.append(
Python’s simplicity and versatility are largely attributed to its extensive ecosystem of modules and packages. These essential components enable developers to write clean, reusable, and efficient code, whether for simple scripts or complex applications.
This article aims to deepen our understanding of Python’s modules and packages and the machinery involved, helping me become a more effective Python programmer. We will explore their structure and functionality, covering everything from the basics of importing to creating custom packages and managing dependencies. By unpacking the underlying machinery of how modules/packages get imported and what they really are, we’ll gain insights that will enhance our coding practices and project organization.
Introduction
- Python has only one type of module object regardless of the language the module was implemented it (C/Python/…)
- Package provides a naming hierarchy to organize modules (same analogy to directory in Unix file system):
- All packages are modules but not all modules are packages
- A module is a package that has
__path__
- A package can include subpackages (sub-directories)
- There are two types of packages:
- Regular packages are directories that have
__init__.py
. When importing a package/subpackage -> implicitly executes all__init__.py
files on the path and bound objects to names in the package’s namespace- When import machinery is looking for the package, it stops once it finds it
- Namespace packages are directories that don’t have
__init__.py
- When import machinery is looking for the package, it does not stop when it finds it and assuming there may be a regular package in some other paths in
sys.path
but keep a record of all namespace packages it found during the search. If it finds a regular package with that name -> discard all namespace packages it found and import the regular package. If it doesn’t find any regular package with that name -> Use all the namespace packages it found during the search and combine their paths innamespace_path
so when we try to import subpackage or modules, it checks all the paths in the namespace_path (which is a list) - There can be multiple packages of the same name (under different directories) -> They are all combined together and the
namespace_path
list would have the path for all of them. Therefore, the same package can be used to refer to completely different modules in different directories - Python first scans the whole sys.path before deciding that the package is a namespace -> If any name is found with
__init__.py
in it, it will give this priority and don’t continue.
- When import machinery is looking for the package, it does not stop when it finds it and assuming there may be a regular package in some other paths in
- Regular packages are directories that have
- When importing subpackages such as
foo.bar.baz
, Python first imports foo, then foo.bar, then foo.bar.baz- Each of these will be cached in
sys.modules
- Each of these will be cached in
__init__.py
makes a directory a python package- We can use it to import useful stuff from different modules/subpackages so it can be available to user
- When importing an object, it has
__module__
attribute which determines the global environment for the object - We can define
__all__
in__init__
as concatenation of all__all__
in modules- Example:
__all__ = foo.__all__ + bar.__all__
BUT we need to either first import foo and bar or import anything from them so they can be defined such asfrom .foo import *
ORfrom .foo import y
- Example:
__init__
can be used to also initialize things and maybe monkeypatch some other modules
- Each package has
__path__
attribute that helps when searching for subpackages. This will be given to path finder when loading subpackages- It is a list; similar to
sys.path
so we can change it. But it is not recommended - Example:
import math; math.__path__.append(~/Documents")
- It is a list; similar to
- Relative import is preferred to absolute imports inside packages to avoid having issues if the package name is changed
- Packages get loaded once even if we import it multiple times
- We can in theory upgrade a package in the cache like this:
sys.modules[new_name] = package_name
- If we use
python -m package.module
, it executes module as the main program and relative imports works. Otherwise, relative imports won’t work.m
stands for module
- The
__main__
module is a special case relative to Python’s import system. The__main__
module is directly initialized at interpreter startup, much likesys
andbuiltins
. The manner in which__main__
is initialized depends on the flags and other options with which the interpreter is invoked __main__.py
designates main for a package/subpackage and also allows package directory to be executable -> explicitly marks the entry point. Examples:python package
would look for__main__.py
to execute itpython -m package.subpackage
would look for__main__.py
inside package/subpackage to execute__package__
is set so the relative imports still work- A lot of programming tools utilize this to their own benefit:
python -m profile script.y
ORpython -m pdb script.py
- NOTE THAT
__init__.py
files on the path will still be executed
- Depending on how
__main__
is initialized,__main__.__spec__
gets set appropriately or to None.- When Python is started with the -m option,
__spec__
is set to the module spec of the corresponding module or package.__spec__
is also populated when the__main__
module is loaded as part of executing a directory, zipfile or other sys.path entry. - Otherwise, it will be set to None
- When Python is started with the -m option,
Sys.path
importlib
has a rich API to interact with import system. It is preferred over__import__()
__import__
Only does module search and creation without the name bindingimport
Does everything. Module search, creation, and name binding. It calls__import__
under the hood.egg
files are just directories or .zip files with extra metadata for package managerssys.path
is where python looks to search for a module/package (last place) that we try to import. It traverses it from start-to-end- It has the name of directorires, .zipfiles, .egg files
- first match wins
- If it can’t find it -> can not be imported
sys.prefix
is where python is stored (os.py is the landmark) andsys.exec_prefix
is where compiled binaries are stored (lib-dynload is the landmark)- With virtual environments -> each one has its own sys.prefix
- It is constructed from
sys.prefix
,PYTHONHOME
, andsite.py
. SettingPYTHONHOME
would overridesys.prefix
andsys.exec_prefic
- Python looks for its libraries starting from where it is and keep going up until the root of the file syetsm. It looks for
os.py
and use that location as a landmark python -S
skips site.pypython -vv
to see what python tries to do with every statement- Setting PYTHONPATH to some directories will insert them into the beginning of sys.path. Example:
env PYTHONPATH="/Users/imad/Documents python
to run python with documents inserted at the beginning of the sys.apth
site.py
appends the path to third-party libraries. This is where installed packages get stored. Example:/usr/local/lib/python3.4/site-packages
- Python now have builtin virtual environments that can create one using the
venv
modulepython -m venv env_name
will create new environment called env_name- This environment will include few directories such as include, lib, site-packages, bin and pyvenv.cfg
- This new environment has no third party libraries or any system wide libraries such as those in /usr/local
- All third libraries will be installed in site-packages directory
- Python binary will refer to the original Python installation when the environment was created
- We can use
source path_to_env_name/bin/activate
to activate the environment.deactivate
to deactivate it. Finally,rm -r path_to_env_name
orpyenv --rm
if we create it using poetry
- Files with
.pth
extension in site-packages directory get added to the sys.path. We can list directories in those files that will be added to sys.path for any new instance of Python- Package managers and other third-party packages use this kind of hack to add paths to the sys.path
- sitecustomize and usercustomize also can be used to add stuff to the sys.path
- Finally the current working directory will be added to the path (at the beginning)
Modules
- Modules are just objects of type ModuleType. They act like a dictionary that holds references to objects it holds;
module.__dict__
- When importing a module, it executes the module from top to bottom before returning to the caller
- Module can be namespace, py file, execution environment for statements or container of global variables
- We can set/delete attributes.
module.x = 10
is the same asmodule.__dict__['x'] = 10
- The dictionary has preset attributes such as
__path__
,__loader__
… - Main attributes:
__name__
: # Module name__file__
: # Associated source file (if any)__doc__
: # Doc string__path__
: # Package path. It is used to look for package subcomponents__package__
: # The module’s__package__
attribute must be set. Its value must be a string, but it can be the same value as its__name__
. When the module is a package, its__package__
value should be set to its__name__
. When the module is not a package,__package__
should be set to the empty string for top-level modules, or for submodules, to the parent package’s name.__spec__
: # Module spec
- The main difference between modules and packages is that packages have
__path__
and__package__
defined (not None) sys.modules
serves as a cache for all imported modules/packages- It is a dictionary so we can delete/set keys
- If we delete a module, it will force Python to import it when we reimport it
- If we set module key to None -> result in
ModuleNotFoundError
- Even if we import one object from a module/package, the module/package will be cached in the
sys.modules
but not available in the global name space - The module created during loading and passed to exec_module() may not be the one returned at the end of the import
- This can happen if the imported module set the
sys.modules[__name__]
to some other module
- This can happen if the imported module set the
- The module’s attributes are set after creation and before execution
- Execution of the module is what populates the module’s
__dict__
(namespace of the module). This is done by the loader - When a submodule is loaded using any mechanism, a binding is placed in the parent module’s namespace to the submodule object. For example, if we have a package called spam that has a submodule foo and it imports any of its objects like
from .foo import x
, after importing spam, spam will have an attribute foo which is bound to the submodule -> We can now usespam.foo
- Relative imports use leading dots. A single leading dot indicates a relative import, starting with the current package. Two or more leading dots indicate a relative import to the parent(s) of the current package, one level per dot after the first.
- Relative imports can only use this form of import:
from <> import <>
- It can’t use
import .<>
because this is not a valid expression
- Relative imports can only use this form of import:
- Absolute imports have to start from the top level package and go downward to refer to the module:
from package.subpackage import
module- Not recommended because if we change the name of the package then we need to change all the import statements -> relative imports are more robust and don’t care about namings
- Process when importing a module/package (after locating it):
- First checks if it is cached. If not, continue
- It creates a ModuleType object with that name
- Cache the module in sys.modules
- Executes the source code inside the module (first prefixing it with .py and then assign
__file__
)- In the case of the package/subpackage, it assign it the
__init__.py
file - It also executes all the
__init__.py
on the path
- In the case of the package/subpackage, it assign it the
- Assign a variable to the module object
import sys, types
def import_module(modname):
# Check if it is in the cache first
if modname in sys.modules:
return sys.modules[modname]
= modname + '.py'
sourcepath with open(sourcepath, 'r') as f:
= f.read()
sourcecode = types.ModuleType(modname)
mod __file__ = sourcepath
mod.
# Cache the module
= mod
sys.modules[modname]
# Convert it to Python ByteCode
= compile(sourcecode, sourcepath, 'exec')
code
# Execute the code in the module from top to bottom
# And update the state (globals) in the module's dictionary
exec(code, mod.__dict__)
# We return the cached one in case there is some patching inside the module
return sys.modules[modname]
Module Compilation
- Python put a lock when importing a module until it is done so that we don’t have multiple threads trying to import the same module at the same time
__import__
is the machinery behindimport
statement- We can use
importlib.import_module(module)
which is the same thing as__import__
importlib.import_module(spam)
is the same asimport spam
importlib.import_module('.spam', __package__)
is the same asfrom . import spam
- We can track all imports as follows:
import builtins
def imp_mod(modname, *args, imp=__import__):
print(f"Importing {modname}")
return imp(modname, *args)
__import__ = imp_mod builtins.
- Module Reloading:
- It is not a good idea to reload a module because it creates zombies. Basically Python doesn’t try to clean up the dictionary from the old module, but instead exec() the new state of the module using the old
module.__dict__
. This means stuff from previous load may still exist and we end up having weird cases. This is how Python reloads a module:
= open(module.__file__, 'rb').open() code exec(code, module.__dict__)
- Also, submodules that are loaded in the module/package don’t get reloaded. They still have their old version. Exampe: If module has
import pandas as pd
, when reloading the module it doesn’t reload pandas. - Also, if we have instances that use the old version of the module and then we reload -> New instances of the same object (class) will refer to different code implementation than the instances created before the reload -> Even though they refer to the same class, instances will have different types
- It is not a good idea to reload a module because it creates zombies. Basically Python doesn’t try to clean up the dictionary from the old module, but instead exec() the new state of the module using the old
sys.path
is only the small part of the import machinery- Imports is actually controlled by
sys.meta_path
- It is a list of importers
[_frozen_importlib.BuiltinImporter, _frozen_importlib.FrozenImporter, _frozen_importlib_external.PathFinder,<six._SixMetaPathImporter at 0x10c8769b0>, <pkg_resources.extern.VendorImporter at 0x10dbf9300>]
- Python’s default sys.meta_path has three meta path finders, one that knows how to import built-in modules, one that knows how to import frozen modules, and one that knows how to import modules from an import path
- For every import statement, it goes from start-to-end to know if sys.meta_path knows how to install it
as imp
importlib.util
def find_spec(modname):
for imp in sys.meta_path:
= imp.find_spec(modname)
spec if spec:
return spec
return None
ModuleSpec of a module is its metadata that the loader uses to load it. We can also use
importlib.util.find_spec()
to get the module spec of any loaded package. If the package/module is not found -> returns None. Example of pandas module spec:='pandas', loader=<_frozen_importlib_external.SourceFileLoader object at 0x10e609f90>, origin='/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas/__init__.py', submodule_search_locations=['/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas']) ModuleSpec(name
- Module Spec main info:
- spec.name : # Full module name
- spec.parent : # Enclosing package
- spec.submodule_search_locations : # Package path
- spec.has_location : # Has external location
- spec.origin : # Source file location
- spec.cached : # Cached location
- spec.loader : # Loader object
- We can use the
loader
from module spec to get the source code w/o importing it. They actually create the imported module:
= spec.loader.create_module(spec) module if not module: = types.ModuleType(spec.name) module __file__ = spec.origin module.= spec.loader module.__loader__ = spec.parent module.__package__ = spec.submodule_search_locations module.__path__ = spec module.__spec__
- We can create module from spec with
importlib.util.module_from_spec
. This DOES NOT LOAD THE MODEL., it only creates it. To load the module, the module must be executed withspec.loader.exec_module(spec)
and then cache itsys.modules[spec.name] module
.exec_module
will populate the__dict__
of the module.
- Module Spec main info:
We can execute modules lazily on first access. Implementation example:
import types
class _Module(types.ModuleType):
pass
class _LazyModule(_Module):
def __init__(self, spec):
super().__init__(spec.name)
self.__file__ = spec.origin
self.__package__ = spec.parent
self.__loader__ = spec.loader
self.__path__ = spec.submodule_search_locations
self.__spec__ = spec
def __getattr__(self, name):
self.__class__ = _Module
self.__spec__.loader.exec_module(self)
assert sys.modules[self.__name__] == self
return getattr(self, name)
import importlib.util, sys
def lazy_import(name):
# If already loaded, return the module
if name in sys.modules:
return sys.modules[name]
# Not loaded. Find the spec
= importlib.util.find_spec(name)
spec if not spec:
raise ImportError(f'No module {name:r}')
# Check for compatibility
if not hasattr(spec.loader, 'exec_module'):
raise ImportError('Not supported')
# Perform the lazy import
= sys.modules[name] = _LazyModule(spec)
module return module
- Therefore, the module create/loading has been decoupled in recent versions of Python
- We can insert an importer to
sys.meta_path
that can change the behavior of imports- If it is in the beginning, it supercedes all other loaders and we can do crazy things
import sys class Watcher(object): @classmethod def find_spec(cls, name, path, target=None): print('Importing', name, path, target) return None 0, Watcher) sys.meta_path.insert(
- We can also use this idea to add some logic such as autoinstall packages that are not found using pip. We insert the installer at the end of
sys.meta_path
import sys
import subprocess
import importlib.util
class AutoInstall(object):
= set()
_loaded
@classmethod
def find_spec(cls, name, path, target=None):
if path is None and name not in cls._loaded:
cls._loaded.add(name)print("Installing", name)
try:
= subprocess.check_output(
out '-m', 'pip', 'install', name])
[sys.executable, return importlib.util.find_spec(name)
except Exception as! e:
print("Failed")
return None
sys.meta_path.append(AutoInstall)
We can also import packages not found on the system from some other systems such as Redis
sys.path_hooks
is responsible for the actual loading of the module/package depending on the path- Each entry in the
sys.path
is tested against a list of path hooks to assosiate a module finder with each path entry - Path finders are used to locate module and return module spec along with loader
- Path finders get cached in
sys.path_importer_cache
- Each entry in the
Both
loaders
andfinders
havefind_spec()
that returns spec of module if they know how to find/load it. Otherwise, they returnNone
What happens during import:
= 'somemodulename'
modname for entry in sys.path:
= sys.path_importer_cache[entry]
finder if finder:
= finder.find_spec(modname)
spec if spec:
break
else:
raise ImportError('No such module')
...# Load module from the spec
...
Experiments
from pck.mod import X
pck.mod
X
100
from pck.test import X
pck.test
"pck"].__path__ sys.modules[
_NamespacePath(['/Users/imad/Documents/python-materials/modules-and-packages/pck', '/Users/imad/Documents/python-materials/modules-and-packages/pck', '/Users/imad/Desktop/pck'])
foo.__package__, foo.__path__
AttributeError: module 'package.foo' has no attribute '__path__'
globals()["foo"]
<module 'package.foo' from '/Users/imad/Documents/python-materials/modules-and-packages/package/foo.py'>
def f():
pass
from pandas import read_csv
sys.path_hooks
[zipimport.zipimporter,
<function _frozen_importlib_external.FileFinder.path_hook.<locals>.path_hook_for_FileFinder(path)>]
list(sys.path_importer_cache.keys())[:10]
['/Users/imad/anaconda3/envs/python-exp/lib/python310.zip',
'/Users/imad/anaconda3/envs/python-exp/lib/python3.10',
'/Users/imad/anaconda3/envs/python-exp/lib/python3.10/encodings',
'/Users/imad/anaconda3/envs/python-exp/lib/python3.10/importlib',
'/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages',
'/Users/imad/anaconda3/envs/python-exp/lib/python3.10/lib-dynload',
'/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/PyYAML-6.0-py3.10-macosx-10.9-x86_64.egg',
'/Users/imad/Documents/python-materials/modules-and-packages',
'/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/ipykernel',
'/Users/imad/anaconda3/envs/python-exp/lib/python3.10/json']
from importlib.util import find_spec
= find_spec("mod")
m "mod") m.loader.get_source(
'y = 200\nprint(y)\n\nclass A:\n print("A")\n'
import sys
sys.meta_path
[_frozen_importlib.BuiltinImporter,
_frozen_importlib.FrozenImporter,
_frozen_importlib_external.PathFinder,
<six._SixMetaPathImporter at 0x10c8769b0>,
<pkg_resources.extern.VendorImporter at 0x10dbf9300>]
import mod
200
A
= mod.A() a
from importlib import reload
reload(mod)
200
A
<module 'mod' from '/Users/imad/Documents/python-materials/modules-and-packages/mod.py'>
= mod.A() b
type(a) == type(b) a.__class__, b.__class__,
(mod.A, mod.A, True)
from importlib.util import find_spec
"sys") find_spec(
ModuleSpec(name='sys', loader=<class '_frozen_importlib.BuiltinImporter'>, origin='built-in')
"pandas") find_spec(
ModuleSpec(name='pandas', loader=<_frozen_importlib_external.SourceFileLoader object at 0x10e609f90>, origin='/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas/__init__.py', submodule_search_locations=['/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas'])
import importlib
importlib.import_module()
__name__, pd.__package__, pd.__file__, pd.__doc__ pd.__path__, pd.
(['/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas'],
'pandas',
'pandas',
'/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas/__init__.py',
'\npandas - a powerful data analysis and manipulation library for Python\n=====================================================================\n\n**pandas** is a Python package providing fast, flexible, and expressive data\nstructures designed to make working with "relational" or "labeled" data both\neasy and intuitive. It aims to be the fundamental high-level building block for\ndoing practical, **real world** data analysis in Python. Additionally, it has\nthe broader goal of becoming **the most powerful and flexible open source data\nanalysis / manipulation tool available in any language**. It is already well on\nits way toward this goal.\n\nMain Features\n-------------\nHere are just a few of the things that pandas does well:\n\n - Easy handling of missing data in floating point as well as non-floating\n point data.\n - Size mutability: columns can be inserted and deleted from DataFrame and\n higher dimensional objects\n - Automatic and explicit data alignment: objects can be explicitly aligned\n to a set of labels, or the user can simply ignore the labels and let\n `Series`, `DataFrame`, etc. automatically align the data for you in\n computations.\n - Powerful, flexible group by functionality to perform split-apply-combine\n operations on data sets, for both aggregating and transforming data.\n - Make it easy to convert ragged, differently-indexed data in other Python\n and NumPy data structures into DataFrame objects.\n - Intelligent label-based slicing, fancy indexing, and subsetting of large\n data sets.\n - Intuitive merging and joining data sets.\n - Flexible reshaping and pivoting of data sets.\n - Hierarchical labeling of axes (possible to have multiple labels per tick).\n - Robust IO tools for loading data from flat files (CSV and delimited),\n Excel files, databases, and saving/loading data from the ultrafast HDF5\n format.\n - Time series-specific functionality: date range generation and frequency\n conversion, moving window statistics, date shifting and lagging.\n')