A Deep Dive into Python’s Modules and Packages

Python
SWE
Author

Imad Dabbura

Published

February 9, 2024

Modified

February 9, 2024

Python’s simplicity and versatility are largely attributed to its extensive ecosystem of modules and packages. These essential components enable developers to write clean, reusable, and efficient code, whether for simple scripts or complex applications.

This article aims to deepen our understanding of Python’s modules and packages and the machinery involved, helping me become a more effective Python programmer. We will explore their structure and functionality, covering everything from the basics of importing to creating custom packages and managing dependencies. By unpacking the underlying machinery of how modules/packages get imported and what they really are, we’ll gain insights that will enhance our coding practices and project organization.

Introduction

  • Python has only one type of module object regardless of the language the module was implemented it (C/Python/…)
  • Package provides a naming hierarchy to organize modules (same analogy to directory in Unix file system):
    • All packages are modules but not all modules are packages
    • A module is a package that has __path__
    • A package can include subpackages (sub-directories)
  • There are two types of packages:
    • Regular packages are directories that have __init__.py. When importing a package/subpackage -> implicitly executes all __init__.py files on the path and bound objects to names in the package’s namespace
      • When import machinery is looking for the package, it stops once it finds it
    • Namespace packages are directories that don’t have __init__.py
      • When import machinery is looking for the package, it does not stop when it finds it and assuming there may be a regular package in some other paths in sys.path but keep a record of all namespace packages it found during the search. If it finds a regular package with that name -> discard all namespace packages it found and import the regular package. If it doesn’t find any regular package with that name -> Use all the namespace packages it found during the search and combine their paths in namespace_path so when we try to import subpackage or modules, it checks all the paths in the namespace_path (which is a list)
      • There can be multiple packages of the same name (under different directories) -> They are all combined together and the namespace_path list would have the path for all of them. Therefore, the same package can be used to refer to completely different modules in different directories
      • Python first scans the whole sys.path before deciding that the package is a namespace -> If any name is found with __init__.py in it, it will give this priority and don’t continue.
  • When importing subpackages such as foo.bar.baz, Python first imports foo, then foo.bar, then foo.bar.baz
    • Each of these will be cached in sys.modules
  • __init__.py makes a directory a python package
    • We can use it to import useful stuff from different modules/subpackages so it can be available to user
    • When importing an object, it has __module__ attribute which determines the global environment for the object
    • We can define __all__ in __init__ as concatenation of all __all__ in modules
      • Example: __all__ = foo.__all__ + bar.__all__ BUT we need to either first import foo and bar or import anything from them so they can be defined such as from .foo import * OR from .foo import y
    • __init__ can be used to also initialize things and maybe monkeypatch some other modules
  • Each package has __path__ attribute that helps when searching for subpackages. This will be given to path finder when loading subpackages
    • It is a list; similar to sys.path so we can change it. But it is not recommended
    • Example: import math; math.__path__.append(~/Documents")
  • Relative import is preferred to absolute imports inside packages to avoid having issues if the package name is changed
  • Packages get loaded once even if we import it multiple times
  • We can in theory upgrade a package in the cache like this:
    • sys.modules[new_name] = package_name
  • If we use python -m package.module, it executes module as the main program and relative imports works. Otherwise, relative imports won’t work.
    • m stands for module
  • The __main__ module is a special case relative to Python’s import system. The __main__ module is directly initialized at interpreter startup, much like sys and builtins. The manner in which __main__ is initialized depends on the flags and other options with which the interpreter is invoked
  • __main__.py designates main for a package/subpackage and also allows package directory to be executable -> explicitly marks the entry point. Examples:
    • python package would look for __main__.py to execute it
    • python -m package.subpackage would look for __main__.py inside package/subpackage to execute
    • __package__ is set so the relative imports still work
    • A lot of programming tools utilize this to their own benefit: python -m profile script.y OR python -m pdb script.py
    • NOTE THAT __init__.py files on the path will still be executed
  • Depending on how __main__ is initialized, __main__.__spec__ gets set appropriately or to None.
    • When Python is started with the -m option, __spec__ is set to the module spec of the corresponding module or package. __spec__ is also populated when the __main__ module is loaded as part of executing a directory, zipfile or other sys.path entry.
    • Otherwise, it will be set to None

Sys.path

  • importlib has a rich API to interact with import system. It is preferred over __import__()
  • __import__ Only does module search and creation without the name binding
  • import Does everything. Module search, creation, and name binding. It calls __import__ under the hood
  • .egg files are just directories or .zip files with extra metadata for package managers
  • sys.path is where python looks to search for a module/package (last place) that we try to import. It traverses it from start-to-end
    • It has the name of directorires, .zipfiles, .egg files
    • first match wins
    • If it can’t find it -> can not be imported
  • sys.prefix is where python is stored (os.py is the landmark) and sys.exec_prefix is where compiled binaries are stored (lib-dynload is the landmark)
    • With virtual environments -> each one has its own sys.prefix
    • It is constructed from sys.prefix, PYTHONHOME, and site.py. Setting PYTHONHOME would override sys.prefix and sys.exec_prefic
    • Python looks for its libraries starting from where it is and keep going up until the root of the file syetsm. It looks for os.py and use that location as a landmark
    • python -S skips site.py
    • python -vv to see what python tries to do with every statement
    • Setting PYTHONPATH to some directories will insert them into the beginning of sys.path. Example:
      • env PYTHONPATH="/Users/imad/Documents python to run python with documents inserted at the beginning of the sys.apth
    • site.py appends the path to third-party libraries. This is where installed packages get stored. Example: /usr/local/lib/python3.4/site-packages
  • Python now have builtin virtual environments that can create one using the venv module
    • python -m venv env_name will create new environment called env_name
    • This environment will include few directories such as include, lib, site-packages, bin and pyvenv.cfg
    • This new environment has no third party libraries or any system wide libraries such as those in /usr/local
    • All third libraries will be installed in site-packages directory
    • Python binary will refer to the original Python installation when the environment was created
    • We can use source path_to_env_name/bin/activate to activate the environment. deactivate to deactivate it. Finally, rm -r path_to_env_name or pyenv --rm if we create it using poetry
  • Files with .pth extension in site-packages directory get added to the sys.path. We can list directories in those files that will be added to sys.path for any new instance of Python
    • Package managers and other third-party packages use this kind of hack to add paths to the sys.path
  • sitecustomize and usercustomize also can be used to add stuff to the sys.path
  • Finally the current working directory will be added to the path (at the beginning)

Modules

  • Modules are just objects of type ModuleType. They act like a dictionary that holds references to objects it holds; module.__dict__
    • When importing a module, it executes the module from top to bottom before returning to the caller
    • Module can be namespace, py file, execution environment for statements or container of global variables
    • We can set/delete attributes. module.x = 10 is the same as module.__dict__['x'] = 10
    • The dictionary has preset attributes such as __path__, __loader__
    • Main attributes:
      • __name__ : # Module name
      • __file__ : # Associated source file (if any)
      • __doc__ : # Doc string
      • __path__ : # Package path. It is used to look for package subcomponents
      • __package__ : # The module’s __package__ attribute must be set. Its value must be a string, but it can be the same value as its __name__. When the module is a package, its __package__ value should be set to its __name__. When the module is not a package, __package__ should be set to the empty string for top-level modules, or for submodules, to the parent package’s name.
      • __spec__ : # Module spec
  • The main difference between modules and packages is that packages have __path__ and __package__ defined (not None)
  • sys.modules serves as a cache for all imported modules/packages
    • It is a dictionary so we can delete/set keys
    • If we delete a module, it will force Python to import it when we reimport it
    • If we set module key to None -> result in ModuleNotFoundError
  • Even if we import one object from a module/package, the module/package will be cached in the sys.modules but not available in the global name space
  • The module created during loading and passed to exec_module() may not be the one returned at the end of the import
    • This can happen if the imported module set the sys.modules[__name__] to some other module
  • The module’s attributes are set after creation and before execution
  • Execution of the module is what populates the module’s __dict__ (namespace of the module). This is done by the loader
  • When a submodule is loaded using any mechanism, a binding is placed in the parent module’s namespace to the submodule object. For example, if we have a package called spam that has a submodule foo and it imports any of its objects like from .foo import x, after importing spam, spam will have an attribute foo which is bound to the submodule -> We can now use spam.foo
  • Relative imports use leading dots. A single leading dot indicates a relative import, starting with the current package. Two or more leading dots indicate a relative import to the parent(s) of the current package, one level per dot after the first.
    • Relative imports can only use this form of import: from <> import <>
    • It can’t use import .<> because this is not a valid expression
  • Absolute imports have to start from the top level package and go downward to refer to the module:
    • from package.subpackage import module
    • Not recommended because if we change the name of the package then we need to change all the import statements -> relative imports are more robust and don’t care about namings
  • Process when importing a module/package (after locating it):
    1. First checks if it is cached. If not, continue
    2. It creates a ModuleType object with that name
    3. Cache the module in sys.modules
    4. Executes the source code inside the module (first prefixing it with .py and then assign __file__)
      • In the case of the package/subpackage, it assign it the __init__.py file
      • It also executes all the __init__.py on the path
    5. Assign a variable to the module object
import sys, types

def import_module(modname):
    # Check if it is in the cache first
    if modname in sys.modules:
        return sys.modules[modname]
    
    sourcepath = modname + '.py'
    with open(sourcepath, 'r') as f:
        sourcecode = f.read()
    mod = types.ModuleType(modname)
    mod.__file__ = sourcepath
    
    # Cache the module
    sys.modules[modname] = mod
    
    # Convert it to Python ByteCode
    code = compile(sourcecode, sourcepath, 'exec')
    
    # Execute the code in the module from top to bottom
    # And update the state (globals) in the module's dictionary
    exec(code, mod.__dict__)
    
    # We return the cached one in case there is some patching inside the module
    return sys.modules[modname]

Module Compilation

  • Python put a lock when importing a module until it is done so that we don’t have multiple threads trying to import the same module at the same time
  • __import__ is the machinery behind import statement
  • We can use importlib.import_module(module) which is the same thing as __import__
    • importlib.import_module(spam) is the same as import spam
    • importlib.import_module('.spam', __package__) is the same as from . import spam
    • We can track all imports as follows:
import builtins

def imp_mod(modname, *args, imp=__import__):
    print(f"Importing {modname}")
    return imp(modname, *args)

builtins.__import__ = imp_mod
  • Module Reloading:
    • It is not a good idea to reload a module because it creates zombies. Basically Python doesn’t try to clean up the dictionary from the old module, but instead exec() the new state of the module using the old module.__dict__. This means stuff from previous load may still exist and we end up having weird cases. This is how Python reloads a module:
    code = open(module.__file__, 'rb').open()
    exec(code, module.__dict__)
    • Also, submodules that are loaded in the module/package don’t get reloaded. They still have their old version. Exampe: If module has import pandas as pd, when reloading the module it doesn’t reload pandas.
    • Also, if we have instances that use the old version of the module and then we reload -> New instances of the same object (class) will refer to different code implementation than the instances created before the reload -> Even though they refer to the same class, instances will have different types
  • sys.path is only the small part of the import machinery
  • Imports is actually controlled by sys.meta_path
    • It is a list of importers
    [_frozen_importlib.BuiltinImporter,
    _frozen_importlib.FrozenImporter,
    _frozen_importlib_external.PathFinder,
    <six._SixMetaPathImporter at 0x10c8769b0>,
    <pkg_resources.extern.VendorImporter at 0x10dbf9300>]
    • Python’s default sys.meta_path has three meta path finders, one that knows how to import built-in modules, one that knows how to import frozen modules, and one that knows how to import modules from an import path
    • For every import statement, it goes from start-to-end to know if sys.meta_path knows how to install it
importlib.util as imp

def find_spec(modname):
    for imp in sys.meta_path:
        spec = imp.find_spec(modname)
        if spec:
            return spec
    return None
  • ModuleSpec of a module is its metadata that the loader uses to load it. We can also use importlib.util.find_spec() to get the module spec of any loaded package. If the package/module is not found -> returns None. Example of pandas module spec:

      ModuleSpec(name='pandas', loader=<_frozen_importlib_external.SourceFileLoader object at 0x10e609f90>, origin='/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas/__init__.py', submodule_search_locations=['/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas'])
    • Module Spec main info:
      • spec.name : # Full module name
      • spec.parent : # Enclosing package
      • spec.submodule_search_locations : # Package path
      • spec.has_location : # Has external location
      • spec.origin : # Source file location
      • spec.cached : # Cached location
      • spec.loader : # Loader object
    • We can use the loader from module spec to get the source code w/o importing it. They actually create the imported module:
    module = spec.loader.create_module(spec)
    if not module:
        module = types.ModuleType(spec.name)
        module.__file__ = spec.origin
        module.__loader__ = spec.loader
        module.__package__ = spec.parent
        module.__path__ = spec.submodule_search_locations
        module.__spec__ = spec
    • We can create module from spec with importlib.util.module_from_spec. This DOES NOT LOAD THE MODEL., it only creates it. To load the module, the module must be executed with spec.loader.exec_module(spec) and then cache it sys.modules[spec.name] module. exec_module will populate the __dict__ of the module.
  • We can execute modules lazily on first access. Implementation example:

import types


class _Module(types.ModuleType):
    pass


class _LazyModule(_Module):

    def __init__(self, spec):
        super().__init__(spec.name) 
        self.__file__ = spec.origin
        self.__package__ = spec.parent 
        self.__loader__ = spec.loader
        self.__path__ = spec.submodule_search_locations 
        self.__spec__ = spec

    def __getattr__(self, name):
        self.__class__ = _Module
        self.__spec__.loader.exec_module(self)
        assert sys.modules[self.__name__] == self
        return getattr(self, name)
import importlib.util, sys

def lazy_import(name):
   # If already loaded, return the module
    if name in sys.modules:
        return sys.modules[name]
    
    # Not loaded. Find the spec
    spec = importlib.util.find_spec(name)
    if not spec:
        raise ImportError(f'No module {name:r}')
    
    # Check for compatibility
    if not hasattr(spec.loader, 'exec_module'):
        raise ImportError('Not supported')

    # Perform the lazy import
    module = sys.modules[name] = _LazyModule(spec)
    return module
  • Therefore, the module create/loading has been decoupled in recent versions of Python
  • We can insert an importer to sys.meta_path that can change the behavior of imports
    • If it is in the beginning, it supercedes all other loaders and we can do crazy things
    import sys
    
    
    class Watcher(object):
    
        @classmethod
        def find_spec(cls, name, path, target=None):
            print('Importing', name, path, target)
            return None
    
    
    sys.meta_path.insert(0, Watcher)
  • We can also use this idea to add some logic such as autoinstall packages that are not found using pip. We insert the installer at the end of sys.meta_path
import sys
import subprocess
import importlib.util


class AutoInstall(object):
    _loaded = set()

    @classmethod
    def find_spec(cls, name, path, target=None):
        if path is None and name not in cls._loaded: 
            cls._loaded.add(name)
            print("Installing", name)
            try:
                out = subprocess.check_output(
                          [sys.executable, '-m', 'pip', 'install', name])
                return importlib.util.find_spec(name) 
            except Exception as! e:
                print("Failed")
        return None
sys.meta_path.append(AutoInstall)
  • We can also import packages not found on the system from some other systems such as Redis

  • sys.path_hooks is responsible for the actual loading of the module/package depending on the path

    • Each entry in the sys.path is tested against a list of path hooks to assosiate a module finder with each path entry
    • Path finders are used to locate module and return module spec along with loader
    • Path finders get cached in sys.path_importer_cache
  • Both loaders and finders have find_spec() that returns spec of module if they know how to find/load it. Otherwise, they return None

  • What happens during import:

modname = 'somemodulename'
for entry in sys.path:
    finder = sys.path_importer_cache[entry]
    if finder:
        spec = finder.find_spec(modname)
        if spec:
            break
else:
    raise ImportError('No such module')
...
# Load module from the spec
...

Experiments

sys.path.append("/Users/imad/Desktop/")
from pck.mod import X
pck.mod
X
100
from pck.test import X
pck.test
sys.modules["pck"].__path__
_NamespacePath(['/Users/imad/Documents/python-materials/modules-and-packages/pck', '/Users/imad/Documents/python-materials/modules-and-packages/pck', '/Users/imad/Desktop/pck'])
foo.__package__, foo.__path__
AttributeError: module 'package.foo' has no attribute '__path__'
globals()["foo"]
<module 'package.foo' from '/Users/imad/Documents/python-materials/modules-and-packages/package/foo.py'>
def f():
    pass
from pandas import read_csv
sys.path_hooks
[zipimport.zipimporter,
 <function _frozen_importlib_external.FileFinder.path_hook.<locals>.path_hook_for_FileFinder(path)>]
list(sys.path_importer_cache.keys())[:10]
['/Users/imad/anaconda3/envs/python-exp/lib/python310.zip',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/encodings',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/importlib',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/lib-dynload',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/PyYAML-6.0-py3.10-macosx-10.9-x86_64.egg',
 '/Users/imad/Documents/python-materials/modules-and-packages',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/ipykernel',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/json']
from importlib.util import find_spec
m = find_spec("mod")
m.loader.get_source("mod")
'y = 200\nprint(y)\n\nclass A:\n    print("A")\n'
import sys
sys.meta_path
[_frozen_importlib.BuiltinImporter,
 _frozen_importlib.FrozenImporter,
 _frozen_importlib_external.PathFinder,
 <six._SixMetaPathImporter at 0x10c8769b0>,
 <pkg_resources.extern.VendorImporter at 0x10dbf9300>]
import mod
200
A
a = mod.A()
from importlib import reload
reload(mod)
200
A
<module 'mod' from '/Users/imad/Documents/python-materials/modules-and-packages/mod.py'>
b = mod.A()
a.__class__, b.__class__, type(a) == type(b)
(mod.A, mod.A, True)
from importlib.util import find_spec
find_spec("sys")
ModuleSpec(name='sys', loader=<class '_frozen_importlib.BuiltinImporter'>, origin='built-in')
find_spec("pandas")
ModuleSpec(name='pandas', loader=<_frozen_importlib_external.SourceFileLoader object at 0x10e609f90>, origin='/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas/__init__.py', submodule_search_locations=['/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas'])
import importlib
importlib.import_module()
pd.__path__, pd.__name__, pd.__package__, pd.__file__, pd.__doc__
(['/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas'],
 'pandas',
 'pandas',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas/__init__.py',
 '\npandas - a powerful data analysis and manipulation library for Python\n=====================================================================\n\n**pandas** is a Python package providing fast, flexible, and expressive data\nstructures designed to make working with "relational" or "labeled" data both\neasy and intuitive. It aims to be the fundamental high-level building block for\ndoing practical, **real world** data analysis in Python. Additionally, it has\nthe broader goal of becoming **the most powerful and flexible open source data\nanalysis / manipulation tool available in any language**. It is already well on\nits way toward this goal.\n\nMain Features\n-------------\nHere are just a few of the things that pandas does well:\n\n  - Easy handling of missing data in floating point as well as non-floating\n    point data.\n  - Size mutability: columns can be inserted and deleted from DataFrame and\n    higher dimensional objects\n  - Automatic and explicit data alignment: objects can be explicitly aligned\n    to a set of labels, or the user can simply ignore the labels and let\n    `Series`, `DataFrame`, etc. automatically align the data for you in\n    computations.\n  - Powerful, flexible group by functionality to perform split-apply-combine\n    operations on data sets, for both aggregating and transforming data.\n  - Make it easy to convert ragged, differently-indexed data in other Python\n    and NumPy data structures into DataFrame objects.\n  - Intelligent label-based slicing, fancy indexing, and subsetting of large\n    data sets.\n  - Intuitive merging and joining data sets.\n  - Flexible reshaping and pivoting of data sets.\n  - Hierarchical labeling of axes (possible to have multiple labels per tick).\n  - Robust IO tools for loading data from flat files (CSV and delimited),\n    Excel files, databases, and saving/loading data from the ultrafast HDF5\n    format.\n  - Time series-specific functionality: date range generation and frequency\n    conversion, moving window statistics, date shifting and lagging.\n')