contrib/python-zstandard/README.rst
changeset 37495 b1fb341d8a61
parent 31796 e0dc40530c5a
child 40121 73fef626dae3
equal deleted inserted replaced
37494:1ce7a55b09d1 37495:b1fb341d8a61
     9 The primary goal of the project is to provide a rich interface to the
     9 The primary goal of the project is to provide a rich interface to the
    10 underlying C API through a Pythonic interface while not sacrificing
    10 underlying C API through a Pythonic interface while not sacrificing
    11 performance. This means exposing most of the features and flexibility
    11 performance. This means exposing most of the features and flexibility
    12 of the C API while not sacrificing usability or safety that Python provides.
    12 of the C API while not sacrificing usability or safety that Python provides.
    13 
    13 
    14 The canonical home for this project is
    14 The canonical home for this project lives in a Mercurial repository run by
       
    15 the author. For convenience, that repository is frequently synchronized to
    15 https://github.com/indygreg/python-zstandard.
    16 https://github.com/indygreg/python-zstandard.
    16 
    17 
    17 |  |ci-status| |win-ci-status|
    18 |  |ci-status| |win-ci-status|
    18 
       
    19 State of Project
       
    20 ================
       
    21 
       
    22 The project is officially in beta state. The author is reasonably satisfied
       
    23 that functionality works as advertised. **There will be some backwards
       
    24 incompatible changes before 1.0, probably in the 0.9 release.** This may
       
    25 involve renaming the main module from *zstd* to *zstandard* and renaming
       
    26 various types and methods. Pin the package version to prevent unwanted
       
    27 breakage when this change occurs!
       
    28 
       
    29 This project is vendored and distributed with Mercurial 4.1, where it is
       
    30 used in a production capacity.
       
    31 
       
    32 There is continuous integration for Python versions 2.6, 2.7, and 3.3+
       
    33 on Linux x86_x64 and Windows x86 and x86_64. The author is reasonably
       
    34 confident the extension is stable and works as advertised on these
       
    35 platforms.
       
    36 
       
    37 The CFFI bindings are mostly feature complete. Where a feature is implemented
       
    38 in CFFI, unit tests run against both C extension and CFFI implementation to
       
    39 ensure behavior parity.
       
    40 
       
    41 Expected Changes
       
    42 ----------------
       
    43 
       
    44 The author is reasonably confident in the current state of what's
       
    45 implemented on the ``ZstdCompressor`` and ``ZstdDecompressor`` types.
       
    46 Those APIs likely won't change significantly. Some low-level behavior
       
    47 (such as naming and types expected by arguments) may change.
       
    48 
       
    49 There will likely be arguments added to control the input and output
       
    50 buffer sizes (currently, certain operations read and write in chunk
       
    51 sizes using zstd's preferred defaults).
       
    52 
       
    53 There should be an API that accepts an object that conforms to the buffer
       
    54 interface and returns an iterator over compressed or decompressed output.
       
    55 
       
    56 There should be an API that exposes an ``io.RawIOBase`` interface to
       
    57 compressor and decompressor streams, like how ``gzip.GzipFile`` from
       
    58 the standard library works (issue 13).
       
    59 
       
    60 The author is on the fence as to whether to support the extremely
       
    61 low level compression and decompression APIs. It could be useful to
       
    62 support compression without the framing headers. But the author doesn't
       
    63 believe it a high priority at this time.
       
    64 
       
    65 There will likely be a refactoring of the module names. Currently,
       
    66 ``zstd`` is a C extension and ``zstd_cffi`` is the CFFI interface.
       
    67 This means that all code for the C extension must be implemented in
       
    68 C. ``zstd`` may be converted to a Python module so code can be reused
       
    69 between CFFI and C and so not all code in the C extension has to be C.
       
    70 
    19 
    71 Requirements
    20 Requirements
    72 ============
    21 ============
    73 
    22 
    74 This extension is designed to run with Python 2.6, 2.7, 3.3, 3.4, 3.5, and
    23 This extension is designed to run with Python 2.7, 3.4, 3.5, and 3.6
    75 3.6 on common platforms (Linux, Windows, and OS X). Only x86_64 is
    24 on common platforms (Linux, Windows, and OS X). x86 and x86_64 are well-tested
    76 currently well-tested as an architecture.
    25 on Windows. Only x86_64 is well-tested on Linux and macOS.
    77 
    26 
    78 Installing
    27 Installing
    79 ==========
    28 ==========
    80 
    29 
    81 This package is uploaded to PyPI at https://pypi.python.org/pypi/zstandard.
    30 This package is uploaded to PyPI at https://pypi.python.org/pypi/zstandard.
    94 this package with ``conda``.
    43 this package with ``conda``.
    95 
    44 
    96 Performance
    45 Performance
    97 ===========
    46 ===========
    98 
    47 
    99 Very crude and non-scientific benchmarking (most benchmarks fall in this
    48 zstandard is a highly tunable compression algorithm. In its default settings
   100 category because proper benchmarking is hard) show that the Python bindings
    49 (compression level 3), it will be faster at compression and decompression and
   101 perform within 10% of the native C implementation.
    50 will have better compression ratios than zlib on most data sets. When tuned
   102 
    51 for speed, it approaches lz4's speed and ratios. When tuned for compression
   103 The following table compares the performance of compressing and decompressing
    52 ratio, it approaches lzma ratios and compression speed, but decompression
   104 a 1.1 GB tar file comprised of the files in a Firefox source checkout. Values
    53 speed is much faster. See the official zstandard documentation for more.
   105 obtained with the ``zstd`` program are on the left. The remaining columns detail
    54 
   106 performance of various compression APIs in the Python bindings.
    55 zstandard and this library support multi-threaded compression. There is a
   107 
    56 mechanism to compress large inputs using multiple threads.
   108 +-------+-----------------+-----------------+-----------------+---------------+
    57 
   109 | Level | Native          | Simple          | Stream In       | Stream Out    |
    58 The performance of this library is usually very similar to what the zstandard
   110 |       | Comp / Decomp   | Comp / Decomp   | Comp / Decomp   | Comp          |
    59 C API can deliver. Overhead in this library is due to general Python overhead
   111 +=======+=================+=================+=================+===============+
    60 and can't easily be avoided by *any* zstandard Python binding. This library
   112 |   1   | 490 / 1338 MB/s | 458 / 1266 MB/s | 407 / 1156 MB/s |  405 MB/s     |
    61 exposes multiple APIs for performing compression and decompression so callers
   113 +-------+-----------------+-----------------+-----------------+---------------+
    62 can pick an API suitable for their need. Contrast with the compression
   114 |   2   | 412 / 1288 MB/s | 381 / 1203 MB/s | 345 / 1128 MB/s |  349 MB/s     |
    63 modules in Python's standard library (like ``zlib``), which only offer limited
   115 +-------+-----------------+-----------------+-----------------+---------------+
    64 mechanisms for performing operations. The API flexibility means consumers can
   116 |   3   | 342 / 1312 MB/s | 319 / 1182 MB/s | 285 / 1165 MB/s |  287 MB/s     |
    65 choose to use APIs that facilitate zero copying or minimize Python object
   117 +-------+-----------------+-----------------+-----------------+---------------+
    66 creation and garbage collection overhead.
   118 |  11   |  64 / 1506 MB/s |  66 / 1436 MB/s |  56 / 1342 MB/s |   57 MB/s     |
    67 
   119 +-------+-----------------+-----------------+-----------------+---------------+
    68 This library is capable of single-threaded throughputs well over 1 GB/s. For
   120 
    69 exact numbers, measure yourself. The source code repository has a ``bench.py``
   121 Again, these are very unscientific. But it shows that Python is capable of
    70 script that can be used to measure things.
   122 compressing at several hundred MB/s and decompressing at over 1 GB/s.
       
   123 
       
   124 Comparison to Other Python Bindings
       
   125 ===================================
       
   126 
       
   127 https://pypi.python.org/pypi/zstd is an alternate Python binding to
       
   128 Zstandard. At the time this was written, the latest release of that
       
   129 package (1.1.2) only exposed the simple APIs for compression and decompression.
       
   130 This package exposes much more of the zstd API, including streaming and
       
   131 dictionary compression. This package also has CFFI support.
       
   132 
       
   133 Bundling of Zstandard Source Code
       
   134 =================================
       
   135 
       
   136 The source repository for this project contains a vendored copy of the
       
   137 Zstandard source code. This is done for a few reasons.
       
   138 
       
   139 First, Zstandard is relatively new and not yet widely available as a system
       
   140 package. Providing a copy of the source code enables the Python C extension
       
   141 to be compiled without requiring the user to obtain the Zstandard source code
       
   142 separately.
       
   143 
       
   144 Second, Zstandard has both a stable *public* API and an *experimental* API.
       
   145 The *experimental* API is actually quite useful (contains functionality for
       
   146 training dictionaries for example), so it is something we wish to expose to
       
   147 Python. However, the *experimental* API is only available via static linking.
       
   148 Furthermore, the *experimental* API can change at any time. So, control over
       
   149 the exact version of the Zstandard library linked against is important to
       
   150 ensure known behavior.
       
   151 
       
   152 Instructions for Building and Testing
       
   153 =====================================
       
   154 
       
   155 Once you have the source code, the extension can be built via setup.py::
       
   156 
       
   157    $ python setup.py build_ext
       
   158 
       
   159 We recommend testing with ``nose``::
       
   160 
       
   161    $ nosetests
       
   162 
       
   163 A Tox configuration is present to test against multiple Python versions::
       
   164 
       
   165    $ tox
       
   166 
       
   167 Tests use the ``hypothesis`` Python package to perform fuzzing. If you
       
   168 don't have it, those tests won't run. Since the fuzzing tests take longer
       
   169 to execute than normal tests, you'll need to opt in to running them by
       
   170 setting the ``ZSTD_SLOW_TESTS`` environment variable. This is set
       
   171 automatically when using ``tox``.
       
   172 
       
   173 The ``cffi`` Python package needs to be installed in order to build the CFFI
       
   174 bindings. If it isn't present, the CFFI bindings won't be built.
       
   175 
       
   176 To create a virtualenv with all development dependencies, do something
       
   177 like the following::
       
   178 
       
   179   # Python 2
       
   180   $ virtualenv venv
       
   181 
       
   182   # Python 3
       
   183   $ python3 -m venv venv
       
   184 
       
   185   $ source venv/bin/activate
       
   186   $ pip install cffi hypothesis nose tox
       
   187 
    71 
   188 API
    72 API
   189 ===
    73 ===
   190 
    74 
   191 The compiled C extension provides a ``zstd`` Python module. The CFFI
    75 To interface with Zstandard, simply import the ``zstandard`` module::
   192 bindings provide a ``zstd_cffi`` module. Both provide an identical API
    76 
   193 interface. The types, functions, and attributes exposed by these modules
    77    import zstandard
       
    78 
       
    79 It is a popular convention to alias the module as a different name for
       
    80 brevity::
       
    81 
       
    82    import zstandard as zstd
       
    83 
       
    84 This module attempts to import and use either the C extension or CFFI
       
    85 implementation. On Python platforms known to support C extensions (like
       
    86 CPython), it raises an ImportError if the C extension cannot be imported.
       
    87 On Python platforms known to not support C extensions (like PyPy), it only
       
    88 attempts to import the CFFI implementation and raises ImportError if that
       
    89 can't be done. On other platforms, it first tries to import the C extension
       
    90 then falls back to CFFI if that fails and raises ImportError if CFFI fails.
       
    91 
       
    92 To change the module import behavior, a ``PYTHON_ZSTANDARD_IMPORT_POLICY``
       
    93 environment variable can be set. The following values are accepted:
       
    94 
       
    95 default
       
    96    The behavior described above.
       
    97 cffi_fallback
       
    98    Always try to import the C extension then fall back to CFFI if that
       
    99    fails.
       
   100 cext
       
   101    Only attempt to import the C extension.
       
   102 cffi
       
   103    Only attempt to import the CFFI implementation.
       
   104 
       
   105 In addition, the ``zstandard`` module exports a ``backend`` attribute
       
   106 containing the string name of the backend being used. It will be one
       
   107 of ``cext`` or ``cffi`` (for *C extension* and *cffi*, respectively).
       
   108 
       
   109 The types, functions, and attributes exposed by the ``zstandard`` module
   194 are documented in the sections below.
   110 are documented in the sections below.
   195 
   111 
   196 .. note::
   112 .. note::
   197 
   113 
   198    The documentation in this section makes references to various zstd
   114    The documentation in this section makes references to various zstd
   199    concepts and functionality. The ``Concepts`` section below explains
   115    concepts and functionality. The source repository contains a
   200    these concepts in more detail.
   116    ``docs/concepts.rst`` file explaining these in more detail.
   201 
   117 
   202 ZstdCompressor
   118 ZstdCompressor
   203 --------------
   119 --------------
   204 
   120 
   205 The ``ZstdCompressor`` class provides an interface for performing
   121 The ``ZstdCompressor`` class provides an interface for performing
   206 compression operations.
   122 compression operations. Each instance is essentially a wrapper around a
       
   123 ``ZSTD_CCtx`` from the C API.
   207 
   124 
   208 Each instance is associated with parameters that control compression
   125 Each instance is associated with parameters that control compression
   209 behavior. These come from the following named arguments (all optional):
   126 behavior. These come from the following named arguments (all optional):
   210 
   127 
   211 level
   128 level
   212    Integer compression level. Valid values are between 1 and 22.
   129    Integer compression level. Valid values are between 1 and 22.
   213 dict_data
   130 dict_data
   214    Compression dictionary to use.
   131    Compression dictionary to use.
   215 
   132 
   216    Note: When using dictionary data and ``compress()`` is called multiple
   133    Note: When using dictionary data and ``compress()`` is called multiple
   217    times, the ``CompressionParameters`` derived from an integer compression
   134    times, the ``ZstdCompressionParameters`` derived from an integer
   218    ``level`` and the first compressed data's size will be reused for all
   135    compression ``level`` and the first compressed data's size will be reused
   219    subsequent operations. This may not be desirable if source data size
   136    for all subsequent operations. This may not be desirable if source data
   220    varies significantly.
   137    size varies significantly.
   221 compression_params
   138 compression_params
   222    A ``CompressionParameters`` instance (overrides the ``level`` value).
   139    A ``ZstdCompressionParameters`` instance defining compression settings.
   223 write_checksum
   140 write_checksum
   224    Whether a 4 byte checksum should be written with the compressed data.
   141    Whether a 4 byte checksum should be written with the compressed data.
   225    Defaults to False. If True, the decompressor can verify that decompressed
   142    Defaults to False. If True, the decompressor can verify that decompressed
   226    data matches the original input data.
   143    data matches the original input data.
   227 write_content_size
   144 write_content_size
   228    Whether the size of the uncompressed data will be written into the
   145    Whether the size of the uncompressed data will be written into the
   229    header of compressed data. Defaults to False. The data will only be
   146    header of compressed data. Defaults to True. The data will only be
   230    written if the compressor knows the size of the input data. This is
   147    written if the compressor knows the size of the input data. This is
   231    likely not true for streaming compression.
   148    often not true for streaming compression.
   232 write_dict_id
   149 write_dict_id
   233    Whether to write the dictionary ID into the compressed data.
   150    Whether to write the dictionary ID into the compressed data.
   234    Defaults to True. The dictionary ID is only written if a dictionary
   151    Defaults to True. The dictionary ID is only written if a dictionary
   235    is being used.
   152    is being used.
   236 threads
   153 threads
   240    Read below for more info on multi-threaded compression. This argument only
   157    Read below for more info on multi-threaded compression. This argument only
   241    controls thread count for operations that operate on individual pieces of
   158    controls thread count for operations that operate on individual pieces of
   242    data. APIs that spawn multiple threads for working on multiple pieces of
   159    data. APIs that spawn multiple threads for working on multiple pieces of
   243    data have their own ``threads`` argument.
   160    data have their own ``threads`` argument.
   244 
   161 
       
   162 ``compression_params`` is mutually exclusive with ``level``, ``write_checksum``,
       
   163 ``write_content_size``, ``write_dict_id``, and ``threads``.
       
   164 
   245 Unless specified otherwise, assume that no two methods of ``ZstdCompressor``
   165 Unless specified otherwise, assume that no two methods of ``ZstdCompressor``
   246 instances can be called from multiple Python threads simultaneously. In other
   166 instances can be called from multiple Python threads simultaneously. In other
   247 words, assume instances are not thread safe unless stated otherwise.
   167 words, assume instances are not thread safe unless stated otherwise.
   248 
   168 
       
   169 Utility Methods
       
   170 ^^^^^^^^^^^^^^^
       
   171 
       
   172 ``frame_progression()`` returns a 3-tuple containing the number of bytes
       
   173 ingested, consumed, and produced by the current compression operation.
       
   174 
       
   175 ``memory_size()`` obtains the memory utilization of the underlying zstd
       
   176 compression context, in bytes.::
       
   177 
       
   178     cctx = zstd.ZstdCompressor()
       
   179     memory = cctx.memory_size()
       
   180 
   249 Simple API
   181 Simple API
   250 ^^^^^^^^^^
   182 ^^^^^^^^^^
   251 
   183 
   252 ``compress(data)`` compresses and returns data as a one-shot operation.::
   184 ``compress(data)`` compresses and returns data as a one-shot operation.::
   253 
   185 
   254    cctx = zstd.ZstdCompressor()
   186    cctx = zstd.ZstdCompressor()
   255    compressed = cctx.compress(b'data to compress')
   187    compressed = cctx.compress(b'data to compress')
   256 
   188 
   257 The ``data`` argument can be any object that implements the *buffer protocol*.
   189 The ``data`` argument can be any object that implements the *buffer protocol*.
   258 
   190 
   259 Unless ``compression_params`` or ``dict_data`` are passed to the
   191 Stream Reader API
   260 ``ZstdCompressor``, each invocation of ``compress()`` will calculate the
   192 ^^^^^^^^^^^^^^^^^
   261 optimal compression parameters for the configured compression ``level`` and
   193 
   262 input data size (some parameters are fine-tuned for small input sizes).
   194 ``stream_reader(source)`` can be used to obtain an object conforming to the
   263 
   195 ``io.RawIOBase`` interface for reading compressed output as a stream::
   264 If a compression dictionary is being used, the compression parameters
   196 
   265 determined from the first input's size will be reused for subsequent
   197    with open(path, 'rb') as fh:
   266 operations.
   198        cctx = zstd.ZstdCompressor()
   267 
   199        with cctx.stream_reader(fh) as reader:
   268 There is currently a deficiency in zstd's C APIs that makes it difficult
   200            while True:
   269 to round trip empty inputs when ``write_content_size=True``. Attempting
   201                chunk = reader.read(16384)
   270 this will raise a ``ValueError`` unless ``allow_empty=True`` is passed
   202                if not chunk:
   271 to ``compress()``.
   203                    break
       
   204 
       
   205                # Do something with compressed chunk.
       
   206 
       
   207 The stream can only be read within a context manager. When the context
       
   208 manager exits, the stream is closed and the underlying resource is
       
   209 released and future operations against the compression stream stream will fail.
       
   210 
       
   211 The ``source`` argument to ``stream_reader()`` can be any object with a
       
   212 ``read(size)`` method or any object implementing the *buffer protocol*.
       
   213 
       
   214 ``stream_reader()`` accepts a ``size`` argument specifying how large the input
       
   215 stream is. This is used to adjust compression parameters so they are
       
   216 tailored to the source size.::
       
   217 
       
   218    with open(path, 'rb') as fh:
       
   219        cctx = zstd.ZstdCompressor()
       
   220        with cctx.stream_reader(fh, size=os.stat(path).st_size) as reader:
       
   221            ...
       
   222 
       
   223 If the ``source`` is a stream, you can specify how large ``read()`` requests
       
   224 to that stream should be via the ``read_size`` argument. It defaults to
       
   225 ``zstandard.COMPRESSION_RECOMMENDED_INPUT_SIZE``.::
       
   226 
       
   227    with open(path, 'rb') as fh:
       
   228        cctx = zstd.ZstdCompressor()
       
   229        # Will perform fh.read(8192) when obtaining data to feed into the
       
   230        # compressor.
       
   231        with cctx.stream_reader(fh, read_size=8192) as reader:
       
   232            ...
       
   233 
       
   234 The stream returned by ``stream_reader()`` is neither writable nor seekable
       
   235 (even if the underlying source is seekable). ``readline()`` and
       
   236 ``readlines()`` are not implemented because they don't make sense for
       
   237 compressed data. ``tell()`` returns the number of compressed bytes
       
   238 emitted so far.
   272 
   239 
   273 Streaming Input API
   240 Streaming Input API
   274 ^^^^^^^^^^^^^^^^^^^
   241 ^^^^^^^^^^^^^^^^^^^
   275 
   242 
   276 ``write_to(fh)`` (which behaves as a context manager) allows you to *stream*
   243 ``stream_writer(fh)`` (which behaves as a context manager) allows you to *stream*
   277 data into a compressor.::
   244 data into a compressor.::
   278 
   245 
   279    cctx = zstd.ZstdCompressor(level=10)
   246    cctx = zstd.ZstdCompressor(level=10)
   280    with cctx.write_to(fh) as compressor:
   247    with cctx.stream_writer(fh) as compressor:
   281        compressor.write(b'chunk 0')
   248        compressor.write(b'chunk 0')
   282        compressor.write(b'chunk 1')
   249        compressor.write(b'chunk 1')
   283        ...
   250        ...
   284 
   251 
   285 The argument to ``write_to()`` must have a ``write(data)`` method. As
   252 The argument to ``stream_writer()`` must have a ``write(data)`` method. As
   286 compressed data is available, ``write()`` will be called with the compressed
   253 compressed data is available, ``write()`` will be called with the compressed
   287 data as its argument. Many common Python types implement ``write()``, including
   254 data as its argument. Many common Python types implement ``write()``, including
   288 open file handles and ``io.BytesIO``.
   255 open file handles and ``io.BytesIO``.
   289 
   256 
   290 ``write_to()`` returns an object representing a streaming compressor instance.
   257 ``stream_writer()`` returns an object representing a streaming compressor
   291 It **must** be used as a context manager. That object's ``write(data)`` method
   258 instance. It **must** be used as a context manager. That object's
   292 is used to feed data into the compressor.
   259 ``write(data)`` method is used to feed data into the compressor.
   293 
   260 
   294 A ``flush()`` method can be called to evict whatever data remains within the
   261 A ``flush()`` method can be called to evict whatever data remains within the
   295 compressor's internal state into the output object. This may result in 0 or
   262 compressor's internal state into the output object. This may result in 0 or
   296 more ``write()`` calls to the output object.
   263 more ``write()`` calls to the output object.
   297 
   264 
   301 
   268 
   302 If the size of the data being fed to this streaming compressor is known,
   269 If the size of the data being fed to this streaming compressor is known,
   303 you can declare it before compression begins::
   270 you can declare it before compression begins::
   304 
   271 
   305    cctx = zstd.ZstdCompressor()
   272    cctx = zstd.ZstdCompressor()
   306    with cctx.write_to(fh, size=data_len) as compressor:
   273    with cctx.stream_writer(fh, size=data_len) as compressor:
   307        compressor.write(chunk0)
   274        compressor.write(chunk0)
   308        compressor.write(chunk1)
   275        compressor.write(chunk1)
   309        ...
   276        ...
   310 
   277 
   311 Declaring the size of the source data allows compression parameters to
   278 Declaring the size of the source data allows compression parameters to
   313 content size being written into the frame header of the output data.
   280 content size being written into the frame header of the output data.
   314 
   281 
   315 The size of chunks being ``write()`` to the destination can be specified::
   282 The size of chunks being ``write()`` to the destination can be specified::
   316 
   283 
   317     cctx = zstd.ZstdCompressor()
   284     cctx = zstd.ZstdCompressor()
   318     with cctx.write_to(fh, write_size=32768) as compressor:
   285     with cctx.stream_writer(fh, write_size=32768) as compressor:
   319         ...
   286         ...
   320 
   287 
   321 To see how much memory is being used by the streaming compressor::
   288 To see how much memory is being used by the streaming compressor::
   322 
   289 
   323     cctx = zstd.ZstdCompressor()
   290     cctx = zstd.ZstdCompressor()
   324     with cctx.write_to(fh) as compressor:
   291     with cctx.stream_writer(fh) as compressor:
   325         ...
   292         ...
   326         byte_size = compressor.memory_size()
   293         byte_size = compressor.memory_size()
   327 
   294 
       
   295 Thte total number of bytes written so far are exposed via ``tell()``::
       
   296 
       
   297     cctx = zstd.ZstdCompressor()
       
   298     with cctx.stream_writer(fh) as compressor:
       
   299         ...
       
   300         total_written = compressor.tell()
       
   301 
   328 Streaming Output API
   302 Streaming Output API
   329 ^^^^^^^^^^^^^^^^^^^^
   303 ^^^^^^^^^^^^^^^^^^^^
   330 
   304 
   331 ``read_from(reader)`` provides a mechanism to stream data out of a compressor
   305 ``read_to_iter(reader)`` provides a mechanism to stream data out of a
   332 as an iterator of data chunks.::
   306 compressor as an iterator of data chunks.::
   333 
   307 
   334    cctx = zstd.ZstdCompressor()
   308    cctx = zstd.ZstdCompressor()
   335    for chunk in cctx.read_from(fh):
   309    for chunk in cctx.read_to_iter(fh):
   336         # Do something with emitted data.
   310         # Do something with emitted data.
   337 
   311 
   338 ``read_from()`` accepts an object that has a ``read(size)`` method or conforms
   312 ``read_to_iter()`` accepts an object that has a ``read(size)`` method or
   339 to the buffer protocol. (``bytes`` and ``memoryview`` are 2 common types that
   313 conforms to the buffer protocol.
   340 provide the buffer protocol.)
       
   341 
   314 
   342 Uncompressed data is fetched from the source either by calling ``read(size)``
   315 Uncompressed data is fetched from the source either by calling ``read(size)``
   343 or by fetching a slice of data from the object directly (in the case where
   316 or by fetching a slice of data from the object directly (in the case where
   344 the buffer protocol is being used). The returned iterator consists of chunks
   317 the buffer protocol is being used). The returned iterator consists of chunks
   345 of compressed data.
   318 of compressed data.
   346 
   319 
   347 If reading from the source via ``read()``, ``read()`` will be called until
   320 If reading from the source via ``read()``, ``read()`` will be called until
   348 it raises or returns an empty bytes (``b''``). It is perfectly valid for
   321 it raises or returns an empty bytes (``b''``). It is perfectly valid for
   349 the source to deliver fewer bytes than were what requested by ``read(size)``.
   322 the source to deliver fewer bytes than were what requested by ``read(size)``.
   350 
   323 
   351 Like ``write_to()``, ``read_from()`` also accepts a ``size`` argument
   324 Like ``stream_writer()``, ``read_to_iter()`` also accepts a ``size`` argument
   352 declaring the size of the input stream::
   325 declaring the size of the input stream::
   353 
   326 
   354     cctx = zstd.ZstdCompressor()
   327     cctx = zstd.ZstdCompressor()
   355     for chunk in cctx.read_from(fh, size=some_int):
   328     for chunk in cctx.read_to_iter(fh, size=some_int):
   356         pass
   329         pass
   357 
   330 
   358 You can also control the size that data is ``read()`` from the source and
   331 You can also control the size that data is ``read()`` from the source and
   359 the ideal size of output chunks::
   332 the ideal size of output chunks::
   360 
   333 
   361     cctx = zstd.ZstdCompressor()
   334     cctx = zstd.ZstdCompressor()
   362     for chunk in cctx.read_from(fh, read_size=16384, write_size=8192):
   335     for chunk in cctx.read_to_iter(fh, read_size=16384, write_size=8192):
   363         pass
   336         pass
   364 
   337 
   365 Unlike ``write_to()``, ``read_from()`` does not give direct control over the
   338 Unlike ``stream_writer()``, ``read_to_iter()`` does not give direct control
   366 sizes of chunks fed into the compressor. Instead, chunk sizes will be whatever
   339 over the sizes of chunks fed into the compressor. Instead, chunk sizes will
   367 the object being read from delivers. These will often be of a uniform size.
   340 be whatever the object being read from delivers. These will often be of a
       
   341 uniform size.
   368 
   342 
   369 Stream Copying API
   343 Stream Copying API
   370 ^^^^^^^^^^^^^^^^^^
   344 ^^^^^^^^^^^^^^^^^^
   371 
   345 
   372 ``copy_stream(ifh, ofh)`` can be used to copy data between 2 streams while
   346 ``copy_stream(ifh, ofh)`` can be used to copy data between 2 streams while
   402 
   376 
   403 ``compressobj()`` returns an object that exposes ``compress(data)`` and
   377 ``compressobj()`` returns an object that exposes ``compress(data)`` and
   404 ``flush()`` methods. Each returns compressed data or an empty bytes.
   378 ``flush()`` methods. Each returns compressed data or an empty bytes.
   405 
   379 
   406 The purpose of ``compressobj()`` is to provide an API-compatible interface
   380 The purpose of ``compressobj()`` is to provide an API-compatible interface
   407 with ``zlib.compressobj`` and ``bz2.BZ2Compressor``. This allows callers to
   381 with ``zlib.compressobj``, ``bz2.BZ2Compressor``, etc. This allows callers to
   408 swap in different compressor objects while using the same API.
   382 swap in different compressor objects while using the same API.
   409 
   383 
   410 ``flush()`` accepts an optional argument indicating how to end the stream.
   384 ``flush()`` accepts an optional argument indicating how to end the stream.
   411 ``zstd.COMPRESSOBJ_FLUSH_FINISH`` (the default) ends the compression stream.
   385 ``zstd.COMPRESSOBJ_FLUSH_FINISH`` (the default) ends the compression stream.
   412 Once this type of flush is performed, ``compress()`` and ``flush()`` can
   386 Once this type of flush is performed, ``compress()`` and ``flush()`` can
   483 
   457 
   484 ZstdDecompressor
   458 ZstdDecompressor
   485 ----------------
   459 ----------------
   486 
   460 
   487 The ``ZstdDecompressor`` class provides an interface for performing
   461 The ``ZstdDecompressor`` class provides an interface for performing
   488 decompression.
   462 decompression. It is effectively a wrapper around the ``ZSTD_DCtx`` type from
       
   463 the C API.
   489 
   464 
   490 Each instance is associated with parameters that control decompression. These
   465 Each instance is associated with parameters that control decompression. These
   491 come from the following named arguments (all optional):
   466 come from the following named arguments (all optional):
   492 
   467 
   493 dict_data
   468 dict_data
   494    Compression dictionary to use.
   469    Compression dictionary to use.
       
   470 max_window_size
       
   471    Sets an uppet limit on the window size for decompression operations in
       
   472    kibibytes. This setting can be used to prevent large memory allocations
       
   473    for inputs using large compression windows.
       
   474 format
       
   475    Set the format of data for the decoder. By default, this is
       
   476    ``zstd.FORMAT_ZSTD1``. It can be set to ``zstd.FORMAT_ZSTD1_MAGICLESS`` to
       
   477    allow decoding frames without the 4 byte magic header. Not all decompression
       
   478    APIs support this mode.
   495 
   479 
   496 The interface of this class is very similar to ``ZstdCompressor`` (by design).
   480 The interface of this class is very similar to ``ZstdCompressor`` (by design).
   497 
   481 
   498 Unless specified otherwise, assume that no two methods of ``ZstdDecompressor``
   482 Unless specified otherwise, assume that no two methods of ``ZstdDecompressor``
   499 instances can be called from multiple Python threads simultaneously. In other
   483 instances can be called from multiple Python threads simultaneously. In other
   500 words, assume instances are not thread safe unless stated otherwise.
   484 words, assume instances are not thread safe unless stated otherwise.
   501 
   485 
       
   486 Utility Methods
       
   487 ^^^^^^^^^^^^^^^
       
   488 
       
   489 ``memory_size()`` obtains the size of the underlying zstd decompression context,
       
   490 in bytes.::
       
   491 
       
   492     dctx = zstd.ZstdDecompressor()
       
   493     size = dctx.memory_size()
       
   494 
   502 Simple API
   495 Simple API
   503 ^^^^^^^^^^
   496 ^^^^^^^^^^
   504 
   497 
   505 ``decompress(data)`` can be used to decompress an entire compressed zstd
   498 ``decompress(data)`` can be used to decompress an entire compressed zstd
   506 frame in a single operation.::
   499 frame in a single operation.::
   507 
   500 
   508     dctx = zstd.ZstdDecompressor()
   501     dctx = zstd.ZstdDecompressor()
   509     decompressed = dctx.decompress(data)
   502     decompressed = dctx.decompress(data)
   510 
   503 
   511 By default, ``decompress(data)`` will only work on data written with the content
   504 By default, ``decompress(data)`` will only work on data written with the content
   512 size encoded in its header. This can be achieved by creating a
   505 size encoded in its header (this is the default behavior of
   513 ``ZstdCompressor`` with ``write_content_size=True``. If compressed data without
   506 ``ZstdCompressor().compress()`` but may not be true for streaming compression). If
   514 an embedded content size is seen, ``zstd.ZstdError`` will be raised.
   507 compressed data without an embedded content size is seen, ``zstd.ZstdError`` will
       
   508 be raised.
   515 
   509 
   516 If the compressed data doesn't have its content size embedded within it,
   510 If the compressed data doesn't have its content size embedded within it,
   517 decompression can be attempted by specifying the ``max_output_size``
   511 decompression can be attempted by specifying the ``max_output_size``
   518 argument.::
   512 argument.::
   519 
   513 
   532 Please note that an allocation of the requested ``max_output_size`` will be
   526 Please note that an allocation of the requested ``max_output_size`` will be
   533 performed every time the method is called. Setting to a very large value could
   527 performed every time the method is called. Setting to a very large value could
   534 result in a lot of work for the memory allocator and may result in
   528 result in a lot of work for the memory allocator and may result in
   535 ``MemoryError`` being raised if the allocation fails.
   529 ``MemoryError`` being raised if the allocation fails.
   536 
   530 
   537 If the exact size of decompressed data is unknown, it is **strongly**
   531 .. important::
   538 recommended to use a streaming API.
   532 
       
   533    If the exact size of decompressed data is unknown (not passed in explicitly
       
   534    and not stored in the zstandard frame), for performance reasons it is
       
   535    encouraged to use a streaming API.
       
   536 
       
   537 Stream Reader API
       
   538 ^^^^^^^^^^^^^^^^^
       
   539 
       
   540 ``stream_reader(source)`` can be used to obtain an object conforming to the
       
   541 ``io.RawIOBase`` interface for reading decompressed output as a stream::
       
   542 
       
   543    with open(path, 'rb') as fh:
       
   544        dctx = zstd.ZstdDecompressor()
       
   545        with dctx.stream_reader(fh) as reader:
       
   546            while True:
       
   547                chunk = reader.read(16384)
       
   548                if not chunk:
       
   549                    break
       
   550 
       
   551                # Do something with decompressed chunk.
       
   552 
       
   553 The stream can only be read within a context manager. When the context
       
   554 manager exits, the stream is closed and the underlying resource is
       
   555 released and future operations against the stream will fail.
       
   556 
       
   557 The ``source`` argument to ``stream_reader()`` can be any object with a
       
   558 ``read(size)`` method or any object implementing the *buffer protocol*.
       
   559 
       
   560 If the ``source`` is a stream, you can specify how large ``read()`` requests
       
   561 to that stream should be via the ``read_size`` argument. It defaults to
       
   562 ``zstandard.DECOMPRESSION_RECOMMENDED_INPUT_SIZE``.::
       
   563 
       
   564    with open(path, 'rb') as fh:
       
   565        dctx = zstd.ZstdDecompressor()
       
   566        # Will perform fh.read(8192) when obtaining data for the decompressor.
       
   567        with dctx.stream_reader(fh, read_size=8192) as reader:
       
   568            ...
       
   569 
       
   570 The stream returned by ``stream_reader()`` is not writable.
       
   571 
       
   572 The stream returned by ``stream_reader()`` is *partially* seekable.
       
   573 Absolute and relative positions (``SEEK_SET`` and ``SEEK_CUR``) forward
       
   574 of the current position are allowed. Offsets behind the current read
       
   575 position and offsets relative to the end of stream are not allowed and
       
   576 will raise ``ValueError`` if attempted.
       
   577 
       
   578 ``tell()`` returns the number of decompressed bytes read so far.
       
   579 
       
   580 Not all I/O methods are implemented. Notably missing is support for
       
   581 ``readline()``, ``readlines()``, and linewise iteration support. Support for
       
   582 these is planned for a future release.
   539 
   583 
   540 Streaming Input API
   584 Streaming Input API
   541 ^^^^^^^^^^^^^^^^^^^
   585 ^^^^^^^^^^^^^^^^^^^
   542 
   586 
   543 ``write_to(fh)`` can be used to incrementally send compressed data to a
   587 ``stream_writer(fh)`` can be used to incrementally send compressed data to a
   544 decompressor.::
   588 decompressor.::
   545 
   589 
   546     dctx = zstd.ZstdDecompressor()
   590     dctx = zstd.ZstdDecompressor()
   547     with dctx.write_to(fh) as decompressor:
   591     with dctx.stream_writer(fh) as decompressor:
   548         decompressor.write(compressed_data)
   592         decompressor.write(compressed_data)
   549 
   593 
   550 This behaves similarly to ``zstd.ZstdCompressor``: compressed data is written to
   594 This behaves similarly to ``zstd.ZstdCompressor``: compressed data is written to
   551 the decompressor by calling ``write(data)`` and decompressed output is written
   595 the decompressor by calling ``write(data)`` and decompressed output is written
   552 to the output object by calling its ``write(data)`` method.
   596 to the output object by calling its ``write(data)`` method.
   556 of ``0`` are possible.
   600 of ``0`` are possible.
   557 
   601 
   558 The size of chunks being ``write()`` to the destination can be specified::
   602 The size of chunks being ``write()`` to the destination can be specified::
   559 
   603 
   560     dctx = zstd.ZstdDecompressor()
   604     dctx = zstd.ZstdDecompressor()
   561     with dctx.write_to(fh, write_size=16384) as decompressor:
   605     with dctx.stream_writer(fh, write_size=16384) as decompressor:
   562         pass
   606         pass
   563 
   607 
   564 You can see how much memory is being used by the decompressor::
   608 You can see how much memory is being used by the decompressor::
   565 
   609 
   566     dctx = zstd.ZstdDecompressor()
   610     dctx = zstd.ZstdDecompressor()
   567     with dctx.write_to(fh) as decompressor:
   611     with dctx.stream_writer(fh) as decompressor:
   568         byte_size = decompressor.memory_size()
   612         byte_size = decompressor.memory_size()
   569 
   613 
   570 Streaming Output API
   614 Streaming Output API
   571 ^^^^^^^^^^^^^^^^^^^^
   615 ^^^^^^^^^^^^^^^^^^^^
   572 
   616 
   573 ``read_from(fh)`` provides a mechanism to stream decompressed data out of a
   617 ``read_to_iter(fh)`` provides a mechanism to stream decompressed data out of a
   574 compressed source as an iterator of data chunks.:: 
   618 compressed source as an iterator of data chunks.:: 
   575 
   619 
   576     dctx = zstd.ZstdDecompressor()
   620     dctx = zstd.ZstdDecompressor()
   577     for chunk in dctx.read_from(fh):
   621     for chunk in dctx.read_to_iter(fh):
   578         # Do something with original data.
   622         # Do something with original data.
   579 
   623 
   580 ``read_from()`` accepts a) an object with a ``read(size)`` method that will
   624 ``read_to_iter()`` accepts an object with a ``read(size)`` method that will
   581 return  compressed bytes b) an object conforming to the buffer protocol that
   625 return  compressed bytes or an object conforming to the buffer protocol that
   582 can expose its data as a contiguous range of bytes. The ``bytes`` and
   626 can expose its data as a contiguous range of bytes.
   583 ``memoryview`` types expose this buffer protocol.
   627 
   584 
   628 ``read_to_iter()`` returns an iterator whose elements are chunks of the
   585 ``read_from()`` returns an iterator whose elements are chunks of the
       
   586 decompressed data.
   629 decompressed data.
   587 
   630 
   588 The size of requested ``read()`` from the source can be specified::
   631 The size of requested ``read()`` from the source can be specified::
   589 
   632 
   590     dctx = zstd.ZstdDecompressor()
   633     dctx = zstd.ZstdDecompressor()
   591     for chunk in dctx.read_from(fh, read_size=16384):
   634     for chunk in dctx.read_to_iter(fh, read_size=16384):
   592         pass
   635         pass
   593 
   636 
   594 It is also possible to skip leading bytes in the input data::
   637 It is also possible to skip leading bytes in the input data::
   595 
   638 
   596     dctx = zstd.ZstdDecompressor()
   639     dctx = zstd.ZstdDecompressor()
   597     for chunk in dctx.read_from(fh, skip_bytes=1):
   640     for chunk in dctx.read_to_iter(fh, skip_bytes=1):
   598         pass
   641         pass
   599 
   642 
   600 Skipping leading bytes is useful if the source data contains extra
   643 .. tip::
   601 *header* data but you want to avoid the overhead of making a buffer copy
   644 
   602 or allocating a new ``memoryview`` object in order to decompress the data.
   645    Skipping leading bytes is useful if the source data contains extra
   603 
   646    *header* data. Traditionally, you would need to create a slice or
   604 Similarly to ``ZstdCompressor.read_from()``, the consumer of the iterator
   647    ``memoryview`` of the data you want to decompress. This would create
       
   648    overhead. It is more efficient to pass the offset into this API.
       
   649 
       
   650 Similarly to ``ZstdCompressor.read_to_iter()``, the consumer of the iterator
   605 controls when data is decompressed. If the iterator isn't consumed,
   651 controls when data is decompressed. If the iterator isn't consumed,
   606 decompression is put on hold.
   652 decompression is put on hold.
   607 
   653 
   608 When ``read_from()`` is passed an object conforming to the buffer protocol,
   654 When ``read_to_iter()`` is passed an object conforming to the buffer protocol,
   609 the behavior may seem similar to what occurs when the simple decompression
   655 the behavior may seem similar to what occurs when the simple decompression
   610 API is used. However, this API works when the decompressed size is unknown.
   656 API is used. However, this API works when the decompressed size is unknown.
   611 Furthermore, if feeding large inputs, the decompressor will work in chunks
   657 Furthermore, if feeding large inputs, the decompressor will work in chunks
   612 instead of performing a single operation.
   658 instead of performing a single operation.
   613 
   659 
   634 
   680 
   635 Decompressor API
   681 Decompressor API
   636 ^^^^^^^^^^^^^^^^
   682 ^^^^^^^^^^^^^^^^
   637 
   683 
   638 ``decompressobj()`` returns an object that exposes a ``decompress(data)``
   684 ``decompressobj()`` returns an object that exposes a ``decompress(data)``
   639 methods. Compressed data chunks are fed into ``decompress(data)`` and
   685 method. Compressed data chunks are fed into ``decompress(data)`` and
   640 uncompressed output (or an empty bytes) is returned. Output from subsequent
   686 uncompressed output (or an empty bytes) is returned. Output from subsequent
   641 calls needs to be concatenated to reassemble the full decompressed byte
   687 calls needs to be concatenated to reassemble the full decompressed byte
   642 sequence.
   688 sequence.
   643 
   689 
   644 The purpose of ``decompressobj()`` is to provide an API-compatible interface
   690 The purpose of ``decompressobj()`` is to provide an API-compatible interface
   648 Each object is single use: once an input frame is decoded, ``decompress()``
   694 Each object is single use: once an input frame is decoded, ``decompress()``
   649 can no longer be called.
   695 can no longer be called.
   650 
   696 
   651 Here is how this API should be used::
   697 Here is how this API should be used::
   652 
   698 
   653    dctx = zstd.ZstdDeompressor()
   699    dctx = zstd.ZstdDecompressor()
   654    dobj = cctx.decompressobj()
   700    dobj = dctx.decompressobj()
   655    data = dobj.decompress(compressed_chunk_0)
   701    data = dobj.decompress(compressed_chunk_0)
   656    data = dobj.decompress(compressed_chunk_1)
   702    data = dobj.decompress(compressed_chunk_1)
       
   703 
       
   704 By default, calls to ``decompress()`` write output data in chunks of size
       
   705 ``DECOMPRESSION_RECOMMENDED_OUTPUT_SIZE``. These chunks are concatenated
       
   706 before being returned to the caller. It is possible to define the size of
       
   707 these temporary chunks by passing ``write_size`` to ``decompressobj()``::
       
   708 
       
   709    dctx = zstd.ZstdDecompressor()
       
   710    dobj = dctx.decompressobj(write_size=1048576)
       
   711 
       
   712 .. note::
       
   713 
       
   714    Because calls to ``decompress()`` may need to perform multiple
       
   715    memory (re)allocations, this streaming decompression API isn't as
       
   716    efficient as other APIs.
   657 
   717 
   658 Batch Decompression API
   718 Batch Decompression API
   659 ^^^^^^^^^^^^^^^^^^^^^^^
   719 ^^^^^^^^^^^^^^^^^^^^^^^
   660 
   720 
   661 (Experimental. Not yet supported in CFFI bindings.)
   721 (Experimental. Not yet supported in CFFI bindings.)
   669 conform to the buffer protocol. For best performance, pass a
   729 conform to the buffer protocol. For best performance, pass a
   670 ``BufferWithSegmentsCollection`` or a ``BufferWithSegments``, as
   730 ``BufferWithSegmentsCollection`` or a ``BufferWithSegments``, as
   671 minimal input validation will be done for that type. If calling from
   731 minimal input validation will be done for that type. If calling from
   672 Python (as opposed to C), constructing one of these instances may add
   732 Python (as opposed to C), constructing one of these instances may add
   673 overhead cancelling out the performance overhead of validation for list
   733 overhead cancelling out the performance overhead of validation for list
   674 inputs.
   734 inputs.::
   675 
   735 
   676 The decompressed size of each frame must be discoverable. It can either be
   736     dctx = zstd.ZstdDecompressor()
       
   737     results = dctx.multi_decompress_to_buffer([b'...', b'...'])
       
   738 
       
   739 The decompressed size of each frame MUST be discoverable. It can either be
   677 embedded within the zstd frame (``write_content_size=True`` argument to
   740 embedded within the zstd frame (``write_content_size=True`` argument to
   678 ``ZstdCompressor``) or passed in via the ``decompressed_sizes`` argument.
   741 ``ZstdCompressor``) or passed in via the ``decompressed_sizes`` argument.
   679 
   742 
   680 The ``decompressed_sizes`` argument is an object conforming to the buffer
   743 The ``decompressed_sizes`` argument is an object conforming to the buffer
   681 protocol which holds an array of 64-bit unsigned integers in the machine's
   744 protocol which holds an array of 64-bit unsigned integers in the machine's
   682 native format defining the decompressed sizes of each frame. If this argument
   745 native format defining the decompressed sizes of each frame. If this argument
   683 is passed, it avoids having to scan each frame for its decompressed size.
   746 is passed, it avoids having to scan each frame for its decompressed size.
   684 This frame scanning can add noticeable overhead in some scenarios.
   747 This frame scanning can add noticeable overhead in some scenarios.::
       
   748 
       
   749     frames = [...]
       
   750     sizes = struct.pack('=QQQQ', len0, len1, len2, len3)
       
   751 
       
   752     dctx = zstd.ZstdDecompressor()
       
   753     results = dctx.multi_decompress_to_buffer(frames, decompressed_sizes=sizes)
   685 
   754 
   686 The ``threads`` argument controls the number of threads to use to perform
   755 The ``threads`` argument controls the number of threads to use to perform
   687 decompression operations. The default (``0``) or the value ``1`` means to
   756 decompression operations. The default (``0``) or the value ``1`` means to
   688 use a single thread. Negative values use the number of logical CPUs in the
   757 use a single thread. Negative values use the number of logical CPUs in the
   689 machine.
   758 machine.
   699 
   768 
   700 This function exists to perform decompression on multiple frames as fast
   769 This function exists to perform decompression on multiple frames as fast
   701 as possible by having as little overhead as possible. Since decompression is
   770 as possible by having as little overhead as possible. Since decompression is
   702 performed as a single operation and since the decompressed output is stored in
   771 performed as a single operation and since the decompressed output is stored in
   703 a single buffer, extra memory allocations, Python objects, and Python function
   772 a single buffer, extra memory allocations, Python objects, and Python function
   704 calls are avoided. This is ideal for scenarios where callers need to access
   773 calls are avoided. This is ideal for scenarios where callers know up front that
   705 decompressed data for multiple frames.
   774 they need to access data for multiple frames, such as when  *delta chains* are
       
   775 being used.
   706 
   776 
   707 Currently, the implementation always spawns multiple threads when requested,
   777 Currently, the implementation always spawns multiple threads when requested,
   708 even if the amount of work to do is small. In the future, it will be smarter
   778 even if the amount of work to do is small. In the future, it will be smarter
   709 about avoiding threads and their associated overhead when the amount of
   779 about avoiding threads and their associated overhead when the amount of
   710 work to do is small.
   780 work to do is small.
   711 
   781 
   712 Content-Only Dictionary Chain Decompression
   782 Prefix Dictionary Chain Decompression
   713 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   783 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   714 
   784 
   715 ``decompress_content_dict_chain(frames)`` performs decompression of a list of
   785 ``decompress_content_dict_chain(frames)`` performs decompression of a list of
   716 zstd frames produced using chained *content-only* dictionary compression. Such
   786 zstd frames produced using chained *prefix* dictionary compression. Such
   717 a list of frames is produced by compressing discrete inputs where each
   787 a list of frames is produced by compressing discrete inputs where each
   718 non-initial input is compressed with a *content-only* dictionary consisting
   788 non-initial input is compressed with a *prefix* dictionary consisting of the
   719 of the content of the previous input.
   789 content of the previous input.
   720 
   790 
   721 For example, say you have the following inputs::
   791 For example, say you have the following inputs::
   722 
   792 
   723    inputs = [b'input 1', b'input 2', b'input 3']
   793    inputs = [b'input 1', b'input 2', b'input 3']
   724 
   794 
   725 The zstd frame chain consists of:
   795 The zstd frame chain consists of:
   726 
   796 
   727 1. ``b'input 1'`` compressed in standalone/discrete mode
   797 1. ``b'input 1'`` compressed in standalone/discrete mode
   728 2. ``b'input 2'`` compressed using ``b'input 1'`` as a *content-only* dictionary
   798 2. ``b'input 2'`` compressed using ``b'input 1'`` as a *prefix* dictionary
   729 3. ``b'input 3'`` compressed using ``b'input 2'`` as a *content-only* dictionary
   799 3. ``b'input 3'`` compressed using ``b'input 2'`` as a *prefix* dictionary
   730 
   800 
   731 Each zstd frame **must** have the content size written.
   801 Each zstd frame **must** have the content size written.
   732 
   802 
   733 The following Python code can be used to produce a *content-only dictionary
   803 The following Python code can be used to produce a *prefix dictionary chain*::
   734 chain*::
       
   735 
   804 
   736     def make_chain(inputs):
   805     def make_chain(inputs):
   737         frames = []
   806         frames = []
   738 
   807 
   739         # First frame is compressed in standalone/discrete mode.
   808         # First frame is compressed in standalone/discrete mode.
   740         zctx = zstd.ZstdCompressor(write_content_size=True)
   809         zctx = zstd.ZstdCompressor()
   741         frames.append(zctx.compress(inputs[0]))
   810         frames.append(zctx.compress(inputs[0]))
   742 
   811 
   743         # Subsequent frames use the previous fulltext as a content-only dictionary
   812         # Subsequent frames use the previous fulltext as a prefix dictionary
   744         for i, raw in enumerate(inputs[1:]):
   813         for i, raw in enumerate(inputs[1:]):
   745             dict_data = zstd.ZstdCompressionDict(inputs[i])
   814             dict_data = zstd.ZstdCompressionDict(
   746             zctx = zstd.ZstdCompressor(write_content_size=True, dict_data=dict_data)
   815                 inputs[i], dict_type=zstd.DICT_TYPE_RAWCONTENT)
       
   816             zctx = zstd.ZstdCompressor(dict_data=dict_data)
   747             frames.append(zctx.compress(raw))
   817             frames.append(zctx.compress(raw))
   748 
   818 
   749         return frames
   819         return frames
   750 
   820 
   751 ``decompress_content_dict_chain()`` returns the uncompressed data of the last
   821 ``decompress_content_dict_chain()`` returns the uncompressed data of the last
   752 element in the input chain.
   822 element in the input chain.
   753 
   823 
   754 It is possible to implement *content-only dictionary chain* decompression
   824 
   755 on top of other Python APIs. However, this function will likely be significantly
   825 .. note::
   756 faster, especially for long input chains, as it avoids the overhead of
   826 
   757 instantiating and passing around intermediate objects between C and Python.
   827    It is possible to implement *prefix dictionary chain* decompression
       
   828    on top of other APIs. However, this function will likely be faster -
       
   829    especially for long input chains - as it avoids the overhead of instantiating
       
   830    and passing around intermediate objects between C and Python.
   758 
   831 
   759 Multi-Threaded Compression
   832 Multi-Threaded Compression
   760 --------------------------
   833 --------------------------
   761 
   834 
   762 ``ZstdCompressor`` accepts a ``threads`` argument that controls the number
   835 ``ZstdCompressor`` accepts a ``threads`` argument that controls the number
   763 of threads to use for compression. The way this works is that input is split
   836 of threads to use for compression. The way this works is that input is split
   764 into segments and each segment is fed into a worker pool for compression. Once
   837 into segments and each segment is fed into a worker pool for compression. Once
   765 a segment is compressed, it is flushed/appended to the output.
   838 a segment is compressed, it is flushed/appended to the output.
   766 
   839 
       
   840 .. note::
       
   841 
       
   842    These threads are created at the C layer and are not Python threads. So they
       
   843    work outside the GIL. It is therefore possible to CPU saturate multiple cores
       
   844    from Python.
       
   845 
   767 The segment size for multi-threaded compression is chosen from the window size
   846 The segment size for multi-threaded compression is chosen from the window size
   768 of the compressor. This is derived from the ``window_log`` attribute of a
   847 of the compressor. This is derived from the ``window_log`` attribute of a
   769 ``CompressionParameters`` instance. By default, segment sizes are in the 1+MB
   848 ``ZstdCompressionParameters`` instance. By default, segment sizes are in the 1+MB
   770 range.
   849 range.
   771 
   850 
   772 If multi-threaded compression is requested and the input is smaller than the
   851 If multi-threaded compression is requested and the input is smaller than the
   773 configured segment size, only a single compression thread will be used. If the
   852 configured segment size, only a single compression thread will be used. If the
   774 input is smaller than the segment size multiplied by the thread pool size or
   853 input is smaller than the segment size multiplied by the thread pool size or
   783 *states*, the output from multi-threaded compression will likely be larger
   862 *states*, the output from multi-threaded compression will likely be larger
   784 than non-multi-threaded compression. The difference is usually small. But
   863 than non-multi-threaded compression. The difference is usually small. But
   785 there is a CPU/wall time versus size trade off that may warrant investigation.
   864 there is a CPU/wall time versus size trade off that may warrant investigation.
   786 
   865 
   787 Output from multi-threaded compression does not require any special handling
   866 Output from multi-threaded compression does not require any special handling
   788 on the decompression side. In other words, any zstd decompressor should be able
   867 on the decompression side. To the decompressor, data generated with single
   789 to consume data produced with multi-threaded compression.
   868 threaded compressor looks the same as data generated by a multi-threaded
       
   869 compressor and does not require any special handling or additional resource
       
   870 requirements.
   790 
   871 
   791 Dictionary Creation and Management
   872 Dictionary Creation and Management
   792 ----------------------------------
   873 ----------------------------------
   793 
   874 
   794 Compression dictionaries are represented as the ``ZstdCompressionDict`` type.
   875 Compression dictionaries are represented with the ``ZstdCompressionDict`` type.
   795 
   876 
   796 Instances can be constructed from bytes::
   877 Instances can be constructed from bytes::
   797 
   878 
   798    dict_data = zstd.ZstdCompressionDict(data)
   879    dict_data = zstd.ZstdCompressionDict(data)
   799 
   880 
   800 It is possible to construct a dictionary from *any* data. Unless the
   881 It is possible to construct a dictionary from *any* data. If the data doesn't
   801 data begins with a magic header, the dictionary will be treated as
   882 begin with a magic header, it will be treated as a *prefix* dictionary.
   802 *content-only*. *Content-only* dictionaries allow compression operations
   883 *Prefix* dictionaries allow compression operations to reference raw data
   803 that follow to reference raw data within the content. For one use of
   884 within the dictionary.
   804 *content-only* dictionaries, see
   885 
   805 ``ZstdDecompressor.decompress_content_dict_chain()``.
   886 It is possible to force the use of *prefix* dictionaries or to require a
   806 
   887 dictionary header:
   807 More interestingly, instances can be created by *training* on sample data::
   888 
   808 
   889    dict_data = zstd.ZstdCompressionDict(data,
   809    dict_data = zstd.train_dictionary(size, samples)
   890                                         dict_type=zstd.DICT_TYPE_RAWCONTENT)
   810 
   891 
   811 This takes a list of bytes instances and creates and returns a
   892    dict_data = zstd.ZstdCompressionDict(data,
   812 ``ZstdCompressionDict``.
   893                                         dict_type=zstd.DICT_TYPE_FULLDICT)
   813 
   894 
   814 You can see how many bytes are in the dictionary by calling ``len()``::
   895 You can see how many bytes are in the dictionary by calling ``len()``::
   815 
   896 
   816    dict_data = zstd.train_dictionary(size, samples)
   897    dict_data = zstd.train_dictionary(size, samples)
   817    dict_size = len(dict_data)  # will not be larger than ``size``
   898    dict_size = len(dict_data)  # will not be larger than ``size``
   818 
   899 
   819 Once you have a dictionary, you can pass it to the objects performing
   900 Once you have a dictionary, you can pass it to the objects performing
   820 compression and decompression::
   901 compression and decompression::
   821 
   902 
   822    dict_data = zstd.train_dictionary(16384, samples)
   903    dict_data = zstd.train_dictionary(131072, samples)
   823 
   904 
   824    cctx = zstd.ZstdCompressor(dict_data=dict_data)
   905    cctx = zstd.ZstdCompressor(dict_data=dict_data)
   825    for source_data in input_data:
   906    for source_data in input_data:
   826        compressed = cctx.compress(source_data)
   907        compressed = cctx.compress(source_data)
   827        # Do something with compressed data.
   908        # Do something with compressed data.
   828 
   909 
   829    dctx = zstd.ZstdDecompressor(dict_data=dict_data)
   910    dctx = zstd.ZstdDecompressor(dict_data=dict_data)
   830    for compressed_data in input_data:
   911    for compressed_data in input_data:
   831        buffer = io.BytesIO()
   912        buffer = io.BytesIO()
   832        with dctx.write_to(buffer) as decompressor:
   913        with dctx.stream_writer(buffer) as decompressor:
   833            decompressor.write(compressed_data)
   914            decompressor.write(compressed_data)
   834        # Do something with raw data in ``buffer``.
   915        # Do something with raw data in ``buffer``.
   835 
   916 
   836 Dictionaries have unique integer IDs. You can retrieve this ID via::
   917 Dictionaries have unique integer IDs. You can retrieve this ID via::
   837 
   918 
   841 a ``ZstdCompressionDict`` later) via ``as_bytes()``::
   922 a ``ZstdCompressionDict`` later) via ``as_bytes()``::
   842 
   923 
   843    dict_data = zstd.train_dictionary(size, samples)
   924    dict_data = zstd.train_dictionary(size, samples)
   844    raw_data = dict_data.as_bytes()
   925    raw_data = dict_data.as_bytes()
   845 
   926 
   846 The following named arguments to ``train_dictionary`` can also be used
   927 By default, when a ``ZstdCompressionDict`` is *attached* to a
   847 to further control dictionary generation.
   928 ``ZstdCompressor``, each ``ZstdCompressor`` performs work to prepare the
   848 
   929 dictionary for use. This is fine if only 1 compression operation is being
   849 selectivity
   930 performed or if the ``ZstdCompressor`` is being reused for multiple operations.
   850    Integer selectivity level. Default is 9. Larger values yield more data in
   931 But if multiple ``ZstdCompressor`` instances are being used with the dictionary,
   851    dictionary.
   932 this can add overhead.
   852 level
   933 
   853    Integer compression level. Default is 6.
   934 It is possible to *precompute* the dictionary so it can readily be consumed
   854 dict_id
   935 by multiple ``ZstdCompressor`` instances::
   855    Integer dictionary ID for the produced dictionary. Default is 0, which
   936 
   856    means to use a random value.
   937     d = zstd.ZstdCompressionDict(data)
   857 notifications
   938 
   858    Controls writing of informational messages to ``stderr``. ``0`` (the
   939     # Precompute for compression level 3.
   859    default) means to write nothing. ``1`` writes errors. ``2`` writes
   940     d.precompute_compress(level=3)
   860    progression info. ``3`` writes more details. And ``4`` writes all info.
   941 
   861 
   942     # Precompute with specific compression parameters.
   862 Cover Dictionaries
   943     params = zstd.ZstdCompressionParameters(...)
   863 ^^^^^^^^^^^^^^^^^^
   944     d.precompute_compress(compression_params=params)
   864 
       
   865 An alternate dictionary training mechanism named *cover* is also available.
       
   866 More details about this training mechanism are available in the paper
       
   867 *Effective Construction of Relative Lempel-Ziv Dictionaries* (authors:
       
   868 Liao, Petri, Moffat, Wirth).
       
   869 
       
   870 To use this mechanism, use ``zstd.train_cover_dictionary()`` instead of
       
   871 ``zstd.train_dictionary()``. The function behaves nearly the same except
       
   872 its arguments are different and the returned dictionary will contain ``k``
       
   873 and ``d`` attributes reflecting the parameters to the cover algorithm.
       
   874 
   945 
   875 .. note::
   946 .. note::
   876 
   947 
   877    The ``k`` and ``d`` attributes are only populated on dictionary
   948    When a dictionary is precomputed, the compression parameters used to
   878    instances created by this function. If a ``ZstdCompressionDict`` is
   949    precompute the dictionary overwrite some of the compression parameters
   879    constructed from raw bytes data, the ``k`` and ``d`` attributes will
   950    specified to ``ZstdCompressor.__init__``.
   880    be ``0``.
   951 
       
   952 Training Dictionaries
       
   953 ^^^^^^^^^^^^^^^^^^^^^
       
   954 
       
   955 Unless using *prefix* dictionaries, dictionary data is produced by *training*
       
   956 on existing data::
       
   957 
       
   958    dict_data = zstd.train_dictionary(size, samples)
       
   959 
       
   960 This takes a target dictionary size and list of bytes instances and creates and
       
   961 returns a ``ZstdCompressionDict``.
       
   962 
       
   963 The dictionary training mechanism is known as *cover*. More details about it are
       
   964 available in the paper *Effective Construction of Relative Lempel-Ziv
       
   965 Dictionaries* (authors: Liao, Petri, Moffat, Wirth).
       
   966 
       
   967 The cover algorithm takes parameters ``k` and ``d``. These are the
       
   968 *segment size* and *dmer size*, respectively. The returned dictionary
       
   969 instance created by this function has ``k`` and ``d`` attributes
       
   970 containing the values for these parameters. If a ``ZstdCompressionDict``
       
   971 is constructed from raw bytes data (a content-only dictionary), the
       
   972 ``k`` and ``d`` attributes will be ``0``.
   881 
   973 
   882 The segment and dmer size parameters to the cover algorithm can either be
   974 The segment and dmer size parameters to the cover algorithm can either be
   883 specified manually or you can ask ``train_cover_dictionary()`` to try
   975 specified manually or ``train_dictionary()`` can try multiple values
   884 multiple values and pick the best one, where *best* means the smallest
   976 and pick the best one, where *best* means the smallest compressed data size.
   885 compressed data size.
   977 This later mode is called *optimization* mode.
   886 
   978 
   887 In manual mode, the ``k`` and ``d`` arguments must be specified or a
   979 If none of ``k``, ``d``, ``steps``, ``threads``, ``level``, ``notifications``,
   888 ``ZstdError`` will be raised.
   980 or ``dict_id`` (basically anything from the underlying ``ZDICT_cover_params_t``
   889 
   981 struct) are defined, *optimization* mode is used with default parameter
   890 In automatic mode (triggered by specifying ``optimize=True``), ``k``
   982 values.
   891 and ``d`` are optional. If a value isn't specified, then default values for
   983 
   892 both are tested.  The ``steps`` argument can control the number of steps
   984 If ``steps`` or ``threads`` are defined, then *optimization* mode is engaged
   893 through ``k`` values. The ``level`` argument defines the compression level
   985 with explicit control over those parameters. Specifying ``threads=0`` or
   894 that will be used when testing the compressed size. And ``threads`` can
   986 ``threads=1`` can be used to engage *optimization* mode if other parameters
   895 specify the number of threads to use for concurrent operation.
   987 are not defined.
       
   988 
       
   989 Otherwise, non-*optimization* mode is used with the parameters specified.
   896 
   990 
   897 This function takes the following arguments:
   991 This function takes the following arguments:
   898 
   992 
   899 dict_size
   993 dict_size
   900    Target size in bytes of the dictionary to generate.
   994    Target size in bytes of the dictionary to generate.
   907    Parameter to cover algorithm defining the dmer size. A reasonable range is
  1001    Parameter to cover algorithm defining the dmer size. A reasonable range is
   908    [6, 16]. ``d`` must be less than or equal to ``k``.
  1002    [6, 16]. ``d`` must be less than or equal to ``k``.
   909 dict_id
  1003 dict_id
   910    Integer dictionary ID for the produced dictionary. Default is 0, which uses
  1004    Integer dictionary ID for the produced dictionary. Default is 0, which uses
   911    a random value.
  1005    a random value.
   912 optimize
  1006 steps
   913    When true, test dictionary generation with multiple parameters.
  1007    Number of steps through ``k`` values to perform when trying parameter
       
  1008    variations.
       
  1009 threads
       
  1010    Number of threads to use when trying parameter variations. Default is 0,
       
  1011    which means to use a single thread. A negative value can be specified to
       
  1012    use as many threads as there are detected logical CPUs.
   914 level
  1013 level
   915    Integer target compression level when testing compression with
  1014    Integer target compression level when trying parameter variations.
   916    ``optimize=True``. Default is 1.
       
   917 steps
       
   918    Number of steps through ``k`` values to perform when ``optimize=True``.
       
   919    Default is 32.
       
   920 threads
       
   921    Number of threads to use when ``optimize=True``. Default is 0, which means
       
   922    to use a single thread. A negative value can be specified to use as many
       
   923    threads as there are detected logical CPUs.
       
   924 notifications
  1015 notifications
   925    Controls writing of informational messages to ``stderr``. See the
  1016    Controls writing of informational messages to ``stderr``. ``0`` (the
   926    documentation for ``train_dictionary()`` for more.
  1017    default) means to write nothing. ``1`` writes errors. ``2`` writes
       
  1018    progression info. ``3`` writes more details. And ``4`` writes all info.
   927 
  1019 
   928 Explicit Compression Parameters
  1020 Explicit Compression Parameters
   929 -------------------------------
  1021 -------------------------------
   930 
  1022 
   931 Zstandard's integer compression levels along with the input size and dictionary
  1023 Zstandard offers a high-level *compression level* that maps to lower-level
   932 size are converted into a data structure defining multiple parameters to tune
  1024 compression parameters. For many consumers, this numeric level is the only
   933 behavior of the compression algorithm. It is possible to use define this
  1025 compression setting you'll need to touch.
   934 data structure explicitly to have lower-level control over compression behavior.
  1026 
   935 
  1027 But for advanced use cases, it might be desirable to tweak these lower-level
   936 The ``zstd.CompressionParameters`` type represents this data structure.
  1028 settings.
   937 You can see how Zstandard converts compression levels to this data structure
  1029 
   938 by calling ``zstd.get_compression_parameters()``. e.g.::
  1030 The ``ZstdCompressionParameters`` type represents these low-level compression
   939 
  1031 settings.
   940     params = zstd.get_compression_parameters(5)
  1032 
   941 
  1033 Instances of this type can be constructed from a myriad of keyword arguments
   942 This function also accepts the uncompressed data size and dictionary size
  1034 (defined below) for complete low-level control over each adjustable
   943 to adjust parameters::
  1035 compression setting.
   944 
  1036 
   945     params = zstd.get_compression_parameters(3, source_size=len(data), dict_size=len(dict_data))
  1037 From a higher level, one can construct a ``ZstdCompressionParameters`` instance
   946 
  1038 given a desired compression level and target input and dictionary size
   947 You can also construct compression parameters from their low-level components::
  1039 using ``ZstdCompressionParameters.from_level()``. e.g.::
   948 
  1040 
   949     params = zstd.CompressionParameters(20, 6, 12, 5, 4, 10, zstd.STRATEGY_FAST)
  1041     # Derive compression settings for compression level 7.
   950 
  1042     params = zstd.ZstdCompressionParameters.from_level(7)
   951 You can then configure a compressor to use the custom parameters::
  1043 
       
  1044     # With an input size of 1MB
       
  1045     params = zstd.ZstdCompressionParameters.from_level(7, source_size=1048576)
       
  1046 
       
  1047 Using ``from_level()``, it is also possible to override individual compression
       
  1048 parameters or to define additional settings that aren't automatically derived.
       
  1049 e.g.::
       
  1050 
       
  1051     params = zstd.ZstdCompressionParameters.from_level(4, window_log=10)
       
  1052     params = zstd.ZstdCompressionParameters.from_level(5, threads=4)
       
  1053 
       
  1054 Or you can define low-level compression settings directly::
       
  1055 
       
  1056     params = zstd.ZstdCompressionParameters(window_log=12, enable_ldm=True)
       
  1057 
       
  1058 Once a ``ZstdCompressionParameters`` instance is obtained, it can be used to
       
  1059 configure a compressor::
   952 
  1060 
   953     cctx = zstd.ZstdCompressor(compression_params=params)
  1061     cctx = zstd.ZstdCompressor(compression_params=params)
   954 
  1062 
   955 The members/attributes of ``CompressionParameters`` instances are as follows::
  1063 The named arguments and attributes of ``ZstdCompressionParameters`` are as
   956 
  1064 follows:
       
  1065 
       
  1066 * format
       
  1067 * compression_level
   957 * window_log
  1068 * window_log
       
  1069 * hash_log
   958 * chain_log
  1070 * chain_log
   959 * hash_log
       
   960 * search_log
  1071 * search_log
   961 * search_length
  1072 * min_match
   962 * target_length
  1073 * target_length
   963 * strategy
  1074 * compression_strategy
   964 
  1075 * write_content_size
   965 This is the order the arguments are passed to the constructor if not using
  1076 * write_checksum
   966 named arguments.
  1077 * write_dict_id
   967 
  1078 * job_size
   968 You'll need to read the Zstandard documentation for what these parameters
  1079 * overlap_size_log
   969 do.
  1080 * compress_literals
       
  1081 * force_max_window
       
  1082 * enable_ldm
       
  1083 * ldm_hash_log
       
  1084 * ldm_min_match
       
  1085 * ldm_bucket_size_log
       
  1086 * ldm_hash_every_log
       
  1087 * threads
       
  1088 
       
  1089 Some of these are very low-level settings. It may help to consult the official
       
  1090 zstandard documentation for their behavior. Look for the ``ZSTD_p_*`` constants
       
  1091 in ``zstd.h`` (https://github.com/facebook/zstd/blob/dev/lib/zstd.h).
   970 
  1092 
   971 Frame Inspection
  1093 Frame Inspection
   972 ----------------
  1094 ----------------
   973 
  1095 
   974 Data emitted from zstd compression is encapsulated in a *frame*. This frame
  1096 Data emitted from zstd compression is encapsulated in a *frame*. This frame
  1001 
  1123 
  1002 has_checksum
  1124 has_checksum
  1003    Bool indicating whether a 4 byte content checksum is stored at the end
  1125    Bool indicating whether a 4 byte content checksum is stored at the end
  1004    of the frame.
  1126    of the frame.
  1005 
  1127 
       
  1128 ``zstd.frame_header_size(data)`` returns the size of the zstandard frame
       
  1129 header.
       
  1130 
       
  1131 ``zstd.frame_content_size(data)`` returns the content size as parsed from
       
  1132 the frame header. ``-1`` means the content size is unknown. ``0`` means
       
  1133 an empty frame. The content size is usually correct. However, it may not
       
  1134 be accurate.
       
  1135 
  1006 Misc Functionality
  1136 Misc Functionality
  1007 ------------------
  1137 ------------------
  1008 
       
  1009 estimate_compression_context_size(CompressionParameters)
       
  1010 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       
  1011 
       
  1012 Given a ``CompressionParameters`` struct, estimate the memory size required
       
  1013 to perform compression.
       
  1014 
  1138 
  1015 estimate_decompression_context_size()
  1139 estimate_decompression_context_size()
  1016 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  1140 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  1017 
  1141 
  1018 Estimate the memory size requirements for a decompressor instance.
  1142 Estimate the memory size requirements for a decompressor instance.
  1038 
  1162 
  1039 FRAME_HEADER
  1163 FRAME_HEADER
  1040     bytes containing header of the Zstandard frame
  1164     bytes containing header of the Zstandard frame
  1041 MAGIC_NUMBER
  1165 MAGIC_NUMBER
  1042     Frame header as an integer
  1166     Frame header as an integer
       
  1167 
       
  1168 CONTENTSIZE_UNKNOWN
       
  1169     Value for content size when the content size is unknown.
       
  1170 CONTENTSIZE_ERROR
       
  1171     Value for content size when content size couldn't be determined.
  1043 
  1172 
  1044 WINDOWLOG_MIN
  1173 WINDOWLOG_MIN
  1045     Minimum value for compression parameter
  1174     Minimum value for compression parameter
  1046 WINDOWLOG_MAX
  1175 WINDOWLOG_MAX
  1047     Maximum value for compression parameter
  1176     Maximum value for compression parameter
  1061     Minimum value for compression parameter
  1190     Minimum value for compression parameter
  1062 SEARCHLENGTH_MAX
  1191 SEARCHLENGTH_MAX
  1063     Maximum value for compression parameter
  1192     Maximum value for compression parameter
  1064 TARGETLENGTH_MIN
  1193 TARGETLENGTH_MIN
  1065     Minimum value for compression parameter
  1194     Minimum value for compression parameter
  1066 TARGETLENGTH_MAX
       
  1067     Maximum value for compression parameter
       
  1068 STRATEGY_FAST
  1195 STRATEGY_FAST
  1069     Compression strategy
  1196     Compression strategy
  1070 STRATEGY_DFAST
  1197 STRATEGY_DFAST
  1071     Compression strategy
  1198     Compression strategy
  1072 STRATEGY_GREEDY
  1199 STRATEGY_GREEDY
  1077     Compression strategy
  1204     Compression strategy
  1078 STRATEGY_BTLAZY2
  1205 STRATEGY_BTLAZY2
  1079     Compression strategy
  1206     Compression strategy
  1080 STRATEGY_BTOPT
  1207 STRATEGY_BTOPT
  1081     Compression strategy
  1208     Compression strategy
       
  1209 STRATEGY_BTULTRA
       
  1210     Compression strategy
       
  1211 
       
  1212 FORMAT_ZSTD1
       
  1213     Zstandard frame format
       
  1214 FORMAT_ZSTD1_MAGICLESS
       
  1215     Zstandard frame format without magic header
  1082 
  1216 
  1083 Performance Considerations
  1217 Performance Considerations
  1084 --------------------------
  1218 --------------------------
  1085 
  1219 
  1086 The ``ZstdCompressor`` and ``ZstdDecompressor`` types maintain state to a
  1220 The ``ZstdCompressor`` and ``ZstdDecompressor`` types maintain state to a
  1088 or ``ZstdDecompressor`` instance for multiple operations is faster than
  1222 or ``ZstdDecompressor`` instance for multiple operations is faster than
  1089 instantiating a new ``ZstdCompressor`` or ``ZstdDecompressor`` for each
  1223 instantiating a new ``ZstdCompressor`` or ``ZstdDecompressor`` for each
  1090 operation. The differences are magnified as the size of data decreases. For
  1224 operation. The differences are magnified as the size of data decreases. For
  1091 example, the difference between *context* reuse and non-reuse for 100,000
  1225 example, the difference between *context* reuse and non-reuse for 100,000
  1092 100 byte inputs will be significant (possiby over 10x faster to reuse contexts)
  1226 100 byte inputs will be significant (possiby over 10x faster to reuse contexts)
  1093 whereas 10 1,000,000 byte inputs will be more similar in speed (because the
  1227 whereas 10 100,000,000 byte inputs will be more similar in speed (because the
  1094 time spent doing compression dwarfs time spent creating new *contexts*).
  1228 time spent doing compression dwarfs time spent creating new *contexts*).
  1095 
  1229 
  1096 Buffer Types
  1230 Buffer Types
  1097 ------------
  1231 ------------
  1098 
  1232 
  1185 
  1319 
  1186 There are multiple APIs for performing compression and decompression. This is
  1320 There are multiple APIs for performing compression and decompression. This is
  1187 because different applications have different needs and the library wants to
  1321 because different applications have different needs and the library wants to
  1188 facilitate optimal use in as many use cases as possible.
  1322 facilitate optimal use in as many use cases as possible.
  1189 
  1323 
  1190 From a high-level, APIs are divided into *one-shot* and *streaming*. See
  1324 From a high-level, APIs are divided into *one-shot* and *streaming*: either you
  1191 the ``Concepts`` section for a description of how these are different at
  1325 are operating on all data at once or you operate on it piecemeal.
  1192 the C layer.
       
  1193 
  1326 
  1194 The *one-shot* APIs are useful for small data, where the input or output
  1327 The *one-shot* APIs are useful for small data, where the input or output
  1195 size is known. (The size can come from a buffer length, file size, or
  1328 size is known. (The size can come from a buffer length, file size, or
  1196 stored in the zstd frame header.) A limitation of the *one-shot* APIs is that
  1329 stored in the zstd frame header.) A limitation of the *one-shot* APIs is that
  1197 input and output must fit in memory simultaneously. For say a 4 GB input,
  1330 input and output must fit in memory simultaneously. For say a 4 GB input,
  1220 it is important to consider what happens in that object when I/O is requested.
  1353 it is important to consider what happens in that object when I/O is requested.
  1221 There is potential for long pauses as data is read or written from the
  1354 There is potential for long pauses as data is read or written from the
  1222 underlying stream (say from interacting with a filesystem or network). This
  1355 underlying stream (say from interacting with a filesystem or network). This
  1223 could add considerable overhead.
  1356 could add considerable overhead.
  1224 
  1357 
  1225 Concepts
  1358 Thread Safety
  1226 ========
  1359 =============
  1227 
  1360 
  1228 It is important to have a basic understanding of how Zstandard works in order
  1361 ``ZstdCompressor`` and ``ZstdDecompressor`` instances have no guarantees
  1229 to optimally use this library. In addition, there are some low-level Python
  1362 about thread safety. Do not operate on the same ``ZstdCompressor`` and
  1230 concepts that are worth explaining to aid understanding. This section aims to
  1363 ``ZstdDecompressor`` instance simultaneously from different threads. It is
  1231 provide that knowledge.
  1364 fine to have different threads call into a single instance, just not at the
  1232 
  1365 same time.
  1233 Zstandard Frames and Compression Format
  1366 
  1234 ---------------------------------------
  1367 Some operations require multiple function calls to complete. e.g. streaming
  1235 
  1368 operations. A single ``ZstdCompressor`` or ``ZstdDecompressor`` cannot be used
  1236 Compressed zstandard data almost always exists within a container called a
  1369 for simultaneously active operations. e.g. you must not start a streaming
  1237 *frame*. (For the technically curious, see the
  1370 operation when another streaming operation is already active.
  1238 `specification <https://github.com/facebook/zstd/blob/3bee41a70eaf343fbcae3637b3f6edbe52f35ed8/doc/zstd_compression_format.md>_.)
  1371 
  1239 
  1372 The C extension releases the GIL during non-trivial calls into the zstd C
  1240 The frame contains a header and optional trailer. The header contains a
  1373 API. Non-trivial calls are notably compression and decompression. Trivial
  1241 magic number to self-identify as a zstd frame and a description of the
  1374 calls are things like parsing frame parameters. Where the GIL is released
  1242 compressed data that follows.
  1375 is considered an implementation detail and can change in any release.
  1243 
  1376 
  1244 Among other things, the frame *optionally* contains the size of the
  1377 APIs that accept bytes-like objects don't enforce that the underlying object
  1245 decompressed data the frame represents, a 32-bit checksum of the
  1378 is read-only. However, it is assumed that the passed object is read-only for
  1246 decompressed data (to facilitate verification during decompression),
  1379 the duration of the function call. It is possible to pass a mutable object
  1247 and the ID of the dictionary used to compress the data.
  1380 (like a ``bytearray``) to e.g. ``ZstdCompressor.compress()``, have the GIL
  1248 
  1381 released, and mutate the object from another thread. Such a race condition
  1249 Storing the original content size in the frame (``write_content_size=True``
  1382 is a bug in the consumer of python-zstandard. Most Python data types are
  1250 to ``ZstdCompressor``) is important for performance in some scenarios. Having
  1383 immutable, so unless you are doing something fancy, you don't need to
  1251 the decompressed size stored there (or storing it elsewhere) allows
  1384 worry about this.
  1252 decompression to perform a single memory allocation that is exactly sized to
       
  1253 the output. This is faster than continuously growing a memory buffer to hold
       
  1254 output.
       
  1255 
       
  1256 Compression and Decompression Contexts
       
  1257 --------------------------------------
       
  1258 
       
  1259 In order to perform a compression or decompression operation with the zstd
       
  1260 C API, you need what's called a *context*. A context essentially holds
       
  1261 configuration and state for a compression or decompression operation. For
       
  1262 example, a compression context holds the configured compression level.
       
  1263 
       
  1264 Contexts can be reused for multiple operations. Since creating and
       
  1265 destroying contexts is not free, there are performance advantages to
       
  1266 reusing contexts.
       
  1267 
       
  1268 The ``ZstdCompressor`` and ``ZstdDecompressor`` types are essentially
       
  1269 wrappers around these contexts in the zstd C API.
       
  1270 
       
  1271 One-shot And Streaming Operations
       
  1272 ---------------------------------
       
  1273 
       
  1274 A compression or decompression operation can either be performed as a
       
  1275 single *one-shot* operation or as a continuous *streaming* operation.
       
  1276 
       
  1277 In one-shot mode (the *simple* APIs provided by the Python interface),
       
  1278 **all** input is handed to the compressor or decompressor as a single buffer
       
  1279 and **all** output is returned as a single buffer.
       
  1280 
       
  1281 In streaming mode, input is delivered to the compressor or decompressor as
       
  1282 a series of chunks via multiple function calls. Likewise, output is
       
  1283 obtained in chunks as well.
       
  1284 
       
  1285 Streaming operations require an additional *stream* object to be created
       
  1286 to track the operation. These are logical extensions of *context*
       
  1287 instances.
       
  1288 
       
  1289 There are advantages and disadvantages to each mode of operation. There
       
  1290 are scenarios where certain modes can't be used. See the
       
  1291 ``Choosing an API`` section for more.
       
  1292 
       
  1293 Dictionaries
       
  1294 ------------
       
  1295 
       
  1296 A compression *dictionary* is essentially data used to seed the compressor
       
  1297 state so it can achieve better compression. The idea is that if you are
       
  1298 compressing a lot of similar pieces of data (e.g. JSON documents or anything
       
  1299 sharing similar structure), then you can find common patterns across multiple
       
  1300 objects then leverage those common patterns during compression and
       
  1301 decompression operations to achieve better compression ratios.
       
  1302 
       
  1303 Dictionary compression is generally only useful for small inputs - data no
       
  1304 larger than a few kilobytes. The upper bound on this range is highly dependent
       
  1305 on the input data and the dictionary.
       
  1306 
       
  1307 Python Buffer Protocol
       
  1308 ----------------------
       
  1309 
       
  1310 Many functions in the library operate on objects that implement Python's
       
  1311 `buffer protocol <https://docs.python.org/3.6/c-api/buffer.html>`_.
       
  1312 
       
  1313 The *buffer protocol* is an internal implementation detail of a Python
       
  1314 type that allows instances of that type (objects) to be exposed as a raw
       
  1315 pointer (or buffer) in the C API. In other words, it allows objects to be
       
  1316 exposed as an array of bytes.
       
  1317 
       
  1318 From the perspective of the C API, objects implementing the *buffer protocol*
       
  1319 all look the same: they are just a pointer to a memory address of a defined
       
  1320 length. This allows the C API to be largely type agnostic when accessing their
       
  1321 data. This allows custom types to be passed in without first converting them
       
  1322 to a specific type.
       
  1323 
       
  1324 Many Python types implement the buffer protocol. These include ``bytes``
       
  1325 (``str`` on Python 2), ``bytearray``, ``array.array``, ``io.BytesIO``,
       
  1326 ``mmap.mmap``, and ``memoryview``.
       
  1327 
       
  1328 ``python-zstandard`` APIs that accept objects conforming to the buffer
       
  1329 protocol require that the buffer is *C contiguous* and has a single
       
  1330 dimension (``ndim==1``). This is usually the case. An example of where it
       
  1331 is not is a Numpy matrix type.
       
  1332 
       
  1333 Requiring Output Sizes for Non-Streaming Decompression APIs
       
  1334 -----------------------------------------------------------
       
  1335 
       
  1336 Non-streaming decompression APIs require that either the output size is
       
  1337 explicitly defined (either in the zstd frame header or passed into the
       
  1338 function) or that a max output size is specified. This restriction is for
       
  1339 your safety.
       
  1340 
       
  1341 The *one-shot* decompression APIs store the decompressed result in a
       
  1342 single buffer. This means that a buffer needs to be pre-allocated to hold
       
  1343 the result. If the decompressed size is not known, then there is no universal
       
  1344 good default size to use. Any default will fail or will be highly sub-optimal
       
  1345 in some scenarios (it will either be too small or will put stress on the
       
  1346 memory allocator to allocate a too large block).
       
  1347 
       
  1348 A *helpful* API may retry decompression with buffers of increasing size.
       
  1349 While useful, there are obvious performance disadvantages, namely redoing
       
  1350 decompression N times until it works. In addition, there is a security
       
  1351 concern. Say the input came from highly compressible data, like 1 GB of the
       
  1352 same byte value. The output size could be several magnitudes larger than the
       
  1353 input size. An input of <100KB could decompress to >1GB. Without a bounds
       
  1354 restriction on the decompressed size, certain inputs could exhaust all system
       
  1355 memory. That's not good and is why the maximum output size is limited.
       
  1356 
  1385 
  1357 Note on Zstandard's *Experimental* API
  1386 Note on Zstandard's *Experimental* API
  1358 ======================================
  1387 ======================================
  1359 
  1388 
  1360 Many of the Zstandard APIs used by this module are marked as *experimental*
  1389 Many of the Zstandard APIs used by this module are marked as *experimental*
  1361 within the Zstandard project. This includes a large number of useful
  1390 within the Zstandard project.
  1362 features, such as compression and frame parameters and parts of dictionary
       
  1363 compression.
       
  1364 
  1391 
  1365 It is unclear how Zstandard's C API will evolve over time, especially with
  1392 It is unclear how Zstandard's C API will evolve over time, especially with
  1366 regards to this *experimental* functionality. We will try to maintain
  1393 regards to this *experimental* functionality. We will try to maintain
  1367 backwards compatibility at the Python API level. However, we cannot
  1394 backwards compatibility at the Python API level. However, we cannot
  1368 guarantee this for things not under our control.
  1395 guarantee this for things not under our control.
  1369 
  1396 
  1370 Since a copy of the Zstandard source code is distributed with this
  1397 Since a copy of the Zstandard source code is distributed with this
  1371 module and since we compile against it, the behavior of a specific
  1398 module and since we compile against it, the behavior of a specific
  1372 version of this module should be constant for all of time. So if you
  1399 version of this module should be constant for all of time. So if you
  1373 pin the version of this module used in your projects (which is a Python
  1400 pin the version of this module used in your projects (which is a Python
  1374 best practice), you should be buffered from unwanted future changes.
  1401 best practice), you should be shielded from unwanted future changes.
  1375 
  1402 
  1376 Donate
  1403 Donate
  1377 ======
  1404 ======
  1378 
  1405 
  1379 A lot of time has been invested into this project by the author.
  1406 A lot of time has been invested into this project by the author.