9 The primary goal of the project is to provide a rich interface to the |
9 The primary goal of the project is to provide a rich interface to the |
10 underlying C API through a Pythonic interface while not sacrificing |
10 underlying C API through a Pythonic interface while not sacrificing |
11 performance. This means exposing most of the features and flexibility |
11 performance. This means exposing most of the features and flexibility |
12 of the C API while not sacrificing usability or safety that Python provides. |
12 of the C API while not sacrificing usability or safety that Python provides. |
13 |
13 |
14 The canonical home for this project is |
14 The canonical home for this project lives in a Mercurial repository run by |
|
15 the author. For convenience, that repository is frequently synchronized to |
15 https://github.com/indygreg/python-zstandard. |
16 https://github.com/indygreg/python-zstandard. |
16 |
17 |
17 | |ci-status| |win-ci-status| |
18 | |ci-status| |win-ci-status| |
18 |
|
19 State of Project |
|
20 ================ |
|
21 |
|
22 The project is officially in beta state. The author is reasonably satisfied |
|
23 that functionality works as advertised. **There will be some backwards |
|
24 incompatible changes before 1.0, probably in the 0.9 release.** This may |
|
25 involve renaming the main module from *zstd* to *zstandard* and renaming |
|
26 various types and methods. Pin the package version to prevent unwanted |
|
27 breakage when this change occurs! |
|
28 |
|
29 This project is vendored and distributed with Mercurial 4.1, where it is |
|
30 used in a production capacity. |
|
31 |
|
32 There is continuous integration for Python versions 2.6, 2.7, and 3.3+ |
|
33 on Linux x86_x64 and Windows x86 and x86_64. The author is reasonably |
|
34 confident the extension is stable and works as advertised on these |
|
35 platforms. |
|
36 |
|
37 The CFFI bindings are mostly feature complete. Where a feature is implemented |
|
38 in CFFI, unit tests run against both C extension and CFFI implementation to |
|
39 ensure behavior parity. |
|
40 |
|
41 Expected Changes |
|
42 ---------------- |
|
43 |
|
44 The author is reasonably confident in the current state of what's |
|
45 implemented on the ``ZstdCompressor`` and ``ZstdDecompressor`` types. |
|
46 Those APIs likely won't change significantly. Some low-level behavior |
|
47 (such as naming and types expected by arguments) may change. |
|
48 |
|
49 There will likely be arguments added to control the input and output |
|
50 buffer sizes (currently, certain operations read and write in chunk |
|
51 sizes using zstd's preferred defaults). |
|
52 |
|
53 There should be an API that accepts an object that conforms to the buffer |
|
54 interface and returns an iterator over compressed or decompressed output. |
|
55 |
|
56 There should be an API that exposes an ``io.RawIOBase`` interface to |
|
57 compressor and decompressor streams, like how ``gzip.GzipFile`` from |
|
58 the standard library works (issue 13). |
|
59 |
|
60 The author is on the fence as to whether to support the extremely |
|
61 low level compression and decompression APIs. It could be useful to |
|
62 support compression without the framing headers. But the author doesn't |
|
63 believe it a high priority at this time. |
|
64 |
|
65 There will likely be a refactoring of the module names. Currently, |
|
66 ``zstd`` is a C extension and ``zstd_cffi`` is the CFFI interface. |
|
67 This means that all code for the C extension must be implemented in |
|
68 C. ``zstd`` may be converted to a Python module so code can be reused |
|
69 between CFFI and C and so not all code in the C extension has to be C. |
|
70 |
19 |
71 Requirements |
20 Requirements |
72 ============ |
21 ============ |
73 |
22 |
74 This extension is designed to run with Python 2.6, 2.7, 3.3, 3.4, 3.5, and |
23 This extension is designed to run with Python 2.7, 3.4, 3.5, and 3.6 |
75 3.6 on common platforms (Linux, Windows, and OS X). Only x86_64 is |
24 on common platforms (Linux, Windows, and OS X). x86 and x86_64 are well-tested |
76 currently well-tested as an architecture. |
25 on Windows. Only x86_64 is well-tested on Linux and macOS. |
77 |
26 |
78 Installing |
27 Installing |
79 ========== |
28 ========== |
80 |
29 |
81 This package is uploaded to PyPI at https://pypi.python.org/pypi/zstandard. |
30 This package is uploaded to PyPI at https://pypi.python.org/pypi/zstandard. |
94 this package with ``conda``. |
43 this package with ``conda``. |
95 |
44 |
96 Performance |
45 Performance |
97 =========== |
46 =========== |
98 |
47 |
99 Very crude and non-scientific benchmarking (most benchmarks fall in this |
48 zstandard is a highly tunable compression algorithm. In its default settings |
100 category because proper benchmarking is hard) show that the Python bindings |
49 (compression level 3), it will be faster at compression and decompression and |
101 perform within 10% of the native C implementation. |
50 will have better compression ratios than zlib on most data sets. When tuned |
102 |
51 for speed, it approaches lz4's speed and ratios. When tuned for compression |
103 The following table compares the performance of compressing and decompressing |
52 ratio, it approaches lzma ratios and compression speed, but decompression |
104 a 1.1 GB tar file comprised of the files in a Firefox source checkout. Values |
53 speed is much faster. See the official zstandard documentation for more. |
105 obtained with the ``zstd`` program are on the left. The remaining columns detail |
54 |
106 performance of various compression APIs in the Python bindings. |
55 zstandard and this library support multi-threaded compression. There is a |
107 |
56 mechanism to compress large inputs using multiple threads. |
108 +-------+-----------------+-----------------+-----------------+---------------+ |
57 |
109 | Level | Native | Simple | Stream In | Stream Out | |
58 The performance of this library is usually very similar to what the zstandard |
110 | | Comp / Decomp | Comp / Decomp | Comp / Decomp | Comp | |
59 C API can deliver. Overhead in this library is due to general Python overhead |
111 +=======+=================+=================+=================+===============+ |
60 and can't easily be avoided by *any* zstandard Python binding. This library |
112 | 1 | 490 / 1338 MB/s | 458 / 1266 MB/s | 407 / 1156 MB/s | 405 MB/s | |
61 exposes multiple APIs for performing compression and decompression so callers |
113 +-------+-----------------+-----------------+-----------------+---------------+ |
62 can pick an API suitable for their need. Contrast with the compression |
114 | 2 | 412 / 1288 MB/s | 381 / 1203 MB/s | 345 / 1128 MB/s | 349 MB/s | |
63 modules in Python's standard library (like ``zlib``), which only offer limited |
115 +-------+-----------------+-----------------+-----------------+---------------+ |
64 mechanisms for performing operations. The API flexibility means consumers can |
116 | 3 | 342 / 1312 MB/s | 319 / 1182 MB/s | 285 / 1165 MB/s | 287 MB/s | |
65 choose to use APIs that facilitate zero copying or minimize Python object |
117 +-------+-----------------+-----------------+-----------------+---------------+ |
66 creation and garbage collection overhead. |
118 | 11 | 64 / 1506 MB/s | 66 / 1436 MB/s | 56 / 1342 MB/s | 57 MB/s | |
67 |
119 +-------+-----------------+-----------------+-----------------+---------------+ |
68 This library is capable of single-threaded throughputs well over 1 GB/s. For |
120 |
69 exact numbers, measure yourself. The source code repository has a ``bench.py`` |
121 Again, these are very unscientific. But it shows that Python is capable of |
70 script that can be used to measure things. |
122 compressing at several hundred MB/s and decompressing at over 1 GB/s. |
|
123 |
|
124 Comparison to Other Python Bindings |
|
125 =================================== |
|
126 |
|
127 https://pypi.python.org/pypi/zstd is an alternate Python binding to |
|
128 Zstandard. At the time this was written, the latest release of that |
|
129 package (1.1.2) only exposed the simple APIs for compression and decompression. |
|
130 This package exposes much more of the zstd API, including streaming and |
|
131 dictionary compression. This package also has CFFI support. |
|
132 |
|
133 Bundling of Zstandard Source Code |
|
134 ================================= |
|
135 |
|
136 The source repository for this project contains a vendored copy of the |
|
137 Zstandard source code. This is done for a few reasons. |
|
138 |
|
139 First, Zstandard is relatively new and not yet widely available as a system |
|
140 package. Providing a copy of the source code enables the Python C extension |
|
141 to be compiled without requiring the user to obtain the Zstandard source code |
|
142 separately. |
|
143 |
|
144 Second, Zstandard has both a stable *public* API and an *experimental* API. |
|
145 The *experimental* API is actually quite useful (contains functionality for |
|
146 training dictionaries for example), so it is something we wish to expose to |
|
147 Python. However, the *experimental* API is only available via static linking. |
|
148 Furthermore, the *experimental* API can change at any time. So, control over |
|
149 the exact version of the Zstandard library linked against is important to |
|
150 ensure known behavior. |
|
151 |
|
152 Instructions for Building and Testing |
|
153 ===================================== |
|
154 |
|
155 Once you have the source code, the extension can be built via setup.py:: |
|
156 |
|
157 $ python setup.py build_ext |
|
158 |
|
159 We recommend testing with ``nose``:: |
|
160 |
|
161 $ nosetests |
|
162 |
|
163 A Tox configuration is present to test against multiple Python versions:: |
|
164 |
|
165 $ tox |
|
166 |
|
167 Tests use the ``hypothesis`` Python package to perform fuzzing. If you |
|
168 don't have it, those tests won't run. Since the fuzzing tests take longer |
|
169 to execute than normal tests, you'll need to opt in to running them by |
|
170 setting the ``ZSTD_SLOW_TESTS`` environment variable. This is set |
|
171 automatically when using ``tox``. |
|
172 |
|
173 The ``cffi`` Python package needs to be installed in order to build the CFFI |
|
174 bindings. If it isn't present, the CFFI bindings won't be built. |
|
175 |
|
176 To create a virtualenv with all development dependencies, do something |
|
177 like the following:: |
|
178 |
|
179 # Python 2 |
|
180 $ virtualenv venv |
|
181 |
|
182 # Python 3 |
|
183 $ python3 -m venv venv |
|
184 |
|
185 $ source venv/bin/activate |
|
186 $ pip install cffi hypothesis nose tox |
|
187 |
71 |
188 API |
72 API |
189 === |
73 === |
190 |
74 |
191 The compiled C extension provides a ``zstd`` Python module. The CFFI |
75 To interface with Zstandard, simply import the ``zstandard`` module:: |
192 bindings provide a ``zstd_cffi`` module. Both provide an identical API |
76 |
193 interface. The types, functions, and attributes exposed by these modules |
77 import zstandard |
|
78 |
|
79 It is a popular convention to alias the module as a different name for |
|
80 brevity:: |
|
81 |
|
82 import zstandard as zstd |
|
83 |
|
84 This module attempts to import and use either the C extension or CFFI |
|
85 implementation. On Python platforms known to support C extensions (like |
|
86 CPython), it raises an ImportError if the C extension cannot be imported. |
|
87 On Python platforms known to not support C extensions (like PyPy), it only |
|
88 attempts to import the CFFI implementation and raises ImportError if that |
|
89 can't be done. On other platforms, it first tries to import the C extension |
|
90 then falls back to CFFI if that fails and raises ImportError if CFFI fails. |
|
91 |
|
92 To change the module import behavior, a ``PYTHON_ZSTANDARD_IMPORT_POLICY`` |
|
93 environment variable can be set. The following values are accepted: |
|
94 |
|
95 default |
|
96 The behavior described above. |
|
97 cffi_fallback |
|
98 Always try to import the C extension then fall back to CFFI if that |
|
99 fails. |
|
100 cext |
|
101 Only attempt to import the C extension. |
|
102 cffi |
|
103 Only attempt to import the CFFI implementation. |
|
104 |
|
105 In addition, the ``zstandard`` module exports a ``backend`` attribute |
|
106 containing the string name of the backend being used. It will be one |
|
107 of ``cext`` or ``cffi`` (for *C extension* and *cffi*, respectively). |
|
108 |
|
109 The types, functions, and attributes exposed by the ``zstandard`` module |
194 are documented in the sections below. |
110 are documented in the sections below. |
195 |
111 |
196 .. note:: |
112 .. note:: |
197 |
113 |
198 The documentation in this section makes references to various zstd |
114 The documentation in this section makes references to various zstd |
199 concepts and functionality. The ``Concepts`` section below explains |
115 concepts and functionality. The source repository contains a |
200 these concepts in more detail. |
116 ``docs/concepts.rst`` file explaining these in more detail. |
201 |
117 |
202 ZstdCompressor |
118 ZstdCompressor |
203 -------------- |
119 -------------- |
204 |
120 |
205 The ``ZstdCompressor`` class provides an interface for performing |
121 The ``ZstdCompressor`` class provides an interface for performing |
206 compression operations. |
122 compression operations. Each instance is essentially a wrapper around a |
|
123 ``ZSTD_CCtx`` from the C API. |
207 |
124 |
208 Each instance is associated with parameters that control compression |
125 Each instance is associated with parameters that control compression |
209 behavior. These come from the following named arguments (all optional): |
126 behavior. These come from the following named arguments (all optional): |
210 |
127 |
211 level |
128 level |
212 Integer compression level. Valid values are between 1 and 22. |
129 Integer compression level. Valid values are between 1 and 22. |
213 dict_data |
130 dict_data |
214 Compression dictionary to use. |
131 Compression dictionary to use. |
215 |
132 |
216 Note: When using dictionary data and ``compress()`` is called multiple |
133 Note: When using dictionary data and ``compress()`` is called multiple |
217 times, the ``CompressionParameters`` derived from an integer compression |
134 times, the ``ZstdCompressionParameters`` derived from an integer |
218 ``level`` and the first compressed data's size will be reused for all |
135 compression ``level`` and the first compressed data's size will be reused |
219 subsequent operations. This may not be desirable if source data size |
136 for all subsequent operations. This may not be desirable if source data |
220 varies significantly. |
137 size varies significantly. |
221 compression_params |
138 compression_params |
222 A ``CompressionParameters`` instance (overrides the ``level`` value). |
139 A ``ZstdCompressionParameters`` instance defining compression settings. |
223 write_checksum |
140 write_checksum |
224 Whether a 4 byte checksum should be written with the compressed data. |
141 Whether a 4 byte checksum should be written with the compressed data. |
225 Defaults to False. If True, the decompressor can verify that decompressed |
142 Defaults to False. If True, the decompressor can verify that decompressed |
226 data matches the original input data. |
143 data matches the original input data. |
227 write_content_size |
144 write_content_size |
228 Whether the size of the uncompressed data will be written into the |
145 Whether the size of the uncompressed data will be written into the |
229 header of compressed data. Defaults to False. The data will only be |
146 header of compressed data. Defaults to True. The data will only be |
230 written if the compressor knows the size of the input data. This is |
147 written if the compressor knows the size of the input data. This is |
231 likely not true for streaming compression. |
148 often not true for streaming compression. |
232 write_dict_id |
149 write_dict_id |
233 Whether to write the dictionary ID into the compressed data. |
150 Whether to write the dictionary ID into the compressed data. |
234 Defaults to True. The dictionary ID is only written if a dictionary |
151 Defaults to True. The dictionary ID is only written if a dictionary |
235 is being used. |
152 is being used. |
236 threads |
153 threads |
240 Read below for more info on multi-threaded compression. This argument only |
157 Read below for more info on multi-threaded compression. This argument only |
241 controls thread count for operations that operate on individual pieces of |
158 controls thread count for operations that operate on individual pieces of |
242 data. APIs that spawn multiple threads for working on multiple pieces of |
159 data. APIs that spawn multiple threads for working on multiple pieces of |
243 data have their own ``threads`` argument. |
160 data have their own ``threads`` argument. |
244 |
161 |
|
162 ``compression_params`` is mutually exclusive with ``level``, ``write_checksum``, |
|
163 ``write_content_size``, ``write_dict_id``, and ``threads``. |
|
164 |
245 Unless specified otherwise, assume that no two methods of ``ZstdCompressor`` |
165 Unless specified otherwise, assume that no two methods of ``ZstdCompressor`` |
246 instances can be called from multiple Python threads simultaneously. In other |
166 instances can be called from multiple Python threads simultaneously. In other |
247 words, assume instances are not thread safe unless stated otherwise. |
167 words, assume instances are not thread safe unless stated otherwise. |
248 |
168 |
|
169 Utility Methods |
|
170 ^^^^^^^^^^^^^^^ |
|
171 |
|
172 ``frame_progression()`` returns a 3-tuple containing the number of bytes |
|
173 ingested, consumed, and produced by the current compression operation. |
|
174 |
|
175 ``memory_size()`` obtains the memory utilization of the underlying zstd |
|
176 compression context, in bytes.:: |
|
177 |
|
178 cctx = zstd.ZstdCompressor() |
|
179 memory = cctx.memory_size() |
|
180 |
249 Simple API |
181 Simple API |
250 ^^^^^^^^^^ |
182 ^^^^^^^^^^ |
251 |
183 |
252 ``compress(data)`` compresses and returns data as a one-shot operation.:: |
184 ``compress(data)`` compresses and returns data as a one-shot operation.:: |
253 |
185 |
254 cctx = zstd.ZstdCompressor() |
186 cctx = zstd.ZstdCompressor() |
255 compressed = cctx.compress(b'data to compress') |
187 compressed = cctx.compress(b'data to compress') |
256 |
188 |
257 The ``data`` argument can be any object that implements the *buffer protocol*. |
189 The ``data`` argument can be any object that implements the *buffer protocol*. |
258 |
190 |
259 Unless ``compression_params`` or ``dict_data`` are passed to the |
191 Stream Reader API |
260 ``ZstdCompressor``, each invocation of ``compress()`` will calculate the |
192 ^^^^^^^^^^^^^^^^^ |
261 optimal compression parameters for the configured compression ``level`` and |
193 |
262 input data size (some parameters are fine-tuned for small input sizes). |
194 ``stream_reader(source)`` can be used to obtain an object conforming to the |
263 |
195 ``io.RawIOBase`` interface for reading compressed output as a stream:: |
264 If a compression dictionary is being used, the compression parameters |
196 |
265 determined from the first input's size will be reused for subsequent |
197 with open(path, 'rb') as fh: |
266 operations. |
198 cctx = zstd.ZstdCompressor() |
267 |
199 with cctx.stream_reader(fh) as reader: |
268 There is currently a deficiency in zstd's C APIs that makes it difficult |
200 while True: |
269 to round trip empty inputs when ``write_content_size=True``. Attempting |
201 chunk = reader.read(16384) |
270 this will raise a ``ValueError`` unless ``allow_empty=True`` is passed |
202 if not chunk: |
271 to ``compress()``. |
203 break |
|
204 |
|
205 # Do something with compressed chunk. |
|
206 |
|
207 The stream can only be read within a context manager. When the context |
|
208 manager exits, the stream is closed and the underlying resource is |
|
209 released and future operations against the compression stream stream will fail. |
|
210 |
|
211 The ``source`` argument to ``stream_reader()`` can be any object with a |
|
212 ``read(size)`` method or any object implementing the *buffer protocol*. |
|
213 |
|
214 ``stream_reader()`` accepts a ``size`` argument specifying how large the input |
|
215 stream is. This is used to adjust compression parameters so they are |
|
216 tailored to the source size.:: |
|
217 |
|
218 with open(path, 'rb') as fh: |
|
219 cctx = zstd.ZstdCompressor() |
|
220 with cctx.stream_reader(fh, size=os.stat(path).st_size) as reader: |
|
221 ... |
|
222 |
|
223 If the ``source`` is a stream, you can specify how large ``read()`` requests |
|
224 to that stream should be via the ``read_size`` argument. It defaults to |
|
225 ``zstandard.COMPRESSION_RECOMMENDED_INPUT_SIZE``.:: |
|
226 |
|
227 with open(path, 'rb') as fh: |
|
228 cctx = zstd.ZstdCompressor() |
|
229 # Will perform fh.read(8192) when obtaining data to feed into the |
|
230 # compressor. |
|
231 with cctx.stream_reader(fh, read_size=8192) as reader: |
|
232 ... |
|
233 |
|
234 The stream returned by ``stream_reader()`` is neither writable nor seekable |
|
235 (even if the underlying source is seekable). ``readline()`` and |
|
236 ``readlines()`` are not implemented because they don't make sense for |
|
237 compressed data. ``tell()`` returns the number of compressed bytes |
|
238 emitted so far. |
272 |
239 |
273 Streaming Input API |
240 Streaming Input API |
274 ^^^^^^^^^^^^^^^^^^^ |
241 ^^^^^^^^^^^^^^^^^^^ |
275 |
242 |
276 ``write_to(fh)`` (which behaves as a context manager) allows you to *stream* |
243 ``stream_writer(fh)`` (which behaves as a context manager) allows you to *stream* |
277 data into a compressor.:: |
244 data into a compressor.:: |
278 |
245 |
279 cctx = zstd.ZstdCompressor(level=10) |
246 cctx = zstd.ZstdCompressor(level=10) |
280 with cctx.write_to(fh) as compressor: |
247 with cctx.stream_writer(fh) as compressor: |
281 compressor.write(b'chunk 0') |
248 compressor.write(b'chunk 0') |
282 compressor.write(b'chunk 1') |
249 compressor.write(b'chunk 1') |
283 ... |
250 ... |
284 |
251 |
285 The argument to ``write_to()`` must have a ``write(data)`` method. As |
252 The argument to ``stream_writer()`` must have a ``write(data)`` method. As |
286 compressed data is available, ``write()`` will be called with the compressed |
253 compressed data is available, ``write()`` will be called with the compressed |
287 data as its argument. Many common Python types implement ``write()``, including |
254 data as its argument. Many common Python types implement ``write()``, including |
288 open file handles and ``io.BytesIO``. |
255 open file handles and ``io.BytesIO``. |
289 |
256 |
290 ``write_to()`` returns an object representing a streaming compressor instance. |
257 ``stream_writer()`` returns an object representing a streaming compressor |
291 It **must** be used as a context manager. That object's ``write(data)`` method |
258 instance. It **must** be used as a context manager. That object's |
292 is used to feed data into the compressor. |
259 ``write(data)`` method is used to feed data into the compressor. |
293 |
260 |
294 A ``flush()`` method can be called to evict whatever data remains within the |
261 A ``flush()`` method can be called to evict whatever data remains within the |
295 compressor's internal state into the output object. This may result in 0 or |
262 compressor's internal state into the output object. This may result in 0 or |
296 more ``write()`` calls to the output object. |
263 more ``write()`` calls to the output object. |
297 |
264 |
313 content size being written into the frame header of the output data. |
280 content size being written into the frame header of the output data. |
314 |
281 |
315 The size of chunks being ``write()`` to the destination can be specified:: |
282 The size of chunks being ``write()`` to the destination can be specified:: |
316 |
283 |
317 cctx = zstd.ZstdCompressor() |
284 cctx = zstd.ZstdCompressor() |
318 with cctx.write_to(fh, write_size=32768) as compressor: |
285 with cctx.stream_writer(fh, write_size=32768) as compressor: |
319 ... |
286 ... |
320 |
287 |
321 To see how much memory is being used by the streaming compressor:: |
288 To see how much memory is being used by the streaming compressor:: |
322 |
289 |
323 cctx = zstd.ZstdCompressor() |
290 cctx = zstd.ZstdCompressor() |
324 with cctx.write_to(fh) as compressor: |
291 with cctx.stream_writer(fh) as compressor: |
325 ... |
292 ... |
326 byte_size = compressor.memory_size() |
293 byte_size = compressor.memory_size() |
327 |
294 |
|
295 Thte total number of bytes written so far are exposed via ``tell()``:: |
|
296 |
|
297 cctx = zstd.ZstdCompressor() |
|
298 with cctx.stream_writer(fh) as compressor: |
|
299 ... |
|
300 total_written = compressor.tell() |
|
301 |
328 Streaming Output API |
302 Streaming Output API |
329 ^^^^^^^^^^^^^^^^^^^^ |
303 ^^^^^^^^^^^^^^^^^^^^ |
330 |
304 |
331 ``read_from(reader)`` provides a mechanism to stream data out of a compressor |
305 ``read_to_iter(reader)`` provides a mechanism to stream data out of a |
332 as an iterator of data chunks.:: |
306 compressor as an iterator of data chunks.:: |
333 |
307 |
334 cctx = zstd.ZstdCompressor() |
308 cctx = zstd.ZstdCompressor() |
335 for chunk in cctx.read_from(fh): |
309 for chunk in cctx.read_to_iter(fh): |
336 # Do something with emitted data. |
310 # Do something with emitted data. |
337 |
311 |
338 ``read_from()`` accepts an object that has a ``read(size)`` method or conforms |
312 ``read_to_iter()`` accepts an object that has a ``read(size)`` method or |
339 to the buffer protocol. (``bytes`` and ``memoryview`` are 2 common types that |
313 conforms to the buffer protocol. |
340 provide the buffer protocol.) |
|
341 |
314 |
342 Uncompressed data is fetched from the source either by calling ``read(size)`` |
315 Uncompressed data is fetched from the source either by calling ``read(size)`` |
343 or by fetching a slice of data from the object directly (in the case where |
316 or by fetching a slice of data from the object directly (in the case where |
344 the buffer protocol is being used). The returned iterator consists of chunks |
317 the buffer protocol is being used). The returned iterator consists of chunks |
345 of compressed data. |
318 of compressed data. |
346 |
319 |
347 If reading from the source via ``read()``, ``read()`` will be called until |
320 If reading from the source via ``read()``, ``read()`` will be called until |
348 it raises or returns an empty bytes (``b''``). It is perfectly valid for |
321 it raises or returns an empty bytes (``b''``). It is perfectly valid for |
349 the source to deliver fewer bytes than were what requested by ``read(size)``. |
322 the source to deliver fewer bytes than were what requested by ``read(size)``. |
350 |
323 |
351 Like ``write_to()``, ``read_from()`` also accepts a ``size`` argument |
324 Like ``stream_writer()``, ``read_to_iter()`` also accepts a ``size`` argument |
352 declaring the size of the input stream:: |
325 declaring the size of the input stream:: |
353 |
326 |
354 cctx = zstd.ZstdCompressor() |
327 cctx = zstd.ZstdCompressor() |
355 for chunk in cctx.read_from(fh, size=some_int): |
328 for chunk in cctx.read_to_iter(fh, size=some_int): |
356 pass |
329 pass |
357 |
330 |
358 You can also control the size that data is ``read()`` from the source and |
331 You can also control the size that data is ``read()`` from the source and |
359 the ideal size of output chunks:: |
332 the ideal size of output chunks:: |
360 |
333 |
361 cctx = zstd.ZstdCompressor() |
334 cctx = zstd.ZstdCompressor() |
362 for chunk in cctx.read_from(fh, read_size=16384, write_size=8192): |
335 for chunk in cctx.read_to_iter(fh, read_size=16384, write_size=8192): |
363 pass |
336 pass |
364 |
337 |
365 Unlike ``write_to()``, ``read_from()`` does not give direct control over the |
338 Unlike ``stream_writer()``, ``read_to_iter()`` does not give direct control |
366 sizes of chunks fed into the compressor. Instead, chunk sizes will be whatever |
339 over the sizes of chunks fed into the compressor. Instead, chunk sizes will |
367 the object being read from delivers. These will often be of a uniform size. |
340 be whatever the object being read from delivers. These will often be of a |
|
341 uniform size. |
368 |
342 |
369 Stream Copying API |
343 Stream Copying API |
370 ^^^^^^^^^^^^^^^^^^ |
344 ^^^^^^^^^^^^^^^^^^ |
371 |
345 |
372 ``copy_stream(ifh, ofh)`` can be used to copy data between 2 streams while |
346 ``copy_stream(ifh, ofh)`` can be used to copy data between 2 streams while |
483 |
457 |
484 ZstdDecompressor |
458 ZstdDecompressor |
485 ---------------- |
459 ---------------- |
486 |
460 |
487 The ``ZstdDecompressor`` class provides an interface for performing |
461 The ``ZstdDecompressor`` class provides an interface for performing |
488 decompression. |
462 decompression. It is effectively a wrapper around the ``ZSTD_DCtx`` type from |
|
463 the C API. |
489 |
464 |
490 Each instance is associated with parameters that control decompression. These |
465 Each instance is associated with parameters that control decompression. These |
491 come from the following named arguments (all optional): |
466 come from the following named arguments (all optional): |
492 |
467 |
493 dict_data |
468 dict_data |
494 Compression dictionary to use. |
469 Compression dictionary to use. |
|
470 max_window_size |
|
471 Sets an uppet limit on the window size for decompression operations in |
|
472 kibibytes. This setting can be used to prevent large memory allocations |
|
473 for inputs using large compression windows. |
|
474 format |
|
475 Set the format of data for the decoder. By default, this is |
|
476 ``zstd.FORMAT_ZSTD1``. It can be set to ``zstd.FORMAT_ZSTD1_MAGICLESS`` to |
|
477 allow decoding frames without the 4 byte magic header. Not all decompression |
|
478 APIs support this mode. |
495 |
479 |
496 The interface of this class is very similar to ``ZstdCompressor`` (by design). |
480 The interface of this class is very similar to ``ZstdCompressor`` (by design). |
497 |
481 |
498 Unless specified otherwise, assume that no two methods of ``ZstdDecompressor`` |
482 Unless specified otherwise, assume that no two methods of ``ZstdDecompressor`` |
499 instances can be called from multiple Python threads simultaneously. In other |
483 instances can be called from multiple Python threads simultaneously. In other |
500 words, assume instances are not thread safe unless stated otherwise. |
484 words, assume instances are not thread safe unless stated otherwise. |
501 |
485 |
|
486 Utility Methods |
|
487 ^^^^^^^^^^^^^^^ |
|
488 |
|
489 ``memory_size()`` obtains the size of the underlying zstd decompression context, |
|
490 in bytes.:: |
|
491 |
|
492 dctx = zstd.ZstdDecompressor() |
|
493 size = dctx.memory_size() |
|
494 |
502 Simple API |
495 Simple API |
503 ^^^^^^^^^^ |
496 ^^^^^^^^^^ |
504 |
497 |
505 ``decompress(data)`` can be used to decompress an entire compressed zstd |
498 ``decompress(data)`` can be used to decompress an entire compressed zstd |
506 frame in a single operation.:: |
499 frame in a single operation.:: |
507 |
500 |
508 dctx = zstd.ZstdDecompressor() |
501 dctx = zstd.ZstdDecompressor() |
509 decompressed = dctx.decompress(data) |
502 decompressed = dctx.decompress(data) |
510 |
503 |
511 By default, ``decompress(data)`` will only work on data written with the content |
504 By default, ``decompress(data)`` will only work on data written with the content |
512 size encoded in its header. This can be achieved by creating a |
505 size encoded in its header (this is the default behavior of |
513 ``ZstdCompressor`` with ``write_content_size=True``. If compressed data without |
506 ``ZstdCompressor().compress()`` but may not be true for streaming compression). If |
514 an embedded content size is seen, ``zstd.ZstdError`` will be raised. |
507 compressed data without an embedded content size is seen, ``zstd.ZstdError`` will |
|
508 be raised. |
515 |
509 |
516 If the compressed data doesn't have its content size embedded within it, |
510 If the compressed data doesn't have its content size embedded within it, |
517 decompression can be attempted by specifying the ``max_output_size`` |
511 decompression can be attempted by specifying the ``max_output_size`` |
518 argument.:: |
512 argument.:: |
519 |
513 |
532 Please note that an allocation of the requested ``max_output_size`` will be |
526 Please note that an allocation of the requested ``max_output_size`` will be |
533 performed every time the method is called. Setting to a very large value could |
527 performed every time the method is called. Setting to a very large value could |
534 result in a lot of work for the memory allocator and may result in |
528 result in a lot of work for the memory allocator and may result in |
535 ``MemoryError`` being raised if the allocation fails. |
529 ``MemoryError`` being raised if the allocation fails. |
536 |
530 |
537 If the exact size of decompressed data is unknown, it is **strongly** |
531 .. important:: |
538 recommended to use a streaming API. |
532 |
|
533 If the exact size of decompressed data is unknown (not passed in explicitly |
|
534 and not stored in the zstandard frame), for performance reasons it is |
|
535 encouraged to use a streaming API. |
|
536 |
|
537 Stream Reader API |
|
538 ^^^^^^^^^^^^^^^^^ |
|
539 |
|
540 ``stream_reader(source)`` can be used to obtain an object conforming to the |
|
541 ``io.RawIOBase`` interface for reading decompressed output as a stream:: |
|
542 |
|
543 with open(path, 'rb') as fh: |
|
544 dctx = zstd.ZstdDecompressor() |
|
545 with dctx.stream_reader(fh) as reader: |
|
546 while True: |
|
547 chunk = reader.read(16384) |
|
548 if not chunk: |
|
549 break |
|
550 |
|
551 # Do something with decompressed chunk. |
|
552 |
|
553 The stream can only be read within a context manager. When the context |
|
554 manager exits, the stream is closed and the underlying resource is |
|
555 released and future operations against the stream will fail. |
|
556 |
|
557 The ``source`` argument to ``stream_reader()`` can be any object with a |
|
558 ``read(size)`` method or any object implementing the *buffer protocol*. |
|
559 |
|
560 If the ``source`` is a stream, you can specify how large ``read()`` requests |
|
561 to that stream should be via the ``read_size`` argument. It defaults to |
|
562 ``zstandard.DECOMPRESSION_RECOMMENDED_INPUT_SIZE``.:: |
|
563 |
|
564 with open(path, 'rb') as fh: |
|
565 dctx = zstd.ZstdDecompressor() |
|
566 # Will perform fh.read(8192) when obtaining data for the decompressor. |
|
567 with dctx.stream_reader(fh, read_size=8192) as reader: |
|
568 ... |
|
569 |
|
570 The stream returned by ``stream_reader()`` is not writable. |
|
571 |
|
572 The stream returned by ``stream_reader()`` is *partially* seekable. |
|
573 Absolute and relative positions (``SEEK_SET`` and ``SEEK_CUR``) forward |
|
574 of the current position are allowed. Offsets behind the current read |
|
575 position and offsets relative to the end of stream are not allowed and |
|
576 will raise ``ValueError`` if attempted. |
|
577 |
|
578 ``tell()`` returns the number of decompressed bytes read so far. |
|
579 |
|
580 Not all I/O methods are implemented. Notably missing is support for |
|
581 ``readline()``, ``readlines()``, and linewise iteration support. Support for |
|
582 these is planned for a future release. |
539 |
583 |
540 Streaming Input API |
584 Streaming Input API |
541 ^^^^^^^^^^^^^^^^^^^ |
585 ^^^^^^^^^^^^^^^^^^^ |
542 |
586 |
543 ``write_to(fh)`` can be used to incrementally send compressed data to a |
587 ``stream_writer(fh)`` can be used to incrementally send compressed data to a |
544 decompressor.:: |
588 decompressor.:: |
545 |
589 |
546 dctx = zstd.ZstdDecompressor() |
590 dctx = zstd.ZstdDecompressor() |
547 with dctx.write_to(fh) as decompressor: |
591 with dctx.stream_writer(fh) as decompressor: |
548 decompressor.write(compressed_data) |
592 decompressor.write(compressed_data) |
549 |
593 |
550 This behaves similarly to ``zstd.ZstdCompressor``: compressed data is written to |
594 This behaves similarly to ``zstd.ZstdCompressor``: compressed data is written to |
551 the decompressor by calling ``write(data)`` and decompressed output is written |
595 the decompressor by calling ``write(data)`` and decompressed output is written |
552 to the output object by calling its ``write(data)`` method. |
596 to the output object by calling its ``write(data)`` method. |
556 of ``0`` are possible. |
600 of ``0`` are possible. |
557 |
601 |
558 The size of chunks being ``write()`` to the destination can be specified:: |
602 The size of chunks being ``write()`` to the destination can be specified:: |
559 |
603 |
560 dctx = zstd.ZstdDecompressor() |
604 dctx = zstd.ZstdDecompressor() |
561 with dctx.write_to(fh, write_size=16384) as decompressor: |
605 with dctx.stream_writer(fh, write_size=16384) as decompressor: |
562 pass |
606 pass |
563 |
607 |
564 You can see how much memory is being used by the decompressor:: |
608 You can see how much memory is being used by the decompressor:: |
565 |
609 |
566 dctx = zstd.ZstdDecompressor() |
610 dctx = zstd.ZstdDecompressor() |
567 with dctx.write_to(fh) as decompressor: |
611 with dctx.stream_writer(fh) as decompressor: |
568 byte_size = decompressor.memory_size() |
612 byte_size = decompressor.memory_size() |
569 |
613 |
570 Streaming Output API |
614 Streaming Output API |
571 ^^^^^^^^^^^^^^^^^^^^ |
615 ^^^^^^^^^^^^^^^^^^^^ |
572 |
616 |
573 ``read_from(fh)`` provides a mechanism to stream decompressed data out of a |
617 ``read_to_iter(fh)`` provides a mechanism to stream decompressed data out of a |
574 compressed source as an iterator of data chunks.:: |
618 compressed source as an iterator of data chunks.:: |
575 |
619 |
576 dctx = zstd.ZstdDecompressor() |
620 dctx = zstd.ZstdDecompressor() |
577 for chunk in dctx.read_from(fh): |
621 for chunk in dctx.read_to_iter(fh): |
578 # Do something with original data. |
622 # Do something with original data. |
579 |
623 |
580 ``read_from()`` accepts a) an object with a ``read(size)`` method that will |
624 ``read_to_iter()`` accepts an object with a ``read(size)`` method that will |
581 return compressed bytes b) an object conforming to the buffer protocol that |
625 return compressed bytes or an object conforming to the buffer protocol that |
582 can expose its data as a contiguous range of bytes. The ``bytes`` and |
626 can expose its data as a contiguous range of bytes. |
583 ``memoryview`` types expose this buffer protocol. |
627 |
584 |
628 ``read_to_iter()`` returns an iterator whose elements are chunks of the |
585 ``read_from()`` returns an iterator whose elements are chunks of the |
|
586 decompressed data. |
629 decompressed data. |
587 |
630 |
588 The size of requested ``read()`` from the source can be specified:: |
631 The size of requested ``read()`` from the source can be specified:: |
589 |
632 |
590 dctx = zstd.ZstdDecompressor() |
633 dctx = zstd.ZstdDecompressor() |
591 for chunk in dctx.read_from(fh, read_size=16384): |
634 for chunk in dctx.read_to_iter(fh, read_size=16384): |
592 pass |
635 pass |
593 |
636 |
594 It is also possible to skip leading bytes in the input data:: |
637 It is also possible to skip leading bytes in the input data:: |
595 |
638 |
596 dctx = zstd.ZstdDecompressor() |
639 dctx = zstd.ZstdDecompressor() |
597 for chunk in dctx.read_from(fh, skip_bytes=1): |
640 for chunk in dctx.read_to_iter(fh, skip_bytes=1): |
598 pass |
641 pass |
599 |
642 |
600 Skipping leading bytes is useful if the source data contains extra |
643 .. tip:: |
601 *header* data but you want to avoid the overhead of making a buffer copy |
644 |
602 or allocating a new ``memoryview`` object in order to decompress the data. |
645 Skipping leading bytes is useful if the source data contains extra |
603 |
646 *header* data. Traditionally, you would need to create a slice or |
604 Similarly to ``ZstdCompressor.read_from()``, the consumer of the iterator |
647 ``memoryview`` of the data you want to decompress. This would create |
|
648 overhead. It is more efficient to pass the offset into this API. |
|
649 |
|
650 Similarly to ``ZstdCompressor.read_to_iter()``, the consumer of the iterator |
605 controls when data is decompressed. If the iterator isn't consumed, |
651 controls when data is decompressed. If the iterator isn't consumed, |
606 decompression is put on hold. |
652 decompression is put on hold. |
607 |
653 |
608 When ``read_from()`` is passed an object conforming to the buffer protocol, |
654 When ``read_to_iter()`` is passed an object conforming to the buffer protocol, |
609 the behavior may seem similar to what occurs when the simple decompression |
655 the behavior may seem similar to what occurs when the simple decompression |
610 API is used. However, this API works when the decompressed size is unknown. |
656 API is used. However, this API works when the decompressed size is unknown. |
611 Furthermore, if feeding large inputs, the decompressor will work in chunks |
657 Furthermore, if feeding large inputs, the decompressor will work in chunks |
612 instead of performing a single operation. |
658 instead of performing a single operation. |
613 |
659 |
669 conform to the buffer protocol. For best performance, pass a |
729 conform to the buffer protocol. For best performance, pass a |
670 ``BufferWithSegmentsCollection`` or a ``BufferWithSegments``, as |
730 ``BufferWithSegmentsCollection`` or a ``BufferWithSegments``, as |
671 minimal input validation will be done for that type. If calling from |
731 minimal input validation will be done for that type. If calling from |
672 Python (as opposed to C), constructing one of these instances may add |
732 Python (as opposed to C), constructing one of these instances may add |
673 overhead cancelling out the performance overhead of validation for list |
733 overhead cancelling out the performance overhead of validation for list |
674 inputs. |
734 inputs.:: |
675 |
735 |
676 The decompressed size of each frame must be discoverable. It can either be |
736 dctx = zstd.ZstdDecompressor() |
|
737 results = dctx.multi_decompress_to_buffer([b'...', b'...']) |
|
738 |
|
739 The decompressed size of each frame MUST be discoverable. It can either be |
677 embedded within the zstd frame (``write_content_size=True`` argument to |
740 embedded within the zstd frame (``write_content_size=True`` argument to |
678 ``ZstdCompressor``) or passed in via the ``decompressed_sizes`` argument. |
741 ``ZstdCompressor``) or passed in via the ``decompressed_sizes`` argument. |
679 |
742 |
680 The ``decompressed_sizes`` argument is an object conforming to the buffer |
743 The ``decompressed_sizes`` argument is an object conforming to the buffer |
681 protocol which holds an array of 64-bit unsigned integers in the machine's |
744 protocol which holds an array of 64-bit unsigned integers in the machine's |
682 native format defining the decompressed sizes of each frame. If this argument |
745 native format defining the decompressed sizes of each frame. If this argument |
683 is passed, it avoids having to scan each frame for its decompressed size. |
746 is passed, it avoids having to scan each frame for its decompressed size. |
684 This frame scanning can add noticeable overhead in some scenarios. |
747 This frame scanning can add noticeable overhead in some scenarios.:: |
|
748 |
|
749 frames = [...] |
|
750 sizes = struct.pack('=QQQQ', len0, len1, len2, len3) |
|
751 |
|
752 dctx = zstd.ZstdDecompressor() |
|
753 results = dctx.multi_decompress_to_buffer(frames, decompressed_sizes=sizes) |
685 |
754 |
686 The ``threads`` argument controls the number of threads to use to perform |
755 The ``threads`` argument controls the number of threads to use to perform |
687 decompression operations. The default (``0``) or the value ``1`` means to |
756 decompression operations. The default (``0``) or the value ``1`` means to |
688 use a single thread. Negative values use the number of logical CPUs in the |
757 use a single thread. Negative values use the number of logical CPUs in the |
689 machine. |
758 machine. |
699 |
768 |
700 This function exists to perform decompression on multiple frames as fast |
769 This function exists to perform decompression on multiple frames as fast |
701 as possible by having as little overhead as possible. Since decompression is |
770 as possible by having as little overhead as possible. Since decompression is |
702 performed as a single operation and since the decompressed output is stored in |
771 performed as a single operation and since the decompressed output is stored in |
703 a single buffer, extra memory allocations, Python objects, and Python function |
772 a single buffer, extra memory allocations, Python objects, and Python function |
704 calls are avoided. This is ideal for scenarios where callers need to access |
773 calls are avoided. This is ideal for scenarios where callers know up front that |
705 decompressed data for multiple frames. |
774 they need to access data for multiple frames, such as when *delta chains* are |
|
775 being used. |
706 |
776 |
707 Currently, the implementation always spawns multiple threads when requested, |
777 Currently, the implementation always spawns multiple threads when requested, |
708 even if the amount of work to do is small. In the future, it will be smarter |
778 even if the amount of work to do is small. In the future, it will be smarter |
709 about avoiding threads and their associated overhead when the amount of |
779 about avoiding threads and their associated overhead when the amount of |
710 work to do is small. |
780 work to do is small. |
711 |
781 |
712 Content-Only Dictionary Chain Decompression |
782 Prefix Dictionary Chain Decompression |
713 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
783 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
714 |
784 |
715 ``decompress_content_dict_chain(frames)`` performs decompression of a list of |
785 ``decompress_content_dict_chain(frames)`` performs decompression of a list of |
716 zstd frames produced using chained *content-only* dictionary compression. Such |
786 zstd frames produced using chained *prefix* dictionary compression. Such |
717 a list of frames is produced by compressing discrete inputs where each |
787 a list of frames is produced by compressing discrete inputs where each |
718 non-initial input is compressed with a *content-only* dictionary consisting |
788 non-initial input is compressed with a *prefix* dictionary consisting of the |
719 of the content of the previous input. |
789 content of the previous input. |
720 |
790 |
721 For example, say you have the following inputs:: |
791 For example, say you have the following inputs:: |
722 |
792 |
723 inputs = [b'input 1', b'input 2', b'input 3'] |
793 inputs = [b'input 1', b'input 2', b'input 3'] |
724 |
794 |
725 The zstd frame chain consists of: |
795 The zstd frame chain consists of: |
726 |
796 |
727 1. ``b'input 1'`` compressed in standalone/discrete mode |
797 1. ``b'input 1'`` compressed in standalone/discrete mode |
728 2. ``b'input 2'`` compressed using ``b'input 1'`` as a *content-only* dictionary |
798 2. ``b'input 2'`` compressed using ``b'input 1'`` as a *prefix* dictionary |
729 3. ``b'input 3'`` compressed using ``b'input 2'`` as a *content-only* dictionary |
799 3. ``b'input 3'`` compressed using ``b'input 2'`` as a *prefix* dictionary |
730 |
800 |
731 Each zstd frame **must** have the content size written. |
801 Each zstd frame **must** have the content size written. |
732 |
802 |
733 The following Python code can be used to produce a *content-only dictionary |
803 The following Python code can be used to produce a *prefix dictionary chain*:: |
734 chain*:: |
|
735 |
804 |
736 def make_chain(inputs): |
805 def make_chain(inputs): |
737 frames = [] |
806 frames = [] |
738 |
807 |
739 # First frame is compressed in standalone/discrete mode. |
808 # First frame is compressed in standalone/discrete mode. |
740 zctx = zstd.ZstdCompressor(write_content_size=True) |
809 zctx = zstd.ZstdCompressor() |
741 frames.append(zctx.compress(inputs[0])) |
810 frames.append(zctx.compress(inputs[0])) |
742 |
811 |
743 # Subsequent frames use the previous fulltext as a content-only dictionary |
812 # Subsequent frames use the previous fulltext as a prefix dictionary |
744 for i, raw in enumerate(inputs[1:]): |
813 for i, raw in enumerate(inputs[1:]): |
745 dict_data = zstd.ZstdCompressionDict(inputs[i]) |
814 dict_data = zstd.ZstdCompressionDict( |
746 zctx = zstd.ZstdCompressor(write_content_size=True, dict_data=dict_data) |
815 inputs[i], dict_type=zstd.DICT_TYPE_RAWCONTENT) |
|
816 zctx = zstd.ZstdCompressor(dict_data=dict_data) |
747 frames.append(zctx.compress(raw)) |
817 frames.append(zctx.compress(raw)) |
748 |
818 |
749 return frames |
819 return frames |
750 |
820 |
751 ``decompress_content_dict_chain()`` returns the uncompressed data of the last |
821 ``decompress_content_dict_chain()`` returns the uncompressed data of the last |
752 element in the input chain. |
822 element in the input chain. |
753 |
823 |
754 It is possible to implement *content-only dictionary chain* decompression |
824 |
755 on top of other Python APIs. However, this function will likely be significantly |
825 .. note:: |
756 faster, especially for long input chains, as it avoids the overhead of |
826 |
757 instantiating and passing around intermediate objects between C and Python. |
827 It is possible to implement *prefix dictionary chain* decompression |
|
828 on top of other APIs. However, this function will likely be faster - |
|
829 especially for long input chains - as it avoids the overhead of instantiating |
|
830 and passing around intermediate objects between C and Python. |
758 |
831 |
759 Multi-Threaded Compression |
832 Multi-Threaded Compression |
760 -------------------------- |
833 -------------------------- |
761 |
834 |
762 ``ZstdCompressor`` accepts a ``threads`` argument that controls the number |
835 ``ZstdCompressor`` accepts a ``threads`` argument that controls the number |
763 of threads to use for compression. The way this works is that input is split |
836 of threads to use for compression. The way this works is that input is split |
764 into segments and each segment is fed into a worker pool for compression. Once |
837 into segments and each segment is fed into a worker pool for compression. Once |
765 a segment is compressed, it is flushed/appended to the output. |
838 a segment is compressed, it is flushed/appended to the output. |
766 |
839 |
|
840 .. note:: |
|
841 |
|
842 These threads are created at the C layer and are not Python threads. So they |
|
843 work outside the GIL. It is therefore possible to CPU saturate multiple cores |
|
844 from Python. |
|
845 |
767 The segment size for multi-threaded compression is chosen from the window size |
846 The segment size for multi-threaded compression is chosen from the window size |
768 of the compressor. This is derived from the ``window_log`` attribute of a |
847 of the compressor. This is derived from the ``window_log`` attribute of a |
769 ``CompressionParameters`` instance. By default, segment sizes are in the 1+MB |
848 ``ZstdCompressionParameters`` instance. By default, segment sizes are in the 1+MB |
770 range. |
849 range. |
771 |
850 |
772 If multi-threaded compression is requested and the input is smaller than the |
851 If multi-threaded compression is requested and the input is smaller than the |
773 configured segment size, only a single compression thread will be used. If the |
852 configured segment size, only a single compression thread will be used. If the |
774 input is smaller than the segment size multiplied by the thread pool size or |
853 input is smaller than the segment size multiplied by the thread pool size or |
783 *states*, the output from multi-threaded compression will likely be larger |
862 *states*, the output from multi-threaded compression will likely be larger |
784 than non-multi-threaded compression. The difference is usually small. But |
863 than non-multi-threaded compression. The difference is usually small. But |
785 there is a CPU/wall time versus size trade off that may warrant investigation. |
864 there is a CPU/wall time versus size trade off that may warrant investigation. |
786 |
865 |
787 Output from multi-threaded compression does not require any special handling |
866 Output from multi-threaded compression does not require any special handling |
788 on the decompression side. In other words, any zstd decompressor should be able |
867 on the decompression side. To the decompressor, data generated with single |
789 to consume data produced with multi-threaded compression. |
868 threaded compressor looks the same as data generated by a multi-threaded |
|
869 compressor and does not require any special handling or additional resource |
|
870 requirements. |
790 |
871 |
791 Dictionary Creation and Management |
872 Dictionary Creation and Management |
792 ---------------------------------- |
873 ---------------------------------- |
793 |
874 |
794 Compression dictionaries are represented as the ``ZstdCompressionDict`` type. |
875 Compression dictionaries are represented with the ``ZstdCompressionDict`` type. |
795 |
876 |
796 Instances can be constructed from bytes:: |
877 Instances can be constructed from bytes:: |
797 |
878 |
798 dict_data = zstd.ZstdCompressionDict(data) |
879 dict_data = zstd.ZstdCompressionDict(data) |
799 |
880 |
800 It is possible to construct a dictionary from *any* data. Unless the |
881 It is possible to construct a dictionary from *any* data. If the data doesn't |
801 data begins with a magic header, the dictionary will be treated as |
882 begin with a magic header, it will be treated as a *prefix* dictionary. |
802 *content-only*. *Content-only* dictionaries allow compression operations |
883 *Prefix* dictionaries allow compression operations to reference raw data |
803 that follow to reference raw data within the content. For one use of |
884 within the dictionary. |
804 *content-only* dictionaries, see |
885 |
805 ``ZstdDecompressor.decompress_content_dict_chain()``. |
886 It is possible to force the use of *prefix* dictionaries or to require a |
806 |
887 dictionary header: |
807 More interestingly, instances can be created by *training* on sample data:: |
888 |
808 |
889 dict_data = zstd.ZstdCompressionDict(data, |
809 dict_data = zstd.train_dictionary(size, samples) |
890 dict_type=zstd.DICT_TYPE_RAWCONTENT) |
810 |
891 |
811 This takes a list of bytes instances and creates and returns a |
892 dict_data = zstd.ZstdCompressionDict(data, |
812 ``ZstdCompressionDict``. |
893 dict_type=zstd.DICT_TYPE_FULLDICT) |
813 |
894 |
814 You can see how many bytes are in the dictionary by calling ``len()``:: |
895 You can see how many bytes are in the dictionary by calling ``len()``:: |
815 |
896 |
816 dict_data = zstd.train_dictionary(size, samples) |
897 dict_data = zstd.train_dictionary(size, samples) |
817 dict_size = len(dict_data) # will not be larger than ``size`` |
898 dict_size = len(dict_data) # will not be larger than ``size`` |
818 |
899 |
819 Once you have a dictionary, you can pass it to the objects performing |
900 Once you have a dictionary, you can pass it to the objects performing |
820 compression and decompression:: |
901 compression and decompression:: |
821 |
902 |
822 dict_data = zstd.train_dictionary(16384, samples) |
903 dict_data = zstd.train_dictionary(131072, samples) |
823 |
904 |
824 cctx = zstd.ZstdCompressor(dict_data=dict_data) |
905 cctx = zstd.ZstdCompressor(dict_data=dict_data) |
825 for source_data in input_data: |
906 for source_data in input_data: |
826 compressed = cctx.compress(source_data) |
907 compressed = cctx.compress(source_data) |
827 # Do something with compressed data. |
908 # Do something with compressed data. |
828 |
909 |
829 dctx = zstd.ZstdDecompressor(dict_data=dict_data) |
910 dctx = zstd.ZstdDecompressor(dict_data=dict_data) |
830 for compressed_data in input_data: |
911 for compressed_data in input_data: |
831 buffer = io.BytesIO() |
912 buffer = io.BytesIO() |
832 with dctx.write_to(buffer) as decompressor: |
913 with dctx.stream_writer(buffer) as decompressor: |
833 decompressor.write(compressed_data) |
914 decompressor.write(compressed_data) |
834 # Do something with raw data in ``buffer``. |
915 # Do something with raw data in ``buffer``. |
835 |
916 |
836 Dictionaries have unique integer IDs. You can retrieve this ID via:: |
917 Dictionaries have unique integer IDs. You can retrieve this ID via:: |
837 |
918 |
841 a ``ZstdCompressionDict`` later) via ``as_bytes()``:: |
922 a ``ZstdCompressionDict`` later) via ``as_bytes()``:: |
842 |
923 |
843 dict_data = zstd.train_dictionary(size, samples) |
924 dict_data = zstd.train_dictionary(size, samples) |
844 raw_data = dict_data.as_bytes() |
925 raw_data = dict_data.as_bytes() |
845 |
926 |
846 The following named arguments to ``train_dictionary`` can also be used |
927 By default, when a ``ZstdCompressionDict`` is *attached* to a |
847 to further control dictionary generation. |
928 ``ZstdCompressor``, each ``ZstdCompressor`` performs work to prepare the |
848 |
929 dictionary for use. This is fine if only 1 compression operation is being |
849 selectivity |
930 performed or if the ``ZstdCompressor`` is being reused for multiple operations. |
850 Integer selectivity level. Default is 9. Larger values yield more data in |
931 But if multiple ``ZstdCompressor`` instances are being used with the dictionary, |
851 dictionary. |
932 this can add overhead. |
852 level |
933 |
853 Integer compression level. Default is 6. |
934 It is possible to *precompute* the dictionary so it can readily be consumed |
854 dict_id |
935 by multiple ``ZstdCompressor`` instances:: |
855 Integer dictionary ID for the produced dictionary. Default is 0, which |
936 |
856 means to use a random value. |
937 d = zstd.ZstdCompressionDict(data) |
857 notifications |
938 |
858 Controls writing of informational messages to ``stderr``. ``0`` (the |
939 # Precompute for compression level 3. |
859 default) means to write nothing. ``1`` writes errors. ``2`` writes |
940 d.precompute_compress(level=3) |
860 progression info. ``3`` writes more details. And ``4`` writes all info. |
941 |
861 |
942 # Precompute with specific compression parameters. |
862 Cover Dictionaries |
943 params = zstd.ZstdCompressionParameters(...) |
863 ^^^^^^^^^^^^^^^^^^ |
944 d.precompute_compress(compression_params=params) |
864 |
|
865 An alternate dictionary training mechanism named *cover* is also available. |
|
866 More details about this training mechanism are available in the paper |
|
867 *Effective Construction of Relative Lempel-Ziv Dictionaries* (authors: |
|
868 Liao, Petri, Moffat, Wirth). |
|
869 |
|
870 To use this mechanism, use ``zstd.train_cover_dictionary()`` instead of |
|
871 ``zstd.train_dictionary()``. The function behaves nearly the same except |
|
872 its arguments are different and the returned dictionary will contain ``k`` |
|
873 and ``d`` attributes reflecting the parameters to the cover algorithm. |
|
874 |
945 |
875 .. note:: |
946 .. note:: |
876 |
947 |
877 The ``k`` and ``d`` attributes are only populated on dictionary |
948 When a dictionary is precomputed, the compression parameters used to |
878 instances created by this function. If a ``ZstdCompressionDict`` is |
949 precompute the dictionary overwrite some of the compression parameters |
879 constructed from raw bytes data, the ``k`` and ``d`` attributes will |
950 specified to ``ZstdCompressor.__init__``. |
880 be ``0``. |
951 |
|
952 Training Dictionaries |
|
953 ^^^^^^^^^^^^^^^^^^^^^ |
|
954 |
|
955 Unless using *prefix* dictionaries, dictionary data is produced by *training* |
|
956 on existing data:: |
|
957 |
|
958 dict_data = zstd.train_dictionary(size, samples) |
|
959 |
|
960 This takes a target dictionary size and list of bytes instances and creates and |
|
961 returns a ``ZstdCompressionDict``. |
|
962 |
|
963 The dictionary training mechanism is known as *cover*. More details about it are |
|
964 available in the paper *Effective Construction of Relative Lempel-Ziv |
|
965 Dictionaries* (authors: Liao, Petri, Moffat, Wirth). |
|
966 |
|
967 The cover algorithm takes parameters ``k` and ``d``. These are the |
|
968 *segment size* and *dmer size*, respectively. The returned dictionary |
|
969 instance created by this function has ``k`` and ``d`` attributes |
|
970 containing the values for these parameters. If a ``ZstdCompressionDict`` |
|
971 is constructed from raw bytes data (a content-only dictionary), the |
|
972 ``k`` and ``d`` attributes will be ``0``. |
881 |
973 |
882 The segment and dmer size parameters to the cover algorithm can either be |
974 The segment and dmer size parameters to the cover algorithm can either be |
883 specified manually or you can ask ``train_cover_dictionary()`` to try |
975 specified manually or ``train_dictionary()`` can try multiple values |
884 multiple values and pick the best one, where *best* means the smallest |
976 and pick the best one, where *best* means the smallest compressed data size. |
885 compressed data size. |
977 This later mode is called *optimization* mode. |
886 |
978 |
887 In manual mode, the ``k`` and ``d`` arguments must be specified or a |
979 If none of ``k``, ``d``, ``steps``, ``threads``, ``level``, ``notifications``, |
888 ``ZstdError`` will be raised. |
980 or ``dict_id`` (basically anything from the underlying ``ZDICT_cover_params_t`` |
889 |
981 struct) are defined, *optimization* mode is used with default parameter |
890 In automatic mode (triggered by specifying ``optimize=True``), ``k`` |
982 values. |
891 and ``d`` are optional. If a value isn't specified, then default values for |
983 |
892 both are tested. The ``steps`` argument can control the number of steps |
984 If ``steps`` or ``threads`` are defined, then *optimization* mode is engaged |
893 through ``k`` values. The ``level`` argument defines the compression level |
985 with explicit control over those parameters. Specifying ``threads=0`` or |
894 that will be used when testing the compressed size. And ``threads`` can |
986 ``threads=1`` can be used to engage *optimization* mode if other parameters |
895 specify the number of threads to use for concurrent operation. |
987 are not defined. |
|
988 |
|
989 Otherwise, non-*optimization* mode is used with the parameters specified. |
896 |
990 |
897 This function takes the following arguments: |
991 This function takes the following arguments: |
898 |
992 |
899 dict_size |
993 dict_size |
900 Target size in bytes of the dictionary to generate. |
994 Target size in bytes of the dictionary to generate. |
907 Parameter to cover algorithm defining the dmer size. A reasonable range is |
1001 Parameter to cover algorithm defining the dmer size. A reasonable range is |
908 [6, 16]. ``d`` must be less than or equal to ``k``. |
1002 [6, 16]. ``d`` must be less than or equal to ``k``. |
909 dict_id |
1003 dict_id |
910 Integer dictionary ID for the produced dictionary. Default is 0, which uses |
1004 Integer dictionary ID for the produced dictionary. Default is 0, which uses |
911 a random value. |
1005 a random value. |
912 optimize |
1006 steps |
913 When true, test dictionary generation with multiple parameters. |
1007 Number of steps through ``k`` values to perform when trying parameter |
|
1008 variations. |
|
1009 threads |
|
1010 Number of threads to use when trying parameter variations. Default is 0, |
|
1011 which means to use a single thread. A negative value can be specified to |
|
1012 use as many threads as there are detected logical CPUs. |
914 level |
1013 level |
915 Integer target compression level when testing compression with |
1014 Integer target compression level when trying parameter variations. |
916 ``optimize=True``. Default is 1. |
|
917 steps |
|
918 Number of steps through ``k`` values to perform when ``optimize=True``. |
|
919 Default is 32. |
|
920 threads |
|
921 Number of threads to use when ``optimize=True``. Default is 0, which means |
|
922 to use a single thread. A negative value can be specified to use as many |
|
923 threads as there are detected logical CPUs. |
|
924 notifications |
1015 notifications |
925 Controls writing of informational messages to ``stderr``. See the |
1016 Controls writing of informational messages to ``stderr``. ``0`` (the |
926 documentation for ``train_dictionary()`` for more. |
1017 default) means to write nothing. ``1`` writes errors. ``2`` writes |
|
1018 progression info. ``3`` writes more details. And ``4`` writes all info. |
927 |
1019 |
928 Explicit Compression Parameters |
1020 Explicit Compression Parameters |
929 ------------------------------- |
1021 ------------------------------- |
930 |
1022 |
931 Zstandard's integer compression levels along with the input size and dictionary |
1023 Zstandard offers a high-level *compression level* that maps to lower-level |
932 size are converted into a data structure defining multiple parameters to tune |
1024 compression parameters. For many consumers, this numeric level is the only |
933 behavior of the compression algorithm. It is possible to use define this |
1025 compression setting you'll need to touch. |
934 data structure explicitly to have lower-level control over compression behavior. |
1026 |
935 |
1027 But for advanced use cases, it might be desirable to tweak these lower-level |
936 The ``zstd.CompressionParameters`` type represents this data structure. |
1028 settings. |
937 You can see how Zstandard converts compression levels to this data structure |
1029 |
938 by calling ``zstd.get_compression_parameters()``. e.g.:: |
1030 The ``ZstdCompressionParameters`` type represents these low-level compression |
939 |
1031 settings. |
940 params = zstd.get_compression_parameters(5) |
1032 |
941 |
1033 Instances of this type can be constructed from a myriad of keyword arguments |
942 This function also accepts the uncompressed data size and dictionary size |
1034 (defined below) for complete low-level control over each adjustable |
943 to adjust parameters:: |
1035 compression setting. |
944 |
1036 |
945 params = zstd.get_compression_parameters(3, source_size=len(data), dict_size=len(dict_data)) |
1037 From a higher level, one can construct a ``ZstdCompressionParameters`` instance |
946 |
1038 given a desired compression level and target input and dictionary size |
947 You can also construct compression parameters from their low-level components:: |
1039 using ``ZstdCompressionParameters.from_level()``. e.g.:: |
948 |
1040 |
949 params = zstd.CompressionParameters(20, 6, 12, 5, 4, 10, zstd.STRATEGY_FAST) |
1041 # Derive compression settings for compression level 7. |
950 |
1042 params = zstd.ZstdCompressionParameters.from_level(7) |
951 You can then configure a compressor to use the custom parameters:: |
1043 |
|
1044 # With an input size of 1MB |
|
1045 params = zstd.ZstdCompressionParameters.from_level(7, source_size=1048576) |
|
1046 |
|
1047 Using ``from_level()``, it is also possible to override individual compression |
|
1048 parameters or to define additional settings that aren't automatically derived. |
|
1049 e.g.:: |
|
1050 |
|
1051 params = zstd.ZstdCompressionParameters.from_level(4, window_log=10) |
|
1052 params = zstd.ZstdCompressionParameters.from_level(5, threads=4) |
|
1053 |
|
1054 Or you can define low-level compression settings directly:: |
|
1055 |
|
1056 params = zstd.ZstdCompressionParameters(window_log=12, enable_ldm=True) |
|
1057 |
|
1058 Once a ``ZstdCompressionParameters`` instance is obtained, it can be used to |
|
1059 configure a compressor:: |
952 |
1060 |
953 cctx = zstd.ZstdCompressor(compression_params=params) |
1061 cctx = zstd.ZstdCompressor(compression_params=params) |
954 |
1062 |
955 The members/attributes of ``CompressionParameters`` instances are as follows:: |
1063 The named arguments and attributes of ``ZstdCompressionParameters`` are as |
956 |
1064 follows: |
|
1065 |
|
1066 * format |
|
1067 * compression_level |
957 * window_log |
1068 * window_log |
|
1069 * hash_log |
958 * chain_log |
1070 * chain_log |
959 * hash_log |
|
960 * search_log |
1071 * search_log |
961 * search_length |
1072 * min_match |
962 * target_length |
1073 * target_length |
963 * strategy |
1074 * compression_strategy |
964 |
1075 * write_content_size |
965 This is the order the arguments are passed to the constructor if not using |
1076 * write_checksum |
966 named arguments. |
1077 * write_dict_id |
967 |
1078 * job_size |
968 You'll need to read the Zstandard documentation for what these parameters |
1079 * overlap_size_log |
969 do. |
1080 * compress_literals |
|
1081 * force_max_window |
|
1082 * enable_ldm |
|
1083 * ldm_hash_log |
|
1084 * ldm_min_match |
|
1085 * ldm_bucket_size_log |
|
1086 * ldm_hash_every_log |
|
1087 * threads |
|
1088 |
|
1089 Some of these are very low-level settings. It may help to consult the official |
|
1090 zstandard documentation for their behavior. Look for the ``ZSTD_p_*`` constants |
|
1091 in ``zstd.h`` (https://github.com/facebook/zstd/blob/dev/lib/zstd.h). |
970 |
1092 |
971 Frame Inspection |
1093 Frame Inspection |
972 ---------------- |
1094 ---------------- |
973 |
1095 |
974 Data emitted from zstd compression is encapsulated in a *frame*. This frame |
1096 Data emitted from zstd compression is encapsulated in a *frame*. This frame |
1220 it is important to consider what happens in that object when I/O is requested. |
1353 it is important to consider what happens in that object when I/O is requested. |
1221 There is potential for long pauses as data is read or written from the |
1354 There is potential for long pauses as data is read or written from the |
1222 underlying stream (say from interacting with a filesystem or network). This |
1355 underlying stream (say from interacting with a filesystem or network). This |
1223 could add considerable overhead. |
1356 could add considerable overhead. |
1224 |
1357 |
1225 Concepts |
1358 Thread Safety |
1226 ======== |
1359 ============= |
1227 |
1360 |
1228 It is important to have a basic understanding of how Zstandard works in order |
1361 ``ZstdCompressor`` and ``ZstdDecompressor`` instances have no guarantees |
1229 to optimally use this library. In addition, there are some low-level Python |
1362 about thread safety. Do not operate on the same ``ZstdCompressor`` and |
1230 concepts that are worth explaining to aid understanding. This section aims to |
1363 ``ZstdDecompressor`` instance simultaneously from different threads. It is |
1231 provide that knowledge. |
1364 fine to have different threads call into a single instance, just not at the |
1232 |
1365 same time. |
1233 Zstandard Frames and Compression Format |
1366 |
1234 --------------------------------------- |
1367 Some operations require multiple function calls to complete. e.g. streaming |
1235 |
1368 operations. A single ``ZstdCompressor`` or ``ZstdDecompressor`` cannot be used |
1236 Compressed zstandard data almost always exists within a container called a |
1369 for simultaneously active operations. e.g. you must not start a streaming |
1237 *frame*. (For the technically curious, see the |
1370 operation when another streaming operation is already active. |
1238 `specification <https://github.com/facebook/zstd/blob/3bee41a70eaf343fbcae3637b3f6edbe52f35ed8/doc/zstd_compression_format.md>_.) |
1371 |
1239 |
1372 The C extension releases the GIL during non-trivial calls into the zstd C |
1240 The frame contains a header and optional trailer. The header contains a |
1373 API. Non-trivial calls are notably compression and decompression. Trivial |
1241 magic number to self-identify as a zstd frame and a description of the |
1374 calls are things like parsing frame parameters. Where the GIL is released |
1242 compressed data that follows. |
1375 is considered an implementation detail and can change in any release. |
1243 |
1376 |
1244 Among other things, the frame *optionally* contains the size of the |
1377 APIs that accept bytes-like objects don't enforce that the underlying object |
1245 decompressed data the frame represents, a 32-bit checksum of the |
1378 is read-only. However, it is assumed that the passed object is read-only for |
1246 decompressed data (to facilitate verification during decompression), |
1379 the duration of the function call. It is possible to pass a mutable object |
1247 and the ID of the dictionary used to compress the data. |
1380 (like a ``bytearray``) to e.g. ``ZstdCompressor.compress()``, have the GIL |
1248 |
1381 released, and mutate the object from another thread. Such a race condition |
1249 Storing the original content size in the frame (``write_content_size=True`` |
1382 is a bug in the consumer of python-zstandard. Most Python data types are |
1250 to ``ZstdCompressor``) is important for performance in some scenarios. Having |
1383 immutable, so unless you are doing something fancy, you don't need to |
1251 the decompressed size stored there (or storing it elsewhere) allows |
1384 worry about this. |
1252 decompression to perform a single memory allocation that is exactly sized to |
|
1253 the output. This is faster than continuously growing a memory buffer to hold |
|
1254 output. |
|
1255 |
|
1256 Compression and Decompression Contexts |
|
1257 -------------------------------------- |
|
1258 |
|
1259 In order to perform a compression or decompression operation with the zstd |
|
1260 C API, you need what's called a *context*. A context essentially holds |
|
1261 configuration and state for a compression or decompression operation. For |
|
1262 example, a compression context holds the configured compression level. |
|
1263 |
|
1264 Contexts can be reused for multiple operations. Since creating and |
|
1265 destroying contexts is not free, there are performance advantages to |
|
1266 reusing contexts. |
|
1267 |
|
1268 The ``ZstdCompressor`` and ``ZstdDecompressor`` types are essentially |
|
1269 wrappers around these contexts in the zstd C API. |
|
1270 |
|
1271 One-shot And Streaming Operations |
|
1272 --------------------------------- |
|
1273 |
|
1274 A compression or decompression operation can either be performed as a |
|
1275 single *one-shot* operation or as a continuous *streaming* operation. |
|
1276 |
|
1277 In one-shot mode (the *simple* APIs provided by the Python interface), |
|
1278 **all** input is handed to the compressor or decompressor as a single buffer |
|
1279 and **all** output is returned as a single buffer. |
|
1280 |
|
1281 In streaming mode, input is delivered to the compressor or decompressor as |
|
1282 a series of chunks via multiple function calls. Likewise, output is |
|
1283 obtained in chunks as well. |
|
1284 |
|
1285 Streaming operations require an additional *stream* object to be created |
|
1286 to track the operation. These are logical extensions of *context* |
|
1287 instances. |
|
1288 |
|
1289 There are advantages and disadvantages to each mode of operation. There |
|
1290 are scenarios where certain modes can't be used. See the |
|
1291 ``Choosing an API`` section for more. |
|
1292 |
|
1293 Dictionaries |
|
1294 ------------ |
|
1295 |
|
1296 A compression *dictionary* is essentially data used to seed the compressor |
|
1297 state so it can achieve better compression. The idea is that if you are |
|
1298 compressing a lot of similar pieces of data (e.g. JSON documents or anything |
|
1299 sharing similar structure), then you can find common patterns across multiple |
|
1300 objects then leverage those common patterns during compression and |
|
1301 decompression operations to achieve better compression ratios. |
|
1302 |
|
1303 Dictionary compression is generally only useful for small inputs - data no |
|
1304 larger than a few kilobytes. The upper bound on this range is highly dependent |
|
1305 on the input data and the dictionary. |
|
1306 |
|
1307 Python Buffer Protocol |
|
1308 ---------------------- |
|
1309 |
|
1310 Many functions in the library operate on objects that implement Python's |
|
1311 `buffer protocol <https://docs.python.org/3.6/c-api/buffer.html>`_. |
|
1312 |
|
1313 The *buffer protocol* is an internal implementation detail of a Python |
|
1314 type that allows instances of that type (objects) to be exposed as a raw |
|
1315 pointer (or buffer) in the C API. In other words, it allows objects to be |
|
1316 exposed as an array of bytes. |
|
1317 |
|
1318 From the perspective of the C API, objects implementing the *buffer protocol* |
|
1319 all look the same: they are just a pointer to a memory address of a defined |
|
1320 length. This allows the C API to be largely type agnostic when accessing their |
|
1321 data. This allows custom types to be passed in without first converting them |
|
1322 to a specific type. |
|
1323 |
|
1324 Many Python types implement the buffer protocol. These include ``bytes`` |
|
1325 (``str`` on Python 2), ``bytearray``, ``array.array``, ``io.BytesIO``, |
|
1326 ``mmap.mmap``, and ``memoryview``. |
|
1327 |
|
1328 ``python-zstandard`` APIs that accept objects conforming to the buffer |
|
1329 protocol require that the buffer is *C contiguous* and has a single |
|
1330 dimension (``ndim==1``). This is usually the case. An example of where it |
|
1331 is not is a Numpy matrix type. |
|
1332 |
|
1333 Requiring Output Sizes for Non-Streaming Decompression APIs |
|
1334 ----------------------------------------------------------- |
|
1335 |
|
1336 Non-streaming decompression APIs require that either the output size is |
|
1337 explicitly defined (either in the zstd frame header or passed into the |
|
1338 function) or that a max output size is specified. This restriction is for |
|
1339 your safety. |
|
1340 |
|
1341 The *one-shot* decompression APIs store the decompressed result in a |
|
1342 single buffer. This means that a buffer needs to be pre-allocated to hold |
|
1343 the result. If the decompressed size is not known, then there is no universal |
|
1344 good default size to use. Any default will fail or will be highly sub-optimal |
|
1345 in some scenarios (it will either be too small or will put stress on the |
|
1346 memory allocator to allocate a too large block). |
|
1347 |
|
1348 A *helpful* API may retry decompression with buffers of increasing size. |
|
1349 While useful, there are obvious performance disadvantages, namely redoing |
|
1350 decompression N times until it works. In addition, there is a security |
|
1351 concern. Say the input came from highly compressible data, like 1 GB of the |
|
1352 same byte value. The output size could be several magnitudes larger than the |
|
1353 input size. An input of <100KB could decompress to >1GB. Without a bounds |
|
1354 restriction on the decompressed size, certain inputs could exhaust all system |
|
1355 memory. That's not good and is why the maximum output size is limited. |
|
1356 |
1385 |
1357 Note on Zstandard's *Experimental* API |
1386 Note on Zstandard's *Experimental* API |
1358 ====================================== |
1387 ====================================== |
1359 |
1388 |
1360 Many of the Zstandard APIs used by this module are marked as *experimental* |
1389 Many of the Zstandard APIs used by this module are marked as *experimental* |
1361 within the Zstandard project. This includes a large number of useful |
1390 within the Zstandard project. |
1362 features, such as compression and frame parameters and parts of dictionary |
|
1363 compression. |
|
1364 |
1391 |
1365 It is unclear how Zstandard's C API will evolve over time, especially with |
1392 It is unclear how Zstandard's C API will evolve over time, especially with |
1366 regards to this *experimental* functionality. We will try to maintain |
1393 regards to this *experimental* functionality. We will try to maintain |
1367 backwards compatibility at the Python API level. However, we cannot |
1394 backwards compatibility at the Python API level. However, we cannot |
1368 guarantee this for things not under our control. |
1395 guarantee this for things not under our control. |
1369 |
1396 |
1370 Since a copy of the Zstandard source code is distributed with this |
1397 Since a copy of the Zstandard source code is distributed with this |
1371 module and since we compile against it, the behavior of a specific |
1398 module and since we compile against it, the behavior of a specific |
1372 version of this module should be constant for all of time. So if you |
1399 version of this module should be constant for all of time. So if you |
1373 pin the version of this module used in your projects (which is a Python |
1400 pin the version of this module used in your projects (which is a Python |
1374 best practice), you should be buffered from unwanted future changes. |
1401 best practice), you should be shielded from unwanted future changes. |
1375 |
1402 |
1376 Donate |
1403 Donate |
1377 ====== |
1404 ====== |
1378 |
1405 |
1379 A lot of time has been invested into this project by the author. |
1406 A lot of time has been invested into this project by the author. |