583 dctx = zstd.ZstdDeompressor() |
653 dctx = zstd.ZstdDeompressor() |
584 dobj = cctx.decompressobj() |
654 dobj = cctx.decompressobj() |
585 data = dobj.decompress(compressed_chunk_0) |
655 data = dobj.decompress(compressed_chunk_0) |
586 data = dobj.decompress(compressed_chunk_1) |
656 data = dobj.decompress(compressed_chunk_1) |
587 |
657 |
|
658 Batch Decompression API |
|
659 ^^^^^^^^^^^^^^^^^^^^^^^ |
|
660 |
|
661 (Experimental. Not yet supported in CFFI bindings.) |
|
662 |
|
663 ``multi_decompress_to_buffer()`` performs decompression of multiple |
|
664 frames as a single operation and returns a ``BufferWithSegmentsCollection`` |
|
665 containing decompressed data for all inputs. |
|
666 |
|
667 Compressed frames can be passed to the function as a ``BufferWithSegments``, |
|
668 a ``BufferWithSegmentsCollection``, or as a list containing objects that |
|
669 conform to the buffer protocol. For best performance, pass a |
|
670 ``BufferWithSegmentsCollection`` or a ``BufferWithSegments``, as |
|
671 minimal input validation will be done for that type. If calling from |
|
672 Python (as opposed to C), constructing one of these instances may add |
|
673 overhead cancelling out the performance overhead of validation for list |
|
674 inputs. |
|
675 |
|
676 The decompressed size of each frame must be discoverable. It can either be |
|
677 embedded within the zstd frame (``write_content_size=True`` argument to |
|
678 ``ZstdCompressor``) or passed in via the ``decompressed_sizes`` argument. |
|
679 |
|
680 The ``decompressed_sizes`` argument is an object conforming to the buffer |
|
681 protocol which holds an array of 64-bit unsigned integers in the machine's |
|
682 native format defining the decompressed sizes of each frame. If this argument |
|
683 is passed, it avoids having to scan each frame for its decompressed size. |
|
684 This frame scanning can add noticeable overhead in some scenarios. |
|
685 |
|
686 The ``threads`` argument controls the number of threads to use to perform |
|
687 decompression operations. The default (``0``) or the value ``1`` means to |
|
688 use a single thread. Negative values use the number of logical CPUs in the |
|
689 machine. |
|
690 |
|
691 .. note:: |
|
692 |
|
693 It is possible to pass a ``mmap.mmap()`` instance into this function by |
|
694 wrapping it with a ``BufferWithSegments`` instance (which will define the |
|
695 offsets of frames within the memory mapped region). |
|
696 |
|
697 This function is logically equivalent to performing ``dctx.decompress()`` |
|
698 on each input frame and returning the result. |
|
699 |
|
700 This function exists to perform decompression on multiple frames as fast |
|
701 as possible by having as little overhead as possible. Since decompression is |
|
702 performed as a single operation and since the decompressed output is stored in |
|
703 a single buffer, extra memory allocations, Python objects, and Python function |
|
704 calls are avoided. This is ideal for scenarios where callers need to access |
|
705 decompressed data for multiple frames. |
|
706 |
|
707 Currently, the implementation always spawns multiple threads when requested, |
|
708 even if the amount of work to do is small. In the future, it will be smarter |
|
709 about avoiding threads and their associated overhead when the amount of |
|
710 work to do is small. |
|
711 |
588 Content-Only Dictionary Chain Decompression |
712 Content-Only Dictionary Chain Decompression |
589 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
713 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
590 |
714 |
591 ``decompress_content_dict_chain(frames)`` performs decompression of a list of |
715 ``decompress_content_dict_chain(frames)`` performs decompression of a list of |
592 zstd frames produced using chained *content-only* dictionary compression. Such |
716 zstd frames produced using chained *content-only* dictionary compression. Such |
607 Each zstd frame **must** have the content size written. |
731 Each zstd frame **must** have the content size written. |
608 |
732 |
609 The following Python code can be used to produce a *content-only dictionary |
733 The following Python code can be used to produce a *content-only dictionary |
610 chain*:: |
734 chain*:: |
611 |
735 |
612 def make_chain(inputs): |
736 def make_chain(inputs): |
613 frames = [] |
737 frames = [] |
614 |
738 |
615 # First frame is compressed in standalone/discrete mode. |
739 # First frame is compressed in standalone/discrete mode. |
616 zctx = zstd.ZstdCompressor(write_content_size=True) |
740 zctx = zstd.ZstdCompressor(write_content_size=True) |
617 frames.append(zctx.compress(inputs[0])) |
741 frames.append(zctx.compress(inputs[0])) |
618 |
742 |
619 # Subsequent frames use the previous fulltext as a content-only dictionary |
743 # Subsequent frames use the previous fulltext as a content-only dictionary |
620 for i, raw in enumerate(inputs[1:]): |
744 for i, raw in enumerate(inputs[1:]): |
621 dict_data = zstd.ZstdCompressionDict(inputs[i]) |
745 dict_data = zstd.ZstdCompressionDict(inputs[i]) |
622 zctx = zstd.ZstdCompressor(write_content_size=True, dict_data=dict_data) |
746 zctx = zstd.ZstdCompressor(write_content_size=True, dict_data=dict_data) |
623 frames.append(zctx.compress(raw)) |
747 frames.append(zctx.compress(raw)) |
624 |
748 |
625 return frames |
749 return frames |
626 |
750 |
627 ``decompress_content_dict_chain()`` returns the uncompressed data of the last |
751 ``decompress_content_dict_chain()`` returns the uncompressed data of the last |
628 element in the input chain. |
752 element in the input chain. |
629 |
753 |
630 It is possible to implement *content-only dictionary chain* decompression |
754 It is possible to implement *content-only dictionary chain* decompression |
631 on top of other Python APIs. However, this function will likely be significantly |
755 on top of other Python APIs. However, this function will likely be significantly |
632 faster, especially for long input chains, as it avoids the overhead of |
756 faster, especially for long input chains, as it avoids the overhead of |
633 instantiating and passing around intermediate objects between C and Python. |
757 instantiating and passing around intermediate objects between C and Python. |
634 |
758 |
635 Choosing an API |
759 Multi-Threaded Compression |
636 --------------- |
760 -------------------------- |
637 |
761 |
638 Various forms of compression and decompression APIs are provided because each |
762 ``ZstdCompressor`` accepts a ``threads`` argument that controls the number |
639 are suitable for different use cases. |
763 of threads to use for compression. The way this works is that input is split |
640 |
764 into segments and each segment is fed into a worker pool for compression. Once |
641 The simple/one-shot APIs are useful for small data, when the decompressed |
765 a segment is compressed, it is flushed/appended to the output. |
642 data size is known (either recorded in the zstd frame header via |
766 |
643 ``write_content_size`` or known via an out-of-band mechanism, such as a file |
767 The segment size for multi-threaded compression is chosen from the window size |
644 size). |
768 of the compressor. This is derived from the ``window_log`` attribute of a |
645 |
769 ``CompressionParameters`` instance. By default, segment sizes are in the 1+MB |
646 A limitation of the simple APIs is that input or output data must fit in memory. |
770 range. |
647 And unless using advanced tricks with Python *buffer objects*, both input and |
771 |
648 output must fit in memory simultaneously. |
772 If multi-threaded compression is requested and the input is smaller than the |
649 |
773 configured segment size, only a single compression thread will be used. If the |
650 Another limitation is that compression or decompression is performed as a single |
774 input is smaller than the segment size multiplied by the thread pool size or |
651 operation. So if you feed large input, it could take a long time for the |
775 if data cannot be delivered to the compressor fast enough, not all requested |
652 function to return. |
776 compressor threads may be active simultaneously. |
653 |
777 |
654 The streaming APIs do not have the limitations of the simple API. The cost to |
778 Compared to non-multi-threaded compression, multi-threaded compression has |
655 this is they are more complex to use than a single function call. |
779 higher per-operation overhead. This includes extra memory operations, |
656 |
780 thread creation, lock acquisition, etc. |
657 The streaming APIs put the caller in control of compression and decompression |
781 |
658 behavior by allowing them to directly control either the input or output side |
782 Due to the nature of multi-threaded compression using *N* compression |
659 of the operation. |
783 *states*, the output from multi-threaded compression will likely be larger |
660 |
784 than non-multi-threaded compression. The difference is usually small. But |
661 With the streaming input APIs, the caller feeds data into the compressor or |
785 there is a CPU/wall time versus size trade off that may warrant investigation. |
662 decompressor as they see fit. Output data will only be written after the caller |
786 |
663 has explicitly written data. |
787 Output from multi-threaded compression does not require any special handling |
664 |
788 on the decompression side. In other words, any zstd decompressor should be able |
665 With the streaming output APIs, the caller consumes output from the compressor |
789 to consume data produced with multi-threaded compression. |
666 or decompressor as they see fit. The compressor or decompressor will only |
|
667 consume data from the source when the caller is ready to receive it. |
|
668 |
|
669 One end of the streaming APIs involves a file-like object that must |
|
670 ``write()`` output data or ``read()`` input data. Depending on what the |
|
671 backing storage for these objects is, those operations may not complete quickly. |
|
672 For example, when streaming compressed data to a file, the ``write()`` into |
|
673 a streaming compressor could result in a ``write()`` to the filesystem, which |
|
674 may take a long time to finish due to slow I/O on the filesystem. So, there |
|
675 may be overhead in streaming APIs beyond the compression and decompression |
|
676 operations. |
|
677 |
790 |
678 Dictionary Creation and Management |
791 Dictionary Creation and Management |
679 ---------------------------------- |
792 ---------------------------------- |
680 |
793 |
681 Zstandard allows *dictionaries* to be used when compressing and |
794 Compression dictionaries are represented as the ``ZstdCompressionDict`` type. |
682 decompressing data. The idea is that if you are compressing a lot of similar |
|
683 data, you can precompute common properties of that data (such as recurring |
|
684 byte sequences) to achieve better compression ratios. |
|
685 |
|
686 In Python, compression dictionaries are represented as the |
|
687 ``ZstdCompressionDict`` type. |
|
688 |
795 |
689 Instances can be constructed from bytes:: |
796 Instances can be constructed from bytes:: |
690 |
797 |
691 dict_data = zstd.ZstdCompressionDict(data) |
798 dict_data = zstd.ZstdCompressionDict(data) |
692 |
799 |
733 You can obtain the raw data in the dict (useful for persisting and constructing |
840 You can obtain the raw data in the dict (useful for persisting and constructing |
734 a ``ZstdCompressionDict`` later) via ``as_bytes()``:: |
841 a ``ZstdCompressionDict`` later) via ``as_bytes()``:: |
735 |
842 |
736 dict_data = zstd.train_dictionary(size, samples) |
843 dict_data = zstd.train_dictionary(size, samples) |
737 raw_data = dict_data.as_bytes() |
844 raw_data = dict_data.as_bytes() |
|
845 |
|
846 The following named arguments to ``train_dictionary`` can also be used |
|
847 to further control dictionary generation. |
|
848 |
|
849 selectivity |
|
850 Integer selectivity level. Default is 9. Larger values yield more data in |
|
851 dictionary. |
|
852 level |
|
853 Integer compression level. Default is 6. |
|
854 dict_id |
|
855 Integer dictionary ID for the produced dictionary. Default is 0, which |
|
856 means to use a random value. |
|
857 notifications |
|
858 Controls writing of informational messages to ``stderr``. ``0`` (the |
|
859 default) means to write nothing. ``1`` writes errors. ``2`` writes |
|
860 progression info. ``3`` writes more details. And ``4`` writes all info. |
|
861 |
|
862 Cover Dictionaries |
|
863 ^^^^^^^^^^^^^^^^^^ |
|
864 |
|
865 An alternate dictionary training mechanism named *cover* is also available. |
|
866 More details about this training mechanism are available in the paper |
|
867 *Effective Construction of Relative Lempel-Ziv Dictionaries* (authors: |
|
868 Liao, Petri, Moffat, Wirth). |
|
869 |
|
870 To use this mechanism, use ``zstd.train_cover_dictionary()`` instead of |
|
871 ``zstd.train_dictionary()``. The function behaves nearly the same except |
|
872 its arguments are different and the returned dictionary will contain ``k`` |
|
873 and ``d`` attributes reflecting the parameters to the cover algorithm. |
|
874 |
|
875 .. note:: |
|
876 |
|
877 The ``k`` and ``d`` attributes are only populated on dictionary |
|
878 instances created by this function. If a ``ZstdCompressionDict`` is |
|
879 constructed from raw bytes data, the ``k`` and ``d`` attributes will |
|
880 be ``0``. |
|
881 |
|
882 The segment and dmer size parameters to the cover algorithm can either be |
|
883 specified manually or you can ask ``train_cover_dictionary()`` to try |
|
884 multiple values and pick the best one, where *best* means the smallest |
|
885 compressed data size. |
|
886 |
|
887 In manual mode, the ``k`` and ``d`` arguments must be specified or a |
|
888 ``ZstdError`` will be raised. |
|
889 |
|
890 In automatic mode (triggered by specifying ``optimize=True``), ``k`` |
|
891 and ``d`` are optional. If a value isn't specified, then default values for |
|
892 both are tested. The ``steps`` argument can control the number of steps |
|
893 through ``k`` values. The ``level`` argument defines the compression level |
|
894 that will be used when testing the compressed size. And ``threads`` can |
|
895 specify the number of threads to use for concurrent operation. |
|
896 |
|
897 This function takes the following arguments: |
|
898 |
|
899 dict_size |
|
900 Target size in bytes of the dictionary to generate. |
|
901 samples |
|
902 A list of bytes holding samples the dictionary will be trained from. |
|
903 k |
|
904 Parameter to cover algorithm defining the segment size. A reasonable range |
|
905 is [16, 2048+]. |
|
906 d |
|
907 Parameter to cover algorithm defining the dmer size. A reasonable range is |
|
908 [6, 16]. ``d`` must be less than or equal to ``k``. |
|
909 dict_id |
|
910 Integer dictionary ID for the produced dictionary. Default is 0, which uses |
|
911 a random value. |
|
912 optimize |
|
913 When true, test dictionary generation with multiple parameters. |
|
914 level |
|
915 Integer target compression level when testing compression with |
|
916 ``optimize=True``. Default is 1. |
|
917 steps |
|
918 Number of steps through ``k`` values to perform when ``optimize=True``. |
|
919 Default is 32. |
|
920 threads |
|
921 Number of threads to use when ``optimize=True``. Default is 0, which means |
|
922 to use a single thread. A negative value can be specified to use as many |
|
923 threads as there are detected logical CPUs. |
|
924 notifications |
|
925 Controls writing of informational messages to ``stderr``. See the |
|
926 documentation for ``train_dictionary()`` for more. |
738 |
927 |
739 Explicit Compression Parameters |
928 Explicit Compression Parameters |
740 ------------------------------- |
929 ------------------------------- |
741 |
930 |
742 Zstandard's integer compression levels along with the input size and dictionary |
931 Zstandard's integer compression levels along with the input size and dictionary |
902 example, the difference between *context* reuse and non-reuse for 100,000 |
1091 example, the difference between *context* reuse and non-reuse for 100,000 |
903 100 byte inputs will be significant (possiby over 10x faster to reuse contexts) |
1092 100 byte inputs will be significant (possiby over 10x faster to reuse contexts) |
904 whereas 10 1,000,000 byte inputs will be more similar in speed (because the |
1093 whereas 10 1,000,000 byte inputs will be more similar in speed (because the |
905 time spent doing compression dwarfs time spent creating new *contexts*). |
1094 time spent doing compression dwarfs time spent creating new *contexts*). |
906 |
1095 |
|
1096 Buffer Types |
|
1097 ------------ |
|
1098 |
|
1099 The API exposes a handful of custom types for interfacing with memory buffers. |
|
1100 The primary goal of these types is to facilitate efficient multi-object |
|
1101 operations. |
|
1102 |
|
1103 The essential idea is to have a single memory allocation provide backing |
|
1104 storage for multiple logical objects. This has 2 main advantages: fewer |
|
1105 allocations and optimal memory access patterns. This avoids having to allocate |
|
1106 a Python object for each logical object and furthermore ensures that access of |
|
1107 data for objects can be sequential (read: fast) in memory. |
|
1108 |
|
1109 BufferWithSegments |
|
1110 ^^^^^^^^^^^^^^^^^^ |
|
1111 |
|
1112 The ``BufferWithSegments`` type represents a memory buffer containing N |
|
1113 discrete items of known lengths (segments). It is essentially a fixed size |
|
1114 memory address and an array of 2-tuples of ``(offset, length)`` 64-bit |
|
1115 unsigned native endian integers defining the byte offset and length of each |
|
1116 segment within the buffer. |
|
1117 |
|
1118 Instances behave like containers. |
|
1119 |
|
1120 ``len()`` returns the number of segments within the instance. |
|
1121 |
|
1122 ``o[index]`` or ``__getitem__`` obtains a ``BufferSegment`` representing an |
|
1123 individual segment within the backing buffer. That returned object references |
|
1124 (not copies) memory. This means that iterating all objects doesn't copy |
|
1125 data within the buffer. |
|
1126 |
|
1127 The ``.size`` attribute contains the total size in bytes of the backing |
|
1128 buffer. |
|
1129 |
|
1130 Instances conform to the buffer protocol. So a reference to the backing bytes |
|
1131 can be obtained via ``memoryview(o)``. A *copy* of the backing bytes can also |
|
1132 be obtained via ``.tobytes()``. |
|
1133 |
|
1134 The ``.segments`` attribute exposes the array of ``(offset, length)`` for |
|
1135 segments within the buffer. It is a ``BufferSegments`` type. |
|
1136 |
|
1137 BufferSegment |
|
1138 ^^^^^^^^^^^^^ |
|
1139 |
|
1140 The ``BufferSegment`` type represents a segment within a ``BufferWithSegments``. |
|
1141 It is essentially a reference to N bytes within a ``BufferWithSegments``. |
|
1142 |
|
1143 ``len()`` returns the length of the segment in bytes. |
|
1144 |
|
1145 ``.offset`` contains the byte offset of this segment within its parent |
|
1146 ``BufferWithSegments`` instance. |
|
1147 |
|
1148 The object conforms to the buffer protocol. ``.tobytes()`` can be called to |
|
1149 obtain a ``bytes`` instance with a copy of the backing bytes. |
|
1150 |
|
1151 BufferSegments |
|
1152 ^^^^^^^^^^^^^^ |
|
1153 |
|
1154 This type represents an array of ``(offset, length)`` integers defining segments |
|
1155 within a ``BufferWithSegments``. |
|
1156 |
|
1157 The array members are 64-bit unsigned integers using host/native bit order. |
|
1158 |
|
1159 Instances conform to the buffer protocol. |
|
1160 |
|
1161 BufferWithSegmentsCollection |
|
1162 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
1163 |
|
1164 The ``BufferWithSegmentsCollection`` type represents a virtual spanning view |
|
1165 of multiple ``BufferWithSegments`` instances. |
|
1166 |
|
1167 Instances are constructed from 1 or more ``BufferWithSegments`` instances. The |
|
1168 resulting object behaves like an ordered sequence whose members are the |
|
1169 segments within each ``BufferWithSegments``. |
|
1170 |
|
1171 ``len()`` returns the number of segments within all ``BufferWithSegments`` |
|
1172 instances. |
|
1173 |
|
1174 ``o[index]`` and ``__getitem__(index)`` return the ``BufferSegment`` at |
|
1175 that offset as if all ``BufferWithSegments`` instances were a single |
|
1176 entity. |
|
1177 |
|
1178 If the object is composed of 2 ``BufferWithSegments`` instances with the |
|
1179 first having 2 segments and the second have 3 segments, then ``b[0]`` |
|
1180 and ``b[1]`` access segments in the first object and ``b[2]``, ``b[3]``, |
|
1181 and ``b[4]`` access segments from the second. |
|
1182 |
|
1183 Choosing an API |
|
1184 =============== |
|
1185 |
|
1186 There are multiple APIs for performing compression and decompression. This is |
|
1187 because different applications have different needs and the library wants to |
|
1188 facilitate optimal use in as many use cases as possible. |
|
1189 |
|
1190 From a high-level, APIs are divided into *one-shot* and *streaming*. See |
|
1191 the ``Concepts`` section for a description of how these are different at |
|
1192 the C layer. |
|
1193 |
|
1194 The *one-shot* APIs are useful for small data, where the input or output |
|
1195 size is known. (The size can come from a buffer length, file size, or |
|
1196 stored in the zstd frame header.) A limitation of the *one-shot* APIs is that |
|
1197 input and output must fit in memory simultaneously. For say a 4 GB input, |
|
1198 this is often not feasible. |
|
1199 |
|
1200 The *one-shot* APIs also perform all work as a single operation. So, if you |
|
1201 feed it large input, it could take a long time for the function to return. |
|
1202 |
|
1203 The streaming APIs do not have the limitations of the simple API. But the |
|
1204 price you pay for this flexibility is that they are more complex than a |
|
1205 single function call. |
|
1206 |
|
1207 The streaming APIs put the caller in control of compression and decompression |
|
1208 behavior by allowing them to directly control either the input or output side |
|
1209 of the operation. |
|
1210 |
|
1211 With the *streaming input*, *compressor*, and *decompressor* APIs, the caller |
|
1212 has full control over the input to the compression or decompression stream. |
|
1213 They can directly choose when new data is operated on. |
|
1214 |
|
1215 With the *streaming ouput* APIs, the caller has full control over the output |
|
1216 of the compression or decompression stream. It can choose when to receive |
|
1217 new data. |
|
1218 |
|
1219 When using the *streaming* APIs that operate on file-like or stream objects, |
|
1220 it is important to consider what happens in that object when I/O is requested. |
|
1221 There is potential for long pauses as data is read or written from the |
|
1222 underlying stream (say from interacting with a filesystem or network). This |
|
1223 could add considerable overhead. |
|
1224 |
|
1225 Concepts |
|
1226 ======== |
|
1227 |
|
1228 It is important to have a basic understanding of how Zstandard works in order |
|
1229 to optimally use this library. In addition, there are some low-level Python |
|
1230 concepts that are worth explaining to aid understanding. This section aims to |
|
1231 provide that knowledge. |
|
1232 |
|
1233 Zstandard Frames and Compression Format |
|
1234 --------------------------------------- |
|
1235 |
|
1236 Compressed zstandard data almost always exists within a container called a |
|
1237 *frame*. (For the technically curious, see the |
|
1238 `specification <https://github.com/facebook/zstd/blob/3bee41a70eaf343fbcae3637b3f6edbe52f35ed8/doc/zstd_compression_format.md>_.) |
|
1239 |
|
1240 The frame contains a header and optional trailer. The header contains a |
|
1241 magic number to self-identify as a zstd frame and a description of the |
|
1242 compressed data that follows. |
|
1243 |
|
1244 Among other things, the frame *optionally* contains the size of the |
|
1245 decompressed data the frame represents, a 32-bit checksum of the |
|
1246 decompressed data (to facilitate verification during decompression), |
|
1247 and the ID of the dictionary used to compress the data. |
|
1248 |
|
1249 Storing the original content size in the frame (``write_content_size=True`` |
|
1250 to ``ZstdCompressor``) is important for performance in some scenarios. Having |
|
1251 the decompressed size stored there (or storing it elsewhere) allows |
|
1252 decompression to perform a single memory allocation that is exactly sized to |
|
1253 the output. This is faster than continuously growing a memory buffer to hold |
|
1254 output. |
|
1255 |
|
1256 Compression and Decompression Contexts |
|
1257 -------------------------------------- |
|
1258 |
|
1259 In order to perform a compression or decompression operation with the zstd |
|
1260 C API, you need what's called a *context*. A context essentially holds |
|
1261 configuration and state for a compression or decompression operation. For |
|
1262 example, a compression context holds the configured compression level. |
|
1263 |
|
1264 Contexts can be reused for multiple operations. Since creating and |
|
1265 destroying contexts is not free, there are performance advantages to |
|
1266 reusing contexts. |
|
1267 |
|
1268 The ``ZstdCompressor`` and ``ZstdDecompressor`` types are essentially |
|
1269 wrappers around these contexts in the zstd C API. |
|
1270 |
|
1271 One-shot And Streaming Operations |
|
1272 --------------------------------- |
|
1273 |
|
1274 A compression or decompression operation can either be performed as a |
|
1275 single *one-shot* operation or as a continuous *streaming* operation. |
|
1276 |
|
1277 In one-shot mode (the *simple* APIs provided by the Python interface), |
|
1278 **all** input is handed to the compressor or decompressor as a single buffer |
|
1279 and **all** output is returned as a single buffer. |
|
1280 |
|
1281 In streaming mode, input is delivered to the compressor or decompressor as |
|
1282 a series of chunks via multiple function calls. Likewise, output is |
|
1283 obtained in chunks as well. |
|
1284 |
|
1285 Streaming operations require an additional *stream* object to be created |
|
1286 to track the operation. These are logical extensions of *context* |
|
1287 instances. |
|
1288 |
|
1289 There are advantages and disadvantages to each mode of operation. There |
|
1290 are scenarios where certain modes can't be used. See the |
|
1291 ``Choosing an API`` section for more. |
|
1292 |
|
1293 Dictionaries |
|
1294 ------------ |
|
1295 |
|
1296 A compression *dictionary* is essentially data used to seed the compressor |
|
1297 state so it can achieve better compression. The idea is that if you are |
|
1298 compressing a lot of similar pieces of data (e.g. JSON documents or anything |
|
1299 sharing similar structure), then you can find common patterns across multiple |
|
1300 objects then leverage those common patterns during compression and |
|
1301 decompression operations to achieve better compression ratios. |
|
1302 |
|
1303 Dictionary compression is generally only useful for small inputs - data no |
|
1304 larger than a few kilobytes. The upper bound on this range is highly dependent |
|
1305 on the input data and the dictionary. |
|
1306 |
|
1307 Python Buffer Protocol |
|
1308 ---------------------- |
|
1309 |
|
1310 Many functions in the library operate on objects that implement Python's |
|
1311 `buffer protocol <https://docs.python.org/3.6/c-api/buffer.html>`_. |
|
1312 |
|
1313 The *buffer protocol* is an internal implementation detail of a Python |
|
1314 type that allows instances of that type (objects) to be exposed as a raw |
|
1315 pointer (or buffer) in the C API. In other words, it allows objects to be |
|
1316 exposed as an array of bytes. |
|
1317 |
|
1318 From the perspective of the C API, objects implementing the *buffer protocol* |
|
1319 all look the same: they are just a pointer to a memory address of a defined |
|
1320 length. This allows the C API to be largely type agnostic when accessing their |
|
1321 data. This allows custom types to be passed in without first converting them |
|
1322 to a specific type. |
|
1323 |
|
1324 Many Python types implement the buffer protocol. These include ``bytes`` |
|
1325 (``str`` on Python 2), ``bytearray``, ``array.array``, ``io.BytesIO``, |
|
1326 ``mmap.mmap``, and ``memoryview``. |
|
1327 |
|
1328 ``python-zstandard`` APIs that accept objects conforming to the buffer |
|
1329 protocol require that the buffer is *C contiguous* and has a single |
|
1330 dimension (``ndim==1``). This is usually the case. An example of where it |
|
1331 is not is a Numpy matrix type. |
|
1332 |
|
1333 Requiring Output Sizes for Non-Streaming Decompression APIs |
|
1334 ----------------------------------------------------------- |
|
1335 |
|
1336 Non-streaming decompression APIs require that either the output size is |
|
1337 explicitly defined (either in the zstd frame header or passed into the |
|
1338 function) or that a max output size is specified. This restriction is for |
|
1339 your safety. |
|
1340 |
|
1341 The *one-shot* decompression APIs store the decompressed result in a |
|
1342 single buffer. This means that a buffer needs to be pre-allocated to hold |
|
1343 the result. If the decompressed size is not known, then there is no universal |
|
1344 good default size to use. Any default will fail or will be highly sub-optimal |
|
1345 in some scenarios (it will either be too small or will put stress on the |
|
1346 memory allocator to allocate a too large block). |
|
1347 |
|
1348 A *helpful* API may retry decompression with buffers of increasing size. |
|
1349 While useful, there are obvious performance disadvantages, namely redoing |
|
1350 decompression N times until it works. In addition, there is a security |
|
1351 concern. Say the input came from highly compressible data, like 1 GB of the |
|
1352 same byte value. The output size could be several magnitudes larger than the |
|
1353 input size. An input of <100KB could decompress to >1GB. Without a bounds |
|
1354 restriction on the decompressed size, certain inputs could exhaust all system |
|
1355 memory. That's not good and is why the maximum output size is limited. |
|
1356 |
907 Note on Zstandard's *Experimental* API |
1357 Note on Zstandard's *Experimental* API |
908 ====================================== |
1358 ====================================== |
909 |
1359 |
910 Many of the Zstandard APIs used by this module are marked as *experimental* |
1360 Many of the Zstandard APIs used by this module are marked as *experimental* |
911 within the Zstandard project. This includes a large number of useful |
1361 within the Zstandard project. This includes a large number of useful |