Can a compression scheme be made invariant (for example, with respect to ordering) and achieve (average/expected) compression efficiency exceeding that of sorting (or canonicalising) the input?
-
It's a given that by sorting your data first, you'll get better compression. For the purposes of this question we can consider hierarchical JSON data which as part of its encoding for a hash element, will give the same practical information regardless of the ordering of those key-value pairs in the input. In other words, the caller is perhaps rarely interested in the original input ordering (they may have forgotten to sort the data). While sorting all of those key-value pairs on each hierarchical level is likely to give better compression than a random or even original arrangement, there are rare cases where a traditional compression algorithm would give slightly better compression with some small perturbation in the input ordering (so an impractical approach there is to try all permutations and select the one which achieves the best compression). Also, taken as a whole (given the nested structure) there may be an optimisation on the ordering on a per-data-set basis that can yield (slightly) better compression than the approach of sorting based upon each key per each level (some other structured approach to the data). However I wanted to open this up and think of theoretical principles around compression given an invariant property of the input. Suppose for now that the original interest is in capturing the exact ordering of the hash elements, and suppose we consider that for this data, each ordering occurs with equal probability (and the data content itself is fixed for this demonstration). Then one way to do that is to optimally compress the sorted version (or compress all versions and take the best) and separately encode the specific ordering using a minimal information theory approach. Given that these variants of a fixed JSON data set occur with equal probability (not necessarily a practical assumption in the real world but used for illustration here), and supposing that we have an array of hash values one level deep [{1:2,3:4,...},{5:6,7:8,...},...], then the information needed to encode that would be a number roughly that of m*(n!) (where m is the number of outer array elements and n is the number of key elements fixed for each hash), since that represents the number of ways the sets can be ordered and the information content would be kind of like the specific index into the permutation number (in fact, up to a bit of information, arithmetic encoding of that permutation number would probably represent the ordering portion of the data most efficiently). We have here a separation of the data content we're interested in from an invariant property (a property or aspect of the data we're not concerned with in this case is the ordering of the hash elements). This could be likened to a distinction between lossy and lossless compression. In lossy compression we may be satisfied to listen to an imperfect recreation of an audio track as long as we can't tell the difference. Usually a programmer would be pretty unhappy to see random bit flips in their JSON data however, but in most if not all cases, there is at least some invariance at some level of the problem, program, or algorithm, whether that is expressed through the structure of the data or algorithmically. (And maybe an ordering is only needed at a particular stage of the algorithm or in certain cases). So as an example I suggest the ordering of hash keys in JSON. Now consider another facet of practical compression schemes, which is that when feeding them the entire set of all possible orderings of the data (and fixing the data content), the difference in compression efficiency for the optimal case (usually the sorted one but not necessarily) is not significantly better than the unsorted case (even when considering a family of inputs). Other facets of the data are being compressed in such a way that there is not a direct reference or correlation to the ordering of that data. If a compression scheme included a reversible sort, and the structure of the data was much more regular (for example, periodicity of the data), then the difference might approach the related information theoretic measure of the ordering content: m*(n!). I'm not saying that the difference, even given an ideal compression scheme suited to this partial invariance of the data, would be expressed by that amount, since there is coupling and interdependence especially in a practical system, but at least it is a measure to keep in mind. But basically if I'm seeing only a small difference in compression efficiency when I sort the keys recursively compared to the unsorted version, and I then consider the set of outputs and corresponding efficiency given all possible input orderings, then somehow the ordering information is embedded in the data as part of the inherent information content, but it's coupled with the information in such a way that sorting isn't the way to optimally reduce the encoding. I'm interested to know in which ways compressors (or a family of theoretical compressors) can be constructed so that they can achieve better efficiency given that they are free to return an arbitrary ordering of that data (or in the more general case, we can consider some other kind of invariance) upon decompression. I'm interested in the concepts and theory here, since that can allow for "impractical" ideas like trying all possible input values or very implausible search spaces. But since most compression schemes focus on perfect reconstruction of the input, they don't necessarily know of any invariance. For purposes of theory (but moving even further away from practical considerations), they could be fed the set of all possible valid representations for that data (for example all possible data orderings). (That probably wouldn't work so well because if just one of those orderings were disallowed or not included from the set of all permutations then that might require that this strange exception be implicitly encoded in comparison to a scheme that algorithmically "understands" that sort ordering is not important). It might be easier to consider the general concept of invariance so for this I'm proposing an instance of hierarchical JSON. The answer can be framed in terms of an assumption on the statistical properties of the input JSON (a particular instance or model for that JSON). And then it can be illustrated how an existing compression approach now wouldn't necessarily observe or optimally notice that invariance. But the main thing I'm getting at is: what would be needed to get at better compression given that we're also expressing an invariance or property of the data we're not interested in (for example ordering). It's open-ended how that invariance specification step would be done since I'm starting with an arbitrary compression scheme and considering a particular input instance for illustration purposes. Just to measure this, I created an array of 10,000 hash elements each of which have 20 key-value pairs chosen as a string consisting of a 15 digit random floating point value between 0.0 and 1.0. The ordering information is roughly 60 bits per array element (an individual 20 key hash), so that amounts to roughly 75 kilobytes of information overall. Now the uncompressed data is 8418k, gzip gives 3381k-3385k, bzip2 gives 2844k-2851k, and lzma gives 2914k-2942k. So the range is most 7k with the first two algorithms, and lzma is showing a 28k delta. While I don't have the resources to try all possible orderings, I'd guess that the delta between the best and worst case might be between 12k-15k for gzip/bzip2 and between 30k-40k with LZMA (I might be biased in assuming LZMA is better at identifying the ordering since it showed a bigger delta but these are just guesses anyhow and they would vary on a case by case basis). And then the delta between the average (or weighted) case and optimal one would be smaller than that. The worst case might "need" up to 20k of additional information for the first two algorithms. I can't say for sure that there's a way to somehow save 75k minus 20k (or somehow even up to 75k) given we'd be alright with an arbitrary ordering of the decompressed data. But there is information about the order already embedded in the encoding (since perfect reconstruction is possible) even though it's coupled with the content of the data itself. So if we remove our interest in the encoding of that ordering information, even if it's not clearly separable, perhaps we can avoid having to encode some information about it. I'm open to hear any kind of theoretical approaches, however intensive they may be computationally. That also includes a curiosity about families of compression schemes that perform optimally well given various models of input. I understand that the ordering is in many ways coupled with the model of the input data and the model that the compressor is choosing for that input data, so it's not a matter of simply removing that. And the example given above is not completely practical or common since the keys and values are chosen randomly and independently, so there are components of this that don't match with the common interdependence found in most heterogeneous JSON data. But I'd like to know what kind of algorithms could notice patterns in the data and by a prior directive decide to not notice other patterns to achieve better efficiency.
-
Answer:
It's been a while since I dabbled in compression algorithms, but theoretically, a compression tool can be written that builds an optimal, static dictionary by examining the entire data set upfront (as opposed to building a running dictionary as it reads the data stream). For the purpose of this discussion, let's define optimal as the most efficient dictionary possible (or, the first in case there's more than one), and let's take it for granted that the optimal static dictionary can be found in reasonable time with current resources. In that case, it shouldn't matter in what order the keys or the hashes themselves occurâthe compression tool should achieve the same efficiency every time. That is, if your dictionary says "AA" => "X" and "BC" => "Y" then it shouldn't matter whether you have [{"AA": "BC"}, {"BC": "AA"}] or [{"BC": "AA"}, {"AA": "BC"}]. What does this mean for the original question? Simply that you can reorder all you want (and, if you need to preserve order then just transmit the ordering information however you want, maybe even as part of the data stream, thus also getting compressed) and our theoretical compressor should achieve the same compression ratios each time. Of course, I would wager than finding such an optimal dictionary for an arbitrary data set is probably NP-hard, so I'm also genuinely curious if any studies in data compression have been made that can take advantage of invariance in the underlying data to achieve greater efficiencies than conventional (running) dictionary compression schemes.
Alistair Israel at Quora Visit the source
Other answers
Compression is based on the joint-probability distribution of your symbols. Reordering or other manipulation (say, differential coding) may indeed change your joint distribution and allow you to get better compression. It all depends on your input data/symbols. Gzip et al, I believe is based on the LZ algorithm which builds a running dictionary. For such algorithms, on the average, it should make little difference whether a specific pattern is in the front or at the end of a file. So, in some sense, all text compression algorithms are order independent. E.g., it should make little difference whether you compress: Abstract+Intro-Chap 1 + summary or Summary+Intro+Chap 1+Abstract. In Huffman coding, you create your mapping based on a-priori estimation of your symbol probabilities, so the order these symbols arrive is again irrelevant; however, if all your symbols have a uniform probability, then Hufman-like codecs are not going to help you much. Reordering could help in some coders (say, using run-length coding), but then you need to keep track of the order , which may defeat the ordering purpose. Before messing up with reordering I would experiment with simple differential coding.
Anonymous
Related Q & A:
- How to change a color scheme on my yahoo?Best solution by Yahoo! Answers
- Can you use recordings accidentally made as evidence?Best solution by Yahoo! Answers
- How can I locate where a yahoo account was made?Best solution by answers.yahoo.com
- How do I delete a comment that I made on a video in YouTube?Best solution by ChaCha
- How can the EU trade be made more sustainable?Best solution by en.wikipedia.org
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.