tensorflow-compression

Data compression in TensorFlow

View on GitHub

Entropy bottleneck layer

This layer exposes a high-level interface to model the entropy (the amount of information conveyed) of the tensor passing through it. During training, this can be use to impose a (soft) entropy constraint on its activations, limiting the amount of information flowing through the layer. Note that this is distinct from other types of bottlenecks, which reduce the dimensionality of the space, for example. Dimensionality reduction does not limit the amount of information, and does not enable efficient data compression per se.

After training, this layer can be used to compress any input tensor to a string, which may be written to a file, and to decompress a file which it previously generated back to a reconstructed tensor (possibly on a different machine having access to the same model checkpoint). For this, it uses the range coder documented in the next section. The entropies estimated during training or evaluation are approximately equal to the average length of the strings in bits.

The layer implements a flexible probability density model to estimate entropy, which is described in the appendix of the paper (please cite the paper if you use this code for scientific work):

“Variational image compression with a scale hyperprior”
J. Ballé, D. Minnen, S. Singh, S. J. Hwang, N. Johnston
https://arxiv.org/abs/1802.01436

The layer assumes that the input tensor is at least 2D, with a batch dimension at the beginning and a channel dimension as specified by data_format. The layer trains an independent probability density model for each channel, but assumes that across all other dimensions, the inputs are i.i.d. (independent and identically distributed). Because the entropy (and hence, average codelength) is a function of the densities, this assumption may have a direct effect on the compression performance.

Because data compression always involves discretization, the outputs of the layer are generally only approximations of its inputs. During training, discretization is modeled using additive uniform noise to ensure differentiability. The entropies computed during training are differential entropies. During evaluation, the data is actually quantized, and the entropies are discrete (Shannon entropies). To make sure the approximated tensor values are good enough for practical purposes, the training phase must be used to balance the quality of the approximation with the entropy, by adding an entropy term to the training loss, as in the following example.

Training

Here, we use the entropy bottleneck to compress the latent representation of an autoencoder. The data vectors x in this case are 4D tensors in 'channels_last' format (for example, 16x16 pixel grayscale images).

Note that forward_transform and backward_transform are placeholders and can be any appropriate artifical neural network. We’ve found that it generally helps not to use batch normalization, and to sandwich the bottleneck between two linear transforms or convolutions (i.e. to have no nonlinearities directly before and after).

# Build autoencoder.
x = tf.placeholder(tf.float32, shape=[None, 16, 16, 1])
y = forward_transform(x)
entropy_bottleneck = EntropyBottleneck()
y_, likelihoods = entropy_bottleneck(y, training=True)
x_ = backward_transform(y_)

# Information content (= predicted codelength) in bits of each batch element
# (note that taking the natural logarithm and dividing by `log(2)` is
# equivalent to taking base-2 logarithms):
bits = tf.reduce_sum(tf.log(likelihoods), axis=(1, 2, 3)) / -np.log(2)

# Squared difference of each batch element:
squared_error = tf.reduce_sum(tf.squared_difference(x, x_), axis=(1, 2, 3))

# The loss is a weighted sum of mean squared error and entropy (average
# information content), where the weight controls the trade-off between
# approximation error and entropy.
main_loss = 0.5 * tf.reduce_mean(squared_error) + tf.reduce_mean(bits)

# Minimize loss and auxiliary loss, and execute update op.
main_optimizer = tf.train.AdamOptimizer(learning_rate=1e-4)
main_step = main_optimizer.minimize(main_loss)
# 1e-3 is a good starting point for the learning rate of the auxiliary loss,
# assuming Adam is used.
aux_optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
aux_step = aux_optimizer.minimize(entropy_bottleneck.losses[0])
step = tf.group(main_step, aux_step, entropy_bottleneck.updates[0])

Note that the layer always produces exactly one auxiliary loss and one update op, which are only significant for compression and decompression. To use the compression feature, the auxiliary loss must be minimized during or after training. After that, the update op must be executed at least once. Here, we simply attach them to the main training step.

Evaluation

# Build autoencoder.
x = tf.placeholder(tf.float32, shape=[None, 16, 16, 1])
y = forward_transform(x)
y_, likelihoods = EntropyBottleneck()(y, training=False)
x_ = backward_transform(y_)

# Information content (= predicted codelength) in bits of each batch element:
bits = tf.reduce_sum(tf.log(likelihoods), axis=(1, 2, 3)) / -np.log(2)

# Squared difference of each batch element:
squared_error = tf.reduce_sum(tf.squared_difference(x, x_), axis=(1, 2, 3))

# The loss is a weighted sum of mean squared error and entropy (average
# information content), where the weight controls the trade-off between
# approximation error and entropy.
loss = 0.5 * tf.reduce_mean(squared_error) + tf.reduce_mean(bits)

To be able to compress the bottleneck tensor and decompress it in a different session, or on a different machine, you need three items:

Compression

x = tf.placeholder(tf.float32, shape=[None, 16, 16, 1])
y = forward_transform(x)
strings = EntropyBottleneck().compress(y)
shape = tf.shape(y)[1:]

Decompression

strings = tf.placeholder(tf.string, shape=[None])
shape = tf.placeholder(tf.int32, shape=[3])
entropy_bottleneck = EntropyBottleneck(dtype=tf.float32)
y_ = entropy_bottleneck.decompress(strings, shape, channels=5)
x_ = backward_transform(y_)

Here, we assumed that the tensor produced by the forward transform has 5 channels.

The above four use cases can also be implemented within the same session (i.e. on the same EntropyBottleneck instance), for testing purposes, etc., by calling the object more than once.