Note: The method described in this article does not apply to PNG files or ICC profiles larger than 64 KB in JPEGs.

In digital images, one way to label a colour space is to embed an ICC file. Many image formats such as JPEG, HEIC, and TIFF support embedded ICC files, but they may store ICC data differently. For example, JPEG stores it in APP2 marker segments, while HEIC stores it in a colr box. If you parse each format separately first and then extract the ICC part from the output, it is both tedious and easy to miss edge cases.

Take the ISO 21496-1 HDR extension for JPEG as an example: it uses multiple concatenated JPEG streams, with MPF markers indicating their positions. When files like this are parsed with the default behaviour of libraries such as PIL, you often only get content from the first JPEG segment, and cannot fully extract ICC data that may appear later on.

So, is there a way to ignore image format entirely and extract ICC files straight from raw binary data? ICC files are self-describing, and they have very clear signatures and structured data that help us identify and extract them directly. As long as they have not been compressed inside the file byte stream, we can try direct detection and extraction.

Special cases such as PNG

In the PNG format, the ICC profile is stored in a chunk called iCCP (Embedded ICC profile). Its structure is as follows:

  • Profile name: 1–79 bytes of ASCII string.
  • Null separator: 1 byte (0x00).
  • Compression method: 1 byte (currently only 0x00, meaning zlib/deflate compression).
  • Compressed profile data: all remaining bytes are zlib-compressed ICC data.

Because of this compression, you can no longer find signatures like 'acsp' directly in the byte stream, so direct extraction is not possible.

TIFF can use ZIP and other compression for file data, but this does not affect non-image parts, so a complete and uncompressed ICC byte stream can still be detected.

Also, the APP2 marker block used for ICC data in JPEG can hold at most about 64 KB of payload. If the ICC file is too complex, it is split into multiple segments. In that case, APP2 marker header data gets mixed into the extracted contiguous byte stream, which corrupts the ICC file and makes it unparsable.

Structural characteristics of ICC files

According to the ICC specification (ICC.1:2022), any valid ICC file contains a 128-byte header with two especially important fields:

  • Bytes 0–3: total file size, stored as uInt32Number (unsigned 32-bit integer, big-endian).
  • Bytes 36–39: fixed signature 'acsp' (hex 61 63 73 70).

This means that no matter what file an ICC profile is embedded in, as long as we can find the 'acsp' signature in the binary stream, we can move 36 bytes backwards to locate the ICC start, then read the first 4 bytes to get the full ICC size and extract it completely.

Here is a simple extraction approach:

  1. Scan the whole file in binary mode and search for b'acsp'; matches must not be within the first 36 bytes of the file.
  2. For each match, move 36 bytes back to find the start, then read 4 bytes and convert them to an unsigned 32-bit integer.
  3. Check whether this integer is reasonable: it must be greater than 128 and less than the remaining file length.
  4. Read that bytes from the start position as a complete byte stream.
  5. Perform a simple validation to see whether it is an ICC file.
  6. Continue searching forward.

Simple validation method

To confirm that extracted data is indeed an ICC file, we can do a further lightweight parse of these bytes, for example checking the version and device class in the header.

  • Bytes 8–11: version. Byte 8 stores the major version; the high and low nibbles of byte 9 store the next two version digits respectively; other bits should be 0.
  • Bytes 12–15: device class, stored as a 4-character ascii code, common values include mntr (display device), scnr (input device), etc.

These relatively fixed fields can be used for quick validation. The theoretical random-collision probability of b'acsp' alone is $1/2^{32}$. In tests across nearly a hundred images in various formats, there were no false positives or misses. Still, it is worth emphasising that this rough-and-ready method is only suitable for quickly extracting ICC files, not for any serious production use, and it cannot be used to extract ICC profiles from PNG files or ICC profiles larger than 64 KB in JPEGs.

Python implementation

Here is a simple Python implementation of the method above. It can scan and extract one or more ICC files from an input file, then do a basic parse of their version and device class.

Jump to GitHub

uv run extract_icc.py image.jpg

A few small findings

From ICC files extracted from various places, here are a few observations:

  • The typical size of a basic RGB ICC file is about 530 bytes, including header, description, copyright, primary XYZ values, white point XYZ, parametric transfer function, and chromatic adaptation matrix.
  • A classic sRGB ICC file is about 3 KB. The main reason is that it uses a 1D-LUT transfer function. Even if the three channels reuse the same data, it still needs 1024 points, taking up 2060 bytes. That is almost as large as a Bilibili video cover image. Bilibili covers stored in AVIF are usually only around 3–4 KB. They use CICP to identify colour space instead of ICC, which saves a lot of space (relative to already small image files).
  • Some ICC files with specific conversion intents can be around 30 KB, mainly due to large 3D LUTs (such as A2B0) and parts used to describe HDR transformations. This can even be larger than storing a separate gain map: a half-resolution greyscale gain map with advanced encoding may take only a few KB.

This is step 0 (or maybe step -1) on my ICC colour management learning journey. Next up, I’ll share my study notes and insights.