Most programming languages contain powerful features, that used correctly are incredibly powerful, but used incorrectly can be incredibly dangerous. Serialization (and deserialization) is one such feature available in most modern programming languages. As mentioned in a previous article:
“Serialization is a feature of programming languages that allows the state of in-memory objects to be represented in a standard format, which can be written to disk or transmitted across a network.”
So why is deserialization dangerous?
Serialization and, more importantly, deserialization of data is unsafe due to the simple fact that the data being processed is trusted implicitly as being “correct.” So if you’re taking data such as program variables from a non trusted source you’re making it possible for an attacker to control program flow. Additionally many programming languages now support serialization of not just data (e.g. strings, arrays, etc.) but also of code objects. For example with Python pickle() you can actually serialize user defined classes, you can take a section of code, ship it to a remote system, and it is executed there.
Of course this means that anyone with the ability to send a serialized object to such a system can now execute arbitrary code easily, with the full privileges of the program running it.
Some examples of failure
Unlike many classes of security vulnerabilities you cannot really accidentally create a deserialization flaw. Unlike memory management flaws for example which can easily occur due to a single off-by-one calculation, or misuse of variable type, the only way to create a deserialization flaw is to use deserialization. Some quick examples of failure include:
CVE-2012-4406 – OpenStack Swift (an object store) used Python pickle() to store metadata in memcached (which is a simple key/value store and does not support authentication), so an attacker with access to memcached could cause arbitrary code execution on all the servers using Swift.
CVE-2013-2165 – In JBoss’s RichFaces ResourceBuilderImpl.java the classes which could be called were not restricted allowing an attacker to interact with classes that could result in arbitrary code execution.
There are many more examples spanning virtually every major OS and platform vendor unfortunately. Please note that virtually every modern language includes serialization which is not safe by default to use (Perl Storage, Ruby Marshal, etc.).
So how do we serialize safely?
The simplest way to serialize and deserialize data safely is to use a format that does not include support for code objects. Your best bet for serialization almost all forms of data safely in a widely supported format is JSON. And when I say widely supported I mean everything from Cobol and Fortran to Awk, Tcl and Qt. JSON supports pairs (key:value), arrays and elements and within these a wide variety of data types including strings, numbers, objects (JSON objects), arrays, true, false and null. JSON objects can contain additional JSON objects, so you can for example serialize a number of things into discrete JSON objects and then shove those into a single large JSON (using an array for example).
Legacy code
But what if you are dealing with legacy code and can’t convert to JSON? On the receiving (deserializing end) you can attempt to monkey patch the code to restrict the objects allowed in the serialized data. However most languages do not make this very easy or safe and a determined attacker will be able to bypass them in most cases. An excellent paper is available from BlackHat USA 2011 which covers any number of clever techniques to exploit Python pickle().
What if you need to serialize code objects?
But what if you actually need to serialize and deserialize code objects? Since it’s impossible to determine if code is safe or not you have to trust the code you are running. One way to establish that the code has not been modified in transit, or comes from an untrusted source is to use code signing. Code signing is very difficult to do correctly and very easy to get wrong. For example you need to:
- Ensure the data is from a trusted source
- Ensure the data has not been modified, truncated or added to in transit
- Ensure that the data is not being replayed (e.g. sending valid code objects out of order can result in manipulation of the program state)
- Ensure that if data is blocked (e.g. blocking code that should be executed but is not, leaving the program in an inconsistent state) you can return to a known good state
To name a few major concerns. Creating a trusted framework for remote code execution is outside the scope of this article, however there are a number of such frameworks.
Conclusion
If data must be transported in a serialized format use JSON. At the very least this will ensure that you have access to high quality libraries for the parsing of the data, and that code cannot be directly embedded as it can with other formats such as Python pickle(). Additionally you should ideally encrypt and authenticate the data if it is sent over a network, an attacker that can manipulate program variables can almost certainly modify the program execution in a way that allows privilege escalation or other malicious behavior. Finally you should authenticate the data and prevent replay attacks (e.g. where the attacker records and re-sends a previous sessions data), chances are if you are using JSON you can simply wrap the session in TLS with an authentication layer (such as certificates or username and password or tokens).