JSON, Homoiconicity, and Database Access

During a recent review of an internal web application based on the Node.js platform, we discovered that combining JavaScript Object Notation (JSON) and database access (database query generators or object-relational mappers, ORMs) creates interesting security challenges, particularly for JavaScript programming environments.

To see why, we first have to examine traditional SQL injection.

Traditional SQL injection

Most programming languages do not track where strings and numbers come from. Looking at a string object, it is not possible to tell if the object corresponds to a string literal in the source code, or input data which was read from a network socket. Combined with certain programming practices, this lack of discrimination leads to security vulnerabilities. Early web applications relied on string concatenation to construct SQL queries before sending them to the database, using Perl constructs like this to load a row from the users table:

# WRONG: SQL injection vulnerability
$dbh->selectrow_hashref(qq{
  SELECT * FROM users WHERE users.user = '$user'
})

But if the externally supplied value for $user is "'; DROP TABLE users; --", instead of loading the user, the database may end up deleting the users table, due to SQL injection. Here’s the effective SQL statement after expansion of such a value:

  SELECT * FROM users WHERE users.user = ''; DROP TABLE users; --'

Because the provenance of strings is not tracked by the programming environment (as explained above), the SQL database driver only sees the entire query string and cannot easily reject such crafted queries.

Experience showed again and again that simply trying to avoid pasting untrusted data into query strings did not work. Too much data which looks trustworthy at first glance turns out to be under external control. This is why current guidelines recommend employing parametrized queries (sometimes also called prepared statements), where the SQL query string is (usually) a string literal, and the variable parameters are kept separate, combined only in the database driver itself (which has the necessary database-specific knowledge to perform any required quoting of the variables).

Homoiconicity and Query-By-Example

Query-By-Example is a way of constructing database queries based on example values. Consider a web application as an example. It might have a users table, containing columns such as user_id (a serial primary key), name, password (we assume the password is stored in the clear, also this practice is debatable), a flag that indicates if the user is an administrator, a last_login column, and several more.

We could describe a concrete row in the users table like this, using JavaScript Object Notation (JSON):

{
  "user_id": 1,
  "name": "admin",
  "password": "secret",
  "is_admin": true,
  "last_login": 1431519292
}

The query-by-example style of writing database queries takes such a row descriptor, omits some unknown parts, and treats the rest as the column values to match. We could check user name an password during a login operation like this:

{
  "name": "admin",
  "password": "secret",
}

If the database returns a row, we know that the user exists, and that the login attempt has been successful.

But we can do better. With some additional syntax, we can even express query operators. We could select the regular users who have logged in today (“1431475200” refers to midnight UTC, and "$gte" stands for “greater or equal”) with this query:

{
  "last_login": {"$gte": 1431475200},
  "is_admin": false
}

This is in fact the query syntax used by Sequelize, a object-relational mapping tool (ORM) for Node.js.

This achieves homoiconicity refers to a property of programming environment where code (here: database queries) and data look very much alike, roughly speaking, and can be manipulated with similar programming language constructors. It is often hailed as a primary design achievement of the programming language Lisp. Homoiconicity makes query construction with the Sequelize toolkit particularly convenient. But it also means that there are no clear boundaries between code and data, similar to the old way of constructing SQL query strings using string concatenation, as explained above.

Getting JSON To The Database

Some server-side programming frameworks, notably Node.js, automatically decode bodies of POST requests of content type application/json into JavaScript JSON objects. In the case of Node.js, these JSON objects are indistinguishable from other such objects created by the application code. In other words, there is no marker class or other attribute which allows to tell apart objects which come from inputs and objects which were created by (for example) object literals in the source.

Here is a simple example of a hypothetical login request. When Node.js processes the POST request on he left, it assigns a JavaScript object to the the req.body field in exactly the same way the JavaScript code on the right does.

POST request	Application code
POST /user/auth HTTP/1.0 Content-Type: application/json {"name":"admin","password":"secret"}	req.body = { name: "admin", password: "secret" }

In a Node.js application using Sequelize, the application would first define a model User, and then use it as part of the authentication procedure, in code similar to this (for the sake of this example, we still assume the password is stored in plain text, the reason for that will be come clear immediately):

User.findOne({
  where: {
    name: req.body.name,
    password: req.body.password
  }
}).then(function (user) {
  if (user) {
    // We got a user object, which means that login was successful.
    …
  } else {
    // No user object, login failure.
    …
  }
})

The query-by-example part is highlighted.

However, this construction has a security issue which is very difficult to fix. Suppose that the POST request looks like this instead:

POST /user/auth HTTP/1.0
Content-Type: application/json

{
  "name": {"$gte": ""},
  "password": {"$gte": ""}
}

This means that Sequelize will be invoked with this query (and the markers included here are invisible to the Sequelize code, they just illustrate the data that came from the post request):

User.findOne({
  where: {
    name: {"$gte": ""},
    password: {"$gte": ""}
  }
})

Sequelize will translate this into a query similar to this one:

SELECT * FROM users where name >= ''  AND password >= '';

Any string is greater than or equal to the empty string, so this query will find any user in the system, regardless of the user name or password. Unless there are other constraints imposed by the application, this allows an attacker to bypass authentication.

What can be done about this? Unfortunately, not much. Validating POST request contents and checking that all the values passed to database queries are of the expected type (string, number or Boolean) works to mitigate individual injection issues, but the experience with SQL injection issues mentioned at the beginning of this post suggests that this is not likely to work out in practice, particularly in Node.js, where so much data is exposed as JSON objects. Another option would be to break homoiconicity, and mark in the query syntax where the query begins and data ends. Getting this right is a bit tricky. Other Node.js database frameworks do not describe query structure in terms of JSON objects at all; Knex.js and Bookshelf.js are in this category.

Due to the prevalence of JSON, such issues are most likely to occur within Node.js applications and frameworks. However, already in July 2014, Kazuho Oku described a JSON injection issue in the SQL::Maker Perl package, discovered by his colleague Toshiharu Sugiyama.

Update (2015-05-26): After publishing this blog post, we learned that a very similar issue has also been described in the context of MongoDB: Hacking NodeJS and MongoDB.

Other fixable issues in Sequelize

Sequelize overloads the findOne method with a convenience feature for primary-key based lookup. This encourages programmers to write code like this:

User.findOne(req.body.user_id).then(function (user) {
  … // Process results.
}

This allows attackers to ship a complete query object (with the “{where: …}” wrapper) in a POST request. Even with strict query-by-example queries, this can be abused to probe the values of normally inaccessible table columns. This can be done efficiently using comparison operators (with one bit leaking per query) and binary search.

But there is another issue. This construct

User.findOne({
  where: "user_id IN (SELECT user_id " +
    "FROM blocked_users WHERE unblock_time IS NULL)"
}).then(function (user) {
  … // Process results.
}

pastes the marked string directly into the generated SQL query (here it is used to express something that would be difficult to do directly in Sequelize (say, because the blocked_users table is not modeled). With the “findOne(req.body.user_id)” example above, a POST request such as

POST /user/auth HTTP/1.0
Content-Type: application/json

{"user_id":{"where":"0=1; DROP TABLE users;--"}}

would result in a generated query, with the highlighted parts coming from the request:

SELECT * FROM users WHERE 0=1; DROP TABLE users;--;

(This will not work with some databases and database drivers which reject multi-statement queries. In such cases, fairly efficient information leaks can be created with sub-queries and a binary search approach.)

This is not a defect in Sequelize, it is a deliberate feature. Perhaps it would be better if this functionality were not reachable with plain JSON objects. Sequelize already supports marker objects for including literals, and a similar marker object could be used for verbatim SQL.

The Sequelize upstream developers have mitigated the first issue in version 3.0.0. A new method, findById (with an alias, findByPrimary), has been added which queries exclusively by primary keys (“{where: …}” queries are not supported). At the same time, the search-by-primary-key automation has been removed from findOne, forcing applications to choose explicitly between primary key lookup and full JSON-based query expression. This explicit choice means that the second issue (although not completely removed from version 3.0.0) is no longer directly exposed. But as expected, altering the structure of a query by introducing JSON constructs (as with the "$gte example is still possible, and to prevent that, applications have to check the JSON values that they put into Sequelize queries.

Conclusion

JSON-based query-by-example expressions can be an intuitive way to write database queries. However, this approach, when taken further and enhanced with operators, can lead to a reemergence of injection issues which are reminiscent of SQL injection, something these tools try to avoid by operating at a higher abstraction level. If you, as an application developer, decide to use such a tool, then you will have to make sure that data passed into queries has been properly sanitized.

007 Software