Working on a test case with sysbench, I encountered this:

I was really surprised. First, and the most important, id is a primary key and the rand() function should produce just one value. How come it returns two rows? Second, why is the response time 0.30 sec? That seems really high for a primary key access.

Looking further:

So it is a primary key, but MySQL does not use an index, and it returns two rows. Is this a bug?

Deterministic vs nondeterministic functions

Turned out it is not a bug at all. It is pretty logical behavior from MySQL, but it is not what we would expect. First, why a full table scan? Well, rand() is nondeterministic function. That means we do not know what it will return ahead of time, and actually that is exactly the purpose of rand() – to return a random value. In this case, it is only logical to evaluate the function for each row, each time, and compare the results. i.e. in our case

  1. Read row 1, get the value of id, evaluate the value of RAND(), compare
  2. Proceed using the same algorithm with the remaining rows.

In other words, as the value of rand() is not known (not evaluated) beforehand, so we can’t use an index.

And in this case – rand() function – we have another interesting consequence. For larger tables with an auto_increment primary key, the probability of matching the rand() value and the auto_increment value is higher, so we can get multiple rows back. In fact, if we read the whole table from the beginning and keep comparing the auto_inc sequence with “the roll of the dice”, we can get many rows back.

That behavior is totally counter-intuitive. Nevertheless, to me, it’s also the only correct behavior.

We expect to have the rand() function evaluated before running the query.  This can actually be achieved by assigning rand() to a variable:

This would meet our expectations.

There are (at least) two bug reports filed, with very interesting discussion:

  1. rand() used in scalar functions returns multiple rows
  2. SELECT on PK with ROUND(RAND()) give wrong errors

Other databases

I wanted to see how it works in other SQL databases. In PostgreSQL, the behavior is exactly the same as MySQL:

And SQLite seems different, evaluating the random() function beforehand:

Conclusion

Be careful when using MySQL nondeterministic functions in  a “where” condition – rand() is the most interesting example – as their behavior may surprise you. Many people believe this to be a bug that should be fixed. Let me know in the comments: do you think it is a bug or not (and why)? I would also be interested to know how it works in other, non-opensource databases (Microsoft SQL Server, Oracle, etc)

PS: Finally, I’ve got a “clever” idea – what if I “trick” MySQL by using the deterministic keyword…

MySQL stored functions: deterministic vs not deterministic

So, I wanted to see how it works with MySQL stored functions if they are assigned “deterministic” and “not deterministic” keywords. First, I wanted to “trick” mysql and pass the deterministic to the stored function but use rand() inside. Ok, this is not what you really want to do!

From MySQL manual about MySQL stored routines we can read:

Assessment of the nature of a routine is based on the “honesty” of the creator: MySQL does not check that a routine declared DETERMINISTIC is free of statements that produce nondeterministic results. However, misdeclaring a routine might affect results or affect performance. Declaring a nondeterministic routine as DETERMINISTIC might lead to unexpected results by causing the optimizer to make incorrect execution plan choices. Declaring a deterministic routine as NONDETERMINISTIC might diminish performance by causing available optimizations not to be used.

The result is interesting:

So MySQL optimizer detected the problem (somehow).

If I use the NOT DETERMINISTIC keyword, then MySQL works the same as when using the rand() function:

 


Photo by dylan nolte on Unsplash

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Ernie Souhrada

I think this is absolutely a bug. Standard rules of algebra would say that we evaluate expressions from the inside out, so we should be able to do the RAND(), then the ROUND(), and be left with a statement equivalent to SELECT * FROM table WHERE id = – in which case we should get back 0 or 1 rows.

The RAND() function is nondeterministic, yes, but it is never going to return more than one value – as opposed to some nondeterministic function that might return an unknown number of rows – so why can the optimizer / parser not be smart enough to figure that out? It used to be the case (and maybe it still is) that if you tried to do an EXPLAIN on a SELECT that contained a subquery, MySQL would actually run the subquery before giving you the optimizer’s plan – this seems like a much less complicated variant of that scenario.

Joe

I say it’s not a bug. Imagine doing a simulation. Say you had a column containing the object’s (represented by the row) chance of success. Each time you run the query, you want back a random list of rows which “succeeded’ during the simulation. You’d write something like:

SELECT * FROM agents WHERE success_prob > rand()

If rand() evaluated before the the query executed (instead of row by row), then all of the rows that returned when rand() = .2 would also be returned when rand() = .1 (plus some additional rows). There would be no variation among which rows got selected… just among the quantity of rows selected .

The author correctly points out that a variable should be used to store a single random value before query exeution if you want a single random value to evaluate against all rows. But the WHERE clause evaluates for each row, so it makes sense that rand() should evaluate for each row separately if it is included in the WHERE clause.

Peter Zaitsev

You may also wish to check this HackerNews Discussion https://news.ycombinator.com/item?id=18617809

Øystein Grøvlen

AFAICT, the described behavior is according to the SQL standard. In theory, a WHERE clause should be evaluated for all rows. A database system may optimize this and not inspect every row if it has some other mechanism to identify which rows will satisfy the WHERE clause (e.g., an index). Also, functions calls that are constant (i.e., deterministic and does not depend on input from the current row) may be evaluated once. In this case, the function is non-deterministic, so it needs to be evaluated on every row.