I C no randomness here
C is see. Punny.
I was doing a unit when at one point, the lecturer was generating pseudo-random numbers as an example on a short introduction to learning how to use C. Unfortunately, the standard C way to generate random numbers….isn’t very good.
Not that it matters since for the unit at least, the distribution of randomness wasn’t very important. It was just used as a more interesting example to show how to print a multidimensional array (printing out an array of all the same numbers isn’t very interesting), as opposed to, say, cryptographic purposes. But in any case, I thought I’ll share a bit about why this method isn’t very good.
Note: Ironically, I wrote the bulk of this article during my Lab session for this unit, since my tutor was teaching the basics of C, which I am already familiar with.
The C way to generate random numbers.
In most tutorials you find online, you’ll likely find something like this:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define MAX_VAL 100
#define MIN_VAL 1
int main(int argc, char *argv[]) {
srand(time(NULL));
for (int i = 0; i < 100; i++) {
printf("%d ", rand() % MAX_VAL + MIN_VAL);
}
printf("\n");
return 0;
}
Let’s break this down a bit.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
The stdio.h
library provides functions like printf
. stdlib.h
provides access to srand
and rand
. time.h
as it name suggest, provides access to time
.
#define MAX_VAL 100
#define MIN_VAL 1
Pre-processor directives that defines MAX_VAL
and MIN_VAL
as 100 and 1 respectively. You can think of them as constants, which I use later because I want to generate random numbers between the range of 1 to 100 inclusive.
Generally, I would use a
const
qualifier likeconst int MAX_VAL = 100
instead. Using#define
is a bit dangerous, you can think of it as a pre-processing step where the compiler replaces all instances ofMAX_VAL
with 100. One example (there are more) is if you were to do something likeprintf("MAX_VAL: %d", MAX_VAL1)
(note the 1 at the end), instead of throwing a compiler error, it will happily print out “1001”. Using a variable would throw an error since that would obviously be invalid syntax. As a side-note, The C/C++ pre-processor by itself is arguably Turing-Complete (probably unintentionally). Check out my writeup on a Google CTF Challenge where they made an entire interpreted language using the C++ preprocessor.
int main(int argc, char *argv[])
This is the main function of the program. I don’t actually use the command-line arguments here, but left it in out of habit.
srand(time(NULL));
Now for the interesting part, let’s start by figuring it out inside-out, with first time
then srand
.
time()
Let’s see the documentation for the time
function. Let’s check out the documentation with man 2 time.
If you’re not familiar with theman
command, it provides a manual page. The number corresponds to what section number it is for. 1 is used for programs/shell commands, 2 is for system calls (which thetime
function is), etc. Runman man
for more info.
So time returns the number of seconds since epoch (the first of January on 1970, UTC time).
The argument is a pointer to also store the returned result in.NULL
in C defined to as 0 (#define NULL 0
or sometimes #define NULL (void*)0
), and since the description says it uses the argument if it is only non-NULL, we can ignore it since it won’t do anything.
But what exactly is time_t
? Let’s run man time_t
to find out.
The C standard specifies it is of an integer type, but doesn’t give any elaboration beyond that. As far as I can tell, currently on most systems, it is interpreted to be a signed 32 bit integer.
Recall that a signed 32 bit integer can only store from -2147483648 ($-2^{31}$) to 2147483647 ($2^{31}-1$). (raised to the power of 31 instead of 32 because 1 bit is used for the sign.). So what happens after 2147483647 seconds after Epoch? What happens then? Well, you’ll experience an integer overflow problem and it’ll wrap back to 0. You might think that these 2 billion seconds seconds might be a very long time and we don’t have to worry about it- but you’ll be wrong. This will happen in the year 2038, and this problem is known as the Year 2038 problem. If you’re not young, this will probably sound similar to the Y2K problem in the year 2000 that you might have known about. The solution would be to migrate to 64-bit signed integers instead (or use some other system other than epoch), which some vendors are moving towards.
srand() and rand()
Computers are really bad at generating random numbers, they’re deterministic and doing exactly what you tell them to do. The solution is usually to have some source of entropy to draw from to generate random numbers. This can be things like your CPU temperature, noise from USB port, interrupt context switching timings, etc.
We usually call this a “seed” value, where we start off with this “random” value, and then we do a series of calculations to generate more “random” values based on this seed value (using the rand()
function, which we use later on), though in actuality, if you used the same seed, you would get the same series of values.
There are various methods (called PseudoRandom Number Generators, or PRNGs) to generate these “more random values” from a seed value. One popular implementation is the Mersenne Twister, and there are others like xorshift and linear shift feedback registers (LSFR). These aren’t designed to be cryptographically secure, so if you need to use it for cryptographic purposes, you should look into Cryptographically-Secure PRNGs (CSPRNG), which are typically used in stream ciphers (the “seed” in this case would be the “key”).
You might be familiar with the game Minecraft, where your world generation depends on the world “seed”, and you can regenerate the world using the same seed on another computer. The same idea applies here.
So if you modify my code and replace srand(time(NULL))
to srand(0)
instead (or any fixed number besides 0), every time you run the program, it should show the same series of values. Before, it changed every time you ran the program because the argument would be based on the current time, and that changes every time you run the program (unless you tried to quickly run the program twice within a second, which would give you the same series of numbers).
As a side note,
srand
actually takes in an unsigned integer, but that doesn’t really matter since the signed integer returned bytime()
is implicitly cast to an unsigned integer. You can catch things like this by turning on additional warnings, for example by using-Wconversion
. I usually only use-Wall
though, which catches most things.
The Attack
First of all, recall how I said the time()
function returns a 32 bit integer. This isn’t too-large an area to brute-force.
So say I have a list of the random numbers.
I can just have a running counter ranging over all possibly 32-bit values, use that as a seed, and keep generating numbers until I find a match! Once that is done, I would know what seed value was used.
But even worse, since the time()
returns a number of seconds since 1970, if we know roughly the time that the program was ran, we can narrow it down the brute-force search even further!
Another point of view of looking at it: previously we were concerned with finding the seed value so we can know what the next random values would be. But since the seed is also the time we can also use this to figure out when the person ran the program (which may be sensitive information, it can, for example, be used to estimate what part of the world a person lives by seeing what time they usually run the program and cross referencing it with daytime. This was one of the methods used by coder Stefan Thomas to guess who Satoshi Nakamoto is by times of his posts on the forum he frequented as well as git commits, which interestingly, seemed to indicate that he wasn’t living in Japan as he claimed- unless he was for some reason only active in the wee hours of the night for some reason.).
Another problem
The rand() function returns an integer between 0 and RAND_MAX.
![](2022-08-04_15-16.png “From \“man 3 rand\””)
If you take another look at my code, you’ll see how I made it fit between 1 and 100 inclusive.
printf("%d ", rand() % MAX_VAL + MIN_VAL);
If you didn’t already know, %
is the remainder operator (sometimes some people call it the modulus operator). It returns the remainder when you divide it by MAX_VAL
. So with 100, it would return a value between 0 and 99 since those are the only remainder values you can get when you divide by 100. Then, I increase it by MIN_VAL since I want it to exclude 0 and include 100, so it would be a value between 1 and 100. (Note that if you want to have a random value between 2 and 100, you will have to do some additional maths. You can’t simply just change MIN_VAL to 2 because that would change it getting a value from 2 to 101. I’ll leave this as an exercise for you. Hint: You should think about getting the distance between the maximum value and the minimum value.)
Do you see the problem?
Recall that rand()
returns a value between 0 and RAND_MAX
. Imagine for second that RAND_MAX
is 100 (it isn’t; but let’s just pretend). No problem there, the chance of getting a remainder of “1” would be $\frac{1}{100} (only if rand() is 1), and “2” would also be \frac{1}{100} (only if rand() is 2), etc.
Now how about if RAND_MAX
is 200? Still no problem, the chance of getting a remainder of “1” would be $\frac{2}{200} = \frac{1}{100}$ (only if rand() is 1 or 101), and so on.
But what if RAND_MAX is 101? Now, the chance of getting a remainder of “1” would be $2/101$, but every other remainder like “2” would have a chance of \frac{1}{101}. I won’t be taking advantage of this problem later on, but just thought this was interesting and something you should take note of.
Application
Here, I have a simple goal of figuring out when my lecturer filmed his pre-recorded video. He had mentioned that it was recorded during a previous semester, and from the content in the video, he occasionally makes reference to “Chat”, so it would probably be during a live class session on Zoom, likely be during class hours on a weekday.
So let’s look at his code random value generation code.
And here is the random values that were printed out.
Below is my brute-forcing code. I used the starting value of 1596243661, which is the First of August of 2020, since that would be around the time that the last time this unit was done and during this week, and I know it can’t be from 2019, since there was no COVID at the time, meaning no virtual classes.
Note that i only used the first 16 values to match against. This is because this should be sufficient to ensure that it wasn’t just “luck” that it was matching. Assuming the values are truly random, the chances of the first number matching is $\frac{1}{1000}$, then chances of the first two numbers matching are $(\frac{1}{1000})^{2}$, and generally the chances of the first $n$ numbers matching are $\frac{1}{1000^{n}}$, and so the chances of somehow coincidentally stumbling upon a different seed value with the same generated numbers, is $\frac{1}{1000^{16}}$ which is….pretty darn low, we don’t have to worry about it.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h> // for memcmp (memory compare)
// I unfortunately can't use const int NUM_VALS = 16;
// because I use it for array initialistions, which doesn't allow for
// variablesto be used to specify the number of elements.
// In C++, I could use const or constexpr and it would work fine.
#define NUM_VALS 16
int main(int argc, char* argv[]) {
int to_find[NUM_VALS] = {423, 398, 138, 347, 339, 680, 804, 406, 943, 581, 355, 99, 200, 859, 705, 226};
int generated[NUM_VALS] = {0};
int seed = 1596243661;
// why doesn't C have bools :(
int found = 0;
while (found != 1) {
srand(seed);
for (int i = 0;i < NUM_VALS;i++) {
generated[i] = rand() % 1000 + 1;
}
// compare 16 * 4 bytes (4 bytes per int) from the known values
// and the generated values.
if (memcmp(to_find, generated, sizeof(int) * NUM_VALS) == 0) {
found = 1;
}
else {
seed++;
}
}
// Just using some helper functions to change back the seconds from
// epoch into a human readable date/time.
time_t t_seed = (time_t) seed;
struct tm* timeinfo;
timeinfo = localtime(&t_seed); // returns in my country's (based on my computer locale) timezone
printf("Time/Date program executed: %s\n", asctime(timeinfo));
return 0;
}
Running the program, I got the following output:-
If we check back with the time on my screenshot earlier, the date/time matches as well! (The screenshot was taken from slightly later in the video, which is why it isn’t an exact match). Interestingly, it seems that the video was taken not during class hours as I suspected, but on a Sunday night. Looking at my University’s calendar for that year, this would be on the Sunday between Week 1 and Week 2. I don’t see any holidays of any sort on either week, so I can only assume it was a replacement for a class that was cancelled by the lecturer for personal reasons. (EDIT: I have since reviewed the video, and it turns out I was mixing up this video with another- there was no “Chat” he was talking to. So he probably recorded this on a Sunday night in preparation for Week 2’s pre-lab video. Which lines up because this video is part of my current semester’s week 2 pre-lab video as well).
Wrap-up
So how should you generate random numbers?
If you were writing in C++, the standard library provides a a separate way to generate random numbers, and it recommend the use of a class called std::random_device
to securely generate a random seed value from a non-deterministic source (e.g. a hardware device), and it has functions for you to generate values between certain numbers, so you don’t have to worry abou the remainder thing.
If you are on a UNIX-based system, you can open the file /dev/urandom
and use a seed value from there (the file will always be changing, you don’t have to worry about other programs reading it).
As I mentioned, the randomness of the values wasn’t important in the unit and was just used as an example, so it really wasn’t an issue. I just thought it an interesting opportunity to have some fun. As a recap, the two core issues are:-
- Remainder operator used may be biased towards certain values.
- Epoch time used as seed value for random number generation.
I just find it interesting that this is the standard way to generate random numbers in C, if you were to read any introductory tutorial or book to the language. I would not be surprised if there are actual important commercial software that were programmed by inexperienced programmers that have this problem. (I’m assuming the business is cheaping out on their developers and hiring the cheapest of the cheap who aren’t very experienced. Proper larger businesses hopefully invested more on more experienced developers). So perhaps there is, for example, a lottery number generator out there with these flaws, waiting for a clever guy to figure it out and earn some money. 😛