Operating System Challenges
|The operating system is in charge.|
|Multi-core systems alleviate many of the obstacles to critical timing that apply to single processor systems.|
At times, operating systems take away processing time from other programs. Even if the other program is reading a microsecond precision clock, the precision is only as good as the accuracy with which the computer processor is actually free to read the clock. For example, assume a software application attempted to continuously read a hardware crystal clock that updated exactly every millisecond. If the application read the clock for a period of 10 seconds (10,000 milliseconds), then you would expect that the application would observe the value read from the clock to change exactly 10,000 times and the difference in time reported between sequential reads to be either 0 or 1 millisecond. However, if such a test application is run on a modern operating system, the application will often observe that, occasionally, the difference between sequential reads of the clock is significantly greater than 1 millisecond. Assuming that the hardware crystal clock itself is independent and runs continuously, these results can only be explained by the application having its reading of the clock paused at times during execution. When the operating system suspends the software application that is reading the clock, the crystal clock itself continues to run, and these individual clock ticks are “missed” by the suspended software application. We refer to the percentage of times for which these individual reads of the clock tick are missed by the software as the “miss tick rate”. The maximum amount of time that elapses between two consecutive reads of the clock is referred to as the “maximum miss tick duration.”
Modern multi-tasking operating systems (e.g., Windows, Mac OS, Unix) will not allow exclusive and continuous execution of any process, because the operating system must take cycles to perform critical- functions (e.g., virtual memory management). No program running on a common desktop operating system can deliver accurate measurements at every millisecond. In other words, you cannot obtain a 0% miss tick rate. For example, Windows will indiscriminately remove parts of a program from memory to test to see if a program really needs them in order to reduce the applications working memory set. This operation and others like it will produce random timing delays1.
Let’s consider a different example: a rotating checkerboard program. We implemented this experiment in E-Prime but without properly using E-Prime’s timing features. Figure 1 below illustrates the measured times from two such runs. The thick line at 200 ms shows the intended time. Session 1 (solid line) and Session 2 (dashed line) have a median display time of 236 ms with irregular spikes as high as 374 ms. Further, the displays seem to be taking about 240 ms, with the longest display taking over 370 ms. In addition to the mean timing error, when running the program multiple times, these long pauses happen at different times.
Figure 1. Actual display durations for displays run on 2 sessions relative to the intended (200 ms) duration.
What can account for these results? The task specifications are clear: present each display for 200 ms, but clearly this is not occurring. Instead, the results illustrate three glaring problems. First, the median duration is about 36 ms longer than was intended. Second, there are severe spikes indicating long unintended pauses in the experiment. Third, the spikes happen at random times.
The first problem, that the median duration is longer than the specified duration time, is most strongly influenced by the time required for stimulus preparation, is detailed in the next section.
The second and third faults in the Figure 1 timing data are related to the spikes in timing durations. Why are some delays very long, and why do they occur at random intervals, changing even if run under exactly the same conditions? The spikes are due to the Windows operating system taking control of the machine and halting the program to perform various administrative and/or “clean-up” activities. The operating system will take control and halt the program without any indication (other than that it is temporarily non-responsive). The program pauses for a period ranging from a few milliseconds to hundreds of milliseconds while the operating system performs actions such as managing virtual memory (i.e., portions of the computer’s memory are stored or reloaded from disk to provide a larger effective address space).
These actions can occur at nearly any time2. The action appears random because when the operating system takes an action is dependent on all of the other events that have occurred in the operating system since it was powered up, and perhaps even since it was installed. For example, if you read in a large word processing document, most of the “real” memory that you have is used up. You now start the experimental program, and the program begins reading in files for the images. Perhaps there is space in memory for the first fourteen images, and they come in fast (100 ms) as the operating system still has room in memory to store them. On the fifteenth image, however, the operating system has no new space in memory to store the image. The operating system then decides to save some of the word processing document to disk to make room. Because it was the operating system’s management routine that decided to make more space, this takes a higher priority than the program, and as a result the program is halted.
The process of the operating system procuring room and scanning the system to perform various housekeeping operations can take tens to hundreds of milliseconds. During this time, the program is paused. The program will once again read images fine for a while, and then may pause again. In the example shown in Figure 1, a spike occurs in the display duration on Image 54, Session 2.
It is very important to remember that the operating system is in control of your computer, not you. You do not issue commands to the operating system; you only issue requests. You may ask the operating system to do things, but it will, at times, delay your task to perform what it views as critical functions. The operating system typically pauses an executing program dozens of times a second without recording the behind-the-scene timing changes.
A Note About Multi-Core and Multi-Processor Systems
Our testing has shown that many of the timing problems that arise from the operating system attending to other programs are significantly reduced when running on a processor with multiple cores. For example, Appendix A reports a 0% missed clock read rates when we tested E-Prime on two different multi-core CPU systems, but a missed tick rate of approximately 10% when running the same test on a single processor machine. Nevertheless, operating system-related timing errors cannot be completely eliminated, and the E-Prime solutions described in this chapter should still be applied to multi-processor systems.
1 The Windows NT documentation on the virtual-memory manager states: The working-set manager (part of operating system) periodically tests the quota by stealing valid pages of memory from a process… This process is performed indiscriminately to all processes in the system. Kath, R. (1992, December 12) The Virtual-Memory Manager in Windows NT, Microsoft Developer Network Technology Group, 12.
2 A request from your program or any other program can trigger operating system actions. With mapping of virtual memory to disk, your program and any files you read may not even be in memory. As you need parts of your program, they are loaded in (e.g., you present a stimulus from a different condition). You may see more delays in the first block of your experiment, as the operating system loads parts of your program as they are needed and continues to clean up from the previous program that was running.