-- Contributed by James St. James, Millikin University
While many of the persons using E-Prime are intimately familiar with the intricacies of research, many are not, or are just beginning to learn. In this article, we present a brief overview of experimental research, including review of some basic terms and ideas. It is not the purpose of this document to serve as an introduction to the use of E-Prime. Rather, its purpose is to aid the user of E-Prime in the conceptual development of experiments, and to consider a number of broad classes of issues regarding experimentation in psychology. If you feel that you are quite familiar with research methodology as it pertains to psychological research, feel free to skip this article. This article is particularly useful for those in need of a ‘refresher course’ in research methods.
Experimental Design Considerations
We begin with a consideration of some basic principles of research design. Then we consider a number of details of the single-trial, reaction time paradigm that is the basis of much of current psychological research. Skip any sections that cover familiar material. Parts of what we include below will seem too obvious to some, but we hope to aid the person using E-Prime who will benefit from a reminder of basic terms, or who is just beginning the process of learning to do research. Our emphasis here is on experimental research, and our examples lie there, but observational and correlational research requires consideration of many of the same points.
Because what follows is not a complete textbook of research methods, the references cited are largely to general sources. The reader is encouraged to go to those sources for more complete coverage and primary references. Many textbooks of research methods in psychology are available that will treat the general topics below in more detail. Most do not include considerations specific to single-trial reaction- time procedures, which we detail.
Definitions
Because they are so widely used in our discussion, we begin by defining dependent and independent variables and controls.
Dependent and Independent Variables
In designing and setting up an experiment using E-Prime, independent and dependent variables will have to be named. Dependent variables (DV’s) are measures of outcome, such as reaction time and accuracy. Independent variables (IV’s) are the aspects of an experiment that are manipulated by the experimenter. Note that, in an experiment, the value of the outcome measure is assumed to
depend upon, or be caused by, the condition under which the participant was tested—the level of the independent variable. Hence, it is a dependent variable.
Independent variables have two or more levels, which define the conditions under which the participant is tested. Examples would be type of stimulus, timing of stimuli, or any other aspect of the experiment that will be manipulated. The independent variables may be manipulated by randomly assigning participants to conditions (levels of the IV), or by applying each condition to each participant, in a random or counterbalanced order. In discussing statistical analysis of experiments, independent variables are also sometimes called factors. An experiment with more than one IV is said to use a factorial design.
Controls
Confounding variables are aspects of the experimental situation that are correlated with the IV’s that the experiment is intended to study, and that may be producing (or hiding) differences among different levels of the IV’s. An example may help. Suppose that a researcher wishes to compare two methods of teaching basic statistics. She teaches two sections of the course, and so decides to teach her 8:00 am class by Method A and her 10:00 am class by Method B. Suppose she finds that students who learned by Method B have significantly higher mean scores on the common final exam than those taught by Method A. Can she conclude that Method B is better? Hardly. Perhaps students are less alert in 8:00 classes than in 10:00 classes. However, suppose that she finds that there is no difference between the classes on the final exam. Can she conclude that the method of teaching doesn’t matter? Again, hardly. In this case, perhaps Method A is actually superior, but the 8:00 class was only half awake, and the only reason they did as well as those taught by Method B was that Method A was sufficiently better to overcome the problem of inattentiveness. In this example, time of day is confounded with method of teaching. (Confounding method of teaching with time of day is not the only problem with this design. The lack of random assignment of students to classes is also a problem.)
Controls include any aspects of the experiment that are intended to remove the influence of confounding variables. Controls are usually intended to remove variability caused by confounds, by making them constants, not variables. In the example above, that would involve controlling the time of day at which the classes were taught. Another example: In research involving visual displays,
be concerned about the effective size of the stimuli, which would vary if different participants sat at different distances from the computer monitor. In that case, a relevant control would be to use a viewing hood or chin rest to make the distance from the screen the same for all participants. A third example is: make sure that equal numbers of males and females are in the group of participants tested at each level of the independent variable, if it is suspected that there might be sex differences in performance. By having equal numbers of males and females, any effect of sex would be the same for each level of the IV, and any differences in average performance for the two levels would not be due to having more males in one group or more females in another.
Note that in the case of assigning equal numbers of males and females to each level of an IV, sex has actually been added as a blocking variable. If recording the participants’ sex in the data file, later analyze the data separately for males and females to explicitly check for sex differences. Blocking variables should always be included as “independent variables” in a data file. An advantage to matching groups on a blocking variable is that it serves to control that confound and to permit examination of its influence.
Order effects are an important class of confounds, especially in experiments in which each participant serves at each level of the IV. Here, experiencing one level of the IV may change performance at another level. Examples would be when experience in one condition provides practice that improves performance in another condition, or when experience of the first condition induces a strategy that affects performance on the other. Two general solutions are available: counterbalancing and randomization. Complete counterbalancing guarantees that each condition precedes or follows each of the others equally often. (For experimental designs with many levels of an IV, complete counterbalancing is usually not possible, due to the number of participants required. In this case, a Latin square design can approximate complete counterbalancing.) An alternative is to randomize the order of presentation of the experimental conditions for each participant. Over a fairly large number of participants, this will approximate counterbalancing. Note that with either counterbalancing or randomization, recording the order of the conditions in the data file will permit later comparison explicitly on the performance of participants receiving different orders of the experimental conditions.1
Before Beginning
Before beginning to design an experiment, carefully consider the broader research question that is trying to be answered. While the use of computers with software such as E-Prime makes it easier to run experiments, there is still a lot of cost involved in paying participants, in time testing participants and analyzing data. For that reason, time spent “up front” on careful theoretical considerations will avoid wasted effort and increase the chance of getting an interpretable and publishable result. In this section, we consider a number of issues that need to be addressed before and during the detailed process of experimental design.
What are the questions that need to be answered?
Before beginning to design an experiment, have a clear formulation of the questions trying to be answered. Specify a hypothesis, or a statement about the expected effects of an independent variable on the dependent variable (e.g., reaction time will decrease as the flanking letters are moved farther from the target letter). The hypothesis may come from an explicit theory, may represent an extension of previous research, or may come from personal observation. In exploratory research, the questions may concern the nature of a phenomenon—what are the conditions under which the phenomenon (e.g., a visual illusion) occurs? Here, the concern is not with testing a theory, but with delineating a phenomenon. In confirmatory research, the research questions concern the explicit test of a theory about the nature of a phenomenon. If the experimental result is predicted in advance by the theory, that tends to confirm the theory. However, if an experimental result contradicts a prediction of the theory, it suggests that the theory is at least incomplete, and possibly incorrect. (A thorough discussion of the confirmation and falsification of theories lies far beyond the scope of this article. See Elmes, Kantowitz, & Roediger, 1992.)
How can the research questions be answered?
Whether the research is exploratory or confirmatory, the questions to be answered must be as specific as possible, so that what would count as evidence is clear. It is important that the questions be posed in such a manner that some kind of experimental task can be devised that can answer the questions. Comparisons are at the heart of any scientific question—it is expected that a dependent variable will, in fact, vary as the level of the independent variable(s) is changed. In confirmatory research, there is a specific prediction of at least the direction (possibly the degree) of the differences in DV’s as IV’s vary. For example, a theory might predict that RT would increase as the intensity of some IV changes. In exploratory research, there is no precise prediction about how the DV will change, but there is the expectation that changes in the IV’s studied will produce changes in the DV. If they do not, not much exploration has taken place.
How will data be analyzed?
The experiment and the data analysis should be co-designed. It is extremely important that the methods of data analysis be known in advance of collecting the data. There are several reasons why this is so. Since the point of the research is to make comparisons of DV’s at varying levels of IV’s, it should be clear in advance what comparisons would be made and how they will be made statistically. This can avoid nasty surprises later, such as discovering that a crucial variable was not recorded, or (worse yet) that no appropriate statistical test is available. There is some additional discussion of statistical analysis of single-trial RT data below.
Before collecting the data, it is useful to draw a graph of the data, to be clear as to what patterns of RT’s would support or disconfirm the hypothesis. If there are other possible assumptions about the effects of the IV(s), graph those as well. Such a graph will help clarify predictions. Plotting the expected means along with expected standard error bars (perhaps derived from pilot testing) can give a good perspective on what is expected to be seen, and what differences might be significant. As a rule of thumb, a difference between the means of two standard errors is likely to be significant. A statistical power analysis is useful as well, to help judge the likelihood of getting a statistically significant difference between means based on the size of the differences expected, the variability of the data, and the sample size.
How will the experimental tasks be presented?
A careful review of the pertinent literature is a natural starting place for any research, usually focusing on previous theoretical and empirical developments. However, a review of Methods sections of experiments using similar tasks is also likely to be rewarding. Such a review may alert to considerations not thought of, saving much pilot testing. If there is literature or research using similar tasks, it might be worthwhile to discuss the design with the authors and take advantage of any insights not made a part of the formal report of Methods.
A host of considerations comes into play in the detailed design of an experiment and the analysis of the data it produces. While some are specific to a restricted research domain, others are more general. The discussion below of the minutia of single-trial, reaction-time research highlights many of those considerations.
Implementing a Computerized Experiment
Once you have thoroughly thought out the question you wish to answer and how you plan on answering it, you are ready to begin designing the experiment. Do not rush the previous planning stage. It is critical to have properly prepared before attempting to design or implement an experiment.
Constructing the experiment
Work from the inside out (or the bottom up). The best way to construct an experiment is to get a few trials going before typing in full stimulus lists and instructions. We recommend leaving instruction screens blank and specify a minimal list of stimuli; usually one from each level of a single IV is sufficient. Once certain that the basic trial is running and that the data are stored correctly, add the other IV’s, additional stimuli, instructions, and other details. It is fairly often the case that in setting up an experiment, it becomes clear that further variables need to be specified in stimulus lists. Going back to a long list of trials and adding those to each can be frustrating. Thorough testing with a few trials of each type will usually catch such errors while they are easy to correct. As an example, suppose that you have to type in 200 words to serve as stimuli, and designate each word by frequency and length. If you then decided to add concreteness as an additional IV, you must enter the concreteness rating for each of the 200 words. However, if you first test the experiment with only four words, and discover that an additional IV is needed, only four levels must be fixed.
Pilot testing
Once the experiment is set up, perform a couple levels of pilot testing. The first level is to sit through the whole experiment alone. You may notice errors you did not spot before, or realize that the experiment is too long and should be run over multiple sessions. Do not expect participants to undergo an experimental procedure that you are not willing to sit through yourself. This is especially important if someone else sets up the experiment according to your specifications. As the Cold War arms-control negotiators used to say, “Trust, but verify.” The second level of pilot testing should be to have two or three people complete the experiment. These should be laboratory assistants, colleagues, or others who might spot potential problems. Especially if using students as pilot participants, let them know that reporting anything that seems like a problem is necessary.
Once the pilot data are collected, perform a full data analysis. Although it isn’t likely that so few participants will give the statistical power needed for “significance,” you can satisfy yourself that the relevant variables are recorded and that you know how to proceed with the analysis. An important aspect of analysis is to know how to structure the data for the program that is being used for analysis. Note that most statistical programs will read in a tab-, comma-, or space-delimited ASCII (or DOS) file, which should have the data for each participant on a single line. With reaction-time research, it is common to use the mean RT’s for each condition as the data for analysis, rather than single-trial data. That can be produced using the Analyze feature of the E-DataAid application within E-Prime.
Formal data collection
Once formal data collection has begun with actual research participants, it is a good idea to debrief at least the first few participants rather extensively when they finish the experiment. Check that they understood the instructions. Ask whether they noticed anything that seemed unusual, and ask them about strategies they may have adopted. Participants sometimes read into an experiment all sorts of demand characteristics that the experimenter never intended. Do not assume that the participants are going to tell about aspects of the experiment that bothered or puzzled them. Therefore, explicitly ask whether there seemed to be anything “wrong.” Note also that the colleagues or laboratory assistants who may have served as pilot participants bring a special expertise to bear, so they may spot problems a naïve participant would not. However, they may also, for the very same reason, overlook problems in instructions and procedures that will bother the naïve participant.
Also review both single-participant and mean data as the first few participants complete the experiment. Look for extremes of variability. In a single-trial, reaction time paradigm, look at a table of mean RT’s by trial type, standard deviations and error rates. Extreme standard deviations or error rates may indicate that a particular trial type is not being presented as expected, or that participants are not reacting as instructions suggested.
Familiarizing participants with the situation
Especially with computerized experiments, it is sometimes necessary to spend time making sure participants are comfortable, because they need to do the experimental task without being distracted by the surroundings. Many researchers rely on undergraduates as participants, and can assume a familiarity with computers. However, in an elderly population, the participants may not be familiar with computers. In research with psychiatric patients, a strange situation may significantly alter the participant’s ability to comprehend and focus on the experimental task. Giving a session of practice just to overcome the threat of a new, strange environment may be well worth the extra time and trouble. Research with children introduces other problems, such as understanding instructions. The use of a response box with just a few keys might be helpful in some of these situations by reducing the distractions inherent in a computer keyboard with its 100+ keys.
If data collection will take place on a computer other than the one used for setting up the experiment, be sure the pilot testing is done on the computer to be used for the final experiment. Problems that can arise include differences in display sizes, as well as problems of switching to a different graphics adapter.
Where will data collection take place?
Give some prior consideration to the setting for data collection. Most often, this is done with one participant at a time, in a laboratory setting. Sometimes, however, the data collection may take place in a hospital or clinic, or another setting.
Considerations in regard to the location of the laboratory include:
1) Limiting outside noise and interruptions. If tests must be done in a noisy environment, it may help to use a white-noise generator being played over a speaker or headphones to block most extraneous sounds. If several participants are tested at once, it helps to have dividers or separate booths, since this discourages the participants from chatting among themselves.
2) Control of lighting. Glare on the computer monitor is sometimes a problem, especially in a relatively dimly lit room. This can be a major problem when using brief, data-limited displays. Adjust the position of the monitor to eliminate glare. Also adjust the brightness and contrast so that the display is clear and sharp. Note that a display seen in a brightly lit room may look too bright (and fuzzy) when the lights are dimmed.
3) Control of access to the computer. It is a good idea to place the computer itself where the participant cannot reach the controls (i.e., so that they do not reboot the machine, pop out a floppy disk, or adjust the monitor settings).
4) Comfort. If participants must sit for lengthy experimental sessions, be sure to have a comfortable chair. A few minutes spent adjusting the chair for a tall or short participant may reduce their discomfort considerably. Ambient temperature should also be comfortable.
5) Testing multiple participants. If several computers are available, consider testing several participants at once. Verbal instructions can be used if all participants start at the same time, but if they do
not, maybe have all instructions on the screen. If so, be sure to test those instructions thoroughly beforehand—instructions that were thought to be perfectly clear may not be for the participant population. Another consideration for multiple-participant testing arises when using auditory presentation of stimuli or tones to signal error trials. Participants can easily be confused about where the sounds are coming from; however, headphones can usually avoid that problem.
Is a keyboard the right input device?
Typically, keyboards are used for response collection, usually limiting the allowable keys to those used for responses. However, in many situations, a keyboard may cause problems. Participants can easily be confused about which keys are being used. If working in a darkened room, locating the right keys can be difficult. If participants have to look at the keyboard to locate the keys they need, it can be disastrous in recording reaction times. Especially with children, the temptation to play with the keyboard may be too great. A good alternative is to use a response box with only a limited selection of keys, such as the Chronos® Response and Stimulus Device available from Psychology Software Tools. Custom response boxes can also be made, using the Custom Expansion Kit with the Chronos® Response and Stimulus Device.
The Single-Trial, Reaction-Time Paradigm
An experiment using the single-trial, reaction-time paradigm consists of one or more blocks, or sets, of trials. Each trial consists of the presentation of at least one stimulus, and the collection of the time required for the participant to respond. The trials vary (within or between blocks), with each trial type representing a single level of an IV (or the unique combination of levels of two or more IV’s). The principal DV is RT, though accuracy (usually percent error or percent correct) is also examined as a secondary DV. Both RT and accuracy are recorded as the DV’s for each trial, but the analysis is usually based on the mean RT (or percent correct) for each trial type, averaged across all correct trials.
The concern for single-trial, reaction time experiments is how various independent variables affect RT—that is, how RT is changed when we deliberately manipulate the stimuli in some way. Inferences about cognition and perception are then made, based on the pattern of RT changes with changes in the independent variable(s). However, RT is also affected by many variables that are not of direct interest. These potentially confounding variables must be controlled in some way so that they do not influence the outcome.
Following a definition of RT, we present a discussion of the events that occur in a typical RT experiment. Then we discuss a number of confounds that can affect RT, and which must be considered in designing experiments employing RT as a dependent variable.
RT defined. For most research in psychology, RT is defined as the time from the onset of a stimulus to the time the participant responds. For computerized experiments, this is usually the time from stimulus onset until a key is pressed indicating a response.
It is important to note that RT can vary, depending on the particular response required. Suppose that there are two versions of an experiment, differing only in how the participants respond to indicate which of the two stimulus types has been seen. In one version, they must press the ‘1’ and ‘2’ keys on the computer keyboard to indicate which type of stimulus appeared. In the other version, they must press a lever to the left or right to indicate the stimulus. Overall RT might well be longer in the case of the lever- press, because the mechanical resistance is higher, or because the distance to be moved is farther, or because different muscles are employed in the two types of responses. In this case, caution is required in comparing the results of the two experiments. Differences in the obtained RT’s might be due solely to mechanical factors, and not reflect any differences of interest. Care is needed, then, in comparing the outcomes of experiments using different responses. Whether a relatively fast key-press or a relatively slow lever-press was used will affect overall RT, but in either case, the difference in time to respond to the two types of stimuli may be about the same. In comparing experiments, then, the crucial issue is whether the same pattern of differences in RT’s is seen, rather than whether overall RT differed.
While we have defined RT as the time from stimulus onset to a response, it is sometimes defined in other ways. In much research in kinesiology, for example, RT is defined in relation to the onset of a muscle potential (electromyographic signal), while the time from that first electrical activity in the muscle to when the response movement itself is completed is called Motor Time. Because RT is sometimes defined differently, and because it can depend on the nature of the response apparatus, it is important
in RT research that the definition of RT and the nature of the response be made explicit, and reported in the Procedures section of the research report.
RT is also sometimes classified as simple RT or choice RT. In simple RT, a participant makes one response to a single stimulus. This requires only a judgement about the presence of a stimulus, and does not involve a decision about the nature of the stimulus. Choice RT is measured when more than one type of stimulus can occur, and the participant must indicate the stimulus type by his or her choice of responses. Because research on simple RT is rare, “RT” means choice RT unless noted otherwise.
General Considerations
In developing the general considerations for RT research, we examine issues concerning the events that take place on each trial, how blocks of trials may differ, and finally how these combine to form the experiment as a whole.
An example experiment
To permit concrete examples of the issues discussed, we begin by outlining an experiment that could be fairly easily implemented in E-Prime. The intent here is clarity, rather than scientific importance. Suppose that you wish to examine how RT to a stimulus is affected by changes in the location of the stimulus. Visual acuity is best for objects in foveal vision—the small, central part of the visual field, and drops rapidly for objects further away in peripheral vision. But does that affect RT? The following experiment would help answer that question.
The principal dependent variable is RT, with accuracy (percent correct) as a secondary dependent variable. The independent variables are the location of the stimulus and whether it is adjusted in size to compensate for poorer peripheral acuity. The stimulus is a letter, presented in a random location on the screen. Stimulus letters are centered on locations 0, 2, 4, 8, and 16, left and right of straight-ahead. Letter sizes are chosen to compensate for the distance from central vision (reference). The letters to be presented are C, G, O, and Q, with one response for C and O and another for G and Q. These letters were chosen because C and G share a lot of feature overlap, as do O and Q, so the discrimination is fairly difficult. Four different letters are used so that participants cannot rely on a single feature, such as the tail of the Q, for the discrimination.
What happens on each trial?
Typically, RT experiments consist of one or more series (blocks) of trials. While the specific stimulus may vary from trial to trial, certain aspects of the experiment are usually the same on each trial. There is often a fixation mark of some kind, to let the participant know where he or she should be looking when the trial starts. Initiation of a trial may be under the participant’s control, allowing the participant to begin a trial whenever he or she is ready. Alternatively, initiation of a trial may be automatic, controlled by the experimenter or computer. In this case, a warning signal is typically given, to allow the participant to get ready for the trial. Sometimes the appearance of the fixation mark acts as the warning, and sometimes a tone or other signal is used. After a trial is initiated (by the participant or automatically), there is usually a brief delay before the stimulus appears. This delay is called the foreperiod, and may vary from trial to trial or be fixed (unvarying). The foreperiod is usually fixed for choice RT tasks.
At the end of the foreperiod, the stimulus is presented. In many experiments there is only a single event making up the overall stimulus. In others, there may be distracter elements displayed on the screen, or stimuli that serve as primes. In either event, timing of the reaction begins when the critical stimulus is displayed. The critical stimulus refers to the element in the display that determines the appropriate reaction (i.e., which key to press). This is sometimes called the “imperative” stimulus. The stimulus duration (how long it remains in view) will largely be controlled by the nature of the stimulus display. For example, if eye movements during the stimulus presentation could affect the experiment, a very brief (say, 100 ms) presentation is often used, since it takes about 200 ms after the stimulus appears for an eye movement to begin. If the stimulus duration is so short that the participant gets only a glance at the stimulus, the display is described as a data-limited display. Other situations involving data-limited displays are discussed below.
Another issue for defining a trial is that of how long to give the participant to respond. Typically, the participant must respond with a key-press within some limited time. The choice of that time depends on the sorts of RT’s expected, with the time allowed being set so as to encompass any legitimate trials. If the task is an easy one, with RT on most trials being less than 500 ms, the time allowed for a response may be relatively brief (e.g., two seconds or so). If no response occurs in that time period, the trial is counted as an omission. Many harder tasks, however, have typical RT’s of 1-2 seconds. In this case, the time allowed for a response should be increased accordingly.
Feedback about accuracy and/or RT is usually given following a response. Feedback about accuracy is usually provided, telling participants whether they were right or wrong in their choice of a response. It should be noted, though, that participants are generally aware of having made an incorrect response. The accuracy feedback emphasizes the importance of correct responding. Because the usual RT instructions emphasize speed of reactions, RT feedback is important, since it lets participants monitor their own performance. Many researchers prefer not to report RT on error trials, to avoid encouraging participants to respond so quickly that accuracy is reduced.
The inter-trial interval (ITI) is the time from the end of one trial to the beginning of the next. If the participant controls initiation of the next trial, the participant also controls the ITI. When it is important to control ITI, trial initiation must be controlled by the computer or experimenter.
In some experiments, there may be more than just a single stimulus presented on each trial, or there may be a prime and then a stimulus that calls for a response (sometimes called the imperative stimulus). For example, if the participant must judge whether two letters they see are the same or different, they might see one letter and then see the second some short time later. That delay before
the second stimulus is the inter-stimulus interval (ISI). The ISI is time from the onset of the first stimulus to the onset of the second. Another term for this is stimulus onset asynchrony (SOA).
In the example experiment on visual acuity, a central fixation mark would be required so that measures of stimulus location would be accurate. Because the locations must be specified and the proper size letters chosen to compensate for distance from fixation, it would be necessary to control the distance from the participant to the screen, using a viewing hood or chin-rest. The distance to the screen, and the resulting display sizes (in degrees of visual angle—see below) should be included in the Methods section of the final report. To be sure that participants do not turn their eyes and re-fixate the letter in central vision, a data-limited display would be needed. A 150 ms display would control for this. Participants might employ a strategy of guessing the location, and thus not be looking at the fixation when the trials begin. This can be prevented by stressing in the instructions that participants should be looking directly at the fixation when they start each trial, and also by randomly presenting the stimuli to left or right of fixation. If participants adopt a guessing strategy, this will lead to them missing many stimuli completely, and the high error rates will clearly show that there is a problem.
Because of the brief displays and the need to guarantee that the participant is looking at the fixation, participant-initiated trials should be used, with a fixed foreperiod. RT’s for this task should be fairly fast, so it would probably be appropriate to limit the allowed response time to 2 seconds or less. Accuracy feedback would be used, with RT reported on correct trials only.
What happens within a block of trials?
The entire series of trials making up an experiment is usually divided into blocks of trials. The division may simply reflect time constraints. In a long experiment, it is best to ensure that participants take occasional pauses, so it may be best to break the entire series into shorter blocks, with rest pauses between them. More importantly, the division of the experiment into blocks may be an integral part of the experiment itself. The rest of this section treats that situation.
Blocked versus random presentation
Suppose there are two or more different sorts of trials being presented in an experiment (two or more independent variables, with two or more levels of each). A question to consider is whether these different sorts of trials should be presented together in each block with the various types alternating in random order, or whether the series of trials should be blocked, with all trials of one type presented, followed by all trials of the other.
Compare two versions of the letter-identification experiment. One is the experiment described above, except that participants must respond by pressing one of four keys to indicate either which of four letters is present (four-choice RT). The other is the same, except that only two letters are used (two-choice RT). The two experiments differ only in whether the two types of trials (two- and four-choice) occur randomly (within a single block), or are blocked, with all of the two-choice trials occurring together, and all of the four-choice trials occurring together. In order to directly compare the four-choice RT to the two-choice RT, the two types of trials (two- and four-choice) could occur either randomly (within a single block), or blocked, with all of the two-choice trials occurring together, and all of the four-choice trials occurring together.
In general, we expect that RT will increase with the number of choices (Wickens, 1992). If participants completed one block of two-choice and one block of four-choice, that would probably be the outcome. But with random presentation, that may not be so. Why not? In this experiment, it is likely that random presentation would lead the participants to ignore whether the trial is two- or four-choice. That is, the participants seeing the stimuli in random order may not bother to pay attention to whether the trial involves two choices or four, but rather treat all of the trials as if they involved four possible choices. That would increase the mean RT for two-choice trials, while having no effect on four-choice trials. That is, the results of the experiment depend (in part) on the choice of blocked versus random presentation of the stimulus types.
In general, then, the choice of random or blocked presentation must depend on whether participants given random ordering of trials will adopt different strategies than those given blocked order. In the case of the experiment above, participants in the random-order experiment might adopt the strategy of ignoring whether there were two choices or four, and treat the two-choice trials as if they were four- choice trials. Thus, the blocked version gives us the better estimate of the actual time required for choosing between two and four response choices.
When blocked presentation is used, the issue of counterbalancing of treatment orders is raised. In the blocked version of the two- versus four-response experiment (two levels of one independent variable), half of the participants would do the two-choice trials first, while half would do the four-choice trials first. This counterbalancing is designed to remove (or at least balance) any effects of carry-over from one block of trials to the next.
Certain confounding variables are usually controlled by counterbalancing. One is the mapping of stimuli to responses. If interested in comparing the speed of reactions to the targets ‘C’ and ‘O’ to the speed for the targets ‘G’ and ‘Q’ in the two-response version of the experiment above, have half of the participants respond by pressing the ‘1’ key for ‘C’ and ‘O’ and the ‘2’ key for ‘G’ and ‘Q’. Half would respond in the opposite way, pressing the ‘2’ key for ‘C’ and ‘O.’ This controls for any possible difference in RT due to the different responses themselves, and is necessary because some muscular actions take longer than others.
If comparing detection of letters under the conditions in which the letters either were or were not adjusted in size, the comparison of interest is adjusted vs. constant size; therefore, since the ‘1’- and ‘2’-response trials will be averaged together, counterbalancing may not be necessary. In other experiments, however, it can be absolutely crucial. Consider, for example, a version of the letter-choice experiment in which two letters are presented and the participant must indicate the letters are the same by pressing one key or that they are different by pressing another. Since one aspect of that experiment is to compare “same” and “different” responses, it would be important to counterbalance the mapping of the response keys to same and different stimuli. Otherwise, a difference between RT to “same” and “different” might be interpreted as reflecting the difference in the stimuli, when it was really reflecting a difference in response time to press the ‘1’ key versus the ‘2’ key. The difference in RT really is due to a lack of proper counterbalancing. (Alternatively, a failure to counterbalance might lead to a finding of no difference, when there really was one.)
Ordering of trials within a block
When each type of trial is presented within a single block of trials, it is almost always the practice to randomize the order of trials. This is equivalent to writing down each trial on a card (including multiple cards for repetitions of the same stimulus) and then shuffling the cards. There is, however, a problem that can be caused by randomization. Suppose that there are two types of trials. In a single block, 100 trials of each type are presented, in a random order. It is likely that some fairly long sequences of a single trial type will occur, with a single type presented 7 or 8 times in a row. Because humans expect randomness to produce shorter sequences than it really does, participants tend to fall into the gambler’s fallacy. If a single trial type occurs 6 times in a row, participants will often either decide that the other trial type is “overdue” and expect it, or they will decide that the type they have seen is more likely to occur and expect it again. In either case, if the expectation is correct, the participant will probably be very fast and accurate. If the expectation is wrong, the participant will be slow and error-prone.
What happens within the whole experiment?
An experiment is composed of one or more blocks of trials. If the experiment is particularly long, it may be broken down into sessions of one or more blocks each. In that case, counterbalancing of blocks across sessions may also be required. An experiment most often begins with instructions about the nature of the experiment, and some practice trials. When the experiment is concluded, some form of debriefing is often used to show the participant the purpose of the experiment and to permit questions about it. Instructions, practice, and debriefing are considered separately below.
Instructions
The purpose of the instructions, in any experiment, is to let the participant know what will be happening and what the correct responses are. In RT research, instructions should also emphasize that participants are to respond as quickly as possible while still remaining accurate. “Accurate” is typically considered 10% or fewer errors, though this would also depend on the specific experiment.
In long experiments, it is also advisable to instruct participants that they should take occasional breaks. If trials are initiated by the participants, these breaks are under the participants’ control. Otherwise, it is a good idea to “build in” breaks by having blocks of trials that are fairly short (e.g., 5-10 minutes). Occasional breaks avoid having the participants just staring at the screen and pressing keys like zombies. This means that participants are less error-prone, and also that RT is less participant to added variability due to eye strain, mental fatigue, and the like.
Practice
Most experiments ask people to do unfamiliar tasks, and require them to indicate their responses by pressing keys that have no previous association with the stimulus. If asked to press the ‘1’ key if a ‘C’ or ‘O’ appears and the ‘2’ key if a ‘G’ or a ‘Q’ appears, participants must first learn to associate 1 with C and O and 2 with G and Q. At first, participants will be very slow and error-prone in their responses, simply because they have to carefully think about which key to press after they identify the target letter. After a while, participants no longer have to think about which key to press, and their responses become faster and more accurate. For this reason, usually give some practice on the task before actually beginning to collect data. The effect of this practice is to reduce the variability of RT during the experiment itself. The number of practice trials can be determined during pilot testing. It is also a good idea to stand and watch the participant during the practice trials, to be sure they understand the task. You may sometimes need to encourage them to slow down, if they are making many errors. Once they clearly understand the task, encourage them to try to speed up. Presenting the mean accuracy after each trial or block of trials can be useful.
In a short experiment, completed in a single session, one block of practice trials is usually all that is needed. If the experiment extends over several sessions, a brief block of practice trials is usually given at the beginning of each session and the first session is often treated as a practice. If the type of stimulus display or responses change from block to block, it might also be necessary to have practice before each block of trials.
Debriefing
When an experiment is over, it is usual to debrief the participant. The debriefing typically is a simple matter of telling the participant what pattern of RT’s is expected to be found and why. That is, the debriefing is used to explain to the participant what the experiment was about. Participants may also be shown their individual results. A second reason for a debriefing is to get comments from the participants about their own experience. While such comments may not be part of the data proper, they can sometimes reveal the use of strategies that the experimenter had not considered, or may even point out flaws in the design. Remember that participants have spent some of their time during the experiment trying to figure out “what is going on.” In doing so, they may notice things about the experiment that the experimenter never noticed—including problems.
How many trials?
Why not just have the participant respond once to each type of display, and take that single RT as the “score” for that condition? This would certainly be faster, since few trials would be needed. The problem with using this procedure, however, is that it ignores the large variability in RT that is due to factors other than the independent variables. RT varies from trial to trial, even if the stimulus does not. That variability comes from momentary changes in attention and muscular preparation, among other things. Note that participants cannot pay attention evenly and uniformly for any length of time. Even when you are listening to a fascinating lecture, you will find your attention wandering from time to time. The same thing happens in RT experiments, when the participant sits doing trial after trial. Occasionally, participants will start a trial when their attention is not focused on the task. When this happens, a very long RT usually results. Long RT’s due to inattentiveness would be expected to occur about equally often for all stimulus types, so averaging a few such trials with many others does not create a problem.
Another way to look at the problem of number of trials per condition is to realize that the RT on each trial provides an estimate of that participant’s “true” RT for that condition. Each individual estimate is not very reliable, for the reasons given above. Therefore, averaging a number of estimates (RT’s on many trials) provides a better (more reliable) estimate of “true” RT. Recall that the confidence interval estimate of a population mean becomes more and more precise as the sample size increases. Similarly, the estimate of true RT becomes better and better as sample size increases--though in this instance, sample size refers to the number of trials per participant, rather than the number of participants. By employing the formula for the confidence interval, determine the number of trials needed to have a certain level of accuracy. In practice, 15-30 trials per condition per participant seem to provide a satisfactory result. This is enough trials that a few aberrant trials will have little effect on the mean RT for that condition.
Between- Versus Within-Participants Designs
Another issue of importance to RT experiments is that of whether the independent variables should be manipulated between participants or within participants. Between-participants variables are ones where different participants are tested on each level of the variable. For the example of two- versus four-choice RT, that would mean that participants do either the two-choice version or the four-choice version, but not both. Within-participants variables are those where each participant is tested at each level of the variable. For the same example, this would mean that each participant does both two- and four-choice trials (in either random or blocked order).
Which method is preferred? We use a different example here, to simplify. Suppose an experimenter wanted to determine the effect of alcohol on RT’s to a simple stimulus, and had 20 participants available. He or she could randomly assign 10 participants to perform the task drunk and 10 to perform it sober, then compare those mean RT’s. This would be a between-participants design. But why not
test each participant both sober and drunk? That way there are 20 participants in each condition. This would be a within-participants design. (Of course, she would want to counterbalance the order, and test some participants sober and then drunk, and others drunk and then sober.) It should be clear that an analysis based on 20 participants per group is more powerful than one based on only 10 participants per group. (Note that the type of statistical analysis would change slightly, since a within-participants design violates the assumption of independent samples. In this case, comparing two means, the t-test for independent samples would be used with the between-participant design, and the t-test for dependent (“correlated”, “matched-pairs”) samples with the within-participant design. If there were several levels of dosage used, the appropriate test would be the standard ANOVA for the between- participants design, and the repeated-measures ANOVA for the within-participants design.)
The main thing to note about the example above is that a within-participants design is clearly better, if it is appropriate to use it, because it effectively increases sample size. But there are severe limitations to its use as well. A within-participants design works fine in this example because if the experimenter tests participants drunk, then tests them sober a few days later, she can be fairly sure that the only systematic difference in the participants is in whether or not they are sober. Similarly, when comparing RT to two versus four stimuli, making a choice between two stimuli probably does not have a later effect on making a choice between four stimuli (or vice-versa)—at least if the trials were blocked. But in many situations making the assumption that there is no carry-over from one condition to another is not justified. For example, to compare RT to naming meaningless shapes following two different types of training, a between-participants design is needed because if a participant learns something by one method that learning cannot be “erased.” If participants performed faster following the second round of learning, is it because that method of learning is better? Or is the difference simply due to the added learning? Another situation in which a between-participants design is required is when the variable is “attached” to the person, and cannot be experimentally manipulated. Variables of this kind include sex, race, ethnic background, and religion.
In general, then, within-participants designs are to be preferred if it is reasonable to assume that there are no carry-over effects of one level of an independent variable on performance at other levels of that independent variable. If that assumption is not reasonable, a between-participants design should be used. Note that this is similar to the issue of random order of trials versus blocking by trial type—if encountering one level of a variable might induce a strategy that carries over to another level, the levels should be blocked, when using a within-participants design. If blocking will not solve the problem, a between-participants design will be necessary.
Another way of thinking about when to use a within-participants design is to consider whether the effect of the experimental “treatment” or manipulation wears off. If the passage of time will erase the effects of the manipulation, a within-participants design may be appropriate. An old joke illustrates the point. A lady walking down the street saw a man lying drunk in the gutter. “Sir,” she said, in obvious disgust, “You are drunk!” Opening one eye the man replied, “Yes, madam, and you are ugly. And tomorrow, I shall be sober.” Some treatments wear off, and thus are candidates for within-participant manipulation.
There are also some experiments that employ both within- and between-participants independent variables. These are usually referred to as mixed designs. For example, to compare the patterns of RT’s for males and females in the letter-identification experiment, sex would be added as another independent variable, in addition to location and whether the letter size was adjusted to compensate for distance from central vision. Location and adjustment would be within-participants variables. But sex (male vs. female) would be a between-participants variable, since no participant could be in both groups.
Other Considerations in RT Research
A number of other factors that must be considered in designing research employing RT as a dependent variable are discussed below. Wickens (1992, Chapter 8) provides a more detailed account of most of these same issues.
Speed-accuracy trade-off
In research employing RT as the dependent variable, the interest is usually in showing that RT differs for different levels of the IV(s). A serious problem can arise, however, if the conditions associated with faster RT also have higher error rates. Such a situation is called a speed-accuracy trade-off, because the participants may be sacrificing (trading) lower accuracy for greater speed. That is, they may be faster on those trials because they are pushing themselves for speed, but ignoring the higher error rate that often goes with that effort. Consider the comparison of RT’s in the letter-identification task.
Suppose that no differences in RT were found with increased distance from foveal vision, in contrast to the expected finding of an increase in RT to identify letters seen less clearly. If the error rates were seen to be increasing with incremental difference, this would suggest that participants were trading accuracy for speed—in order to maintain the same speed of response under more difficult conditions, the participants were permitting the error rates to climb.
Fortunately, in most RT research a speed-accuracy trade-off does not occur. In fact, most of the time the fastest conditions will have the lowest error rates, while the longest RT’s will come in conditions with the highest error rates. In this case, difficult stimuli lead to both slow and sloppy responses. In any case, it is a wise practice to examine error rates for evidence of a speed-accuracy trade-off. To avoid this problem, instructions to the participants usually stress that they must be as fast as they can in each condition but without sacrificing accuracy. That is, the error rates should be uniformly low for all conditions.
Stimulus-response compatibility
In most RT research, the connection between the stimulus and the response is arbitrary. Participants may be instructed to press ‘<’ for an S and ‘>’ for an H, or ‘>’ for an S and ‘<’ for an H. But occasionally the mapping is not arbitrary. Consider the same experiment, but using L and R as stimuli, instead of S and H. If participants had to press ‘<’ for an R and ‘>’ for an L, for example, they might be both slower and more error-prone than otherwise, because of the association of L with “left” and R with “right.” Making a “left” response to an R might well produce some response competition, resulting in a slowing of RT. Basically, any time a stimulus implies a certain direction of response (such as L and R implying left and right responses), there are potential problems of S-R compatibility.
Probability of a stimulus
In most experiments with RT as a dependent variable, each type of stimulus is presented equally often. In this way, participants are discouraged from guessing, since each stimulus is equally likely on each trial. Sometimes, however, one stimulus may be presented more often than another and can have major effects on RT (and error rate). In general, the most common stimulus is responded to more quickly and more accurately. Why is this so? Suppose that in the experiment on recognizing S and H the participants were presented an H 80% of the time, and an S 20%. Participants would quickly realize this, and would expect an H most of the time. On any trial, if the target is an H, there is likely to be a faster response. But if the target is an S, the participants must overcome their expectancy, and preparation for an H. The result is a slower response, and a higher probability of error.
Because of these considerations, it is best to always have the different trial types equally likely whenever randomization is used. Unequal stimulus probabilities are best avoided, unless they form part of the research itself.
Number of different responses
RT increases as the number of possible responses increases. This relationship has long been known, and was quantified in the early 1950’s, when Hick and Hyman, working independently, each noted that RT increases linearly with the logarithm (base 2) of the number of alternatives. That means that additional alternatives will increase RT, but the effect of that increase is smaller as the number of responses becomes larger. This effect is not usually of much concern, but must be kept in mind when comparing the results of several experiments (i.e., if they used different numbers of response alternatives, the RT’s cannot be directly compared).
Intensity and contrast
At least for low levels of illumination, the more intense the stimulus, the faster the RT. Once the stimulus reaches an intensity where it is clearly visible, however, further increases will have little effect. Similarly, increasing contrast (the difference in intensity between the stimulus and the background) will decrease RT, up to a point where the stimulus is clearly visible. Either low intensity or low contrast would produce a data-limited display. A very brief stimulus is another example of a data-limited display.
One common problem in controlling intensity and contrast is ambient light (the light present in the room). A display that may seem very weak under ordinary room lighting may seem quite bright when room lights are off and windows covered. In experiments employing brief, data-limited stimulus displays, it is important that ambient light be carefully controlled.
In addition to lowering apparent intensity and contrast, ambient light may result in glare or reflections on the display screen of the computer. In this case, lights must be shielded or the computer moved to prevent such interference.
Stimulus location
The location of the stimulus can have a powerful effect on both RT and error rates. Visual acuity drops quickly as stimuli are moved away from the fovea—the narrow area of vision straight ahead that is about 2° wide. A person with 20/20 vision in the fovea will typically have about 20/80 vision 2.5° from straight-ahead. At 10° from straight ahead most people have worse than 20/300 vision. To put this in perspective, at a viewing distance of 57 cm (22.5”), each centimeter is about 1° of visual angle, so a letter displayed 2.5 cm (about 1”) from fixation will be seen quite poorly.
For these reasons, retinal locus (where on the retina the image of the stimulus falls) must be controlled by randomization or counterbalancing if the stimuli are not all presented in the same location. If one type of stimulus is presented in the fovea, and another in the periphery, differences in RT might occur (or fail to occur). However they could be due to differences in the location of the stimuli, rather than to differences in the stimuli themselves.
Note that the relative size of the stimuli is a function of distance from the eye. If the relative size of stimuli is a concern, then the location of the participant’s head relative to the screen must also be controlled. This is often done by use of a chin rest or viewing hood to keep the participant’s head relatively stable. In this case, the viewing distance should be specified in the Method section of the final report. Sizes of stimuli are also reported, in degrees of visual angle, rather than millimeters or inches.
Calculation of stimulus sizes in degrees of visual angle can be done using the law of cosines. A good approximation is obtained by the formula
Size in degrees of visual angle = 57.3W/D
…where W = width of display and D = viewing distance, with W and D in the same units of measurement.
Statistical Analysis of RT Data
While this brief review of single-trial RT research cannot include an extensive discussion of data analysis, a few points deserve comment.
The typical analysis of single-trial RT data employs the analysis of variance (ANOVA) to compare mean RT’s under various treatment conditions as defined by the levels of the independent variables. For within-participants variables, a repeated-measures ANOVA is employed. Sometimes, both within- and between-participants factors occur in the same experiment, resulting in a mixed ANOVA. For the example experiment on letter identification, there are two independent variables that define types of trials for the analysis. One is the location of the stimulus, which has six levels (0, 1, 2, 4, 8, and 16°). The other is whether or not it was adjusted in size to correct for poor acuity, which has two levels (adjusted or not). For this analysis, the mean RT for each of the twelve conditions would be calculated for each participant, and those values would serve as the data. The ANOVA then compares the means of those values based on all participants to determine statistical significance2.
In addition to an analysis of RT’s, there should be a parallel analysis of error rates, expressed either as percent correct or as percent error. (Since percent error is just 100 minus the percent correct, these analyses yield the same result.) In general, error rates should parallel RT’s—faster conditions have lower error rates. If faster RT’s are associated with higher error rates, a speed-accuracy trade-off should be suspected, and interpretation of RT differences should only be made with extreme caution.
In almost all instances, RT analyses are based on correct trials only. It is good practice to examine the overall error rate for each participant. While what constitutes an acceptable rate will differ with different experiments, it is common practice to delete the data from any participant whose error rates are clearly higher than the norm. In this case, it is likely that the participant either misunderstood the instructions or was simply unable to perform the task. If at all possible, the maximum error rate should be set in advance, so that there is no danger of deleting a participant’s data because they do not conform to the expected results. Pilot testing should help in setting a maximum error rate.
Another issue for analysis of RT data involves outliers, or extremely deviant RT’s that occur on occasional trials. These usually involve extremely slow RT’s. Many researchers assume that such extreme RT’s reflect momentary inattention or confusion, therefore they are properly omitted from the analysis, prior to calculating the mean RT by condition for individual participants. A common criterion is to omit any trials whose RT is more than three standard deviations from the mean for that condition. That can be done either based on the mean and standard deviation of RT for all participants, or for individual participants. The latter is clearly indicated if there are large differences in RT between participants. More sophisticated schemes for treating outliers have been suggested (Ratliff, 1993; Ulrich & Miller, 1994).
The repeated-measures ANOVA, which is almost always used for significance testing with RT data, makes an assumption of “compound symmetry,” or equality of covariances across all pairs of conditions. That assumption is seldom met in real data. Most statistical packages compute adjusted values of p based on either the Greenhouse-Geiser statistic or the newer, less conservative Huynh-Feldt statistic. In general, these corrected values of p should be used in assessing statistical significance
1We realize that it is now fashionable to refer to the persons from whom we obtain data as “participants” (Publication Manual of the American Psychological Association, 4th ed., 1994). We continue to use the term “participant,” because the whole point of doing an experiment is that you, the experimenter, manipulate the independent variable. It is precisely because the person has agreed to temporarily suspend control and let you decide the level of the IV to which they will be exposed, or the order of the levels, that makes the study an experiment. Of course, the participant may remove himself or herself from participation at any time, but for as long as they participate, participants have allowed you to participant them to the
conditions you choose. “Participant” suggests a level of free choice that is not a part of an experiment (beyond the freedom to choose whether or not to participate and to withdraw).
2A discussion of the controversy surrounding the merits of traditional null hypothesis testing is beyond the scope of this discussion. See several articles in the January, 1997 issue of Psychological Science and Chow (1998) for discussions of this topic.
Next Article: GETTING STARTED: Display Presentation [22726]
Previous Article: GETTING STARTED: Sharing Pre-Developed Programs [22677]
Comments
1 comment
Thanks to James St. James for this extensive and valuable article on managing computerized research! I do, however, take issue with one small point.
Under the section "Ordering of Trials within a block", this article states, "E-Prime permits setting a maximum on the number of repetitions of a single trial type ..." But based on my deep knowledge of E-Prime, this statement is at the very least misleading. E-Prime most certainly does not contain any straightforward mechanism for generally setting an arbitrary maximum on the number of repetitions of a single trial type. But if I have missed something, I would love for somebody to show me!
E-Prime 2 and later, indeed, do include a setting to prevent re-using the same List level as the first level after a reset (i.e., reshuffle) -- sort of like dealing a deck of cards, taking note of the last card dealt, then reshuffling the deck, looking at the top card, and if that card matches the last card dealt then moving that card to some other random place in the deck before dealing out any more cards. Indeed, E-Prime meaningfully terms this "No Repeat After Reset", nothing more than that. This mechanism does not in any way, however, allow for setting an arbitrary maximum on the number of repetitions of a single trial type.
That said, one may, with enough programming skill, write code to enforce any arbitrary randomization constraint they like, so the statement in the article is true in that fringe sense. The PST website contains some example code for how to do that, although using an inefficient sort of "Bogosort" algorithm (look that up on Wikipedia). Typically, in order to enforce specific sorts of randomization constraints, one may more simply generate suitable sequences outside of E-Prime, and then run those sequences in sequential order from E-Prime.
For more discussion on this topic, please go to the E-Prime Google Group (groups.google.com/group/e-prime) and run searches using the terms "random", "pseudorandom", "pseudo-random", "constraint", and "constrain".
Please sign in to leave a comment.