Daniel Simons

Hi again, Brian. Thanks for the further thoughts. ...

2013-03-27T14:58:34.767-05:00

Hi again, Brian. Thanks for the further thoughts. I think we mostly are in agreement on the logic of the argument, especially the idea that different tasks may produce more sensitive measures of the same underlying mechanism (in the the technical sense of bigger signal to noise ratio), and that more sensitive measures will yield bigger effect sizes. I think the reason cognitive psychologists have been skeptical of some of the large effect sizes for conceptual priming is that the measures do not appear to be the sort that would yield greater sensitivity to spreading activation (e.g., single trials with long response latencies and presumably a lot of variability) and, at least on the surface, they seem to require more steps for the semantic link between the prime and the behavior. Much of the discussion earlier on this thread focused on ways in which those assumptions might be wrong, mostly by arguing that the priming does not operate via a straightforward set of semantic associations. That is, most of the interesting alternatives have involved alternative mechanisms or pathways. I think that's a reasonable way to go, because it doesn't seem plausible that these outcome measures will be more sensitive than more traditional measures of spreading activation.

If there's one thing you can say about cognitive psychology it's that researchers tend to focus extensively on optimizing tasks. Sometimes, they lose sight of the phenomenon of interest and just study the task itself. After about 50 years of research on semantic priming using various lexical decision and other tasks, I'm willing to wager that the sensitivity of those measures would be pretty hard to beat. Of course, it's possible that the literature has become myopic and missed what might be far more sensitive measures. I'm just guessing that's unlikely here.

If that's true, then it would be surprising to find a one-trial method that leads to substantially greater effect sizes for the same sort of underlying mechanism. That's why the sort of effect size comparisons seem valid and why some other sort of mechanism would seem to be a more plausible way to explain these sorts of effects. They're easier to accommodate (and justify less skepticism) if the don't operate using the same mechanisms of semantic priming studied using other (presumably more sensitive) cognitive tasks.

On your broader question: You're right that, in principle, the same mechanisms could produce effects at multiple time scales. In the case of spreading activation and semantic priming, the effects tend to get weaker with longer delays and with more links in the chain of associations. And, it would be surprising to get stronger effects with longer time lags and more links in the chain. I think the reasons to appeal to a different mechanism (or at least a different or more direct pathway or outcome) comes more from the points above, but in this case, time scale contributes to the argument because of what we do know about the time scale of semantic priming of this sort. It's an argument that's specific to this particular issue and not one about timing in general.

Oh, and: the only remaining question I have about ...

2013-03-27T01:34:50.752-05:00

Oh, and: the only remaining question I have about the social priming discussion itself has to do with timing. You noted that: "If you want to make the argument that the social measure allows for bigger effect sizes, you would need to make the case for a different mechanism, one that operated on the scale of seconds rather than 10s of milliseconds." Must one appeal to a different underlying mechanism simply because the task itself operates on a longer time-scale? Does the timing of different mechanisms even have to enter into such a discussion? For example, could we have a measure of priming that wasn't time-based at all (e.g. stem-completion), but that still yielded larger effect sizes than an RT-based measure (e.g. lexical decisions)? And in that case, might the two tasks be tapping the very same underlying mechanism of spreading activation, even though the former operated on a scale of seconds while the latter operated on a scale of 10s-of-milliseconds? (I'm obviously not suggesting this is the case w/ the walking research, so that this might all be beside the point for the specific discussion about priming that you're having. But again, I'm more interested in the general form of these arguments.)

(Continued) So in the end, measured effect sizes ...

2013-03-27T01:33:19.750-05:00

(Continued)

So in the end, measured effect sizes will be influenced by many factors, including (1) the "# of steps" in the underlying mechanism (i.e. what you were focused on in the original argument about priming), (2) the degree to which the dependent measure allows for mechanism-related variability, (3) the degree to which the dependent measure allows effects to be swamped or masked by mechanism-unrelated variability, and (4) the # of observations per subject, etc. Reasoning about "effect-size chaining" (a la your "If both X and Y operate via Mechanism Z...") is only possible when you think that the influence on the effect size of #1 is going to swamp differences related to #2 and #3, etc.

Thanks again,

-bjs

Hi Dan. Thanks for taking the time to help clarif...

2013-03-27T01:32:39.450-05:00

Hi Dan. Thanks for taking the time to help clarify these things for me; your response makes a lot of sense, and is very helpful.

I still think that these discussions of what we might call "effect size chaining" can't make sense unless we always take the dependent measures into account in an explicit manner. To the extent we differ on this at all, though, this might just be an issue of rhetorical style rather than logical substance.

I guess I can't really comment on social priming per se, since I don't know much about that (and I wasn't even paying attention to the single-trial vs. multiple-trials issue, on which a bit more below). But I think that this issue can be sharpened even within the realm of semantic priming itself. Imagine that you had two tasks for measuring semantic priming, one of which was pretty good and one which was pretty crappy (either swamping the effect w/ independent variance [per your points about "walking" dependent measures] or simply not allowing for much mechanism-related variance in the first place). (I'm not sure what the good measure would be -- maybe some variant of stem completion? -- but the crappy measure might be a timed lexical-decision task.) It seems to me that you could easily end up with a situation in which the robin-priming-animal effect size (as measured with the good task) was larger than the robin-priming-bird effect size (as measured with the crappy task) -- even though you were studying the same "mechanism" in each case.

In other domains that I know even better, I'm sure that you could have this situation. For example, object-based attention effects are larger and more reliable when the 'objects' in question enjoy properties such as closure. But if I measured OBA for closed objects in a spatial cueing task but OBA for un-closed "objects" in a divided attention task, then I could easily get a larger measured effect size for the latter. The reason is that RT-based spatial cueing tasks ("press a key as soon as you see the probe") allow for relatively little task-related variance, since the speed w/ which you make a detection response is just never going to be that slow or that fast; you'll end up w/ an effect magnitude of a few 10s of milliseconds at most, and you'll rarely be able to experimentally discriminate effects of different strengths, etc. Meanwhile, an accuracy-based divided-attention task ("press one of two keys to tell me if the two quickly-flashed probes were the same or different") allows for much larger differences, for differences of different magnitudes to be compared, etc.

Your point about "walking" is well taken: a measure that allows for more variance *unrelated* to the mechanism we care about will be worse and will yield a smaller measured effect size. (And yes, we agree that walking wouldn't be a good measure of semantic priming itself!) But tasks can and frequently do also differ in terms of how much variance they allow that *is* related to the mechanism that we care about -- with more variability of that sort implying a better task and perhaps a larger resulting effect size. So yes, we want "controlled, boring measures" -- but even so, some controlled boring measures are better (and will produce bigger effect sizes) than others. (Back to OBA: your point about the disadvantages of single-trial measures are well taken, but this is a domain where e.g. I can probably get a larger and more reliable effect in a 3-trial MOT experiment than I could in a 30-trial spatial-cueing experiment.)

Continuation of last reply... 3) Your argument a...

2013-03-26T15:47:40.036-05:00

Continuation of last reply...

3) Your argument about effect sizes is wrong, though. I think you're conflating nature of the dependent measure with the size of a measured effect. A more variable outcome measure should lead to *less* reliable effects which in turn would lead to smaller *effect sizes* even if it leads to a bigger absolute measure. Walking down the hallway is measured in seconds whereas lexical decision is measured in 10s of milliseconds. That doesn't mean walking as an outcome measure will produce a larger effect size. For example, walking would be a much worse measure of lexical priming (walk to door 1 for word and door 2 for non-word) -- your latencies would be much longer and more variable, but the variability would swamp whatever mechanism contributes to a difference in lexical decision times (because that mechanism operates on the level of 10s of milliseconds).

The big advantage of using such controlled, boring measures is that they can produce a more reliable estimate of differences across conditions. And, more reliable estimates produce bigger effect sizes for a given difference between conditions. So, the fact that walking down the hallway has greater room for variability does not necessarily mean you'll get a bigger effect. You might, if the mechanism underlying the effect operates on a scale of seconds. But, most models of spreading activation don't operate on that scale.

If you want to make the argument that the social measure allows for bigger effect sizes, you would need to make the case for a different mechanism, one that operated on the scale of seconds rather than 10s of milliseconds. That might well be the case for these sorts of social effects, but it could no longer be explained by more typical mechanisms for conceptual priming and spreading activation. That's why I argued that some other mechanism would be needed (or a complete rethinking of the mechanisms for spreading activation).

Hi Brian (I assume), You're missing a coupl...

2013-03-26T15:47:20.610-05:00

Hi Brian (I assume),

You're missing a couple things, I think :-).

1) Your description of my premise isn't quite right even if the logic of what you say mostly is. You state "If phenomenon X depends on phenomenon Y..." - The real question is more of the following form:

If both X and Y operate via Mechanism Z, and X requires more steps in Mechanism Z than does Y, then then X should be a smaller effect than Y."

If you think of this in terms of classic subtractive logic, each additional associative link should have a cost, meaning that a more remote associate should produce smaller effects than a closer associate. In classic spreading activation form, seeing "robin" primes "bird" more than it primes "animal" because you have to go through more steps to get to animal. More remote associates produce smaller priming effects.

In the case of some of the priming research, some cognitive psychologists wonder why the effect sizes for conceptual social priming (warm cup primes concept of warmth which spreads to related meanings of warmth which then spreads to pro-social behavior) should be bigger than conceptual priming for much more direct semantic associates. It shouldn't if the mechanism works the same way. That's an argument for why a different mechanism would be needed to produce larger effects, and some of the folks commenting on this post have suggested such alternatives.

2) Your arguments about different measures being differentially sensitive to the same underlying mechanism is a good one. Some measures are more sensitive than others, and that might account for differences in the size of the obtained effects. For that logic to hold, though, you would need to make the case that a single-trial measure produces a more reliable result than an average across many trials. As someone who does a lot of single-trial studies, I'm not sure I'd go that route. In essence, you'd have to argue that the dependent measure used in social priming research produces a more *reliable* measures of the underlying mechanism. I think that will be a tough sell with a single-trial measure. Tasks with multiple trials might be a better case for this sort of argument.

Continued in next reply...

Hi. Dan, I'm not sure if you're still rea...

2013-03-26T13:34:37.488-05:00

Hi. Dan, I'm not sure if you're still reading this older thread (which I just saw), but here's another perspective on "Reconciling the size of social priming effects with the apparently smaller size of explicit semantic priming". I wanted to address this issue more generally than some of the past comments (though these thoughts aren't inconsistent w/ some of the other posts).

Several people I respect (including you and Hal Pashler) have recently been making this general argument: if phenomenon X depends on phenomenon Y, then the measured effect size of X shouldn't be bigger than that of Y.

I completely fail to understand this argument; it seems like a non-sequitur to me, so I must be missing something. In general, the measured effect size of any phenomenon will be a function of both (1) the underlying phenomenon itself, and (2) the task used to measure it. As a result, you can take the very same phenomenon, and reliably get two very different effect sizes as a result of measuring it with two different tasks. (In vision, e.g., take the phenomenon of object-based attention; here you'll reliably get a smallish effect size if you measure it w/ spatial cueing, a medium effect size if you measure it w/ divided-attention tasks, and a larger effect size if you measure it w/ something like multiple-object tracking.) It's presumably the same effect in every case (since we care about underlying processes in the mind, not paradigms used by scientists), but that effect gets filtered through different task constraints.

So the real form of your argument, though it's not stated as such, must be: if phenomenon X depends on phenomenon Y, then the measured effect size of X as measured with task A shouldn't be bigger than that of Y as measured with task B. But that certainly doesn't follow, since B could itself constrain the effect size much more than A.

Back to the actual topic being discussed: semantic/associative priming in cognitive psychology has typically/historically been measured w/ piddly little fast-response-time tasks (a la lexical decisions) that allow room for effects of only a few 10s (or 100s at most) of milliseconds -- such that you'll probably never be able to get a huge effect size, regardless of what you're measuring. This isn't because the underlying process of spreading activation is weak; it's because the bottleneck through which you're measuring it is weak. But many of the 'social priming' phenomena are measured with tasks that themselves allow for much more variable performance (even when measuring "response" time -- e.g. when the response is the time taken to walk down the hall), and so it's not unsurprising that the measured effect sizes would be much larger, even if they depend in part on some of the same underlying processes. (And I think this is a larger sociological difference between the two fields: relative to cognitive psychologists' tasks, social psychologists' tasks tend to be much harder to pin onto specific underlying cognitive processes, but they're sooooo much better suited for yielding huge effect sizes.)

In short, if task A is a great task with lots of room for variable performance, while task B is a crappy and highly constraining task, then you can easily get a larger measured effect size for X than for Y, even if X depends on Y.

In other words, vis a vis "Reconciling the size of social priming effects with the apparently smaller size of explicit semantic priming": there's really nothing to reconcile in the first place, either in this specific case or in general. The only rare context where that argument might work would be if the paradigms were identical...

What am I missing?

-bjs

Josh -- I don't think the thought experiment m...

2013-03-15T11:46:41.356-05:00

Josh -- I don't think the thought experiment makes any assumptions at all about the truth value of the initial study. Rather, it just asks what the odds of getting an effect size of a given magnitude/sign would be if the null hypothesis were true. The larger thought experiment asks whether a second study can be considered to have replicated the pattern of the first, whether or not the first result was true or a false positive. It isn't a conditional probability because I'm not making any assumptions about whether the initial result is real or a false positive. I'm just asking whether a second study replicates the pattern shown in the first one. We don't know the ground truth about the size of the actual effect in this thought experiment (just as we can only estimate it through experimental results in reality).

In a real-world situation, we just don't know whether an initial result is a false positive or even an accurate estimate of an effect. And, no one study can tell us whether the original was a false positive. The broader goal, then, is to estimate the true underlying effect size. The best approach, without pre-existing knowledge of the ground truth is to conduct a meta-analysis across many studies. If the cumulative effect size approaches zero across many studies, then the original likely was a false positive, but we never know that for certain (we're estimating reality).

My point in this thought experiment is to note that the sign of an effect in a replication does not provide strong confirmation of the original effect.

This is an interesting thought experiment. I'm...

2013-03-15T10:03:32.846-05:00

This is an interesting thought experiment. I'm not sure it contains the right numbers. What you are looking at is what is the probability, given that the first experiment had a false positive, the second experiment of a given sample size will be in the same direction.

I think that typically what we care about is the non-conditional probability: what is the joint probability that the first results was a false positive and the second one is the same direction? Answering that question involves a lot more guesswork (we need to know the prior probability of false positives, which we don't, and people's guestimates seem to run pretty much the whole spectrum).

Thanks Michael. Absolutely right: From a statistic...

2013-03-13T11:12:50.601-05:00

Thanks Michael. Absolutely right: From a statistical perspective, you can treat the order of the original study and replication as arbitrary. The case in which you just run an experiment twice with different subjects is an idealized version of that, and order doesn't matter (in principle, assuming the population hasn't changed over time). Combining the two meta-analytically without differentiating which came first (with various tools that allow you to treat them as a random effect as well) makes sense in that case.

That said, the idealized case in which the two studies are conducted independently isn't always the case in reality (even though they can be treated that way for meta-analytic purposes if they are the same study design with the same population). In most cases, an initial study of an has a disproportionate influence on how people think about a finding. Murder is shouted on page 1, but corrections appear on page C26. Moreover, the first study does give us information we can use to adjust our prior beliefs about the size of the effect when conducting a follow-up study. When deciding what to do for the followup (e.g., power, etc), knowledge of the first result can guide decisions about the second. If you update your priors based on the first result, then the two studies aren't really two tests of the same effect using the same prior beliefs. The order is irrelevant if you don't update your priors based on your initial evidence. But, it's relevant if you do.

Of course, treating them as equivalent and disregarding the order in which they were conducted is a good way to estimate the overall effect and to try to generalize to the larger population of possible experiments of the same sort. If they are direct replications using the same procedure, that's the right way to go (and you can use a common set of prior beliefs to look at the overall effect, etc. in the way Michael suggests).

Suppose you ran two experiments (the second being ...

2013-03-13T09:03:01.267-05:00

Suppose you ran two experiments (the second being an attempted replication of the first) on different subjects and you want to know what their aggregated data tell us. As a Bayesian, you shouldn't care which came first. Data should be undated (with some exceptions). The first experiment is as much a replication of the second as vice versa. Thus the idea of replication is misleading. If you have two experiments that are similar enough to compare effects, you should think of experiments as a random effect, i.e., these experiments are a sample from an infinite population of possible experiments, with subjects nested within experiments. The tools for doing this are readily available in the R packages to do linear mixed-effects models, or programs such as HLM.

Sorry, I do tend to get lost in the regulatory wor...

2013-03-12T03:46:35.924-05:00

Sorry, I do tend to get lost in the regulatory work, given that I think it's the more exciting avenue now. Regarding the automaticity part: I myself am not really joined to any automaticity claim. Such responses may well be controllable given specific conditions and depending on the representation. I do really like the questions that you are asking though.

The distinction to me seems to rely more on the type of representation, which may be one assumption that may be different. The level of representation is not necessary a semantic network type of representation. In fact, many people have stressed that representations may rely on soft versus hard interfaces (Zajonc) or "simulations" (Barsalou). This is an assumption in many of the embodiment work (although the radical embodied people would disagree). So, Zajonc and Markus for example talks about manipulating heart rate that would also differ the way people operate in specific behaviors (I cannot remember the specific DV there). The same would hold for the specific work we talk about here.

My approach to date has been mostly relying on descriptions from other theories. That is, warmth seems to evoke a communal sharing mindset (one of Alan Fiske's relational models), which generally lets people engage in trusting relationships (akin to mother-infant relationships, those between romantic partners, and so forth). Warmth (as related to touch) should be one of the early cues for preverbal infants to interpret the environment as safe and trustworthy (in another relational model, authority ranking, preverbal infants seems to associate dominance/submission with power - see Lotte Thomsen's work).

Now, it is not the case that just anybody comes to associate warmth with a good quality relationship (in which generosity is afforded). In fact, we find that those children who do not associate physical warmth with good quality care (i.e, insecurely attached) do not become more generous (you do see the effect for those who associate good quality care with warmth - securely attached). Is this relying on semantic networks? I doubt it, so the nature of the representation is different (and "priming" of warmth in this case may well be different). However, it is true that people have started to tease apart these mechanisms already to some extent.

Ainsworth also stressed early on as well that mothers of avoidant children tend to be more reluctant to touch. So, the effects of warmth should be primarily geared towards "something" about touch, close proximity (and probably also specific types of smells related to the mother-infant interaction). Warmth just seems to be the most dominant of these, because throughout evolution, it should have been associated with all of the most dominant communal acts (like sex, sharing of fluids, providing care for an infant).

That basically also means that this is really not directly related to climatic differences. Climatic differences and differences in national culture are far more complex. That is, in warm climates people come outside far more frequently, they should interact more, have different baselines for temperature. There should be SOME relation, but it's a lot harder to tease apart (we did find effects of a warm room on feelings of closeness, but again, this is an incidental manipulation, and not a climatic difference). For what it's worth, we did do an analysis of an existing dataset, but the effects seem to be quite instable (paper is still under review: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2215865).

Thanks Hans. No problem about the self-promotion -...

2013-03-11T20:27:25.385-05:00

Thanks Hans. No problem about the self-promotion -- great to let people know about that sort of special topic. I'm not sure I completely follow the top-down/bottom-up distinction you've drawn here. I don't tend to think of priming effects as top-down at all, and they typically are described as automatic, which implies a more bottom-up or obligatory process rather than something guided by intentions. The idea that physical warmth triggers some sort of regulatory processes seems reasonable. What I'd like to see is a more specific explanation for how those regulatory behaviors then lead to pro-social behavior or judgments about the warmth of another's personality rather than to the many other things they could do. More broadly, if physical warmth leads to such behaviors and judgments, are people in warmer climates inherently more charitable? Do people living in cold places think everyone is cold? That doesn't seem quite right to me. If not, is it the relative change in feelings of warmth at that moment? If so, would people be more pro-social right after walking outside on a warm day? What are the constraints of these effects? What are their limits? Knowing more about the hypothesized chain of "events' that leads from the prime to the outcome would help in better understanding these effects.

Hi Dan, Thank you as well for the detailed commen...

2013-03-11T18:06:38.000-05:00

Hi Dan,

Thank you as well for the detailed comments. Let me make another distinction. I think that for many of the embodiment effects we can make broad distinction between two types of effects - similar to many other effects in cognition: top down and bottom up. Many of the top down effects may fall in the view that some have argued as conceptual metaphors/metaphoric transfer. I'd think that these are far more bound to context (e.g., the link between time and space, which has also been suggested to be moderated by culture).

The work that we have been working on should be closely linked to ANS functions and oxytocinergic effects. We have found one link with skin temperature (i.e., social exclusion leads to lower skin temperatures and warmth mitigates the effect). We are trying to work out these effects beyond these relatively simple 'priming' effects (if one still wants to include it as priming, if it is simply vasoconstriction/vasodilation). It does not mean that there are no top down effects (e.g., through internal working models, see Fay&Maner, and our own work with children), but the most basic effects are also obtained with young children (e.g., young infants are soothed by warmth when they are stressed as well, more so than a pleasurable treatment with sucrose).

There are a whole bunch of these converging effects (see for example the problems oxytocin deficient mice have in regulating temperature, oxytocin promoting heat transfer in mammals when feeding pups, and oxytocin-thermoregulation links in humans), but I agree that the full mechanisms are not clear yet. But that's ok, it's ambitious to do so and I think progress is being made. Forgive the blatant self-promotion, but the working out of mechanisms is something that we will try to invite here: http://www.frontiersin.org/Cognition/researchtopics/Mechanisms_of_well-adjusted_an_1/1627

As a 'primer': we are working to include a 'registered reports' section, based on what Chris Chambers proposed at Cortex, which I hope will increase the confidence in these mechanisms/effects.

Ah. Good point. Thanks David. I do like the idea o...

2013-03-11T17:22:01.665-05:00

Ah. Good point. Thanks David. I do like the idea of using the power of the original study as a way to assess an unstated belief about the size of the effect. When an effect is surprising, the lack of power and the necessity of finding a large effet is particularly problematic. For a surprising effect, you shouldn't assume it will be large, right?

Briefly: You're indeed right that I don't ...

2013-03-11T17:10:20.225-05:00

Briefly: You're indeed right that I don't favor using p-values to establish the legitimacy of an effect. But that's the common practice. My only small point was that investigator who plays that game is barring himself/herself from reporting any effect smaller than one sufficient to attain the critical p-level given the N, and therefore implicitly accepting that an effect smaller than that is one that should never see the light of day. Which, as much as anything, shows the flaw in the whole logic of NHST.
David Funder

That approach has some merit—basically, you're...

2013-03-11T16:01:05.834-05:00

That approach has some merit—basically, you're asking what the criterion effect size would have been if p had been exactly .05. In that case, power would be 50%. So, you're assuming that the study was conducted with 50% power to detect an effect of that size (under NHST). The problem is that we know that published effect size estimates are likely inflated (due to publication bias, etc). So, I would prefer assuming that the effect size actually is smaller than the published effect and then use a large enough sample to find that smaller effect size with at least 80% power in the replication attempt. I think the best approach is to make sure that the replication attempt has enough power to detect a similar sort of effect.

The one aspect of your logic that I disagree with is your statement that the effect with p of exactly .50 is the "minimum effect size that would make the finding worth taking seriously." I'm not a fan of using p-values in this way for determining the legitimacy of an effect (I know that's probably not what you meant, of course). The measured p-value will vary quite a bit across repeated version of the same study.

You're absolutely right that the importance of an effect size varies wildly with the nature of the inference. The effect size of aspirine on heart attacks is about r=.02, and the study required 10,000 subjects. That is a life-saving effect size. Most psychology studies are not that large, and given the amount of variability in psychology measures, most effect sizes that small would not be of practical importance. The real question, as you suggest, is what sorts of effects are we capable of detecting reliably given the sample sizes we use. (Stay tuned -- another post coming soon on that point.)

Thanks Maarten. Yes, that is a fascinating situati...

2013-03-11T15:46:55.486-05:00

Thanks Maarten. Yes, that is a fascinating situation. And, I could easily imagine cases in which some set of labs consistently produce an effect and others do not. The key, then, will be in trying to understand what differs. One advantage of using a shared, vetted protocol (as we're doing at Perspectives on Psych Science) is that we can identify as many of the necessary manipulation checks and method details in advance as possible. That helps rule out the simple mistakes, leaving the more interesting ones.

I agree with everything in this post and wish to a...

2013-03-11T15:44:24.473-05:00

I agree with everything in this post and wish to add one thought, concerning the statement "We have to treat near-zero effect sizes as failures to replicate..." Any study can have an effect that goes one direction, goes the other direction, or is so close to zero that it might as well be zero. But how close to zero is that? It depends on many things, including theoretical context (e.g., what other effects are competing as explanations for the phenomenon and what is their size?)and practical implications (e.g., how many lives will be saved by the widespread use of a treatment that has this effect size?).
The situation is a bit different when we are talking about replication. Here, the question is whether the new study obtained an effect size large enough to support the existence of the original effect. But how large is that? Here's an interesting possibility:
If the original finding was reported in the context of NHST, then an N was reported along with a critical p-value. What is the smallest effect size that would have attained that critical p-value, given that N? The answer to this question tells us what the original investigator, implicitly, is saying is the minimum effect size that would make the finding worth taking seriously. Because: if it had been smaller than that, the investigator wouldn't have reported it!
If this logic is correct, then it follows that a subsequent study that does not attain at least this effect size cannot count as being "big enough" to confirm the finding, and if the confidence interval around the new effect doesn't include the old effect size, then the original study can be said to have been disconfirmed!
Did I get anything wrong here?

Yes, well put, I agree with everything you wrote. ...

2013-03-11T14:10:20.142-05:00

Yes, well put, I agree with everything you wrote. I would only add that if the differences themselves reproduce (team A keeps finding A, team B keeps finding B, for example), then it is definitely time for adversarial collaboration. That might be the situation that is most interesting in terms of theory development.

I completely agree -- the results of one replicati...

2013-03-11T13:42:39.774-05:00

I completely agree -- the results of one replication cannot falsify an original result. That's one reason I've pushed for multiple replications using a shared protocol for APS. THe goal is to get a better handle on the size of the effect. And, we shouldn't take the results of any one study as definitive (unless it has enormous power and an airtight design). I also don't think we necessarily need to assume shoddy research practices if, upon multiple replications, the original result turns out to be wrong. Statistically, we should expect some false positive results. And, with publication bias as it is, we should expect that most published findings overestimate the true effect size. The original study might have followed best research practices and still produced a false positive. In my view, we should treat a single positive result much the same way that we treat a single negative one — it's suggestive but not definitive evidence for the true size of the effect in the population.

For what it's worth, I don't think of replication as a questionable-practices-detection mechanism (although it might do that on occasion). Rather, I view it as a way to obtain a better estimate of the true size of an effect in reality. If an effect has no theoretical or empirical importance, then there's not much reason to bother replicating it. But, for results that are theoretically important (even if the theories are only weakly elaborated), direct replications help verify the size of a result.

In my view, there's too much emphasis on replication as fraud detection and debunking. Replication helps to shore up an original finding by showing that it is robust. My hope and expectation is that most findings in our field will withstand such direct replication (particularly when the original study had adequate power to detect an effect with a reasonable effect size). Of course underpowered studies are more likely to produce false alarms with large effect sizes, so those are the ones that are most in need of direct replication to verify the actual effect size. That doesn't mean they were fraudulent, just that provided a far less precise estimate of the effect size. Now, if they were produced via p-hacking and other questionable practices, they will be even more likely to be false alarms. I hope such practices are less prevalent than they seem to be, but if not, then direct replication is even more crucial. There's no point in fleshing out sophisticated theories if the results they are based on prove to be insubstantial.

Don't get me wrong: I think replication is imp...

2013-03-11T13:29:19.091-05:00

Don't get me wrong: I think replication is important, and direct replication sadly too rare. I just don't think that the result of one replication can falsify the original result. If it is different from the original result, it tells you that something interesting is going on and further replications should be done. I agree that a series of non-replications will get the original result and theory in trouble, to the point where it will be discarded. That point will be reached when there are no more theoretically interesting reasons to uphold the original result in the face of mounting counter-evidence, and it will have to be put down to 'banal' causes such as questionable research practices. But just as one result isn't conclusive until it is replicated (that's why direct replications are so important, after all), one replication is not conclusive either until it is replicated -- until a point is reached where the research community decides that the replications don't produce interesting differences anymore and the original theory is ether accepted, or amended, or dropped.

If you could really judge the adequacy of an experiment regardless of the theories describing it, then why do direct replications at all? Only to rule out fraud and questionable research practices? I agree that is (sadly) an important consideration, but I'm sure it's not the only one.

Hi Hans. Thank you for the detailed comments. You ...

2013-03-11T13:15:57.086-05:00

Hi Hans. Thank you for the detailed comments. You might well be right that "social" primes are inherently stronger than simpler semantic primes, although I would like to know what mechanisms in particular make them stronger and how the representational structure differs from other forms of semantic representation. I guess it's not clear to me what it means to be visceral and how such visceral connections would lead to stronger effects. That does require a different mechanism, perhaps some way of strengthening the weights between concepts or a more direct pathway for priming to induce its effects.

Interesting that mood or positive valence doesn't induce the same sort of effects. That does seemingly eliminate one possible alternative pathway that wouldn't need to operate by spreading from one concept to another (prime -> positive affect - > generalized positive responding). I guess what I haven't seen yet (and that's probably due to my own lack of expertise in some of these areas) is a structural/mechanistic explanation for how these priming effects operate and what sorts of pathways or concepts are being primed. How are these concepts represented and how are those representations connected such that priming has predictable effects. Without such a model, it's tough for me to wrap my head around how a warmth prime could lead to increased prosocial behavior increased anger (maybe that's with different primes). I'd like to see a more mechanistic account that spell out each step in the process from prime to outcome. That would also help differentiate the effects from the sorts of semantic spreading activation effects I mentioned.

For what it's worth, I only mentioned the Williams & Bargh finding because it was the one I had seen (it was in Science and is cited a lot). I didn't intend to neglect others in that field - I just used it as what I saw to be an example of claims of conceptual priming. I imagine there are many others I could have used in its place.

I don't really agree with your semantic distin...

2013-03-11T11:55:05.731-05:00

I don't really agree with your semantic distinction between the process and result. The word "replication" is used in both ways, and the context typically disambiguates. If I say that I'm going to try to replicate some result, I'm referring to the process of conducting a replication. If I say "that result replicates" I'm referring to the outcome of repeating the study. I don't follow why that necessarily leads to confusion.

I agree that the interpretation of the results of a replication study must vary as a function of how that replication was conducted. If it was a conceptual replication, the interpretation should be different than for a direct replication. That said, I completely disagree with the claim that falsification doesn't work at the frontier of science. That is precisely where direct replication is needed. If direct replication attempts consistently fail to reproduce the same effect, then we know that the original effect, as described, is either wrong or the description was incomplete. Of course it's always possible that undescribed factors contributed to the original result, but the direct replications still constrain (falsify, in a sense) the generality of that result. I do think it is possible to judge the adequacy of a replication even if you don't know all of the factors that might matter in theory. A direct replication is an attempt to reproduce a result, and that should be possible for any scientific result, regardless of the richness of the theories describing it. After all, theories eventually will be proven wrong or incomplete, but the evidence used to generate them should be robust.

Adversarial collaborations are great when they work (they're rare), especially when two theories make conflicting predictions. But even in the absence of direct theoretical predictions, direct replication is necessary to verify the strength of the results themselves.

For the registered replication reports, we are foc...

2013-03-11T10:56:03.062-05:00

For the registered replication reports, we are focusing only on direct replication of an original study rather than on extensions of that study in new directions. There are other outlets for replicate+extend papers. Our goal for these is to have multiple replications of a single study using a shared protocol.