Monday, January 30, 2012

The cumulative dissertation---a five-step program to success

Here is an informative text from kisswin on the subject of cumulative dissertations:
A special type of doctorate is the so called cumulative dissertation. While a traditional doctorate finishes with one doctoral thesis, a cumulative dissertation consists of various publications which are then combined to one complete work and evaluated.The publications are normally papers, articles etc. which have been published in renowned (“peer reviewed”) professional journals. The presented publications are evaluated and chosen by qualified experts (“peer reviewer”). Thereby the review secures that the publications meet the standards of renowned professional journals and conferences.Furthermore the prestige of professional journals is connected with the Journal Impact Factor (JIL). The JIL does not make statements about the quality of an article but measures the frequency with which an article in a journal has been cited in a given period of time in other journals. Nevertheless it is important to note that a comparison can be difficult because of different citation rates within different research fields. In spite of that a high JIL increases the prestige of professional journals.If scientific work is published depends on many different factors. Therefore a cumulative dissertation is often less calculable concerning time schedules than a traditional dissertation. In principle, publications are important for a traditional dissertation as well, but they are not as important for a traditional dissertation as for a cumulative dissertation.Up to now cumulative dissertations are rather rare at German universities and no standardised method does exist for German-speaking countries. The respective conditions can be found in the regulations for doctoral studies of the respective university.
This topic is of great interest to me, and it should be of great interest to any PhD student in Germany where cumulative PhD are offered. The cumulative diss. is the best possible way to do a dissertation. Two of my students, Titus, and Umesh, just got done with cumulatives, and it's amazing: three papers each.
When I got done with my PhD, I had zero publications stemming directly from the PhD work. It took me one more year to get a book published, and then another three years before I got my first major paper published. Compared to that, having three papers under review or published (in both Titus and Umesh's case, one is published and two under review) at the moment of submission is amazing and worth replicating. Even if the submitted papers are rejected, it is great to hit the ground running.

The only problem is: it's very hard to replicate this kind of performance. In order to do a cumulative, you need to get started real early with writing and submitting articles, and in a typical PhD, one cannot get enough data for a publication until one is well into the second year (in my own case, I had data within 3 months of becoming all-but-dissertation, but I had gone to India with five laptops and gathered 250 subjects' data in a month or two).  Also, in psycholinguistics, journals articles are long-drawn out affairs, with ritualistic introductions that go on and on, and the obligatory long General Discussion, usually consisting of large amounts of waffling and wild speculation (I exaggerate wildly, but GDs are too long for my taste; in my own journal articles I try to create a new culture by making it short, but reviewers always complain about too-short GDs). It would be awesome if journals would encourage pithiness rather than (and I use this word in its correct sense, cf. Sarah Palin's illiteracy on display) verbiage. But that's not going to happen any time soon.

So students have to be able to write; even if the advisor (i.e., me) is intensively involved in the writing, they still have to be able to do the core writing themselves. This is hard enough for native speakers of English.

Also, students have to be ready to put up with the ugliness and fundamental lack of friendliness of the review process. Reviewers are often former graduate students who (even after they have become extremely-former grad students) were trained in green-beret universities to become attack dogs, and never learnt to become human again. Many reviewers like to nail a paper just because they can. No paper is ever perfect (at least, I have never read one yet that I would call perfect), but for many reviewers this is no reason to let a paper through.

Getting eviscerated by a nasty reviewer is not a pleasant introduction to the scientific process. The cumulative student has to develop nerves of steel when the rejection letter comes in, and that's one more thing to learn. But it takes many years to develop the thick skin necessary to be able to subsist within this kind of review culture. Unless one is born tough, it's hard to recover from this shock right away.

If one considers the fact that often a paper goes through many revisions before it is accepted, it can take up to three years or more to get a single paper accepted. If it's controversial (as most good papers are),  then the chances of an argument with the reviewers are much greater.

So how can a student deliver three papers (this is informally our requirement at Potsdam) as part of a cumulative, and within three years? Is this possible to achieve?

My advice for a successful three year cumulative:

Five-step success program for a cumulative dissertation 
Results guaranteed if you if drink the potion:

1. Don't waste two or more years trying to figure out what to do for your PhD. German PhDs require no course work, you are expected to start right away on research. My experience has been that whoever loses time in the beginning pays a heavy price towards the end (well, duh).
2. Forget about holidays, parties and the like and forget about the 9 to 5 work-life-balance stuff that you may have seen in Work-Life-Balance folders lying around the university; they are relevant for a lot of people, but not for a PhD student. A PhD is a 24 hr a day 7 days a week job, and a German PhD with a three-year window even more so (in the US, students typically take five years, often much more to get their PhD done in linguistics). Anyone who thinks they can do a PhD in three years and have a life is doomed. You should only be doing a PhD if you enjoy it anyway. It should be your entire life, for those few years. Never forget that the three letters in the abbreviation, PhD, stand for Total Immersion Program. Of course, I exaggerate a bit here; you have to sleep, eat. But the point is that if you are doing a PhD, it has to be a priority in your life for the few years that you are busy doing it.
3. As soon as you know what your first paper will be on, and as soon as the data comes in and you know you have a publishable result, don't waste time. The Daily Show is not going anywhere, it's always going to be online; the Colbert Report too. It's possible to have a rough draft of a normal sized paper ready in two to three weeks of full-time work. You just have to have the discipline to do it. If you are in my lab, I envy you, because your advisor usually reads papers very quickly and responds to drafts with lightning speed. He can help you get that draft ready for submission. But you have to produce a draft first.
4. If you worked hard the first year, you have data. By the first quarter of your second year, you should have a paper under review or under revision after an eviscerating rejection. By the first quarter  of the second year, you should have a second paper under review/rejection. By the end of the third year, you should have your third paper under review. That's it. It doesn't matter how many times the papers get rejected (revise and resubmit, to a different journal if they tell you, "we never want to see you face again"), they need to be in the pipeline. Just before you submit the cumulative dissertation, you should have the papers under review (for the 100th time, if need be).  
5. That's it, you have submitted a cumulative with three publications either published or under review. As a last step, after submission, have a ritualized party: print out all those rejection letters and burn them (not in the Besprechungsraum please).

Open data initiatives

There has always been a vague desire on part of experimentalists to have publicly obtainable data that's already been published and is in the public domain.

In psycholinguistics, the first such database I know of is Reinhold Kliegl's PMR2.  

In my lab, we are (or rather I am) also thinking of some way to provide easy access to our published lab data (in addition to listing it on PMR2, it would be good to have a local copy so that one is not 100% dependent on an archive maintained by someone else). 

I just heard about another data repository, directed more to linguistics (but also to psychologists): CLARIN-D.

In this context, I've been thinking about what properties our local repository should have (this is only about our own public repository). Here is a preliminary list:

1. Data access should require login and registration, as well as an "I agree" button to get an agreement of the terms of the data release (below).
2. The data should be released on the condition that any new result derived from the data should be uploaded there so that people can follow the history of what happened with that data. The people downloading the data should cite the original work where the data were reported.
3. Once the re-analysis of the original data is in the public domain and the new analysis has been uploaded, it would be ideal if there were room for comments from others (e.g., a response from the original authors). I.e., this would be like a blog, but the blog should be integrated seamlessly with the repository, and not be a separate interface (see PRM2 for an example of what I would not like to have). Downloaders and users of our data should also agree to show us their reanalysis before publishing it, so that we have the chance to respond if we find something that we disagree about (e.g., how to remove extreme values).
4. We should release full data for the published study. E.g., all items used, all fillers (excluding filler experiments that are not published yet), and the raw as well as analyzed data (which could be non-raw, e.g., aggregated).  People often dislike releasing all their data (I have had several people refuse to release the raw data, making re-analysis effectively impossible), they limit the release to just the data in the format that allows exactly the analysis already done, nothing further. What's the use of that kind of a data release? Suppose I want to look for a particular kind of confound in the data, and I can only do it if I have the raw data, a data release of the reduced dataset would be useless (I have been in this situation, and I could not use the released data).
5. Our own analysis for a particular dataset should be an Sweave'd document, with .Rnw, .R source, the data itself, a pdf. Ideally the paper should be the Sweave file. If every downloadable item has this collection of items, it will have a completely predictable structure, easy to understand for the outsider. I know that developing standards is hard, even within the confines of our own lab, but it might be worth it. 
6. The data should not be in .Rda files, but rather as text files. I have had some problems accessing .Rda files in a new version of R that were created with an older version of R.
7. There has to be a contact person locally whom people from outside can contact (and it's not gonna be me!). That's the central problem with good ideas; they always require some work.
8. There should be a possibility to upload a new, improved data analysis even after the data is published. For example, I published a paper in 2004, when the state of my statistical knowledge was even more miserable than it is right now. I would like to post a revised analysis, done to the best of my current ability. There should be space for that in the interface. This cannot count as a re-analysis of the dataset by an outside, third party, and therefore it should be presented as part of the lab data set but marked as "revised data analysis", or something like that.
9. What about our re-analyses of *other* people's data? For example, Titus reanalyzed the Meseguer et al dataset; this should be presented not as original data from our lab; there should be a separate section for showcasing reanalyses that we did.

I'm looking forward to suggestions for further improvements (from anyone, not just lab members, that's why it's in the public domain).

Thursday, January 26, 2012

Tuesday, January 24, 2012

Titus von der Malsburg has submitted his dissertation

We inaugurate this lab blog with the news that Titus has submitted his dissertation! This is a special occasion for me too, since he's the first person I'll be graduating. Congratulations Titus!