Introduction
This page describes a few assorted themes that underlie my general writing style. I don't actually have an explicit style; rather, these are just some ideas that may run through my head when deciding how to write an essay.
Contents
- Introduction
- My history
- To use big words or not?
- How much expertise to assume?
- Modularity and essay length
- Updates
- Why I don't like crossposting
- Single source of truth
- Repeating myself
- Logical quotation
- Sentence-case capitalization
- Should "etc" have a period?
- Voice
- Ambiguous pronouns
- Drafts
- Outlining and "house of cards"
- Transitions, sectioning, and pictures
- Summaries: good
- Conclusions: bad
- Proofs and programming
- Going off on tangents
- To use discursive footnotes or not?
- Link rot
- Variable names
- Date formatting
- Miscellaneous formatting preferences
- Using smileys in emails
- Extracting information from source articles
- Quoting factual material
- Citations after every sentence
- Page numbers in citations
- Citing Wikipedia
- Errors in published articles
- Making one's uncertainty explicit
- Picture-filled presentations
- PDF vs. HTML
- No-frills websites
- Should you use a pseudonym?
- Acknowledgments
- Footnotes
My history
I think I wrote fairly normally from a young age through middle school, though I did enjoy playing with language, in part inspired by grammar I learned in my German class.
In high school, I began studying vocabulary words and generally reading the dictionary as preparation for SATs and for fun. I also enjoyed the word puzzles of Shakespeare's plays and was inspired by Ralph Nader's large vocabulary. These factors contributed to an increasing bombast in my writing. I endeavored to employ sesquipedalia wherever possible, even if they rendered my prose more cumbersome to read. When writing school assignments, I reined in my grandiloquence to avoid sounding weird, but when writing in private journals, I let loose with unusual verbal constructions that I found satisfying.
Around the start of college, I began to change my approach somewhat. One inspiration was George Orwell's famous "Politics and the English Language", which includes these recommendations:
(ii) Never use a long word where a short one will do. [...]
(v) Never use a foreign phrase, a scientific word, or a jargon word if you can think of an everyday English equivalent.
I became less concerned about making my sentences sound like 18th-century English prose. As a result, I was also able to write faster.
Around the same time, I became somewhat less obsessive about perfect grammar after I watched a documentary which pointed out that grammar is ultimately arbitrary. For instance, I realized, it's not actually more clear to use "whom" instead of "who" in the objective case; doing so it just a rule that developed and is now on its way out.
To use big words or not?
I remain conflicted on whether to use less common vocabulary words. Some reasons in favor of doing so:
- I personally find it really enjoyable to read articles that use big words. They sort of tickle my brain.
- Doing so may signal some degree of sophistication. For instance, big words are common in New Yorker articles and other so-called "high culture" publications.
And reasons against:
- Some have told me that my prose can occasionally be hard to read due to big words.
- One man's sophistication is another man's pretentiousness.
In general, my current approach is to use big words freely, either when they seem particularly apposite, or just to add flavor or humor to the writing. However, I don't go out of my way to figure out how to change a sentence so that it contains bigger words.
Prose vs code
We could compare simple words in natural-language text to the core functions of a programming language, like addition or printing output to the terminal. These basic words are likely to be successfully interpreted by almost anyone. Meanwhile, more complex words in prose can be compared with programming functions defined in external libraries: they provide powerful specific functionality, but not everyone has these words/functions "installed" in their brains/computers yet. Sometimes you can accomplish the same task as an external library's function using combinations of core operations in the programming language, just like you can explain a jargon word using combinations of simpler words.
With programming, I prefer to only use external libraries when necessary, so that the code can run on more people's machines without requiring them to import additional libraries. More libraries means more dependencies that could break things or pose a security risk when they're updated in the future, or they might fail to be maintained.
In the case of natural language, it's also true that using complex words will cause your writing not to fully "run" in the brains of more readers, but this situation is much less severe than in the case of programming, because often a reader can fail to know a few words and still understand most of the text. Moreover, it's often good to encourage readers to "install" more words in their brains for future use, which doesn't significantly add maintenance or security complexity. The definitions of words change slowly enough that updates to their meanings are unlikely to "break" existing writing any time soon, although over long time periods this can be a problem (Law 2018).
In non-technical prose, another consideration is that repeating the same simple words over and over can feel repetitive, and using a variety of words adds spice to the writing. However, in contexts where precision is essential, it may be best to reuse the same word to avoid confusion.
How much expertise to assume?
My essays tend to presume a high level of reader knowledge. Partly this is because many of my readers are experts, and I worry about talking down to them. The other reason is that in the age of Wikipedia, it's possible to look up details on which readers aren't clear, and I don't find compelling the idea of reproducing introductory information that already exists into my essays. Why should I reinvent the wheel if Wikipedia already explained it better? Of course, this assumes that my readers have enough motivation to look up what they don't know.
I also tend to assume a high level of philosophical sophistication, again partly because I worry about talking down to experienced readers and partly because explaining the basics has already been done suitably by others. For the most part, I try to write only novel ideas—or at least material that's specific to myself. (For instance, most of what I say in this piece has probably already been written by someone somewhere, but it's not "standard knowledge" the way that, say, the basic schools of ethics are.)
I also freely assume mathematical sophistication by readers and don't worry much about including equations. On the other hand, I think it's often better to explain a concept in words and by means of examples than using formulas because formulas are often less scrutable in an unnecessary way.
Math envy
I suspect that some academics try to over-use Greek symbols and notation in order to dress up their articles as more rigorous and deserving of more admiration. The fewer people who can understand it, the more intelligent you must be, right? Other authors probably over-use notation because they worked out the ideas in their heads using special notation and don't realize how hard it can be for outsiders to pick up the notation.
I really like this blog post:
math envy. Y'know, the idea that math=intelligence. This utter foolishness leads to the simultaneous fear and awe of anyone who throws math around, as if the presence of mere symbols and equations demonstrates the clear superiority of the author's throbbing, bulging,... intellect. This utter foolishness leads, therefore, to authors who feel the need to add superfluous "mathematics" to their writings in order to demonstrate that their... intelligence measures up that of their colleagues.
Well, turns out, someone finally got around to doing a study on math envy: Kimmo Ericksson (2012) "The nonsense math effect", Judgment and Decision Making 7(6). As expected, those with less training in mathematics tend to rate utterly irrelevant "mathematical content" more highly than its absence. [...] Not to name names, but I've read more than one NLP paper that throws in some worthless equation just to try to look more worthwhile.
Against mathematical elitism
Honestly, I think a lot of mathematical ideas are quite amenable to purely verbal, conceptual explanation. This doesn't necessarily make them less rigorous, and people are more likely to use the ideas properly if they understand them at a conceptual level than if they blindly manipulate symbols. Correspondingly, I think most math is not beyond the ken of Average Joe, and if something doesn't make sense, it's probably because the author didn't explain the idea well enough rather than because the idea is inherently inaccessible to regular minds. Much of math is difficult merely because it's so detailed and requires so much background knowledge—sort of like a legal document—rather than because you need to be a genius to comprehend it.
Consider Newtonian physics. We have a rich vocabulary to describe an object's speed, acceleration, mass, shape, wind resistance, friction, and much more. All of this can be done without equations. We can explain and visualize an object's behavior at an intuitive level. There's no reason why the same can't be done for any branch of math or physics. All the components of an equation or proof can in principle be described in conceptual terms. With enough training, abstract mathematical operations can all become as intuitive as Newtonian physics. The main barriers are just the amount of material that needs to be learned and lack of time/motivation to learn it.
When people say "Math is too hard", I think what's often really going on is "I'm not interested enough in math to learn all the background material and spend a lot of time training my brain to make mathematical theorems more intuitive."
Modularity and essay length
I try to keep my essays roughly independent of one another—with cross-linking when necessary—because this reduces dependencies and doesn't force readers to have read some essays before others. This is also useful because I typically write the essays out of order.
In the past I've tended to write long essays out of convenience, but I now think shorter essays are probably better. One reason is that, though I'm not an expert at search-engine optimization, I would guess that long pages don't necessarily rank better than short pages, since if the match between a query and a web page is measured by similarity of a normalizeda word distribution, then having more words shouldn't generally improve the match, and indeed, having more words might reduce the match because the page would cover more total topics. Of course, it's possible that page length would improve the quality score assigned to a page.
The net effect of page length on ranking any given web page is unclear, but if pages are shorter, you can have more of them, thereby increasing total page views from search engines. Of course, what matters is how much of your content people read rather than just how many page views there are, but assuming people usually don't read much of a page they stumble upon, then having more page views would increase the total number of words that people read. Plus, more page views means exposure to more total people.
As of 2016, I try to write essays that are as focused as possible on a single issue while still being self-contained and not requiring lots of cross-referencing to other web pages. This approach is inspired by programming, where one is advised to write small functions and classes.
Another benefit of shorter articles is that you don't have to worry as much about accidentally messing stuff up when editing the article—such as accidentally deleting parts of the article or doing a poorly designed find-and-replace on the article text—because there's less total content in the article to mess up, so the damage if you do mess something up is more limited.
Updates
I dislike the idea of writing essays that will become outdated. My website is a living document that I try to update as my views change (though because of the number of writings I have, I can't always update everything satisfactorily). It's more helpful to readers if a single essay on a coherent topic contains the full set of my thoughts than if I have different thoughts from different times scattered across several date-stamped blog posts. Thus, I treat my site a lot like a wiki—a private wiki that only I edit.
Why I don't like crossposting
People sometimes crosspost a blog post that first appeared on one site to another site. I'm not particularly annoyed when other people do this, and I can see some cases where it makes sense. But I personally dislike crossposting and try to avoid it. Why?
- The most important reason is that I update many of my writings over time, and if I have N copies of the same article in different places, I need to update all N copies. This is one of the reasons programmers avoid duplicating code. If you're lazy and don't update all versions, then people may read the out-of-date version. (See the next section on "Single source of truth".) This is also why I don't like uploading papers to sites like academia.edu, although published academic papers are, sadly, less likely to be updated over time, making the problem less severe in that case.
- Crossposting feels somewhat spammy, because if you want people to read your post, you should just link to it, not copy over the whole text. Crossposting also may unfairly give you an advantage on Google because now if either one of the two copies of your article ranks high for a query, you get clicks. On the flip side, search engines might penalize you for duplicate content. Using "canonical links" could help with this, but most informal, copy-and-paste instances of crossposting that people do probably don't make use of this possibility.
As an alternative to crossposting, what about publishing directly on someone else's website and not on your own site? I don't like this option either because I have complete control over my own site, including making sure that it stays online, making sure that formatting doesn't get messed up by style updates, avoiding broken links by creating redirects when necessary, and making sure the site is backed up. If I also have content on other people's sites, I need to worry about those things for those other sites too. (Several sites where I've posted content in the past have gone offline or had their formatting messed up.)
It's also nice to have all my main published content on just one or a small number of websites for reasons like the following. Imagine that one day you realize you've been using a technical term incorrectly throughout your writings. Or maybe you've been using some part of HTML syntax wrong. You want to go through your articles and correct the mistakes. If your articles are all on one or two sites, you can download those sites, grep
through them, find the places where revisions are required, and then make those revisions. If you have your content on more total sites, you have to download and grep
through more different things, and to make the fixes, you have to log in to more different website accounts. Or maybe someone else posted the article for you on their blog, in which case you'd have to pester that person to make the edit on your behalf. Maybe the person who originally posted the content 5 years ago has now moved on to another phase of life and is hard to get in contact with. And so on. All these issues are ameliorated by concentrating the content you care about in a small number of places that you fully control.
What do I recommend instead of crossposting? Reddit. Have a forum where it's ok to just share links to an article without copying over the full article text, but where you can still have high-quality, lengthy, Google-indexed comment threads with upvoting/downvoting. Reddit also allows for custom posts on a subreddit in case there's no original article to link to, or in case the OP wants to comment on the article rather than just linking to it.
Single source of truth
In a situation where data (HTML files, images, PDF files, etc.) can constantly be updated, it's important to have a "single source of truth" that stores the definitive, latest version of your content. For this reason, I try to keep my website clean, such as by deleting images that I no longer use and setting up redirects whenever a page moves. If you keep old files lying around, you may forget whether they're supposed to be there for some reason, and you risk one day accidentally linking to an old copy of, say, a PDF file rather than to the latest version. Therefore, I recommend to keep only the latest version of the file on your website.
Of course, it's good to retain old copies of files for version control and redundant backups. In fact, duplicating content many times over makes it more secure against data loss and data corruption. However, this should be done somewhere other than on your live site (such as in a "backups" folder in your personal files). Your backups can be duplicative and messy because you won't confuse them for being the "live", latest versions of your website files.
I try to use the same practice of cleaning out old content for my personal files as well. Here's an example of why this is valuable. In the past I created a todo-list file (call it file1.txt
), and later, I carelessly created a copy of the file and modified it (call it file2.txt
) while leaving the old version lying around. Many years later, when I was reviewing these files, I didn't know if there was anything in file1.txt
that I still wanted, or whether I only needed file2.txt
. As a result, I had to spend a good deal of time reviewing and comparing the two files. If I had instead deleted file1.txt
after I knew I was done with it, I would have saved myself this trouble. The general principle is to organize and annotate your data (including deleting it if it's no longer needed, or moving it to a folder explicitly marked for archiving of old material) while you have fresh in your head information about what should be done with it, rather than needing to figure that out over again at some later date. Cleanliness on your computer is not just for neat freaks—it actually has significant utilitarian value.
Repeating myself
I generally aim to avoid repeating myself in my online writings. For instance, I try to use facts, quotes, or detailed arguments in only one location on my websites. The motivation for this is that I don't want to be someone who creates a lot of content just by repeating the same points. This comment from Bill Watterson on why he ended Calvin and Hobbes has stuck with me:
By the end of 10 years, I'd said pretty much everything I had come there to say.
It's always better to leave the party early. If I had rolled along with the strip's popularity and repeated myself for another five, 10 or 20 years, the people now "grieving" for "Calvin and Hobbes" would be wishing me dead and cursing newspapers for running tedious, ancient strips like mine instead of acquiring fresher, livelier talent. And I'd be agreeing with them.
That said, I have, perhaps unfortunately, repeated myself a few times when writing about subjects like consciousness, although my goal with writing so many different consciousness essays was partly to drive my viewpoint home by saying it using many different explanations.
I also repeat myself when I write pieces or make videos that are aimed for a different audience than people who will read my detailed writings. And I often repeat myself in comment sections of different blogs or on different Facebook threads.
Logical quotation
In 10th grade (2002), I was taught to use the American style of quotation, where periods and commas go inside quotation marks even when they don't belong, like "this." This style contrasts with "logical quotation", like what I did just there—keeping the commas and periods where they logically belong. One American friend of mine encouraged logical quotation because it made more sense even though it wasn't standard, but I stuck to American quotation lest readers mentally penalize me on the assumption that I didn't know the rules at all. But in 2014, I discovered that Wikipedia uses logical quotation, so I switched to that thenceforth. Because I haven't gone back to change the quotation style in my earlier writings, you can date sentences I've written as before or after mid-2014 based on quotation style (in analogy with stratigraphic dating).
Sentence-case capitalization
Besides American-style quotation, another rule I was taught in school that I eventually abandoned was capitalizing all major words in article titles; this is called "title case" capitalization. I continued doing this until 2018 because I thought it was how you were supposed to do things. However, different websites use different styles. For example, Wikipedia articles, as well as many news sites, use so-called "sentence case" capitalization, in which only the first word and proper nouns are capitalized. I think sentence case is better for several reasons, including the following:
- Sentence case is easier. When using sentence-case capitalization, you don't have to agonize over whether a word gets capitalized based on its part of speech, number of letters, and so on. There are websites where you can enter a title and get the appropriate capitalization, but doing that takes extra effort. Plus, there's a variety of different ways to capitalize the major words in a title, which means you might accidentally switch between different methods over time, causing inconsistency. With sentence-case capitalization, there's not much ambiguity about which words are capitalized.
- Title case loses information about which words are proper nouns.
When you're writing text with a typewriter, perhaps title case can be helpful as a way to distinguish the title from the body of a paper. But on most websites, you can use text styles and placement to distinguish the title, making capitalization unnecessary.
Should "etc" have a period?
The standard abbreviation of "et cetera" is "etc." I'm usually too lazy to add italics, so it could be written as "etc.". I used to consider it a harmless error when people wrote "etc" without a period, but as of 2018, I've decided I generally prefer the period-less "etc". The reason is that adding a period can make it look like you're ending the sentence. For example, if I write "Go buy pens/paper/etc. at the store.", it could look at first glance like the sentence is just "Go buy pens/paper/etc." In particular, word-processing programs sometimes assume that "etc." ends a sentence, and therefore, these programs may wrongly capitalize the following word: "Go buy pens/paper/etc. At the store." This is less problematic for abbreviations like "i.e." that often are immediately followed by a comma, which shows that the sentence isn't ending.
Voice
I mostly write the way I would speak. One friend told me that my "writing is not heavily stylised writing, but it's very pleasing to read. It's warm and comforting, like I'm being hugged by the words as I read them." It's also easiest to write quickly with a conversational tone.
Ambiguous pronouns
I've noticed that many of my most confusing sentences are those that use "it", "these", and similar pronouns. The purpose of pronouns is usually to avoid using the same word/phrase twice in a row, but doing thatb using the same word/phrase twice in a row is better than the opposite problem of having an unclear sentence. I think writers should probably err more on the side of clumsy but clear sentences than elegant but vague ones.
Of course, sometimes there are ways to write a passage that avoid confusion while still using a pronoun instead of the original word. This page gives one example of that:
Error: When Samuel dropped the goblet onto the glass table, it broke. (What broke? The table or the goblet?)
Correction: The goblet broke when Samuel dropped it onto the glass table.
Drafts
When I was in school, teachers would often insist on writing a first draft and then rewriting an essay into a final draft. I find this aggravating and stopped doing it once it was no longer required. The problem is that I'm fatigued by the second-round draft and don't put my full effort into it because it feels like I'm just doing the same thing over. I wonder if multiple drafts were more important in typewriter days before electronic word processors made it possible to rewrite arbitrary pieces of text without disrupting the whole essay.
Once an essay is done, I reread it twice, using the first pass to carefully comb the words and the second to check the overall fluency of the sentences and transitions.
Outlining and "house of cards"
When I think of a new essay I want to write, it feels like an avalanche building up in my brain. Ideas keep accumulating and picking up steam the more I think about the topic. I rarely explicitly write outlines of my essays, but I plan the general structure in my head, possibly including some key sentences. I need to write down the ideas all at once while they're fresh or else I risk forgetting some of them. A blog post I read a while back and can't find now referred to this fragility of ideas in one's head as a "house of cards". If you're distracted for too long, the short-term memory traces fade, and the house collapses.
Transitions, sectioning, and pictures
Teachers in school make a big deal of transition sentences to connect paragraphs. I find myself naturally including these in cases where they seem sensible. I often picture transitions between paragraphs like dominos or puzzle pieces: Two paragraphs fit together by sharing some idea at their borders.
Schools tend to emphasize essays with only raw text. This seems unfortunate, because raw text is suboptimal for conveying ideas efficiently. Readable essays make copious use of sections, which not only allow for quickly finding information but also provide a built-in tl;dr for a piece, similar to self-documenting function names in code. Wikipedia seems to understand this.
I also try to use bullets, numbering, and other text structures as much as possible because
- this distinguishes the items more clearly than using marker words in a paragraph would, and
- scaffolding beyond a mass of text in a paragraph is more helpful to skimmers.
Likewise, diagrams and pictures convey ideas more clearly and quickly than text does. They should be used as much as is sensible, though admittedly they also require more effort to create.
Summaries: good
Until 2007, most of the essays I wrote took the style of many philosophy papers by jumping into the subject without any Abstract or Summary at the beginning. A reader of my site told me s/he prefers papers with abstracts (as do most scientists). From this point onward I began adding Summaries to most of my writings. This was one of the best pieces of advice I've ever gotten on how to write well. I think my essays are clearer by having a Summary at the top, and the Summary also makes it much easier for casual readers to glean the gist of my point rather than navigating away on the grounds that my piece was too long.
I really think that almost any writing longer than a few paragraphs should have a Summary (except for fiction where the plot would be spoiled by doing so). Often magazine-style articles begin with a catchy event to grab the reader's attention. I think this isn't nice to the reader. Even these pieces should offer a Summary before jumping into the juicy details, since there are many cases where a reader simply lacks the time to digest the whole piece.
Conclusions: bad
Many of my essays for school were required to use the five-paragraph format: introduction, three body paragraphs, and conclusion. I often found this annoying, because I ended up saying basically the same things in the second half of my introduction as in the conclusion. I didn't see any point to having both. I felt like a five-paragraph essay was a mozzarella stick with a huge amount of bread on it: there was just a tiny amount of substance in the three body paragraphs surrounded by two repetitive summaries on either end.
Now that I don't have requirements on my writing, I usually omit conclusions. Sometimes I design the last sentence or two of the essay to wrap up and repeat a high-level idea, but I don't think this is necessary. If the reader wants a conclusion, he can go back and reread the "Summary".
Perhaps George W. Bush would be more fond of making the same point again in a concluding section. In 2005, he said "See, in my line of work you got to keep repeating things over and over and over again for the truth to sink in, to kind of catapult the propaganda."
Conclusions can be valuable if, by explaining the same thing in different words, they make your thesis more understandable to readers.
A downside is that if you change your findings from a study, you have to edit more total things (both the "Summary" and the "Conclusion" rather than just the "Summary").
I think whether a conclusion makes sense can be evaluated on a case-by-case basis. If you expect to basically rewrite previous paragraphs, then don't add one. If you'd synthesize things in a new way, then do include one.
Proofs and programming
In college I remember enjoying the puzzle of fitting together an essay in an elegant way. I was writing proofs and computer programs at the same time, and I remarked to myself how similar all three forms of writing were: All involved a pleasantly creative process of stitching together a nice design that would be clear, effective, and organized.
Essays are organized into sections; code is organized into functions; proofs are organized using lemmas. Subroutines of a program feel almost identical to lemmas of a proof: You take input assumptions, do some processing, and output a conclusion. Essays organized around a core argument may also involve "lemmas" when arguing for each step.
Going off on tangents
A friend mentioned to me that many of my writings ramble more on tangential topics than is standard in the academic literature. This is because I like to include side comments about something interesting when the opportunity arises. Our thoughts are not laser-focused on demonstrating a single argument, and I think essays can be the same way. Tangents may add interest or have academic value in their own right, and I find that including them can spice up prose. Daniel Dennett seems to agree, as his writings are profuse with illustrations and side comments that, while only somewhat relevant to the context at hand, make his overall discussion more fun and memorable, while also teaching the reader some interesting tidbits along the way.
To use discursive footnotes or not?
In the distant past I didn't shy from using discursive footnotes, i.e., footnotes that give more detail about some topic. The main selling point of such footnotes is that they avoid bogging down the text with unnecessary tangents. However, these days I generally try to minimize explanation within footnotes and incorporate the information into the main text. In analogy with Blaise Pascal's quote that "I would have written a shorter letter, but I did not have the time", I think of footnote-filled text as what one writes when one doesn't have the time to organize the information into the main text. As you can see, I still sometimes do use footnotes (even in this article), but generally try to avoid it, and when I do use footnotes, I sometimes consider it a mark of laziness. (In some cases, the footnote does seem truly tangential enough not to be part of the main text.)
My disinclination toward discursive footnotes comes from my own experience as a reader. I'm the kind of reader who wants to read all the discursive footnotes, but when I do, my reading is disrupted by going to the footnote and then returning to the main text, especially if the footnote is mid-sentence. Footnotes are especially cumbersome if they're at the end of the article or book because I either have to go back and forth or open two different browser tabs, one opened on the main text and one opened on the footnotes. In addition, if you convert an article to audio using text-to-speech software, it's even harder to follow footnotes properly. As a reader, I would prefer for the material in the discursive footnote to be introduced gently within the text itself. Apparently I'm not alone in this sentiment. Mills (2014): "Those long discursive notes is a major cause of why John Q. Public hates writing that 'looks like a dissertation.' If we divide our narrative discussion into two separate threads, one in the text and one in the notes, we force our readers to try and follow two threads at once. Many will quit in frustration or annoyance."
Of course, possibly many people have the opposite reading experience: they don't read discursive footnotes and are annoyed when the main text gets too bogged down in details and qualifications.
Link rot
Link rot on the web is terrible. In my informal experience, it seems that maybe ~5-10% of my external links break per year.
When I move the locations of articles on my site, I'm careful to set up redirects. As a result, I think almost no links to my sites should be broken, except for links to individual sections within an article or links to image files. But many other websites aren't so careful, and as a result, they lose out on potential traffic that would have come from old links.
The way I've decided to preemptively combat link rot is to include the title of an article in the title=""
attribute of the hyperlink, which you can see by hovering over the hyperlinked text. Doing so helps because even if a link is broken, it's usually possible to find the page by searching its title. Putting link titles in the title=""
field also has the advantage that, if a url isn't very descriptive, readers can preview what the article is about by hovering on the hyperlink and examining its title, rather than needing to click through. Unfortunately, I don't think this works for mobile readers.
Looking up the old link on Internet Archive's Wayback Machine also often solves link rot, but not always.
For scholarly papers, I try to use DOI urls (i.e., https://doi.org/______, where the ______ is the article's DOI) because these are supposed to be persistent urls.
What should website authors do about link rot? I think it's acceptable not to fix broken links, since doing so requires eternal vigilance, and most authors aren't able to put in that level of effort. I've considered using a tool to scan for broken links in my essays, but I usually have more urgent tasks to attend to; maybe some day I'll get around to it.
If you do have time to fix broken links, here's a possible approach:
- Google for the title and/or other identifying information of the article to see if the page has moved to a new url. If so, link to that page.
- If there's no new version of the page, link to a recent Wayback Machine copy of the page.
- Failing that, simply keep the old link, perhaps with a note saying it's broken.
I think it's bad to remove broken links entirely because they provide important citation information. (You wouldn't remove a citation to an old manuscript even if no one could find an extant copy.) Plus, the site may come back online in the future, and knowing the url might help you track down the site's author. If you absolutely must remove broken urls, I would at least preserve the information about what the url was within an HTML comment next to the hyperlink, although few readers will ever realize that this information is available.
Should you prefer to use Wayback Machine rather than looking for a new version of the page? This has the advantage of pointing to the web content in roughly the form you saw it when citing it, whereas it's possible that the new page you're linking to is significantly different than the old one. (Of course, that can be true even if the url hasn't changed.) On the other hand, it's generally good to point readers to the latest version of the information.
Another argument for using a Wayback Machine url is that once you replace the old url, readers won't know what the old url was and so won't be able to look up the old content for themselves in the Wayback Machine; in contrast, readers can always Google for a new version of the page if they want to.
Finally, the phenomenon of link rot makes me more inclined to quote from a source text rather than merely linking to a source text and assuming that the reader can find the relevant information there. In 10 or 20 years, the source text may be inaccessible, so I had better save the relevant information myself. Indeed, I wish I could preserve the entire source text on my website as a backup against data loss, but copyright issues would probably stand in the way. UPDATE: I learned that the "WebCite page combing form" "allows you to submit a URL or file whose links will be archived using WebCite." So it seems that you can in fact conveniently archive all the pages cited in an article that you or someone else wrote.
Variable names
One of my friends said that math would look less intimidating if it were written in the style of programming, where variables had full, descriptive names rather than single Greek letters. I agree. On the other hand, manipulating such equations would be harder, and the equations would be much longer (probably requiring several lines of "code").
When programming, I prefer to use long variable names, because they're self-documenting, and unlike with comments, you're more likely to remember to refactor them as your program changes. They make code bulkier, yes, but I'd rather understand bulky code than puzzle over cleaner code. Ease of understanding is extremely important for other people and for your future self, who will have forgotten how the code worked within a few years.
Date formatting
If you write a month using numbers rather than letters, there can be ambiguity about the date depending on whether the month or the day is supposed to come first. For example, is "04/07" April 7 or July 4? I prefer to write the month using letters to obviate this ambiguity. Using letters also provides a helpful separator between the numbers of the year and day, so that you can write something like 2018Apr07 without needing hyphens to delimit the parts.
I prefer to write the year first (like "2018Apr07") rather than writing the day first (like "07Apr2018") because with the year first, sorted file names will be organized by year rather than by whatever day of the month they happen to be from. For human readers, too, putting the year first makes sense because usually the year is the most important piece of information, unless it's, say, a news story where anyone who reads it soon after it's published already knows the year. I also include a leading zero when necessary (like "2018Apr07" rather than "2018Apr7") so that if file names are blindly sorted as text, then, for instance, "2018Apr07" will appear before rather than after "2018Apr11". (That said, I find that some file managers recognize numbers within file names as numbers when sorting, making a leading zero in the day unnecessary.)
If you're really optimizing for proper sorting of file names, then using numbers for months is better: "2018-04-07". A former colleague of mine strongly preferred this format for reasons of file-name sorting. I'm uneasy about ambiguity regarding month and day order, but if you're writing the files for yourself and consistently stick to the YYYY-MM-DD format, then this is indeed arguably the best way to go.
Miscellaneous formatting preferences
Should you write the numerical range between 100 and 300 as "100-300" or "100 to 300"? I prefer using "to" because the "-" symbol could be interpreted as subtraction.
I prefer using underscores rather than spaces in file and folder names, because spaces can cause headaches. On the command line, spaces need to be escaped. And in urls, spaces become %20
. Unfortunately, the underscore key on a keyboard is harder to type than the spacebar. It would be nice if there were a key for "Underscore Lock", similar to "Caps Lock", that would cause the spacebar to output underscores instead of spaces. You could then use this when typing long file names with lots of underscores (or long variable names when programming).
Using smileys in emails
I use smilies :) a lot in emails, Facebook comments, and so on. Happy emoticons help break the ice in communication and combat the problem that tone doesn't carry well through text. It puzzles me that emoticons weren't invented centuries ago; they're amazingly useful and convey a lot of information in a small number of characters.
More generally, I try to be cheerful and positive in communication except in rare situations where that isn't warranted. This helps the conversation participants feel more warm toward each other and themselves. It's also more effective at "winning friends and influencing people". Smiles and "thank you"s can be especially useful for defusing situations that might otherwise turn into confrontation.
I use smileys and Facebook "Like"s the way bonobos use sex:
Sexual activity generally plays a major role in bonobo society, being used as what some scientists perceive as a greeting, a means of forming social bonds, a means of conflict resolution, and postconflict reconciliation.[40]
On one Rationally Speaking episode, Julia Galef cited a book (which I can't now find) about what people can learn from dog-training psychology. One example from the book involved a person A who wanted person B to call A more often, but whenever B does call A, A begins the conversation by asking angrily, "Why don't you call me more?!" Part of the answer to A's question is that whenever B does call, the action is negatively reinforcedc by A's accusations. In general, emotional reinforcement of this type is powerful and can make the difference between having lots of friends and collaborators or having few.
Extracting information from source articles
When I was in 9th grade, one of my teachers required the class to use a particular method of note-taking when writing a research paper. That method was to take notes on our sources on index cards, one note per card. We were supposed to rewrite the source material in our own words on the index cards. Then we could reorder the cards into the proper sequence for the research paper and write the paper based on the cards, without consulting the original sources.
My guess is that a main purpose of this approach was to reduce the risk of accidental plagiarism. The concern was probably that if a student read a source and then directly transferred its information into her paper, the student would be more likely to copy the language of the original source.
I think this approach to note-taking is suboptimal. In fact, currently I don't even do "note-taking" for my writings. Instead, I type up a skeleton outline of the article I plan to write, read through a paper, and as I'm reading, I directly take note of important information and add it to my own article. I think this approach is better than note-taking because it reduces the risk of error. If I read 10 papers and take notes on them before adding any of that information to my own article, I will probably have forgotten some details of the papers, and I might mix up some of the different papers I read. As a result, my writing about those articles may be somewhat error-prone. This seems especially likely if the articles I'm citing are, for example, scientific studies where I want to report the details of the study location, sample size, and conclusions precisely. It's very easy to introduce minor errors in reporting such information unless you extract the important information from the study immediately after reading it. Memory of details fades quickly over time.
As far as plagiarism, if you're writing about a source article simultaneously with reading that article, then you can explicitly reword your sentences to avoid plagiarism. In contrast, if you write your sentences long after having read the source article, you might, on occasion, accidentally write the facts in the same way as the source article did. That said, reading the source article right before extracting its information into your own article may limit your creativity about how to write the information in your article, slightly increasing a broad kind of plagiarism caused by having a similar structure of exposition as the source article, even if the words are different. There does seem to be some tension between not plagiarizing vs. not making mistakes in reporting information from a source article. Personally, I prefer to solve this dilemma by quoting information from source articles, as discussed in the next section.
Quoting factual material
In 2001, I wrote a school research paper in which I included a number of quotes that contained statistical data from source materials. My teacher told me that unless a passage from the source text is unusually distinctive, I should rewrite the information in my own words rather than quoting it. I grudgingly did so.
Rewriting in one's own words may be necessary for formal publications, but when writing informal pieces, I prefer to revert back to the habit of quoting factual information if it's particularly well stated in the source material. The main reason is because doing this reduces my probability of making mistakes when transferring the information to my own page. It's very easy to miss subtle things when moving data from one source to another. For example, maybe the source paper says "adult mosquito population", but you just write "mosquito population", forgetting to qualify that it's the adult mosquitoes only. Lots of small oversights like this may be introduced when trying to rewrite information in one's own words.
Including quotes is especially valuable for readers when I'm citing web articles without page numbers. Quoting allows readers to Ctrl+f for the quote within its original context on the source web page.
LeVar Burton famously said on Reading Rainbow: "But you don't have to take my word for it." Likewise, when you quote a source text, readers don't have to take your word that you've correctly paraphrased the material and that you're not distorting what the original author said. (Distortions of meaning are still possible with quotations, but they're arguably harder to pull off and easier to discover, because readers can directly search for the quote and check its surrounding context, rather than wondering whether the source author actually said what was attributed to her somewhere in her whole article/book.)
In general, quoting rather than paraphrasing source material seems to me like almost a strict improvement in terms of the quality and transparency of presenting research findings. It's a shame that this practice isn't more widely accepted for formal writing.
Kaj Sotala gave me the following feedback on this section:
I find that as a reader, quotes often tend to impose a small cognitive cost that makes for heavier reading. Long quotes usually have a somewhat different style and context than the work that's quoting them, and this seems to cause a small "switching cost" in my head as I have to adjust for the new style and context, and then switch back when returning back to the original essay.
For this reason I think it's probably better to rewrite things in one's own words, as that will make the content fit your existing style and flow of text and let the reader absorb the information without an additional cost. (Though in practice I still leave a lot of quotes in my text since it's easier.)
I probably agree with this if you're writing for a very large audience, such that the additional cost to the writer of rewording and checking the material is small compared with the collective benefit to the readers. But for most of what I write, the number of readers is relatively small.
One alternative to quoting extensively from source material directly in your article can be to write your article in your own words but include copious footnotes in which you extensively quote the original sources. We might call this a "quotes in footnotes" approach. In this case, a quote becomes like an extended part of a citation, pointing readers to the location in the source text that gives the information that you're referring to. Personally I don't like using many footnotes because they make reading choppy (requiring the reader to switch between the main text and the footnotes a lot), but for readers who want to skip over the details, the "quotes in footnotes" approach may be appealing.
Citations after every sentence
In school, I was taught that if the information in several consecutive sentences comes from the same source, you should wait to include a citation for it until the last of those consecutive sentences, to avoid repetition. I see the same done on Wikipedia in many cases.
I strongly dislike this rule. It allows for ambiguity about how much of the stated content actually comes from the cited source. For the same reason, it can lead people to assume that what you've written doesn't have a citation. Except when it's obvious that all my information comes from a given source (such as because I'm discussing the author explicitly in my text), I try to include citations after every sentence.
This is especially important for shared writing pieces like Wikipedia articles, because someone might insert a sentence in between your original two or more sentences, in which case your earlier, uncited sentences will no longer be consecutive with the final sentence that contains the citation.
Page numbers in citations
I find it unfortunate that standard scientific in-text citations, such as "(Smith 2009)", don't include page numbers. Of course, this is understandable if one is citing some general theme expressed by the entirety of an article or book. But when citing a particular fact, I prefer to include a page number, since this makes it easier for others (including fact-checking reviewers) to find the original information.
It's harder to point to the location of information when citing an HTML website, but one helpful approach is to quote the sentences(s) of the original source that contain(s) the information (perhaps in a footnote or the title=""
field of the hyperlink if not in the main text of your piece) so that readers can Ctrl+f for the quoted text in the original source.
Wise (2000) shares my frustration with omission of page numbers in scientific citations. He explains:
Before me is a book chapter I have written, to be published by a scientific press. When I turned it in to the publisher, I gave footnotes citing the pages at which every proposition upon which I relied could be found. I was informed that this is not good scientific notation. The publisher returned the chapter with instructions for me to go through each footnote and delete the offending references to the exact pages.
What is going on? A physicist friend says that some colleagues aim to outline their achievements while giving away as little as possible to competitors. A primatologist friend believes that scientific specialization is now so common that scientists write only for colleagues in their disciplines, who can be assumed to have read everything that the author has.
Henige (2006), pp. 103-04:
Particularly disconcerting is the disconnect between this unconcern with precision in citation and the extraordinary care taken to assure that submitted papers measure up in other ways.14 [...] Assuming that referees are also deprived of this information, it raises the question of why they should be satisfied with this restricted capacity to check authors’ conclusions.
Rekdal (2014), p. 573:
Direct quotations are electronically much more easily searchable than paraphrased sections, and locators are therefore more crucial for the latter. Despite the emergence of tools such as full text databases and Google Books, we still need the page numbers, particularly for source material that does not appear in the form of a direct quotation.9
Citing Wikipedia
Some people love to hate Wikipedia, but as has often been shown, Wikipedia is generally very accurate. For articles where many editors have made detailed contributions, I would generally trust Wikipedia more than any other source, even a journal article in Nature. The reason is similar to Linus's Law in software development: "given enough eyeballs, all bugs are shallow". Academic peer review is very imperfect and can in many cases allow errors to slip in (see the next section). My own experience with peer review suggests that reviewers do relatively little checking of the details of one's paper. Plus, academic writings are static and can't be fixed, while Wikipedia can. And sometimes, academic authors give questionable information the veneer of authenticity by including it in a journal article.
That said, there are many Wikipedia articles that aren't particularly trustworthy because they've only been edited by one or two people or because they don't have citations. I tend to consider such articles similar to blog posts in how much I trust them.
If you're going to the trouble of consulting primary sources for information, then obviously doing that is superior to citing secondhand information on Wikipedia (unless the Wikipedia article corrects errors or provides other important context). However, in many cases, it's not efficient to read all the primary sources on a topic, and in these instances, citing Wikipedia makes sense.
Errors in published articles
de Lacey et al. (1985)
de Lacey et al. (1985) examined the accuracy of "quotations" (statements that had citations) and the citations themselves in six medical journals (p. 884). I think de Lacey et al. (1985) defined "quotations" as "All direct quotations of, indirect references to, or summaries of another author's work" (p. 884). de Lacey et al. (1985) discovered that "Of all references, 12% contained errors" that were either slightly or seriously misleading (p. 885).
This is a rather astonishing number, although in my opinion, these errors may not always be as bad as one might think. de Lacey et al. (1985) give some examples (pp. 884-85) of misleading errors. This example (p. 884) of a "slightly misleading" quotation doesn't seem atrocious to me:
a quotation that reducing weight by decreasing intake of energy lowered the blood pressure in most obese hypertensive subjects. The original source, however, studied the effect of a combined low energy and low salt diet on weight and blood pressure.
Apparently, this statement is misleading because of oversimplification. Likewise, de Lacey et al. (1985) explain (p. 885): "Misleading quotations were often due to oversimplification in summarising another author's figures."
The following example (p. 884) of a "seriously misleading" error also seems possibly excusable to me:
One correspondent said: "several studies have shown that the immediate memory span is intact," referring to patients with Korsakoff's syndrome. One of the two quoted sources was a paper on the psychological aspects of rehabilitation in cases of brain injury, with no mention of patients with Korsakoff's syndrome.
Since two sources were quoted here, maybe the first one was directly about the stated finding, while this other source was added as general background reading on memory problems, not intended to buttress the stated claim? (Sadly, standard academic citation methods don't readily allow for distinguishing what kind of information a given citation is supposed to provide.) That said, since the misleading article used the phrase "several studies", maybe it's implied that both of the sources cited should have supported the claim.
Some of the example errors that de Lacey et al. (1985) furnish do seem more serious to me. These findings made me realize I should be slightly more skeptical about any statement I read, even in top journals.
In a 1985 follow-up letter to Lacey et al. (1985), S. R. Lowry found that in letters published by the BMJ, 12% of quotations were "inaccurate", and another 21% were "slightly inaccurate" (p. 1421). Lowry says (p. 1421): "The journal does not check everything, and as a result a third of direct quotations and 8% of references printed were inaccurate to some extent."
Making one's uncertainty explicit
In high school, I was taught not to use phrases like "I think" because such wording was self-evident: Saying "I think X" is equivalent to just saying "X". Formal writing often discourages indications of uncertainty or meta-level discussion about one's process, perhaps because doing so would signal weakness? Unfortunately, this means that when writing formal papers, you either have to drop interesting ideas/information that isn't "sufficiently rigorous", or you include the non-rigorous idea/information without indicating that it's not actually well established in the hopes that peer reviewers won't complain.
I think (see what I did there?) this formal writing style is suboptimal. For example, saying "I think X" is qualitatively different from saying "X". The former sentence tells the reader that this is your opinion or a guess that you're making, rather than an established fact or the opinion of some other entity. Likewise, hedging statements, used appropriately rather than just for politeness, can inform readers about how much weight to give a claim. For example, saying "I would intuitively guess that X" or "I haven't read much about this topic, but my impression is that X" is more useful than either (a) declaring that X is the case or (b) not saying anything about X because you're not certain.
Comments about one's own research process can serve similar functions—e.g., marking where you've only read a study's "Abstract" rather than the full text, in order to tell the reader that there's some risk that you're misinterpreting the study due to not having read the fine print. This is a useful compromise between not marking uncertainty vs. writing vastly slower due to having to check everything you say thoroughly first.
Picture-filled presentations
I'm bad at following two verbal trains simultaneously. As a result, I dislike presentations in which slides contain lots of text that I'm expected to read at the same time as the speaker is talking. It's commonly advised to use bullet points rather than full sentences in presentations, presumably for this reason. But I like to go a step further and include only a bare minimum of text in my presentation slides. I might include a text title on the slide and maybe some numbers or captions, but I prefer for most of my slide to be just one or several pictures illustrating what I'm talking about.
PDF vs. HTML
When I created my first website in 2006, it was initially a PDF file of essays. A friend advised me that HTML was more readable, so I converted to HTML format instead. Now I strongly prefer HTML, for several reasons:
- I make edits to my essays all the time. With HTML, I can just edit a single text file on my site, and the change is done. With PDF, I would have to generate and upload a whole new version of the piece, which would take more time.
- Suppose you have a PDF document that you created with Microsoft Word 15 years ago. Now you discover an error in it that you want to correct. You no longer have a copy of the original Word document, or maybe you do, but its formatting is weird when it's opened in a newer version of Word. In this case, it would require significant effort to create a new version of the PDF with the error corrected. In contrast, if the document had been HTML, the correction would have been extremely easy to make. Of course, a PDF advocate could point out that this scenario might not happen with other PDF-creation software that's more stable over time than Word is.
- The formatting is more flexible with HTML. For example, readers can increase or decrease font size as big or small as they want. They can change the background color. And so on.
- With HTML, you can click links that navigate around within the essay, like in the table of contents or to view a bibliographic entry, and then when you click the "Back" button, you go back to where you were. In a PDF document, if you click an in-text link (e.g., to see a reference in the bibliography), when you click "Back", you navigate away from the whole PDF. At least, this is the behavior I see in Chrome on Windows.
- HTML allows for JavaScript calculations, interactive graphs, embedded videos, hover-over footnotes, etc.
- HTML text is easier to copy and paste all at once without page numbers, headers, etc. getting in the way. (This is useful for me when converting articles to audio format.)
- HTML can easily be converted to PDF using your browser's "Save as PDF" feature, but the reverse isn't true.
- HTML can handle equations with tools like MathJax. There are tools to export TeX to HTML. I assume there are tools that can mimic the behavior of BibTeX (maybe this?).
- With HTML pages, you automatically provide the source code for the web page. This is useful for readers who want to see how you implemented something on your website or copy some of your source HTML for their own website. It's also useful for you, because you can't possibly lose the original source code and leave yourself unable to make further edits to a page. In contrast, I think PDF files don't usually include their source code? For example, if you generate a PDF using LaTeX and put it on your website, you may have to upload the
.tex
source separately if you want readers to have access to it. If you lose the original.tex
file, you may be out of luck if you want to edit the PDF in the future. Including LaTeX source code in a PDF can be done (TeX - LaTeX Stack Exchange "Is there some way ..."), but it's not automatic the way it is for HTML. (Pedants may complain about my use of the term "source code" to describe HTML markup given that HTML is not a programming language, but I think "source code" is the clearest phrase, and it seems to be commonly used to describe the raw HTML of a page.) - With HTML, you can easily see the hidden information on your web page, because it's just in a text file. Decoding the hidden information in PDFs can require more effort.
- HTML is probably less intimidating for non-academic readers, since most websites are in HTML.
- My anecdotal impression is that PDFs may not rank as well on Google, but I haven't found any verification of this supposition on SEO sites, so it may be wrong.
The main benefits of PDF have to do with formatting standardization, document integrity, permanence, etc. For example, you don't have to worry about the formatting of images getting messed up by a change to your website's CSS styling or by changes to how browsers render web pages. PDF documents keep all relevant files, including figures and images, contained within one document, while with HTML, the images are usually stored separately, and if they get lost or moved, the links to the images in the HTML document will break. These are significant benefits of PDFs, but even from a permanence standpoint, I probably favor HTML because usually what's most important is the raw text of a document, and nothing beats text files (such as HTML files) from a data-preservation standpoint. In addition, the text in HTML documents is easier to manipulate, which makes forward migration of formats easier.
Because HTML documents are text files, they're less vulnerable to file-format rot than most other file types on account of their structural simplicity. Of course, the graphical rendering of HTML documents, especially those using fancy CSS and JavaScript, is arguably more breakable than the rendering of PDFs, which have their layout and styling baked in. But I care much less about stylistic presentation than about making sure the raw words are preserved, since the words of a document are generally the hardest thing to replace. (Format-rot considerations equally well favor, e.g., .tex
files, which are also plain text.)
Another consideration is that PDFs are still (unfortunately) more common for academic articles, so they may superficially appear more professional for that reason. Also, PDFs have page numbers, which is helpful if people citing your piece refer to those page numbers (though as mentioned earlier, many authors sadly don't include page numbers in citations).
McBurnett (2008) contains a similar list of reasons in favor of HTML, though not all of it is up-to-date. The author concludes: "Good, standards-compliant HTML is almost always better for use on the web."
No-frills websites
As I've gotten older, I've had to keep updated more and more things in my life, including my websites. I've also been thinking about the broader societal task of preserving important intellectual output for the long term. These considerations have made me increasingly appreciate the value of simplicity.
The fancy JavaScript animations and complicated styling that make your website look cool today may become a maintenance headache in five years, as technology changes or as you try to refactor your website. The more moving parts your website has, the more work will be required to migrate to a new content management system down the road.
I've come across aging websites where images no longer work or other site functionality is broken. Degradation problems are less likely if you minimize external dependencies and keep your website simple, using mainly vanilla HTML, minimal JavaScript, and so on.
If you use AJAX to build your web pages and load additional content on demand, this makes it more cumbersome for archivists or data miners to download your web pages.
I was once asked why I write mathematical equations using plain text rather than using a LaTeX plugin for my website. Initially the reason was that I hadn't gotten around to exploring the relevant plugins, but now (as of ~2018), my current sentiment is that I don't want to use a LaTeX plugin for reasons of simplicity. Unless you're writing extremely large equations with big division symbols or something, I find that plain-text equations are almost as readable as LaTeX ones.
I miss the days when most websites were simple HTML, and I enjoy finding people who still have 1990s-style homepages. Unfortunately for me, the overall trend on the web is toward greater complexity—and with it, fragility (not to mention CPU demands).
Should you use a pseudonym?
Writing under your real name rather than a pseudonym has a number of benefits. For example:
- Readers can feel that they know the author better if they can put a face to a name, can follow you on Facebook, and so on. I suspect that this increases the sense of trust that readers have toward your writings. (Of course, you could use a fake picture and a fake biography, but this would worsen the loss of trust readers would feel if your identity was exposed.)
- You can share links to your blog on your own Facebook/Twitter profiles and can point friends and family to your work.
- You don't have the cognitive overhead of maintaining two identities and thinking about whether a given action will reveal your other identity.
- A pseudonym might leak eventually, especially if your friends know about your pseudonym and accidentally or deliberately spread the information about who you are.
- Your body of work is unified under a single name rather than scattered between two names. This is especially relevant for academics who need to worry about citation counts.
- If you're vain (like most of us are to some degree), you may feel more motivated to write under your real name because then "you" (the real you) get credit for your efforts.
Some benefits of writing under a pseudonym:
- You might feel more free to say what you really think without fear of ostracism by friends, family, and online trolls.
- You leave open the option to reveal your name later. In contrast, if you start using your real name, it's nearly impossible to scrub all of your mentions from the web (although scrubbing just a few mentions from the web may be enough depending on your situation).
- You leave open the option of working in high-ranking political or corporate positions, which would be difficult to do if you had a paper trail of controversial views on the web.
I wonder if society will gradually become more tolerant of people who wrote embarrassing or controversial things in their youth, because these days many young people have extensive paper trails online. But for now, if you aim for a top-ranking job in politics or corporate management, you probably need to play it safe. If you aim to become a US Senator or the CEO of Apple, it may be best to avoid writing even under a pseudonym, because your pseudonym could leak, especially if you're running a political race against another campaign that's doing opposition research on you. Accounts get hacked, and your computer could potentially be infected with spyware, although the risk of privacy breaches of this sort is probably low unless you're already famous.
For most jobs, even corporate jobs, I suspect that a web presence under your real name wouldn't be a huge problem, and perhaps some employers would see your blog as a sign of your productivity and creativity. (If I were hiring for an effective-altruist organization, I would be more likely to hire an applicant who had an extensive history of high-quality blogging.) When I was a software engineer at Microsoft, no one at the company ever expressed concern about my online writings, and many colleagues thought it was cool that I had a philosophical website. Talking philosophy was sometimes a social icebreaker. However, at more traditional and snooty workplaces, such as companies where you have to wear a suit and tie every day, I suspect that writing weird things online might carry some small cost, with the cost increasing if you aim to hold higher-level management jobs.
Acknowledgments
A discussion with Caspar Oesterheld improved my views on the question of PDFs vs. HTML. Denis Drescher inspired a point I made about footnote citations.
Footnotes
- The vector-space model of similarity is effectively normalized by the length of a document via the norm of the tf-idf vector in the denominator of the cosine formula. Likewise, the BM-25 match score is sort of normalized for document length. If you strip out a lot of parameters, the "tf" fraction is something like (term frequency)/|D|, i.e., normalized for document length. Of course, the exact function is messy, and when this is one input to a complex ranking model, the effect of document length becomes messier still. (back)
- I struck out "doing that" here because it's unclear if it refers to "using the same word/phrase twice in a row" or "avoiding using the same word/phrase twice in a row". (back)
- I'm using "negative reinforcement" in a colloquial sense. In formal terminology, what I actually mean is "positively punished", since "negative reinforcement" technically means removal of a bad stimulus. (back)