PrePubMed Help

Policies are here. PrePubMed was created because at the time there weren't any services indexing preprints (except Google Scholar which has some search limitations and doesn't let you download their data). However, now there are multiple services indexing preprints, most notably Europe PMC. Ideally operating an indexing service wouldn't take any time or effort, but because the servers (except arXiv) don't have an API if a server's HTML changes or there's an error in a post it can break my indexing code and given my current projects I don't foresee having the time to babysit the daily indexing. Despite the data in PrePubMed no longer being up-to-date, it is still worth keeping the server up since a lot of people come here for the error detection tools (and the database contains records for preprints which were secretly retracted).

How accurate is the data?

The various preprint servers were scraped on a daily basis (except arXiv which has an API), and the data was first written to text files, then to an SQL database. Unfortunately several things could go wrong in this process. Theoretically no preprints should be missed during the scraping process, so the text files should contain all of the preprints I intended to index, but I think I have noticed preprints which got missed somehow and am not sure why or how often that happened. Sometimes preprint servers made an error when posting a preprint, which may not have raised an error during indexing, but then raised an error when it's time to write to the database. This would result in the database not writing that preprint, and unfortunately the other preprints for that server for that day that haven't been written to the database yet. Another thing that could happen is the database can be locked for some reason and throw an error, again resulting in missing preprints. For the database issues I can go back and add the preprints to the database since they are in the text files, and for preprints missing in the text files I can rescrape the entire server to get those, but I don't regularly rebuild the database or check for missing preprints.

As a result, I can't guarantee that PrePubMed will have all of the preprints you are looking for. If you are trying to do a comprehensive search I would recommend using multiple search engines. I assume Google Scholar doesn't miss any preprints, but if you sort by date you will only get preprints they indexed in the past year. Europe PMC now indexes preprints, but it seems they get their data from Crossref and I've noticed some issues with that data. Scilit is another option for finding preprints, and I'm not sure how they get their data.

A note on other servers such as Figshare and OSF Preprints

Most uploads on Figshare with the label "Paper" are not actually preprints and as a result it is difficult to accurately index preprints from Figshare. I previously attempted indexing Figshare, and you can see that data here, but I no longer index Figshare.

OSF Preprints, along with its offsprings PsyArXiv and SocArXiv, primarily contain social science preprints. I know this because I individually looked through all of the OSF preprints. PrePubMed was created for biology preprints because there wasn't a central hub for these types of preprints. Math and physics preprints have arXiv, and social science preprints now have the OSF.

A note on preprints.org

This is a new preprint server that aims to host preprints from a wide range of disciplines. Currently I have decided to index the Biology, Life Sciences, and Medicine & Pharmacology subjects. This server also hosts a variety of different types of articles: Article, Review, Conference Paper, Data Descriptor, Brief Report, Case Report, Communication, Short Note, Technical Note, Hypothesis. It is unclear to me what the quality of some these types of articles will be so I am currently only indexing the preprints labeled "Article" or "Review".

A note on The Winnower

After much thought I have decided to index all articles on The Winnower with the "paper" designation. Articles on The Winnower are not preprints and are closer to blog posts. Although PrePubMed is meant for articles that will eventually be indexed by PubMed, I support nontraditional forms of communicating work. I normally only index biology related articles, for example only the q-bio section of arXiv and only certain categories of Figshare, but to support The Winnower and its mission I have indexed all of their categories, including Reddit AMAs.

How will this affect your searches? It likely won't. The Winnower does not have abstracts for its articles, so your search terms will only be searching against titles from The Winnower. As a result, it is unlikely you will be seeing blog posts show up in your RSS feeds instead of preprints. And if you do happen to get a blog post from The Winnower, because your search matched the title it might be something you want to check out.

Some terminology and overview

I will be using the word "phrase" when it comes to queries enclosed by double quotes. To be a phrase you must have terms with a space. For example, "highly significant" would be a phrase, "highly-significant" would not. Double quotes have no effect when you use them on a query without a space separator.

I attempted to make searching PrePubMed similar to PubMed. First your query is broken up into substrings based on whether or not you have any quoted phrases. Punctuation is removed from unquoted phrases by converting it to spaces, and then I check if the unquoted phrases contain any author names in the database. Any terms which do not match an author then get screened against PubMed's stopwords. If they pass this screen they will be searched against article Titles/Abstracts, along with any identified author names using AND logic. Phrases in double quotes will not have punctuation or stopwords removed and will be searched against Titles/Abstracts. I decided against allowing search tags such as [au] since your query is automatically checked for authors, and PrePubMed only indexes a small amount of information for each article.

How to search for authors

You can search for authors almost exactly like in PubMed. Note that unlike in PubMed author names get preference over stopwords. So if someone has the last name "The", you can type in the name without any problems.

One way to search for authors is to enter the last name followed by a space and up to two initials. Trailing commas and periods do not matter since they are converted to spaces and ultimately removed. Internal punctuation will cause your term to be broken up, hyphens excluded. Suffixes such as Jr are not allowed. If you are searching for an author do not put the name plus abbreviation in quotes, as it will only be searched against the Titles/Abstracts in that case.

You can also perform a Full Name search exactly like in PubMed. For example, Julia s Wong and Wong Julia s will both work (unless someone has the last name Julia and first name s). Because author names are not indexed manually, it is impossible to distinguish a multi-part first or last name from a middle name.

For example, if someone's name is
first name: Ricky Bobby
last name: Ferrell

Then PrePubMed will index that as
first: Ricky
middle: Bobby
last: Ferrell

As a result, when searching a complicated name you should search using the very first first name and very last last name.

How to search by journal

You can't. PrePubMed is journal agnostic and I believe that where your article is published should not impact viewership. However, there does appear to be differences in the quality of the preprint servers with regards to indexing information such as author names or ensuring that an article isn't duplicated. I want the information in PrePubMed to approach the accuracy of PubMed and will be contacting the preprint servers to work towards this goal. If one preprint server is clearly the best I will consider endorsing its use.

How to search by subject area

You kind of can. When you perform a search I provide the list of the subject areas associated with each article, and you can click on them to perform a search for that exact subject area. The problem is that there is not a consistent subject area system among the seven journals that PrePubMed indexes. As a result, clicking on a tag for Bioinformatics may not return all articles related to bioinformatics. Because of this, I do not provide the ability to perform a custom search with subject areas.

How to search by affiliation

You can search for affiliation with the advanced search option. Note that Figshare does not list affiliations for authors so a search for an affiliation will not return any Figshare preprints. See using advanced search for more details.

What do I do about duplicated or questionable articles?

Nothing. It is the responsibility of the preprint server to not publish duplicated articles (they should be different versions of the same article instead). Also, how am I supposed to know which version of the article the authors want indexed in PrePubMed? I also do not believe I should have the authority to prevent articles from being indexed. If your article passes the screening process at the preprint servers it will be indexed (even if it uses the word God). It is the job of post-publication peer review to determine whether or not your article is useful, and PrePubMed facilitates that process by providing a means to find your article.

How to search by dates

You can't. I do index dates, and articles are sorted by date when you perform a search, however the problem with dates is that they change for preprints. When someone submits a revised version of an article, it then gets a more recent date. As a result, an article can be originally published years ago but have a recent date, which is misleading. Once an article is indexed by PrePubMed, if a revision to the article is posted, I do not update the date in PrePubMed. If I did then someone could submit a minimally revised article to get it to appear at the top of the search results, which is what currently happens at the preprint servers PrePubMed indexes.

What punctuation gets removed?

All of these will be converted to spaces if you do not enclose your phrase in double quotes:
! # " $ % & ( ) * + , . / : ; < = > ? @ \ ^ _ ` { | } ~

Using advanced search

Advanced search allows you to specify exactly what you want searched, which may not be possible with the default search since it auto identifies authors, and removes whitespace, stopwords, and punctuation.

Anything you enter will be treated as if it were a quoted phrase, even single words. For example, this search:
Abstract Query 1: highly significant
will search for the exact phrase "highly signficant"

If you want an abstract that contains both words but don't care about the order, then you need to perform this search:
Abstract Query 1: highly
Abstract Query 2: significant

Advanced Search is the only way that you can search for affiliation. Advanced search is also the only method for searching for a non-ASCII character. For example, you can search for β with Advanced Search, but not with the default search.

You should keep in mind that all Author names are stored with only ASCII characters, so if you search for a name like Łaszcz, you will not receive any results. You must write the name as Laszcz.

Titles, Abstracts, and Affiliations are stored with UTF-8 encoding, so you should be able to search for non-ASCII characters in those search fields.

Note that all fields in Advanced Search will be searched together using AND logic. Also note that when using advanced search there will be no information in the "search details" box since you know what the search details are (or at least you should since you just typed them in).