John Wilkin’s Orphan Books Analysis: 2.5 Million So Far

By [Thursday, February 17th, 2011] at 5:41 pm

John Wilkin is the Executive Director of the HathiTrust, the consortium that many of Google’s partner libraries are using to store and index their digital copies of books scanned by Google. He’s written and posted a thoughtful attempt to use HathiTrust data to answer the perennial question of how many orphan books there are. Previous efforts have relied on data from WorldCat and Bowker. In addition to being a new and substantially indpendent dataset, the HathiTrust data is relatively clean and also, for obvious reasons, tracks reasonably well the collections of major research libraries.

The resulting study, Bibliographic Indeterminacy and the Scale of Problems and Opportunities of “Rights” in Digital Collection Building, then takes HathiTRust’s information on publication years and US/non-U.S. status and combines it with estimates from previous studies on public-domain status and the ability to find (or not) copyright owners. Some of the numbers are guesses, and Wilkin is quite open about it, but he also supplies reasons for the guesses he makes. His conclusion:

Our data spotlight the likely scope of the public domain and the probable large role of orphans in our bibliographic landscape. The following are some key findings of our preliminary analysis:

  1. The percentage of public domain books in the collective collection—not simply the current 5+ million books, but the collection as it expands—is unlikely to grow to more than 33% of the total number of books we will put online. Using the numbers assembled here, the percentage of public domain materials, not including government documents, will be 28%.
  2. The body of orphan works—works whose rights holders we cannot locate—is likely to be extremely large, and perhaps the largest body of materials. If the guesses made here are right, 50% of the volumes will be orphan works. This 50% is comprised as follows: 12.6% will come from the years 1923-1963, 13.6% from 1964-1977, and 23.8% from 1978 and years that follow. (The percentage of orphan works relative to all works decreases as time passes; the number of orphan works increases in more recent years because more works are published in later years.) Indeed, if this speculation is right, our incomplete collection today includes more 2.5 million orphan works, of which more than 800,000 are US orphans.
  3. The likely size of the corpus of in-copyright publications for which we are able to identify a known rights holder will be roughly the same size as, or slightly smaller than, the body of public domain materials. Again, using these speculative numbers, they may comprise as little as 22% of the total number of books.

Even before we are finished digitizing our collections, the potential numbers are significant and surprising: more than 800,000 US orphans and nearly 2 million non-US orphans.

Worth reading for the charts alone.

This entry was posted in Analysis. Bookmark the permalink. Trackbacks are closed, but you can post a comment.

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>