Pepys’ Diary: Exported Data README

v1.1, 2012-05-29.

The Diary of Samuel Pepys features all the entries written by the 17th century London diarist, with accompanying background information and annotations by readers.

A zip file of much of the data in JSON format, including this README, can always be found at: http://www.pepysdiary.com/export/json/pepysdiary_json.zip (around 20MB). The files include 3,434 Diary Entries, nearly 4,800 Encyclopedia Topics, over 60,000 Annotations, and more than 350 thumbnail portraits of people mentioned:

If you have any suggestions for improvements or additions, please do drop me, Phil Gyford, a line: phil@gyford.com.

The Data

Each of the JSON files contains two top-level elements, meta and data. So each file's basic structure is like this:

{
    "meta":
    {
        "generated":"2011-01-16T15:33:32+00:00",
        "type":"..."
    },
    "data":
    [
        {
            ...
        },
        {
            ...
        }
    ]
}

The generated field is the time at which this file was generated, in ISO 8601 format.

The type field is one of diary, topics, categories or annotations.

The yearly Diary files also contain a year field which contains the year this file is for (ie, the file diary/1660_entries.json has a year field with a value of 1660).

The data element contains an array of objects. Each object is a single entity of a particular type, ie, a single Diary entry, a single Category or a single Topic. We'll look at the data contained in each of these objects.

All fields are present for all data items, even if they are marked "optional". In this case the value will be an empty string (if the field is a string) or null (if the field is a number).

Diary

An example of one of the objects from the Diary JSON files's data element:

{
    "id": 87,
    "title": "Sunday 1 January 1659\/60",
    "date": "1660-01-01",
    "permalink": "http:\/\/www.pepysdiary.com\/archive\/1660\/01\/01\/",
    "comment_count": 52,
    "text": "<p>Blessed be God, at the end of the last year I was in very good health, without any sense of my old pain, but upon taking of cold.<sup>1<\/sup> I lived in <a href=\"http:\/\/www.pepysdiary.com\/p\/6919.php\">Axe Yard<\/a> having ... to our own home.<\/p>",
    "footnotes": "<p>The year did not legally begin ... Own TIme, book i.<\/li>\n<\/ol>"
}

Here's a description of each field:

id
(number) A unique identifier for this Diary entry. Non-consecutive, but unique among both Diary entries and Encyclopedia Topics.
title
(string) The title of the Diary entry, ie, the date in text format. Note that dates early in the year have a year of the form "1659/60" because the year didn't officially start until 25th March in England at the time. A date marked "1659/60" will be known as being in 1660 for the purposes of other fields, such as...
date
(string) The date of this Diary entry, in year-month-day format.
permalink
(string) The URL of this day's entry at pepysdiary.com.
comment_count
(number) The number of comments/annotations posted by users on this Diary entry on the website.
text
(string) The text of the Diary entry in HTML. Paragraphs are included. Each entry contains links to relevant pages in the Encyclopedia of the form http://www.pepysdiary.com/p/6919.php where the 6919 is the id of the Encyclopedia Topic (see below). Note that occasionally there are also links to other pages in the diary, of the same form as the permalink field. The only other HTML included is:
  • <sup>1</sup> or (in later entries) <sup id="fnr-1665-01-16"><a href="#fn1-1665-01-16">1</a></sup> which point/link to footnotes (see the next field).
  • <i>l.</i>, italic tags around occurrences of l. s. d. (markers for pounds, shillings and pence).
foonotes
(string, optional) Footnotes, for this entry, if any. Usually these are an ordered list (<ol>) of footnotes, but occasionally have paragraphs as well or instead. Many have <a> links within them, often to Encyclopedia Topics. Some footnotes include <p>, and maybe <blockquote>, tags within the <li> tags.
Later entries may have footnotes with HTML ids and links back to the text, eg:
<li id="fn1-1665-01-16">Among the State Papers ...  1664-65, p. 122) <a href="#fnr1-1665-01-16">&#8617;</a></li>

Encyclopedia Categories

An example of one of the Categories from the encyclopedia/categories.json file's data element:

{
    "id": 10,
    "title": "Food",
    "parent_id": 173
}
id
(number) The unique identifer for this Category. Non-consecutive and only unique with Categories.
title
(string) The name of the Category.
parent_id
(number) The unique id of the parent of this Category. If the parent_id is 0 this is a top-level Category with no parent. Using this it should be possible to reconstruct the hierarchy of the Encyclopedia.

Encyclopedia Topics

Two examples of Topics from the encyclopedia/topics.json file's data element. Some of the fields only apply to certain kinds of Topic (although they are all always present). First, a person:

{
    "id": 114,
    "title": "Jemima Carteret (b. Mountagu, \"Mrs\/Lady Jem\")",
    "title_sort": "Carteret, Jemima (b. Mountagu, \"Mrs\/Lady Jem\")",
    "excerpt": "Daughter of Lord Sandwich, married Philip Carteret in 1665.",
    "text": "<p>Daughter of <a href=\"http:\/\/www.pepysdiary.com\/p\/112.php\">Lord Sandwich<\/a> ... <\/p>\n",
    "text_wheatley": "<p>Mrs. Jemimah, or Mrs. Jem, ... <\/p>\n",
    "published_date": "2002-12-27",
    "ping_count": 113,
    "comment_count": 5,
    "categories": [
        {
            "id": 2,
            "primary": true
        }
    ],
    "text_author": "Phil Gyford",
    "latitude": null,
    "longitude": null,
    "zoom": null,
    "shape": "",
    "map_category": "",
    "thumbnail_image": false,
    "wikipedia_page": ""
}

The second example is of a location:

{
    "id": 230,
    "title": "New Palace Yard",
    "title_sort": "New Palace Yard",
    "excerpt": "To the northwest of the Houses of Parliament ... ",
    "text": "",
    "text_wheatley": "",
    "published date": "2003-01-28",
    "ping_count": 11,
    "comment_count": 5,
    "categories": [
        {
            "id": 28,
            "primary": true
        }
    ],
    "text_author": "Phil Gyford",
    "latitude": 51.500585069288,
    "longitude": -0.125532746315,
    "zoom": 15,
    "shape": "51.500856,-0.126257;51.500819,-0.124782;51.500188,-0.124916;51.500214,-0.12513;51.500234,-0.125157;51.500248,-0.125281;51.500441,-0.126064;51.500538,-0.126294;51.500638,-0.126498;51.500859,-0.126665;51.500856,-0.126257",
    "map_category": "road",
    "thumbnail_image": false,
    "wikipedia_page": "New_Palace_Yard"
}
id
(number) The unique id of this Topic, as used in links from the Diary entries. These are non-consecutive but each is unique within both Encyclopedia Topics and Diary entries.
title
(string) The name of this Topic.
title_sort
(string) The name of the Topic but more suitable if sorting a list of Topics alphabetically. For many Topics title_sort will be the same as title. But for the names of people, title_sort will have their surname listed first, as in the example above: Carteret, Jemima (b. Mountagu, "Mrs Lady Jem").
excerpt
(string, optional) A brief piece of plain text (no HTML) summarising the Topic. These are the texts in the pop-up boxes you see if you visit the website and mouse over a hyperlink within one of the Diary entries.
text
(string, optional) Some HTML text describing the Topic. If present, this can vary from a few words to a long essay.
text_wheatley
(string, optional) Some HTML text describing the Topic, taken from the footnotes of the 1893 edition of the Diary, written by Henry Wheatley.
published_date
(string) The date on which this Topic was first published on the website. As with Topic ids, these are broadly in the order in which the Topics appear in the diary, but shouldn't be relied on to be so.
ping_count
(number) The number of times this Topic has been linked to from the Diary. I'm not 100% sure of the accuracy of this, but it should be broadly correct.
comment_count
(number) The number of comments/annotations written by users about this Topic on the website.
categories
(array of objects) Each categories object has id (number) and primary (boolean) fields. The id field corresponds to the ids of Categories in the encyclopedia/categories.json file. The primary field indicates whether this is the primary Category for this Topic (although this has little real meaning/use). Each Topic should have at least one Category. I don't think any have more than two.
text_author
(string, optional) If there is anything written in the text field, this the is the name of its author. If you display the text anywhere, please also display the name of its author.
latitude
(number, optional) Some of the Topics which are locations have latitude and longitude positions. If the location also has a shape (see below), the latitude and longitude indicate a roughly central point which you could, for example, center a map of the shape on.
longitude
(number, optional) See latitude, above.
zoom
(number, optional) If the Topic has latitude and longitude then the zoom field is set to a suitable value to use with Google Maps as an initial zoom level. eg, a map of a building in London will be zoomed in further than one of a city in another country.
shape
(string, optional) Some Topics which are locations describe an area or road. Some of these have shape data, which consist of a series of latitude and longitude points describing either a shape outline or a line on a map. Lat/lon are separated with commas, and each pair of points is separated by a semicolon. If the first and last pairs of points are identical, this is a closed outline (eg, a town square or area of a city); otherwise it is a line (eg, a road).
map_category
(string, optional) Some Topics which are locations, have been assigned a map_category, which is no relation to the overall Category hierarchy. This describes a small set of categories which locations can be divided into, for displaying separately on maps. The current possible values are:
area
A broad area within London, eg Covent Garden or St James's Park.
gate
One of the gates into and out of the old City of London, eg Temple Bar or Newgate.
home
One of the buildings in which Pepys lived.
misc
Something that doesn't fit into one of the other categories.
road
A road, street or square in London, eg Leadenhall Street or Spital Square.
stair
One of the landing stairs or docks on the banks of the River Thames, eg Tower Dock or Whitefriars Stairs.
town/village
A settlement outside of London, eg Marylebone. Note that what counted as "London" was a lot smaller in the 17th century.
Note that locations, shapes and these categories are a work in progress and are far from complete.
thumbnail_image
(boolean) Does this Topic have a thumbnail image included (see below)? This is only ever true for (some) people, not for Topics in any other category.
wikipedia_page
(string, optional) If this Topic has a relevant page on the English-language version of Wikipedia, the unique part of the page's URL is included here. eg, if the value of wikipedia_page is Church_of_St._Margaret%2C_Westminster the Wikipedia page is at http://en.wikipedia.org/wiki/Church_of_St._Margaret%2C_Westminster.

Encyclopedia Thumbnails

Also included is a directory of several hundred JPEG images, each one a small portrait of a person who has a Topic in the Encyclopedia. Each image is named like 112.jpg, where 112 correspondes to the id of the Topic. Every Topic that has a thumbnail_image value of true should have a corresponding thumbnail file. All the images are 100 x 120 pixels in size, and are taken from images on Wikipedia.

Annotations

Annotations (comments) are included for every Diary entry and every Encyclopedia Topic (if any). The meta element in each Annotation JSON file is either like this:

"meta":
{
    "generated": "2011-01-16T15:33:32+00:00",
    "type": "annotations",
    "source": 1661
}

Which shows these Annotations are for all the Diary entries in 1661, or like this:

"meta":
{
    "generated": "2011-01-16T15:33:32+00:00",
    "type": "annotations",
    "source": "encyclopedia"
}

Which shows these Annotations are from all the Encyclopedia Topics.

The Diary Annotations are split into a different file per year simply because there are so many of them that processing the whole lot in one JSON file can cause problems.

Here's an example of an Annotation from the encyclopedia/annotations.json file's data element:

{
    "id": 227,
    "author_id": 11,
    "author_name": "language hat",
    "author_url": "http:\/\/www.languagehat.blogspot.com\/",
    "text": "Axe Yard was actually on the site of the later Fludyer St.,
 just south of Downing St.; see the annotations to:\r\n<a href=\"http:\/\/www.pepysdiary.com\/p\/102.php\">http:\/\/www.pepysdiary.com\/p\/102.php<\/a>",
    "created_on": "2003-01-08T03:41:14+00:00",
    "topic_id": 106
}

The only difference for an Annotation in one of the Diary years is that the topic_id is replaced with an entry_id.

A description of each field:

id
(number) A unique ID for this Annotation. This is non-consecutive and is unique across all Annotations, whether on Diary entries or Encyclopedia Topics.
author_id
(number) A unique ID for the author of this Annotation. This ID is unique across all Annotations -- Diary and Encyclopedia.
When posting an Annotation, each author was required to supply an email address. When generating these JSON files, each different email address was assumed to indicate a different person, and unique IDs were created for each one. This is not foolproof — authors may have used different email addresses across the life of the site — but it's as good as we'll get. The purpose of the ID is also to avoid distributing the authors' email addresses.
Also note that this ID is not guaranteed to be consistent across subsequent dumps of this data.
author_name
(string) The Annotation author's name.
author_url
(string) This field is always present but is often an empty string. If present it (probably) indicates the author's homepage etc. Even if present, don't assume it is a valid URL.
text
(string) The text of the Annotation, written by its author. May contain HTML <a href...> tags, but no other HTML.
created_on
(string) The date and time this Annotation was posted on, in ISO 8601 format.
topic_id or entry_id
(string) Only one of these is present. The topic_id maps to the id of items in the Encyclopedia Topics JSON file. The entry_id maps to the id of entries in the Diary JSON files.

Licence

Very, very broadly, you're free to do what you like with this data, so long as you credit Phil Gyford (and the authors of Annotations) and your work is non-commercial. More detail:

The text of the Diary itself (but not the links within the text) comes from Project Gutenberg and is in the public domain. Minor typos have occasionally been fixed.

The thumbnail images all come from Wikipedia and are also considered public domain.

In both the above cases you're expected to check your own country's copyright laws to ensure this applies...

Annotations have been written by many individuals and are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 1.0 Generic (CC BY-NC-SA 1.0) licence.

Everything else is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA 3.0) licence.

All non-public-domain elements are copyright Phil Gyford, 2002-2012, except Encyclopedia texts which remain the copyright of their respective authors.

If you pass any of the accompanying data files on, please be sure to include this README document.

Versions

v1.0, 2011-01-22
First release.
v1.01, 2011-01-24
Corrected typo in 'New Palace Yard' example in README.
v1.1, 2012-05-29
Changes:
  • Added 1669 Diary entries
  • Added Annotations
  • Added type to very JSON file
  • Added unique ID to every Diary entry