A Dataset of Cryptic Crossword Clues

A Dataset of Cryptic Crossword Clues is a dataset of cryptic crossword1 clues, indicators and charades, collected from various blogs and publicly available digital archives.

The project scrapes several blogs and digital archives for cryptic crosswords. Out of these collected web pages, the clues, answers, clue numbers, blogger’s explanation and commentary, puzzle title and publication date are all parsed and extracted into a tabular dataset. The result is over half a million clues from cryptic crosswords over the past twelve years.

Two other datasets are subsequently derived from the clues - wordplay indicators and charades (a.k.a. substitutions). All told, the derived datasets contain over ten thousand wordplay indicators and over sixty thousand charades.

Currently the sources for clues are:

The data can be viewed online and downloaded for free (CSV, JSON, SQLite, advanced3). Detailed documentation can be found on the datasheet and the source code for creating the dataset is available on GitHub.

Send all comments, suggestions and complaints to george[æ]

Please share and enjoy!

~ George Ho

  1. If you’re new to cryptic crosswords, rejoice! A whole new world awaits you! The New Yorker has an excellent introduction to cryptic crosswords, and Matt Gritzmacher has a daily newsletter with links to crosswords.↩︎

  2. .puz files were provided courtesy of Michael F. Gill. As of August 2021, The New York Times no longer supports .puz files.↩︎

  3. The CSV request will only return the first 1000 rows, click here to stream all rows (this will take a while). The JSON request is paginated with 100 rows per page.↩︎