SFGram

artificial intelligence, machine learning, deep learning, science-fiction, digital humanities, sfgram

SFGram

A database of public Science-Fiction books, novels and movies

SFGram (Science-Fiction Gram) is a dataset of public science-fiction novels, books and movie covers. It is designed to be used by researchers to study the evolution of the science-fiction literature over time and to test machine learning algorithms on authorship attribution and document classification tasks. All the documents are now published on the public domain and were obtained from the Gutenberg project or the archive.org website.

Book covers

    

Book images

    

Some SF magazines

   

Magazine Start End
Galaxy Magazine January 1950 March 1995
IF Magazine March 1952 September 1986

The Dataset

The dataset is composed of the following files and directories.

File Type Description
authors Dir Contains all author files containing information about the different authors. Each file is named “authorsXXXXX.json” where XXXXX is the author ID with preceding zeros.
book-contents Dir Contains all text documents containing novels and books. Each file is named “bookXXXXX.txt” where XXXXX is the book ID with preceding zeros.
book-covers Dir Directory containing all book covers. Each file is named “bookXXXXX-NAME.jpg” where XXXXX is the book ID and name the name of the original file found at the creation of the dataset.
book-images Dir Directory containing images found on the Wikipedia page of the corresponding book if it exists.
books Dir Contains all the book JSON files defining information about each book.
authors.json JSON A JSON file containing all JSON object present in the “authors” directory in a list named “authors”.
books.json JSON A JSON file containing all JSON object present in the “book” directory in a list named “books”.
countries.json JSON A JSON file containing a list of object. Each object represents a country described by its name and by an ID. The object also contains a list of IDs corresponding to the book published by author linked to this country.
years.json JSON A JSON file containing a list of object. Each object represents a year with the list of IDs corresponding to books published that year.

Authors

The authors directory and the file authors.json contain the data describing each authors in the dataset. This description gives a large field of information such as name, country, gender, biography and the list of books. The complete list of fields contained in each author’s profile is the following.

Field Type Description
name String Author’s name
countries Array List of country IDs linked to the author
gender String Author’s gender (f or m)
wikipedia JSON object JSON object containing the author’s Wikipedia page’s URL is found
n_books Integer Number of books written by this author in the dataset
summary string A short biography found on Wikipedia (if it exists)
born Formatted string Birth day written as YYYY-MM-DD HH:MM:SS
books Array A list of IDs of books written by this author
id Integer The author’s ID
died Formatted string Death day, if exists, written as YYYY-MM-DD HH:MM:SS

 

The following code shows the profile of the author Ayn Rand as an example.

{ 
   "name": "Ayn Rand", 
   "countries": [ 2, 16 ], 
   "gender": "f", 
   "wikipedia": { 
      "url": "https://en.wikipedia.org/wiki/Ayn_Rand", 
      "found": true 
   }, 
   "n_books": 1, 
   "summary": "Ayn Rand (; born Alisa Zinov'yevna Rosenbaum, Russian: February 2 [O.S. January 20] 1905 \u2013 March 6, 1982) was a Russian-American novelist, philosopher, playwright, and screenwriter. She is known for her two best-selling novels, The Fountainhead and Atlas Shrugged, and for developing a philosophical system she called Objectivism. Educated in Russia, she moved to the United States in 1926. She had a play produced on Broadway in 1935\u20131936. After two early novels that were initially unsuccessful in America, she achieved fame with her 1943 novel, The Fountainhead.\nIn 1957, Rand published her best-known work, the novel Atlas Shrugged. Afterward, she turned to non-fiction to promote her philosophy, publishing her own magazines and releasing several collections of essays until her death in 1982. Rand advocated reason as the only means of acquiring knowledge, and rejected faith and religion. She supported rational and ethical egoism, and rejected altruism. In politics, she condemned the initiation of force as immoral, and opposed collectivism and statism as well as anarchism, and instead supported laissez-faire capitalism, which she defined as the system based on recognizing individual rights. In art, Rand promoted romantic realism. She was sharply critical of most philosophers and philosophical traditions known to her, except for Aristotle, Thomas Aquinas, and classical liberals.\nLiterary critics received Rand's fiction with mixed reviews, and academia generally ignored or rejected her philosophy, though academic interest has increased in recent decades. The Objectivist movement attempts to spread her ideas, both to the public and in academic settings. She has been a significant influence among libertarians and American conservatives.", 
   "born": "1905-02-02 00:00:00", 
   "books": [ 6 ], 
   "id": 5, 
   "died": "1982-03-06 00:00:00" 
}

Books

The books directory directory and file contain all the information about documents such as title, author’s name and image urls. The complete list of fields contained in each book’s file is the following.

Field Type Description
content_

name

String The name of the text file in the “book-contents” directory which contains the book.
author_

name

String Name of the main author
images_

urls

Array A list of strings corresponding to the URLs where the images in the “book-images” where found.
year Integer The year when the book was published
images Array An array containing all URLs to images linked to this book
id Integer The unique ID of this book
category String The book category
genres Array An array of string defining all the genres which the book belong, as defined on goodreads
copyright String A string defining the copyright linked to this book (mostly on the public domain)
title String The title of the book
wikipedia JSON A JSON object containing the URL to the Wikipedia page if available
average_

rating

Float The average rating specified on goodreads
goodreads JSON A JSON object containing the URL to the goodread page if available
similar_

books

Array An array of string which correspond to the title of books similar to this one (defined on goodread)
description String A short abstract of the book
loc_class String The class of the book as defined by the Gutenberg project
gutenberg JSON A JSON object containing the URL and ID of this book on the website of the Gutenberg project
authors Array An array of ID corresponding to the authors who participated to the creation of this document
language String The language corresponding to the available content
countries Array A list of ID corresponding to the countries linked to authors of this document
release_

date

Formatted string The release date as YYYY-MM-DD
author Integer The ID of the main author
cover String The URL to the cover
content_

cleaned

Boolean True if the content has been cleaned of copyright information and reference
classes Array The different classes defining this document as specified on goodread
content_

available

Boolean True if the content of this document is available in the dataset
n_authors Integer The exact number of authors who participated to the creation of this document

 

The following JSON sample shows an example of book information.

{
    "content_name": "8681.txt.utf-8", 
    "author_name": "Robert Barr", 
    "images_urls": [], 
    "year": 1894, 
    "images": [], 
    "id": 229, 
    "category": "Text", 
    "genres": [], 
    "copyright": "Public domain in the USA.", 
    "title": "The Face and the Mask", 
    "wikipedia": {
        "found": false
    }, 
    "average_rating": 4.43, 
    "rating_count": 1, 
    "goodreads": {
        "url": "https://www.goodreads.com/book/show/9066959-the-face-and-the-mask", 
        "found": true
    }, 
    "similar_books": [], 
    "description": "Novel by the teacher, journalist, editor and novelist, born in Glasgow, Scotland and educated in Canada. In 1876 he became a member of the staff of the Detroit Free Press, in which his contributions appeared under the signature \"Luke Sharp.\" In 1881 he removed to London, to establish the weekly English edition of the Free Press, and in 1892 he joined Jerome K. Jerome in founding the Idler magazine, from whose co-editorship he retired in 1895. He was a prolific author, producing many popular novels of the day.", 
    "loc_class": "PR: Language and Literatures: English literature", 
    "gutenberg": {
        "url": "http://www.gutenberg.org/ebooks/8681", 
        "num": 8681
    }, 
    "authors": [
        75
    ], 
    "language": "English", 
    "countries": [
        8, 
        1
    ], 
    "release_date": "2004-11-14T00:00:00", 
    "author": 75, 
    "cover": "https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png", 
    "content_cleaned": false, 
    "classes": [
        "to-read", 
        "21st-century-lit", 
        "books-i-have", 
        "short-story-thurs"
    ], 
    "content_available": true, 
    "n_authors": 1
}

Countries

The countries file contains the information of each country such as books and authors.

Field Type Description
books Array A list of ID of books whose authors are linked to this country
id Integer The ID corresponding to this country
name String Country’s name
authors Array An array of IDs corresponding to the authors linked to this country

 

The following JSON sample shows an example of the United Kingdom with bokos and authors.

{
   "movies": [], 
   "books": [
      950, 1, 2, 8, 9, 17, 18, 20, 31, 32, 46, 47, 48, 49, 51, 63, 69, 79, 86, 87, 135, 153, 164, 201, 229, 233, 234, 321, 349, 380, 392, 478, 483, 623, 649, 710, 725, 749, 763, 798, 836, 926, 928
   ], 
   "id": 1, 
   "name": "United Kingdom", 
   "authors": [
      1, 7, 10, 11, 33, 54, 57, 75, 76, 77, 106, 160, 246
   ]
}

Years

The years JSON file contains the information for each year in the dataset such as movies and books.

Field Type Description
movies Array A list of ID of movies
n_books Integer How many books were published this year
books Array An array of IDs corresponding to books published this year

 

The following JSON sample shows an example

{
 "movies": [],
 "books": [],
 "n_books": 5,
 "year": 2017 
}

Download

Click on the following button to download the last version of SFGram.

Documents

Related posts

 

0 Comments

Leave a reply

Your email address will not be published. Required fields are marked *

*