SFGram

artificial intelligence, machine learning, deep learning, science-fiction, digital humanities, sfgram

SFGram

A database of public Science-Fiction books, novels and movies

SFGram (Science-Fiction Gram) is a dataset of public science-fiction novels, books and movie covers. It is designed to be used by researchers to study the evolution of the science-fiction literature over time and to test machine learning algorithms on authorship attribution and document classification tasks. All the documents are now published on the public domain and were obtained from the Gutenberg project or the archive.org website.

Book covers

    

Book images

    

Some SF magazines

   

Magazine

Start

End

Galaxy Magazine

January 1950

March 1995

IF Magazine

March 1952

September 1986

The Dataset

The dataset is composed of the following files and directories.

File

Type

Description

authors

Dir

Contains all author files containing information about the different authors. Each file is named “authorsXXXXX.json” where XXXXX is the author ID with preceding zeros.

book-contents

Dir

Contains all text documents containing novels and books. Each file is named “bookXXXXX.txt” where XXXXX is the book ID with preceding zeros.

book-covers

Dir

Directory containing all book covers. Each file is named “bookXXXXX-NAME.jpg” where XXXXX is the book ID and name the name of the original file found at the creation of the dataset.

book-images

Dir

Directory containing images found on the Wikipedia page of the corresponding book if it exists.

books

Dir

Contains all the book JSON files defining information about each book.

authors.json

JSON

A JSON file containing all JSON object present in the “authors” directory in a list named “authors”.

books.json

JSON

A JSON file containing all JSON object present in the “book” directory in a list named “books”.

countries.json

JSON

A JSON file containing a list of object. Each object represents a country described by its name and by an ID. The object also contains a list of IDs corresponding to the book published by author linked to this country.

years.json

JSON

A JSON file containing a list of object. Each object represents a year with the list of IDs corresponding to books published that year.

Authors

The authors directory and the file authors.json contain the data describing each authors in the dataset. This description gives a large field of information such as name, country, gender, biography and the list of books. The complete list of fields contained in each author’s profile is the following.

Field

Type

Description

name

String

Author’s name

countries

Array

List of country IDs linked to the author

gender

String

Author’s gender (f or m)

wikipedia

JSON object

JSON object containing the author’s Wikipedia page’s URL is found

n_books

Integer

Number of books written by this author in the dataset

summary

string

A short biography found on Wikipedia (if it exists)

born

Formatted string

Birth day written as YYYY-MM-DD HH:MM:SS

books

Array

A list of IDs of books written by this author

id

Integer

The author’s ID

died

Formatted string

Death day, if exists, written as YYYY-MM-DD HH:MM:SS

 

The following code shows the profile of the author Ayn Rand as an example.

{ 
   "name": "Ayn Rand", 
   "countries": [ 2, 16 ], 
   "gender": "f", 
   "wikipedia": { 
      "url": "https://en.wikipedia.org/wiki/Ayn_Rand", 
      "found": true 
   }, 
   "n_books": 1, 
   "summary": "Ayn Rand (; born Alisa Zinov'yevna Rosenbaum, Russian: February 2 [O.S. January 20] 1905 \u2013 March 6, 1982) was a Russian-American novelist, philosopher, playwright, and screenwriter. She is known for her two best-selling novels, The Fountainhead and Atlas Shrugged, and for developing a philosophical system she called Objectivism. Educated in Russia, she moved to the United States in 1926. She had a play produced on Broadway in 1935\u20131936. After two early novels that were initially unsuccessful in America, she achieved fame with her 1943 novel, The Fountainhead.\nIn 1957, Rand published her best-known work, the novel Atlas Shrugged. Afterward, she turned to non-fiction to promote her philosophy, publishing her own magazines and releasing several collections of essays until her death in 1982. Rand advocated reason as the only means of acquiring knowledge, and rejected faith and religion. She supported rational and ethical egoism, and rejected altruism. In politics, she condemned the initiation of force as immoral, and opposed collectivism and statism as well as anarchism, and instead supported laissez-faire capitalism, which she defined as the system based on recognizing individual rights. In art, Rand promoted romantic realism. She was sharply critical of most philosophers and philosophical traditions known to her, except for Aristotle, Thomas Aquinas, and classical liberals.\nLiterary critics received Rand's fiction with mixed reviews, and academia generally ignored or rejected her philosophy, though academic interest has increased in recent decades. The Objectivist movement attempts to spread her ideas, both to the public and in academic settings. She has been a significant influence among libertarians and American conservatives.", 
   "born": "1905-02-02 00:00:00", 
   "books": [ 6 ], 
   "id": 5, 
   "died": "1982-03-06 00:00:00" 
}

Books

The books directory directory and file contain all the information about documents such as title, author’s name and image urls. The complete list of fields contained in each book’s file is the following.

Field

Type

Description

content_

name

String

The name of the text file in the “book-contents” directory which contains the book.

author_

name

String

Name of the main author

images_

urls

Array

A list of strings corresponding to the URLs where the images in the “book-images” where found.

year

Integer

The year when the book was published

images

Array

An array containing all URLs to images linked to this book

id

Integer

The unique ID of this book

category

String

The book category

genres

Array

An array of string defining all the genres which the book belong, as defined on goodreads

copyright

String

A string defining the copyright linked to this book (mostly on the public domain)

title

String

The title of the book

wikipedia

JSON

A JSON object containing the URL to the Wikipedia page if available

average_

rating

Float

The average rating specified on goodreads

goodreads

JSON

A JSON object containing the URL to the goodread page if available

similar_

books

Array

An array of string which correspond to the title of books similar to this one (defined on goodread)

description

String

A short abstract of the book

loc_class

String

The class of the book as defined by the Gutenberg project

gutenberg

JSON

A JSON object containing the URL and ID of this book on the website of the Gutenberg project

authors

Array

An array of ID corresponding to the authors who participated to the creation of this document

language

String

The language corresponding to the available content

countries

Array

A list of ID corresponding to the countries linked to authors of this document

release_

date

Formatted string

The release date as YYYY-MM-DD

author

Integer

The ID of the main author

cover

String

The URL to the cover

content_

cleaned

Boolean

True if the content has been cleaned of copyright information and reference

classes

Array

The different classes defining this document as specified on goodread

content_

available

Boolean

True if the content of this document is available in the dataset

n_authors

Integer

The exact number of authors who participated to the creation of this document

 

The following JSON sample shows an example of book information.

{
    "content_name": "8681.txt.utf-8", 
    "author_name": "Robert Barr", 
    "images_urls": [], 
    "year": 1894, 
    "images": [], 
    "id": 229, 
    "category": "Text", 
    "genres": [], 
    "copyright": "Public domain in the USA.", 
    "title": "The Face and the Mask", 
    "wikipedia": {
        "found": false
    }, 
    "average_rating": 4.43, 
    "rating_count": 1, 
    "goodreads": {
        "url": "https://www.goodreads.com/book/show/9066959-the-face-and-the-mask", 
        "found": true
    }, 
    "similar_books": [], 
    "description": "Novel by the teacher, journalist, editor and novelist, born in Glasgow, Scotland and educated in Canada. In 1876 he became a member of the staff of the Detroit Free Press, in which his contributions appeared under the signature \"Luke Sharp.\" In 1881 he removed to London, to establish the weekly English edition of the Free Press, and in 1892 he joined Jerome K. Jerome in founding the Idler magazine, from whose co-editorship he retired in 1895. He was a prolific author, producing many popular novels of the day.", 
    "loc_class": "PR: Language and Literatures: English literature", 
    "gutenberg": {
        "url": "http://www.gutenberg.org/ebooks/8681", 
        "num": 8681
    }, 
    "authors": [
        75
    ], 
    "language": "English", 
    "countries": [
        8, 
        1
    ], 
    "release_date": "2004-11-14T00:00:00", 
    "author": 75, 
    "cover": "https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png", 
    "content_cleaned": false, 
    "classes": [
        "to-read", 
        "21st-century-lit", 
        "books-i-have", 
        "short-story-thurs"
    ], 
    "content_available": true, 
    "n_authors": 1
}

Countries

The countries file contains the information of each country such as books and authors.

Field

Type

Description

books

Array

A list of ID of books whose authors are linked to this country

id

Integer

The ID corresponding to this country

name

String

Country’s name

authors

Array

An array of IDs corresponding to the authors linked to this country

 

The following JSON sample shows an example of the United Kingdom with bokos and authors.

{
   "movies": [], 
   "books": [
      950, 1, 2, 8, 9, 17, 18, 20, 31, 32, 46, 47, 48, 49, 51, 63, 69, 79, 86, 87, 135, 153, 164, 201, 229, 233, 234, 321, 349, 380, 392, 478, 483, 623, 649, 710, 725, 749, 763, 798, 836, 926, 928
   ], 
   "id": 1, 
   "name": "United Kingdom", 
   "authors": [
      1, 7, 10, 11, 33, 54, 57, 75, 76, 77, 106, 160, 246
   ]
}

Years

The years JSON file contains the information for each year in the dataset such as movies and books.

Field

Type

Description

movies

Array

A list of ID of movies

n_books

Integer

How many books were published this year

books

Array

An array of IDs corresponding to books published this year

 

The following JSON sample shows an example

{
 "movies": [],
 "books": [],
 "n_books": 5,
 "year": 2017 
}

Download

Click on the following button to download the last version of SFGram.

Documents

Related posts

 

0 Comments

Leave a reply

Your email address will not be published. Required fields are marked *

*