Skip to content

Intro to MongoDB and the Nobel Prize Dataset - Lesson 1

Why MongoDB?

  • All Materials Project data is stored using MongoDB
  • Our API query language is based on MongoDB syntax

MongoDB Structure

  • Data objects represented by documents
  • Documents organized into collections
  • Collections make up a database

Data Structure

object → {field: value, field1: value1, ...}

fields: string

values: string, int/double, true, false, null, array, object, ...

example:

{
    name: Sue,
    age: 28,
    lawSpecialties: [copyright, tax],
    canMeet:
        { 
            mon: True,
            tues: False,
            wed: True,
            thurs: True,
            fri: False,
        }    
}

JavaScript Object Notation (JSON)

object → {string: value, string1: value1, ...}

values: string, number, true, false, null, object, array

JSON \(\longleftrightarrow\) Python

objects → {string: value,string1: value1, ...}

---> dictionaries (with str keys)

arrays → [value, value1, ...]

---> lists

values: string, number, true, false, null, object, array

str, int, float, True, False, None, dict, list

Accessing MongoDB

We can access our MongoDB databases using an instance of MongoClient from the pymongo package.

from pymongo import MongoClient

client = MongoClient()
print(client)
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

You can access databases and collections as attributes and/or treat them as dictionary keys.

client.nobel == client["nobel"]
True
client.nobel.prizes == client["nobel"]["prizes"]
True

We can also connect to the nobel database.

db = client.nobel
print(db)
Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'nobel')

Searching for documents

Let's see what a document looks like in the prizes collection using the find_one() method. This takes an optional filter argument. Passing an empty filter ({}) is the same as passing no filter. In Python, the returned document takes the form of a dictionary. The keys of the dictionary are the (root-level) "fields" of the document.

db.prizes.find_one({})
{'_id': ObjectId('5f1f4e9430688833f8e30309'),
 'year': '2019',
 'category': 'chemistry',
 'laureates': [{'id': '976',
   'firstname': 'John',
   'surname': 'Goodenough',
   'motivation': '"for the development of lithium-ion batteries"',
   'share': '3'},
  {'id': '977',
   'firstname': 'M. Stanley',
   'surname': 'Whittingham',
   'motivation': '"for the development of lithium-ion batteries"',
   'share': '3'},
  {'id': '978',
   'firstname': 'Akira',
   'surname': 'Yoshino',
   'motivation': '"for the development of lithium-ion batteries"',
   'share': '3'}]}

We can now add a filter to our search to ensure that the returned prizes document contains data for a physics nobel prize.

criteria = {"category": "physics"}
db.prizes.find_one(criteria)
{'_id': ObjectId('5f1f4e9430688833f8e3030d'),
 'year': '2019',
 'category': 'physics',
 'overallMotivation': '"for contributions to our understanding of the evolution of the universe and Earth’s place in the cosmos"',
 'laureates': [{'id': '973',
   'firstname': 'James',
   'surname': 'Peebles',
   'motivation': '"for theoretical discoveries in physical cosmology"',
   'share': '2'},
  {'id': '974',
   'firstname': 'Michel',
   'surname': 'Mayor',
   'motivation': '"for the discovery of an exoplanet orbiting a solar-type star"',
   'share': '4'},
  {'id': '975',
   'firstname': 'Didier',
   'surname': 'Queloz',
   'motivation': '"for the discovery of an exoplanet orbiting a solar-type star"',
   'share': '4'}]}

You may iterate over a collection, collecting from each document. However, a collection is not a list, so we can't write for doc in <collection> to iterate over documents. Instead, we can use the find() method to produce an iterable called a cursor, and instead write for doc in <collection>.find() to iterate over documents.

- Using find() to iterate over documents in the laureates collection, sum the total number of laureate prizes. The length of doc["prizes"] for a laureate document doc is the number of prizes won by that laureate. Store the sum in the variable count.

# Compute the total number of laureate prizes
count = 0
cursor = db.laureates.find({})
for doc in cursor:
    num_prizes = len(doc["prizes"])
    count += num_prizes
count
950

Listing and Counting - Lesson 2

from pymongo import MongoClient

client = MongoClient()
print(client)
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

Listing databases and collections

Recall: - Data objects represented by documents - Documents organized into collections - Collections make up a database

We can list database names by calling list_database_names() on a client instance, and we can list collection names by calling list_collection_names() on a database instance.

- How many databases are managed by client?

# Save a list of names of the databases managed by client
db_names = client.list_database_names()
print(db_names)

# Save a list of names of the collections managed by the "nobel" database
nobel_coll_names = client.nobel.list_collection_names()
print(nobel_coll_names)
['admin', 'config', 'drone_test', 'fireworks', 'local', 'maggma_test', 'mp_workshop', 'nobel']
['prizes', 'laureates']

Counting documents

# Connect to our "nobel" database
db = client.nobel

The count_documents() method of a collection can be used to count the number of documents matching a particular filter. Let's use the method to count the total number of nobel laureates who died in the USA.

An example laureates document:

{'_id': ObjectId('5b9ac94ff35b63cf5231ccb1'),
 'born': '1845-03-27',
 'bornCity': 'Lennep (now Remscheid)',
 'bornCountry': 'Prussia (now Germany)',
 'bornCountryCode': 'DE',
 'died': '1923-02-10',
 'diedCity': 'Munich',
 'diedCountry': 'Germany',
 'diedCountryCode': 'DE',
 'firstname': 'Wilhelm Conrad',
 'gender': 'male',
 'id': '1',
 'prizes': [{'affiliations': [{'city': 'Munich',
                               'country': 'Germany',
                               'name': 'Munich University'}],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary services '
                           'he has rendered by the discovery of the '
                           'remarkable rays subsequently named after him"',
             'share': '1',
             'year': '1901'}],
 'surname': 'Röntgen'}

db.laureates.count_documents({"diedCountry": "USA"})
222

We can now build out our filter to reduce the total number of matching documents:

  • Create a filter criteria to count the number of laureates who died in the USA but were born ("bornCountry") in Germany.
# Create a filter for laureates who died in the USA but were born in Germany
criteria = {"diedCountry": "USA", "bornCountry": "Germany"}

# Save a count
count = db.laureates.count_documents(criteria)
print(count)
5

- Count laureates who died in the USA, were born in Germany, and whose first name ("firstname") was "Albert".

# Create a filter for laureates who died in the USA, were born in Germany, and whose first name was Albert.
criteria = {"diedCountry": "USA", "bornCountry": "Germany", "firstname": "Albert"}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
1

db.laureates.find_one(criteria)
{'_id': ObjectId('5f1f4e418defad6d9f1ccead'),
 'id': '26',
 'firstname': 'Albert',
 'surname': 'Einstein',
 'born': '1879-03-14',
 'died': '1955-04-18',
 'bornCountry': 'Germany',
 'bornCountryCode': 'DE',
 'bornCity': 'Ulm',
 'diedCountry': 'USA',
 'diedCountryCode': 'US',
 'diedCity': 'Princeton, NJ',
 'gender': 'male',
 'prizes': [{'year': '1921',
   'category': 'physics',
   'share': '1',
   'motivation': '"for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect"',
   'affiliations': [{'name': 'Kaiser-Wilhelm-Institut (now Max-Planck-Institut) für Physik',
     'city': 'Berlin',
     'country': 'Germany'}]}]}

Finding, and Dot Notation - Lesson 3

from pymongo import MongoClient

client = MongoClient()
db = client.nobel
print(client)
print(db)
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)
Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'nobel')

Query operators

Query operators provide us more options when constructing filters. For example, we may wish to find documents where a field's value matches any of a set of options. To do this we can use the $in query operator.

- How many laureates were born in any of "Canada", "Mexico", or "USA"?

#Save a filter for laureates who were born in with Canada, Mexico, or the USA.
criteria = {"bornCountry": {"$in": ["Canada", "Mexico", "USA"]}}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
295

If we wish to accept all but one option as a value for a field, we can use the $ne (not equal) operator.

- How many laureates died in the USA but were not born in the USA?

# Save a filter for laureates who died in the USA and were not born there.
criteria = {"bornCountry": {"$ne": "USA"}, "diedCountry": "USA"}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
71

Below is a list of comparison query operators that are available in MongoDB:

$eq -- Matches values that are equal to a specified value.
$gt -- Matches values that are greater than a specified value.
$gte -- Matches values that are greater than or equal to a specified value.
$in -- Matches any of the values specified in an array.
$lt -- Matches values that are less than a specified value.
$lte -- Matches values that are less than or equal to a specified value.
$ne -- Matches all values that are not equal to a specified value.
$nin -- Matches none of the values specified in an array.

In addition to comparison query operators, we can use logical query operators in our filters as well.

- How many laureates died in Canada OR were born in France?

#Save a filter for laureates who were born in Canada OR died in France.
criteria = {"$or":[{"diedCountry": "Canada"}, {"bornCountry": "France"}]}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
59

Below is a list of logical query operators that are available in MongoDB:

$and -- Joins query clauses with a logical AND returns all documents that match the conditions of both clauses.
$not -- Inverts the effect of a query expression and returns documents that do not match the query expression.
$nor -- Joins query clauses with a logical NOR returns all documents that fail to match both clauses.
$or -- Joins query clauses with a logical OR returns all documents that match the conditions of either clause.

See https://docs.mongodb.com/manual/reference/operator/query/ for more operators!

Using dot notation

You will notice that the prizes field of a laureates document consists of an array of subdocuments:

An example laureates document:

{'_id': ObjectId('5b9ac94ff35b63cf5231ccb1'),
 'born': '1845-03-27',
 'bornCity': 'Lennep (now Remscheid)',
 'bornCountry': 'Prussia (now Germany)',
 'bornCountryCode': 'DE',
 'died': '1923-02-10',
 'diedCity': 'Munich',
 'diedCountry': 'Germany',
 'diedCountryCode': 'DE',
 'firstname': 'Wilhelm Conrad',
 'gender': 'male',
 'id': '1',
 'prizes': [{'affiliations': [{'city': 'Munich',
                               'country': 'Germany',
                               'name': 'Munich University'}],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary services '
                           'he has rendered by the discovery of the '
                           'remarkable rays subsequently named after him"',
             'share': '1',
             'year': '1901'}],
 'surname': 'Röntgen'}

An easy way to construct filters for array entries or subdocument fields is to use dot notation.

- Use the $exists operator to find out how many laureates were awarded at least two prizes?

#Save a filter for laureates who were awarded at least two prizes. 
criteria = {"prizes.1":{"$exists": True}}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
6

- Estimate how many laureates were awarded at least two prizes in physics specifically?

#Save a filter to estimate the number of laureates who were awarded at least two prizes in physics. 
criteria = {"prizes.0.category": "physics", "prizes.1.category": "physics"}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
1

db.laureates.find_one(criteria)
{'_id': ObjectId('5f1f4e418defad6d9f1cced5'),
 'id': '66',
 'firstname': 'John',
 'surname': 'Bardeen',
 'born': '1908-05-23',
 'died': '1991-01-30',
 'bornCountry': 'USA',
 'bornCountryCode': 'US',
 'bornCity': 'Madison, WI',
 'diedCountry': 'USA',
 'diedCountryCode': 'US',
 'diedCity': 'Boston, MA',
 'gender': 'male',
 'prizes': [{'year': '1956',
   'category': 'physics',
   'share': '3',
   'motivation': '"for their researches on semiconductors and their discovery of the transistor effect"',
   'affiliations': [{'name': 'University of Illinois',
     'city': 'Urbana, IL',
     'country': 'USA'}]},
  {'year': '1972',
   'category': 'physics',
   'share': '3',
   'motivation': '"for their jointly developed theory of superconductivity, usually called the BCS-theory"',
   'affiliations': [{'name': 'University of Illinois',
     'city': 'Urbana, IL',
     'country': 'USA'}]}]}