Skip to content

Intro to MongoDB and the Nobel Prize Dataset - Lesson 1

Why MongoDB?

  • All Materials Project data is stored using MongoDB
  • Our API query language is based on MongoDB syntax

MongoDB Structure

  • Collections make up a database
  • Documents organized into collections
  • Data objects represented by documents

Data Structure

object → {field: value, field1: value1, ...}

fields: string

values: string, int/double, true, false, null, array, object, ...

example:

{
    name: Sue,
    age: 28,
    lawSpecialties: [copyright, tax],
    canMeet:
        { 
            mon: True,
            tues: False,
            wed: True,
            thurs: True,
            fri: False,
        }    
}

JavaScript Object Notation (JSON)

object → {string: value, string1: value1, ...}

values: string, number, true, false, null, object, array

JSON \(\longleftrightarrow\) Python

objects → {string: value,string1: value1, ...}

---> dictionaries (with str keys)

arrays → [value, value1, ...]

---> lists

values: string, number, true, false, null, object, array

str, int, float, True, False, None, dict, list

Accessing MongoDB

We can access our MongoDB databases using an instance of MongoClient from the pymongo package.

from pymongo import MongoClient

client = MongoClient()
print(client)
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

You can access databases and collections as attributes and/or treat them as dictionary keys.

client.nobel
Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'nobel')
client["nobel"]
Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'nobel')
client.nobel == client["nobel"]
True
client.nobel.prizes == client["nobel"]["prizes"]
True

We can also connect to the nobel database.

db = client.nobel
print(db)
Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'nobel')

Searching for documents with find_one()

Let's see what a document looks like in the prizes collection using the find_one() method. This takes an optional filter argument. Passing an empty filter ({}) is the same as passing no filter. In Python, the returned document takes the form of a dictionary. The keys of the dictionary are the (root-level) "fields" of the document.

db.prizes.find_one({})
{'_id': ObjectId('6111e3211614513ac60bf8f3'),
 'year': '2020',
 'category': 'chemistry',
 'laureates': [{'id': '991',
   'firstname': 'Emmanuelle',
   'surname': 'Charpentier',
   'motivation': '"for the development of a method for genome editing"',
   'share': '2'},
  {'id': '992',
   'firstname': 'Jennifer A.',
   'surname': 'Doudna',
   'motivation': '"for the development of a method for genome editing"',
   'share': '2'}]}

We can now add a filter to our search to ensure that the returned prizes document contains data for a physics nobel prize.

criteria = {"category": "physics"}
db.prizes.find_one(criteria)
{'_id': ObjectId('6111e3211614513ac60bf8f7'),
 'year': '2020',
 'category': 'physics',
 'laureates': [{'id': '988',
   'firstname': 'Roger',
   'surname': 'Penrose',
   'motivation': '"for the discovery that black hole formation is a robust prediction of the general theory of relativity"',
   'share': '2'},
  {'id': '989',
   'firstname': 'Reinhard',
   'surname': 'Genzel',
   'motivation': '"for the discovery of a supermassive compact object at the centre of our galaxy"',
   'share': '4'},
  {'id': '990',
   'firstname': 'Andrea',
   'surname': 'Ghez',
   'motivation': '"for the discovery of a supermassive compact object at the centre of our galaxy"',
   'share': '4'}]}

Iterating through documents with find()

A collection is not a list, so we can't write for doc in <collection> to iterate over documents. Instead, we use the find() method to produce an iterable called a cursor, and instead write for doc in <collection>.find() to iterate over documents.

Example: Using find() to iterate over documents in the laureates collection, sum the total number of laureate prizes. The length of doc["prizes"] for a laureate document doc is the number of prizes won by that laureate. Store the sum in the variable count.

db.laureates.find_one({})
{'_id': ObjectId('6111e3211614513ac60bfb7f'),
 'id': '1',
 'firstname': 'Wilhelm Conrad',
 'surname': 'Röntgen',
 'born': '1845-03-27',
 'died': '1923-02-10',
 'bornCountry': 'Prussia (now Germany)',
 'bornCountryCode': 'DE',
 'bornCity': 'Lennep (now Remscheid)',
 'diedCountry': 'Germany',
 'diedCountryCode': 'DE',
 'diedCity': 'Munich',
 'gender': 'male',
 'prizes': [{'year': '1901',
   'category': 'physics',
   'share': '1',
   'motivation': '"in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"',
   'affiliations': [{'name': 'Munich University',
     'city': 'Munich',
     'country': 'Germany'}]}]}
# Compute the total number of laureate prizes
count = 0
cursor = db.laureates.find({})
for doc in cursor:
    num_prizes = len(doc["prizes"])
    count += num_prizes
count
962

Listing and Counting - Lesson 2

from pymongo import MongoClient

client = MongoClient()
print(client)
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

Listing databases and collections

Recall: - Data objects represented by documents - Documents organized into collections - Collections make up a database

We can list database names by calling list_database_names() on a client instance, and we can list collection names by calling list_collection_names() on a database instance.

  • How many databases are managed by client?
# Save a list of names of the databases managed by client
db_names = client.list_database_names()
print(db_names)

# Save a list of names of the collections managed by the "nobel" database
nobel_coll_names = client.nobel.list_collection_names()
print(nobel_coll_names)
['admin', 'config', 'local', 'nobel']
['laureates', 'prizes']

Counting documents

# Connect to our "nobel" database
db = client.nobel

The count_documents() method of a collection can be used to count the number of documents matching a particular filter. Let's use the method to count the total number of nobel laureates who were born in Egypt.

db.laureates.find_one({})
{'_id': ObjectId('6111e3211614513ac60bfb7f'),
 'id': '1',
 'firstname': 'Wilhelm Conrad',
 'surname': 'Röntgen',
 'born': '1845-03-27',
 'died': '1923-02-10',
 'bornCountry': 'Prussia (now Germany)',
 'bornCountryCode': 'DE',
 'bornCity': 'Lennep (now Remscheid)',
 'diedCountry': 'Germany',
 'diedCountryCode': 'DE',
 'diedCity': 'Munich',
 'gender': 'male',
 'prizes': [{'year': '1901',
   'category': 'physics',
   'share': '1',
   'motivation': '"in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"',
   'affiliations': [{'name': 'Munich University',
     'city': 'Munich',
     'country': 'Germany'}]}]}
db.laureates.count_documents({"bornCountry": "Egypt"})
6

We can now build out our filter to reduce the total number of matching documents:

  • Create a filter criteria to count the number of laureates who died in the United Kingdom but were born ("bornCountry") in Egypt.
# Create a filter for laureates who died in the USA but were born in Germany
criteria = {"diedCountry": "United Kingdom", "bornCountry": "Egypt"}

# Save a count
count = db.laureates.count_documents(criteria)
print(count)
1

Finding, and Dot Notation - Lesson 3

from pymongo import MongoClient

client = MongoClient()
db = client.nobel
print(client)
print(db)
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)
Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'nobel')

Query operators

Query operators provide us more options when constructing filters. For example, we may wish to find documents where a field's value matches any of a set of options. To do this we can use the $in query operator.

  • How many laureates were born in any of "Canada", "Mexico", or "USA"?
#Save a filter for laureates who were born in with Canada, Mexico, or the USA.
criteria = {"bornCountry": {"$in": ["Canada", "Mexico", "USA"]}}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
302

If we wish to accept all but one option as a value for a field, we can use the $ne (not equal) operator.

  • How many laureates died in the USA but were not born in the USA?
# Save a filter for laureates who died in the USA and were not born there.
criteria = {"diedCountry": "USA", "bornCountry": {"$ne": "USA"}}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
72

Below is a list of comparison query operators that are available in MongoDB:

$eq -- Matches values that are equal to a specified value.
$gt -- Matches values that are greater than a specified value.
$gte -- Matches values that are greater than or equal to a specified value.
$in -- Matches any of the values specified in an array.
$lt -- Matches values that are less than a specified value.
$lte -- Matches values that are less than or equal to a specified value.
$ne -- Matches all values that are not equal to a specified value.
$nin -- Matches none of the values specified in an array.

In addition to comparison query operators, we can use logical query operators in our filters as well.

  • How many laureates died in Canada OR were born in France?
#Save a filter for laureates who were born in Canada OR died in France.
criteria = {"$or":[{"diedCountry": "Canada"}, {"bornCountry": "France"}]}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
60

Below is a list of logical query operators that are available in MongoDB:

$and -- Joins query clauses with a logical AND returns all documents that match the conditions of both clauses.
$not -- Inverts the effect of a query expression and returns documents that do not match the query expression.
$nor -- Joins query clauses with a logical NOR returns all documents that fail to match both clauses.
$or -- Joins query clauses with a logical OR returns all documents that match the conditions of either clause.

The $regex operator can be used for text searching within values. Regex is short for regular expressions and is a powerful text searching technique with pattern recognition far beyond what we discuss here.

  • How many laureates were born in 1879?
db.laureates.find_one({})
{'_id': ObjectId('6111e3211614513ac60bfb7f'),
 'id': '1',
 'firstname': 'Wilhelm Conrad',
 'surname': 'Röntgen',
 'born': '1845-03-27',
 'died': '1923-02-10',
 'bornCountry': 'Prussia (now Germany)',
 'bornCountryCode': 'DE',
 'bornCity': 'Lennep (now Remscheid)',
 'diedCountry': 'Germany',
 'diedCountryCode': 'DE',
 'diedCity': 'Munich',
 'gender': 'male',
 'prizes': [{'year': '1901',
   'category': 'physics',
   'share': '1',
   'motivation': '"in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"',
   'affiliations': [{'name': 'Munich University',
     'city': 'Munich',
     'country': 'Germany'}]}]}
#Save a filter for laureates who were born in 1879.
criteria = {"born":{"$regex":"1879-"}}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
6

See https://docs.mongodb.com/manual/reference/operator/query/ for more operators!

Using dot notation

You will notice that the prizes field of a laureates document consists of an array of subdocuments:

An example laureates document:

{'_id': ObjectId('5b9ac94ff35b63cf5231ccb1'),
 'born': '1845-03-27',
 'bornCity': 'Lennep (now Remscheid)',
 'bornCountry': 'Prussia (now Germany)',
 'bornCountryCode': 'DE',
 'died': '1923-02-10',
 'diedCity': 'Munich',
 'diedCountry': 'Germany',
 'diedCountryCode': 'DE',
 'firstname': 'Wilhelm Conrad',
 'gender': 'male',
 'id': '1',
 'prizes': [{'affiliations': [{'city': 'Munich',
                               'country': 'Germany',
                               'name': 'Munich University'}],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary services '
                           'he has rendered by the discovery of the '
                           'remarkable rays subsequently named after him"',
             'share': '1',
             'year': '1901'}],
 'surname': 'Röntgen'}

An easy way to construct filters for array entries or subdocument fields is to use dot notation.

  • Use the $exists operator to find out how many laureates were awarded at least two prizes
#Save a filter for laureates who were awarded at least two prizes. 
criteria = {"prizes.1":{"$exists": True}}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
6

How many laureates were awarded at least two prizes in physics?

db.laureates.find_one(criteria)["prizes"][0]["category"]
'physics'
#Save a filter to estimate the number of laureates who were awarded at least two prizes in physics. 
criteria = {"prizes.0.category": "physics", "prizes.1.category": "physics"}

# Count them and save the count.
count = db.laureates.count_documents(criteria)
print(count)
1

db.laureates.find_one(criteria)
{'_id': ObjectId('6111e3211614513ac60bfbbf'),
 'id': '66',
 'firstname': 'John',
 'surname': 'Bardeen',
 'born': '1908-05-23',
 'died': '1991-01-30',
 'bornCountry': 'USA',
 'bornCountryCode': 'US',
 'bornCity': 'Madison, WI',
 'diedCountry': 'USA',
 'diedCountryCode': 'US',
 'diedCity': 'Boston, MA',
 'gender': 'male',
 'prizes': [{'year': '1956',
   'category': 'physics',
   'share': '3',
   'motivation': '"for their researches on semiconductors and their discovery of the transistor effect"',
   'affiliations': [{'name': 'University of Illinois',
     'city': 'Urbana, IL',
     'country': 'USA'}]},
  {'year': '1972',
   'category': 'physics',
   'share': '3',
   'motivation': '"for their jointly developed theory of superconductivity, usually called the BCS-theory"',
   'affiliations': [{'name': 'University of Illinois',
     'city': 'Urbana, IL',
     'country': 'USA'}]}]}