Skip to content

MongoDB Part 2

1: Working with Distinct Values, $elemMatch, and Regex

An exceptional laureate

from pymongo import MongoClient

client = MongoClient()
db = client.nobel
db.laureates.find_one({"prizes.2": {"$exists": True}})
{'_id': ObjectId('5f1f4e418defad6d9f1cd070'),
 'id': '482',
 'firstname': 'International Committee of the Red Cross',
 'born': '1863-00-00',
 'died': '0000-00-00',
 'gender': 'org',
 'prizes': [{'year': '1917',
   'category': 'peace',
   'share': '1',
   'motivation': '"for the efforts to take care of wounded soldiers and prisoners of war and their families"',
   'affiliations': [[]]},
  {'year': '1944',
   'category': 'peace',
   'share': '1',
   'motivation': '"for the great work it has performed during the war on behalf of humanity"',
   'affiliations': [[]]},
  {'year': '1963',
   'category': 'peace',
   'share': '2',
   'motivation': '"for promoting the principles of the Geneva Convention and cooperation with the UN"',
   'affiliations': [[]]}]}

Using .distinct()

db.laureates.distinct("gender")
['female', 'male', 'org']

  • A convenience method for a common aggregation (like count_documents)
  • We will not cover custom aggregations in this lesson, but the aggregate method is powerful.
  • distinct aggregation is efficient if there is a collection index on the field
  • We will learn how to create an index later in this lesson
  • No index needed here: collection fits in memory, has ≲ 1,000 documents

All prize categories vs those with multi-winners

db.laureates.distinct("prizes.category")
['chemistry', 'economics', 'literature', 'medicine', 'peace', 'physics']
db.laureates.distinct("prizes.category",{"prizes.1": {"$exists": True}})
['chemistry', 'peace', 'physics']

Distinct example: Born here, went there

In which countries have USA-born laureates had affiliations for their prizes?

db.laureates.distinct("prizes.affiliations.country",{"bornCountry":"USA"})
['Australia', 'Denmark', 'USA', 'United Kingdom']

Enter $elemMatch

db.laureates.count_documents({
    "prizes": {
        "category": "physics",
        "share": "1"
    }
})
0

Why is the above number zero?

db.laureates.count_documents({
    "prizes.category": "physics",
    "prizes.share": "1"
})
48
db.laureates.count_documents({
    "prizes": {
        "$elemMatch": {
            "category": "physics",
            "share": "1"
        }
    }
})
47

Why are these numbers different?

db.laureates.find_one({"firstname": "Marie"})
{'_id': ObjectId('5f1f4e418defad6d9f1cce9a'),
 'id': '6',
 'firstname': 'Marie',
 'surname': 'Curie',
 'born': '1867-11-07',
 'died': '1934-07-04',
 'bornCountry': 'Russian Empire (now Poland)',
 'bornCountryCode': 'PL',
 'bornCity': 'Warsaw',
 'diedCountry': 'France',
 'diedCountryCode': 'FR',
 'diedCity': 'Sallanches',
 'gender': 'female',
 'prizes': [{'year': '1903',
   'category': 'physics',
   'share': '4',
   'motivation': '"in recognition of the extraordinary services they have rendered by their joint researches on the radiation phenomena discovered by Professor Henri Becquerel"',
   'affiliations': [[]]},
  {'year': '1911',
   'category': 'chemistry',
   'share': '1',
   'motivation': '"in recognition of her services to the advancement of chemistry by the discovery of the elements radium and polonium, by the isolation of radium and the study of the nature and compounds of this remarkable element"',
   'affiliations': [{'name': 'Sorbonne University',
     'city': 'Paris',
     'country': 'France'}]}]}

$elemMatch example: Sharing in physics after World War II

What is the approximate ratio of the number of laureates who won an unshared, i.e.

{"share": "1"}

, prize in physics after World War II, i.e.

{"year": {"$gte": "1945"}}

, to the number of laureates who won a shared prize in physics after World War II?

# Save a filter for laureates with unshared prizes
unshared = {
    "prizes": {"$elemMatch": {
        "category": "physics",
        "share": "1",
        "year": {"$gte": "1945"},
    }}}

# Save a filter for laureates with shared prizes
shared = {
    "prizes": {"$elemMatch": {
        "category": "physics",
        "share": {"$gt": "1"},
        "year": {"$gte": "1945"},
    }}}

ratio = db.laureates.count_documents(unshared) / db.laureates.count_documents(shared)
print(ratio)
0.1232876712328767

Finding a substring with $regex

db.laureates.distinct(
    "bornCountry",
    {"bornCountry": {"$regex": "Poland"}})
['Austria-Hungary (now Poland)',
 'Free City of Danzig (now Poland)',
 'German-occupied Poland (now Poland)',
 'Germany (now Poland)',
 'Poland',
 'Poland (now Belarus)',
 'Poland (now Lithuania)',
 'Poland (now Ukraine)',
 'Prussia (now Poland)',
 'Russian Empire (now Poland)']

Beginning and ending

db.laureates.distinct(
    "bornCountry", 
     {"bornCountry": {"$regex":"^Poland"}})
['Poland',
 'Poland (now Belarus)',
 'Poland (now Lithuania)',
 'Poland (now Ukraine)']
db.laureates.distinct(
    "bornCountry", 
     {"bornCountry": {"$regex":"^Poland \(now"}})
['Poland (now Belarus)', 'Poland (now Lithuania)', 'Poland (now Ukraine)']
db.laureates.distinct(
    "bornCountry", 
     {"bornCountry": {"$regex":"now Poland\)$"}})
['Austria-Hungary (now Poland)',
 'Free City of Danzig (now Poland)',
 'German-occupied Poland (now Poland)',
 'Germany (now Poland)',
 'Prussia (now Poland)',
 'Russian Empire (now Poland)']

Regex example: Glenn, George, and others in the G.S. crew

There are two laureates with Berkeley, California as a prize affiliation city that have the initials G.S. - Glenn Seaborg and George Smoot.

How many laureates in total have a first name beginning with "G" and a surname beginning with "S"?

db.laureates.count_documents({"firstname": {"$regex":"^G"}, "surname": {"$regex":"^S"}})
10

2: Projection and Sorting

What is "projection"?

  • reducing data to fewer dimensions
  • Asking certain data to "speak up"!

Projection in MongoDB

When using db.collection.find(), the first argument is the filter and the second is the projection.

db.laureates.find({}, {})
<pymongo.cursor.Cursor at 0x7f8360632850>

Since db.collection.find() returns a cursor, we need to make it into a list:

list(db.laureates.find({},{}))[:3]
[{'_id': ObjectId('5f1f4e418defad6d9f1cce95')},
 {'_id': ObjectId('5f1f4e418defad6d9f1cce96')},
 {'_id': ObjectId('5f1f4e418defad6d9f1cce97')}]

An empty projection dictionary by default only displays the entry "_id" field. To project out the prizes.affiliation field without the "_id" field, we would use a projection equal to {"prizes.affiliations": 1, "_id": 0}.

list(db.laureates.find({},{"prizes.affiliations": 1, "_id": 0}))[:3]
[{'prizes': [{'affiliations': [{'name': 'Munich University',
      'city': 'Munich',
      'country': 'Germany'}]}]},
 {'prizes': [{'affiliations': [{'name': 'Leiden University',
      'city': 'Leiden',
      'country': 'the Netherlands'}]}]},
 {'prizes': [{'affiliations': [{'name': 'Amsterdam University',
      'city': 'Amsterdam',
      'country': 'the Netherlands'}]}]}]

where the 1 value turns on "prizes.affiliations" and the 0 value turns off "_id".

Missing fields

list(db.laureates.find({"gender": "org"}, {"firstname": 1, "born": 1, "_id": 0}))
[{'firstname': 'Institute of International Law', 'born': '1873-00-00'},
 {'firstname': 'Permanent International Peace Bureau', 'born': '1891-00-00'},
 {'firstname': 'International Committee of the Red Cross',
  'born': '1863-00-00'},
 {'firstname': 'Nansen International Office for Refugees',
  'born': '1921-00-00'},
 {'firstname': 'Friends Service Council', 'born': '1647-00-00'},
 {'firstname': 'American Friends Service Committee', 'born': '1917-00-00'},
 {'firstname': 'Office of the United Nations High Commissioner for Refugees',
  'born': '1951-00-00'},
 {'firstname': 'League of Red Cross Societies', 'born': '1919-00-00'},
 {'firstname': "United Nations Children's Fund", 'born': '1946-00-00'},
 {'firstname': 'International Labour Organization', 'born': '1919-00-00'},
 {'firstname': 'Amnesty International', 'born': '1961-00-00'},
 {'firstname': 'International Physicians for the Prevention of Nuclear War',
  'born': '1980-00-00'},
 {'firstname': 'United Nations Peacekeeping Forces', 'born': '1948-00-00'},
 {'firstname': 'Pugwash Conferences on Science and World Affairs',
  'born': '1957-00-00'},
 {'firstname': 'International Campaign to Ban Landmines',
  'born': '1992-00-00'},
 {'firstname': 'Médecins Sans Frontières', 'born': '1971-00-00'},
 {'firstname': 'United Nations', 'born': '1945-00-00'},
 {'firstname': 'International Atomic Energy Agency', 'born': '1957-00-00'},
 {'firstname': 'Grameen Bank', 'born': '1976-00-00'},
 {'firstname': 'Intergovernmental Panel on Climate Change',
  'born': '1988-00-00'},
 {'firstname': 'European Union (EU)', 'born': '1952-00-00'},
 {'firstname': 'Organisation for the Prohibition of Chemical Weapons',
  'born': '1997-00-00'},
 {'firstname': 'National Dialogue Quartet '},
 {'firstname': 'International Campaign to Abolish Nuclear Weapons (ICAN)',
  'born': '2007-00-00'}]

Projection example: Shares of the 1963 prize in physics

Let's examine the laureates of the 1963 prize in physics and how they split the prize. Here is a query without projection:

db.laureates.find({"prizes": {"$elemMatch": {"category": "physics", "year": "1963"}}})

How would we fetch the laureates' full names and prize share info?

projection = {"firstname": 1, "surname": 1, "prizes.share": 1, "_id": 0}

list(db.laureates.find({"prizes": {"$elemMatch": {"category": "physics", "year": "1963"}}}, projection))
[{'firstname': 'Eugene', 'surname': 'Wigner', 'prizes': [{'share': '2'}]},
 {'firstname': 'Maria',
  'surname': 'Goeppert Mayer',
  'prizes': [{'share': '4'}]},
 {'firstname': 'J. Hans D.', 'surname': 'Jensen', 'prizes': [{'share': '4'}]}]

Sorting post-query with Python

from operator import itemgetter

docs = list(db.prizes.find({"category": "physics"},
                           {"year": 1}))

docs = sorted(docs, key=itemgetter("year"))

print([doc["year"] for doc in docs][:5])
['1901', '1902', '1903', '1904', '1905']

docs = sorted(docs, key=itemgetter("year"), reverse=True)

print([doc["year"] for doc in docs][:5])
['2019', '2018', '2017', '2016', '2015']

Sorting in-query with MongoDB

cursor = db.prizes.find({"category": "physics"},
                        {"year": 1},
                        sort=[("year", 1)])

print([doc["year"] for doc in cursor][:5])
['1901', '1902', '1903', '1904', '1905']

cursor = db.prizes.find({"category": "physics"},
                        {"year": 1},
                        sort=[("year", -1)])

print([doc["year"] for doc in cursor][:5])
['2019', '2018', '2017', '2016', '2015']

Primary and secondary sorting

for doc in db.prizes.find(
        {"year": {"$gt": "1966", "$lt": "1970"}},
        {"category": 1, "year": 1, "_id": 0},
        sort=[("year", 1), ("category", -1)]):
    print(doc["year"],doc["category"])
1967 physics
1967 peace
1967 medicine
1967 literature
1967 chemistry
1968 physics
1968 peace
1968 medicine
1968 literature
1968 chemistry
1969 physics
1969 peace
1969 medicine
1969 literature
1969 economics
1969 chemistry

Sorting example: What the sort?

This block prints out the first five projections of a sorted query. What "sort" argument fills the blank?

docs = list(db.laureates.find(
    {"born": {"$gte": "1900"}, "prizes.year": {"$gte": "1954"}, "gender":{"$in":["male","female"]}},
    {"born": 1, "prizes.year": 1, "_id": 0},
    sort = ____))
for doc in docs[:5]:
    print(doc)
{'born': '1916-08-25', 'prizes': [{'year': '1954'}]}
{'born': '1915-06-15', 'prizes': [{'year': '1954'}]}
{'born': '1901-02-28', 'prizes': [{'year': '1954'}, {'year': '1962'}]}
{'born': '1913-07-12', 'prizes': [{'year': '1955'}]}
{'born': '1911-01-26', 'prizes': [{'year': '1955'}]}

Primary sorting is ascending by "prizes.year". Secondary sort is descending by "born". Thus, sort=[("prizes.year",1),("born",-1)].

my_sort = [("prizes.year",1),("born",-1)]

docs = list(db.laureates.find(
    {"born": {"$gte": "1900"}, "prizes.year": {"$gte": "1954"}, "gender":{"$in":["male","female"]}},
    {"born": 1, "prizes.year": 1, "_id": 0},
    sort = my_sort))
for doc in docs[:5]:
    print(doc)
{'born': '1916-08-25', 'prizes': [{'year': '1954'}]}
{'born': '1915-06-15', 'prizes': [{'year': '1954'}]}
{'born': '1901-02-28', 'prizes': [{'year': '1962'}, {'year': '1954'}]}
{'born': '1913-07-12', 'prizes': [{'year': '1955'}]}
{'born': '1911-01-26', 'prizes': [{'year': '1955'}]}

3: Indexing

What are indexes?

When to use indexes?

  • Queries with high specificity
  • Large documents
  • Large collections

Getting index information

db.prizes.index_information()
{'_id_': {'v': 2, 'key': [('_id', 1)], 'ns': 'nobel.prizes'}}

Gauging performance before indexing

Jupyter Notebook %%timeitmagic (same as python -m timeit "[expression]")

%%timeit
docs = list(db.prizes.find({"year": "1901"}))
577 µs ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
docs = list(db.prizes.find({}, sort=[("year", 1)]))
5 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
docs = list(db.prizes.find({}, sort=[("year", -1)]))
4.79 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Adding a single-field index

* index model: list of (field, direction)pairs. * directions: 1 (ascending) and -1 (descending)

db.prizes.create_index([("year", 1)])
'year_1'
db.prizes.index_information()
{'_id_': {'v': 2, 'key': [('_id', 1)], 'ns': 'nobel.prizes'},
 'year_1': {'v': 2, 'key': [('year', 1)], 'ns': 'nobel.prizes'}}
%%timeit
docs = list(db.prizes.find({"year": "1901"}))
330 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
docs = list(db.prizes.find({}, sort=[("year", 1)]))
3.89 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
docs = list(db.prizes.find({}, sort=[("year", -1)]))
4.07 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

db.prizes.drop_index([("year", 1)])
db.prizes.create_index([("year", -1)])
'year_-1'
%%timeit
docs = list(db.prizes.find({}, sort=[("year", -1)]))
4.36 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

db.prizes.drop_index([("year", -1)])

Adding a compound (multiple-field) index

  • index "covering" a query with projection
%%timeit
list(db.prizes.find({"category": "economics"}, {"year": 1, "_id": 0}))
620 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

db.prizes.create_index([("category", 1), ("year", 1)])
'category_1_year_1'
%%timeit
list(db.prizes.find({"category": "economics"}, {"year": 1, "_id": 0}))
435 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

db.prizes.drop_index([("category", 1), ("year", 1)])

* index "covering" a query with projection and sorting

%%timeit
db.prizes.find_one(
    {"category": "economics"},
    {"year": 1, "_id": 0},
    sort=[("year", 1)]
)
698 µs ± 62.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

db.prizes.create_index([("category", 1), ("year", 1)])
'category_1_year_1'
%%timeit
db.prizes.find_one(
    {"category": "economics"},
    {"year": 1, "_id": 0},
    sort=[("year", 1)]
)
504 µs ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

db.prizes.drop_index([("category", 1), ("year", 1)])